Torchlight 3 Early Access Launch Weekend Postmortem - "We had to kill all of the zombies."

Sometimes the best possible way to look to the future of a project, is to understand its past. This Torchlight 3 takes a look back at it's Early Access launch weekend and what it took to "kill all the zombies."

Jill Sullivan, Blogger

July 6, 2020

19 Min Read

During Early Access we want players to get up-to-the-moment information about what's going on with the game & dev team focus. This week's State of the Game letter comes from Guy Somberg, Lead Programmer of Echtra, with help from Jill Sullivan, Senior Community Manager.

Forward

Sometimes the best possible way to look to the future of a project, is to understand its past. This State of the Game is for those who enjoy a good technical read and want a postmortem on our Early Access weekend launch. This was written shortly after the Early Access launch weekend in order to retain as many details as possible.

Introduction

Wow, what an exciting weekend it’s been. On Saturday morning, we launched Torchlight 3 for early access on Steam, after having been in alpha and then beta for 15 months. After a year of having live players from all over the world playing our game, we had developed a lot of technology and procedures for deploying changes, fixing issues, investigating problems, triaging bugs, and the other fine-grained minutiae of running a live service. The game servers and game clients had been running fairly smoothly, so we felt prepared to open up the floodgates.

Life likes to throw curve balls, and we had a ton thrown at us over the course of this past weekend. Our team was up through the weekend chasing down problems, and now that things have settled down and our service is in better shape, it’s worth taking a look back. So, let’s go on a journey of the release and see what happened and how we fixed it.

Torchlight 3 Launches

Our live team got together early on Saturday morning. Or, at least, early for game developers. At 10am, we had our build ready to go, and being white-gloved by QA. A white-glove test is a test that QA does on a build when there are no other players on it to “check for dust” - that is, to make sure that there are no surprise issues. That process was complete, and at 10:30am - with no fanfare - the game was available for purchase on Steam.

Just after 11am, Max went on the PC Gaming Show to talk about Torchlight 3, and told people that it’s live now. So much for no fanfare. Thus opened the floodgates, and people started playing. For a little while, things seemed to go well. Our concurrent users (CCU) numbers went from zero up to a few thousand in a very short timeframe.

Very quickly, though, we started to get reports of players disconnecting, failing to log on, and failing to travel across zones. The team jumped into action like a cheetah on a pogo stick. We determined that there were likely two root causes here: service scaling and server reaping.

Most of our back-end services are horizontally scalable. That means that, if one of the services is under high load, then we can bring up more instances of it to handle the load. We did exactly that to handle one of the problems. The fact that the load was so high was indicative of other problems, but increasing the resources allotted to those services helped resolve those issues while we investigated the root causes.

The other problem was server reaping. For some reason, the services that were monitoring our game servers were not getting informed that those servers had players on them. We tracked down the connection issue and the disconnects stopped.

The Next Step: Quests and Characters

With these two problems resolved, we now had a more stable service, and players started running into a new set of problems. This time, the problems manifested as quest progress taking a very long time to appear (if at all). You could still play, but the game would be subtly broken. Around the same time, we were getting reports of players being unable to create characters.

We tracked the quest problem to a database configuration issue that was causing the responses from the database to the backend service to be slow. Unfortunately, in order to deploy those changes, we needed to take the game down briefly. Our expectation was that it would be 10 minutes, so we told people to expect 20 minutes of downtime. 30 minutes after we started the downtime, we were finally able to bring up the servers.

The other issue that we were trying to track down was why some people were unable to create characters. Fortunately, this one was easy. Log files from the affected players all indicated that these people were attempting to name their pets using Cyrillic characters, which is disallowed by the services, but accidentally allowed by the game client. We will fix this one in the client, but in the meantime we just informed these players about this restriction.

We’re All Good Now, Right?

There were still some disconnects and other issues, but toward the end of the day we felt as though we were in good enough shape that we could all go to sleep. We had some inklings that there were still more issues, but we only had a few shots in the dark to try to fix them before bed.

Of course, the world is round and Europe and Asia were just about to get into their respective peak times. Several of our team members were up through the night watching the EU and Asia servers go bad. People didn’t seem to be crashing out, but they did seem to continually be disconnected.

The Next Morning

Our first impulse on having load issues is to scale up the services, which we did as quickly as we could. We were bringing up servers in our data centers as well as in our cloud providers. Unfortunately, the cloud provider has a rate limit on how frequently they’re willing to grant IP addresses - something like 2 per minute - so that severely limited how frequently we could bring up the extra load.

Early Sunday morning, then, one of our devops engineers said - in a very polite and timid fashion - “You know, it seems like we’re trying to run in a way that is not sustainable. This whole thing with loading up 20 zones per player just isn’t going to work long-term.”

Wait, WHAT? 20 zones per player? That’s not how our game is supposed to work at all! Our game servers are a piece of hardware which runs a whole bunch of processes that we call Zones (or sometimes game servers). Each zone runs the game world for a collection of areas - for example, Edgewood Bluff, Heroes’ Rest, and the Den of Upheaval will all be hosted by a single zone process. How many zones do we expect to have per player? Well, we’ve got the collection of zones that you’re in, maybe the last zone you just came from, maybe a town, and maybe your fort. That’s a maximum of 4, but we should be averaging somewhere around 1.2 to 1.5 zones per player. How did we end up with 20 zones? That’s insanity! Something is wrong with what we’re measuring or doing, but that is not right!

Zombie Zones

Now that we knew that we had a problem, we inspected the processes on the servers that were running and saw some defunct processes - colloquially called “zombie processes”, which gave rise to the nickname “zombie zones”. A “defunct” process in Linux is a process which has been killed, but is still lingering. It will still hold on to some minor resources such as file handles and such, but it is officially dead - but not gone. Thus: zombies.

Although we did not realize it at the time that we found the zombies, we later discovered these zombie processes were actually counting toward the capacity of the servers. We would ask a server if it had more capacity to spin up a new zone, and it would say “no, I’m full up”. But it was full of zombies, not actual zones.

At this point, we had to do two things:

We had to kill all of the zombies.
We had to figure out why the zombies were appearing in the first place.

In order to kill the zombies for now, we actually just rebooted each of the servers in order to free up the capacity. Players could then start taking up that space, but we knew that the zombies would be coming back, so we needed a fix for the underlying issue. It was like playing Whack-a-Mole - as soon as we had cleaned up one set of servers, another one would get infested with zombies again.

Background

At this point, we’re going to get a bit more technical. We’ve already discussed the zone servers, but there is also one single other process that we call the Zone Controller (ZC) whose job it is to spin up and monitor zone processes.

So a message comes in (say from a player who wants to travel) to the ZC, and it spins up a new zone process and responds with instructions about how to travel to that zone. The ZC also has a few other tasks. It monitors the zones that it has spun up, and if it sees one that it thinks isn’t responding (due to a crash or hang), it kills it. It also reports back a lot of information such as what zones it’s got, how much capacity it has, and what players are in what zones to another server called Zone Lookup (ZL). (Zone Lookup is another character in our story, which will come up soon.)

More Background

Upon examining the Zone Controller logs, we realized that it was sending 410 HTTP status codes to the zone servers. When a zone server checks in with the ZC, the ZC can examine the server’s state and respond with a message indicating that the zone server should shut itself down. It would do this, for example, if it has observed that there are no players in that zone for a certain length of time.

However, this 410 code is actually a vestigial piece of technology right now, because of a different piece of tech that we implemented. When all of the players leave a zone server, the ZC will actually put the process to sleep using a Linux system call. In this mode, the process is not even scheduled by the operating system, so it consumes zero CPU cycles. We wake the process up if somebody travels back there (like if you portaled to town or to your fort and then returned), but if you never return then we never wake the process up. After the timeout occurs, we kill the process entirely.

That makes it very surprising to see these 410 codes showing up. Under normal operation, we should never see the 410s being triggered. But we were seeing a lot of them. Like, A LOT. On an individual ZC that manages 100 to 200 zones, we would see one 410 every 1 to 2 seconds.

So, that was a thing. And, of course, we’re building up these zombie zones at the same time.

Moving Toward a Fix

Both the zombie zones and the 410s seemed to be related, and both of these problems were pointing at the ZC. We started to dig into the code of the ZC to see what the problem could be.

The ZC stores a map containing information about all of the zones. This map gets locked as a whole entity any time changes are made to it, and we were suspicious that there was some contention over the lock. The zone server sends messages to the ZC at a regular rate containing information about itself and its player counts, and the ZC updates this map in response to those messages. The 410 codes could be explained if the ZC was missing out on these update messages or not processing them fast enough, and then the piece of code that checks the map coming in and noticing that it hasn’t heard from the zone server and killing it, even though there was an update message in the queue that it just hadn’t processed yet.

We still didn’t know why this was happening, but we were fairly confident that this is what was happening.

An Unrelated Problem? Zone Server Spin-up Time

Another issue that we noticed is that new zones were taking a long time to spin up. Ordinarily, it should take about 10-15 seconds, but new zones were taking significantly longer to spin up for some reason. This issue seemed to be unrelated to the other issues. When it rains it pours.

We traced this issue to the ZC reaching out to the content service. When starting up a new zone, the ZC gets information about which content version to use. These are things like our spreadsheets, quests, recipes, etc. Ordinarily, it requests this value every 5 minutes.

However, the code was written such that every new zone that was trying to spin up while it was requesting the new content version would block waiting for that answer. So, if it was slow to get the answer - and we think it was for a while - then new zones wouldn’t come up for a long time because they were waiting for the content version.

This was where we started to realize that if the upstream services like Zone Lookup and the content service were lagging, then they were causing blocks in the ZC. While we started investigating those issues, we rewrote how the ZC was getting the content version so that we store a cached value and reuse it until the new answer comes back.

Blocking Operations While Locking

Now that we saw that there were issues with the locks and communicating with other services, we examined whether there was anything that could be holding onto the lock for long periods of time. The first thing we observed was that we had some situations where we would take a lock on the map of zones, and then perform some operating system calls, which could cause the process to be unscheduled, and can potentially be long-running. We rewrote those sections of code such that the OS calls happened outside of the lock.

Now we had two fixes to try out: the content service cached value, and the OS calls outside of the locks. We put both of these fixes up onto a single server and monitored that one server to see whether or not the situation had improved. After a few minutes, it was clear that the situation was a little better - a 410 code every two seconds instead of two times every second, and we were still seeing zombie servers - but it was still a problem. It was definitely some improvement, but there was still work to do.

Rubber Ducks and Go Channels

At this point, it was back to the drawing board. We knew generally where the problem had to lie, but did not have a specific root cause. So, we busted out our rubber ducks.

What do rubber ducks have to do with it? The parable goes that a junior engineer walks into a senior engineer’s office and asks for help with a problem. The senior engineer stops the junior engineer short, hands over a rubber duck and says “Tell it to the rubber duck, then tell it to me,” then turns around and goes back to work. The junior engineer thinks this is weird, but shrugs and starts to tell the rubber duck about the problem. “Whenever I call this function, it crashes. But it’s always on the last element of the list for some reason. Oh! It’s because that element is one past the end, so I can’t dereference it!” The junior engineer rushes off to fix the code.

The moral is that often times you just need to describe a problem to somebody else - anybody else - in order to realize what the problem is. “Rubber duck debugging” is the process of just having somebody else describe to you how the system works and what the problem is.

So we started asking questions about how the ZC works and how it spins up and monitors zones. As a part of this, one of our engineers noticed that we are sending events on a Go channel while holding a lock on the map.

Most of our backend services are written using the Go programming language, which has many low-level routines for concurrency and communication. One of these primitives is called a Channel, which is a queue that can have messages pushed onto it from one thread of execution and received from another thread of execution. (These threads of execution are called “goroutines”.)

We discovered that the system was set up such that the system would take a lock on the map of zone servers, push some data onto this channel, make some request to the Zone Lookup, and then go into a loop monitoring the zone server process. Ordinarily, this would be fine because the data getting pushed into the channel would be buffered up. However, the buffer length is finite and eventually the buffer gets filled up as the other routines try to hit up the Zone Lookup.

So our procedure looked like this:
Start up a zone
Wait for some network communication with Zone Lookup
Monitor the zone

However, due to some slowness in Zone Lookup, the network communication could take multiple minutes to complete. And, during that time, the routine that is monitoring zones for liveness would come in and see that the zone is empty and isn’t doing anything - or, rather, it appeared that way because the monitoring hasn’t started yet - and then killed it. Finally, when the network communication returned, it would continue operation and start to monitor an already-dead process.

Now, the monitoring would detect that the process was dead and clean it up once it got there. However, because it was waiting for this network communication that was taking multiple minutes to complete, these processes would stick around until the monitoring call could clean them up. Thus: zombie zones!

A Fix for the Zombie Zones

Now that we knew what was happening, we fixed the code that was doing this so that it could start monitoring the process while it was waiting for the zone lookup. Once again, we deployed the change, and…

Still getting 410s and zombie zones. It was definitely better, but the problem was still there. Fortunately, one of our engineers noticed that we were doing this pattern in multiple places, where we were sending data into a Go channel and waiting for a response. But now we had a procedure for fixing them. We found all of the instances where this was happening and fixed them all by putting the channel communication into a goroutine so that it wouldn’t block normal operation.

Now that we had an understanding of the problem, this helped explain a lot of the issues we had been seeing. For example, the zone server sends an HTTP message to the ZC and waits (asynchronously) for a response. The ZC is blocked on its own downstream operation for a long time. Meanwhile, the zone server is trying to send more HTTP requests and waiting for responses, but the zone server also has a limited queue of operations that it’s willing to wait for. The end result is that the zone server drops further requests without sending them, so we start to lose things like quest updates or character saves.

So, to reiterate, we’re blocking up ZC’s ability to monitor zones, we’re blocking up the zone server’s ability to send messages, and we’re leaving these zombie zones around which occupy capacity that should be going to actual players. All of these problems stacked up at the same time due to this one pattern that we were using, which caused all sorts of badness.
Stable(r) Service

It was now 11:40pm on Sunday night, and we deployed our fixes to a single ZC and started watching it to see what happened. Everything looked good initially. We waited for 5 minutes, excitedly - everything was going great! Then we waited for 20 minutes, impatiently - just to make sure. After 20 minutes everything looked good, so we started deploying it out in waves to the rest of the servers across our infrastructure in the US, Europe, and Asia.

Rolling this change out did require us to kick people off of servers, but we did our best to make sure that the players knew what was going on and that it was a temporary thing as we rolled out stability fixes.

After about an hour of rollouts, we declared success on this issue. The service is in good shape. Players are able to log in, travel, and play successfully for extended periods of time. It’s not perfect yet - there are still some issues left that we are continuing to monitor and tackle, but our servers were in a much better place after those first 48 hours.

We went to bed late Sunday night (really early Monday morning) and woke up Monday somewhat later morning for a day where our players were discussing skill respecs and other gameplay issues rather than complaining about not being able to play.

Conclusion

In any distributed system, there is always going to be more than one root cause of any perceived issue. We had to dive through about four levels of other problems before we even realized that the zombie zones were an issue in the first place, and then once we tracked down the problem there were several places in our code that had to be fixed in the same way.

The lesson here for us is that the first impression of what a problem’s root cause is is unlikely to be the actual problem, so we need to constantly be re-evaluating our fundamental assumptions.

As we move forward from this Early Access launch, we want to thank all of our fans and players for your patience as we worked through these problems. We know that it can be frustrating sometimes, but rest assured that we’re working tirelessly to fix the problems, and that we are listening to your concerns and feedback about the game.

- Guy Somberg

About the Author(s)

Jill Sullivan

Blogger

See more from Jill Sullivan

Related Topics

Related Topics

Recent in More

Related Topics

Torchlight 3 Early Access Launch Weekend Postmortem - "We had to kill all of the zombies."

About the Author(s)

Latest News

Trending

Cooking Games Spotlight: Deep Dives, Interviews, and More

Featured Blogs

Game Developer Essentials

Related Topics

Related Topics

Recent in More

Related Topics

<span class="ArticleBase-LargeTitle">Torchlight 3 Early Access Launch Weekend Postmortem - &amp;quot;We had to kill all of the zombies.&amp;quot;</span>Torchlight 3 Early Access Launch Weekend Postmortem - &quot;We had to kill all of the zombies.&quot;

About the Author(s)

Latest News

Trending

Cooking Games Spotlight: Deep Dives, Interviews, and More

Featured Blogs

Game Developer Essentials

Torchlight 3 Early Access Launch Weekend Postmortem - "We had to kill all of the zombies."