GAMEhud, our game analytics service, has been live for a little over two years now. During that time, we have had spikes in traffic that we had to work through. I am happy to say that we were always able to record an event coming in.
However, when we discovered that one of our customers, Coffee Stain Studios, was releasing a new game called Goat Simulator, we knew it was going to be big. We just didn't know it was going to be BIG!
We have had to do a bit of scrambling to keep up with the traffic, but overall things are going smoothly. I wanted to share some of the principles we used to prepare our service for the onslaught of space ships, trolls, or . . . goats that game developers throw at us. Hopefully, you can use some of these principles when designing your own high traffic game or online services.
Keep the Web Request Lean
The first rule of thumb when setting up a high traffic web service is to do as little work as possible within the web request. GAMEhud is basically a backend API for sending events, player and device information. At the time we setup our system two years ago, we heard about other analytics systems that were “falling down” under heavy load. Therefore, our focus was to not let that happen to us. So we made sure the process to capture an event or player/device update was super lean.
All our API does is confirm the secret key being sent is valid and then insert the payload (e.g., an event) into a queue. That is it. Even under load, a web request that submits an event typically takes less than 10 milliseconds. We have a separate background process (actually processes) that takes each event from the queue and processes it. Therefore, we have never encountered a problem with not being able to record an event due to load. Our only issue is our background processes falling behind and not being able to keep up at times.
Now this does not mean you should not do any work in the web request. It just means you want to keep it as lean as possible. For example, you could do a little bit of work if you are using a really fast web framework tied to a memory resident database like Redis. Just be sure to keep it lean so you can always be available for new requests.
Do Work In Parallel
This is probably obvious to most of you, but I will include it for completeness. Do work in parallel wherever you can. I am not just talking about multiple processes on the same box, but multiple processes running across multiple boxes. This is otherwise known as horizontal scaling.
You don’t have to do this on day one, but at least plan for it. It was MONTHS before we setup our background queue worker to run in parallel. We didn’t need to. Once we had enough traffic, we did it. We can now bring up multiple boxes to work through the events in our queue system.
Once you have captured all this data, you want to do something with it! Think about what questions you will be asking a lot or at least repetitively. You should be pre-aggregating the answers to those questions so there is less load on your database.
For example, GAMEhud automatically produces 16 different game metrics on a daily basis for one part of our dashboard. We process these once on a daily basis and store them. So no matter how times and how many people access this area, the data for it is only calculated once.
Use caching wherever you can. For example, when you have millions of rows of data and want to do an aggregation across them, it can take a while. This can happen for some of our event tables. What we decided to do is cache the results of a given long running query. That way if the user decides to run the same query again, it responds instantly. We also employ multiple levels of caching. For example, we process reports by day. We cache not only the daily results but also the results of the report in total.
Look for opportunities to cache what your users are looking for to be able to deliver it to them quickly.
Decisions on what type of database to use to store your data can be tough. I want to share my experiences so far.
Relational databases are great for pulling one indexed record out of a set of millions or billions. It is pretty fast. They are also great at giving you the most flexible reporting environment. Storing data in third normal form allows you to easily ask a variety of different questions. One disadvantage is reporting large aggregations across multiple tables with joins can be . . . S . . . L . . . O . . . W. A second disadvantage is it is harder or near impossible to do master to master clustering across machines. It is easy to setup a master and slaves but having a cluster of masters is harder.
NoSQL databases are faster at aggregations across millions of rows and are easier to setup as a cluster of master databases. They can also retrieve single records quickly like a relational database. Where it becomes more difficult is you need to think a lot more about HOW you are going to store your data. Many are schema-less so you really need to think about how you are going to pull out the data you need for reports. You don’t have near the flexibility in reporting the way you do with a relational database.
So look these over and choose the best solution for your needs. Or better yet, maybe use a hybrid approach and use one of each for the tasks it is optimized for.
Have a Service Oriented Architecture (SOA) Plan
You don’t have to have a service oriented architecture on day one. Actually, I will go so far as to say you should not. However, I do believe you should have a PLAN for one. When we first released GAMEhud, everything ran on a single server. But we had a plan for how to break it up into multiple components when the timing was right.
Here is our rough SOA plan for GAMEhud. First, support multiple tracking servers whose only job is to accept events and store them in a queue. Second, support multiple worker boxes that retrieve events from the queues of the tracking servers and push the data into the primary database. Third, support multiple report boxes that process reports or aggregations for end users. We did not set this up on day one, but we had a plan for how to scale when the time came.
I am proud to say that due to the solutions we put in place above, we are readily accepting the herd of goats coming in through Goat Simulator. I hope you found this information beneficial as you think through high traffic issues with your game. If you have any questions, feel free to ask.