[In this Intel-sponsored feature, part of the Gamasutra Visual Computing microsite, Goo! developer Tommy Refenes of PillowFort matches wits with multi-threading in four gripping acts -- and emerges victorious]
I'll admit it. I'm a speed freak. (I tried to get help, but it turns out those clinics by the train station are for something totally different than my addiction.) Simply put, I like what I write to run fast, getting a certain amount of personal joy when something I've written runs at over 100 FPS.
I've always strived to make my code as efficient as possible, but I didn't realize how little I knew until my previous job. Back then I was an Xbox 360 engine programmer at a studio working crazy hours to get a game up on XBLA. I was thrown into the world of memory management, cache optimizations, rendering optimizations, and briefly touched on the land of threading.
In May 2006 I left that company to start working on Goo!, a game in which you control big globs of liquid and use those globs to envelop enemy goos. Players have total control over their goo, which means a ton of collision calculations and even more physics calculations.
My new job, once again, was an engine programmer, and I knew that to have real-time interactive liquid rendering on the screen at a speed that wouldn't give me FPS withdrawal, I would need to thread the collision and physics to the extreme.
(A small disclaimer: Goo! is my first attempt at multi-threaded programming. I've learned everything I know about threading reading MSDN help, a few random Intel documents, and various other resources on threads. I didn't have the fancy pants books or money to buy said books. So in this article I may miss a few official terms and concepts and I may be totally wrong about some things, but like all programming, it's all about results.)
Act 1: Power Overwhelming
Goo! began life on a Intel Pentium 4 processor 3.0 GHz with Hyper-Threading Technology. Goos are rendered by combining thousands of small blob particles on the screen to make up a height map, which is then passed through filters to make the goos look like liquid or mercury. That's how they are rendered now, but in the beginning they were hundreds of smiley faces.
Back then, Goo! rendered and calculated around 256 blobs on the screen. Blobs' collision calculations were culled based on velocity and position, but it was all pushed through the Intel Pentium 4 processor and needless to say the results weren't great.
With the game rendering at around 20 FPS I knew it was time to thread the collision detection. I broke down and purchased an Intel Pentium D 965 Extreme Edition processor. The dual 3.73 GHz cores would be more than enough to start building a new multi-threaded collision detection system. Like a five-year old back in 1986 with a brand new Optimus Prime, I was excited and anxious to get started on the new collision engine.
So began my first attempt at threading. I adopted a philosophy inspired by Ron Popeil (of Ronco fame): "Set it and forget it." Logically data that isn't needed immediately is a perfect candidate for a threaded operation. Collision detection fell into this thinking very nicely because most of the time collision results will be the same for several frames before changing. Therefore, I reasoned, collision could run on a separate thread and be checked every frame to see if the routine had finished, copying over the results when finished, and using them in the current frame's physics calculations.
At the time, this seemed like a great idea. Not waiting for collision detections to finish every frame allowed the engine to move on and continue crunching physics and gameplay calculations with old data, while still maintaining a pretty high frame rate. At this point, Goo! was pushing through around 625 blobs rendering at about 60 FPS.
The threading model was simple: A collision thread was created when the game loaded, sleeping until it had something to work on. The main thread copied position and size data for every blob into a static array the thread could access and then continued on to physics calculations and rendering. The collision thread woke up, saw it had work to do, and proceeded with a very brute force method of figuring out who's touching who. It then saved this data into an array of indices that corresponded to every blob in the game.
Once finished, it posted a flag to the main thread. When the main thread saw that the collision thread had completed its calculations, it copied the newly calculated collision data over the old data, copied new position and size data, and once again sent it to the collision thread to calculate. At this time, each collision calculation took around 20 ms, which meant that collision data was updated just about every 1.5 frames.
With 625 blobs on the screen rendering at around 60 FPS, gameplay could now start development. This is where the first problems began to hit. The goos themselves didn't really feel like goos. When 625 blobs represent one goo it looks fine, but Goo! is a game where you battle against other goos so having more than one on the screen made the goos look a little chunky.
Chunky goos (unless you are talking about GooGoo Clusters) are no good. The solution: more blobs! I expanded the threading model by adding another collision thread and splitting up the work over the two threads. The main thread now broke the collision work up into two pieces, telling collision thread 1 to calculate the first half of the data, and collision thread 2 to calculate the second half. Once the two threads finished, they sent two flags and waited for the main thread to see that data, copy it over, and send more work. This method yielded some better results, allowing around 900 blobs to be rendered to the screen at 60 FPS.
With 900 blobs on the screen goos started looking like goos, and it was time to put the collision to rest for a while and focus on rendering. But as gameplay development progressed, various little unexplained bugs started to pop up: Memory would appear as if it were being accessed after being freed, blobs wouldn't change ownership properly, and the simulation would occasionally explode. Although these were gamebreaking bugs, they were so infrequent that I didn't bother with them until about seven months later after some core gameplay was flushed out.
Act 2: Why Act 1 is Stupid
Now, I can hear those of you familiar with threading screaming, "What the hell were you thinking?" I agree. My threading model worked nicely for this particular application, but it was horribly inefficient and prone to all sorts of problems.
For starters, the model of having two threads sleeping and waiting for work (at least in the way in which they were waiting) was horrible. The threads waited in an infinite loop for work, and if they didn't find any, they performed a Sleep(0) and then continued waiting for work. At the time, I thought the Sleep(0) would free the processor enough so that other threads running elsewhere on the OS would be scheduled efficiently. Boy, was I wrong. The work check loop had the two processors constantly running at 90 percent to 100 percent capacity- an absolute waste of time. To understand why, picture this scenario.
Let's say you are in charge of five construction workers tasked with paving a road. You know that you will need at least two of them to lay blacktop and at least the remaining three to grade the road, dig the ditches, and so on. Being the efficient supervisor that you are, you wouldn't think of allowing the two workers that need to lay the blacktop to sleep in the middle of the road until the blacktop is ready to be laid. No, to work as efficiently as possible, you would put all five of them to work grading and then all five paving.
That is basically what the first threading model was doing-allowing construction workers to sleep in the path of the other workers until the blacktop was ready to be laid on the section just prepared by the other workers. The processor was tied up and unable to efficiently schedule threads, which caused the thread queue to build up, which made the overall scheduling and delegation of work much slower than it should have been, which tied up processor resources, which made the game slower. Basically my über-threaded engine was using way too much of the processor in an extremely inefficient way.
Because of this threading bottleneck, odd problems began to surface. At times, the controls seemed jumpy- almost as if they weren't being updated fast enough-and physics became unpredictable because collision data was no longer being sent every frame and a half, but every six or seven frames.
By constantly checking for work with barely any rest in between, threads for other operations, such as XACT sound and XInput, could not be scheduled efficiently. Having the worker threads perform a Sleep(1) further slowed down the collision threads. Now besides their already lengthy operation, they added on at least 1 ms before any actual work was done, causing the physics to explode even more. To fix these problems I could have synced the collision to the main thread, but the frame rate would have dropped to below 30 FPS, which when rendering interactive liquid looks horrible. Goo! was in a bad way and needed help.
Act 3: How Intel VTune Performance Analyzer Hurt My Feelings
It was painfully obvious that I needed to rewrite the collision engine. I purchased Intel VTune Performance Analyzer around this time to help analyze the threads and locate the bottleneck. Everything I discovered in Act 2 was a result of the information that VTune analyzer turned up the first time I ran Goo! through it. Although my functions had very low tick counts, there was an obvious problem with the thread queue and my privileged processor time.
I was hoping for an easy fix, something along the lines of "Create threads with *inaudible* and your bottleneck will be solved." No, that wasn't the case at all. VTune basically told me that "The number of threads waiting closely matches the number of processors on the machine" and that I "should consider multi-threading my application."
Obviously VTune analyzer didn't know my plight and how much I had slaved over my current threading model. Here I had just paid good money for a program to tell me that I needed to thread my already heavily threaded application. But after VTune analyzer and I sat down and talked for a bit, I began to see things her way. VTune was only looking out for me and calling it how she saw it. My thread queue was long (10 or 12 threads I think).
My privileged processor time was very high and, as a result, the game was running very inefficiently. I also found that the amount of data being passed to the threads was causing several cache misses and bringing the execution of the code on the thread to a virtual crawl. Sure, my threads were doing a lot of work, but their inefficiencies were slowing down the entire system. The entire situation basically snowballed from a very poor threading model.
The solution? A new, better designed threading model.
Using VTune analyzer I found and optimized some inefficient code and optimized the threads as best I could. I restructured the collision results data both sent to and calculated on the threads to fit into cache better and I optimized some of the collision detection calculations.
The remaining issues would have to be addressed later. With the Independent Games Festival (IGF) Audience Award deadline fast approaching, I needed to focus on gameplay. After going through and optimizing some functions to cut down on cache misses and rewriting part of the collision algorithms to be a little more processor friendly, Goo! finally got back above 60 FPS and was ready for display at the IGF Pavilion during the Game Developers Conference (GDC) 2008.
Act 4: A New Hope
While on display at the IGF pavilion, Goo! was running at around 60 FPS at 1920 × 1080 with 1024 blobs on the screen, dropping to around 50 FPS when goos were tightly grouped together. At GDC it had to display at 1024 × 768, and it maintained a healthy, V-Synced 60 FPS.
A few Intel software engineers approached me at my booth asking about the game and what I had done to get it to run. I told them that Goo! ran off a custom made multi-threaded collision engine, and they told me I should enter the 2008 Intel Game Demo contest which judges on how efficiently your game scales from single core to quad core machines. Exciting news, but my game in its current state wouldn't even run on a single core machine and the scaling from dual to quad core would probably not be significant enough to win or even place in the finals.
Returning from GDC, I began coding a new, much more efficient threading model for Goo!. With VTune analyzer's blessing, I decided to trash everything I have done up to that point with threading and collision. After destroying the worker threads and eliminating the copying back and forth, I was basically starting back at square one, but I now had almost two years of experimentation and experience to guide me in creating a more efficient threading model.
Instead of having threads constantly running, I decided to create threads when I needed them and allow them to expire. So rather than allocating two constantly running threads that would tie up even a quad-core machine, I could create more threads and split the work up according to the number of processor cores, helping to scale better from a dual core to a quad core This method proved to be several times more efficient than waiting or even suspending and resuming threads.
Over about two months I rewrote the entire collision engine with phenomenal results. Goo!, which used to run at 60 FPS at 1920 x 1080, was now rendering at 100 FPS, dropping to around 80 FPS when goos are clenched tightly. At my old default test resolution of 1280 x 720 it rendered at 140 FPS instead of 80 FPS. Having the threads create and expire allowed for much more efficient scheduling over the entire processor, causing the thread queue to go down and, in turn, speeding up the game tremendously.
I recently started work again on Goo! to prepare for the Intel Contest Finals, hoping Goo! will be a finalist. The new engine is totally rewritten; not a scrap of code from the old engine exists. I have not built a release version of Goo! in the new engine yet, but the debug version runs at 70 FPS at 1024 × 768 in debug and 190 FPS in standard release build (not final profiled release) with around 1600 blobs on the screen. When I push it to 2048 blobs on the screen the frame rate drops to around 35 FPS (which was around the frame rate that the last iteration of the engine ran in final optimized release build) and stays around 100 FPS in standard release.
The new threading model is similar to the last one in which I create threads as I need them, but the size of code is down dramatically on the threads, which prevents tons of cache misses and causes the new threads to really pump data out at lightning fast speeds.
As I progress in my career, I intend to learn much more about threading- understanding it as completely and fully as possible. In the meantime, I hope my blunders and obviously bad choices in thread design will inspire and warn those who read this article.
Now...where's my GooGoo Cluster?