Profiling, Data Analysis, Scalability, and Magic Numbers, Part 2: Using Scalable Features and Conquering the Seven Deadly Performance Sins
Ensemble's Herb Marselas concludes his two-part series on game optimization. Learn how to improve your game's performance by using optimization tools and desiging scalable features, as his team did on Age of Empires II: The Age of Kings.
Last month we discussed some of the performance issues facing Age Of Empires II: The Age Of Kings (AoK). I described some of the tools that we used at Ensemble to collect that data, including Intel's VTune, NuMega's TrueTime, and our own profiling code. In this concluding article, I'll describe how to improve performance by effectively using performance tools and instrumentation. We'll also look at general types of problems we encountered as we optimized AoK which can affect any game. Then we'll wrap things up by taking a look at the last bastion of getting a game to run on the minimum platform when all else fails: scalable features.
Performance Targets
In the first part of this article, we discussed the CPU and process utilization of AoK using Intel VTune and our own profiling code. However, the data and graphs presented only showed half the story - the culmination of optimizing AoK.
Luckily, depending on your point of view, we're in the middle of optimizing an expansion pack to AoK called Age of Empires II: The Conquerors (TC). In The Conquerors, we're again facing the same problems of raising performance to an acceptable level on the minimum platform: a 166 MHz Pentium computer with 32 MB of memory.
Having already gone through the performance process for AoK, we know rough CPU and process utilization breakdowns for the minimum platform (Figure 1c and 2c). Using this information, we can set performance targets that we need to reach in order to perform well on a given platform. Setting performance targets based on known data removes the guesswork of knowing when an optimization is reaching the point of diminishing returns, or when it's at least "good enough."
We were fortunate that AoK and TC share so much of the same code. This allowed us to use the shipping performance of AoK as the baseline performance target for TC. However, if you're working on a new game that doesn't follow on from an existing work, it's still possible to create test cases to establish performance targets for your game.
Optimizing Attila
In the case of TC, the data presented comes specifically from analyzing the performance of a new single player campaign based on Attila the Hun. The problem is that one of the scenarios only runs at a few frames per second on the minimum platform. The Attila Session 1 numbers show the CPU utilization (Figure 1a) and process utilization (Figure 2a) of the first performance test on Attila. Intel Vtune and profiling instrumentation were used to collect this data.
Examining the CPU utilization by itself (Figure 1a), it's not apparent where the slowdown actually occurs. By comparing the CPU utilization of the Attila scenario with data from a comparable scenario from AoK, it's obvious that TC needs to use about 20% less CPU on this scenario. This will give the virtual memory manager (VMM) time to run.
The reason the VMM needs to have a specific portion of the CPU is that AoK (and TC) trade CPU cycles for physical memory on the minimum platform. This trade, however, requires CPU time to run the VMM so that it can move data between physical and virtual memory. Since AoK/TC are typically memory bound, not CPU bound, this is an effective tradeoff on the minimum platform.
To find the actual cause of the slowdown in Attila, we need to look at the breakdown of functionality within TC (Figure 2a). The first thing that's noticeable is that TC spends upwards of 40% of it's time in pathing code.
Increased pathing time could be a symptom of several problems: the pathing code is slow, a large number of units attempting to path, or units trying to path to an inaccessible location. Knowing that few changes were made to pathing code for TC ruled out slow pathing performance. Using statistics taken from AoK on a similar scenario, we can also see that we're executing pathing nearly four times as much as we would expect to be (Figure 2c).
With the pathing code ruled out, the next step is to watch the scenario to determine how many units are in it, and where they are going. By looking at the scenario, it became obvious that there were a number of units attempting to path to inaccessible locations. This was causing every unit that attempted to path to repeatedly try and fail to path to their desired location. After making the desired location accessible, the scenario was re-tested (Figure 2b). This change decreased pathing execution by nearly half.
Solving the pathing problem both reduced the amount of time TC spent in pathing code, and the amount of time the CPU spent executing TC (Figure 1b). Looking again to the performance targets, we can see that while we made an improvement, we still need to reduce TC's CPU utilization by nearly 20%, and still need to cut pathing by another 50%.
The next step will be to work with the designers to balance their goals for the scenario with the performance requirements for the minimum platform. This may be through either reducing the number of units, or creating special code in TC to handle the pathing situation that arises in this scenario.
The Seven Deadly Performance Sins
All the performance problems AoK encountered fell into one or more of seven general categories. These problems ranged from executing dead code to inefficient code and they can affect any game. Let's take a look at these categories.
1. Executing dead or superfluous code. Over the course of a long development cycle, a lot of code-based functionality is created, changed, and/or discarded. Sometimes discarded or superceded functionality is not removed from the game and continues to be executed. While it's a waste of effort to optimize code that should be removed in the first place, it can be difficult to determine whether a few lines of code, a function, or an entire subsystem is going unused.
One feature we had envisioned for AoK was renewable resources, so natural resources such as trees would increase over time if they weren't depleted. After play-testing the game, we found that this feature would often cause a game to last indefinitely, so we eliminated it. Later, when profiling game performance, we discovered that not all of the code had been removed -- the code that controlled tree regrowth appeared at the top of our profiler's function list, and we quickly removed it.
Unfortunately, superfluous code is not always so easily found, and often it's only when the code gets executed enough that you spot it on a profiling list. Such was the case with another problem also related to the trees in our game.
In our derived unit hierarchy of classes (described in last month's article), we easily added new units to the game by deriving new classes in the hierarchy. This hierarchy also is powerful in that functionality can be added or changed in a single place in the code to affect many different game units. One such change inadvertently added line-of-sight checking for trees, which is unnecessary since trees are not player-controlled. This was not an obvious performance problem and it was found only through logging data and stepping through code while trying to make the line-of-sight code faster.
2. Executing code too much. Trees, wall segments, and houses were often indicators of general performance issues in AoK, given the large amount of them on maps -- some AoK maps contain more than 15,000 trees. In order to process these units quickly, we created shortcuts in various derived functions within the unit hierarchy to avoid unnecessary unit processing. This became very complicated in some circumstances, since the computer player uses walls and houses as part of their line of sight. If it weren't for the differences between the way computer and human players used these units, the wall and house special processing would have been simpler. But the player's ability to use the buildings to scan for enemies made our AI processing simpler and more effective.
Pathing was another system that we spent a lot of time optimizing so that it wouldn't execute for too long. To do this, we capped the number of times the pathfinding system could be executed to a fixed number of iterations per unit per game update. When trying to optimize a pathing system by capping its execution, you have to balance the desire to limit CPU usage with the desire to not make players think the units exhibit dumb behavior when instructed to move or attack. This forced us to tweak the game a great deal to achieve the right balance between playability and speed, but that's often the trade-off you face when optimizing a game.
We tried a variety of caps to optimize the pathfinding system, and it was determined that at five or more pathing attempts, units attempting to retarget were the most responsive to the player. Five attempts were too many for the minimum platform, and we decided that two pathing attempts were too few based on the results of play-testing. We ultimately decided to cap the number of pathing attempts at three, once again based on our desire to balance playability with usability.
Briton Dark Age. Western European building set. | Viking Dark Age. Eastern European building set. |
---|---|
Castle Age. Western European building set. | Castle Age. Eastern European buidling set. |
Post-Imperial Age towns. Western European building set. | Post-Imperial Age towns. Eastern European building set. |
We also placed execution caps on other systems to improve performance. These included the number of pathing attempts made by a player's units, the amount of time the computer player could spend thinking during each game update, and the number of targets a unit could look for when retargeting.
3. Using inappropriate algorithms. While the pathing system in Age of Empires was a good general purpose system, it broke down in some specific circumstances (as discussed last month). Also, there were new performance issues raised by AoK, including a larger number of units and larger maps to path across.
We could have continued to attempt to optimize the single-pathing system, but it was obvious from the work performed on AoE that enough requirements had changed so that the algorithm could no longer stand on its own. What had been a good algorithm for AoE had become an inappropriate algorithm for AoK due to new and changing pathing requirements.
The AoE pathing system was used to path units from one general area to another over short distances in AoK. New pathing systems were added to path units quickly across the game map and to path units accurately within short distances. Also, as part of the new pathing system, a new unit obstruction manager (see Pottin-
ger in the For More Information section) was added for detecting unit collisions during pathing.
4. Encountering exceptional data. Built for efficiency from the start, the unit obstruction manager surprised us when it was identified by our performance profilers as one of the top problems. After reviewing the code to look for obvious (or not obvious) problems, we added instrumentation code that catalogued how units and their locations were stored within the quadtree.
With this logging code in place, we quickly saw that the majority of units placed in the quadtree ended up being not in the leaf nodes, but higher up in the quadtree branches. We also discovered that units touching the edge of a tile were interpreted as spanning two tiles, which caused performance problems. By bumping units back onto the proper tiles, we immediately saw a 300 percent performance boost in obstruction manager performance.
This code, as is most code, was written based on assumptions about the data. Programmers assume that the data processed by a function is of a certain type and will fall within certain limits or into certain sets. When data fell outside these expectations, our algorithm -- which would otherwise have performed well -- was identified as a performance problem.
Some sections of the game were instrumented from the very outset of development to help diagnose data processing problems that arose frequently in those sections of code. The unit AI, for instance, contained conditional #define statements to log approximately 50 different sets of performance information. These performance monitors could be used alone or in various combinations to help resolve performance issues related to data processing.
5. Inefficient memory usage. Poor performance can be caused by data structures that are not cache-line aligned, random access to main memory, using too much memory, allocating memory, and data dependencies. In AoK, memory problems could be especially severe since multiplayer games can last six hours or more, during which time tens of thousands of units can be created and destroyed.
Many data structures in performance-critical areas were compacted to fit in multiples or fractions of cache lines to improve memory access. There were also other areas that could have been improved by re-arranging data in structures of arrays, or streams (see For More Information section), but this would have made the code even more complicated.
To analyze and improve the memory usage of AoK, we used a number of different tools. The first tool that was a tremendous help was the set of Windows NT performance counters, which we used to examine memory statistics quickly. The NT performance counters provided a wide array of data about an application, including processor, process, memory, and network statistics. In the case of AoK, the most important memory statistic was Private Bytes, the amount of nonshared memory allocated for the AoK process.
By sampling the memory footprint at specific intervals, we created a general picture of the game memory footprint (Figures 3a and 3b). Since the game's memory requirements are effectively the same across Windows NT and Windows 98, the NT performance counters helped us examine how memory was used during a four-player game on the minimum specified player's system. This was key to helping us determine if AoK would fit within the minimum target memory size of 32MB.
Given the minimum system game requirements (Figure 4), we estimated that a game should typically last about 45 to 60 minutes. In the four-player game example shown in Figure 1a, about 21MB of memory was allocated by the game upon start up. Thirty minutes into the game, memory usage rises to around 23MB.
Figure 4. AoK minimum system game specifications. |
---|
4 player; any combination of human and computer players |
4-player map size |
75-unit population cap |
800X600 resolution |
Low-detail terrain graphics quality* |
*added as part of scalability effort
In contrast, look at the memory footprint of the eight-player game shown in Figure 3b. The addition of more players to the game requires more memory for their data at startup, as well as more memory to support the larger game map. The amount of memory consumed continues to grow during the game as more units and buildings are created until a plateau is reached. After reaching that plateau (not shown), the memory footprint starts tapering back down. The receding memory footprint occurs as players and units are defeated.
While these high-level memory statistics from the NT performance counters are quick and useful, often it's necessary to drill down to see which specific functions are allocating memory. To get that information, we created a simple memory instrumentation system to track memory allocations (see Listing 1). The memory allocation code tracked allocations and de-allocations by memory address, number of bytes requested, and file name and line number of the actual function call. It also provided a running count of the number of allocations and de-allocations, and the bytes of memory allocated in each game update loop.
Listing 1. A simple memory instrumentation system for AoK
//================================================
// memory.h header
//================================================
extern "C"
{
void *mymalloc( size_t size, const char *pfilename,
const long dwline);
void myfree( void *memblock, const char *pfilename,
const long dwline);
};
//================================================
#ifdef _INSTRUMENTMEMORY
#define malloc DEBUG_MALLOC
#define free DEBUG_FREE
#endif
#define DEBUG_MALLOC(size) mymalloc(size, __FILE__, __LINE__)
#define DEBUG_FREE(block) myfree(block, __FILE__, __LINE__)
//================================================
#ifdef _INSTRUMENTMEMORY
void MemoryInit(void);
int MemorySave(void);
void MemoryUpdate(void);
#else
#define MemoryInit
#define MemorySave
#define MemoryUpdate
#endif
//================================================
// eof: memory.h
//================================================
//================================================
// memory.cpp
//================================================
#include <windows.h>
#include <stdio.h>
#include <io.h>
// !!! DO NOT include memory.h header file here !!!
//================================================
static FILE *pmemfile, *pupdatefile;
static bool binitialized = false;
//================================================
static DWORD gdwAllocCount;
static DWORD gdwByteCount;
static DWORD gdwDeletions;
static DWORD gdwFrameCount;
//================================================
void MemoryInit(void);
//================================================
void MemoryUpdate(void)
{
if (pupdatefile)
{
fprintf(pupdatefile, "%lu\t%lu\t%lu\t%lu\n",
gdwFrameCount, gdwAllocCount, gdwDeletions, gdwByteCount);
gdwDeletions = 0;
gdwAllocCount = 0;
gdwByteCount = 0;
gdwFrameCount++;
}
} // MemoryUpdate
//================================================
extern "C" void *mymalloc( size_t size, const char *pfilename, const long dwline)
{
RGEMemoryEntry entry;
gdwAllocCount++;
gdwByteCount += size;
void *p = malloc(size);
if (!binitialized)
MemoryInit();
if (pmemfile)
fprintf(pmemfile, "malloc\t0x%X\t%ld\t%s\t%ld\n", p, size, pfilename, dwline);
return p;
} // mymalloc
//================================================
extern "C" void myfree( void *memblock, const char *pfilename, const long dwline)
{
RGEMemoryEntry entry;
gdwDeletions++;
if (!binitialized)
MemoryInit();
if (pmemfile)
fprintf(pmemfile, "free\t0x%x\t\t%s\t%ld\n", memblock,
pfilename, dwline);
free(memblock);
} // myfree
//================================================
void MemoryInit(void)
{
if (binitialized)
return;
pmemfile = fopen("c:\\memory-alloc.txt", "wb");
pupdatefile = fopen("c:\\memory-update.txt", "wb");
if (pmemfile)
fputs("type\tptr\tbytes\tfilename\tline\n", p);
if (pupdatefile)
fputs("frame\tallocations\tdeletions\ttotal bytes\n", p);
binitialized = true;
} // MemoryInit
//================================================
int MemorySave(void)
{
fclose(pmemfile);
fclose(pupdatefile);
pmemfile = 0;
pupdatefile = 0;
return 0;
} // MemorySave
//================================================
// eof: memory.cpp
//================================================
The sheer number of memory allocation schemes used in AoK complicated our memory analysis. AoK uses the C++ new and delete operators; C library malloc, free, and calloc functions; and Win32 GlobalAlloc, GlobalFree, LocalAlloc, and LocalFree functions. In the future, we will be actively restricting ourselves to a subset of these functions.
To reduce memory fragmentation and eliminate overhead caused by allocating and de-allocating memory, memory pooling was used in many subsystems. While this significantly increased performance, it did create problems when trying to fix bugs where code referred to recycled data.
In an attempt to improve performance further, we utilized MicroQuill's SmartHeap to manage memory in release builds. (We were unable to use it in debug builds due to incompatibilities with interactive debugging.) In the final analysis, the performance benefit of SmartHeap over the standard heap manager wasn't clear to us, due to the efforts we made to reduce and pool memory allocations.
After profiling performance and memory usage, it turned out that the most performance-limiting factor in AoK could be the Windows 95/98 virtual memory system. Unlike Windows NT/2000, Windows 95/98 doesn't require or configure a fixed-size swap file for virtual memory. To make matters worse, the swap file can grow and shrink as a program runs. An expert user can create a swap file of fixed size, but it's not something the vast majority of users can do or should have to worry about.
AoK relies on the virtual memory system to handle the growing footprint of game data over time within the game. It also uses multiple read-only memory-mapped files to access game graphics and sounds residing in large aggregated resource files. These memory-mapped files ranged in size from 28MB to 70MB. Since the amount of virtual memory available can vary so widely on a user's Windows 95/98 system, this ended up being the number one AoK performance issue beyond our control. It should be noted that this virtual memory problem didn't effect every minimally configured system. Virtual memory problems in Windows 95/98 seemed to occur just on certain systems, even when identically configured systems performed with little or no problem.
6. Inefficient code. Rewriting inefficient code is likely the most well known performance optimization, but it was typically the last resort to fix our performance problems. In many cases, the performance problem was resolved by identifying and fixing one of the previously mentioned deadly sins.
The easiest place to attempt to improve inefficient code is with the compiler optimization settings. Due to the size of AoK, we chose to compile release builds with the default "maximize speed" setting for all modules. This may cause some code bloat (since speed is favored over size), but in general it's a good choice. We chose not to use "full optimization" since we've seen few programs that could run after using it.
Since shipping AoK we've been looking at the performance benefits of compiling with "minimize size" and then using #pragma (or module settings) to optimize specific hotspots for speed. This seems to be a better trade-off than just using the single speed optimization setting for everything.
In AoK we chose to use the "only _inline" option in Visual C++, instead of inlining "any suitable" function. This let us choose which functions to inline based on their appearance in the profile list. Inlining any suitable function would most certainly increase the code size and lead to slower performance.
Using an alternate compiler, such as Intel's C/C++ compiler, to optimize one or more performance-intensive modules is also another way to realize some additional performance gains. We decided against this for AoK, however, because of the risk associated with changing compilers (or even compiler versions) near the ship date.
7. Other programs. One of the greatest strengths of Microsoft Windows is its ability to preemptively run multiple programs at the same time. However, it can be a huge drawback when programs that the user is unaware of take CPU time away from a game or cause the game to lock up. For instance, during the play-testing phase of AoK's development, we received reports of problems that we couldn't reproduce on our own systems. Sometimes these issues were caused when the game entered an unstable state, but often other programs running in the background on the tester's computer caused the reported problems.
Virus scanners and other programs spontaneously running in the background while a tester was playing AoK were the most widespread cause of externally induced performance problems. Unfortunately, there's no way to easily and adequately interrogate a player's computer and warn them about potential problems that other programs can cause.
The most severe issue related to other programs involved the Windows Critical Update Notification. Play-testers sometimes reported input lock ups during game play for no apparent reason. We accidentally discovered that when AoK was in full-screen mode, the Critical Update Notification could pop up a dialog box behind AoK. This would take the focus off AoK and make it appear to players as if the game had stopped accepting input. Changing AoK to handle situations like this was relatively easy once the problem was identified. Other applications likely cause similar behavior to occur, but it's only by trial and error that these problems are identified.
When we had finally squeezed as much performance as we could out of AoK for the minimum platform, we were still left with aperformance deficit in the area of terrain drawing. We couldn't make the feature optional since players need to see the terrain, yet we couldn't make it any faster, either. The only alternative was to provide different implementations of terrain drawing or different levels of terrain detail.
We decided to offer three different terrain detail settings: a fast algorithm with low detail for the minimum platform, a medium detail (but slower) one for mid-range platforms, and a high detail (slower still) for high-end platforms. This allowed AoK to run on a lower minimum platform, but still give the user on the high end additional visual quality to look forward to. This was a tactic used for a number of features in AoK.
Scalable features least likely to confuse the player are those that fit directly into the context of the game, such as the number of players or the game map size. In total, there were six scalable features, four of which fit the game context. In the game, these are not called "scalable features," but "game options" (Figure 5). These options are as follows:
Figure 5. Game play and feature scalability.
Number of players | 2 to 8. in any combination of human or computer |
Size of map | 2 to 8 player sizes and "giant" size |
Unit population cap | 25 to 200 units per player |
Civilization sets | Western European, Eastern European, Middle Eastern, Asian |
Resolution | 800X600 |
1024X768 | |
1280X1024 | |
Three terrain detail modes | High detail - multi-pass, fast lower-quality filtering, RGB color calculation |
Medium detail - multi-pass, faster lower-quality filtering, RGB color calculation |
Number of players. The simplest scalable feature within AoK is the number of computer-controlled or human opponents or allies that a person chooses play with, which is up to eight players per game. The more players, the higher the performance required by each player's system. Up to four human or computer-controlled players can play on a minimum-specified system.
Map size. Related to the number of players is the size of the map. The map size is expressed in terms of numbers of players (for example, two-player map, four-player map, and so on). The number of players supported by a specific map size was determined by the distance that our game designers felt should be between player starting positions. But map size is independent of the number of players, so you can have a two-player game on a big eight-player map. This gives players the choice to accept our recommendations, space themselves out further, or squeeze in tighter. Based on our choice of four players for the minimum platform, the default map size is for four players.
Population limit. The unit population cap sets the maximum number of units the player can build during a game. By default, this value is 75, but it can range between 25 and 200 units. Again, the user does not see this as scalability, but as a tweakable option for creating the perfect game. We chose to make 75 units the default population cap because game performance on the minimum platform degrades too much at the next higher population limit.
Different artwork. In addition to these scalable game options is also an implied, but generally unrealized, scalability in AoK's art assets. Each of the 13 civilizations in AoK is assigned to one of four sets of building art. Each building art set represents the area of the world where the civilization is from. For example, the Japanese use the Asian building art set and the Britons use the Western European building art set. There are also Eastern European and Middle Eastern art sets. These art sets not only have their own styles, each also has different styles of buildings for each of the four evolutionary ages in AoK: Dark Age, Feudal Age, Castle Age, and Imperial Age. In other words, an Asian Dark Age house looks different from an Asian Feudal Age house, or an Asian Castle Age house.
This "upgrade" of buildings within each art set as the ages progress creates an interesting memory allocation curve. In the beginning of the game, all the players use the Dark Age version of their particular art set. As the game progresses, players advance through the ages at different rates. Since the advancement to the next Age causes the building style to change for the player, new art must be loaded and displayed. This increase in memory allocation continues until all players again reach the same age.
Assuming all players start in the Dark Age and survive to the Imperial Age, the memory allocation exhibits bell curve behavior. The worst case is when there is a player in each of the four Ages at the same time, which sometimes happens in the middle of a game. If all the players in an eight-player game select civilizations from different art sets, they use at least twice as much memory as if they had chosen civilizations from the same art set.
Display resolution. This, along with the terrain detail (which is explained in a moment), is one of two scalability options within AoK that is outside the scope of the game's design concept. The default display resolution is 800x600, and can scale up to 1280x1024. Again, the lowest resolution was chosen for the minimum system.
Terrain detail. A terrain detail setting was introduced to reduce the amount of processor time required to draw 2D isometric terrain on slower computers by reducing the visual quality. Three levels of terrain detail are provided. The terrain highest detail setting uses multiple rendering passes, anisotropic filtering, and RGB color to bring out the best detail. The medium-detail setting replaces anisotropic filtering with a lower quality but faster filter; the low-detail setting uses flat shading and an eight-bit color lookup table similar to the terrain in the original AoE. No matter which terrain detail setting is used, the final display output only uses 256 colors.
The choice of which level of terrain detail to display is made automatically by AoK the first time it runs. Since all rendering is performed on the CPU, this decision is made quickly by using a test that gauges the CPU speed. Players can change the setting later using an in-game menu.
Unfortunately, scalable features that fall outside of the game design (such as the last two options above) are less likely to be understood by players. This lack of understanding can lead players to change settings, resulting in a negative impact on their game experience without them realizing what they did, and leave them unable to restore the original settings.
Key Lessons
After analyzing and improving the performance of AoK, our team learned some essential lessons that we hope to use to improve the quality and performance of our future products.
We hope it will improve your future games, as well. These lessons are:
Obtain objective performance information.
Don't leave performance issues until the code's completely done. Work to improve it throughout the course of the project.
Don't fall prey to the Seven Deadly Performance Sins.
Create a saved game, scenario, and/or recorded game system to capture performance workloads.
Make performance statistics easy to collect, see, and use.
If performance statistics aren't easy to use, they won't be used.
Set performance targets. If you don't know how far you should go, how do you know when you've gotten there?
Use a single memory allocation scheme.
When all else fails, create scalable versions of features.
Herb Marselas currently works at Ensemble Studios. He helped out on Age of Empires II: The Age of Kings. Shhhh! Please don't tell anyone he's working on a secret 3D-engine project called [deleted]. Previously, he worked at the Intel Platform Architecture Lab where he created the IPEAK Graphics Performance Toolkit. You can reach him at [email protected].
For more information and awknowledgements please refer to Profiling, Data Analysis, Scalability, and Magic Numbers: Meeting the Minimum Requirements for Age of Empires II: The Age of Kings
Read more about:
FeaturesAbout the Author
You May Also Like