Writing Efficient Game Code for Next-Gen Console Architectures
Next-gen consoles have introduced us to multiple cores, longer pipelines, Level 2 caches, and multi-thread issues, and SN Systems' Andy Thomason analyzes the key concepts needed in order to write efficient code for all next-gen consoles.
Introduction
Code performance is vital to a successful game title, and poor performance can lead to shipping delays and cutting of content as well as a much more expensive game development process. Getting code generation right has to start from the beginning of a project, as the architecture of a game system determines the performance as much as anything else. A naïve approach to architecture design will spell disaster for a project.
Things are only getting more complicated, since next-gen consoles have introduced us to multiple cores, longer pipelines, Level 2 caches, and multi-thread issues. Most next-gen consoles have chosen to pack more computing elements onto the chips than most workstations. This comes at the price of the omission of instruction renaming and large caches intended to make legacy applications run faster.
The plus side is that next-gen consoles are capable of hundreds of gigaflops, or about a thousand Cray-1s, the most powerful supercomputer available in 1976 when I was first learning to code. This article illustrates a few of the key elements that lead to high performance code in the system as a whole.
Memory Access
The most important optimization you can make is to reduce the size of your data. A typical title will spend a significant period of its time waiting for Level 1 and Level 2 cache misses. Level 1 misses are bad, Level 2 misses can be catastrophic.
A typical Level 1 cache miss on a modern processor will waste tens of cycles, whereas a Level 2 miss will waste hundreds. It is very important, therefore, that the game state is kept to under the size of the Level 2 cache, otherwise the cache will thrash on every frame and performance will plummet. The drop in performance will be sudden and unexplained, but the cause will be that the code is touching too much data in a frame. If the code touches 511k of Level 2 cache in a frame on a machine that has 512k, it will run fine. As soon as it increases to more than 512k, the LRU mechanism will cause every access to fail. This was not the case on previous generation consoles that had only Level 1 caches with low cache miss penalties.
Use of bit fields, carefully designed access methods and heavy duty pre-processing of game data will significantly reduce game state size. Memory with larger scope, like collision plane information or AI route finding should be encouraged to access the same patches of memory on a frame-by-frame basis. Even though the datasets may be quite large, regular access will not thrash the cache.
Avoid using look-up tables and constants in memory. These will cause cache misses and will thrash Level 2 cache access. GPUs suffer from memory access problems also. Typically a GPU will also have a cache that follows the same rules. Use of bytes and shorts for vertex data is recommended and balancing texture LOD by adjusting mip-map bias is an important tool. Use anisotropic filtering to sharpen textures instead of positive LOD bias.
Most advanced game systems have a method of sub-sampling the artist's original textures to reduce the texture usage. Different scenes have different use of the same texture, so that the texture may be loaded at a lower resolution if the player is never going to encounter that texture close-up. Finally, the smaller the texture pool size, the higher will be the GPU performance. Sorting the scene by texture will preserve GPU cache coherency.
Branch Avoidance
The next biggest source of optimization is to re-code critical methods to avoid branches. Included in this are virtual function calls and pointers-to-functions. Branch miss-prediction causes pipeline stalls and instruction look-ahead failure, which is usually much more expensive. Templates are great ways to generate variations on methods that can avoid virtual function calls and often function pointers passed as constants to inline functions can be converted to direct calls. Profile-driven branch tuning can feed-back information about branches to an executable. Branches usually have two forms, unlikely and likely. When a branch is first encountered by the thread, the branch direction is hinted by the likelihood encoded in the instruction. A profile-driven optimizer can choose branch likelihood and re-order blocks of code to avoid loading uncommon blocks into Level 1 cache.
We can eliminate branches entirely by changing code of this kind:
if( x == 0 ) y = z;
to
y = x == 0 ? z : y;
and
if( ptr != 0 ) ptr->next = prev;
to
*( ptr != 0 ? &ptr->next : &dummy ) = prev;
With a good compiler, this will execute far faster than the code with branches that the “if” form will generate. Next-gen consoles have deep pipelines that require large uninterrupted function bodies to be able to schedule efficiently.
Inlining
After branch avoidance, inlining can be used to remove call overhead and increase function body size, but at the expense of extra Level 2 cache usage. Some compilers, like the SN Systems compiler (SNC), will auto-inline functions if requested. Even legacy code can be made to go faster provided the scope of functions is limited by “static.”
Given a fresh design, it is possible to arrange for all functions to be inlinable, especially if all functions are defined in header files and the large parts of the program are compiled as a single unit. Code like this will actually compile and link faster than its multi-module counterpart owing to the reduction of debug information emitted by the compiler.
Most compilers have n-squared dependencies on optimization, so if compile times get high, some bigger blocks may need to be split or lower optimization levels may have to be used. Another benefit of inlining is in the field of alias analysis which will help the compiler to remove unnecessary loads and stores and to lengthen the blocks available to the scheduler. Constraints on register usage are eliminated by in-lining.
Floating Point Pipeline Usage
The current generation of processors have long floating point pipelines that may allow out-of-order completion. This allows faster clock rates because the number of gates that the pipeline travels through can be reduced if an operation is spread over several cycles. The price that we pay for this is that the results of floating point operations are not available for tens of cycles. It is important therefore to avoid using floating point results for as long as possible and to make function bodies that use floating point code as large as possible with no branches.
In particular, be careful when converting the result of a floating point operation to an integer. The pipeline will have to be flushed, causing big delays.
Vector Units and Supervectors
Ideally, all maths must be done using the vector units available either on main processors or on coprocessors. Unfortunately, retro-fitting a vector math library to float-based code is rarely effective, resulting in only a few percent improvement to critical functions. A scalar type should be included in a vector math library and used extensively in the place of “float” arithmetic in cases where the cost of copying from scalar to vector is prohibitive. One way of getting good use out of your vector units is to use the concept of Supervectors. Instead of vectors of four floats, operate on units of sixteen or more to absorb the latency between instructions.
Instead of:
void Add( vector &elem, vector a, vector b )
{
elem = vec_add( a, b );
}
Write:
void Add( vector elem[], vector a[], vector b[] )
{
elem[ 0 ] = vec_add( a[ 0 ], b[ 0 ] );
elem[ 1 ] = vec_add( a[ 1 ], b[ 1 ] );
elem[ 2 ] = vec_add( a[ 2 ], b[ 2 ] );
elem[ 3 ] = vec_add( a[ 3 ], b[ 3 ] );
elem[ 4 ] = vec_add( a[ 4 ], b[ 4 ] );
elem[ 5 ] = vec_add( a[ 5 ], b[ 5 ] );
elem[ 6 ] = vec_add( a[ 6 ], b[ 6 ] );
elem[ 7 ] = vec_add( a[ 7 ], b[ 7 ] );
}
The second function will take about the same time to complete as the first, because of FPU latency, but will do eight times the work. Note that use of the __restrict keyword may be necessary if the function is not inlined to help alias analysis. This is an important numeric optimization. The general rule of thumb is to do the same thing multiple times. Pipelines work best when the same input instruction is used during the lifetime of the pipeline operation.
Next-gen coprocessors have long latencies, but large numbers of available registers which can be exploited to fill the gaps caused by read-after-write latency issues. Structures of arrays methods can be used to apply regular arithmetic to vector units where Supervectors are treated like scalars. With SOA methods, values for many objects are represented as arrays with the x, y and z components of a vector, for example, all being elements in three separate Supervectors. It is important to batch up numerical operations so that they can be performed together. For example, if we have a particle system that adds the gravity value to the velocity, the gravity value needs only to be loaded once if all the particles of the same type are evolved together in one loop.
If spare GPU power is available, simple linear tasks can be run on the GPU also, although unless the result is directly used for rendering alone, synchronization can be an issue. Again it is essential to batch up tasks that are similar.
Thread Management
First-timers to embedded multithread environments will usually make the mistake of spawning too many threads. It is important to spawn only as many threads as there are hardware resources to execute them, otherwise the operating system will thrash the threads and take up a considerable overhead. Although multiple threads can be created on a single core, they will often fight for resources, negating the advantage of having them. Care should be taken to schedule complimentary tasks into threads. For example, mix a memory-intensive task with a computation intensive task, so that they do not compete for memory resources. Spawn the threads in the game initialization code; it is not wise to attempt to spawn threads in the game loop as this is an expensive operation. Profiling the game code early in the development cycle can spot operating system overheads before they become serious.
Next-gen engines may have an Execution Plan where tasks are given fixed timeslots to complete in with the timeslots organized so that the outputs of one task feed the inputs of another. Failure of a task to complete in time may cause an assertion failure in pre-production builds with the short-term irritation tolerated to prevent last-minute optimization panics and slipped deadlines. Execution plans can be represented as XML files and tie nicely into systems like the Collada scene description language. Careful design of the execution plan for a scene or sub-scene is essential. If the resources available in a scene are finite, then it is possible to guarantee frame rate without over-allocation.
Using multiple threads on one core reduces effective latencies as instructions are executed alternately. A good compiler will have an optimization option to support this.
Serialization and Pre-compilation
Serialization in-game, which causes the long load times common on PC games, can be avoided by pre-serializing game state data and constants in the game compilation phase of the tool-chain. Binary images of structures load much faster as no code needs to be executed during load. Seamless scene changes are possible as code and data resources for a scene can be loaded without the use of precious CPU cycles. On consoles, this is the preferred method.
Games usually load from DVD, which has little tolerance for seeking. Large game projects usually have banks of servers to execute game-specific processes like lightmapping, collision data building, and AI routefinding table construction. There are a number of distributed build systems available to help developers build large code and data systems.
For example, it is possible to convert game data into binary images of in-game structures that can be loaded in seconds in megabyte-sized blocks direct from DVD. Often for DVD efficiency, data are duplicated to avoid doing seeks. It is a false economy to store a texture only once on a DVD, for example, as this will cause game loading times to multiply.
Memory Management
High-performance games do their own memory management. Calling malloc or the default new in a game loop is considered irresponsible. Memory fragmentation is the biggest cause of console game crashes and on PC games will cause slow-downs after periods of gameplay. Heap management systems range from simple “hunk” allocators to systems that sort block sizes by location in memory and can allocate from pools in a few cycles. Debug information is usually a part of heap management. It is useful to see which systems are allocating blocks of memory.
Avoiding Ransom Blocks which prevent the merging of two large memory blocks is important. A scene should be given its own memory area to allocate from. This will mean that when the scene is torn down, the entire area can be freed without fear of fragmentation.
Conclusions
Next-gen consoles are very sensitive to poor programming practice and require nice long function bodies with no branches in order to be able to optimize the code effectively. The main CPU, the coprocessors and the GPU all need careful planning to avoid thrashing of caches, but if they are treated well, huge performance benefits are possible, far more than can be achieved with PC games. “The bigger, the better” is the motto for vectorization. Don't limit yourself to the four floats available in SIMD implementations - sixteen or thirty-two float Supervectors are much better to soak up the stalls on read-after-write dependencies.
Careful thread management is essential to reduce latency and avoid competition for resources. The use of an Execution Plan is a wise idea to tie together macroscopic components of the system.
_____________________________________________________
Read more about:
FeaturesAbout the Author
You May Also Like