In-Depth: Cold Code

In this reprinted <a href="http://altdevblogaday.com/">#altdevblogaday</a> in-depth piece, SN Systems compiler engineer Andy Thomason explains how you can improve performance for your cold code -- or code executed at most once a frame.

Andy Thomason, Blogger

November 4, 2011

7 Min Read

[In this reprinted #altdevblogaday-opinion piece, SN Systems compiler engineer Andy Thomason explains how you can improve performance for your cold code -- or code executed at most once a frame.] A brief history of code temperature. Since the dawn of the computing age, processor speed has been dominated by memory speed, a phenomenon known as "the memory wall". No matter how many ALUs you throw onto the silicon, applications will always be limited by the speed of access of the memory. To solve this problem, caches were introduced, and to an extent they help to reduce the memory bandwidth issues. But they come at a considerable price, not only in terms of the silicon real-estate they occupy, but primarily in terms of additional latency. The result of this is that benchmark code performance is improved, but real code performance takes a big hit. I'd like to explain why this is and what we can do to mitigate the effects. What Is Code Temperature? In compiler terms, code is split into "hot" code which is frequently executed and "cold" code which is executed at most once a frame. Caches do not help game loops on the whole as by the time the loop is re-run, the caches have been flushed and all code ends up being cold. Cache-locking and cache affinity can help, but require O/S assistance that is not generally available. The Anatomy Of A Game Loop We all know about the kinds of computational tasks that our games carry out in the game loop. In particular, a great deal of emphasis is always placed on the hot loops that perform matrix stack calculations or physics simulation, but very little is understood about cold code which makes up the bulk of the code executed in a loop. In fact, as compiler engineering is largely about statistics, we frequently analyze the composition of game programs to work out where to put our optimization work. Developers always look the closest at their hot code as it is an easy optimization target and in fact, most of the game code we see has had its hot loops removed and transferred to coprocessors, which is a largely mechanical task. Hot code is easily identified from profile traces, and as it is generally very small, it is easy to tweak loops to improve performance, for example by manually unrolling them or using GPGPU coding. This tends to leave the cold code dominating the performance of the game loop and to the developer this looks like a uniform background noise as the "whack-a-mole" style of profile-fix-profile cycles flatten out the hot functions. Making Cold Code Go Faster When we analyze the composition of game loops we find that the vast majority of instructions are either loads, address calculations, conditional branches and calls/returns. Floating point operations and general integer operations are actually very rare with less than 1 percent of a typical game using vector and floating point operations. Function prologues and epilogues are the single most identifiable part of the code and prologues often account for 10% or more of the execution time, mostly because the branch to the function is mis-predicted, usually as a result of a virtual function call. Pass-by-reference is very expensive as we need extra instructions to store to memory and load back again. On some CPUs which have no store forwarding mechanism, this can seriously impact performance. A good linker should provide an option to sort by call order. This has the side effect of minimizing cross-page branches and reducing iCache pollution. Block reordering within cold functions can also help, but the heuristics are different from hot code block reordering common in GCC and other compilers. Interrupt handlers and O/S activity is also a common cause of cold code slowdown. If you catch your audio programmers using sleep(), you should engage in a program of re-education as any O/S call will be a cache-killer. On the whole, we find the best way to improve cold code performance is to reduce code size in all but the hottest parts of the code. What Your Compiler Can Do For You The science of cold code optimization is still in its infancy, so the contributions that current compiler technology can give are still very limited. You can write smaller code, for instance, but this is very difficult even for experienced programmers. Making game logic data-driven is a good step in the right direction. On Intel and ARM platforms there are a limited set of code size reducing optimizations associated with -Os (/Os) groups. If you are building on windows you may find that your whole project will run faster on -Os, even the hot loops. The Thumb-2 instruction set has been a great boost to lowering code size on mobile devices. Many instructions are 16 bit and if the compiler has good register allocation algorithms (and I don't mean Chaitin-Briggs!) then it is possible to choose the maximum number of 16 bit instructions in a function. Register allocation is the focus for my current commercial work with an estimated 10% performance gain to be had in some real-code cases. Scripting Ironically, scripted code can often run as fast as C++ code in cold cases. This is why Java JITs often do not compile functions until they have been proved to be hot. A good bytecode can be more compact than the machine code of the processor, and scripting engines can pull many tricks to batch up opcodes. This suits the general trend for migrating games towards dynamic languages. Dynamic languages are also easier to execute in parallel and the data layout can be changed to improve performance. A large part of compiler optimization effort is going into the optimization of dynamic languages such as Lua and Python. A Taxonomy Of Optimizations We can represent compiler optimizations on a sliding scale:

[generalizing------------------------------specializing]

Specialization is the most familiar of these transforms. Once code is inlined, many opportunities exist to specialize the code based on constants and invariant behavior. Almost all optimizations found in text books are of this form. Generalization is less well understood and its primary purpose is to reduce code size, perhaps by making function calls that replace repeating sections of code or loops that replace stores to subsequent locations. The vast bulk of game code consists of parametrized function calls. Often these parameters are the same for many or all calls, which gives us a great specialization opportunity that will also reduce code size. Profile-guided Optimization Profile guided optimizaation gives us the opportunity to discover which regions of code are hot and which are cold. This usually means running several phases, instrument, execute, compile with results. Good compilers will provide PGO information that has good longevity and will improve code performance through many edits to the code. However, many of the implementations available are extremely crude and it is usually best to manually mark a few of your functions as hot. For example, if the compiler unrolls all the loops, then the hot code will run faster, but the cold code will run much slower, negating any benefit. It is a common error to use the -O3 settings on compilers, which are targeted at benchmarks, rather than -O2 or -Os which give a compromise for real code. LLVM We hope that LLVM will provide a good framework for much of our future research work. It will provide the global view of programs we need to perfect whole-program generalization optimizations through link-time code generation. Being open source, it is easy to share results unlike the proprietary compilers used by many console manufacturers. Most of the work to-date has focussed on benchmark results rather than real code performance, so we look forward to applying the principles we have learned from console compiler development to the open source world. LLVM is not the last word, however, as its many-IRs approach to code generation already looks dated and its complexity makes it unsuitable for dynamic languages. Next Time Next time I'll show some more specific examples of cold code performance and how to measure it. I am planning to do a talk on LLVM in games at the Euro Game Connection conference in Paris in December with my "Goldsmiths" hat on and may see some of you there. http://www.game-connection.com/gameconn/content/gameconnection-europe [This piece was reprinted from #AltDevBlogADay, a shared blog initiative started by @mike_acton devoted to giving game developers of all disciplines a place to motivate each other to write regularly about their personal game development passions.]

About the Author(s)

Andy Thomason

Blogger

Andy Thomason lives in Bristol, U.K. Starting by writing chess and arcade games in the early ‘80s, he has worked for the Psygnosis Technology group, Rage Games and Confounding Factor before working for SN Systems, the console tools developer.

See more from Andy Thomason

Related Topics

Related Topics

Recent in More

Related Topics

Related Topics

In-Depth: Cold Code

About the Author(s)

Latest News

Trending

Featured Blogs

Game Developer Essentials