Fast, Simple, Higher Level Optimization
We recently had the rather daunting task of porting a next generation SDK to current generation and handheld platforms. This presented a variety of difficult challenges with one of the most rewarding being performance optimization. Over several weeks, we used the often overlooked methods in this article to gain an approximate 4x increase in performance and a healthy reduction in library size. A little assembly language was necessary but on the whole the bulk of the optimization was high level.
First Step – Profile Profile Profile
Before undertaking any optimization exercise it is necessary to establish a solid foundation for which to benchmark optimization changes. To do this we set up a simple repeatable scenario in our game. This scenario represented a rather typical but nonetheless demanding situation for the SDK. In our case, we captured approximately 30 seconds of profiling data on each run. Results were then saved and compared with subsequent profiling runs.
We spent a lot of time in the profiler doing essentially two different types of run. The first type was sample-based profiling where the profiler interrupts the application at a high frequency. With each interrupt the program counter is queried and the tuning software makes a note of which function is being executed. This method of profiling is crude in terms of accuracy (due to the overhead of interrupting/logging results) but it is useful for quickly spotting where most cycles are being spent. Once we’d established potential offenders we would then go through the source code and perform a second, more focused function specific profile. We were then able to quickly establish exactly where the time was being spent. The next step was then to figure out why.
Second Step – Look at the Code
Sometimes, we get lucky and we can spot a bottleneck in a higher level function immediately. Other times it is not so obvious and this was especially true in our case as the code had already been optimized for next generation platforms. With this in mind, we resorted to cross referencing the source code with the actual assembly language that had been generated. Most debuggers have a ‘source code annotation’ option that makes this easier. By doing this we were quickly able to spot areas where we thought the code was unnecessarily bloated. It is important to say that one doesn’t need a great knowledge of assembler programming to do this. Sure it helps, but as developers we have a ‘feel’ for how much code something should take.
On the following pages, I’ll spend some time covering some of the more beneficial optimizations that we made.
Improving Instruction Cache Usage
Firstly, if you are unfamiliar with how caches operate I’d like to point you to the resources at the end of this article.
The instruction cache is your friend, please don’t kill it! It is often overlooked as an area for performance optimization. However on consoles with smaller caches, time spent optimizing this area can be a big win.
In our case the biggest win here was caused by removing excessive use of inline. The library we were optimizing had been created with large cache systems in mind. It is possible that the programmers had subscribed to the common assumption that the easiest way to make code run fast is to inline it. This can be true in areas where the function call overhead is large compared to the cost of running the functions code. However in our case we were seeing extremely large functions inlined. Our target platform had a small cache with a cache line size of 32 bytes. Each instruction opcode takes 4 bytes so this meant that after every 8 instructions we could expect a cache miss when executing uncached code. In extreme cases, excessive inlining meant that code that repeatedly called the same inlined sub function was never hitting the cache. If the sub function were not inlined then aside from an initial miss the sub functions code would be in the cache. Combine this with the fact that filling cache lines isn’t fast and you can see the potential for a huge win!
Aside from being
selective about what we inlined we also made a pair of other
optimizations to improve performance in this area. As you may know the
cache can only hold so many lines at once. By grouping areas/modules of
the game so that they fit in the cache at the same time is beneficial.
The code that we were dealing with was already modular in that clearly
defined sections of the SDK were executed sequentially. To facilitate
even better instruction cache performance in these areas we used two
other approaches. Firstly, within each module we identified functions
that were executed close to each other. These functions were then
grouped together within the source code. The purpose of this was to
increase our chance of code already being in the cache and also to
reduce the chance of line contention issues. Secondly, our compilers
had a function stripping option as part of the linker process but in
our case it proved to be ineffective at removing dead code. With this
in mind we took on the tedious task of manually removing code we knew
wouldn’t be executed and also removing functionality that we didn’t
require. This was time consuming but extremely rewarding in performance
improvements once our codes footprint was reduced.
In conclusion I’d say the time spent tweaking the instruction cache was well spent. The performance impact was substantial, immediate and all for very little effort.
Improving Data Cache Usage
Our second area of focus was the improving our data cache usage. Again, in our case our cache line size was 32 bytes. It thus made sense for us to group data that was accessed at similar times into the same cache line. As you can imagine this isn’t a fast process as it involves wading through source code. However, we found the performance gain to be worthwhile so it was time well spent.
Another optimization that we made in this area was to reduce the data footprint. We switched from using bool’s (often 32 bit with many compilers) to instead using bit flags. We looked at integer values and where possible stored them as chars. Again, this is a slow process but very worthwhile.
Finally, there were a couple of key data arrays in the code that were checked as follows:
for ( i=0; i < num; ++i)
if ( data[i].m_active )
Even though most elements were inactive, this innocent looking code was generating a cache miss and a substantial cycle penalty on each and every loop iteration. The reason for this was that the data did not fit into a cache line so a cache line refill was necessary on each iteration. In this case, there was no way that the data structure would ever fit in one cache line. The obvious fix here was to remove the m_active flag from the data structure and store it as a bit flag in a separate structure. With the separate structure aligned on a cache boundary we could then store flags for 256 (8 bits * 32 bytes per line) array elements within the same cache line.
It is worth mentioning that many profilers monitor instruction / data cache misses. Needless to say, this can save you a lot of time in identifying hotspots.
Optimizing Memory Management
One of the biggest hits we found was calls from the SDK to the system malloc and free operations. As you know, new and delete
also go through the same memory allocation process. The shear volume of
calls was causing a major performance drain so this was an area we
concentrated on. The first optimization that we made was to use a
custom memory manager engineered for the typical size of our
allocations and the target platform (allocations respected alignment
issues etc). We added logging information to the memory manager so that
we could establish where the bulk of the activity was taking place.
Once we knew this we went about the best way of optimizing a memory
manager which is not to call it in the first place. In the case of
smaller, fixed size, per class allocations we simply reserved the
memory within the class structure itself. After all, the bottleneck was
not the size of the allocations but the frequency of the calls.
Another optimization that we made was for variable size memory that was allocated at the start of a function and freed at the end. For this we used the alloca function which allocates memory on the stack with a very low CPU overhead. There is no cleanup function when using alloca as the memory/stack is cleaned up automatically when the function exits. Another benefit of using alloca is that unlike malloc, there is no chance of memory fragmentation.
Optimizing new and delete operators was a simple case of creating our own version of the operators that allocated the memory through our own memory manager. Each element was then new’ed in place at the memory address we obtained.
Use the Instruction Set
of the box, the SDK that we were optimizing used no platform specific
instructions. For example, calls to math functions such as cosf,sinf,atanf
were all using the default implementation. As you know these default
functions can be extremely slow so we replaced them with platform
specific variants. We mostly used assembler versions of approximation
functions which are easily obtainable. However, one platform actually
has sin and cos as part of the coprocessor instruction set which was a pleasant surprise.
Other platform specific optimizations included using SIMD instructions for any Matrix/Vector operations. This optimization in particular requires very little assembly language and in some cases the compiler supports intrinsics which make this task even easier. The only difficulty that we faced when doing this was that the SIMD instructions required 16 byte alignment and modifying a huge code base after the fact was tough. Again, this optimization is worthwhile as the difference in performance from using SIMD can be staggering.
Finally, if the target platform supports 8 or 16 byte load/store operations then it is your duty to use them. It doesn’t take long to code blazing fast replacements for the default memcpy and memset functions.
To Err is Human
When looking at compiled code through a disassembler, simple mistakes in C++ code can become easy to spot. For example it is very easy to spot mistakes such as code bloat caused by programmers returning objects by value or using temporary objects. For example in the case of the matrix/vector class that we optimized, the original programmer had supplied two versions of common matrix/vector operations. One being a standard function within the class such as:
Matrix3x3 Multiply( const Matrix3x3 &matA, const Matrix3x3 &matB )
and the other being an overloaded operator:
Matrix3x3 operator*( const Matrix3x3 &mat )
Most of the code had been written to use the overloaded operator e.g result = matrixa * matrixb. This certainly looked a lot neater than result.Multiply( matrixa, matrixb ) but neatness came at a price. As you’ve probably guessed, the overloaded operator albeit neater, involves a temporary object which is returned on the stack and then copied into the destination object. Most C++ programmers are aware of this but many seem to forget it. Needless to say our fix was to remove the overloaded operator and force programmers to use the explicit, faster version.
Another very simple mistake that is easy to spot from the debugger is use of double precision math. This could be as simple as a programmer forgetting to append an f on a constant floating point value or using sqrt instead of sqrtf etc. Catching calls to these functions is as easy as setting a breakpoint.
Finally, it is worth touching upon effective use of the instruction set again. If you need to return objects by value and your instruction set supports 8/16 byte operands, please overload the class assignment operator and use the operands to accelerate the copy operation.
Getting the Best Out of the Compiler
Some of the simplest optimizations that you can make come from flipping a switch in the compiler options. I’ll cover a couple of common options that depending on your compiler may be enabled by default. These options can increase the size of your code as well as compromise code performance:
RTTI – Runtime Type Information
This option allows you to query information about classes/objects at runtime and as such comes with an additional memory overhead. The dynamic_cast and typeid operators rely on this information but if you don’t use it within your program or library then it is definitely worth disabling RTTI. With this option disabled you should see a healthy reduction in code size. If you are unsure whether you do use it then disable the option and rebuild your code. The compiler will generate an error if it finds code that requires RTTI.
This is another option that is enabled by default but comes with both a performance impact and an increase in code size. If you don’t use exception handling then it should be safe to disable this option. Avoiding exception handling means not using catch, try or throw within your code.
This often supported option, when enabled comes with a slight memory and performance overhead. This option generates extra code to check that the pointer returned by operator new is non-null. If you are in complete control of your memory management and don’t expect to come across this then its probably worthwhile disabling the option.
I would recommend spending quite a bit time looking at the impact of different optimization settings. The temptation is to set optimization to the highest level but in my experience this doesn’t necessarily lead to the fastest code. What I found is that the highest optimization setting often allows the compiler complete freedom over what to inline. I found on this setting there was a significant increase in code size for no noticeable increase in performance. I personally prefer the highest optimization setting that doesn’t include automatic inlining. I’d rather be in complete control of which functions are inlined.
If you are aiming to reduce the size of your program then another interesting observation that I made was with regard to compilers with an ‘optimize for size’ option. On the platforms I tried this option I found that as expected, the generated code ran slower than code generated on a higher optimization setting. However, the difference in program size was negligible (0.3% in our case).
Finally, to summarise here are some results of playing with the various compiler options on the size our binaries:
No optimization, RTTI on, Exceptions on, Check new on - 11,674,296 bytes
No optimization, RTTI on, Exceptions on - 11,587,128 bytes
No optimization, RTTI on - 7,579,672 bytes
No optimization - 7,139,128 bytes
Full optimization with automatic inlining - 4,623,800 bytes
Full optimization without automatic inlining - 4,574,392 bytes
Optimize for size - 4,559,928 bytes
Whilst we did indeed perform lower level optimization of the code base, the bulk of the performance increases came from the simple higher level techniques I covered here. They should be applicable to code written for most platforms. Optimizing code is a very rewarding experience and I hope that you’ll enjoy it as much as I have.