“With enough inside information and a million dollars you can go broke in a year” – Warren Buffet on stock tips
Myths, tips, inside information – however you state it, dangerous blanket statements and generalizations increase the learning curve of those practicing Video Game Optimization (VGO). Optimization myths thrive on complex hardware, legacy implementations, faith, and e-mail lists. The following is a description of some of the most popular myths that I’ve heard over the past 7 years of optimizing games.
Myth 1: Pre-mature Optimization Is The Root Of All Evil
Before accusing me of disagreeing with Stanford’s legend Donald Knuth [pictured] and a knighted programmer (Sir C.A.R Hoare - quick sort inventor), realize that I am planning to agree with what has become the mantra for the reactive optimizer. Let’s examine the quote in its entirety.
“Forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” -Donald Knuth.
A statement from a professional that exposes “the root of all evil” is very momentous, but when you examine the entire statement, it’s easier to see the forest –not the tree. I’m not ordained to interpret programming scripture, but from this famous statement I infer the following: prematurely optimizing small efficiencies is usually the root of all evil; prematurely optimizing large efficiencies is a necessity.
There are three levels of VGO, the system level, the application level, and the micro level. The system level of optimization is where we as programmers examine our architecture and compare it to our system specs. System level questions include: “How many cores do we support?”, or “Do we require a minimum level of shader support?”. The application level of optimization is usually implemented at the class level. Examples include quad trees, occlusion culling, and instancing. Micro level optimizations, the most tangible and arguable level, are easily recognized since they exist within the domain of several, or single, lines of code.
There are more flavors of PC configurations then there are of Linux operating systems. System level and application level optimizations are more likely to “rise the tide” of frame rates across combinations of AMD, Intel, Nvidia, CPUs and GPUs. Micro optimizations tend to vary across different configurations more than the system or application levels.
Optimization is a part of design! System and application levels of optimizations are best implemented during design. If we miss these opportunities because we feel we are acting prematurely, then only an abundance of flexibility, the level at which few engines provide, will afford us the opportunity to integrate the optimization before shipping.
Pre-mature optimization of system and application optimizations is not the root of all evil. Pre-mature optimization of the micro level is.
Myth 2: Less Code Is Faster Than More
There are two polar opinions that dominate game programming personalities. On the left, a personality I call the LEAN_AND_MEAN programmer. On the right, is the class heavy abstractionist.
Which one is correct in regards to performance? I’m afraid there is no clear winner, and in-fact, they may both be correct. The amount of code you write before you compile does not always answer the more important questions about the runtime performance.
The argument against the abstractionist is that the overhead of their design is burdensome; however, a well designed class hierarchy does not need to travel through many lines of code during execution. A poorly designed class hierarchy will be the victim of its verbose design.
The argument against the LEAN_AND_MEAN programmer is the lack of flexibility needed reduce superfluous lines of code and rapid refactoring.
The bottom line - sometimes writing more code can reduce superfluous CPU and memory system work and maximize parallelism. In this, both are taking the correct approach as long as the class heavy abstractionist uses a good design and the LEAN_AND_MEAN programmer manages superfluous execution and flexibility.
Unintuitive hardware can also propagate this myth. A good example of how more lines of code can be faster than less is evident when using write-combined buffers.
Dynamic vertex buffers, when locked, sometimes use write-combined buffers, a memory type that does not travel through the cache system. This is done to reduce the management of memory coherency between the GPU and the CPU. When we use a write-combined buffer, it is important to update all 64 bytes of a write-combined buffer line. If the entire 64 byte line is not updated, the write-combined buffer writes to system memory in 8 byte increments. When all 64 bytes are updated, the entire line writes in a single burst.
What does this mean to a game programmer? When considering the memory performance of a write-combined buffer, we should update every byte of a vertex, even if position is the only value that changed. In this example, writing more code, which appears slower on the C++ level, unlocks a latent hardware optimization.
The “lines of code” myth survives on the belief that to some, more code, and a larger design, means less performance. I’d bet a programmer with this belief wouldn’t give up their quad tree as a strategy to reduce the number of lines of code.
Myth 3: Game Developers Don’t “Do” Multi-core Well
Let’s face it. There are many areas that we as programmers lag behind the mainstream. One area is multi-core CPU programming [multi-core CPU pictured below]. We are; however, ahead of the pack as multi-core programmers.
Any PC with at least one CPU core and one GPU core is a multi-core machine. And the rules for optimizing multi-core machines - which also include next-gen consoles such as the PlayStation 3 and Xbox 360 - are very different from that of optimizing single cores. Not realizing that a machine is multi-core makes the process of optimization difficult and inefficient. This leads us to myth number 4…
Myth 4: Every Optimization Yields Some Performance Gain
Because every hardware accelerated game is using a minimum of two cores (see Myth 3), there is a possibility that a successful optimization could yield no frame rate increase. Consider the following example:
A dealer splits a deck of cards in half
and hands them to Jack and Jill. The dealer then asks the
participants to sort the deck by red and black. Assume for our
purposes that Jill is much faster than Jack, and finishes her half of
the deck in 45 seconds. Jack, who is slower, finishes in 60 seconds.
The entire process, since Jack and Jill operate in parallel, is
equal to the slowest participant- in this example, Jack. Therefore,
the entire process takes 60 seconds.
Now - assume we optimize Jill’s performance so that she is now able to sort the deck 15 seconds faster. If we run the experiment again, we can clearly see that our bottle neck, Jack, is still causing our experiment to take 60 seconds. We have optimized Jill by 15 seconds but noticed no increase in the overall performance.
Any time we fail to optimize the slowest core or parallel GPU kernel we have the potential for a zero percent frame rate increase. This sort of optimization, especially if it requires two or more weeks of work, does not impress management.
It is possible to increase performance by optimizing the incorrect core. This occurs when we indirectly optimize the slowest core. For example, if our game is limited by fragment processing on the GPU, then optimizing AI will do little, and probably nothing, to increase our overall frame rate performance. If we were to optimize CPU work, such as a faster and better culling system, we would indirectly be optimizing pixel processing. In this case, we targeted a CPU optimization that led to a GPU optimization in our limiting kernel (fragment processing)
Myth 5: Reducing Instruction Processing Is Our Primary Goal In CPU Optimization
When comparing the growth rate of instructions retired in the past five years, the GPU is the winner. The CPU, by means of increased instruction level parallelism and multi-core is in second place. The slowest growth (of resources commonly utilized in game runtime) is the memory system.
The reason is simple, when used correctly, memory is very fast. The problem is that games, which are getting close to the 32 bit OS limit of 4 gigs, frequently abuse our fragile memory architecture.
Many traditional optimizations, made famous before the requirement of a tiered cache system, can be harmful to modern architectures. For example, a look-up table trades memory for instruction processing. If this increase in memory causes a cache miss that requires a fetch from system memory, you have done little to increase your performance. A cache miss that causes a fetch to system memory is many times slower than the slowest instruction. In attempting to save instructions, you have created latency and a data dependency.
When optimizing the CPU, we have a tendency to seek out the slowest instruction loops in our engine. The usual suspects are AI, culling, and physics. If you are not optimizing your engine for cache efficiency you are doing yourself a disservice. If you are reducing instructions and increasing cache misses, you are committing a sin.
Myth 6: Optimization And Assembly
For many programmers, optimization and assembly are synonymous. This is much less true today than it was in the past. If this inflames you, don’t think that I believe there is NO place for assembly in optimization; it’s just a smaller piece.
First, the use of APIs is much more prevalent today than in the past. In many programs, the application code written by the developer consumes a small percentage of the total runtime. It is very common for the drivers, graphics bus, or the graphics pipeline to be the 20% of the Pareto principle.
When the code you use, not the code you write, is the hotspot or bottleneck, coding with optimized assembly ( excluding shader assembly ) will not be your most efficient use of time. We have sacrificed a lot of control for code reuse through APIs.
Second, optimizing for assembly is a classic example of a micro optimization. A micro optimization is more likely to have different results across PC configurations than system or application optimizations. Everyone is familiar with the office phrase, “It doesn’t crash on my machine”. Micro optimizations sometimes stimulate the phrase, “It runs fast on my machine”.
Finally, when you write assembly, you get exactly what you write. To some, this is great. To those who are not experts in assembly, it’s an opportunity to shoot yourself in the foot. The term “optimizing compiler” is antiquated. Now, even standard compilers contain optimizations. Writing assembly bypasses the optimizations ingrained in compilers. Even text book examples of non-data dependant loop unrolling yield slower performance than that of pristine loop. The risk vs. reward of pigeon-holing your compiler is not as justified as it was in the past.
In closing this myth, I will again repeat that there is a place, under the correct circumstances, that optimizing with assembly is still the correct choice. For those who prefer a higher level language, it’s good to know that compilers are doing more of our work.
Myth 7: A Ratio Of 1 To 1, Thread To Logical Core, Is Optimal
Ok, so maybe I’m getting a bit nit-picky. This statement is common, but more detail must be applied to stop confusion. If using Microsoft Windows XP, you will find that on start-up, your machine will be running anywhere between 400 to 800 threads. Does this mean we are never going to achieve the ratio until we have an 800 core machine. Of course not.
A more accurate phrase is, “a ratio of 1 to 1 intensive threads is optimal”. Two threads, running at 50% can share one core efficiently. The difficulty in this example is to ensure that the executing 50% does not occur at the same time for both threads.
Myth 8: Multi-threading With Efficient Synchronization Will Always Increase Performance
This myth is related to myth 5 and is likely to go away as multi-core systems memory architectures evolve. The root of this myth is that multi-threading does not increase memory performance, it complicates it.
If a given algorithm is bound by memory performance, then dividing the task across threads will not increase the performance. And by opening the door to false sharing and increased cache eviction, the potential for a decrease in performance exists.
Myth 9: You Can Determine The Performance Of A Graphics Call With The Following Methodology:
int start = getTime();
int end = getTime();
int time = end-start;
The code above is understandable and intuitive. Unfortunately, VGO is not. The missing piece is the work that exists under the “hood” of the API and the hardware that it drives.
The CPU is a low latency, low throughput resource. In other words, if you ask it one question, you get a quick response. The GPU is a high latency, high throughput processor. In other words, if you ask it a question, you get a slow response. If you ask it many questions, the response is just as slow as if you asked it one.
How do we manage this difference in processing? With a structure called the command buffer. The code above is measuring the time it takes to set the command into the command buffer and determine if your render state changed from the last time you called DrawIndexPrimitive. This does NOT tell you how long it takes to render the triangles. That works occurs at a different time and is under the control of your drivers.
To measure that data, you would either need to force the DrawIndexedPrimitive call to no longer act in parallel (thus reducing the need for a buffer) or use drivers that can provide that information for you.
Myth 10: OpenGL Is Faster Than DirectX (Or Visa Versa)
Not so fast. Religion, politics, and API supremacy is not something I am prepared to discuss in a public forum. The decision is deeply personal, and rooted in the financial gain that your favorite API provides for your future. All sarcasm aside, let’s look at some facts.
You can make a slow game in OpenGL and you can make a slow game in DirectX. An API is an interface to your driver, and the driver is an interface to hardware. Any API that gives access to the hardware, without hindering it, will provide plenty of opportunity for performance.
I will go out on a limb and suggest that, at this time of this article, OpenGL is lacking the toolset supported by DirectX for analyzing GPU performance. There are several efforts in development to alleviate this.
The process of VGO is as mystifying and complex as the hardware that drives our games. In the near future, this process will be demystified by better tools and made more complex by modern hardware, such as multi-core (think double digit number of cores) and unified shader architectures. Before implementing any optimization “tip” or “trick”, be certain to understand the underlying hardware that provides the optimization. If you don’t, you may be perpetuating another optimization myth.