Game industry veteran Preisz takes a close look at "the most popular myths that I’ve heard over the past 7 years of optimizing games", covering vital programming topics such as multi-threading, assembly, and the eternal 'premature optimization' question.
“With enough inside information and a
million dollars you can go broke in a year” – Warren Buffet on
stock tips
Myths, tips, inside information –
however you state it, dangerous blanket statements and
generalizations increase the learning curve of those practicing Video
Game Optimization (VGO). Optimization myths thrive on complex
hardware, legacy implementations, faith, and e-mail lists. The
following is a description of some of the most popular myths that
I’ve heard over the past 7 years of optimizing games.
Myth 1: Pre-mature Optimization Is The Root Of All Evil
Before accusing me of disagreeing with
Stanford’s legend Donald Knuth [pictured] and a knighted programmer (Sir
C.A.R Hoare - quick sort inventor), realize that I am planning to
agree with what has become the mantra for the reactive optimizer.
Let’s examine the quote in its entirety.
“Forget about small efficiencies, say
about 97% of the time: premature optimization is the root of all
evil.” -Donald Knuth.
A statement from a professional that
exposes “the root of all evil” is very momentous, but when you
examine the entire statement, it’s easier to see the forest –not
the tree. I’m not ordained to interpret programming scripture, but
from this famous statement I infer the following: prematurely
optimizing small efficiencies is usually the root of all evil;
prematurely optimizing large efficiencies is a necessity.
There are three levels of VGO, the
system level, the application level, and the micro level. The system
level of optimization is where we as programmers examine our
architecture and compare it to our system specs. System level
questions include: “How many cores do we support?”, or “Do we
require a minimum level of shader support?”. The application level
of optimization is usually implemented at the class level. Examples
include quad trees, occlusion culling, and instancing. Micro level
optimizations, the most tangible and arguable level, are easily
recognized since they exist within the domain of several, or single,
lines of code.
There are more flavors of PC
configurations then there are of Linux operating systems. System
level and application level optimizations are more likely to “rise
the tide” of frame rates across combinations of AMD, Intel, Nvidia,
CPUs and GPUs. Micro optimizations tend to vary across different
configurations more than the system or application levels.
Optimization is a part of design!
System and application levels of optimizations are best implemented
during design. If we miss these opportunities because we feel we are
acting prematurely, then only an abundance of flexibility, the level
at which few engines provide, will afford us the opportunity to
integrate the optimization before shipping.
Pre-mature optimization of system and
application optimizations is not the root of all evil.
Pre-mature optimization of the micro level is.
Myth 2: Less Code Is Faster Than More
There are two polar opinions that
dominate game programming personalities. On the left, a personality
I call the LEAN_AND_MEAN programmer. On the right, is the class
heavy abstractionist.
Which one is correct in regards to
performance? I’m afraid there is no clear winner, and in-fact,
they may both be correct. The amount of code you write before you
compile does not always answer the more important questions
about the runtime performance.
The argument against the abstractionist
is that the overhead of their design is burdensome; however, a well
designed class hierarchy does not need to travel through many lines
of code during execution. A poorly designed class hierarchy will be
the victim of its verbose design.
The argument against the LEAN_AND_MEAN
programmer is the lack of flexibility needed reduce superfluous lines
of code and rapid refactoring.
The bottom line - sometimes writing more
code can reduce superfluous CPU and memory system work and maximize
parallelism. In this, both are taking the correct approach as long
as the class heavy abstractionist uses a good design and the
LEAN_AND_MEAN programmer manages superfluous execution and
flexibility.
Unintuitive hardware can also propagate
this myth. A good example of how more lines of code can be faster
than less is evident when using write-combined buffers.
Dynamic vertex buffers, when locked,
sometimes use write-combined buffers, a memory type that does not
travel through the cache system. This is done to reduce the
management of memory coherency between the GPU and the CPU. When we
use a write-combined buffer, it is important to update all 64 bytes
of a write-combined buffer line. If the entire 64 byte line is not
updated, the write-combined buffer writes to system memory in 8 byte
increments. When all 64 bytes are updated, the entire line writes in
a single burst.
What does this mean to a game
programmer? When considering the memory performance of a
write-combined buffer, we should update every byte of a vertex, even
if position is the only value that changed. In this example, writing
more code, which appears slower on the C++ level, unlocks a latent
hardware optimization.
The “lines of code” myth survives
on the belief that to some, more code, and a larger design, means
less performance. I’d bet a programmer with this belief wouldn’t
give up their quad tree as a strategy to reduce the number of lines
of code.
Myth 3: Game Developers Don’t “Do” Multi-core Well
Let’s face it. There are many areas
that we as programmers lag behind the mainstream. One area is
multi-core CPU programming [multi-core CPU pictured below]. We are; however, ahead of the pack as
multi-core programmers.
Any PC with at least one CPU core and
one GPU core is a multi-core machine. And the rules for optimizing
multi-core machines - which also include next-gen consoles such as the PlayStation 3 and Xbox 360 - are very different from that of optimizing single
cores. Not realizing that a machine is multi-core makes the process
of optimization difficult and inefficient. This leads us to myth
number 4…
Myth 4: Every Optimization Yields Some Performance Gain
Because every hardware accelerated game
is using a minimum of two cores (see Myth 3), there is a possibility
that a successful optimization could yield no frame rate increase.
Consider the following example:
A dealer splits a deck of cards in half
and hands them to Jack and Jill. The dealer then asks the
participants to sort the deck by red and black. Assume for our
purposes that Jill is much faster than Jack, and finishes her half of
the deck in 45 seconds. Jack, who is slower, finishes in 60 seconds.
The entire process, since Jack and Jill operate in parallel, is
equal to the slowest participant- in this example, Jack. Therefore,
the entire process takes 60 seconds.
Now - assume we optimize Jill’s
performance so that she is now able to sort the deck 15 seconds
faster. If we run the experiment again, we can clearly see that our
bottle neck, Jack, is still causing our experiment to take 60
seconds. We have optimized Jill by 15 seconds but noticed no
increase in the overall performance.
Any time we fail to optimize the
slowest core or parallel GPU kernel we have the potential for a zero
percent frame rate increase. This sort of optimization, especially
if it requires two or more weeks of work, does not impress
management.
It is possible to increase performance
by optimizing the incorrect core. This occurs when we indirectly
optimize the slowest core. For example, if our game is limited by
fragment processing on the GPU, then optimizing AI will do little,
and probably nothing, to increase our overall frame rate performance.
If we were to optimize CPU work, such as a faster and better culling
system, we would indirectly be optimizing pixel processing. In this
case, we targeted a CPU optimization that led to a GPU optimization
in our limiting kernel (fragment processing)
Myth 5: Reducing Instruction Processing Is Our Primary Goal In CPU Optimization
When comparing the growth rate of
instructions retired in the past five years, the GPU is the winner.
The CPU, by means of increased instruction level parallelism and
multi-core is in second place. The slowest growth (of resources
commonly utilized in game runtime) is the memory system.
The reason is simple, when used
correctly, memory is very fast. The problem is that games, which are
getting close to the 32 bit OS limit of 4 gigs, frequently abuse our
fragile memory architecture.
Many traditional optimizations, made
famous before the requirement of a tiered cache system, can be
harmful to modern architectures. For example, a look-up table trades
memory for instruction processing. If this increase in memory causes
a cache miss that requires a fetch from system memory, you have done
little to increase your performance. A cache miss that causes a
fetch to system memory is many times slower than the slowest
instruction. In attempting to save instructions, you have created
latency and a data dependency.
When optimizing the CPU, we have a
tendency to seek out the slowest instruction loops in our engine.
The usual suspects are AI, culling, and physics. If you are not
optimizing your engine for cache efficiency you are doing yourself a
disservice. If you are reducing instructions and increasing cache
misses, you are committing a sin.
Myth 6: Optimization And Assembly
For many programmers, optimization and
assembly are synonymous. This is much less true today than it was in
the past. If this inflames you, don’t think that I believe there
is NO place for assembly in optimization; it’s just a
smaller piece.
My justifications?
First, the use of APIs is much more
prevalent today than in the past. In many programs, the application
code written by the developer consumes a small percentage of the
total runtime. It is very common for the drivers, graphics bus, or
the graphics pipeline to be the 20% of the Pareto principle.
When the code you use, not the code you
write, is the hotspot or bottleneck, coding with optimized assembly (
excluding shader assembly ) will not be your most efficient use of
time. We have sacrificed a lot of control for code reuse through
APIs.
Second, optimizing for assembly is a
classic example of a micro optimization. A micro optimization is
more likely to have different results across PC configurations than
system or application optimizations. Everyone is familiar with the
office phrase, “It doesn’t crash on my machine”. Micro
optimizations sometimes stimulate the phrase, “It runs fast on my
machine”.
Finally, when you write assembly, you
get exactly what you write. To some, this is great. To those who
are not experts in assembly, it’s an opportunity to shoot yourself
in the foot. The term “optimizing compiler” is antiquated. Now,
even standard compilers contain optimizations. Writing assembly
bypasses the optimizations ingrained in compilers. Even text book
examples of non-data dependant loop unrolling yield slower
performance than that of pristine loop. The risk vs. reward of
pigeon-holing your compiler is not as justified as it was in the
past.
In closing this myth, I will again
repeat that there is a place, under the correct circumstances, that
optimizing with assembly is still the correct choice. For those who
prefer a higher level language, it’s good to know that compilers
are doing more of our work.
Myth 7: A Ratio Of 1 To 1, Thread To Logical Core, Is Optimal
Ok, so maybe I’m getting a bit
nit-picky. This statement is common, but more detail must be applied
to stop confusion. If using Microsoft Windows XP, you will find that
on start-up, your machine will be running anywhere between 400 to 800
threads. Does this mean we are never going to achieve the ratio
until we have an 800 core machine. Of course not.
A more accurate phrase is, “a ratio
of 1 to 1 intensive threads is optimal”. Two threads, running at
50% can share one core efficiently. The difficulty in this example
is to ensure that the executing 50% does not occur at the same time
for both threads.
Myth 8: Multi-threading With Efficient Synchronization Will Always Increase Performance
This myth is related to myth 5 and is
likely to go away as multi-core systems memory architectures evolve.
The root of this myth is that multi-threading does not increase
memory performance, it complicates it.
If a given algorithm is bound by memory
performance, then dividing the task across threads will not increase
the performance. And by opening the door to false sharing and
increased cache eviction, the potential for a decrease in performance
exists.
Myth 9: You Can Determine The Performance Of A Graphics Call With The Following Methodology:
int start = getTime();
pD->DrawIndexedPrimitive(…);
int end = getTime();
int time = end-start;
The code above is understandable and
intuitive. Unfortunately, VGO is not. The missing piece is the work
that exists under the “hood” of the API and the hardware that it
drives.
The CPU is a low latency, low
throughput resource. In other words, if you ask it one question, you
get a quick response. The GPU is a high latency, high throughput
processor. In other words, if you ask it a question, you get a slow
response. If you ask it many questions, the response is just as slow
as if you asked it one.
How do we manage this difference in
processing? With a structure called the command buffer. The code
above is measuring the time it takes to set the command into the
command buffer and determine if your render state changed from the
last time you called DrawIndexPrimitive. This does NOT tell you how
long it takes to render the triangles. That works occurs at a
different time and is under the control of your drivers.
To measure that data, you would either
need to force the DrawIndexedPrimitive call to no longer act in
parallel (thus reducing the need for a buffer) or use drivers that
can provide that information for you.
Myth 10: OpenGL Is Faster Than
DirectX (Or Visa Versa)
Not so fast. Religion, politics, and
API supremacy is not something I am prepared to discuss in a public
forum. The decision is deeply personal, and rooted in the financial
gain that your favorite API provides for your future. All sarcasm
aside, let’s look at some facts.
You can make a slow game in OpenGL and
you can make a slow game in DirectX. An API is an interface to your
driver, and the driver is an interface to hardware. Any API that
gives access to the hardware, without hindering it, will provide
plenty of opportunity for performance.
I will go out on a limb and suggest
that, at this time of this article, OpenGL is lacking the toolset
supported by DirectX for analyzing GPU performance. There are
several efforts in development to alleviate this.
In Closing
The process of VGO is as mystifying and
complex as the hardware that drives our games. In the near future,
this process will be demystified by better tools and made more
complex by modern hardware, such as multi-core (think double digit
number of cores) and unified shader architectures. Before
implementing any optimization “tip” or “trick”, be certain to
understand the underlying hardware that provides the optimization. If
you don’t, you may be perpetuating another optimization myth.
Get daily news, dev blogs, and stories from Game Developer straight to your inbox