Leveraging the Power of Cache Memory
Looking for ways to increase your game's performance? Optimizing it to make the most efficient use of cache memory is one possible strategy. Here is a discussion of cache memory and some examples of putting it to work for you and your game.
April 9, 1999
Author: by Pascal Urro
When hardware engineers presented us software developers with cache memory for the first time, essentially they gave us a black box. "Don’t worry about what’s inside of it," they assured us, "just trust us that without changing your programming habits, your programs are going to run much faster!" That was the goal: free software performance enhancement. While it worked fine for a while, over the years the rift between processor speed and memory access time became greater, and it became obvious that by opening the black box of cache memory and exploring the contents within, we’d find solutions to the widening performance gap.
The first steps into demystifying cache memory were modest, and the first tips that resulted from these explorations were vague. For instance, some of the first tips about making the best use of cache memory were non-specific suggestions such as, "avoid random memory access" and "keep your data local". With the appearance of dissociated data and code caches, new suggestions arose, like "keep your data and code strictly separated." The "allocate on write" technique came with its own set of advice. These days, the manual prefetch is in vogue, which takes advantage of dedicated instructions introduced with the new generation of Intel, AMD, PowerPC, and Mips processors. Unfortunately, these basic tips aren’t really satisfactory if you are serious about trying to optimize your game. As a result, the major new code optimization challenge facing game developers today has cache memory optimization.
Simply blindly applying vague suggestions like those to your game, without knowing their basis for operation, will not give you fully optimized code. In fact, it’s easy to find instances where strict code optimization and cache memory optimization conflict. Let’s look at an example.
Assume that you are computing a true-color image. You have the choice of representing your image using either a 24-bit RGB data structure or a 32-bit structure using an idle byte. Your first reflex would probably be to select the second solution for one simple reason: all your memory access will be 4-byte aligned, avoiding the three (at least) processor cycles needed to handle misaligned access. Usually, that’s the right choice, but if you’re using cache memory, it’s not always the way to go.
The technique you should use is implementation dependant. If you’re only going to look at the data once, a 24-bit structure may be more efficient than the byte-aligned 32-bit structure, but the opposite is true if the data is being accessed multiple times. Why? Because on the first pass, the image data is absent from the cache (assuming that we’re working on a processor without a prefetch instruction). Accessing the data from RAM and fetching that data into cache are expensive enough operations as to make the misalignment penalty insignificant. What we’re worried about during the first pass is not the shape of the data per se, but the raw amount of data that has to be moved. So to reduce the amount of data we need to fetch, we should use a 24-bit data structure as opposed to the 32-bit alternative.
At this point let me clarify something about optimizing your game for cache memory. The main handicap of cache memory is its lack of determinism and the difficulty we encounter when trying to tune its performance. Of course, Pentium event counters provide some useful help in this regard, but not enough to strictly optimize your code. Therefore the only real solution is, as usual, hard work. To that end I recommend purchasing a good book on the subject, such as The Cache Memory Book by Jim Handy (Academic Press, 1998).
This hard work will be rewarded, however. To convince you that the results of properly handling cache issues justify the added programming complications, let’s take a look at two detailed, real-life examples. The first example will convince you of one simple thing: accessing uncached data is expensive. The second (and more interesting) example will show you that sometimes changing a very small aspect of your game can solve a huge performance problem, provided you look deeply enough into the cache structure. There isn’t much actual code in these examples – they are primarily meant to illustrate that understanding and optimizing your data structures for cache can result in significant performance improvements.
Example 1: A 16-bit floating-point Z-buffer
What is the drawback of Z-buffers? They’re costly, that’s true. But the cost of a Z-buffer depends on the one parameter associated with its algorithm: the depth. There are several choices when it comes to constructing an efficient Z-buffer, but it turns out none of them are perfect for our use.
The highest precision type of Z-buffer is given by the 32-bit floating-point Z-buffer. However, this flavor of the algorithm is also the slowest and it can’t be used with MMX code – so we can disregard it. The 32-bit integer Z-buffer is faster, and remains satisfying in terms of precision: it looks pretty good. Finally, the 16-bit integer Z-buffer is not accurate enough, although it would be quite quick. So the best choice is the 32-bit integer Z-buffer.
Do we have to be satisfied with such a situation? No. Game developers are greedy, and we want both 32-bit integer precision and 16-bit integer speed and compactness. It’s this desire that germinated the idea of a 16-bit floating point Z-Buffer. You may be saying to yourself, "You crazy Frenchmen, 16-bit floats don’t exist!" and you’re quite right: they don’t. But I’m are not crazy, either. In order to use 16-bit floating-point Z-buffers, you have to create them.
Before diving deep into the description of small floating-point numbers, I want to clear something up. It’s easy to imagine that the code needed to implement the 16-bit floating-point (16bFP) Z-buffer is more complex than the code needed for the 32-bit integer (32bINT) Z-buffer, and you may be concerned that this added complexity will slow down your game. However, remember the lesson that was learned when we tried to structure the image data in memory: sometimes the efficiency gained from using less memory outweighs the penalty imposed by using more complex code. At the end of this example, I’ll explain the circumstances in which the 16bFP is faster than the 32bINT, but keep in mind that both have comparable results. With that in mind, let’s get started creating 16bFP!
First, ask yourself the following: "What arithmetic operations are used in the Z-buffer algorithm?" We need just two operations: addition and comparison (see Figure 1).
if (Z < Zbuffer[pos]) { write pixel ; Zbuffer[pos] = Z ; } pos += 4 ; Z = Z + dZ ; |
Figure 1 |
The first approach is to write code for these operations for our new representation. I advise those of you that are not familiar with floating point number structure to look at Figure 2, or better yet, read Chris Hecker’s column, "Let’s get to the [floating] point" (Game Developer magazine, February 1996). Of the two operations, the comparison could be the simpler if a 16-bit floating-point number structure similar to the 32-bit one was used (exponent in the top part, and mantissa in the low part). This structure allows us to compare two floating point numbers as if they were integers, which ensures a fast and easy comparison. Now, if we can find an efficient addition scheme for two 16-bit floats, we’ll be set.
A floating point number can be divided into 3 parts : the sign bit, the exponent and the mantissa.
31 | 30 | 23 | 0 |
sign | exponent | mantissa | |
It’s the working of the exponent that interest us here. It indicates the position of the decimal point in the number. If the number is in [0..1] then the exponent represents the number of reset bits that separates the decimal point from the first set bit. |
Unfortunately, I have not found an efficient way to perform 16-bit floating point addition. Instead, I use a somewhat different approach. First I use the simplest addition operation that available: 32-bit integer addition. After this addition, the result must be converted into a 16-bit floating point number, which turns out to be rather simple. This conversion looks like a compression scheme: the idea is to work like floating point numbers by eliminating the most significant bits that are at zero. In other words, a 16-bit floating number is composed by a first integer number that indicates the position of the first (most significant bit) that is set (we call it the exponent) and a second integer number created by a truncated piece of the original 32-bit integer number (the mantissa). For example, assume the original 32-bit integer number is represented as:
00010101010101010011001001110000b
The most significant bit that is at 1 is the 28th (11100b) in this example, but it could have any value from 31 down to 0. So, we need five bits to code the exponent. That leaves us 16-5=11 bits for the mantissa. For this example, then, the mantissa is (in bold):
000 10101010101 010011001001110000b
The final 16 bit floating point number looks like this:
11100 10101010101b
This organization preserves the magnitude-order of the numbers, so we’ve succeed in our goal to keep a simple comparison operation. Let’s sum up what we said by rewriting the algorithm (see Figure 3).
Zf16 = convert (Zi32) ; if (Zf16 < Zf16Buffer[pos]) { write pixel ; Zf16Buffer[pos] = Zf16 ; } pos += 2 ; Zi32 = Zi32 + dZi32 ; |
Figure 3 |
Let’s look at the conversion function now (see Figure 4). There are several ways to perform the conversion. I’ll describe the smallest one, which needs only three x86 instructions. Note than even though this is the smallest implementation, it is not necessarily the fastest: depending on the context, other solutions could be faster.
STEP 1: EAX = 00010101010101010011001001110000b ECX = ? BSR ECX ,EAX EAX = 00010101010101010011001001110000b ECX = 28 STEP 2: EAX = 00010101010101010011001001110000b ECX = 28 EDX = ? SHRD EDX ,EAX,CL EAX = 00010101010101010011001001110000b ECX = 28 EDX = 0101010101010011001001110000????b STEP 3: EDX = 0101010101010011001001110000????b ECX = 11000b (28d) SHRD EDX ,ECX,5 EDX = 11000010101010101001100100111b ECX = 28 |
---|
Figure 4 |
First we need to know which is the most significant bit that is set. To do determine this, you can use the x86 instruction called BSR (Bit Search Reverse). This instruction used to consume several dozen cycles on a Pentium processor, but it became a just two-cycle instruction on the Pentium II.
Second, eliminate the top three most significant bits like this:
From : EAX = 00010101010101010011001001110000b
To : EAX = 10101010101010011001001110000000b
To do that, the most intuitive solution is to shift EAX three bits to the left. However, to compute the number of bits to shift, we need to use at least two instructions and to make use of one more register (MOV EDX,31; SUB EDX,ECX) in addition to the shift instruction. Another possibility is to use the more exotic SHRD (Shift Right Double) instruction. This instruction shifts a register to the right, introducing the shifted bits into another register from its left hand side. (You’ll notice that the most significant bit set is lost, as real floating numbers.)
For the third step, we just need to combine the exponent (ECX) and the mantissa (EDX). To do that we use the SHRD operation again.
We’re almost done. Our 16-bit float is now present in the upper half of EDX register, but to use it, it must be in the low half (DX) of the register. For this reason we won’t use the last SHRD instruction with a count of 5, but with a count of 5+16=21.
Let ’s sum up the conversion function. The total resources used are three instructions and uses two registers:
BSR ECX ,EAX
SHRD EDX ,EAX,CL
SHRD EDX ,ECX,21
You’ve may have noticed that the EDX register isn’t initialized. This is only necessary if the 32bINT number is less than 2^11 - 1 = 2047 (meaning that the most significant bit set is less than the mantissa size in bits). However, experience shows that we can usually do without this initialization.
These three instructions are "long decoded" on Pentium II processors, which means that every instruction needs one cycle to be decoded, and generates two micro-operations each. In order to lower the cost of conversion, the instructions above should be interlaced with simple, decoded instructions.
Before I move on, let me issue a caveat regarding the BSR instruction. This instruction is usable only on the Pentium Pro, Pentium II, Pentium III and Celeron processors. The cost of using the BSR instruction remains exorbitant on all other processors, including the K6-family.
Now that we’ve looked at the 16-bit floating point Z-buffer, let’s think about its performance. I ran some tests on a 233 MHz Pentium II with EDO RAM, and found the performance of a 32-bit integer Z-buffer to be very similar to the performance of the 16-bit floating point version, so we can choose between them depending on the situation. The 32bINT Z-buffer should be used on systems where memory access is cheaper (such as those systems that have fast SDRAM), and the 16bFP is a better solution on Celeron-based PCs where memory access must be avoided at all costs. Even with the 128KB L2 cache of Celeron A processors, the 16bFP solution minimizes cache pollution, thereby achieving better performance.
The moral of this story is fairly simple: Increasing code complexity to reduce data size is a legitimate way to achieve better performance.
Example 2: Managing Big Triangles with Large Texture Maps
The second example, which I’ll present now, is more persuasive for a number of reasons. First, this problem is directly related to cache memory. Second, the solution can’t be found without a deep analysis of the way cache memory works. Third, the solution requires a minuscule change in the program.
In this example let’s look at a software rendering engine, to be used in a Quake-like game. In these kind of games, we find corridors, rooms, and open spaces. In big rooms or open spaces, we have huge roofs (or a sky), and huge floors. We’ll be talking about these huge surfaces.
Let’s assume that the roof in a particular room is a square divided in two triangles, and one 256x256 texel texture is used for the whole square. Here’s the problem: when a player looks at the roof, the game’s frame rate varies from a certain value to half that value, whenever the player rotates his head. To solve the problem, you don’t have to change one line of code. You simply have to add a pitch of sixteen texels to your mapped texture. This means that the 256x256 texel maps will be stored in a memory space of 272x256 texels.
The texture map is stored in memory as shown in Figure 5. Imagine that you are in the middle of the room, facing North, and looking at the roof (or the floor). The rendering engine maps the two triangles, reading the texture in the order that it’s stored, and everything is fine. A problem emerges, however, when you turn your head 90 degrees in either direction, because the texture is no longer being read in the same order that it’s organized in memory. To clarify, let’s take a closer look at what is happening in the left side of the texture map.
A0 B0 C0 D0 E0 F0 G0 H0 I0 J0 K0 L0 M0 N0 O0 P0 Q0 R0 S0 T0 U0 V0 W0...A1 B1 C1 D1 E1 F1 G1 H1 I1 J1 K1 L1 M1 N1 O1 P1 Q1 R1 S1 T1 U1 V1 W1...A2 B2 C2 D2 E2 F2 G2 H2 I2 J2 K2 L2 M2 N2 O2 P2 Q2 R2 S2 T2 U2 V2 W2...A3 B3 C3 D3 E3 F3 G3 H3 I3 J3 K3 L3 M3 N3 O3 P3 Q3 R3 S3 T3 U3 V3 W3...A4 B4 C4 D4 E4 F4 G4 H4 I4 J4 K4 L4 M4 N4 O4 P4 Q4 R4 S4 T4 U4 V4 W4...A5 B5 C5 D5 E5 F5 G5 H5 I5 J5 K5 L5 M5 N5 O5 P5 Q5 R5 S5 T5 U5 V5 W5...A6 B6 C6 D6 E6 F6 G6 H6 I6 J6 K6 L6 M6 N6 O6 P6 Q6 R6 S6 T6 U6 V6 W6...A7 B7 C7 D7 E7 F7 G7 H7 I7 J7 K7 L7 M7 N7 O7 P7 Q7 R7 S7 T7 U7 V7 W7.........A255 B255 C255 D255 E255 F255 G255 H255 I255 J255 K255 L255 M255... Assume that each letter-digit pair represents a 16bit texel, and that the surface to map is exactly the sizeof the texture. |
Figure 5: How the texture map is stored |
What happens when we look at the roof in the "good" way? Well, first the A0 texel is read and written to the corresponding pixel. The same treatment is given to the texels B0, C0, D0, and so on. Nothing strange here. What is happening from the point of view of the cache? For the cache, reading A0 will cause a cache line fill. This operation store texels A0 to P0 in the cache memory (on Pentium II, the cache line size is 32 bytes, but K6-family processors have 2x32 byte size lines). So a certain number of cycles will be consumed during the access to the A0 pixel, but accessing texels B0 through P0 does not consume any other processor cycles.
What happens when we look at the roof in the "bad" way? This time, it’s not the letter that will change, but the digit. Lines A0, A1, A2,... will be accessed successively, so the cache will, as expected, load lines A0 to P0, A1 to P1, A2 to P2, and so on. This induces costly penalties due to cache misses and cache loads.
As the first line (A0 to A255) is processed, a huge number of cycles are lost. That wouldn’t be a problem if the texels B0 to B255, C0 to C255, D0 to D255, ..., P0 to P255 were present in the data cache afterward. Will it be? It’s tempting to think so, since the entire structure will fit in cache (32 bytes per cache line * 256 texels in the texture map * 2 bytes per texel = 16KB). You might think that the entire A0,P0,A255,P255 block could be cached after reading texels A0 to A255. Unfortunately, this isn’t the case. The unhappy consequence of accessing A0 to A255 successively is that no texel is cached when accessed, which is why there is a considerable decrease in the game’s frame rate. So the heart of this problem becomes, "why is B0 not present in the cache memory when accessed?" Why is the data that we need not in cache when we expect it to be?.
First, level one cache memory organization is very similar on Pentium, Pentium II, and K6-family processors. The main difference is the cache size. On Pentium II processors, cache memory is organized as four two-dimensional arrays of 128 rows and 32 columns each, while K6 processors have two 256x64 arrays. The smallest element is one byte. We shall assume for this example that we have just one array of 512 rows and 32 columns in either processor. A byte is either present or absent in cache, and if it’s present, it can only be at one address. To know where, we need the address of the byte in the memory space. The five least significant bits indicate the row, while the next nine bits give us the line of the byte, as shown in Figure 6. It’s important to understand this addressing scheme.
Let’s see what happens when the texture map is accessed in the "bad" way. Assume the map is stored at the address 0x08800000. When accessing A0, the texels A0 to P0 are stored in the line 0,at 0x008800000. A1 is present at the address 0x08800200 (256 texels = 512 bytes), which means that texels A1 to P1 are located in the row 16. Hey – what about the rows 1, 2, ...,15? The answer is that they stay unused. So A2 to P2 goes in row 32, and so on, until we reach texels A31 to P31 which go into row 496. Once we reach that point, though, trouble begins: the data in A32-P32, as you might have guessed, gets placed in row 0, evicting the data that was already there (A0 to P0). Data is now being overwritten beginning with texel A32. This is why texel B0 is absent from cache when we try to access it later.
After investigating the caching mechanism, we are armed with enough knowledge to start looking for a solution. We have three options.
We could store the texture map as a double specimen, one rotated 90 degrees from the other. This is a somewhat inelegant approach, however.
The last approach is the one I prefer. The idea is to break up the pattern. This is achieved by using a texture map line size that isn’t a power of two. Adding 16 void texels (16 bits each) to the end of each line is sufficient on the Pentium II processor, but not enough on the K6, where 32 void texels is a better choice (see Figures 8a and 8b).
Cache Makes a Difference
Cache memory is a powerful tool. In order unleash its potential, however, it needs a little bit of help. There’s no need to use assembly language, even if it offers more flexibility than C/C++. In the last example we didn’t talk about the code at all, just about the data structures, and that’s our main point: always be sure that your data organization is cache friendly.
Alexandre Macris is currently working in the R&D department at Cryo Interactive as 3D Engine code optimizer for K6-2, Pentium II and Pentium III processors. He's waiting your feed back at [email protected]. Pascal Urro has a Ph.D. in physics and chemistry and is a videogame addict. He has worked on video compression and 3D engines. He is now the R&D manager at Cryo Interactive. You can reach him at [email protected]. Both authors would like to thank Joshua Thayer for his input while writing this article.
Read more about:
FeaturesYou May Also Like