Sponsored By

Optimizing Games for the Pentium III Processor

With the launch of the next generation of Intel processors, new options are available to game developers. This overview of the Pentium 3 illustrates how programmers can use the new floating point capabilities to produce better game graphics.

Pete Baker, Blogger

March 26, 1999

25 Min Read

This spring, Pentium III processor-based systems began shipping. The chip was designed with today’s floating-point-intensive multimedia, Internet, and 3D applications in mind, but of course gamers won’t buy the processor to marvel at its new registers and instructions. Consumers just want to play better games on their new systems.

To make the most of your game on Pentium III processor based systems, you have to know how to optimize you game for the new processor, and that in turn requires that you understand the processor's architecture. The Pentium III processor is based on the same well-known foundation as the Pentium II processor, and as such, many of the software design principles and optimization techniques still apply. With this in mind, the best optimization tool with which to arm yourself is a basic understanding of how the processor executes code (see Sidebar 1, "Understanding the Pentium II Processor"). Beyond that though, you need to learn about the Pentium III processor’s new registers and new instructions, and decide how to design your next title to make the most of these features. This article explains some of the features of the Pentium III processor, and shows how they can be used by game developers.

New Floating-Point Registers – No More ‘Double Duty’

One of the most common requests for enhancement to microprocessor architecture is the addition of new registers. Let’s face it, we’ve all spent time performing loop optimizations and said to ourselves, "If I just had one more register, I could really work some magic…". The Pentium III processor architecture takes steps to address this problem. It provides eight new registers in a flat register model (no floating-point stack implementation to deal with), in addition to the original 14 Intel Architecture registers and the eight FP/MMX technology registers (see Figure 1). These new registers are called XMM0-XMM7, they are128 bits wide, and they can be used to perform calculations on data. (These new registers may not be used to address memory.) Not coincidentally, these registers are the perfect size for storing four single precision (32-bit) floating-point numbers.

The new registers are completely independent of the existing floating point and MMX technology registers. There’s no penalty or emms instruction required to perform a context switch between the new registers and the existing FP/MMX registers – they work concurrently. This means you have up to four times the register real estate when working on single precision floating point data: the original ST0-ST7, and the new XMM0–XMM7 (which hold four single precision floating-point numbers each). This gives programmers the ability to develop algorithms that can finely mix packed single-precision floating point and integer data using both the Streaming SIMD Extensions and MMX technology, respectively.

image1_small.gif
Figure 1 - New Registers (XMM0 - XMM7)
[zoom]

 

New Status/Control Register

The addition of new registers requires a new processor state. The Pentium III processor has a new combined status/control register to handle this new state (see Figure 2). The status/control register is used to perform the following: enable masked/unmasked numerical exception handling, set rounding modes, set the flush to zero mode, and view status flags. The four existing FP rounding modes remain the same: round to nearest, round down, round up, and round toward zero. Additionally, a flush to zero (FTZ) bit has been added to remove the burden of checking for underflow of FP numbers (all underflows are automatically forced to zero with the sign of the true result, and the precision and underflow exception flags are set).

image2_small.jpg
Figure 2 - New status / control register
[zoom]

Changes to this new register will only affect the new processor state. For example, changing the rounding mode to "round up" in this register will affect operations done in the new 128-bit Pentium III registers (XMM0-XMM7), but not operations done in the existing FP registers (ST0-ST7).

New Data Types, New Instructions

Probably the most useful new processor feature to game developers is the new Streaming SIMD Extensions (see Sidebar 2, "SIMD Explained"). After the unveiling of MMX technology with its SIMD operations on integer data types, it was clear that the instruction set architecture could be enriched to be more flexible and more adaptable to algorithms that used single-precision floating-point data.

The Streaming SIMD Extensions were designed specifically to address the needs of algorithms that are:

  1. Computationally intensive

  2. Inherently parallel

  3. Dependant on efficient cache utilization

  4. Single-precision floating point implementations

Generally, you want to try to optimize code segments that are computationally expensive and that take most of the overall application processing time. The new instructions help accelerate applications that rely heavily on floating-point operations, such as 3D geometry and lighting, video processing, and high-end audio mixing.

Before we delve into the instructions themselves, however, it makes sense to look at the type of data the instructions require. The principle data type of the Streaming SIMD Extensions is a new 128-bit data type (see figure 3). In most cases, this data type must be 16-byte aligned.

image3.gif
Figure 3 – the _m128 data type
[zoom]

 

The new data types operate in the IEEE Standard 754 for binary floating-point arithmetic. This is a slight deviation from previous generations of Intel architectures, which used IEEE Standard 758 for representing floating-point numbers. Results from operations done with the Streaming SIMD Extensions and results obtained by the standard Intel architecture floating-point operations may not be bit exact.

The 70 new instructions can be broken down into six basic categories. While we won’t list all of the instructions in this article (check out http://developer.intel.com for documentation) we’ll hit the highlights, and present an overview of each type of instruction and give an example of how each might be used, using examples in a subject near and dear to game developers: 3D graphics.

Not too long ago, 3D application performance was limited by poorly performing accelerator hardware (or worse, no accelerator at all). Fast rasterization hardware has quickly become mainstream on PC platforms. The processor now has the difficult task of performing calculations fast enough on the geometry and lighting side of the 3D pipeline to keep the accelerator fed.

The processor, as expected, has a number of instructions that perform arithmetic computations. These can be further sorted into two groups: full precision instructions and approximate precision instructions. Full precision instructions consist of all of those floating point operations you know and love for doing adds, subtracts, multiplies, and divides, and so on, which operate on the new Pentium III registers.

There are also several approximate precision instructions for doing reciprocals and reciprocal square roots. The approximate precision instructions are extremely fast, but only return 11 bits of precision (rather than 23). These are useful for doing lighting, perspective projection and all kinds of other 3D graphics tasks for which 11 bits of precision is sufficient. For applications where more precision is required, you can use the following code to perform Newton Raphson iterations on the results, and get up to 22 bits of precision:

// Newton Raphson approximation for
// 1/tz = 2 * 1/tz - tz * 1/tz *1/tz
// the initial value, tz, assumed to be in xmm0

rcpps xmm1, xmm0 // 1/tz
mulps xmm0, xmm1 // tz * 1/tz
mulps xmm0, xmm1 // tz * 1/tz * 1/tz
addps xmm0, xmm1 // 2 * 1/tz
subps xmm1, xmm0 // tw = 2 * 1/tz -tz * 1/tz * 1/tz

This can be accomplished in half the time it takes to do a full-precision divide, which means that you get four results in less time than it takes to do one on a Pentium II processor.

Each of the computational instructions has both a packed (denoted by a ps suffix -- see Figure 4) and a scalar (denoted by an ss suffix – see Figure 5) version. The difference between these two versions is that packed operations complete four operations with one instruction, whereas the scalar versions only operate on the least significant data element and leave the other three elements of the destination unchanged.

image4.gif
Figure 4 - The packed versions of the instructions will operate on four data elements at a time

 

 

image5.gif


Figure 5 - The scalar versions of the instructions will operate on the least-significant data element only.

 

Data Movement and Data Manipulation

For everyday data movement, the Streaming SIMD Extensions provide move instructions. The movaps (move aligned packed single) and movups (move unaligned packed single) instructions transfer 128 bits of packed data from memory to one of the XMM registers and vice-versa, or between XMM registers. The faster movaps instruction can be used if the data is aligned on a 16-byte boundary.

Remember that this is a four-way parallel instruction set; we want to get the most parallelism out of the code as we can. Let’s say your data wasn’t laid out in memory four in a row. To that end, some data manipulation may be required. Since we’re using a packed data type, it’s important to provide ways to get the data into the correct format for optimal use by the instruction set. To that end, the instruction set now has instructions for performing data manipulations like shuffles, 64-bit moves, packing and unpacking, inserts and extracts.

For instance, say you want to perform simple dot products. In most 3D engines, data is laid out in a simple structure like this (where ‘w’=1):

struct vertice {

 

float x, y, z, w;
float nx, ny, nz;
float u, v;

}

 

Then the following code performs the dot products:

for (i=0;...)
{

FR3 = ((X*m00) + (Y*m01) + (Z*m02) + mat03);
.
.
.

}

 

Which performs operations as described in figure 6.

image6_small.jpg
Figure 6 – Non-optimal data layout
[zoom]

 

In Figure 6, you can see that we’re wasting 25% of our execution bandwidth in the multiply (we really only have to do three multiplies, assuming w=1), and we suffer from the additional overhead of three shuffles and three adds to get the final result.

Optimally, the data should be set up in a parallel format, so that the four dot products could be done with three multiplies and three adds, as shown in Figure 7. These parallel calculations can be done with the Streaming SIMD Extensions in the same time it took to do the one dot product on the Pentium II processor.

image7_small.gif
Figure 7 – Optimal SIMD data layout
[zoom]

 

How do you go about reordering the data? One method is to use the 64-bit movhps dest, src (see Figure 8) and shuffle shufps dest, src, mask (see Figure 9). The 64-bit move instructions can be used to move 64 bits representing two single precision operands to and/or from the either the upper or lower 64-bits of the src to the dest.

image8.gif
Figure 8 – movhps instruction

 

The shuffle can be used to rotate, shift, swap and broadcastdata between two registers or within one register (if both src and dest are the same), under the control of a mask. The mask contains eight bits; two bits for each data element in the dest. Bits 0 and 1 of the immediate field are used to select which of the four input numbers will be used as the first number of the result; bits 2 and 3 of the immediate field are used to select which of the four input numbers will be used as the second number, and so on.

image9_small.gif
Figure 9 – shufps instruction
[zoom]

 

Now we’ll show an example of how these instructions can be used to reorganize vertex data. (The "-" symbol in the comments below denotes a "don’t care".)

// Where xmm7 = -z0y0x0; xmm2 = -z1y1x1;
// xmm4 = -z2y2x2; xmm3 = -z3y3x3
// Reorder the input vertices to be
// in xxxx,yyyy,zzzz format

movhps temp1, xmm7
// Use 64-bit moves to move the high 64-bits…

movhps temp2, xmm4
// Save the Z0, Z2 values from these vectors

shufps xmm7, xmm2, 0x44
// xmm7 = y1,x1,y0,x0

shufps xmm4, xmm3, 0x44
// xmm4 = y3,x3,y2,x2

movaps xmm5, xmm7
// save content of register to extract
// the X elements later

shufps xmm7, xmm4, 0xDD
// xmm7 = y3,y2,y1,y0

shufps xmm5, xmm4, 0x88
// xmm5 = x3,x2,x1,x0

movhps xmm6, temp1
// mov the Z0 element from memory
// to the reg xmm6 = -,z0,y1,x1

shufps xmm6, xmm2, 0x22
// xmm6 = -,z1,-,z0

movhps xmm2, temp2
// mov the Z3 element from memory
// to the reg xmm6 = -,z3,y3,x3

shufps xmm2, xmm3, 0x22
// xmm2 = -,z3,-,z2

shufps xmm6, xmm2, 0x88
// xmm6 = z3,z2,z1,z0

Prefetching Data and Cache Instructions

The most appealing applications for the home PC market handle growing amounts of data – whether it be integer or floating-point. (Just think about the amount of texture and vertex data your next title will use.) Unfortunately, most of the data is out of the caches when it’s needed. The operation of loading and storing the data to and from the caches slows down the application while it waits for the data to become available. In some cases, the data address is known ahead of time, and the data could have been fetched in advance, reducing these waiting cycles. There are ways to do this with reads today, but it’s obvious that that the methods could be improved. To address this problem, the Streaming SIMD Extensions contain new instructions dedicated to memory streaming: the prefetches and the streaming stores.

Some multimedia data types, such as the 3D display list, are referenced once and aren’t used again immediately. A programmer wouldn’t want a game’s cached code and data to be overwritten by this non-temporal data. The movntq/movntps (or streaming store) instructions let data be written directly to memory, thereby minimizing cache pollution. For data that you know you’ll use soon and often, there’s the prefetch instruction. This instruction lets you prefetch 32 bytes of data (a cache line on the Pentium III processor) before it’s actually used. All of these prefetch instructions can be used to prefetch data into the L1 cache, all cache levels, or all levels except L1. Table 1 shows the different uses of data prefetching.

 

Data Use

Prefetch Type

Prefetch Instruction

Data will be used once

Prefetch into L1 only

PrefetchNTA

Data likely to be reused

Prefetch into all levels

PrefetchT0

Data likely to be reused, but not immediately

Prefetch to all levels except L1

PrefetchT1 / T2

Table 1 – Data Use vs. Prefetch Type

While these instructions will retire quickly, they are used merely as a hint to the processor, and thus won’t generate any exceptions or faults. When prefetching data, it’s important to remember a these simple rules:

  1. Choose the right type of prefetch

  2. Try to process a whole cache line (32 bytes) in one iteration

  3. Unroll the loops as necessary

  4. Make sure the CPU has some work to do while the data is being prefetched (i.e., don’t try to use the data right away)

  5. Treat the prefetch execution like a memory read when scheduling code.

Branching

As processor pipelines get deeper and deeper, branch mispredictions become more and more costly. There are a couple things you can do to deal with this problem. First, try to follow the branch prediction rules for the processor. The Pentium® III processor branch prediction rules are the same as the Pentium® II processor (see Sidebar 1, "Understanding the Pentium® II Processor"). Second, you can simply remove the branch where appropriate. Take the following example where we’re using logical instructions to remove branches:

C++:

a = (a < b) ? c : d ; Only doing a single compare here and there is a branch.

Assembly:

cmpps xmm0, xmm1, 1
;4 compares ("a" and "b") w/ one instruction –
;creates mask. This is also the beginning of the
;branch removal.

movaps xmm2, xmm0
;Save a copy of the mask.

andps xmm0, xmm3
;and(mask, c) | andnot(mask, d)

andnps xmm2, xmm4
;Where c=xmm3 and d=xmm4

orps xmm0, xmm2
;Final result as in the above C++ statement, but 4X.

Or, say you wanted to simply perform a clamp on an angle. You could use either the MINPS or MAXPS instruction to apply the clamp to four values at once. Here we’re using MINPS to clamp a vector to one (1.0f).

C++:

a = (a > b) ? b : a
; Only doing ONE compare here AND there is a branch.

Assembly:

minps xmm0, xmm1
;Where xmm0 = a; xmm1 = b = 1.0f;
;4 compares ("a" and "b") w/ one instruction
;Final result as in the above C++ statement, but 4X.

Type Conversion

Converting color values from floating point to integer values is a common requirement in a 3D application pipeline. Yet using simple high level language casts, such as intvalR = (int)fpvalR;, typically isn’t the most efficient way to do this. This ties you to the library implementation of the cast, which might be slow. These conversions usually happen for at least one color value (RGB) per vertex, and in a realistic lighting scheme, such values are often calculated multiple times. Over the course of drawing large models in the screen coordinate space, the inefficiencies can accumulate and slow down a game’s frame rate.

If you lay out your data in the representation described using the shuffle above (4-wide parallelism), you will be faced with the problem of converting four parallel floating-point numbers contained in the XMM registers into integer values quickly to pass to the rasterizer. By cleverly using the Streaming SIMD Extensions conversion instructions and MMX technology, a more efficient method for doing such conversions can be achieved. The following example shows how to convert four floating-point diffuse color values to integer values intermixing the Streaming SIMD Extensions and MMX technology:

// Where xmm2 = r3r2r1r0;
// xmm5 = g3g2g1g0; xmm1 = b3b2b1b0
// Each within range 0.0 - 255.0

cvtps2pi mm0, xmm2
// convert (int)r1,(int)r0 & store in mm0

shufps xmm2, xmm2, 0xEE
// broadcast r3,r2,r3,r2 on itself

cvtps2pi mm3, xmm2
// convert (int)r3,(int)r2 & store in mm3

cvtps2pi mm1, xmm5
// convert (int)g1,(int)g0 & store in mm1

shufps xmm5, xmm5, 0xEE
// broadcast g3,g2,g3,g2 on itself

cvtps2pi mm4, xmm5
// convert (int)g3,(int)g2 & store in mm4

cvtps2pi mm2, xmm1
// convert (int)b1,(int)b0 & store in mm2

shufps xmm1, xmm1, 0xEE
// broadcast b3,b2,b3,b2 on itself

cvtps2pi mm5, xmm1
// convert (int)b3,(int)b2 & store in mm5

 

// Now use the logical MMX instructions
// to correctly format the data

pslld mm0, 0x10
// shift r1<<16,r0<<16

pslld mm3, 0x10
// shift r3<<16,r2<<16

pslld mm1, 0x08
// shift g1<<8,g0<<8

pslld mm4, 0x08
// shift g3<<8,g2<<8

por mm0, mm1
// bitwise OR red(0,1) and green(0,1) & store in mm0

por mm0, mm2
// bitwise OR result with blue(0,1) & store in mm0

por mm3, mm4
// bitwise OR red(2,3) and green(2,3) & store in mm3

por mm3, mm5
// bitwise OR result with blue(2,3) & store in mm3

You could now perform the same operations on the specular light element, and then use the unpack instructions to interleave the colors for processing by the graphics card.

State Management

With all these changes to the internal workings of the processor, what about operating system support? The new registers must be saved and restored across context switches. In other words, the operating system must be aware of how to save and restore the information contained in the processor state through task switches. The new status/control register can be loaded with the LDMXCSR and FXRSTOR instructions and stored in memory with the STMXCSR and FXSAVE instructions.

All of the following operating systems support this new state: Windows 98, Windows 2000, and Windows NT 4.0 (with Service Pack 4 or greater, plus a special driver which can be found at http://www.intel.com).

The architectural changes discussed in this article including the new registers, new instructions, and faster clock speeds, all enable developers to create the best looking titles around. Take the opportunity to develop to these new features and we’ll all see titles with faster frame rates, more polygons per frame, deeper color depths, and more realistic lighting.

Hopefully, this article has planted some ideas in your head as to how your code can benefit from the Pentium III processor. We hope you can take this information and push the envelope of technology in real-time games and multimedia applications further than ever before.

Pete Baker and Kim Pallister are Senior Technical Marketing Engineers in Intel’s Developer Relations Group, where they evangelize processor architectures and optimization techniques to the development community. They can be reached at [email protected] and [email protected]

Undertanding the Pentium II Processor

With the launch of the Pentium Pro processor, dynamic execution technology was introduced into the Pentium family. Dynamic execution means that the processor is able to predict program flow, analyze the data path, and speculatively execute instructions out-of-order. The processor does this using multiple branch prediction and dynamic data flow analysis.

Multiple branch prediction is the processor’s ability to predict the outcome of branches many instructions ahead of the instruction pointer. Dynamic data flow analysis is the processor’s ability to determine instructions’ data and register dependencies, which helps the processor spot opportunities for executing instructions out of order. Both of these techniques are integral to the processor’s ability to do speculative execution, the ability to execute instructions out of order, but then retire them back in the correct order. With this in mind, here’s a "Top Five" list of optimization tips for the Pentium family:

1. Memory optimizations

When you sit down to begin optimizing your code for the Pentium II processor, the first thing you need to consider is your algorithm’s memory usage. Here are some simple tips you can use when looking into doing memory optimizations for the Pentium II:

Align data according to the data type. That is, WORDs on two-byte boundaries, DWORDs on four-byte boundaries, QWORDs on eight-byte boundaries, TBYTEs on 16-byte boundaries, and arrays to the beginning of a cache line (32 bytes).

Remember that write misses allocate and fill cache lines.

Read and write sequentially for maximum performance.

2. Minimize mispredicted branches.

After predicting a branch, the processor begins fetching and decoding instructions. If the branch was mispredicted, then the processor will be fetching and decoding the wrong instructions (obvious enough).

Additionally, before the processor can begin fetching and decoding the correct ones, the entire front-end of the processor (fetch, decode, and decoded micro-ops – up to 40 of them!) must be flushed. By understanding the rules by which the processor performs branch prediction, you can structure your code branches so that they will be accurately predicted more often. You can make this happen by remembering the following:

If the branch has never been seen before (static branch prediction),

 

  • Forward conditional branches will be predicted as not taken (jz @forward).

  • Backward conditional branches will be predicted as taken (jnz @backward).

  • Unconditional direct branches are always taken (jmp label).

  • Unconditional indirect branches fall through (jmp [eax]).

 

If the branch has been seen before (dynamic branch prediction),

 

  • Based on prediction and history in Branch Target Buffer (BTB)

  • If you can guess from the history what the branch will be, so will the processor (0101 -- taken, not taken, taken, not taken)

 

For example, a switch statement based on application data is an unconditional indirect branch, and cannot be predicted. If you have a case that you know will occur, say, 50% of the time, you are best off placing that in an if statement, and placing the remainder of the switch in an else. Additionally, you can use the Pentium II processor’s conditional move instructions to reduce the number of branches in your code. Structuring your code with these rules in mind will help you avoid mispredicted branches.

3. Avoid Processor stalls.

The processor’s execution can be hampered by a number of types of stalls. Understanding them can help you avoid them in your code. Some examples of stalls are:

Store address unknown. If the destination address of the store is unknown, all subsequent loads will stall because the processor doesn’t know if the store address will be the same as the load address.

Partial register stall. A partial register stall occurs when a large register (eax) is read after one of its partial registers (for instance, al) is written. Because of the register renaming performed by the Pentium II, the read of the large register has to wait until the write to the partial register is retired.

Serializing. Serializing instructions causes all execution to be delayed until all previous instructions execute, and are retired back to memory. Serializing instructions include: read-modify-write instructions with lock prefix, XCHG, CPUID, FLDCW, and any move to the control or debug registers

4. Use a good blend of instructions

The Pentium II and Pentium III processors have five execution ports that are each responsible for sending micro-op instructions sitting in the re-order buffer to their respective execution units. There is one port responsible for loads, two for stores, and two for arithmetic instructions. In order to keep all of the execution units happily working away, it is a good idea to use a good mix of different instructions.

5. Optimize for the front end

While your code may keep the execution units busy, it may potentially bottleneck at the instruction fetch and decoder units. The fetch unit is responsible for fetching instructions from the Icache, and sending 16-bytes at a time to the decoders. To help the fetch unit, you should:

 

  • Keep targets to branches aligned on 16-byte boundaries

  • Keep your instructions short so that you can fit at least three instructions into each 16-byte chunk.

 

The decoder units are responsible for decoding the instructions into micro-ops for the execution units to digest. The Pentium II and Pentium III have three instruction decoders: two decoders that can handle instructions one micro-op in size, and one decoder that handles instructions decoded to four or less micro-ops. To help the decoders, you should use simple instructions, as these are less likely to require the large instruction decoder. Using many complex instructions together can bottleneck the one large instruction decoder and leave the other two spinning idle.

SIMD Explained

The motive for a parallel instruction set is simple: whereas performing one operation at a time is good, doing four at once is usually better. The premise behind "single instruction, multiple data" (SIMD) is that certain applications (specifically multimedia, video, and 3D graphics) can be accelerated greatly if specific arrays of data common to these applications are executed quickly in parallel. The majority of the new instructions use this technique.

image10.gif

Figure 2. SIMD processing

Read more about:

Features

About the Author(s)

Pete Baker

Blogger

Pete Baker is a Senior Technical Marketing Engineers in Intel’s Developer Relations Group, where he evangelizes processor architectures and optimization techniques to the development community. He can be reached at [email protected].

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like