Sponsored By

Welcome to a new column about game programming, written by Rob Wyatt, a programmer at DreamWorks Interactive. In Wyatt’s World, Rob will look at the details of programming games, in a format that revolves around you, the reader. And since it’s a new column, Rob’s decided that he may as well cover a new processor, Intel’s Pentium III. He’ll attempt to answer some common questions, and provide some background information and working examples.

Rob Wyatt, Blogger

May 28, 1999

59 Min Read

Editor’s Note: Welcome to a new column about game programming, written by Rob Wyatt, a programmer at DreamWorks Interactive. In Wyatt’s World, Rob will look at the details of programming games, in a format that revolves around you, the reader. It’s going to require your input and ideas as readers, so if you have programming questions or topics you’d like to see addressed here in Wyatt’s World, let him know at [email protected].

And since it’s a new column, Rob’s decided that he may as well cover a new processor, Intel’s Pentium III. He’ll attempt to answer some common questions, and provide some background information and working examples. What is all the fuss about with the Pentium III?

This new processor contains 70 new multimedia instructions, or "Streaming SIMD Instructions" as Intel would like them to be called. Some of these new SIMD (Single Instruction Multiple Data) instructions provide an extension to MMX, and like the existing MMX instructions, they are integer-based. However, of real interest to game developers who work with 3D graphics and physics are the SIMD floating-point instructions. Gamasutra already has published a couple of articles on the subject, including "Implementing a 3D SIMD Geometry and Lighting Pipeline" ( http://www.gamasutra.com/view/feature/3331/implementing_a_3d_simd_geometry_.php) and "Optimizing Games for the Pentium III Processor" ( http://www.gamasutra.com/view/feature/3323/optimizing_games_for_the_pentium_.php).

The new Pentium III Streaming SIMD instructions are functionally similar to the instructions AMD added to the K6 with its 3DNow! instruction set, but the Pentium III instructions are implemented substantially differently. Whereas the K6-2 processor is a SIMD device, it only operates on two floating-point numbers at once. On the other hand, the Pentium III operates on four floating-point numbers at once. On the K6-2, the pair of 32-bit floating point values are held within one of the 64-bit MMX registers, which, as everybody knows, are aliased onto the floating-point registers. To use the 3DNow! instructions (which are in effect an extension to MMX), the processor had to operate in MMX mode, but along with MMX comes all of the associated restrictions, such as no FPU and the overhead of the EMMS instruction when the FPU is required. With the Pentium III, Intel has solved the problem of register aliasing and allows wider registers by adding eight new registers, called XMM0 to XMM7. Each register is 128 bits wide, and holds four IEEE 32-bit floats. Fortunately, the SIMD registers can be used while the processor is in floating-point or MMX mode, although it is better if it is in the latter mode. You may experience scheduling problems in the processor if you try to interleave floating-point instructions with SIMD instructions.

The execution speed of the new SIMD instructions is good. For example, the MULPS instruction (Multiply Packed Scalar, Packed Scalar, which means it independently multiplies all four elements of the register) has a latency of five cycles, and a throughput of one instruction every two cycles, which equates to two floating-point multiplies per clock cycle (see Figure 1). This throughput is typical of most of the floating-point SIMD instructions; the only real exception is the full-precision divide and square root, which take a whopping 36 and 58 cycles, respectively! Fortunately, there are instructions which approximate the results of both the reciprocal and reciprocal square root, which are each accurate to 12 bits of mantissa, and these instructions take only two cycles. This makes normalizing vectors a little faster.

Without these new instructions, the Pentium III is functionally identical to the Pentium II, and at the same clock speed, there is no difference in performance. With this in mind, it stands to reason that only applications that take advantage of the new instructions will benefit from a Pentium III. However, using the right algorithms, the benefit of the Pentium III can be huge, and modern games use many such algorithms.

How do I detect the new instructions?

You detect the Streaming SIMD Instructions in the same way as you detect MMX. You issue the CPUID instruction with EAX=1 and check the SIMD bit (bit 25) in the feature flags (EDX). Don’t forget to first check for the presence of the CPUID instruction! Here is code for detecting the new instructions:

bool DetectSIMD()
{

bool found_simd;
_asm

{

 

pushfd
pop eax // get EFLAGS into eax
mov ebx,eax // keep a copy
xor eax,0x200000
// toggle CPUID bit

push eax
popfd // set new EFLAGS
pushfd
pop eax // EFLAGS back into eax

xor eax,ebx
// have we changed the ID bit?

je NO_SIMD
// No, no CPUID instruction

// we could toggle the
// ID bit so CPUID is present
mov eax,1

cpuid // get processor features
test edx,1<<25 // check the SIMD bit
jz NO_SIMD
mov found_simd,1
jmp DONE
NO_SIMD:
mov found_simd,0
DONE:
}

return found_simd;

}

This function simply returns true when SIMD instructions are present. Ideally, you should always protect detection code with a __try/__except block, so if things go wrong you do not quit the application with an illegal opcode. Here’s how to implement the __try/__except block:

bool simd_present = false;
__try

{

simd_present = DetectSIMD();

 

}

__except(EXCEPTION_EXECUTE_HANDLER)

{
// If the DetectSIMD() function generates any sort
// of exception we will end up here and then we
// assume there are no SIMD instructions.

simd_present = false;

}

The above code will function correctly on any make of processor. Be aware that even if you detect the presence of a Pentium III (Family 6, Model 7), don’t assume that it supports Streaming SIMD instructions. Other manufacturers such as AMD or Cyrix may release processors with Streaming SIMD support, or future Intel processors may not support Streaming SIMD, and that could jeopardize your game if it requires this support.

Be aware that successfully implementing the above code could be a problem if you are using Microsoft Visual C++. If you put the __try/__except block around the assembly listing and then call the function using bool simd_present = DetectSIMD(), Visual C++ 6.0 will not allow it. Visual C++ balks at the fact that the destination of the jump instructions are within a __try block, even though the source of the jump is within the same block. It appears that this is a limitation of inline assembly language within __try/__except blocks, but fortunately this problem is remedied if you use the Intel C/C++ reference compiler in conjunction with Visual C++.

What operating system support is required for the Pentium III?

Both MMX and 3DNow! were aliased onto the floating-point registers so that no additional processor state had to be introduced. This meant all existing operating systems could continue to work without modification. However, the eight new extended multimedia registers found in the Pentium III add state to the processor, and the operating system must be aware of this when switching tasks. To help, Intel added two new instructions, FXSAVE and FXRSTOR, which save and restore a whopping 512 bytes of state. Within these 512 bytes are the SIMD registers, the floating point/MMX registers and various control registers. These new instructions actually first appeared in the late-model Pentium II processors, and their presence is indicated by the CPUID instruction and by checking the FXSR bit in the features.

So, all we need now is an operating system that is aware of the new "save" and "restore state" instructions. Fortunately, Windows 95 OSR2, Windows 98, Windows NT 4.0 with Service Pack 4 and Windows NT 5.0 (beta 2) are aware of and use these new state instructions; support problems should be minimal. Windows NT 4.0 requires a special driver in addition to Service Pack 4, which is available at http://developer.intel.com. Service Pack 4 is available at either http://www.microsoft.com or through MSDN.

The method of detecting operating system support for Streaming SIMD Instructions is quite simple. If the operating system supports the new state management instructions, it sets the OSFXSR bit (bit 9) in control register 4. While this might seem to be an ideal way to detect the necessary operating system support, the problem is that control registers are out of bounds for general applications – they can only be accessed from Ring 0. So how do you detect operating system support? Dropping into Numega’s SoftICE and executing the CPU command will verify the CR4 setting on your development system. Assuming you have already detected the presence of the SIMD instructions on the chip, if the OSFXSR bit is not set, the SIMD instructions will generate invalid opcode exceptions, and that will alert you to the fact that the operating system lacks support.

The subject of exceptions brings me to the last operating system support issue. Just like floating point instructions, the SIMD instructions can generate exceptions for cases such as divide by zero, inexact result, overflow, and so on. I recommend disabling SIMD exceptions within your shipping code, since the SIMD unit will provide reasonable values in situations where an exception would occur. However, in development code, it is useful to enable exceptions to see where (or if) they are happening and if they are important.

How do you detect the new exceptions? Unfortunately SIMD exceptions are not easily detected in current versions of Windows. The processor can take two paths when a SIMD instruction generates an exception. Either a new exception (protected mode interrupt #19 (decimal)) is raised or an invalid opcode is signaled. Whether an exception is raised or an invalid one occurs depends on the state of the OSXMMEXCEPT bit (bit 10) in control register 4. If the operating system supports the new exception, then it should set this bit. Otherwise, it has to leave it clear. No versions of Windows (except Windows NT 5.0 beta 2) can handle this exception, so SIMD exceptions will appear as invalid opcodes. Perhaps this will change with the upcoming second edition of Windows 98. If the operating system supports SIMD exceptions, the abstract exception passed through to Win32 applications is known as STATUS_FLOAT_MULTIPLE_FAULTS. Regardless of the exception generated, it is not trivial for an application to determine which of the floating-point values within a SIMD register caused the exception. The listing below returns a bool indicating whether or not the operating system supports SIMD (this program requires the Visual C++ macros described in this article).

bool DetectOSSupport()
{

bool support = true;
__try
{

 

_asm
{
//Execute a Streaming SIMD instruction
// and see if an exception occurs.
ADDPS(_XMM0,_XMM1)
}

}

__except(EXCEPTION_EXECUTE_HANDLER)

{
// We should really check the reason for the
// exception in case it is not an illegal
// instruction but any other exception is
// very unlikely.

support = false;

}

return support;

}

Detecting support for exceptions is difficult because of the need to change the SIMD control register. The listing below returns a Boolean representing operating system exception support. Again, this function requires the Visual C++ macros, which are described shortly.

bool DetectExceptionSupport()
{

 

bool exception_support = true;
float test_val[4] = {1.0f, 1.0f, 1.0f, 1.0f};
DWORD control;
__try
{

 

_asm
{
// Enable divide by zero exceptions by
// clearing bit 9 in the SIMD control
// register.

push ebp
lea ebp,control
STMXCSR
and DWORD PTR [ebp], 0fffffdffh
LDMXCSR
pop ebp

// clear XMM0, all bits being 0 is 0.0 in
// floating point
lea eax,test_val

XORPS (_XMM0,_XMM0)
MOVUPS (_XMM1,_EAX)
DIVPS (_XMM1,_XMM0)
}

}

__except(EXCEPTION_EXECUTE_HANDLER)

{

 

// The divide by zero above has caused an
// illegal instruction exception so the
// OS must not support SIMD exceptions.

if (_exception_code() == STATUS_ILLEGAL_INSTRUCTION)
{

 

exception_support = false;

 

}

}

_asm

{

 

// disable the divide by zero exception
// again
or control, 0x200
push ebp
lea ebp,control
LDMXCSR
pop ebp

}

return exception_support;

}

More robust versions of the above functions and the DetectSIMD() function from the previous question have been put into an easy to use C++ class for your convenience, and are listed in Detect.cpp and Detect.h. An example using this class is provided in DetectExample.cpp. A pre-build version of this test application is available as DETECT.EXE.

 

What are these new SIMD instructions?

The tables below cover all the new Streaming SIMD Instructions for floating point and integer operations. The integer streaming SIMD instructions are actually extensions to MMX, work the same way as the existing MMX instructions, and use the same registers. All the floating-point operations have two forms of each instruction: a packed format indicated by instructions ending in "PS", and a single format indicated by instructions ending in "SS". The PS instructions perform operations on each of the four floating-point elements within a XMM register (Figure 1), whereas the SS instructions operate only on the bottom float, leaving the others untouched (Figure 2). The data is stored within XMM registers in a right-to-left order, so the value on the righthand side is the least significant 32 bits. Note that this can be confusing if you store a vector in memory as [x,y,z,w], because it appears as [w,z,y,x].

XMM0

8.0

6.0

4.0

2.0

 

*

*

*

*

XMM1

3.0

5.0

7.0

9.0

 

=

=

=

=

XMM0

24.0

30.0

28.0

18.0

Figure 1. Example of the MULPS xmm0,xmm1 instruction

 

XMM0

8.0

6.0

4.0

2.0

 

 

 

 

*

XMM1

3.0

5.0

7.0

9.0

 

=

=

=

=

XMM0

8.0

6.0

4.0

18.0

Figure 2. Example of the MULSS xmm0,xmm1 instruction

 

The next tables show the various Streaming SIMD operations. The two columns on the far right side of the table are the issue (throughput) and latency times for each instruction. For example, ADDPS can be issued every two cycles, and each instruction has a latency of four cycles. Unfortunately, there is a little more to scheduling than these simple timings, because the execution port and resource usage must be taken into account. These numbers give you a rough idea, though. For more information on decode scheduling, see the latest Intel optimization reference manual, available at http://developer.intel.com.

The "Src" and "Dst" columns in the following tables show possible locations for the source and destination operands of the various instructions. The following combination of symbols are used:

Xmm (Floating point SIMD Multimedia register)

Mmx (Integer MMX Multimedia register)

Mem (Memory address/Indirect address)

Reg (x86 integer register)

 

Mathematical operations

Dst

Src

Issue

Latency

ADDPS

Add packed scalar

Xmm

Xmm/Mem

2

ADDSS

Add single scalar

Xmm

Xmm/Mem

1

SUBPS

Subtract packed scalar

Xmm

Xmm/Mem

2

SUBSS

Subtract single scalar

Xmm

Xmm/Mem

1

MULPS

Multiply packed scalar

Xmm

Xmm/Mem

2

MULSS

Multiply single scalar

Xmm

Xmm/Mem

1

DIVPS

Divide packed scalar

Xmm

Xmm/Mem

38

DIVSS

Divide single scalar

Xmm

Xmm/Mem

18

SQRTPS

Square root packed scalar

Xmm

Xmm/Mem

58

SQRTSS

Square root single scalar

Xmm

Xmm/Mem

30

RCPPS

Reciprocal packed scalar

Xmm

Xmm/Mem

2

RCPSS

Reciprocal single scalar

Xmm

Xmm/Mem

2

RSQRTSS

Reciprocal square root single scalar

Xmm

Xmm/Mem

2

RSQRTPS

Reciprocal square root packed scalar

Xmm

Xmm/Mem

2

MAXPS

Maximum packed scalar

Xmm

Xmm/Mem

2

MAXSS

Maximum single scalar

Xmm

Xmm/Mem

1

MINPS

Minimum packed scalar

Xmm

Xmm/Mem

2

MINSS

Minimum single scalar

Xmm

Xmm/Mem

1

 

Conversion operations

Dst

Src

Issue

Latency

CVTPI2PS

Convert packed integer to packed scalar

Xmm

Mmx/Mem

1

CVTSI2SS

Convert single integer to single scalar

Xmm

Reg/Mem

2

CVTPS2PI

Convert packed scalar to packed integer

Mmx

Xmm/Mem

1

CVTSS2SI

Convert single scalar to single integer

Reg

Xmm/Mem

1

CVTTPS2PI

Convert packed scalar to packed integer, with truncate

Mmx

Xmm/Mem

1

CVTTSS2SI

Convert single scalar to single integer, with truncate

Reg

Xmm/Mem

1

 

Move operations

Dst

Src

Issue

Latency

MOVAPS (load)

Move from aligned memory to XMM register

Xmm

Mem

2

MOVAPS (reg)

Move XMM register to XMM register

Xmm

Xmm

1

MOVAPS (store)

Store from XMM register to aligned memory

Mem

Xmm

2

MOVUPS (load)

Load from unaligned memory to XMM register

Xmm

Mem

2

MOVUPS (store)

Store from XMM register to unaligned memory

Mem

Xmm

3

MOVSS (Load)

Load single scalar

Xmm

Mem

1

MOVSS (Reg)

Move single scalar

Xmm

Xmm

1

MOVSS (Store)

Store single scalar

Mem

Xmm

1

MOVMSKPS

Move MSB of packed scalars to integer register

Reg

Xmm

1

MOVLHPS

Move Low 2 packed scalars to high position

Xmm

Xmm

1

MOVHLPS

Move high 2 packed scalars to low position

Xmm

Xmm

1

MOVLPS (Load)

Load 2 packed scalars to low position

Xmm

Mem

1

MOVLPS (reg)

Move 2 packed scalars in low position

Xmm

Xmm

1

MOVLPS (Save)

Save 2 packed scalars in low position to memory

Mem

Xmm

1

MOVHPS (Load)

Load 2 packed scalars to high position

Xmm

Mem

1

MOVHPS (Reg)

Move 2 packed scalars in high position

Xmm

Xmm

1

MOVHPS (Save)

Save 2 packed scalars in high position to memory

Mem

Xmm

1

MOVNTPS

Store XMM register to aligned memory, non temporal

Mem

Xmm

2

SHUFPS

Shuffle single scalar within packed

Xmm

Xmm/Mem

2

UNPCKLPS

Unpack low

Xmm

Xmm/Mem

2

UNPCKHPS

Unpack high

Xmm

Xmm/Mem

2

 

Compare operations

Dst

Src

Issue

Latency

CMPPS

Compare packed scalar

Xmm

Xmm/Mem

2

CMPSS

Compare single scalar

Xmm

Xmm/Mem

1

COMISS

Compare single scalar and set EFLAGS

--

Xmm/Mem

1

UCOMISS

Unordered compare single scalar and set EFLAGS

--

Xmm/Mem

1

 

Logical operations

Dst

Src

Issue

Latency

ANDNPS

And Not packed scalar

Xmm

Xmm/Mem

2

ANDPS

And packed scalar

Xmm

Xmm/Mem

2

ORPS

Or packed scalar

Xmm

Xmm/Mem

2

XORPS

Exclusive or packed scalar

Xmm

Xmm/Mem

2

 

Memory operations

Dst

Src

Issue

Latency

PREFETCHT0

Prefetch using T0 hint

--

Mem

1

PREFETCHT1

Prefetch using T1 hint

--

Mem

1

PREFETCHT2

Prefetch using T2 hint

--

Mem

1

PREFETCHNTA

Prefetch using NTA hint (Non temporal)

--

Mem

1

SFENCE

Store fence

--

--

1

 

Integer/MMX operations

Dst

Src

Issue

Latency

PSHUFW

Packed shuffle word

Mmx

Mmx/Mem

1

PEXTRW

Extract word

Reg

Mmx

2

PINSRW

Insert word

mmx

Reg/Mem

1

PMINUB

Packed minimum unsigned byte

Mmx

Mmx/Mem

½

PMINSW

Packed minimum signed word

Mmx

Mmx/Mem

½

PMAXUB

Packed maximum unsigned byte

Mmx

Mmx/Mem

½

PMAXSW

Packed maximum signed word

mmx

Mmx/Mem

½

PMOVMSKB

Move byte mask to integer register

Reg

Mmx

1

PSADBW

Packed sum of absolute differences

Mmx

Mmx/Mem

2

PAVGW

Packed average word

Mmx

Mmx/Mem

½

PAVGB

Packed average byte

Mmx

Mmx/Mem

½

PMULHUW

Packed multiply high

Mmx

Mmx/Mem

1

MOVNTQ

Move QWORD non temporal

Mem

Mmx

1

MASKMOVQ

Byte mask write

Mmx

Mmx

1

 

Control operations

Dst

Src

 

FXSAVE

Store extended state (FP/MMX and SIMD)

Mem

--

FXRESTOR

Load extended state (FP/MMX and SIMD)

--

Mem

LDMXCSR

Load 32bytes of SIMD status/control

--

Mem

STMXSCR

Store 32bytes of SIMD status/control

Mem

--

What disappoints me about this instruction set is that there are no instructions to perform inter-register operations to calculate, for instance, a dot product. Although calculating a dot product can be performed by shuffling, a dot product instruction would have been very useful.

There has been talk on the Internet that a thermal noise random number generator is present within the Pentium III. Although this would be very useful, I cannot find any trace of it. If you know anything about it, let me know.

How do I make use of the new instructions?

The best way to write code for Pentium III is to use version 4.0 of the Intel C/C++ compiler. This compiler, which comes with Intel’s VTune, is a replacement for the Microsoft C/C++ compiler that Visual C/C++ uses. The advantage of using the Intel compiler is that you can still use the IDE, debugger, linker and tools that you are familiar with, and there is no learning curve. If you prefer C++, the Intel compiler is a much better implementation of the language than Microsoft’s version.

The Intel compiler supports the new Pentium III instructions via its vectorizing code generator. As such, it successfully generates SIMD instructions. The new instructions are also supported within the inline assembler, and if you don’t want to code in assembly, there is a new Intel compiler-specific SIMD data type called __m128 and a set of intrinsic C functions, so optimized code can be developed without assembly language. For every SIMD instruction, there is a compiler function to do the same thing. For example,

__m128 __mm_add_ps(__m128, __m128)

adds the two specified SIMD data types together and can compile to a single instruction. The __m128 type can be made into a union with an array of floats if access to the individual floats is required. The only caveat is that this operation requires the __m128 to be in memory, because there are no instructions to move floating point data between the SIMD and x87 registers. Intel attempts to do this for you; it tries to maintain code portability via an abstract class called f32Vec4. This is a C++ class which has inline member functions with the same names as the intrinsic C functions. When used, this class generates exactly the same code as that generated by the intrinsic functions, so an application’s performance should not be affected. The benefit of using the class is that it can be re-implemented using x87 (or even AMD’s 3DNow!) without having to change your source code. For full documentation of the intrinsic types, the C++ classes and the vectorizing compiler options, look at appendix C in volume 2 of the Intel Architecture Reference Manual, or the Intel optimizations reference manual, both which are available from Intel’s developer web site or on the VTune CD-ROM.

I assume most professional game programmers use Visual C++, and if so, these people may not want to change compilers just to access the new instructions. If you are in this position, all is not lost. Here are a couple of options for using the new instructions within Visual C++.

First, the Intel compiler produces object files that are binary compatible with Visual C++, including C++ name mangling. With this in mind, you can separate the individual C functions or C++ class members that require Pentium III optimizations into a separate source files, and then compile them with the Intel compiler. To make switching compilers even easier, Intel implemented a #pragma that lets you select the compiler on a source-file basis.

Alternatively, Microsoft has updated MASM via a patch to include these new instructions. The latest MASM version is 6.14, and the patch will update versions 6.11, 6.11a, 6.11d, 6.12 and 6.13. The ML614.exe patch is available from the Microsoft web site at http://support.microsoft.com/download/support/mslfiles/ml614.exe. Alternatively, Intel provides an include file for MASM that defines macros for the Pentium III instructions and this include file works with all versions of MASM. If your build environment includes MASM, one of these options may be the way to go. The file is called IAXMM.INC and is available from Intel’s developer web site.

The MASM include file inspired me to build a set of macros that emits the opcode bytes directly into the code stream, thereby allowing any compiler to use the Pentium III instructions. This turned out to be a little more difficult than I anticipated, however, mainly because of inline assembly code restrictions. The instruction macros I provide here are not ideal, but they do the job. And for small sections of inline assembly code, the instructions are perfectly adequate and can make a huge difference.

To create these macros, I first defined the register names and their respective values. The standard register names are reserved words, so they cannot be used. Further, the SIMD register names will be reserved words in a future version of Visual C++ (they are reserved words in the Intel compiler), so it’s better not use them, either. In the end, I decided to call the SIMD registers _XMM0 to _XMM7, and I called the MMX registers _MM0 to _MM7. The integer registers have two forms, depending on whether they are as address pointers or not. The pointer versions of the standard registers are called EAX_PTR, EBX_PTR, and so on, and the register versions are called EAX_REG, EBX_REG, and so on.

Opcode 0f 58 /r is the addps instruction, where "/r" means either a register or memory pointer for the source operand. Fortunately, only registers can be destinations within most SIMD instructions, so there are only two forms of the instructions. Looking at the encoding of the "/r" component of the instructions you’ll notice that it’s a standard "mod/rm-sib-offset" (for example, [eax*2+ebx+offset]), just like the any other instruction. With this in mind, the register-to-register version of the instruction (addps xmm,xmm) becomes trivial to encode because the one-byte "/r" component (the mod/rm byte) is laid out as follows:

Bit 7

Bit 6

Bits 5-3

Bits 0-2

1

1

Dst XMM register

Src XMM register

The following macro assembles the instruction:

#define ADDPS_REG(dst,src) \

{ \
_asm _emit 0x0f \
_asm _emit 0x58 \
_asm _emit 0xc0 | ((dst)<<3) | (src) \
}

This would simply be used as ADDPS_REG(_XMM0,_XMM1) from either inside or outside of an assembly code block.

The register-to-register form of the instructions is of no use unless we can also use the memory form of the instructions to load data. If we look at the same instruction with a 32-bit integer register pointing to the data, the "/r" component of the instruction remains a single byte. It is laid out as:

Bit 7

Bit 6

Bits 5-3

Bits 0-2

0

0

Dst XMM register

Src Integer register ptr

Like before, we can define a macro to assemble this instruction:

#define ADDPS_MEM(dst,src) \
{ \

 

_asm _emit 0x0f \
_asm _emit 0x58 \
_asm _emit 0xc0 | ((dst)<<3) | (src) \

}

This would be used as ADDPS_MEM(_XMM1,EAX_PTR) and would add the 128-bit value pointed to by EAX to the contents of the XMM1 register.

It would be nice if both of these macro forms could be combined into a single macro, so that you could easily switch from a register to memory pointer. If you define the registers as shown in the table below, the following macro will successfully assemble both forms of the instruction.

 

Register

Value

Register

Value

Register

Value

Register

Value

_XMM0

0xC0

_MM0

0xC0

EAX_PTR

0x00

EAX_REG

0xC0

_XMM1

0xC1

_MM1

0xC1

EBX_PTR

0x03

EBX_REG

0xC3

_XMM2

0xC2

_MM2

0xC2

ECX_PTR

0x01

ECX_REG

0xC1

_XMM3

0xC3

_MM3

0xC3

EDX_PTR

0x02

EDX_REG

0xC2

_XMM4

0xC4

_MM4

0xC4

ESI_PTR

0x06

ESI_REG

0xC6

_XMM5

0xC5

_MM5

0xC5

EDI_PTR

0x07

EDI_REG

0xC7

_XMM6

0xC6

_MM6

0xC6

ESP_PTR

0x04

ESP_REG

0xC4

_XMM7

0xC7

_MM7

0xC7

EBP_PTR

0x05

EBP_REG

0xC5

 

#define ADDPS(dst,src) \
{ \

 

_asm _emit 0x0f \
_asm _emit 0x58 \
_asm _emit ((dst & 0x3f)<<3) | (dst) \

 

}

This macro is simply used as ADDPS(_XMM0, _XMM1) for the register version, or ADDPS(_XMM0, EAX_PTR) for the memory version. In the KNI.h header file, similar macros are provided for all the new Pentium III instructions.

Single register indirect addressing is the only addressing mode that the macros support, which can be restrictive compared to the functionality of a proper assembler. While using the macros, all addressing modes can be achieved by using an LEA instruction to calculate the address and use the result in the macro. While this method takes two instructions, it’s usually not too difficult to schedule the LEA between some other instructions where the processor would have otherwise been stalled.

The opcode determines the type of registers used within a given instruction. (The possibilities are shown in the instruction tables above.) However, because the macros cannot perform any error checking, it is possible to assemble what appear to be illegal instructions. For example, the instruction ADDPS(EAX_REG,EBX_REG) is invalid, but it actually assembles to the valid ADDPS xmm0, xmm3 instruction. With this in mind, you have to be very careful when using the macros, because simple typos can lead to bizarre side effects.

The only SIMD instructions that can take a memory operand as the destination are the various move instructions, such as MOVAPS or MOVUPS, and these move instructions actually have different op-codes for storing and therefore require a different macro. To keep things simple, a storing version of an instruction has a postfix of _ST. For example, the instruction

MOVAPS [eax], xmm0

becomes

MOVAPS_ST(EAX_PTR,_XMM0)

when using the macros. The KNI.H header file contains macros for all the SIMD instructions and constants for registers.

Note that both Visual C++ and the Intel compiler know what assembly instructions modify what registers. Using this information, the compilers store their working registers around assembly blocks for only the registers used within the assembly block, resulting in more optimal code. If instructions are directly emitted into the code stream by using the _emit operator, the compiler does not know what registers are used and attempts no guesses. As a result, you may corrupt a register the compiler is using and was not saved.

How do I debug code with the new instructions?

Currently the way of viewing the SIMD registers or disassembling SIMD code within the application environment is to use the Intel Register Viewing Tool -- examining SIMD code within the Visual C++ dissembler will reveal nothing. The Register Viewing Tool is a stand-alone tool and is not linked to the Visual C++ IDE in any way, apart from being accessible from the tool menu. The register view highlights changes to individual elements of a register, but the update is not instant. The tool only updates the registers every half second (which is fast enough while you work in the debugger) and you can always click on the ‘Refresh’ button for a instantaneous update. In addition to viewing the registers in floating point format, they can be viewed in byte, word and dword formats.

The disassembly window displays code at the current address, which is marked with an asterisk ‘*’ and 40 bytes either side. The current address line moves down as you step through in the debugger, and it also displays the correct address when you hit a break point. The only inconvenient aspect of this tool is that when a breakpoint is first hit, an int 3 instruction (debugger break) is shown in the disassembly window, and you have to enter the first byte of actual instruction if you want an accurate disassembly. Usually for a SIMD instruction, the first byte will be 0x0f for packed scalar instructions and 0xf3 for single scalar instructions. This little problem aside, it is otherwise a very useable and essential tool if you are serious about writing SIMD code. Using the register viewing tool and some good old-fashioned debugging techniques, you will get by fine. Hopefully Microsoft will be quicker in implementing SIMD debug support than they were in implementing MMX support.

image10_small.gif

Viewing SIMD code with
Intel's Register Viewing Tool
[zoom]

For low level debugging there is a new version of SoftICE (version 3.25) that supports the Pentium III registers and instructions. The latest version is available for download from the Numega web site at http://www.numega.com/drivercentral/components/si325.shtml (this is free to anyone who has a registered version of SoftICE 3.20 or higher).

Programming Considerations

Like I said before, I am disappointed that there are no dot product/inner product instructions. Having such an instruction could have made a huge difference for lighting and collision calculation performance. Fortunately, calculating a dot product with the new instructions can be done in just a few cycles -- significantly faster than using older x87 floating-point methods. The code below performs a simple dot-product between two vectors and places the resulting value in all positions so that it’s ready to use. By carefully scheduling and interleaving the neighboring operations, this code could go significantly faster.

// Load the vectors
Movaps xmm0, lv xmm0 = [-, lz, ly, lx]
movaps xmm1, nv xmm1 = [-, nz, ny, nx]

// Do the math
mulps xmm0, xmm1 xmm0 = [-, lz*nz, ly*ny, lx*nx]
movaps xmm2, xmm0 xmm2 = [-, lz*nz, ly*ny, lx*nx]
shufps xmm0, xmm0,9 xmm0 = [-, lx*nx, lz*nz, ly*ny]

addps xmm0, xmm2 xmm0 =
[-,lx*nx+lz*nz,lz*nz+ly*ny,lx*nx+ly*ny]

shufps xmm2, xmm2,18 xmm2 = [-, ly*ny, lx*nx, lz*nz]
addps xmm0, xmm2 xmm0 = [-, dp, dp, dp]

To get the most out of SIMD instructions, you must ensure that every register element performs a useful operation on every instruction. For example, if you place a single 3D vector into a SIMD register, at most you will get 75% of the maximum possible throughput. You can see in the above dot product example that only three useful operations are performed by each instruction. Not using all of the elements within a register means that the unused elements could contain unknown values. These unknown values generally cause no harm, but be careful when issuing divide and square root instructions -- especially if exceptions are enabled.

A modification to the above code can be used to perform vector normalization:

// Load the vector
movaps xmm0, v xmm0 = [-, z, y, x]

// Do the math
movaps xmm1, xmm0 xmm1 = [-, z, y, x]
mulps xmm0, xmm0 xmm0 = [-, z*z, y*y, x*x]
movaps xmm2, xmm0 xmm2 = [-, z*z, y*y, x*x]
shufps xmm0, xmm0,9 xmm0 = [-, x*x, z*z, y*y]
addps xmm0, xmm2 xmm0 = [-, x*x+z*z,z*z+y*y,x*x+y*y]
shufps xmm2, xmm2,18 xmm2 = [-, y*y, x*x, z*z]
addps xmm0, xmm2 xmm0 =
[-, x*x+y*y+z*z,x*x+y*y+z*z,x*x+y*y+z*z]

sqrtps xmm0, xmm0 xmm0 = [-, len, len, len]
divps xmm1, xmm0 xmm1 = [-, unit z, unit y, unit x]

This produces results with full precision accuracy, and takes about 100 cycles -- not significantly faster than the same operation in x87 floating-point format. If the vector must be calculated using full precision, then a significant speed can be gained by taking advantage of the fact that square root and divide instructions (in bold) both work on vectors containing the same value in each element. The single square root and divide instructions are faster than the packed ones, and replacing the last two instructions in the above example with the four instructions below will save around 40 cycles, making this code about twice as fast as the equivalent x87 floating-point code.

sqrtss xmm0, xmm0 xmm0 = [-, -, -, len]
divss xmm0, xmm0 xmm0 = [-, -, -, 1/len)
shufps xmm0, xmm0, 0 xmm0 = [1/len, 1/len, 1/len, 1/len]
mulps xmm1, xmm0 xmm1 = [-, unit z, unit y, unit x]

It is unlikely that a vector normalization would require full precision, so take advantage of the approximate reciprocal instructions to speed things up. Again, replacing the square root and divide instructions in the original code (bold type) with the two below will reduce the overall time to around 16 cycles, which is much faster than anything in x87 floating point format – it’s even faster than using a lookup table, as this method does not thrash the cache.

rsqrtps xmm0, xmm0 xmm0 = [-, 1/len, 1/len, 1/len]
mulps xmm1, xmm0 xmm1 = [-, unit z, unit y, unit x)

With these vector normalization routines, you need to be careful of the unknown value if exceptions are enabled. If exceptions are disabled, the SIMD unit provides reasonable values when an exception occurs.

In the examples so far, we placed a whole vector in a single SIMD register, which is known as "horizontal data processing", or the AoS (Array of Structures) method. As the name implies, if you process a set of vectors with AoS, then each vector is a structure and you have an array of them (you probably process 3D geometry this way frequently). The C code below shows a typical AoS layout for 1024 vectors.

 

struct Vector3
{

float X;
float Y;
float Z;

};

Vector3 SOA_Data[1024];

Another problem with the above data layout is alignment. Each vector is only 12 bytes, but the SIMD movaps instruction must fetch data from a memory address that is 16-byte aligned. To fix this problem you could use four element vectors, but if it’s not required it may not be worth the additional 33% of storage it requires. Alternatively, you can use the movups instruction, which can read unaligned data, but storing the data in an unaligned address suffers a penalty. The alignment restrictions also apply to any SIMD instruction that directly references memory, such as addps xmm1,[eax]. If the alignment restrictions are not satisfied, a general protection fault will be generated.

It is common knowledge in digital signal processing and SIMD programming that using AoS is not the most efficient method of representing data such as vertices. Vertical programming, also known as the SoA (Structure of Arrays) method, is significantly faster. In this method, each element of the vectors is stored in an array, so in our example we would have an array of X components, another of Y components and an array of Z components, created like this:

struct AOS_Data

{

 

float x[1024];
float y[1024];
float z[1024];

 

};

Now consider the unoptimized and unscheduled code below to normalize a set of vectors using the SoA method:

//Load the data for 4 vectors
movaps xmm0, X xmm0 = [x3, x2, x1, x0]
movaps xmm1, Y xmm1 = [y3, y2, y1, y0]
movaps xmm2, Z xmm2 = [z3, z2, z1, z0]

//keep a copy
movaps xmm3, xmm0 xmm3 = [x3, x2, x1, x0]
movaps xmm4, xmm1 xmm4 = [y3, y2, y1, y0]
movaps xmm5, xmm2 xmm5 = [z3, z2, z1, z0]

//Do the math
mulps xmm0,xmm0 xmm0 = [x3x3, x2x2, x1x1, x0x0]
mulps xmm1,xmm1 xmm1 = [y3y3, y2y2, y1y1, y0y0]
mulps xmm2,xmm2 xmm2 = [z3z3, z2z2, z1z1, z0z0]
addps xmm0,xmm1 xmm0 = [x3x3+y3y3, x2x2+y2y2, x1x1+y1y1, x0x0+y0y0]

addps xmm0,xmm2 xmm0 = [x3x3+y3y3+z3z3, x2x2+y2y2+z2z2, x1x1+y1y1+z1z1, x0x0+y0y0+z0z0]

rsqrtps xmm0,xmm0 xmm0 = [1/len3, 1/len2,1/len1, 1/len0]

mulps xmm3,xmm0 xmm3 = [unit x3, unit x2, unit x1, unit x0]

mulps xmm4,xmm0 xmm4 = [unit y3, unit y2, unit y1, unit y0]

mulps xmm5,xmm0 xmm5 = [unit z3, unit z2, unit z1, unit z0]

Even in its unoptimized form, the gains are huge. The above code takes less than 30 cycles to perform four vector normalization operations. The primary reason for the huge speed gain is that in every instruction, every element of each register performs a useful operation (which is almost impossible if you use three element vectors in the SoA format). And the advantages don’t stop there. Most of the alignment issues are avoided as well, since only the individual arrays have to be aligned to 16-byte boundaries. Another advantage is that the number of elements in the vector is independent of the register size, so code using this format is easy to convert to using AMD’s 3DNow! instructions.

Laying data out in the SoA format does not help in all cases, however. It has disadvantages, too. The disadvantages are usually caused by human errors, since this method requires a different way of thinking. However, once you understand it, it’s not really much different than the AoS format. The SoA format is really only useful for processing arrays of vectors -- it’s inefficient for single operations. But with a little thought your part, it’s not difficult to convert from the AoS format to SoA.

How do I read the new Pentium III serial number?

Why are so many people upset about having a serial number within the processor? I think it’s a fine place for a centralized number for security, identification, software protection or licensing. For software protection and licensing, unique numbers are not hard to generate -- especially if the machine has a network card (that will provide 48 totally unique bits for starters). Add to this the variety of other random information that you can grab from a user’s machine (such as the vendor ID of the video card, the version of Windows and the serial number of the motherboard) and you’ll get a unique number! Internet games could make use of a persistent, unique number, for instance for online identification. I think it makes perfect sense for a computer to have a unique ID, and the processor is an ideal place to put it. Anyway, other machines have had unique IDs for years. Let’s see how to read it.

To detect the presence of the serial number, you issue the CPUID instruction with EAX=1 to read the feature flags, and check the SN bit (bit 18) in EDX. If the bit is clear, the machine has no serial number or it has been disabled. And once disabled, it cannot be re-enabled without resetting the processor. The serial number, if present, is read with the CPUID instruction when EAX =3. The lower 64 bits of the 96-bit serial number are returned in EDX:ECX, and the top 32 bits comes from the processor signature (ECX = bits 0 to 31, EDX = bits 32 to 63). The function below detects the presence of the serial number:

bool DetectSerialNumber()
{

 

bool found_sn;
_asm
{

 

pushfd
pop eax // get EFLAGS into eax
mov ebx,eax // keep a copy
xor eax,0x200000 // toggle CPUID bit
push eax
popfd // set new EFLAGS
pushfd
pop eax // EFLAGS back into eax

// have we changed the ID bit?
xor eax,ebx

je NO_SERIAL_NUM

// we could toggle the bit so CPUID
// is present
mov eax,1

cpuid // get processor features

// check the serial number bit
test edx,1<<18

jz NO_SERIAL_NUM
mov found_sn,1
jmp DONE
NO_SERIAL_NUM:
mov found_sn,0
DONE:

}

return found_sn;

}

The next function below actually reads the serial number, but make sure you have verified that the serial number is present before you use this function. An illegal instruction will result if the serial number is not present or disabled due to the illegal operation (EAX=3) with CPUID. The source files DETECT.CPP and DETECT.H contain code that will read the serial number.

void ReadSerialNumber(DWORD* serial)
{

DWORD t,m,b;
_asm
{

 

mov eax,1
cpuid

// top 32 bits are the processor
// signature bits
mov t,eax

// A new CPUID code for the
// Pentium III
mov eax,3


cpuid
mov m,edx
mov b,ecx

}

// copy the locals into the pointer variables passed in
serial[0] = b;
serial[1] = m;
serial[2] = t;

}

Intel recommends displaying the serial number to the user as six groups of four hexidecimal digits, (i.e., xxxx-xxxx-xxxx-xxxx-xxxx-xxxx.).

If you have a multi-processor machine, then you also have multiple serial numbers. What’s worse is that the serial numbermay be disabled on just one of the processors, which could introduce some really hard-to-find bugs. For example, the function to detect the serial number could indicate that there is a serial number on one processor, but due to context switching, the function to read the serial number might actually run on the other processor, whose number has been disabled. This will cause an exception that is really difficult to track down. I recommend detecting the features of each processor individually by running the detection functions from a thread that has its affinity set to a single processor. This guarantees reliability by forcing all code to run on the same physical processor. (See the DETECTEXAMPLE.CPP source file for an example of this technique.)

image11_small.gif

Example output of the ‘DetectExample’
program when run on Windows 98
[zoom]

How do I disable the Pentium III serial number?

The easiest way to disable the serial number is to get the serial number utility tool from Intel’s web page, and run this application. This utility causes all traces of the serial number to disappear. If you run this tool from within Windows, it may be too late to prevent the number from being read, as a driver could read it during your machine’s boot up. To completely disable the serial number, it must be run from the BIOS during initial boot up, and at the time of writing only a few BIOSes provided this feature. If you have either an Intel SE440BX or an SE440BX-2 motherboard, you should ensure that the BIOS version is either 4S4EB0X1.86A.0031.P11 or 4S4EB2X0.86A.0017.P10, respectively (these numbers are displayed on the initial boot screen). These lastest BIOSes support the Pentium III and feature the ability to disable the serial number at boot up. They also support the latest clock speeds (550Mhz in the case of the SE440BX-2). BIOS updates for any Intel motherboard are available from the Intel web page at http://www.intel.com/design/motherbd/genbios.htm.

To write code to disable the serial number, you simply set bit 21 of Model Specific Register (MSR) 119h, but as with all system registers, you have to be at ring zero to accomplish this. Additionally, remember to disable the serial number for every processor in the system, if you are running a machine with more than one processor. It is very unlikely that a game or other application will ever want to disable the serial number, however.

 

Is there any new performance/profiling information?

The performance registers within the Pentium III are identical to the performance registers within the Pentium II, apart from the addition of the following:

Accessing and using the performance information is beyond the scope of this article. (I covered the subject in depth in my article "Building an Inline Performance Monitoring System", Game Developer, May 1998). With this article and the information above, you should have no problem reading the new events. VTune 4.0 also has the ability to read the new Pentium III performance counters, if you don’t want to do it yourself.

 

Event Num.

Unit Mask

Description

SIMD Instruction Performance Counters

0xD8

0x00

Packed and Scalar SIMD Instructions retired

0xD8

0x01

Scalar SIMD Instructions retired

0xD9

0x00

Packed and Scalar SIMD computational instructions retired

0xD9

0x01

Scalar SIMD computational instructions retired

Prefetch Performance Counters

0x07

0x00

Non temporal prefetch instructions dispatched, including speculative.

0x07

0x01

Prefetch instructions dispatched for all caches, including speculative.

0x07

0x02

Prefetch instructions dispatched for L1 and L2, including speculative.

0x07

0x03

Weakly ordered stores dispatched, including speculative.

0x4B

0x00

Non temporal prefetch instructions that missed all caches

0x4B

0x01

Prefetch instructions that missed all caches

0x4B

0x02

Prefetch instructions for L1 or L2 that missed all caches

0x4B

0x03

Weakly ordered stores that missed all caches

 

For Further Information

Lots of information can be found on Intel’s developer web site. This site contains lots of source code examples, all the processor manuals, and the SIMD tools mentioned within this article. If you have VTune 4.0, you’ll find all of the SIMD example code and the latest processor manuals (including the optimization manual) on the CD-ROM. Check out the following:

http://developer.intel.com/design/pentiumiii/manuals (Pentium III manuals)
http://developer.intel.com/vtune/cbts/strmsimd/appnotes.htm (PentiumI III App Notes)
http://developer.intel.com/design/pentiumiii/applnots/245125.htm (Serial Number)
http://developer.intel.com/design/pentiumii/applnots/241618.htm (CPUID)
http://developer.intel.com/design/pentiumiii/psover.htm (Serial Number Control)
http://developer.intel.com/vtune/macropak/index.htm (MASM macro file)
http://developer.intel.com/vtune/rvt/index.htm (Register viewing tool)
http://developer.intel.com/vtune/optidrvr/index.htm (NT4.0 SIMD driver)
http://developer.intel.com/vtune/ (VTune home page)

Next month I’ll explore processor detection, using a DLL that detects every feature of every make and model processor on the market. If you have a question about this article, Pentium processors, Windows or PCs in general, then email them to me at [email protected]

Rob has been involved with games and system level code on almost every major platform for the past 12 years. Currently he is mainly involved with PCs and Windows. When he is not working, he can be found either flying his Cessna out of Santa Monica airport or at the pub. He can be contacted at [email protected].

Read more about:

Features

About the Author(s)

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like