As faster rasterization hardware has become more widespread on PCs, the throughput of the front end of the 3D pipeline (i.e., the geometry set up and lighting portions of the process) has become more critical to overall performance. Not too long ago, application performance was limited by poorly performing 3D accelerating hardware – or the lack of an accelerator altogether. Now game developers must be more concerned about feeding the 3D accelerator enough data to keep it busy.
Computing 3D geometry is a fairly straightforward task. It simply requires spitting out processed vertices as fast as possible to keep the accelerator well fed. Pervertex lighting (usually performed within the geometry engine) is used by the rasterizer to compute polygon fills and perform texture modulation.
Intel’s Streaming SIMD Extensions includes SIMD floating point operations that operate on new 4wide packed, singleprecision floatingpoint registers. It’s no coincidence that this feature is perfect for 3D geometry, and we’ll demonstrate how to exploit this new extension using a highlevel language (C++) and still achieve outstanding performance.
Why Custom 3D Pipelines?
By nature, custom 3D pipelines can take advantage of known, a priori characteristics of 3D content that a general engine cannot always exploit. This can lead to both performance and/or quality advantages. Let’s consider the pros and cons of a custom engine:
Pros:
 Data structures can be optimized. Some APIs (e.g., Direct3D) require input data to be in an array of structures (AOS – more on that shortly). With these APIs, this data must be reformatted (i.e. swizzled) to a structure of arrays (SOA) format for vertical SIMDstyle processing. Optimized structures increase performance.
 Custom lighting can be used, which often looks better a standard API’s lighting support. This results in better visual quality.
 Custom engines can be built to support advanced techniques (e.g., NURBS tessellation), resulting in better performance and better visual quality than nonspecialized APIs. For instance, a custom 3D engine could just transform a small number of control points and tessellate to the proper level of detail (resulting in better performance), as well as properly deform when external forces are applied (resulting in improved visual quality).
Cons:
 You have to tune and maintain your own engine instead of relying on the developer of the API do it for you.
The bottom line is that when performance and quality are a high priority, the motivation for using a custom engine is greater. What we intend to show here is that a high performing custom engine can be written efficiently in C++ exclusively and still take advantage of Intel’s new Streaming SIMD Extensions. Support for Streaming SIMD Extensions using C++ and intrinsics is part of the support given by the Intel C/C++ compiler. For more information on developing with this tool (including hybrid development using Intel's compiler for some code and another compiler for other code), please see http://developer.intel.com/vtune/icl.
An Introduction to Streaming SIMD Extensions
The Streaming SIMD Extensions is a natural extension of SIMDstyle processing which was initiated when MMX technology was launched. The Streaming SIMD Extensions include three basic categories of instructions: SIMD floatingpoint instructions, SIMD integer instructions, and memoryrelated cachecontrol instructions.
The SIMD integer commands are extensions to MMX technology. They include new instructions that operate on the 64bit wide MMX technology registers. However, for the purposes of this article, we are primarily interested in the latter two instruction categories. The SIMD floatingpoint instructions operate on 128bit wide packed singleprecision floatingpoint registers. There are eight such new registers in the Pentium III, which support 4wide SIMD processing using single precision, floating point numbers. The memoryrelated cachecontrol instructions help control cache locality and reduce engine latency. These memory hints (e.g., prefetches) tell the processor that we’ll need to grab this data in advance so it’s ready when we need it. Figure 1 shows a diagram with the new 128bit packed floatingpoint registers. For more information, read "Optimizing Games for the Pentium III Processor".
Figure 1. The new 128bit packed floatingpoint registers.

Working with Data structures
To begin with, we emphasize vertical parallelism. The algorithms are often efficiently written in a serial manner, processing four pieces of data (i.e., vertices) in parallel. This type of processing is often more efficient in a SIMD environment since data dependencies often limit the amount of horizontal parallelism you can exploit.
Now, let's look at the data. A vertex pool is represented as an AOS. Each structure contains data for one vertex (such as its position, normal, color, texture coordinates, and so on), and the entire pool is represented as an array of all these structures. However, this method of representing the vertex pool makes the transfer to SIMD difficult. Why? Well, keep in mind that data should be 16byte aligned to avoid performance penalties, and that the Pentium III requires 16byte alignment when reading and writing the Streaming SIMD Extensions registers (since their new registers are 128 bits wide). As the size of a vertex structure is not always divisible by 16, you can always align the first vertex in the array, but not necessarily the second one.
On the other hand, you could represent your data as an structure of arrays (SOA). In this method, each element of vertex data is stored in an array, and all of these arrays are used together to generate the vertex pool (which takes the format of xxxx….,yyyy…..,zzzz….., and so on). Although this method may seem a bit strange, it has advantages. First, if all the element heads are aligned, then the entire data structure is aligned. Second, the regular nonSIMD algorithm can be used in a vertical fashion; instead of dealing with one vertex per iteration, the algorithm can handle four vertices per iteration. There are some drawbacks to this method, though. The main drawback is that you have to manage a set of pointers or indices – one for each element of the vertex data. Another drawback is that because algorithm handles four data elements in one iteration, all of the constants and parameters have to be expanded four times, which requires more time and storage space.
There is a third method of representing a vertex pool. This method is a hybrid between the previous two methods, and involves dividing the vertex pool into small SIMD portions (usually four or eight vertices per portion). You could represent each portion in an SOA, and the entire vertex pool could be made up of arrays of these portions. Thus, if the portion size was four, the vertex pool would look like this: xxxxyyyyzzzz, xxxxyyyyzzzz,…and so on. This method retains the alignment, in each iteration we deal with four vertices without changing the algorithm, and we need to manage only one pointer. We will use this method in this article.
The data structures we will work with are defined as the following:
struct SsimdTex {
F32vec4 u,v;
};
Struct SsimdVector {
F32vec4 x,y,z;
};
struct SsimdVertex {
SsimdVector pos;
SsimdVector norm;
SSimdTex tex;
};
The vertex pool is defined as :
_MM_ALIGN16 SSimdVertex vertex_pool[(MAX_NUM_VERITCES + 3)/4];
Since the algorithm operates on four vertices in each iteration, all of the external data (such as the transform matrix and lighting and material information) should be expanded. To keep things simple, we created structures built from F32vec4 members that will hold the expanded data. For example, the transform matrix is defined as:
struct SSimd4X4Matrix {
F32Vec4 _11,_12,_13,_14;
F32Vec4 _21,_22,_23,_24;
F32Vec4 _31,_32,_33,_34;
F32Vec4 _41,_42,_43,_44;
};
The 3D Pipeline Structure and Body
Let’s look at a rudimentary 3D pipeline. In the simplest terms, a 3D pipeline usually takes this form:
For each vertex do:
// transform.
1. Calculate w (dot product of the fourth matrix column & the vertex position).
2. Calculate we (the reciprocal of w).
3. Calculate x, y, z (dot product of the first, second and third matrix columns and the vertex position).
4. Multiply x, y, z by we.
// light
1. Color < 0.
2. For each light do:
a. Calculate dir  the direction vector from the light source to the vertex in the world space (for directional light this vector is the light vector).
b. Normalize dir.
c. Calculate d  the dot product between dir and the vertex normal (assuming the vertex normal is already normalized, this gives the cosine of angle between the vertex and the light direction).
d. If (d > 0) then
color < color + light_color * material_color * d
3. Write x, y, z, color and other vertex data to the output buffer.
Initializing the SIMD pipeline
After the main program calculates the transformation matrix and the lighting data, all these data elements should be expanded to the F32vec4 class so that all data elements have the same value. The simplest way of performing this data expansion is by using the F32vec4 float constructor. The typical code to perform the data expansion looks like this:
x_mat>_11 = F32vec4(float_matrix>_11);
x_mat>_12 = F32vec4(float_matrix>_12);
where float_matrix is a structure similar to the SsimdMatrix, except that its data elements are all floats.
Transforming SIMD Coordinates
Since the transformation part of the algorithm is purely computational, there is no problem converting it to SIMD. The main issue in this part of the pipeline is calculating we (the reciprocal of w). One way to perform this calculation is by dividing an expanded "one" value by the calculated w, but in this method we pay a performance penalty for the divide operation.
There’s also second technique for transforming SIMD coordinates, which makes use of the new Pentium III SIMD and scalar instructions (rcpps, rcpss) that estimate the reciprocal of a given value. Since these instructions only provide estimations however, we’ll also use one iteration of the NewtonRaphson method to improve the instruction’s result accuracy. Using this method we still come out ahead – the result of these steps still takes fewer clock cycles than performing a divide. The reciprocal and the reciprocal with NewtonRaphson method are already encapsulated in Intel’s SIMD classes and their corresponding functions as rcp and rcp_nr.
The transform part of the pipeline looks like this:
// calculating w.
w = pos>x * x_mat>_41 + pos>y * x_mat>_42 + pos>z * x_mat>_43 + x_mat>_44;
// calculating we = 1/w.
we = rcp_nr(w);
// transforming x ,y ,z and multiplying by we.
x = (pos>x * x_mat>_11 + pos>y * x_mat>_12 + pos>z * x_mat>_13 + x_mat>_14) * we;
y = (pos>x * x_mat>_21 + pos>y * x_mat>_22 + pos>z * x_mat>_23 + x_mat>_24) * we;
z = (pos>x * x_mat>_31 + pos>y * x_mat>_32 + pos>z * x_mat>_33 + x_mat>_34) * we;
Simple Lighting
There are two main issues to consider when converting a lighting pipeline to SIMD. The first concern, which pops up with every SIMD algorithm, is "What do I do if I have a branch in my code?". There is no simple answer to that question – the solution depends on the nature of the branch. Solutions vary between "execute all the code paths in the branch and then glue the SIMD results according to the branch condition" and "try to manipulate the data in such manner that the branch is no longer needed".
In our lighting code that we previously showed, this branch appeared: if d (the dot product) is greater than zero, then multiply it by the color constants and add the results to the accumulated color. Let’s handle this branch using the second technique – changing things so that the branch is no longer needed.
Let’s try to clamp all of the negative values of d to zero. All the positive values (the ones that I’m interested in) will be left unchanged and the negative values will be clamped to zero and won’t affect the overall result. This clamping is simple and it’s done by the maximum function (simd_max, which uses the new maxps instruction) between the four different SIMD elements and zero. We can still use a scalar branch to handle the case of all four elements of the SIMD data are less than zero. We can use the move_mask function (which generates an integer bit mask representing the sign of each one of the SIMD elements) to determine if all of the elements are negative.
There’s another issue to consider when using this solution. In order to normalize a vector we have to calculate square root of the vector’s magnitude, and than divide each of the vector elements by the calculated root. The Pentium III has a new SIMD and scalar instructions that estimate the reciprocal of the square root of a given value (rsqrtps, rsqrtss) . These instructions are already encapsulated in the classes as the functions rsqrt and rsqrt_nr (this latter function calculates the reciprocal of the square root with one NewtonRaphson iteration). Since accuracy in lighting is much less important than accuracy when calculating a vertex position, I’ll use the regular version of the function in the lighting code.
The lighting portion of the pipeline looks like this:
switch (light>light_type) {
case DIRECTIONAL:
// assume that light direction is already normalized.
dot = light>dir.x * norm>x + light>dir.y * norm>y + light>dir.z * norm>z;
break;
case POINT:
dir.x = pos>x  light>pos.x;
dir.y = pos>y  light>pos.y;
dir.z = pos>z  light>pos.z;
magnitude = dir.x * dir.x + dir.y * dir.y + dir.z * dir.z;
len = rsqrt(magnitude);
dot = (dir.x * norm>x + dir.y * norm>y + dir.z * norm>z) * len;
break;
default:
}
if (move_mask(dot) == 0xf) // if all dot elements are less than zero
continue; // skip this light.
dot = simd_max(dot,_ZERO_); // zero all simd elements below zero.
diffuse.r += light_color.r * mat_color.r * dot;
diffuse.g += light_color.g * mat_color.g * dot;
diffuse.b += light_color.b * mat_color.b * dot;
Converting the Data For a Serial Rasterizer
Most rasterizers still need their data to be submitted in AOS format. This means that transposing the data from our internal data structure to AOS (this process is also known as "deswizzling"). With Direct3D, you also have to pack the colors from floats in the [0..1] range to a DWORD ARGB format of values between 0 and 255. Here’s how to convert the data.
The first step is to pack the colors. Most of the instructions that deal with data arrangement and conversion inside the SIMD register are not encapsulated in the SIMD classes, but they do appear as intrinsic instructions. Color packing comprises three parts: clamping the values above one down to one (the maximum color value); changing the scale of the colors from [0..1] to [0..255]; and converting the values to integers and rearranging them in the ARGB manner.
Here is how to clamp the values above one:
diffuse.r = simd_min(diffuse.r,_ONE_);
diffuse.g = simd_min(diffuse.g,_ONE_);
diffuse.b = simd_min(diffuse.b,_ONE_);
Here’s how to change their scale:
diffuse.r *= _255_;
diffuse.g *= _255_;
diffuse.b *= _255_;
The conversion instruction cvttps2pi operates only on the lower half of the SIMD data (2 elements only). How do we convert all four elements? We can convert and pack the lower two elements and then, using movhlps, we will move the upper two elements to the lower segment and convert them. The packing is done by using MMX bitwise shifts and ors:
// conversion of lower two
r = _mm_cvtt_ps2pi(diffuse.r);
g = _mm_cvtt_ps2pi(diffuse.g);
b = _mm_cvtt_ps2pi(diffuse.b);
// packing
r = _m_psllqi(r,16);
g = _m_psllqi(g,8);
out[0] = _m_por(r,_m_por(g,b));
// conversion of upper two
r = _mm_cvtt_ps2pi(_mm_movehl_ps
(diffuse.r,diffuse.r));
g = _mm_cvtt_ps2pi(_mm_movehl_ps
(diffuse.r,diffuse.r));
b = _mm_cvtt_ps2pi(_mm_movehl_ps
(diffuse.r,diffuse.r));
// packing
r = _m_psllqi(r,16);
g = _m_psllqi(g,8);
out[1] = _m_por(b,_m_por(r,g));
The second step is to transpose the SOA vertex position to four AOS positions. We can perform this transposition using the _MM_TRANSPOSE4_PS macro defined in xmmintr.h. This macro transposes a 4x4 matrix, as passed by its line. The output from this macro gives us the first vertex (x,y,z,we) in x, the second vertex in y, and so on. The syntax looks like this:
_MM_TRANSPOSE4_PS(x,y,z,we)
Ideas for future Improvements
This paper concerns improvements that a very conventional 3D engine could implement. However, this infrastructure can be extended to handle many custom features, such as particle systems, NURBS tessellation, texture transforms, environment and displacement mapping, and so on. We hope this article illustrates that the high performance available from the Pentium III and its Streaming SIMD Extensions can be exploited through C++ and that developers make full use of it.
Appendix
All the code, including a sample 3D application is in the attached zip file;
http://www.gamasutra.com/features/
19990416/intel_simd.zip
Ronen Zohar has a B.A.. (1998) in computer science from The Technion  Israel Institute of Technology. His areas of concentration are in 3D graphics, geometric modeling and computer vision. Ronen currently works in the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.
Haim Barad has a Ph.D. (1987) in Electrical Engineering from The University of Southern California. His areas of concentration are in 3D graphics, image processing and computer vision. Haim currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.