Viewport clipping is one of the essential stages within the 3D geometry pipeline. Since 3D hardware cannot accept position values outside the viewing frustum, we have to look for vertices that are outside this boundary and remove them from further processing steps. Any polygons which are only partially on-screen must be clipped by the screen volume boundaries. In this article I’ll show some techniques for speeding up the process of clipping polygons, using Intel’s Streaming SIMD Extension instructions.

*The Sutherland-Hodgeman Clipping Algorithm*

The most used algorithm for clipping convex polygons** **is the Sutherland-Hodgeman algorithm. Here are the functions used by the algorithm, in pseudo code:

state(Clip plane C, Vertex v)

//This function returns "inside" if v is

// inside clip plane C.

next(Vertex v)

//This function returns the next vertex

// after v in the polygon.

interpolate(Vertex v, Vertex v’, Clip plane C)

//This function returns an interpolated

// vertex v’’ interpolated from v and v’

// according to the C plane.

Pseudo code for the algorithm looks like this:

S0 = initial polygon

For each clip plane C

S1 = empty set

For each vertex v in S0 do:

If state(C,v) = "inside" then

S1 = S1 + {v}

v’ = next(v)

If state(C,v) <> state(C,v1) then

S1=S1 + interpolate(v,v’,C)

S0 = S1

Result = S0

Because of its iterative nature, optimizing the entire algorithm using Streaming SIMD Extensions is difficult. However, two of its critical functions can easily be optimized: the generation of clip flags, and the interpolation of two vertices into a new vertex.

*Generating Clip Flags*

Most applications implement the state function as a bit array assigned to each vertex, where each bit represents the state for a specific clip plane (where a value of ‘1’ indicates the vertex is outside the plane), and these bit arrays are usually called "clip flags". Clip flags are usually calculated before or during the transformation part of the geometry pipeline, and by using Boolean operations on these clip flags we can determine whether we need to enter the clipping algorithm itself or dump the entire polygon from further processing steps.

If the AND of all the clip flags is a value other than 0, it means the all of the polygon’s vertices are outside of at least one of the clip planes, indicating that the entire polygon is outside the view volume and therefore it can be dumped. On the other hand, if the OR of all the clip flags is equal to 0, it means that the entire polygon is inside the view volume and it can be submitted to the rasterizer without further processing.

Clip flags are usually generated in object space (before transformation) or in screen space (after transformation), and there are several Streaming SIMD Extension techniques for each one of these options.

*Using Clip Flags in Object Space (Clipping Against Arbitrary Planes)*

In object space, we have to transform the screen boundaries from view space to object space. This transformation generates an arbitrary plane in object space. In order to determine if a vertex is "inside" or "outside" this arbitrary plane, we have to insert the vertex position into the clip plane formula (Ax + By + Cz + D) and check whether the result is negative (indicating that the vertex is outside the viewing volume).

*Vertices arranged in SOA (four vertices processed in one SIMD iteration).*** **If the vertices are arranged in SOA format (xxxx….., yyyy…., zzzz…..), we can expand the clip plane constants four times, insert (using SIMD) the vertices’ data into the clip plane formula and use the Streaming SIMD Extension move_mask to get the sign bit for each of the four vertices. Then, using a look-up table, we can blow each one of 16 possible combinations to an MMX 4 Words register and shift the register left to the clip plane mask position. The code for this operation looks like this:

static Iu4vec16 maskCodeToMMXWords[16] = { 0x0000000000000000, 0x0000000000000001, 0x0000000000010000, ... };

Iu4vec16 generateClipFlags(SIMDVertex *vtx) { Iu4vec16 clipFlags = 0; for (i=0; i F32vec4 temp; Iu4vec16 mask; SIMDClipPlane *cp = &clip_planes[i]; temp = cp->A*vtx->x + cp->B*vtx->y + cp->C*vtx->z + cp->D mask = maskCodeToMMXWords[move_mask(temp)]; mask = mask << i; clipFlags |= mask; } return clipFlags; }

*Vertices arranged in AOS (four clip planes processed in one SIMD iteration). *When the vertices are arranged in an AOS structure (x,y,z…,x,y,z,…), we can expand each one of the vertex components four times and use the SIMD operators to assign four clip planes in one iteration. After the assignment, we use the Streaming SIMD Extension’s move_mask to get the sign bit of each clip plane, and then shift the four clip planes to their mask position. It looks like this:

WORD generateClipFlags(D3DVertex *vtx) { WORD clipFlags = 0; int count = (clipPlaneCount + 3); F32vec4 x = vtx->x, y= vtx->y, z=vtx->z; for (i=0;i> 2]; WORD mask; mask = move_mask(cp->A*x+ cp->B*y + cp->C*z + cp->D); mask = mask << i; clipFlags |= mask; } return clipFlags & clipPlanesCountMask; }

*Using Clip Flags in Screen Space (Clipping Against Screen Boundaries)*

In screen space, the clip flags are calculated against constant planes adjacent to the x, y and z axes. Each axis has two limits, upper and lower, and we have to generate clip flags according to these boundaries.

*Vertices arranged in SOA (four vertices in one SIMD iteration). *In this case we have to expand each one of the six boundaries into a SIMD word, and then make a compare and generate the clip flags according to the compare mask, like this:

#define MASK_CF(compare,bit)\ maskCodeToMMXWords[move_mask((compare))]<<(bit)

Iu4vec16 generateClipFlags(SIMDVertex *vtx) { Iu4vec16 clipFlags; clipFlags = MASK_CF(cmpgt(vtx->tx,clipLeft),LEFT_BIT); clipFlags |= MASK_CF(cmplt(vtx->tx,clipRight),RIGHT_BIT); clipFlags |= MASK_CF(cmpgt(vtx->ty,clipTop),TOP_BIT); clipFlags |= MASK_CF(cmplt(vtx->ty,clipBottom),BOTTOM_BIT); clipFlags |= MASK_CF(cmpgt(vtx->tz,clipBack),BACK_BIT); clipFlags |= MASK_CF(cmplt(vtx->tz,clipFront),FRONT_BIT); return clipFlags; }

*Vertices arranged in AOS.* When the vertices are arranged in AOS, we perform two compares for each vertex: one for the (x,y,z) lower limits, and one for the (x,y,z) upper limits. Later, we use the move_mask to pack the six planes together. It looks like this:

WORD generateClipFlags(D3DVertex *vtx) { F32vec4 aos_pos; WORD clipFlags; loadu(aos_pos,&vtx->tx); clipFlags = move_mask(cmpgt(aos_pos,clip_upper)) << 4; clipFlags |= move_mask(cmplt(aos_pos,clip_lower)); return clipFlags & 0x0077; }

**Interpolating Two Vertices to Generate a New Vertex**

Interpolation is the blending of two vertices’ data to get a new, third vertex. It is done using a blend factor that is computed by the relative distance of the vertices from the clip plane. The interpolation is linear, and it is performed for each one of the vertex components (position, colors, texture coordinates, and so on).

Since only two vertices are interpolated at a time, and because the operation is usually performed on relatively few vertices, there is no point in using the SOA format for the interpolation. Additionally, graphics hardware expects data to in the AOS format, which is another reason to perform the interpolation after converting from the SOA input format to the AOS output format. Using this method, the vertices’ data will be stored in memory, and up to four of the vertex data elements can be interpolated in one iteration.

For example, if the vertex format is {x, y, z, 1/w, diffuse, specular, u0, v0, u1, v1}, the x, y, z, 1/w components can be interpolated in one iteration, and the texture coordinates can be interpolated in one iteration. Additionally, each of the two color channels can be interpolated in one iteration using MMX instructions. The only drawback of this example is that the 1/w can’t be linearly blended. So 1/w must be converted to w, then you interpolate to w’, and then convert the w’ back to 1/w’ (this conversion can be performed easily using the rcp function, operating on a single-precision floating point number). A typical blending of position and 1/w components usually looks like this:

F32vec4 interpolate_positionAndRHW(D3DTLVERTEX *v1, D3DTLVERTEX *v2, float alpha) { F32vec4 result; _asm { mov eax,v1 // expand alpha movss xmm0,alpha shufps xmm0,xmm0,alpha

// load first vertex, change order // to 1/w,y,z,x & convert 1/w to w movups xmm1,[eax] mov eax,v2 shufps xmm1,xmm1,0x27 rcpss xmm1,xmm1 //load second vertex, change order //to 1/w,y,z,x & convert 1/w to w muvups xmm2,eax shufps xmm2,xmm2,0x27 rcpss xmm2,xmm2 // interpolate (v2 + alpha*(v2-v1)) movaps xmm3,xmm2 subps xmm3,xmm1 mulps xmm3,xmm0 addps xmm3,xmm2 // convert w of result to 1/w, change // order back to x,y,z,1/w rcpss xmm3,xmm3 shufps xmm3,xmm3,0x27 movaps result,xmm3 }

return result

}

Blending the texture coordinates is much simpler. All we have to do is load the texture coordinates and interpolate them. Since each SIMD register can hold up to two texture coordinates sets, if the number of textures is odd, we must process only half a register in one of the iterations. The code to blend texture coordinates look like this:

void interpolate_textureCoords(float *t1,float *t2,float *tnew, float alpha, DWORD texturesCount) { F32vec4 tex1,tex2,texNew,alpha4 = alpha; int i;

// odd number -> process only half register for the //first texture set. if (texturesCount & 0x1) { tex1 = _mm_loadl_pi(tex1,t1); tex2 = _mm_loadl_pi(tex1,t1); texNew = tex2 + alpha4*(tex2-tex1); _mm_storel_pi(tnew,texNew); t1+=2;t2+=2,tnew+=2; texturesCount--; }

// process the remaining texture coordinate sets texturesCount /= 2; for (i=0;i

}

Colors are usually represented as a four-byte array, with one entry for each one of the four color channels (red, green, blue and alpha/fog). In order to interpolate them efficiently, I’ll use MMX instructions to interpolate the four color channels in one iteration. First, I’ll calculate 256*alpha and 256*(1-alpha),convert the results to WORD and expand it four times to a MMX QWORD. Second, I’ll read the input color channels and expand them to one MMX QWORD made of four WORDs, then I’ll perform the interpolation itself using integer arithmetic, and divide the results by 256 (by doing a simple shift). Finally, I’ll pack the four WORDs back into four bytes. Here’s what it looks like:

DWORD interpolate_color(DWORD col1,DWORD col2,float alpha) { F32vec1 alpha256 = F32vec1(alpha) * F32vec1(256.0f); Iu16vec4 m64alpha,m64alpha2,m64out; static Iu16vec4 MMX256 = 0x0100010001000100; // m64alpha = 4 times 256*alpha; // m64alpha2 = 4 times 256*(1-alpha) m64alpha = M64(_m_pshufw(_m_from_int(F32vec1ToInt(alpha256)),0)); m64alpha2 = M64(_m_psubw(MMX256,m64alpha)); // load color 1 m64Col1 = M64(_m_from_int(col1)); m64Col1 = M64(_m_punpcklbw(m64Col1,0)); // load color 2 m64Col2 = M64(_m_from_int(col2)); m64Col2 = M64(_m_punpcklbw(m64Col2,0)); // interpolate (col1*alpha*256 + col2*(1-alpha)*256) / 256 m64out = (m64Col1*m64alpha + m64Col2*m64alpha2) >> 8; return _m_to_int(_m_packuswb(m64out,0));

}

Because viewport clipping is so important to efficient 3D geometry processing, improvements to the process can make a big differences in the speed of a game. The methods I have outlined here exploit the power of the Streaming SIMD Extensions. Implemented these techniques results in a clipping algorithm that runs about thee times faster than older legacy x87 implementations.

**Ronen Zohar has a B.A.. in computer science from The Technion - Israel Institute of Technology. His areas of concentration are in 3D graphics, geometric modeling and computer vision. Ronen currently works in the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel. He can be reached at [email protected]**