Sponsored By

Hardware Accelerating Art Production

Steve Hill shows how the power of GPUs can be harnessed for the creation of computationally intensive artistic effects. The technique he describes here for creating self-shadowing, ambient occlusion (AO), uses realtime shaders running on the graphics card to do the heavy lifting, resulting in a solution that can be used for preview purposes, allowing artists to better see the results.

Steve Hill, Blogger

March 19, 2004

40 Min Read

The use of shaders is not only impacting real-time rendering in a fundamental way but also game production in terms of workflow and art tools as we seek to maximize their visual potential. With more parameters and surface maps affecting the final on-screen look of game characters and scenes, it's more important than ever that artists have tools which enable them to fine-tune their work quickly and easily.

Thankfully, the major 3D graphics packages are adding native support for shaders. But in-house art tools (such as custom plug-ins, editors and build processes) should strive for the same level of usability.

This feature promotes the use of programmable graphics hardware outside of real-time rendering, in the processing of artistic effects in game development. To demonstrate the power of this technique, a large chunk of this article concerns mapping an intensive precomputed shadowing technique for meshes (ambient occlusion) onto the GPU. The method can be 15 times faster than a solution that takes advantage of only the host CPU.

This technique is a boon for art builds, but it also opens up the possibility of wrapping things up in a plug-in that would let artists effortlessly generate, visualize and tweak vertex or texture data within their 3D art package. In addition, the implementations presented here use version 1.1 Direct3D shaders for processing, so they can be widely deployed for use with today's common graphics cards.

Feature Overview

The first section looks at how technological advances affect art production and what features tools need to provide. An overview of ambient occlusion follows, leading to a three-step workout, turning an initially sluggish hardware approach into a lean, streamlined solution.

In fact, several variations on the technique are covered, illustrating a range of production trade-offs. I'll also show how these can be extended to handle higher-order occlusion (generalized precomputed radiance transfer) before closing with a round up of other accelerated preprocesses that you might wish to consider.

Art Production

To begin, let's consider the broad picture of art production and where ambient occlusion (AO), as a preprocess, fits in. The technique for generating the AO shadows can be integrated into the creation end of the production process--3D art packages such as 3ds max and Maya that are already using real-time shaders, giving the 3D artist an instant preview of what the effect will look like. Of course, it could be integrated with any commercial or custom tool that has access to 3D mesh data.

The Case Study

The AO process (for game purposes, at least) takes in one or more meshes and spits out vertex or texture data. There are a number of variations described later on, but with the basic technique this data consists of a single shadow term for every surface point--stored as an additional vertex attribute or packed into an occlusion map--which is used in-game in addition to ambient (or diffuse environmental) lighting.

In contrast to its low run-time cost, the processing is pretty intense, since lots of visibility samples need to be taken at each point. With a typical CPU-based approach, AO forms part of a lengthy build process that is far from ideal for editing and previewing purposes. Speeding up the calculation will not only yield faster builds but also afford us the opportunity to expose AO as a modeling package plug-in, so that artists can iterate more readily on content.

I will provide an overview of AO, but the subject is covered in greater detail by a chapter in the forthcoming GPU Gems book (see "Additional Reading"). Whether you're already familiar with the technique or not, this primer sets the scene so to speak for mapping AO onto graphics hardware.

Ambient Occlusion: A Primer

AO measures the amount that a point on a surface is obscured from light that might otherwise arrive from the outside. This average occlusion factor is recorded at every surface element, vertex, or texel, and used to simulate self-shadowing (see Figure 1).

The extra soft self-shadowing adds a great deal of believability to the lighting.

Note: This example setup is available as an effect which comes bundled with the RenderMonkey shader development suite.

The technique has come to prominence through its use in the film industry, with ILM first employing it for Dinosaur [Landis02]. Its niche is attenuating soft environmental lighting, particularly from indirect sources such as walls, sky, or ground, achieving the look of more complex setups--usually the manual placement of extra bounce lights or full-blown global illumination--at a lower cost.

One point that should be stressed is that fine-tuning lighting is a frequent task in post-production, and by decoupling the shadowing and lighting, it is possible to tweak the lighting without re-rendering the shadows.


Several assumptions are required for the re-use of visibility information to work:

  • Diffuse surfaces

  • Rigid geometry

  • Constant lighting

The first two constraints are also required by traditional precalculated radiosity. The diffuse surface limitation means that the stored visibility is view-independent, and in conjunction with a lack of deformation, ensures that the data can be re-used. Because of the separation of lighting and visibility, AO is less restrictive than static lighting since single meshes with independent occlusion can be merrily translated, rotated and uniformly scaled. A group must be transformed as one however, in order to preserve inter-object shadows.

The Process

Equation 1, expressing occlusion, Op, in integral form, is typically solved via Monte Carlo integration. For detailed coverage of the methodology consult the "Additional Reading" section.

Rays are traced outward from a given surface point p over the hemisphere around the normal N. A binary visibility function V is evaluated for each of these rays, which returns 0 if the ray intersects any geometry before reaching the extent of the scene, and returns 1 otherwise. Figure 3 shows a simple 2D depiction of the process for a single semi-occluded point. In reality, creating smooth gradations in the shadows requires hundreds or even thousands of rays.

In the case of a uniform distribution of ray directions, each visibility sample is weighted by the cosine between the normal and sample direction (again, consult Additional Reading for an explanation) and averaged together, resulting in our scalar occlusion term. For efficiency, ILM uses a cosine distribution of rays, which removes the need for the cosine factor (as shown in Equation 2) and concentrates samples in statistically important directions.

Game Use

For games, AO immediately enhances constant ambient lighting, adding definition to otherwise flat, shadowed areas. NVIDIA's Ogre demo (see Figure 4) uses AO in this way, in addition to a key directional light plus shadow map.

This is the only correct use for AO, where lighting is assumed to be constant and thus it can be factored out. In practice, however, we can stretch things slightly as the overview suggested. ILM, for instance, uses a pre-filtered environment map for secondary lighting of diffuse surfaces, instead of a uniform ambient term. Here, believable results are achieved from shadowing with AO because the illumination varies slowly. They also improve on the approximation by storing the average un-occluded direction--a so called "bent normal"--which replaces the surface normal when indexing the map.

Spherical harmonic (SH) lighting [Forsyth03] is often a more attractive option than using a pre-filtered map. Again, AO can be used for attenuation but it can also be extended to precomputed radiance transfer (using SH), as is shown later.

In summary, AO is a cheap but effective shadowing technique for diffuse environmental illumination, and the combination of the two can add a plausible global effect on top of traditional game lighting. Using the power of today's programmable graphics cards, we can accelerate the technique enough to make it usable during the creation of 3D models and art, as described next.



Mapping AO onto the GPU


One method of directly mapping of AO's hemispherical sampling onto hardware is the hemi-cube [Purvis03], but this is computationally expensive: the scene must be transformed and rendered multiple times for each surface element. A more efficient alternative comes from tracing many coherent rays together in the opposite direction. An intuitive implementation of this, developed by Weta Digital and described in [Whitehurst03] involves surrounding the scene with a sphere of lights, as Figure 5 shows. The light directions are used for sample weighting, while associated depth maps (our coherent rays) are used for visibility determination.

Equation 3 encapsulates Monte Carlo integration in this instance. For each element p, weighted visibility samples (from this point on simply referred to as "samples") are accumulated and averaged via the weight sum w. Because the sample directions si cover the unit sphere, those outside of a point's vision are given a zero weighting by the hemispherical function H. The visibility function V is just a depth comparison between the surface element in depth map space and the corresponding map value.


We can use graphics hardware to quickly render depth maps, but it's possible to go further and gain full benefit from the GPU after overcoming a couple of hurdles. The following work-through covers the stages involved in accomplishing this.

First Try

Before getting more adventurous in the kitchen, let's start baking using plain rasterization hardware and software processing. Here's the recipe:

1. Pick an orientation around scene center
2. Render geometry from this viewpoint
3. Read back depth information
4. Transform each surface element into depth buffer space
5. Perform visibility test via depth comparison
6. Repeat above steps, accumulating element samples and weights
7. Calculate AO for each element from sample and weight totals

This is a valid procedure for graphics chips that lack any sort of programmability, but otherwise hardware capabilities are going to waste; per-element transform and triangle rasterization (in the case of texture baking) are tasks better performed by dedicated hardware. The performance of this approach is also hurt by read-back, a big deal given the high number of iterations necessary to avoid noise (or banding because of shared viewpoints).

To boost processing speed substantially, we must make more effective use of the GPU. As the remaining steps show, this leads to reducing read-back as well.

Second Attempt

The time-consuming steps--bar-depth transfer--are amenable to stream processing, so let's take advantage of this. Depth rendering will remain largely as before, but we can replace software transform and comparison of surface elements with shaders. A vertex shader computes the weight for a given surface element and orientation. A corresponding pixel shader performs the necessary depth check, with the weight and test result written out. As Figure 6 captures, we now have two accelerated stages: a depth pass and a sampling pass.

For vertex AO, points (i.e., D3DPT_POINTLIST) are sent through the new shader pair, each occupying a single pixel--containing the weight and test result--in the render target. With these laid out contiguously in vertex order, the target contents can be simply iterated over in software.

In the case of texture baking, the process is slightly more involved, as triangles are rasterized instead using non-overlapping UVs specified by the artist or generated automatically. The weight is also calculated per-pixel, with the vertex shader performing setup instead.

Read-back has been reduced somewhat compared to the first try, assuming that surface resolution (number of elements) is lower than depth buffer resolution--under-sampling would occur otherwise. A jump in speed should be expected as well, depending on relative GPU and CPU muscle, plus the amount of time spent optimizing the last version. The new process is also simpler since the dedicated hardware takes care of lower-level computations.

Despite these gains, room for improvement remains, as the pipeline is still stalled by read-back. If the summation stage were to be moved to the graphics card, the GPU would be kept busy and we would only have to read the totals back at the end. This is possible via ps2.0 shaders and high-precision targets, but cards which support these features are not ubiquitous and the extra accuracy can come with aspeed and storage hit. As the final version reveals however, with a little craftiness the GPU can perform partial summation without the features just mentioned, further cutting back-transfer by well over an order of magnitude.

Third Time Lucky

A split ramp, a neat idea borrowed from [James03], is the solution to the problem of accumulating values in an 8-bit-per-component buffer. Consider the case of vertex AO from the previous attempt. Rather than just passing the weight through the pixel shader, the value can be used to index a special texture which splits it across color components. High bits are returned in one channel and the low bits in another.

The issue of precision has been avoided up until now, but as the results will show, 8 bits are plenty for a single weight or sample--for preview purposes, anyway. Accordingly, the split ramp chops the weight into two 4-bit parts over R and G. Samples can be handled in the same way with the weight in B and A, masked by the visibility result. The empty high bits in all components allow values to be accumulated via alpha blending, saturating after 16 passes or so. Figure 7 illustrates this extended process for a single element.

After every block of iterations, the host reconstructs sums from the target, adding them to main totals held in main memory for each element. The target is then cleared to zero for the next set of iterations.

We have reduced the frequency of the read-back to 1/16th, on top of the earlier improvements, and processing is now blazingly fast for typical game meshes and texture sizes. Timings and analysis are provided later in the article.


There are several practical issues I have glossed over that can affect quality and correctness, some of which are independent of this case study. While perspective projection could be used for depth map rendering, as suggested by earlier talk of a "sphere of lights", orthographic projection suits our needs perfectly.

With the former, extra shader math is required because the eye direction (or sample direction depending on the point of view) varies, and the resolution is biased towards near features. Orthographic rendering, on the other hand, offers uniformity and constructing a tight frustum is effortless, since the dimensions are best based upon the scene's bounding sphere for consistency across all viewpoints.

With Direct3D, as used here, the issue of rasterization rules comes into play. Geometry is sampled at pixel centers--a mismatch with texture lookup, which reads from texel edges. When rendering to a texture with subsequent reading in another pass, it's simplest to adjust the rasterization coordinates and [Brown03] presents a clean way to achieve this. It's a good idea to get this aspect set up and tested with predictable data from the outset, so that you can avoid problems later, subtle or otherwise.

Filtering is another easily forgotten but important issue when using certain lookup tables. With a split ramp, point sampling is required to ensure a correct value is returned in regions where the low bits wrap around.

All of the variations in the next section make use of an 8-bit pseudo depth buffer. In most cases it's possible to swap in a more accurate 16-bit version (using a splitting scheme) without needing a separate pass for the depth comparison. Hardware shadow mapping can also be used when supported.


It has already been noted that AO can be calculated and stored at every mesh vertex or separately as an occlusion map. The former is a good option with constant ambient lighting, provided that surfaces are sufficiently tessellated to capture shadow changes well. This requirement goes away with the latter and while an AO texture results in a storage cost, one may be able to pack the channel with another map in order to save a texture stage.

Another option is available when high-poly models are used for normal map generation. The extra information can be used to compute a more detailed AO texture, at the cost of additional processing time.

Also, precomputed radiance transfer can go beyond AO (and bent normals) through extra terms, handling low-frequency lighting situations more accurately--again at the cost of storage, although the data can be compressed. Rendering is affected as well, since SH lighting requires some additional support.

These situations are all dealt with in the remainder of this section, particularly with regard to accelerated preprocessing.

Vertex Baking

There is very little more to say on the workings of vertex AO and the shader code in Listing 1 should be self-explanatory.

Listing 1.

// depth.vsh
// c0       : Rasterisation offset
// c1-4     : World*View*Proj. matrix
dcl_position    v0
// Output projected coordinates
m4x4 r0, v0, c1
mad oPos, r0.w, c0, r0
// Output depth via diffuse colour register
mov oD0, r0.z
// sampling_v.vsh
// c0       : Rasterisation offset
// c1-4     : World*View*Proj. matrix
// c5       : Sample direction
def c8,   2.0, -2.0, -1.0,  1.0
def c9,   0.5, -0.5,  0.0,  1.0
def c10,  0.0,  0.0,  0.0,  0.504
dcl_position    v0
dcl_normal      v1
dcl_texcoord    v2
// Scale and offset texture coordinates
// to [-1, 1] range for render target
mad r0.xy, v2.xy, c8.xy, c8.zw
mov r0.zw, c9.zw
// Output coordinates for rasterising
mad oPos, r0.w, c0, r0
// Project vertex coordinates
m4x4 r0, v0, c1
// Output depth via diffuse colour register
// (for consistency with depth pass)
mov oD0, r0.z
// Output bias for depth test
mov oD1, c10.w
// Scale and offset projected coordinates
// for depth map lookup:
// x' =  x*0.5 + 0.5*w
// y' = -y*0.5 + 0.5*w
// z' =  0
// w' =  w
mul r0, r0, c9
mad oT0, r0.w, c9.xxzz, r0
// Cosine weighting: max(N.s_i, 0)
dp3 r0.z, v1, c5
max r0.z, r0.z, c9.z
// Output weight, to be split via ramp lookup
mov oT1.x, r0.z
mov oT1.yzw, c9.zzw
// sampling_v.psh
def c0, 1.0, 1.0, -1.0, -1.0  // Sample mask
tex t0  // Depth
tex t1  // Weight (R & G), sample (B & A)
// Compute depth difference, with
// a bias added for cnd (0.5 + 1/255)
sub r0.a, t0.a, v0.a
add r0.a, r0.a, v1.a
// Output weight and sample (zero if occluded)
mul_sat r1, t1, c0
cnd r0, r0.a, t1, r1

Texture Baking

For texture AO, weights must be calculated per-pixel. To this end, a transform of the normal into world-space replaces previous weight operations in the vertex shader. The interpolated normal is then used to index a cubic split ramp, which performs the necessary normalization per-pixel. The remaining math comes courtesy of the ramp, and we arrive at the split weight as before. This all works because the world z-axis is the sample direction, which points in the opposite direction of the camera.

With a little effort, these two versions can be unified. A cube-map lookup might come at a cost for vertex AO, but with a little fiddling it's possible to set things up so that only the split texture need change between the cases.

Texture Baking II

While a texture offers higher fidelity than vertex occlusion, in light of normal mapping the additional resolution is somewhat under-used. There are a couple of ways to increase texture detail if high-poly (HP) reference geometry is available.

The first approach is to make use of the derived normal map and calculate the weight in the pixel shader, split via a dependent texture look-up--still manageable in ps1.1 without spreading sampling work across passes. This gives the extra detail of the HP model with the coarse geometric occlusion of the low-poly (LP) mesh.

For more faithful reproduction of the extra detail, ATI's Normal Mapper tool has the option of sampling AO per texel during normal map generation. This is rather costly however; a more frugal approach would be to precalculate AO at the vertices of the HP mesh using hardware. Occlusion then becomes just another attribute sampled by the ray caster (as described by [Cignoni99]).

Precomputed Radiance Transfer

While a single occlusion term is extremely compact and a useful lighting component for games right now and can be stretched beyond its assumptions, it has its limits. Precomputed radiance transfer (PRT) using the SH basis functions [Sloan02] generalizes AO (the 0th term) and can capture additional directionality and therefore other effects. With nine or more transfer components, soft shadows noticeably track as lights move. Although this comes at the cost of increased storage, vectors can be efficiently compressed through Clustered Principle Component Analysis (CPCA) [Sloan03].

The assumptions for PRT to work, at least in the form presented here, are as with AO except that incoming radiance is assumed to be distant rather than constant. Lighting, now approximated via basis coefficients, can be factored out with the remaining integral evaluated for all surface points, yielding a transfer vector. Without compression, outgoing radiance Rp is reconstructed as a large inner product between the vector L (expressing environmental lighting) and the transfer vector Tp scaled by the diffuse surface response, as Equation 4 shows.

So our occlusion term is now a vector, computed in much the same way as with AO, except for the presence of basis functions Bi. Indeed, Equation 5 is very similar to Equation 3, but with a summation for each term.

For our hardware implementation, each basis function Bi is evaluated in software and uploaded as a constant, since the sampling direction is fixed for a given viewpoint. Scaling and biasing is required with these terms so that they fall within the range [0,1], to prevent clipping during split ramp lookup. Accumulated values are later range-expanded in software.

Assuming a fairly liberal number of PRT components, it makes sense to factor out parts of the sampling process that are constant over all components. These are namely weights, which are currently packed with samples and visibility determination.

With the operations moved to separate passes that write their results to additional textures, shader resources are freed up, thus allowing two PRT components to be processed in parallel per sampling pass--at least in the case of vertex baking.

One side benefit of this extra partitioning is that it's now easier to swap in a more accurate set of depth render and depth test passes, perhaps at run-time based on a quality setting. Listing 2 provides just the shaders for packed PRT sampling as the rest (depth, depth compare and weights passes) can be derived from Listing 1.

Listing 2.

// sampling_prt_v.vsh
// c0       : Rasterisation offset
// c1-4     : World*View*Proj. matrix
// c5       : Sample direction
// c6       : Packed SH basis terms:
//          : t0*scale0, t1*scale1, bias0, bias1
def c8,  2.0, -2.0, -1.0,  1.0
def c9,  0.5, -0.5,  0.0,  1.0
dcl_position    v0
dcl_normal      v1
dcl_texcoord    v2
// Scale and offset texture coordinates
// to [-1, 1] range for render target
mad r0.xy, v2.xy, c8.xy, c8.zw
mov r0.zw, c9.zw
// Output coordinates for rasterising
mad oPos, r0.w, c0, r0
// Output coordinates for depth result lookup
mov oT0, v2
// Cosine weighting: max(N.s_i, 0)
dp3 r0.z, v1, c5
max r0.z, r0.z, c9.z
// Sample = B(s)*[V(s)]*Hn(s), scaled and
// biased to [0, 1] range
// V(s) is evaluated in the pixel shader
// The following values are passed through
// a pair of 2D split ramps:
// Scaled and biased samples in x and y
mad oT1.xy, r0.zz, c6.xy, c6.zw 
mov oT1.zw, c9.zw
// Biases in x and y (occluded case)
// Note: since these are fixed, a pixel
// shader constant could be used instead
mov oT2.xy, c6.zw
mov oT2.zw, c9.zw
// sampling_prt_v.psh
tex t0  // Depth test result - V(s)
tex t1  // Sample0, sample1
tex t2  // Bias0, bias1
// Use depth test result to mask packed samples
mov r0.a, t0.a
cnd r0, r0.a, t1, t2


The Direct3D Extensions Library (D3DX) provides a number of functions for software processing of SH PRT as part of the latest public release (DirectX SDK Update Summer 2003). Since this implementation is both robust and readily available, it is an ideal reference for comparison with the hardware version presented earlier.

While not exhaustive, Table 1 shows a trend in the performance between D3DX and our hardware setup for 9-component vertex PRT. There is a clear gulf between the two versions as the number of surface elements increases--something which bodes well for texture processing.

Two sets of timings are listed for the hardware (HW1 and HW2), which differ only in the number of samples taken per vertex. The former is for direct comparison with D3DX; twice as many samples are used since the hardware method samples over a sphere, so on average only half the samples contribute anything (that's quite a contrast to hemispherical sampling in software). Bear in mind that this still isn't an entirely fair comparison, as D3DX importance samples via a cosine distribution, which speeds up convergence.

HW2 is supplied for a rough idea of the sort of performance to expect when previewing using a lower number of samples. The optimum will vary from scene to scene, however, depending on visibility variance.

These figures were recorded on an Athlon XP 2400+ PC equipped with a GeforceFX 5800 using a release build and libraries. To put the hardware version in the best light, meshes were "vertex cache order" optimized beforehand, although this improves locality of reference for software as well. The depth resolution for hardware processing was 512x512x8 bits in all cases.

Visually there is a minor difference in lighting (Figure 8) with hardware processing--caused by the limited storage precision of samples--which in my opinion is acceptable for previews. Precision and read-back can be traded off via the split ramp as required, but both the number of iterations and the depth map (resolution and precision) have a greater affect on the quality of the results.

The only major differences are the shadows under the eyelids (Figure 9), which are not captured by the hardware version. This is due to the limited depth precision, which can be increased as described earlier. A reduced depth bias combined with nudging vertices outwards (along the normal) may also improve accuracy. It should also be noted that the cyan color, present in transition areas between the red and white lights, is due to the low-order SH lighting approximation.

[Sloan04] reports similar results to these with floating point hardware. His accelerated PRT implementation--essentially the same as the one presented in this feature but using version 2.0 shaders and high-precision buffers--will be available in an upcoming DirectX SDK Update along with significant optimizations to the software simulator. The revised API will also be more modular, enabling (among other things) general CPCA compression of user-created data, as produced here.


It might also be possible to map one or more of the versions presented earlier onto fixed-function hardware. Whether such cards have the necessary power to out-do a clever software implementation is unclear however, and operations may need to be split over extra passes, cutting performance further.

One trick that may improve vertex AO performance is adaptive ray sampling. A simple form of this would be to examine vertices after some fixed number of iterations, find those that appear to be fully visible or within some margin of error, then through a dynamic index buffer (or other method) and extra book-keeping, spare these vertices further processing. A more general solution could look at differences between blocks of iterations, terminating at a given error threshold. It's an interesting idea, but extra communication between host and GPU may counter any potential speed gain.

Other Applications

The following are a few other examples of hardware-accelerated preprocesses, some of which have already been employed in game development, and others that could be used.


Coombe et al. [Coombe03] map a progressive radiosity method almost completely onto the GPU and in the process their method solves a couple of the classic problems with hemi-cubes. Firstly the hemi-cube faces no longer need to be read back to main memory for software processing of delta form factors. Rather than iterating over the faces and randomly updating elements--an infeasible task with current hardware--surface elements, maintained directly in textures, are instead inverse-transformed to the ID hemi-cube faces. Secondly, since IDs now only need to be assigned to patches rather than elements, aliasing--a problem with hemi-cubes when shooting--is also reduced.

The paper is an enlightening read, with clever tricks devised for face shooting selection and adaptive subdivision. From the point of view of acceleration, their system reaches a solution rapidly for simple scenes and intermediate results are available at any point due to the progressive nature and texture storage.

Static SH Volume

Max Payne 2 uses a static volume of SH estimates over a game level, for environmental lighting of models at run-time. Remedy accelerates the offline computation [Lehtinen04]--in particular the SH projection of environment maps through ps2.0 shaders and floating-point storage.

At a given grid point, the pre-lit scene is first rendered to an HDR cube-map. Face texels are then multiplied by the current basis function evaluated in the corresponding direction. Naturally, since these directional terms are fixed for all cubes, they can be pre-calculated and read from a texture. The resulting values are then accumulated via reductive summing of groups of four neighboring texels through multiple passes--a process akin to repeated box filtering, just without the averaging. The final level is then read back and scaled appropriately, yielding an SH coefficient. Projection is repeated for all coefficients and rendering for all grid points.

Normal Mapping

Wang et al. [Wang03] describe an image-space method to accelerate normal map processing using graphics hardware, which has similarities to the first AO scheme described earlier. The reference mesh is rendered from a number of viewpoints and depth comparisons are made in software to determine the nearest surface point and normal.

The authors work around the problem of hidden surface points in complex meshes by falling back to interpolating target triangle normals. The process outlined in the paper relies on reading back the frame buffer, containing the reference normals, and the depth buffer. It's quite possible that vertex and fragment programs could be used, as in the case study, to move more of the work to the GPU, thereby accelerating the process further.

Christian Seger, author of ORB, takes a different approach to normal map acceleration [Seger03] that avoids the issue of hidden surface points. For a given triangle of the LP model, triangles from the HP reference mesh within a search region are first of all culled. Rather than simply planar projecting these triangles, they are shrunk down based on the interpolated normal of the target triangle. In ray-tracing terms this emulates normal, rather than nearest point, sampling, which reduces artifacts--see [Sander00]. Using the coordinates calculated from the shrinking process, hardware then renders these triangles to the normal map, with stenciling used to clip away any texels outside of the target triangle in texture-space.


This feature, through the case study and other examples, has hopefully convinced you that that programmable graphics can play a role in developing more responsive art tools. Furthermore, the potential for acceleration is not restricted to the very latest floating point graphics processors.

It's true that software is ultimately more general, often easier to debug and extend and may offer greater accuracy. But GPU restrictions are falling away, shader debugging is becoming easier and as the results show, high numerical accuracy isn't a prerequisite for previews.

When a process can be mapped efficiently onto graphics hardware, the speed increase can be significant, and GPUs are scaling up faster than CPUs in terms of raw performance. The implementation can also be simpler than a comparable software version when the latter needs extra data structures, algorithms and lower-level optimizations to cut processing time.

Future shader versions will clear the way for mapping a larger class of algorithms onto graphics processors and higher-level abstractions such as BrookGPU are another welcome development. Faster communication through PCI-Express will also make hybrid solutions more viable.


I would like to thank Simon Brown, Willem de Boer, Heine Gundersen, Peter McNeill, David Pollak, Richard Sim and Neil Wakefield for comments and support; Peter-Pike Sloan and Rune Vendler for detailed feedback, information and ideas; Jaakko Lehtinen and Christian Seger for describing their respective hardware processing schemes; the guys at Media Mobsters for regular testing of unreliable code on a range of GPUs; Simon Green and NVIDIA for the Ogre stills and permission to use the head mesh, modeled by Steve Burke; Microsoft for permission to publish details of upcoming D3DX features.


[Brown03] Brown, S, How To Fix The DirectX Rasterisation Rules, 2003.
[Cignoni98] Cignoni, P, Montani, C, Rocchini, C, Scopigno, R, A general method for preserving attribute values on simplified meshes, IEEE Visualization, 1998.
[Coombe03] Coombe, G, Harris, M J, Lastra, A, Radiosity on Graphics Hardware, June 2003.
[Forsyth03] Forsyth, T, Spherical Harmonics in Actual Games, GDC Europe 2003.
[James03] James, G, Rendering Objects as Thick Volumes, ShaderX2 2003.
[Landis02] Landis, H, Production-Ready Global Illumination, Siggraph course notes #16, 2002.
[Lehtinen04] Lehtinen, J, personal communication, 2004.
[Purvis03] Purvis, I, Tokheim, L, Real-Time Ambient Occlusion, 2003.
[Sander00] Sander, P, Gu, X, Gortler, S J, Hoppe, H, Snyder, J, Silhouette Clipping, Siggraph 2000.
[Seger03] Seger, C, personal communication, 2003.
[Sloan02] Sloan, P-P, Kautz, J, Snyder, J, Precomputed Radiance Transfer for Real-Time Rendering in Dynamic, Low-Frequency Lighting Environments, Siggraph 2002.
[Sloan03] Sloan, P-P, Hall, J, Hart, J, Snyder, J, Clustered Principal Components for Precomputed Radiance Transfer, Siggraph 2003.
[Sloan04] Sloan, P-P, personal communication, 2004.
[Wang03] Wang, Y, Fröhlich, B, Göbel, M, Fast Normal Map Generation for Simplified Meshes, Journal of Graphics Tools, Vol. 7 No. 4, 2003.
[Whitehurst03] Whitehurst, A, Depth Map Based Ambient Occlusion Lighting, 2003.

Additional Reading

Advanced Global Illumination. Philip Dutré, Philippe Bekaert, Kavita Bala, AK Peters 2003.

"Spherical Harmonics, The Gritty Details", Robin Green, GDC 2003.

Practical Precomputed Radiance Transfer, Peter-Pike Sloan, ShaderX2, 2003.

Ambient Occlusion, Matt Pharr, Simon Green, GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics, Addison Wesley 2004.

General-Purpose computation on GPUs, GPGPU.org.

Read more about:


About the Author(s)

Steve Hill


Steve Hill has a BSc in Computer Science (Honours, 2.1) from The University of Southampton, UK. During his studies he interned at STMicroelectronics, Graphics Products Division (Bristol, UK, unfortunately closed). After graduating he worked at Criterion Software Ltd, Art Tools Group (Guildford, UK) as a Software Engineer. Steve can be reached via the username sjh199 at the domain zepler.org.

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like