The use of shaders is not only impacting realtime rendering in a fundamental way but also game production in terms of workflow and art tools as we seek to maximize their visual potential. With more parameters and surface maps affecting the final onscreen look of game characters and scenes, it's more important than ever that artists have tools which enable them to finetune their work quickly and easily.
Thankfully, the major 3D graphics packages are adding native support for shaders. But inhouse art tools (such as custom plugins, editors and build processes) should strive for the same level of usability.
This feature promotes the use of programmable graphics hardware outside of realtime rendering, in the processing of artistic effects in game development. To demonstrate the power of this technique, a large chunk of this article concerns mapping an intensive precomputed shadowing technique for meshes (ambient occlusion) onto the GPU. The method can be 15 times faster than a solution that takes advantage of only the host CPU.
This technique is a boon for art builds, but it also opens up the possibility of wrapping things up in a plugin that would let artists effortlessly generate, visualize and tweak vertex or texture data within their 3D art package. In addition, the implementations presented here use version 1.1 Direct3D shaders for processing, so they can be widely deployed for use with today's common graphics cards.
Feature Overview
The first section looks at how technological advances affect art production and what features tools need to provide. An overview of ambient occlusion follows, leading to a threestep workout, turning an initially sluggish hardware approach into a lean, streamlined solution.
In fact, several variations on the technique are covered, illustrating a range of production tradeoffs. I'll also show how these can be extended to handle higherorder occlusion (generalized precomputed radiance transfer) before closing with a round up of other accelerated preprocesses that you might wish to consider.
Art Production
To begin, let's consider the broad picture of art production and where ambient occlusion (AO), as a preprocess, fits in. The technique for generating the AO shadows can be integrated into the creation end of the production process3D art packages such as 3ds max and Maya that are already using realtime shaders, giving the 3D artist an instant preview of what the effect will look like. Of course, it could be integrated with any commercial or custom tool that has access to 3D mesh data.
The Case Study
The AO process (for game purposes, at least) takes in one or more meshes and spits out vertex or texture data. There are a number of variations described later on, but with the basic technique this data consists of a single shadow term for every surface pointstored as an additional vertex attribute or packed into an occlusion mapwhich is used ingame in addition to ambient (or diffuse environmental) lighting.
In contrast to its low runtime cost, the processing is pretty intense, since lots of visibility samples need to be taken at each point. With a typical CPUbased approach, AO forms part of a lengthy build process that is far from ideal for editing and previewing purposes. Speeding up the calculation will not only yield faster builds but also afford us the opportunity to expose AO as a modeling package plugin, so that artists can iterate more readily on content.
I will provide an overview of AO, but the subject is covered in
greater detail by a chapter in the forthcoming GPU Gems book
(see "Additional Reading"). Whether you're already familiar
with the technique or not, this primer sets the scene so to speak
for mapping AO onto graphics hardware.
Ambient Occlusion: A Primer
AO measures the amount that a point on a surface is obscured from light that might otherwise arrive from the outside. This average occlusion factor is recorded at every surface element, vertex, or texel, and used to simulate selfshadowing (see Figure 1).


Figure
1: Breakdown of Ambient Occlusion in conjunction with 
The extra soft selfshadowing adds a great deal of believability to the lighting.
Note: This example setup is available as an effect which comes bundled with the RenderMonkey shader development suite.
The technique has come to prominence through its use in the film industry, with ILM first employing it for Dinosaur [Landis02]. Its niche is attenuating soft environmental lighting, particularly from indirect sources such as walls, sky, or ground, achieving the look of more complex setupsusually the manual placement of extra bounce lights or fullblown global illuminationat a lower cost.
One point that should be stressed is that finetuning lighting is a frequent task in postproduction, and by decoupling the shadowing and lighting, it is possible to tweak the lighting without rerendering the shadows.
Assumptions
Several assumptions are required for the reuse of visibility information to work:
 Diffuse surfaces
 Rigid geometry
 Constant lighting
The first two constraints are also required by traditional precalculated radiosity. The diffuse surface limitation means that the stored visibility is viewindependent, and in conjunction with a lack of deformation, ensures that the data can be reused. Because of the separation of lighting and visibility, AO is less restrictive than static lighting since single meshes with independent occlusion can be merrily translated, rotated and uniformly scaled. A group must be transformed as one however, in order to preserve interobject shadows.


Figure 2: The two basic forms of AO: (left) Vertex data  note the linear interpolation artifacts behind the ear due to the low tessellation, unable to accurately capture shadow changes in this region. (right) Texture data  note the smoother shadowing. 
The Process
Equation 1, expressing occlusion, O_{p}, in integral form, is typically solved via Monte Carlo integration. For detailed coverage of the methodology consult the "Additional Reading" section.


Equation 1: Hemispherical integration of visibility 
Rays are traced outward from a given surface point p over the hemisphere around the normal N. A binary visibility function V is evaluated for each of these rays, which returns 0 if the ray intersects any geometry before reaching the extent of the scene, and returns 1 otherwise. Figure 3 shows a simple 2D depiction of the process for a single semioccluded point. In reality, creating smooth gradations in the shadows requires hundreds or even thousands of rays.


Figure 3: A crosssection showing visibility sampling over the hemisphere for a point p. Many rays are fired off and tested for intersection with blocking geometry, with the results averaged together. 
In the case of a uniform distribution of ray directions, each visibility sample is weighted by the cosine between the normal and sample direction (again, consult Additional Reading for an explanation) and averaged together, resulting in our scalar occlusion term. For efficiency, ILM uses a cosine distribution of rays, which removes the need for the cosine factor (as shown in Equation 2) and concentrates samples in statistically important directions.


Equation 2: Monte Carlo integration using a cosine ray distribution 
Game Use
For games, AO immediately enhances constant ambient lighting, adding definition to otherwise flat, shadowed areas. NVIDIA's Ogre demo (see Figure 4) uses AO in this way, in addition to a key directional light plus shadow map.



Figure 4: Breakdown of lighting in NVIDIA's Ogre demo: (above) AO (below) AO multiplied by constant ambient lighting, on top of key lighting (directional light plus shadow map). Images courtesy of Spellcraft Studio GmbH and NVIDIA Corporation. 
This is the only correct use for AO, where lighting is assumed to be constant and thus it can be factored out. In practice, however, we can stretch things slightly as the overview suggested. ILM, for instance, uses a prefiltered environment map for secondary lighting of diffuse surfaces, instead of a uniform ambient term. Here, believable results are achieved from shadowing with AO because the illumination varies slowly. They also improve on the approximation by storing the average unoccluded directiona so called "bent normal"which replaces the surface normal when indexing the map.
Spherical harmonic (SH) lighting [Forsyth03] is often a more attractive option than using a prefiltered map. Again, AO can be used for attenuation but it can also be extended to precomputed radiance transfer (using SH), as is shown later.
In summary, AO is a cheap but effective shadowing technique for diffuse environmental illumination, and the combination of the two can add a plausible global effect on top of traditional game lighting. Using the power of today's programmable graphics cards, we can accelerate the technique enough to make it usable during the creation of 3D models and art, as described next.
______________________________________________________
Mapping AO onto the GPU
Reformulation
One method of directly mapping of AO's hemispherical sampling onto hardware is the hemicube [Purvis03], but this is computationally expensive: the scene must be transformed and rendered multiple times for each surface element. A more efficient alternative comes from tracing many coherent rays together in the opposite direction. An intuitive implementation of this, developed by Weta Digital and described in [Whitehurst03] involves surrounding the scene with a sphere of lights, as Figure 5 shows. The light directions are used for sample weighting, while associated depth maps (our coherent rays) are used for visibility determination.


Figure 5: Sampling over a sphere. The viewpoints (represented as cones) are processed one at a time. This batches together visibility rays through a shadow depth map for each orientation. 
Equation 3 encapsulates Monte Carlo integration in this instance. For each element p, weighted visibility samples (from this point on simply referred to as "samples") are accumulated and averaged via the weight sum w. Because the sample directions s_{i} cover the unit sphere, those outside of a point's vision are given a zero weighting by the hemispherical function H. The visibility function V is just a depth comparison between the surface element in depth map space and the corresponding map value.


Equation 3a: Monte Carlo integration over a sphere 


Equation 3b: Definitions 
We can use graphics hardware to quickly render depth maps, but it's possible to go further and gain full benefit from the GPU after overcoming a couple of hurdles. The following workthrough covers the stages involved in accomplishing this.
First Try
Before getting more adventurous in the kitchen, let's start baking using plain rasterization hardware and software processing. Here's the recipe:
1. Pick an orientation around scene center
2. Render geometry from this viewpoint
3. Read back depth information
4. Transform each surface element into depth buffer space
5. Perform visibility test via depth comparison
6. Repeat above steps, accumulating element samples and weights
7. Calculate AO for each element from sample and weight totals
This is a valid procedure for graphics chips that lack any sort of programmability, but otherwise hardware capabilities are going to waste; perelement transform and triangle rasterization (in the case of texture baking) are tasks better performed by dedicated hardware. The performance of this approach is also hurt by readback, a big deal given the high number of iterations necessary to avoid noise (or banding because of shared viewpoints).
To boost processing speed substantially, we must make more effective use of the GPU. As the remaining steps show, this leads to reducing readback as well.
Second Attempt
The timeconsuming stepsbardepth transferare amenable to stream processing, so let's take advantage of this. Depth rendering will remain largely as before, but we can replace software transform and comparison of surface elements with shaders. A vertex shader computes the weight for a given surface element and orientation. A corresponding pixel shader performs the necessary depth check, with the weight and test result written out. As Figure 6 captures, we now have two accelerated stages: a depth pass and a sampling pass.


Figure 6: The shader passes in the second attempt at mapping AO onto the GPU. 
For vertex AO, points (i.e., D3DPT_POINTLIST) are sent through the new shader pair, each occupying a single pixelcontaining the weight and test resultin the render target. With these laid out contiguously in vertex order, the target contents can be simply iterated over in software.
In the case of texture baking, the process is slightly more involved, as triangles are rasterized instead using nonoverlapping UVs specified by the artist or generated automatically. The weight is also calculated perpixel, with the vertex shader performing setup instead.
Readback has been reduced somewhat compared to the first try, assuming that surface resolution (number of elements) is lower than depth buffer resolutionundersampling would occur otherwise. A jump in speed should be expected as well, depending on relative GPU and CPU muscle, plus the amount of time spent optimizing the last version. The new process is also simpler since the dedicated hardware takes care of lowerlevel computations.
Despite these gains, room for improvement remains, as the pipeline is still stalled by readback. If the summation stage were to be moved to the graphics card, the GPU would be kept busy and we would only have to read the totals back at the end. This is possible via ps2.0 shaders and highprecision targets, but cards which support these features are not ubiquitous and the extra accuracy can come with aspeed and storage hit. As the final version reveals however, with a little craftiness the GPU can perform partial summation without the features just mentioned, further cutting backtransfer by well over an order of magnitude.
Third Time Lucky
A split ramp, a neat idea borrowed from [James03], is the solution to the problem of accumulating values in an 8bitpercomponent buffer. Consider the case of vertex AO from the previous attempt. Rather than just passing the weight through the pixel shader, the value can be used to index a special texture which splits it across color components. High bits are returned in one channel and the low bits in another.
The issue of precision has been avoided up until now, but as the results will show, 8 bits are plenty for a single weight or samplefor preview purposes, anyway. Accordingly, the split ramp chops the weight into two 4bit parts over R and G. Samples can be handled in the same way with the weight in B and A, masked by the visibility result. The empty high bits in all components allow values to be accumulated via alpha blending, saturating after 16 passes or so. Figure 7 illustrates this extended process for a single element.


Figure 7: The process of splitting weights and samples across multiple color components. 
After every block of iterations, the host reconstructs sums from the target, adding them to main totals held in main memory for each element. The target is then cleared to zero for the next set of iterations.
We have reduced the frequency of the readback to 1/16th, on top of the earlier improvements, and processing is now blazingly fast for typical game meshes and texture sizes. Timings and analysis are provided later in the article.
Practicalities
There are several practical issues I have glossed over that can affect quality and correctness, some of which are independent of this case study. While perspective projection could be used for depth map rendering, as suggested by earlier talk of a "sphere of lights", orthographic projection suits our needs perfectly.
With the former, extra shader math is required because the eye direction (or sample direction depending on the point of view) varies, and the resolution is biased towards near features. Orthographic rendering, on the other hand, offers uniformity and constructing a tight frustum is effortless, since the dimensions are best based upon the scene's bounding sphere for consistency across all viewpoints.
With Direct3D, as used here, the issue of rasterization rules comes into play. Geometry is sampled at pixel centersa mismatch with texture lookup, which reads from texel edges. When rendering to a texture with subsequent reading in another pass, it's simplest to adjust the rasterization coordinates and [Brown03] presents a clean way to achieve this. It's a good idea to get this aspect set up and tested with predictable data from the outset, so that you can avoid problems later, subtle or otherwise.
Filtering is another easily forgotten but important issue when using certain lookup tables. With a split ramp, point sampling is required to ensure a correct value is returned in regions where the low bits wrap around.
All of the variations in the next section make use of an 8bit
pseudo depth buffer. In most cases it's possible to swap in a more
accurate 16bit version (using a splitting scheme) without needing
a separate pass for the depth comparison. Hardware shadow mapping
can also be used when supported.
Implementations
It has already been noted that AO can be calculated and stored at every mesh vertex or separately as an occlusion map. The former is a good option with constant ambient lighting, provided that surfaces are sufficiently tessellated to capture shadow changes well. This requirement goes away with the latter and while an AO texture results in a storage cost, one may be able to pack the channel with another map in order to save a texture stage.
Another option is available when highpoly models are used for normal map generation. The extra information can be used to compute a more detailed AO texture, at the cost of additional processing time.
Also, precomputed radiance transfer can go beyond AO (and bent normals) through extra terms, handling lowfrequency lighting situations more accuratelyagain at the cost of storage, although the data can be compressed. Rendering is affected as well, since SH lighting requires some additional support.
These situations are all dealt with in the remainder of this section, particularly with regard to accelerated preprocessing.
Vertex Baking
There is very little more to say on the workings of vertex AO and the shader code in Listing 1 should be selfexplanatory.
Listing 1.
//////////////////////////////////////////////// // depth.vsh //////////////////////////////////////////////// vs.1.1 // c0 : Rasterisation offset // c14 : World*View*Proj. matrix dcl_position v0 // Output projected coordinates m4x4 r0, v0, c1 mad oPos, r0.w, c0, r0 // Output depth via diffuse colour register mov oD0, r0.z //////////////////////////////////////////////// // sampling_v.vsh //////////////////////////////////////////////// vs.1.1 // c0 : Rasterisation offset // c14 : World*View*Proj. matrix // c5 : Sample direction def c8, 2.0, 2.0, 1.0, 1.0 def c9, 0.5, 0.5, 0.0, 1.0 def c10, 0.0, 0.0, 0.0, 0.504 dcl_position v0 dcl_normal v1 dcl_texcoord v2 // Scale and offset texture coordinates // to [1, 1] range for render target mad r0.xy, v2.xy, c8.xy, c8.zw mov r0.zw, c9.zw // Output coordinates for rasterising mad oPos, r0.w, c0, r0 // Project vertex coordinates m4x4 r0, v0, c1 // Output depth via diffuse colour register // (for consistency with depth pass) mov oD0, r0.z // Output bias for depth test mov oD1, c10.w // Scale and offset projected coordinates // for depth map lookup: // x' = x*0.5 + 0.5*w // y' = y*0.5 + 0.5*w // z' = 0 // w' = w mul r0, r0, c9 mad oT0, r0.w, c9.xxzz, r0 // Cosine weighting: max(N.s_i, 0) dp3 r0.z, v1, c5 max r0.z, r0.z, c9.z // Output weight, to be split via ramp lookup mov oT1.x, r0.z mov oT1.yzw, c9.zzw //////////////////////////////////////////////// // sampling_v.psh //////////////////////////////////////////////// ps.1.1 def c0, 1.0, 1.0, 1.0, 1.0 // Sample mask tex t0 // Depth tex t1 // Weight (R & G), sample (B & A) // Compute depth difference, with // a bias added for cnd (0.5 + 1/255) sub r0.a, t0.a, v0.a add r0.a, r0.a, v1.a // Output weight and sample (zero if occluded) mul_sat r1, t1, c0 cnd r0, r0.a, t1, r1
Texture Baking
For texture AO, weights must be calculated perpixel. To this end, a transform of the normal into worldspace replaces previous weight operations in the vertex shader. The interpolated normal is then used to index a cubic split ramp, which performs the necessary normalization perpixel. The remaining math comes courtesy of the ramp, and we arrive at the split weight as before. This all works because the world zaxis is the sample direction, which points in the opposite direction of the camera.
With a little effort, these two versions can be unified. A cubemap lookup might come at a cost for vertex AO, but with a little fiddling it's possible to set things up so that only the split texture need change between the cases.
Texture Baking II
While a texture offers higher fidelity than vertex occlusion, in light of normal mapping the additional resolution is somewhat underused. There are a couple of ways to increase texture detail if highpoly (HP) reference geometry is available.
The first approach is to make use of the derived normal map and calculate the weight in the pixel shader, split via a dependent texture lookupstill manageable in ps1.1 without spreading sampling work across passes. This gives the extra detail of the HP model with the coarse geometric occlusion of the lowpoly (LP) mesh.
For more faithful reproduction of the extra detail, ATI's Normal Mapper tool has the option of sampling AO per texel during normal map generation. This is rather costly however; a more frugal approach would be to precalculate AO at the vertices of the HP mesh using hardware. Occlusion then becomes just another attribute sampled by the ray caster (as described by [Cignoni99]).
Precomputed Radiance Transfer
While a single occlusion term is extremely compact and a useful lighting component for games right now and can be stretched beyond its assumptions, it has its limits. Precomputed radiance transfer (PRT) using the SH basis functions [Sloan02] generalizes AO (the 0^{th} term) and can capture additional directionality and therefore other effects. With nine or more transfer components, soft shadows noticeably track as lights move. Although this comes at the cost of increased storage, vectors can be efficiently compressed through Clustered Principle Component Analysis (CPCA) [Sloan03].
The assumptions for PRT to work, at least in the form presented here, are as with AO except that incoming radiance is assumed to be distant rather than constant. Lighting, now approximated via basis coefficients, can be factored out with the remaining integral evaluated for all surface points, yielding a transfer vector. Without compression, outgoing radiance R_{p} is reconstructed as a large inner product between the vector L (expressing environmental lighting) and the transfer vector T_{p} scaled by the diffuse surface response, as Equation 4 shows.


Equation 4: Outgoing radiance (from a diffuse surface and distant lighting) approximated via spherical harmonics 
So our occlusion term is now a vector, computed in much the same way as with AO, except for the presence of basis functions B_{i}. Indeed, Equation 5 is very similar to Equation 3, but with a summation for each term.


Equation 5: Transfer vector (Monte Carlo integration) 
For our hardware implementation, each basis function B_{i} is evaluated in software and uploaded as a constant, since the sampling direction is fixed for a given viewpoint. Scaling and biasing is required with these terms so that they fall within the range [0,1], to prevent clipping during split ramp lookup. Accumulated values are later rangeexpanded in software.
Assuming a fairly liberal number of PRT components, it makes sense to factor out parts of the sampling process that are constant over all components. These are namely weights, which are currently packed with samples and visibility determination.
With the operations moved to separate passes that write their results to additional textures, shader resources are freed up, thus allowing two PRT components to be processed in parallel per sampling passat least in the case of vertex baking.
One side benefit of this extra partitioning is that it's now easier to swap in a more accurate set of depth render and depth test passes, perhaps at runtime based on a quality setting. Listing 2 provides just the shaders for packed PRT sampling as the rest (depth, depth compare and weights passes) can be derived from Listing 1.
Listing 2.
//////////////////////////////////////////////// // sampling_prt_v.vsh //////////////////////////////////////////////// vs.1.1 // c0 : Rasterisation offset // c14 : World*View*Proj. matrix // c5 : Sample direction // c6 : Packed SH basis terms: // : t0*scale0, t1*scale1, bias0, bias1 def c8, 2.0, 2.0, 1.0, 1.0 def c9, 0.5, 0.5, 0.0, 1.0 dcl_position v0 dcl_normal v1 dcl_texcoord v2 // Scale and offset texture coordinates // to [1, 1] range for render target mad r0.xy, v2.xy, c8.xy, c8.zw mov r0.zw, c9.zw // Output coordinates for rasterising mad oPos, r0.w, c0, r0 // Output coordinates for depth result lookup mov oT0, v2 // Cosine weighting: max(N.s_i, 0) dp3 r0.z, v1, c5 max r0.z, r0.z, c9.z // Sample = B(s)*[V(s)]*Hn(s), scaled and // biased to [0, 1] range // V(s) is evaluated in the pixel shader // // The following values are passed through // a pair of 2D split ramps: // Scaled and biased samples in x and y mad oT1.xy, r0.zz, c6.xy, c6.zw mov oT1.zw, c9.zw // Biases in x and y (occluded case) // Note: since these are fixed, a pixel // shader constant could be used instead mov oT2.xy, c6.zw mov oT2.zw, c9.zw //////////////////////////////////////////////// // sampling_prt_v.psh //////////////////////////////////////////////// ps.1.1 tex t0 // Depth test result  V(s) tex t1 // Sample0, sample1 tex t2 // Bias0, bias1 // Use depth test result to mask packed samples mov r0.a, t0.a cnd r0, r0.a, t1, t2
Results
The Direct3D Extensions Library (D3DX) provides a number of functions for software processing of SH PRT as part of the latest public release (DirectX SDK Update Summer 2003). Since this implementation is both robust and readily available, it is an ideal reference for comparison with the hardware version presented earlier.
While not exhaustive, Table 1 shows a trend in the performance between D3DX and our hardware setup for 9component vertex PRT. There is a clear gulf between the two versions as the number of surface elements increasessomething which bodes well for texture processing.
Two sets of timings are listed for the hardware (HW1 and HW2), which differ only in the number of samples taken per vertex. The former is for direct comparison with D3DX; twice as many samples are used since the hardware method samples over a sphere, so on average only half the samples contribute anything (that's quite a contrast to hemispherical sampling in software). Bear in mind that this still isn't an entirely fair comparison, as D3DX importance samples via a cosine distribution, which speeds up convergence.
HW2 is supplied for a rough idea of the sort of performance to expect when previewing using a lower number of samples. The optimum will vary from scene to scene, however, depending on visibility variance.



Table 1: results 
These figures were recorded on an Athlon XP 2400+ PC equipped with a GeforceFX 5800 using a release build and libraries. To put the hardware version in the best light, meshes were "vertex cache order" optimized beforehand, although this improves locality of reference for software as well. The depth resolution for hardware processing was 512x512x8 bits in all cases.
Visually there is a minor difference in lighting (Figure 8) with hardware processingcaused by the limited storage precision of sampleswhich in my opinion is acceptable for previews. Precision and readback can be traded off via the split ramp as required, but both the number of iterations and the depth map (resolution and precision) have a greater affect on the quality of the results.


Figure 8: Output from software and hardware processing (respectively), (left) D3DX, (right) HW1 
The only major differences are the shadows under the eyelids (Figure 9), which are not captured by the hardware version. This is due to the limited depth precision, which can be increased as described earlier. A reduced depth bias combined with nudging vertices outwards (along the normal) may also improve accuracy. It should also be noted that the cyan color, present in transition areas between the red and white lights, is due to the loworder SH lighting approximation.


Figure 9: A close up showing the lack of depth resolution in the hardware implementation compared to software: (left) D3DX, (right) HW1  note the missing shadow under the eyelids caused by the 8bit depth map. 
[Sloan04] reports similar results to these with floating point hardware. His accelerated PRT implementationessentially the same as the one presented in this feature but using version 2.0 shaders and highprecision bufferswill be available in an upcoming DirectX SDK Update along with significant optimizations to the software simulator. The revised API will also be more modular, enabling (among other things) general CPCA compression of usercreated data, as produced here.
Extensions
It might also be possible to map one or more of the versions presented earlier onto fixedfunction hardware. Whether such cards have the necessary power to outdo a clever software implementation is unclear however, and operations may need to be split over extra passes, cutting performance further.
One trick that may improve vertex AO performance is adaptive ray sampling. A simple form of this would be to examine vertices after some fixed number of iterations, find those that appear to be fully visible or within some margin of error, then through a dynamic index buffer (or other method) and extra bookkeeping, spare these vertices further processing. A more general solution could look at differences between blocks of iterations, terminating at a given error threshold. It's an interesting idea, but extra communication between host and GPU may counter any potential speed gain.
Other Applications
The following are a few other examples of hardwareaccelerated
preprocesses, some of which have already been employed in game development,
and others that could be used.
Radiosity
Coombe et al. [Coombe03] map a progressive radiosity method almost completely onto the GPU and in the process their method solves a couple of the classic problems with hemicubes. Firstly the hemicube faces no longer need to be read back to main memory for software processing of delta form factors. Rather than iterating over the faces and randomly updating elementsan infeasible task with current hardwaresurface elements, maintained directly in textures, are instead inversetransformed to the ID hemicube faces. Secondly, since IDs now only need to be assigned to patches rather than elements, aliasinga problem with hemicubes when shootingis also reduced.
The paper is an enlightening read, with clever tricks devised for face shooting selection and adaptive subdivision. From the point of view of acceleration, their system reaches a solution rapidly for simple scenes and intermediate results are available at any point due to the progressive nature and texture storage.
Static SH Volume
Max Payne 2 uses a static volume of SH estimates over a game level, for environmental lighting of models at runtime. Remedy accelerates the offline computation [Lehtinen04]in particular the SH projection of environment maps through ps2.0 shaders and floatingpoint storage.
At a given grid point, the prelit scene is first rendered to an HDR cubemap. Face texels are then multiplied by the current basis function evaluated in the corresponding direction. Naturally, since these directional terms are fixed for all cubes, they can be precalculated and read from a texture. The resulting values are then accumulated via reductive summing of groups of four neighboring texels through multiple passesa process akin to repeated box filtering, just without the averaging. The final level is then read back and scaled appropriately, yielding an SH coefficient. Projection is repeated for all coefficients and rendering for all grid points.
Normal Mapping
Wang et al. [Wang03] describe an imagespace method to accelerate normal map processing using graphics hardware, which has similarities to the first AO scheme described earlier. The reference mesh is rendered from a number of viewpoints and depth comparisons are made in software to determine the nearest surface point and normal.
The authors work around the problem of hidden surface points in complex meshes by falling back to interpolating target triangle normals. The process outlined in the paper relies on reading back the frame buffer, containing the reference normals, and the depth buffer. It's quite possible that vertex and fragment programs could be used, as in the case study, to move more of the work to the GPU, thereby accelerating the process further.
Christian Seger, author of ORB, takes a different approach to normal
map acceleration [Seger03] that avoids the issue of hidden surface
points. For a given triangle of the LP model, triangles from the
HP reference mesh within a search region are first of all culled.
Rather than simply planar projecting these triangles, they are shrunk
down based on the interpolated normal of the target triangle. In
raytracing terms this emulates normal, rather than nearest point,
sampling, which reduces artifactssee [Sander00]. Using the coordinates
calculated from the shrinking process, hardware then renders these
triangles to the normal map, with stenciling used to clip away any
texels outside of the target triangle in texturespace.
Conclusion
This feature, through the case study and other examples, has hopefully convinced you that that programmable graphics can play a role in developing more responsive art tools. Furthermore, the potential for acceleration is not restricted to the very latest floating point graphics processors.
It's true that software is ultimately more general, often easier to debug and extend and may offer greater accuracy. But GPU restrictions are falling away, shader debugging is becoming easier and as the results show, high numerical accuracy isn't a prerequisite for previews.
When a process can be mapped efficiently onto graphics hardware, the speed increase can be significant, and GPUs are scaling up faster than CPUs in terms of raw performance. The implementation can also be simpler than a comparable software version when the latter needs extra data structures, algorithms and lowerlevel optimizations to cut processing time.
Future shader versions will clear the way for mapping a larger
class of algorithms onto graphics processors and higherlevel abstractions
such as BrookGPU are another welcome development. Faster communication
through PCIExpress will also make hybrid solutions more viable.
Acknowledgements
I would like to thank Simon Brown, Willem de Boer, Heine Gundersen, Peter McNeill, David Pollak, Richard Sim and Neil Wakefield for comments and support; PeterPike Sloan and Rune Vendler for detailed feedback, information and ideas; Jaakko Lehtinen and Christian Seger for describing their respective hardware processing schemes; the guys at Media Mobsters for regular testing of unreliable code on a range of GPUs; Simon Green and NVIDIA for the Ogre stills and permission to use the head mesh, modeled by Steve Burke; Microsoft for permission to publish details of upcoming D3DX features.
References
[Brown03] Brown, S, How To Fix The DirectX Rasterisation Rules,
2003.
[Cignoni98] Cignoni, P, Montani, C, Rocchini, C, Scopigno, R, A
general method for preserving attribute values on simplified meshes,
IEEE Visualization, 1998.
[Coombe03] Coombe, G, Harris, M J, Lastra, A, Radiosity on Graphics
Hardware, June 2003.
[Forsyth03] Forsyth, T, Spherical Harmonics in Actual Games, GDC
Europe 2003.
[James03] James, G, Rendering Objects as Thick Volumes, ShaderX^{2}
2003.
[Landis02] Landis, H, ProductionReady Global Illumination, Siggraph
course notes #16, 2002.
[Lehtinen04] Lehtinen, J, personal communication, 2004.
[Purvis03] Purvis, I, Tokheim, L, RealTime Ambient Occlusion, 2003.
[Sander00] Sander, P, Gu, X, Gortler, S J, Hoppe, H, Snyder, J,
Silhouette Clipping, Siggraph 2000.
[Seger03] Seger, C, personal communication, 2003.
[Sloan02] Sloan, PP, Kautz, J, Snyder, J, Precomputed Radiance
Transfer for RealTime Rendering in Dynamic, LowFrequency Lighting
Environments, Siggraph 2002.
[Sloan03] Sloan, PP, Hall, J, Hart, J, Snyder, J, Clustered Principal
Components for Precomputed Radiance Transfer, Siggraph 2003.
[Sloan04] Sloan, PP, personal communication, 2004.
[Wang03] Wang, Y, Fröhlich, B, Göbel, M, Fast Normal Map
Generation for Simplified Meshes, Journal of Graphics Tools, Vol.
7 No. 4, 2003.
[Whitehurst03] Whitehurst, A, Depth Map Based Ambient Occlusion
Lighting, 2003.
Additional Reading
Advanced Global Illumination. Philip Dutré, Philippe Bekaert, Kavita Bala, AK Peters 2003.
"Spherical Harmonics, The Gritty Details", Robin Green, GDC 2003.
Practical Precomputed Radiance Transfer, PeterPike Sloan, ShaderX2, 2003.
Ambient Occlusion, Matt Pharr, Simon Green, GPU Gems: Programming Techniques, Tips, and Tricks for RealTime Graphics, Addison Wesley 2004.
GeneralPurpose computation on GPUs, GPGPU.org.