An increasing number of 3D application developers---mainly game developers---are writing applications that take advantage of 3D graphics acceleration via Microsoft's Direct3D API. While writing to the API itself is relatively straightforward, getting optimal performance out of the underlying hardware can prove elusive. This article presents a variety of optimization techniques as well as insight into how a 3D application interacts with Direct3D and, ultimately, the 3D graphics accelerator. This knowledge has been gleaned from my efforts in assisting developers with Direct3D optimization as well as porting 3D applications to ATI's proprietary API-often from Direct3D versions. I will also present some hard data gathered from a simple Direct3D application run on a variety of 3D accelerators.
Optimizing Direct3D code for 3D hardware means minimizing the communication with the 3D graphics accelerator. This translates to:
- minimizing render state changes
- separating 2D and 3D operations
- batching/stripping/fanning vertices
There is a theoretical minimum number of DrawPrimitive(), SetRenderState(), and other calls necessary to render a given scene. An application should minimize communication with the 3D card in an attempt to approach this theoretical minimum. Naturally, there are limitations which can prevent a given game from moving as far as possible toward the theoretical minimum communication between the API and card (i.e. the game is sort dependent, it has architectural baggage due to being converted from another platform, or it's poorly architected but too late to change). Usually, however, there are still changes that can be made to a Direct3D application to improve its performance on a variety of hardware.
Minimizing SetRender Calls
The single biggest optimization that can be made is minimization of SetRenderState() calls. One way to design an application or engine to minimize these calls is to use a material abstraction where each material is an associated set of render states. For each frame, an application would render all of the polygons of a given material (set of render states) and then move on to the next material without revisiting any materials. The engine could even traverse the materials such that the minimum number of SetRenderState() calls is made, though this might be going a bit overboard. Of course, it wouldn't be quite this simple with non-z-buffered applications. Additionally, alpha-blended polygons should be deferred to the end of the frame and depth-sorted. Such an architecture will go a long way toward eliminating redundant SetRenderState() calls.
Many graphics engines were not originally architected with any kind of material abstraction or even hardware acceleration in mind. Such applications can still cut out redundant SetRenderState() calls by maintaining the current state of all of the Direct3D render states and only changing a given render state when the hardware is not already in the right state. ATI has taken this approach when converting a variety of Direct3D applications to its proprietary API, which is very Direct3D-like, and we have seen around a 20% speed boost over the Direct3D versions (even in cases where we have to copy and convert data from D3DTLVERTEX structures to our ATI specific structures). Much of this speed boost is due to the elimination of redundant SetRenderState() calls, but some of it is attributable to being "closer to the metal." In most cases, the only render state that will change on a transition from one material to another is the current texture. That is, materials will, in most cases, map to textures.
Batching polygons of a common texture has many performance benefits. It eliminates call overhead, minimizes PCI bus traffic, and perhaps most importantly, batching polygons with common textures minimizes texel cache thrashing. Newer graphics accelerators come with several kilobytes of texel cache on chip. In order to keep costs low, these texel caches do not snoop the PCI bus. That is, without PCI bus snooping these caches may contain data this is out of sync with the actual video memory addresses that are cached. As a result, switching textures may result in a complete flush of the texture cache, while rendering polygons of a common texture in a batch will dramatically increase inter-texture texel cache hits.
Many developers either ignore the redundant render state issue or assume that the driver or hardware will check for redundancy. Games must not rely on drivers or accelerators to check for SetRenderState() redundancy. The game has the information to best optimize away any redundant SetRenderState() calls. Pushing this responsibility downward would be far less efficient than keeping it at the application level.
In order to illustrate the performance falloff due to the addition of SetRenderState() calls to a render loop, I have created a simple Direct3D application which renders and profiles five different scenes. The code for this application was modified from the Microsoft flip3D sample application and is available as a download from Gamasutra. The sample application renders two quads on the screen for each scene. Each quad is rendered as a D3DPT_TRIANGLESTRIP of four vertices. One of the quads is screen aligned, while the other is somewhat oblique to the screen. See figure 1 where the two quads rendered in test scenes.
Scene 1 consists of these two quads rendered without texture maps and with a Gouraud shaded gradation of color from top to bottom. The function used to render test Scene1 is shown in Listing 1.
Listing 1 - RenderTimedScene1()
DWORD RenderTimedScene1(int times_to_render)
DWORD begin_time, end_time;
// warm the data cache
result = d3dDevice->SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE, NULL);
// grab timestamp
begin_time = timeGetTime();
for(i=0; i (times_to_render; i++)
// render the first triangle
d3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP, D3DVT_TLVERTEX, gTestTriangle, 4, NULL);
// render the second strip
d3dDevice->DrawPrimitive(D3DPT_TRIANGLESTRIP, D3DVT_TLVERTEX, TestTriangle+4, 4, NULL);
// grab another timestamp
end_time = timeGetTime();
return (end_time - begin_time);
Scene 2 texture maps both quads with a 256x256 texture while Scene 3 texture maps both quads with a 128x128 texture. Perspective correction and D3DTBLEND_MODULATE texture blending are on. As you would expect, the render state for the current texture is set once and there are no SetRenderState() calls within the loop. Scene 4 texture maps the two quads with two different textures as shown in Figure 1 above. Naturally, there are two SetRenderState() calls in the loop. Scene 5 renders the same image as Scene 4 but there are redundant SetRenderState() calls introduced into the loop as shown in Listing 2 to show performance degradation due to redundant SetRenderState() calls.
Listing 2. Redundant Calls Introduced.
for(i=0; i SetRenderState(D3DRENDERSTATE_TEXTUREHANDLE,gTextureOneHandle);
I have measured the performance of a variety of current 3D graphics cards using the test program with times_to_render set to 2500, resulting in 10,000 triangles per scene. The results for three typical cards are shown in Figure 2 below. For each card, the time to render the 10,000 triangles was measured in milliseconds. This number was then converted to triangles per millisecond and normalized (divided by the Scene 1 score for the given card). This normalization was done so that the fall-off in performance is made clear without obscuring the issue with absolute performance comparisons.
Figure 2. Triangle per millisecond fall-off for Scenes 1 through 5 for a variety of 3D accelerators. The data was gathered with the application developed for this article on a 300MHz Pentium II.
Both Card 1 and Card 2 take a minor hit for turning on texture mapping (Scene 1 to Scene 2), while Card 3 takes a significant performance hit. Changing the size of the texture used in this limited test scenario (Scene 2 to Scene3) does not affect performance on any of the cards. Adding a SetRenderState() call to each iteration of the loop to change between two textures (Scene 3 to Scene 4) is a performance penalty on all three cards, particularly Card 2. Adding the redundant SetRenderState() calls as shown in Listing 2 degrades performance further still.
I encourage developers interested in this issue to download the source for this test application and experiment with it. I think it's a good idea to do this kind of profiling of Direct3D performance and SetRenderState() tracking in a developer's application as well. Intel is also devoting time and resources to this issue and the Graphics Toolkit in their recently released IPEAK family of platform performance and integration tools is intended to help developers with just this sort of workload and scene analysis.
It should be pointed out that SetRenderState() calls are effectively 3D operations that cause communication with the hardware in 3D mode. As a result, SetRenderState() calls should be made in 3D mode (within a BeginScene() - EndScene() block) to prevent the hardware from having to switch from 2D to 3D and back again when the SetRenderState() is executed. In the next section, 2D and 3D modes will be discussed further.
2D and 3D Modes
Another big optimization you can make to Direct3D games is minimizing the transitions between 2D and 3D modes via BeginScene() and EndScene() calls. On combination 2D-3D cards (the vast majority of 3D graphics accelerators), the hardware incurs overhead when switching between 2D and 3D modes. Applications should attempt to use one render block (a BeginScene() - EndScene() pair) per frame to cut this overhead to its minimum. Additionally, as mentioned above, all SetRenderState() calls should be made while in 3D mode (i.e. in a render block) since they require the hardware to switch to and from 3D mode if done while in 2D mode.
Operations such as DirectDraw Lock(), Unlock() and Blt() calls are 2D operations and can fail if performed within a render block. Many applications use both 2D and 3D operations to compose a frame. Blts are often used for heads-up displays (HUDs) and other screen-aligned overlay primitives. If possible, these 2D blts should be deferred until the end of the frame, after the EndScene() and before the Flip(), since the chip is in 2D mode at this point. Some 3D only cards do not support Blts. As a result, many developers will use 3D polygons for inherently 2D primitives. This is fine for 3D-only cards, but on 2D/3D cards it can be more efficient to use Blts for large 2D primitives such as sky/scenery backdrops. As a result, applications should detect the hardware's capability of doing Blts (checked via the various DDCAPS_BLTx flags returned from a call to GetCaps()) and do Blt operations on hardware that supports it. Some applications also use a least-recently-used (LRU) scheme (or other texture management method) in the event that the application's texture footprint is larger than the amount of video memory available for textures. In this situation, a game may not realize that a texture needs to be swapped into video memory until the middle of a frame. This can result in a series of 3D-2D-3D mode switches as texture data is moved from system to video RAM mid-frame via 2D operations. This should be avoided, and with the right design it can. Additionally, the greater amount of texture memory provided by AGP can reduce this potential performance penalty.
Stripping and Fanning Vertices
In Direct3D, as in any 3D API, more than one primitive can be sent to the rendering hardware via a single call to the API. This amortizes the call overhead across all of the primitives rendered due to a given call. Do not make one DrawPrimitive() call per polygon. At the very least, primitives should be sent to the hardware via Direct3D as a D3DPT_TRIANGLELIST (specified using the first parameter to DrawPrimitive()). Applications may want to experiment with the number of vertices sent per DrawPrimitive() call since this will affect concurrency on 3D hardware that has PCI bus-mastering capabilities.
Polygons which share vertices (including texture coordinates) and which should be rendered with identical render states can be organized for more concise and efficient communication with Direct3D and thus the underlying hardware. Such groups of vertices can be rendered as a D3DPT_TRIANGLESTRIP or D3DPT_TRIANGLEFAN. This can be very efficient if the application's 3D models are structured accordingly, but can be wasteful on the CPU side if the 3D structures to be rendered are not already stripped or fanned.
Strive to Optimize
There are a variety of basic techniques which can be used to achieve optimal performance in Direct3D applications, but the main idea behind them is minimizing the communication with the 3D graphics accelerator. This can be done by minimizing render state changes, separating 2D and 3D operations and batching/stripping/fanning vertices. These techniques can have varying degrees of effectiveness depending on the overall architecture of a 3D graphics engine, but keeping these basic principles in mind will take you a long way toward optimal performance on 3D graphics hardware.
Jason L. Mitchell is a Software Engineer at ATI Research Inc. (Marlborough, MA). He can be reached at [email protected].