Sponsored By

3D Acceleration Demystified, Part II: The Benchmarks

3D graphics accelerators from 3Dlabs, ATI Technologies, Cirrus Logic, Diamond Multimedia, Intergraph, Matrox, Number Nine, and Rendition are put to the test.

Andy Bigos, Blogger

June 1, 1997

19 Min Read

In the last issue of Game Developer, I discussed some of the basics of 3D hardware acceleration and promised a performance benchmark that would scrutinize some of today's popular accelerators. To do this, I defined some standardized tests to measure performance and enlisted the help of Andy Bigos to write a rasterization performance benchmark called D3DBench. I must stress that the performance comparisons within this article take neither price nor availability into account - I'm targeting game developers who want to know what kind of performance a given accelerator can offer.

D3DBench

D3DBench uses Microsoft's Direct3D Immediate Mode rendering API for abstracting hardware access and Microsoft's Foundation Classes (MFC) for Windows-specific issues. Direct3D was selected because it is heavily supported by hardware vendors and is specifically targeted towards game developers. However, Direct3D isn't an ideal interface for all hardware, so using a vendor's proprietary API may be a better way of achieving maximum performance.

I don't have the space to describe D3DBench's inner workings, but the help files distributed with D3DBench contain additional information on its features and implementation. Although D3DBench is capable of controlling many types of display options, only a specific set of feature combinations was tested for this article.

D3DBench cares only about raw rasterization speed. While this isn't a perfect benchmark, it does provide a basis for comparing hardware rendering performance. It's important to realize that D3DBench doesn't attempt to take into account issues that will affect overall game speed, including overlap between CPU and hardware, CPU loads, texture download performance, texture memory size constraints.

This is very important - you cannot take the numbers derived from D3DBench and correlate them proportionally to frame rate. Besides rasterization, a game's frame rate is controlled by a number of factors, including geometric complexity, sound, artificial intelligence, collision detection and response, physics, and input management. Our results don't necessarily indicate how much faster a game will run on Hardware A than on Hardware B. With that said, I encourage you to include a demo loop within your game that can be used as your own benchmark, since the only valid measurement of true game performance uses the game itself.

Flaky Drivers and Nonexistent Specs

During the course of developing this benchmark, the issue of flaky drivers reared its head more than once - some drivers reported erroneous capability information or gave weird or incorrect output. Unfortunately, I don't have the space to list each driver's bugs, especially since I'm assuming that most of these bugs will be ironed out by the time this article is published.

The primary problem we encountered while developing D3DBench is Direct3D's lack of a reference implementation or specification. Direct3D endorses the concept of capability bits, or the ability for a particular driver to tell an application exactly what 3D acceleration capabilities it supports. It is the application's responsibility to compensate for missing capabilities, a burdensome and error-prone requirement to say the least. The very nature of capability bits means that an application can be bug free when written for a specific piece of hardware, yet breaks down the moment a different piece of hardware is inserted. This is where APIs such as Silicon Graphics' OpenGL really show an advantage - OpenGL requires that all functions be available under any implementation, so one OpenGL program should work fine with any OpenGL implementation or driver. Direct3D's paradigm of capability determination is completely counter to this and, as we learned, very buggy and error prone. Further complicating the implementation of Direct3D is the fact that different hardware drivers interpret the capability bit fields differently! The point that Direct3D programming isn't as easy as it should be. I'm taking this time to warn those of you delving into Direct3D programming to be patient, careful, and cynical.

The Players

Every 3D graphics accelerator manufacturer with an announced product that I was aware of was contacted, provided they had working silicon, up-to-date Direct3D drivers, and products aimed at the consumer market (no $5,000 CAD boards need apply). Those that responded with loaner boards and working drivers were 3Dlabs, ATI Technologies, Cirrus Logic, Diamond Multimedia, Intergraph, Matrox, Number Nine, and Rendition. All manufacturers were allowed to review the test results and comment privately before the article was submitted for final publication. For those of you who wish to find out more about specific products and developer programs, manufacturer's URLs are located at the end of this article.

Microsoft's software-only Direct3D RGB emulation driver was used as a reference benchmark. While the Microsoft Direct3D Ramp emulation driver and/or an 8-bit display mode would have exhibited better performance, they weren't included in the tests because of their poorer image quality (hence, they would not exactly have represented an even or fair comparison). Table 1 lists the complete set of boards tested for this article.

I really wanted to test Intel's new MMX processor with Microsoft's MMX Direct3D driver; unfortunately, I didn't manage to gain access to such a machine in time for this article.

The Field

All tests had the following in common: 640x480 screen resolution, 15/16-bpp screen depth, 16-bit Z-buffering, dithering, double buffering, and all were tested in full-screen mode. The consensus is that this is the "standard Direct3D game mode" configuration. In all likelihood, support for lower resolutions, such as 400x300 and 512x384, will be common because of the lower fill rate requirements. Still, 640x480 seems to be the ideal resolution for games. All texture-mapping tests were perspective corrected, had a texel-to-pixel ratio of 1:4 (each texel maps to 4 pixels), and all texture maps were RGB and 64x64 (note that some accelerators will likely perform better with paletted textures, but this was beyond the scope of the benchmark). MIP mapping wasn't used, although it is a feature that should be measured in future benchmarks. Triangles are rotated arbitrarily so that textures are stepped through at different orientations.

All tests were run on a PC with an Intel motherboard, a Pentium 166MHz CPU, the Triton Chipset, 64MB RAM, and Windows 95 with ServicePak 1 installed. The initial release of DirectX 3.0 was used, using the nondebugging libraries. The test itself was compiled using Microsoft Visual C++ 4.2 with Release Build. All tests were executed with no buffer swapping so as to remove the effects of vertical retrace synchronization (a final buffer swap is executed after the timer is stopped and the hardware is idle so that output can be verified visually). When possible, the display was set to 60Hz refresh for all adapters. Triangle sizes tested were 3, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, and 10000.

The tests we used represent common and important feature sets for 3D games (Table 2). I had to choose a few reasonable tests from the 50 or 60 I could devise (the original test suite had several hundred tests). I'm fairly certain that these nine tests gauge features that game developers are currently exploiting. In these tests, I stressed the importance of Z-buffering, because this feature is now supported by almost every hardware accelerator and is an elegant way to solve hidden surface removal problems. However, since D3DBench supports many more options, those of you interested in doing your own benchmarking should definitely download and play with it.

The Tests

Each specific test addresses a particular form of rendering algorithm. The following is a short description of each of the tests:

Test 1 - Smooth Shading. The SMOOTH test represents what will likely be the lowest useful common denominator: RGB-lit, Z-buffered triangles. While many people believe that nontextured rendering is a moot issue, the higher fill rates of nontextured rendering - and the associated higher frame rates - make smooth shading an appealing option for developers and consumers who believe that frame rate matters more than visual appeal.

smth-fil.gif

 

smth-put.gif



Tests 2 and 3 - Rendering BSP trees. BSPWALLS represents the rendering mode used when drawing walls, ceilings, and floors using a BSP (Binary Space Partition) tree. When rendering BSP trees, sorting order is implicit; you don't need to use Z-buffering to handle occlusion of static objects represented by the BSP tree. Instead, you need Z-buffering to correctly render dynamic objects (such as objects that are not part of the BSP). This can be leveraged in two ways - back-to-front rendering or front-to-back rendering.

bsp-b2f-fil.gif

bsp-b2f-put.gif

bsp-f2b-fil.gif

bsp-f2b-put.gif



When rendering back to front (BSPWALLS_B2F) you can set the Z comparison function to ALWAYS, since you know that anything rendered will be closer to the viewer than anything previously drawn. Through this setting, the accelerator no longer needs to read a value from the Z-buffer, easing memory bandwidth strain.

When rendering front to back (BSPWALLS_F2B), on the other hand, you set the Z comparison function to LESSEQUAL, since you now need the Z-buffer to handle occlusion. However, this mode allows accelerators that defer computations until after a pixel passes the Z test to achieve higher performance.

It is assumed that something such as id Software's surface caching scheme (see Michael Abrash, "QUAKE's Lighting Model: Surface Caching," Dr. Dobb's Sourcebook, pp. 43-47) is used when rendering BSP walls; thus, the texture is unlit and bilinear filtered. Lack of a texture-copy mode precludes participation in this test. However, since a texture-copy mode can easily be emulated with flat-modulated texturing using a white light source, lack of a texture-copy mode doesn't necessarily imply a true lack of functionality.

Tests 4 and 5 - General Rendering Ability. The TEXLIT_RGB and TEXLITB_RGB tests are indicative of the most general rendering case: Z-buffered, texture-mapped, and smooth RGB-lit triangles. TEXLITB_RGB differs from TEXLIT_RGB only in that bilinear blending is enabled.


txrgb-fil.gif

txrgb-put.gif

txrgb-b-fil.gif

txrgb-b-put.gif



Tests 6 and 7 - "Flight-Sim" Rendering. Most driving and flight simulators do not need true RGB-colored lighting, nor do they need Z-buffering. The TEXLIT_WHITE_NOZ and TEXLITB_WHITE_NOZ tests represent these types of games. Note that lack of support for mono lighting doesn't indicate missing functionality - white-light support is an optimization capability, not a feature capability.

txwht-noz-fil.gif



txwht-b-noz-fil.gif

txwht-b-noz-put.gif



Test 8 - Rendering Meshed Groups. The TEXLIT_RGB_MESH test is identical to TEXLIT_RGB, except that triangles are sent in meshed groups. Meshing lets the driver take advantage of vertex sharing and potentially has better caching effects. The obvious candidate for meshed rendering is a high-polygon-count model. For this test, the mesh size was set to thirty triangles. Meshing-enabled rendering rarely will be slower than nonmeshed rendering and, in some cases, should be considerably faster (the tests bear this out).

txlit-rgb-mesh.gif



Test 9 - The STRESS Test. STRESS tests the worst-case scenario for an accelerator: Huge amounts of memory reads and writes are performed, and a large amount of data is transferred to the accelerator. This rendering mode isn't that far-fetched, either - transparent, texture-mapped objects aren't necessarily rare, and alpha blending has many uses other than transparency, including multipass lighting effects. If an accelerator is reasonably fast with STRESS, it is highly doubtful that it is slower with any of the other modes.

stress-fil.gif
stress-put.gif


The Score

The accompanying graphs illustrate the relative performance of the accelerators with the different modes mentioned earlier. Drivers or accelerators that didn't support a particular benchmark configuration are not listed in the relevant graph. Whenever possible, the most recent drivers were used - whatever the company provided, unless more up-to-date drivers were available on their web or ftp site.

There are two sets of graphs: the triangle-throughput graphs and the fill-rate graphs. Triangle throughput measures the number of triangles per second that a hardware accelerator can process. Fill rate measures the number of pixels per second that a hardware accelerator can render. The only exception to this formula within our graphs is for the TEXLIT_MESH test where only the throughput of three-pixel triangles was quantified; TEXLIT_MESH is designed to test throughput, not fill rate.

3Dlabs Reference Board (3Dlabs PERMEDIA/Delta): This board turned in some excellent triangle throughput numbers, typically second behind the Diamond Monster3D for smaller triangles (less than 250 pixels), and supported all the modes we requested. The drivers were robust and fast and definitely showed that meshing is a big win in performance on 3Dlabs' hardware. However, bilinear-filtering performance was extremely lackluster, probably due to the extra memory fetches required for proper bilinear filtering. Depending on the particular test, there seemed to be a crossover around the 100- to 250-pixel triangle mark where fill rate began to limit triangle throughput. The 3Dlabs PERMEDIA/Delta board has the notable distinction of being one of only two boards (along with the Intergraph Reactor) to execute all tests successfully.

ATI 3DXpression+: The ATI Technologies 3DXpression+ was a competent performer, with average or above average fill rates across the board. Lack of a texture-copy mode precluded generating BSPWALLS test scores. Note that lack of a copy mode isn't a killer, since you can use "modulate" with a white light to achieve the same effect. The ATI also lacked a mono-lighting mode, precluding its inclusion in the TEXLIT_WHITE_NOZ tests. Triangle throughput was fairly low, but this is a common attribute of lower-cost 3D accelerators, where it is easy to offload a lot of setup computation onto the host and leave the rendering to the hardware. Still, the 3DXpression+ will probably be a popular board because of its wide range of features and the fact that ATI is traditionally a high-volume chip vendor.

Cirrus Logic Reference Board (Cirrus Logic Laguna): The early beta drivers for this board weren't very stable. I had to hack D3DBench a bit to get it to work, but once that was done the tests looked correct. The Laguna lacks true alpha blending, a texture-copy mode, and mono lighting, preventing its participation in the STRESS, BSPWALLS, and TEXLIT_WHITE_NOZ tests respectively. Overall, the board posted average scores, sometimes a little faster and sometimes a little slower than the rest of the pack, with a tendency towards lower triangle throughput.

Diamond Monster3D (3Dfx Interactive Voodoo): The Monster3D is king of the hill in pure rendering performance, and it possesses a rich feature set to boot. Unfortunately, it lacks VGA and Windows acceleration, meaning that its penetration into the consumer market will be limited; I expect this board will only find its way into the hands of hardcore game players. The only feature missing from the Monster3D is mono lighting, so it did not participate in the TEXLIT_WHITE_NOZ test.

Fill rate was pretty much even on every test - turning on features such as Z-buffering, bilinear filtering, or alpha blending doesn't seem to exact a fill-rate penalty. Also, throughput peaks at 100-pixel triangles - for some reason, the Monster3D can process 100-pixel triangles faster than it can process 10-pixel triangles. This may be because of some hardware anomaly (10-pixel triangles come too quickly and stall the PCI bus) or something as mundane as the fact that larger triangles require smaller execute buffers in D3DBench, and thus show better caching effects.

Diamond Stealth3D, Number Nine 772 (S3 Virge and S3 Virge/VX): No matter how hard I tried, I could not get the S3 drivers to work in my system. More time was spent trying to solve problems with the S3-based boards than all the others combined. The problem seemed to vacillate between issues with my computer system and problems working with D3DBench, depending on the whims of the Compatibility Gods. I'd like to mention that both Nicholas Wilt of Microsoft and Phil Parker of Number Nine made valiant attempts at trying to get these boards working in my system.

Intergraph Reactor (Rendition Verite): The Intergraph Reactor showed average or slightly below average raw performance, at least in terms of fill rate. Triangle throughput was very good, placing third behind the 3Dlabs PERMEDIA/Delta.

In all fairness, I'd like to note that games actually written for the Rendition Verite (the chipset used in the Reactor) have demonstrated very good performance, probably the result of overlap more than anything else. For this reason, D3DBench is not a good measure of performance for architectures that depend on overlap to realize their optimal performance figures.

Matrox Mystique (MGA 1064-SG): The Matrox board's lack of bilinear filtering really hurt its usefulness with D3DBench - the test suite runs on the assumption that users will be demanding bilinear filtering from future games, and with this in mind, the Matrox could not execute with TEXLITB_RGB, TEXLITB_WHITE_NOZ, BSPWALLS, or STRESS. On the tests that the Matrox could complete, however, its performance was respectable. As with the ATI board, triangle throughput was fairly low, typical of less-expensive 3D accelerators that have expensive setup overhead.

Microsoft Direct3D RGB Emulation Driver: If you're writing a game with Direct3D, don't even think about supporting software-only rendering, at least not with the RGB Emulation driver provided by Microsoft. This driver pretty much flatlined near the bottom of the charts in all modes (at least the ones it supported), and established the low end of the performance spectrum, as is to be expected for a software-only renderer.

Unfortunately, if a game is written with hardware acceleration in mind, it may not be usable at all with Microsoft's RGB Emulation driver. In Microsoft's defense, they have been concentrating on optimizing the Ramp Emulation driver and their MMX driver. The Microsoft RGBsoftware driver doesn't support Z-functions other than "less or equal," so BSPWALLS_B2F couldn't be performed. Lack of alpha blending precluded gathering numbers for the STRESS test.

The Effect of the Processor

An important measure of a hardware accelerator's performance is how much of a load it exacts on the host CPU. An accelerator that requires a lot of CPU time may actually be slower in a game than one with low load characteristics, even if it has a faster fill rate. The amount of CPU load an accelerator consumes consists of all the CPU activities required to get data to the accelerator. This load generally falls into two categories: triangle setup and flow control.

Triangle setup consists of all the work done to compute the triangle parameters that are relevant to the accelerator. This may be as simple as gradient computations, or as complex as splitting up big triangles into smaller triangles that the hardware can handle. Triangle setup requires CPU time to calculate parameters, and thus influences load significantly. Flow control is the amount of handshaking that the CPU has to do with the accelerator to send the accelerator the triangle parameter data (and other information) computed during triangle setup. Bad flow control may require the CPU to poll the accelerator before every hardware register write for busy status, or poll the hardware before writing out a new triangle. If your code is subjected to this kind of waiting, you better hope the hardware can render triangles significantly faster than your software can, or it may be just as fast (or faster) to render triangles yourself.

Hardware with good flow-control characteristics does not subject the CPU to busy waiting, either by having very deep FIFO write buffers or by using PCI bus mastering to asynchronously fetch triangle data from the host. When a hardware accelerator lets the CPU do nonrendering tasks in parallel with the rendering, it is called "execution overlap." The less CPU load you have, the more overlap is achievable. In terms of overlap, the ideal accelerator performs triangle setup and has good flow control characteristics. In this situation, the game only has to send vertex data to the accelerator.

Overlap can become a huge factor in game performance, especially if your game is consuming a lot of time performing nonrasterization activities. This is another situation in which rasterization performance doesn't necessarily equate to overall game performance - you must measure the performance of a specific game on a specific accelerator to get a valid idea of performance differences between accelerators. D3DBench does not attempt to measure overlap - as a matter of fact, the benchmark discourages it by waiting for hardware rendering to complete before stopping the timer.

Brian Hook is a freelance 3D graphics software and hardware consultant based out of Sunnyvale, Calif. He can be reached at [email protected], or http://www.wksoftware.com.

Andy Bigos is a software engineer with 3Dlabs, based out of the United Kingdom. He can be reached at [email protected].

Read more about:

Features

About the Author(s)

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like