This year's Game Developers Conference offered ten to twenty parallel tracks of keynotes, lectures, workshops and roundtables, complemented by up to six sponsored sessions, some of which covering a full day -- and all that following two days of full-day tutorials. Such an abundance of offerings forces attendees to cull and prioritize, and as our needs at Loki revolve around portability, I tended to have a distinct preference for OpenGL.
Unfortunately, registration delays left me late for the OpenGL tutorial that started at 10:30, but luckily the organizers of the tutorial offered complete and detailed course notes -- weighing in at a good 150 pages. (The same held true for several others tutorials at the show.) During the day, John Spitzer offered sane ways to maximize performance, and Brad Grantham closed the session with a brief description of OpenGL on Linux. The bulk of the material (and most of the presentation) was given by Mark Kilgard of Nvidia.
the time I arrived, the room was already full, with quite a few attendees
left without chairs as they listened to a description of shadow volume
techniques. This part of Kilgard's presentation was kept brief, as it
would later be used in his Friday afternoon presentation, "Advanced
Hardware and Rendering Techniques." The tutorial subsequently covered
translucency and the well-known problems of non-commutative blends,
touched on the usefulness and pitfalls of screen space stiple, and moved
on to a review of GL fog.
The OpenGL specification allows fog to use depth instead of Euclidean distance, which can lead to problems -- especially with wide fields of view and steep fog functions. As Kilgard pointed out, the GL specification contains several examples where the accurate solution is merely recommended, but alternative implementations that trade small artifacts for potentially huge performance gains are allowed. Another example is perspective-corrected color interpolators. In some cases, GL even specifies features that are inspired more by practicality then modeling; GL EXP2 fog is a good example of this. GL EXP2 fog does not implement a model of atmospheric attenuation, but it's well-suited for effectively hiding a far clipping plane (and as it avoids a square root, it is potentially cheaper to implement for correct distance calculations).
Propagating Euclidian distance through the hardware pipeline is an expensive addition. Kilgard proposed the GL Fog Coordinate extension as a possible solution, which puts the burden of calculating the proper fog distance on the application, but empowers it to implement layered (e.g., altitude dependant) fog. In general, GL extensions played a prominent role in the tutorial as well as GL-related lectures at the conference, culminating in an OpenGL Extension safari in the afternoon of the tutorial in which Kilgard performed an annotated walkthrough through thirty-six out of over two hundred published GL extensions, commenting on their possible applications and their availability.
The remaining time before the lunch break was dedicated to a few GLUT-based demos (the GLUT framework is still available from the SGI website). Kilgard used his low-poly dinosaur (which should be known as GLUTzilla, in my opinion) to demonstrate reflections and shadowing using stencils, and during his demonstrations he was careful to point out artifacts and shortcomings, such asvisible tessellation caused by projection from nearby point lightsources. He also showed a multipass stencil-based technique for rendering magic halos around an object, which I suspect to be of limited use. Much more striking was the demonstration of real-time rendering of six views into a cube texture, followed by environment mapping on a reflective sphere using the GeForce-supported texture cube map extension. A Direct3D version of this demo was presented in Ron Fosner's "All Aboard Hardware T&L" article in the April issue of Game Developer.
The fill rate and bandwidth requirements for rendering six views of the scene can be quite expensive, but image caching techniques and sparse updates can certainly be applied. In addition, reflective objects like mirrors are usually used as special gadgets -- they are placed to make certain parts of a level more memorable, or to underscore a moment of a game's plot. Hardware support for environment mapping has entered the market, and techniques that are comparatively expensive today might be common one or two product cycles later.
Cass Everitt from Physitron elaborated on "Real-Time Per-Pixel Illumination" as introduced in his article on OpenGL.org (www.openGL.org). Also known as Orthogonal Illumination Mapping (OIM), the technique requires the calculation of signed dot products per pixel - an excellent example of what are often referred to as pixel shaders. OIMs also serve as a good illustration of the problems involved: firstly, OpenGL does not support signed data types, which makes it necessary to break down each signed multiplication into four multiplications of mixing terms. The results are stored in GL's unsigned color buffer, thus we end up with four passes per dot product, half of which are done using subtractive blending. For OIMs, a total of three dot products have to be calculated per pixel, for a total of twelve passes. Specular lights and other effects add to this rapidly.
With multipass counts of twenty-six and above, it seemed a bit overboard when Mark Kilgard pointed out that this technique could be implemented with unextended GL -- if you have all-out multitexture hardware (OpenGL allows for up to thirty-two texture units, but just a few months ago most people would have thought that hardware with more than two or maybe four would hit diminishing returns), or lots of fill rate to burn.
are possible by transforming the problem into tangent space (recommended
reference: Peercy etc. al, SIGGRAPH 1997), essentially turning a 3D
mathematical operation into a 2D one. Beyond that, the technique seems
destined to become a vehicle to show off Nvidia's register combiner
extensions, which allow for collapsing several passes into a sequence
of operations applied in a single pass.
Not only can these operations be signed, they can also address a problem of limited resolution - while consumer space color buffers are inevitably 8-bit per RGBA channel, the same is not true for the internal paths. Hardware manufacturers have maintained higher accuracy in their pipeline for years now, and while it is unlikely that 3D accelerators will abandon commodity RAM anytime soon, there are possibly fewer constraints on the accelerator chip itself. Currently the precision in the color buffer is limited, and this will particularly cause problems (i.e. banding), with the exponentiation needed by specular color.
contributed a white paper entitled "A Practical and Robust Bump-Mapping
Technique for Today's GPU's" to the Course Notes, which will make
its way to the Nvidia Developer's web site, a rocking twenty-four pages
of equations and register combiner data flow diagrams, and a sixteen
page appendix of sample code. If you consider this scary, rest assured
that the results are worth the effort. As Kilgard put it, it is a feature
ready for prime time. I never found myself particularly impressed with
the various screenshots of both fake bump mapping and OIM, but Everitt's
demonstration of OIM-based illumination in real time was striking.
Even a close look at the full geometry grid (this image is not textured) as compared to the lit simple geometry (sub-sampled by sixteen in each dimension) does not convey the effect of a light source moving in real time. Of course, tangent space is comparatively easy for height fields, and while certainly doable for parametric surfaces (like the ubiquitous donut), the technique might turn out to be more of a problem for manually created polygon meshes (e.g. in combination with Intel's Multi-Resolution Meshing techniques). The issue had been raised before, and one of the early morning demonstrations I missed seemingly showed off per pixel illumination applied to polygonal Quake models.
Kilgard followed up on the OIM presentation by reviewing the bump mapping "hype" of last year's advertising (not sparing his own employer), pointing out problems of aliasing and lack of self-shadowing, and revisiting the precision and resolution issues with unextended GL. His view of the OpenGL state machine (an attitude probably nurtured by the architects of the API themselves) is that of a general-purpose machine to implement mathematical operations. A related example can be found in David Blythe's presentation to implement Renderman in hardware. In this view, the 3D accelerator is just a subsystem applying calculations to a data stream - buffers are used as memory, internal paths become registers, logical and algebraic operations are implemented (ab)using alpha, stencil, and depth test, and texture and cube maps serve as lookup tables to look up values for scalar, 2D and 3D vectors (the latter even allowing for use of unnormalized vectors).
If you attended Kurt Akeley's keynote on Saturday, you will have heard an echo of this when he described SGI's experiments -- and lack of success -- with programmable hardware and microcode. In general, the crowd of SGI alumni that are now employed at Nvidia, 3dfx and other companies has taken with them the concept that hardware programmability is best implemented by using the API, not by introducing a backdoor into programming the hardware directly. Having eavesdropped on Friday's DirectX8 presentation on programmable vertex shaders, this view is obviously not shared by Microsoft. Perhaps scared by the presumed flexibility of the Playstation2, and with their X-Box announcement in mind, Direct X8 will give birth to the very "programmable hardware" which Akeley in the "nay-saying" part of his keynote dismissed as "liimited at best, error prone at worst." Like the GL tutorials and lectures, Nvidia also dominated the "Direct3D Programming" tutorial.
to DX8 vertex shaders done by Richard Huddy proved quite scary to my
OpenGL-biased mind. Details are not yet finalized, and some details
will not be available until after DX8 goes beta - quite different from
the GL extensions which, as the Nvidia engineers themselves pointed
out, are already available to developers (the Register Combiner extensions
can be found in this PDF
document, but only one of them is already documented in SGI's reference
repository (separate ASCII documents)). GL drivers offering features
equivalent to vertex shader will come (and are seemingly easier to implement,
but not first priority).
The GDC2000 D3D tutorial does not seem to be on Nvidia's website yet either, my summary of the presentation will have to suffice.
Starting with a rendering pipeline that went from curved surfaces all the way down into an anti-aliased frame buffer, vertex shaders according to DX8/Nvidia are placed between those higher order primitives and the triangle setup and rasterization stage. They are to be written in an assembly language comprised of 19 RISC-style pseudo-instructions that can be applied to either vectors of four 32-bit floats or -- by subscript -- to a single component of a vector. The data is weakly typed data until placed in the output registers, where it will be automatically clamped when necessary. A vertex shader is applied to a stream of vertices, on a vertex-in-vertex-out basis, and Huddy was careful to point out that DX8 vertex shaders can't operate on a set of vertices, as required for culling, to name just one example. A well-formed output vertex exists in HCLIP space and has a maximum of four texture coordinates (seemingly cut down from eight coordinates from earlier DX8 alphas) plus diffuse and specular color, fog value and a point size. Point size is a feature used for different purposes (e.g. particles -- see also GL's point parameter extension -- as used by flight simulators as well as Quake2, but abandoned by Quake3), and Richard Huddy took the opportunity for an informal poll on how many developers would actually use this output channel. As if to prove Akeley's verdict, plenty of restrictions apply. The total instruction count is limited to 128. Most of these are executed in a single cycle: as a general rule, a coder should worry about the instruction count, but can safely ignore the issue of cycles. A DX8 vertex shader is valid only if it writes out a complete vertex. No creation or destruction of vertices is possible, and you can't keep data or feed it to the next vertex either. There are no branching instructions, and you can't exit a shader early, so creative use of, say, the SLT/SGE instructions to effectively multiply by one or add zero is required.
segments come as 96 DWORDs that are 32 bits long, four of which are
the transformation and one serving as index buffer. There is room for
sixteen 4-vectors (hardware limits you to 16 streams), and six more
as temporary registers. According to the presentation, you can load
a palette of twenty-two transformation matrices (eighty-eight values
total) into the same space as a constant table, which is quite a bit
of overkill by any definition of multi-matrix skinning. What is known
as "DX8 hardware" (the presence of which indicated by yet
another caps bit) does not support multiple indices into multiple streams.
On the upside, it only takes six instructions to transform and light
one vertex, partly due to the fact that some of the RISC instructions,
namely LIT and DST, are shortcuts for efficient lighting, cutting efforts
for directional light sources down to 2 instructions/3 cycles. With
a total of 128 instructions, four of which are required for transformation,
shading using infinite distance light and viewer approximations totals
6+7/light. The slides seemingly had an error there, but my understanding
is that numbers for local light source and viewer should be about tripled.
One of the Nvidia vertex shader demos which will eventually show up
on their developer website, implements seventeen lights per vertex in
124 instructions (another accomplishment that makes me wonder about
diminishing returns). Other instructions implement matrix/vector operations,
including matrix inversion and normalization. In particular, single
cycle DP3 and DP4 dot products are provided. In some cases, instructions
have to be used in creative ways (e.g., MAX is used to implement ABS
or clamping operations).
What is it good for? DX8 vertex shaders will give you direct access to the raw programmable GPU core, in a similar vein as anticipated for the two Vector Units in the Playstation2 console. In the words of the speaker, it is both "your opportunity to look different" and a chance to spend hours debugging shoddy RISC code (in a syntax that features an "excessive number of brackets" and will seemingly bring back fond memories of early x86 assembly pains). There is a range of opportunities, from custom texture coordinate generation to custom lighting to fisheye transformations and procedural mesh morphing. The DX8 vertex shaders can do everything that DX7 T&L can do, and more. Ironically, the demonstrations started off by showing that one of the things you simply can't do in DX7 is implement the DX6 light falloff. There was a multi-matrix skinning example animating a Dolphin, and a less impressive demo that demonstrated how to implement back and front side lighting on a per vertex basis (remember, you can stream in multiple normals per vertex, but you can't operate at triangle level). Once game developers (and, even more so, the demo scene) get their hands on the development tools, we can expect to see a lot more, and different, applications.
Speaking of development tools, Nvidia also showed off a debugging tool that runs on top of DX8, which is supposed to replace the current Microsoft console tools. It might be recommended that you use the DX8 software implementation of vertex shaders for debugging though, which is not done by Microsoft; Intel and AMD each are competing to provide optimized software emulation for their respective CPUs, both of which are destined to ship with DX8. The DX8 shader instruction set avoids a good chunk of the problems Kurt Akeley was aiming at -- endless loops executed inside your video hardware -- by omitting looping instructions along with branches, but DX9 or a later revision might well open that microbox of Pandora.
It will not be possible to mix vertex shaders with the built-in, fixed functionality exposed by DX - once you start adding custom components, you have to implement your hardware T&L as you need it. The instructions used for this will compete with whatever new mesh-morphing and procedural geometry techniques you create for resources. However, Nvidia expects good drivers to exceed fixed function performance eventually.
are a lot of practical issues that might not even be finalized yet.
Vertex shaders are text files handed to DX* for compilation when the
application starts. It seems recommended, and maybe even necessary,
to provide those shaders as text files, as the instruction stream is
hardware dependent, and might be optimized in ways not available when
you shipped your game. DX8 hands back a handle after compilation, and
shaders are loaded and unloaded as necessary. The efficiency of this
process might also be an issue depending on how many competing shaders
are used and switched each frame - as Akeley pointed out, SGI found
runtime setup costs for programmable hardware to outweigh any processing
gains. Questions from the audience regarding saving shaders to files
and shipping bytecodes were not answered conclusively.
Microsoft seemingly plans to provide precompiled shaders, presumably the DX7 T&L for compatibility and tutorial value. Whether text shaders, encrypted source, or bytecode, this new approach will also raise intellectual property issues. Similar issues come up with audio APIs now that sound effects will be shipped with DS3D. The difference between data and code gets blurry here, and it will be educational to see how these issues work out.
While nobody has to solve the halting problem (yet) for DX vertex shaders, the Playstation 2 will have to be programmed with all due care. Dominic Mallinson of Sony's R&D gave a brief review of the microprogrammable graphics architecture of the PS2, which was probably old news to everybody but pedestrians like me. The PS2 offers two Vector Unit programmable co-processors, one of which directly connected to the 300 MHz MIPS CPU, the other to the Graphics Synthesizer (GS) 3D accelerator handling triangle setup, rasterization and texture mapping. PS2 will not offer pixel shaders; the VUs are meant for procedural geometry, custom transformation and lighting, and tessellation of higher order surfaces. High level culling and scene database management are left to the CPU, which shares memory with the VU's over a 128bit bus running at clock speed. Mallinson pointed out that microprocessors, while more flexible and less bandwidth consuming than (multipass) hardware, are still slower than a hardwired pipline. However, clock-by-clock, any coprocessor has to be faster than the CPU. The same is true for geometry "coprocessors" in the PC market - given the current arms race between AMD and Intel in the above-GHz realm, 3D accelerator companies might have a hard time delivering GPU's that outperform those CPUs. Tim Sweeney made a comment last year that dedicated 3D accelerator hardware will just be a detour of little more than a decade -- while I fail to see why rasterization and texture mapping should ever end up in CPU instruction sets. I can certainly see the issue of balancing competing subsystems recurring with tessellation, transformation and lighting.
Also mentioned were the potential savings from offloading computationally intensive tasks from the CPU.
However, this benefit might be overrated. First, you will never save more than a 100 percent (and that only in unrealistic worst case scenarios), and second, simulation-related calculations will still have to be made: neither a PSX2 nor a PC accelerator accessed through DX8 or GL will allow for efficient read-back of transformation results (although the PSX2 CPU-VU0 tandem might have an advantage here).
Higher order surfaces have pros and cons, as they can't handle every object, and not all operations can be applied (cutting holes into a Bezier patch can prove challenging). Accuracy and computational expense have to be watched. Mallinson showed the "Teapot" composed of 28 patches, each rendered with up to 65x65 subdivisions, or a total of 14,196,000 polygons. The big advantage of procedural representations is that they are an excellent means of compression. They sometimes allow for automated LOD and often accommodate parameterized variation.
With respect to the relatively small amount of memory in the PS2, low memory requirements might actually be a necessity. There is a limit on how much patch subdivision for apparent smoothness pays off, so new (e.g. fractal) means to preserve detail when compressing geometry might become more important. Mallinson himself mentioned L-Systems, a well-known generative technique absent from contemporary computer games. Another example can be found in Ken Perlin's slides of his GDC Hardcore talk, mentioned by Jeff Lander in his Sunday GDC lecture on creating 3D characters (also featured in Lander's column in a recent issue of Game Developer).
The PS2 vector units, both SIMD operating on two instruction VLIW in 31 general-purpose 128-bit vector registers (the same four 32-bit floats found in DX8 vertex shaders) come with different sized instruction and data caches, and due to the CPU-VU0 and VU1-GS dedicated connections, the PS2 programmers will have to load balance three programmable devices. Mallinson concluded the presentation with a bunch of demos, which covered projective shadows, run-time cloud generation, a day-night cycle, and nonstandard rendering techniques like surfaces represented by particle emissions instead of texturing. The terrain/day cycle demonstration was impressive, but the occasionally-visible aliasing made me wonder how PS2 will hold its own against upcoming full-screen antialiasing hardware in the consumer space.
I started off this article with OpenGL, so I should conclude with it. Let's return to Mark Kilgard's two-hour presentation on advanced hardware rendering techniques, in which he elaborated on shadow mapping and shadow volume techniques. The former, essentially depth testing against the lightsource instead of the eye in a second lighting pass, employs projective texturing techniques described by Mark Segal in 1992. Projective texture mapping comes practically for free in the GL pipeline, and this is partly due to the SGI engineers letting themselves be led by the beauty of symmetry in the GL design. As Kurt Akeley pointed out in his keynotes, the decision to treat texture coordinates just like vertices was one taken as much for aesthetical reasons as for applicability.
GL extensions that might have an encore in consumer hardware in the
future are GIX_depth_texture and SGIX_shadow. One potential pitfall
with using projective textures on current consumer hardware is that
multiple texture units might actually share the divider gates needed
for the hyperbolic interpolator (i.e. the TNT can not do dual projective
texturing, while the GeForce can). The issue of limited resolution in
consumer space buffers returned with a vengeance, and a suggested workaround
is using GL_LUMINACE8_ALPHA8 and two comparison passes (essentially
adding an 8-bit
LSB precision to the 8-bit MSB depth information) as placing near and far clipping planes around the occluder and occluded scene might not always be workable with 8-bit depth maps.
limitations inherent to image space depth buffer techniques (which also
suffer from aliasing and require different filtering), Kilgard suggested
applying image processing techniques from computer vision to extract
edges from the depth maps for shadow volume reconstruction. This was
inspired by the
1998 work of Michael McCool. Personally, I have my doubts about applying expensive computer vision techniques for compressing data that is already present as a high level geometric primitive, but you can grab examples from Nvidia's Developer Resources. Michael McCool was also involved in cooperations with Wolfgang Heidrich, another name that came up frequently during the various presentations. Whether texture shaders, or BDRF's, Heidrich in particular might be a source of inspiration for game developers trying to anticipate the next key technologies entering consumer hardware markets. Mark Kilgard strongly recommended downloading Heidrich's PhD thesis, High Quality Shading and Lighting for Hardware Accelerated Rendering; an introductory example of this researcher's work can be found in his SIGGRAPH 1999 paper.
that this research appeals to hardware vendors and developers is that
it provides a nice migration path: using the techniques described there
and elsewhere, you can experiment with standard GL, take a proof of
concept to judicious use for special effects as soon as some hardware
in the market makes it feasible, and scale it all the way up to a ubiquitous
feature that anticipates next-generation graphics hardware. The 3D accelerator
companies have used up the incredible boost that originated in more
than a decade of accumulated knowledge from SGI and other sources, but
Moore's law still applies, and not only for GPU floating point gates.
The smaller the chip structures get, the more texturing units can be
put in the same wafer area. Duplicating existing components has -- and
will -- increase fill rate further and further. Adding new operations
(and maybe adequate data types) to the API and hardware will collapse
passes even while, as Kilgrad predicts, the number of texture units
increases. In Akeley's words, it is a smart strategy to restrict your
API to simple operations, and then optimize them to the utmost. NURBS
have no place in a GL API.
Even lines and points, the legacy primitives every driver writer dreads, might have been better left out. So maybe DX8 vertex shaders, and even those appealing register combiners, are not what we graphics programmers really need. Some developers, most notably John Carmack, have expressed preference for a vertex array base, instead of shaders operating on single vertices. Whatever their technical elegance, register combiners can turn out to be a mixed blessing for non-technical reasons: a vendor specific extension exposing patented hardware technology might leave us short on abstraction and portability. It will be interesting to see how this works out in the market. Judging from developer questions raised during Kilgard's safari visit of the S3TC texture, proprietary technologyand widespread acceptance do not go along well in this quickly moving PC industry.
On the other hand, the GL extension mechanism seems to pick up speed, and it is good to see vendors like Nvidia taking the initiative. Taking examples from John Spitzer's presentation on GL optimization, we find an impressive palette. There are operating-system-specific extensions like wglAllocateMemoryNV for application side AGP memory management. Nvidia also proposes GL extensions that essentially mimic D3D concepts, hopefully adding versatility. For instance, NV_vertex_array_range is Direct3D vertex buffers in disguise. The safari also offered examples originating in the open source realm -- like MESA_window_pos. The value of each extension will have to be proven, and even reviewed over time. As Kurt Akeley pointed out, circumstances do change.
Mark Kilgard praised id Software as a good role model for game development companies: as with Quake 3, id embraced GL extensions they found essential for their purposes, and the company keeps hardware vendors informed of their plans. John Carmack is also vocal about extensions he would like to see (but unfortunately, his request for a different approach to geometry operations was not heard). Judging from Carmack's contributions to the Utah GLX project, which Brad Grantham briefly described in his Linux OpenGL presentation, id is also still well ahead on the issue of open source.
As Kurt Akeley claimed in his keynote, proprietary advantage is temporary, and one has to recognize when a solution has to become an open industry standard. SGI considers Open Source a powerful tool to accomplish this, and the company certainly has a lot of experience in such migrations, starting with the transition from Iris GL to OpenGL. On a smaller scale, every vendor's GL extension will have to follow a similar path. Ultimately, the true value of say, Nvidia's contributions to GL will depend on whether and how proprietary solutions can be evolved into vendor-neutral versions of the OpenGL API. Nvidia has to be applauded for taking the initiative on extending GL, and for evangelizing the technology they want us to adopt at the conference. There is a significant intellectual investment behind those tutorials, lectures, and the documentation and sample implementations available to developers, and often to the public. If more hardware vendors adopt the same strategy, we fill find a future GDC graphics track filled with imaginative uses and new solutions. Another example were the sessions sponsored by the Intel Architecture Lab.
While covering techniques not yet ready for prime time, it is obvious that game developers are hungry for new ways to do graphics, be it "real-time toons" or other ways to look different -- photo-realistic or not.
So much for my impressions. Depending on your area of work and interest, Graphics at GDC 2000 might have appeared completely different to your eyes. My cursory and selective overview is certainly lacking order and completeness, but if you follow the references and consult the GDC proceedings material available on the net, you should find a treasure of valuable information.
Bernd Kreimeier is a physicist, writer, and coder, working as senior programmer and project lead at Loki Entertainment. Previous work for Gamasutra and Game Developer Magazine include Killing Games: A Look At German Videogame Legislation as well as the "Dirty Java" features on Using the Java Native Interface and Optimizing Pure Java. He can be reached at [email protected].