Multi-core processors are going mainstream. Most PCs sold today and also all new game consoles allow the parallel execution of several hardware threads. For games programmers it would be a real shame to have all this additional computational power available and not use it. The new consoles do not seem to have out-of-order cores (see [Stokes05]). Threading is therefore the only way to really use all available execution units in parallel.
Creating a multi-threaded game engine is a challenging task (see e.g. [Gabb05]). Various game tasks like rendering, Artificial Intelligence (AI) computations, physics or collision detection often rely on the sequential availability of results from other game tasks. If game engine developers choose to run several different game tasks, like physics and rendering, in parallel (task level parallelism) it can be complicated to have required results from other tasks available in time. After all one does not want to add too much synchronization overhead. The other possibility is to work on several problems of the same kind in a data parallel way (e.g. path finding for 4 characters in parallel). To be as efficient and as scalable as possible your game engine should ideally be implemented using a mixture of task level and data level parallelism. This guarantees scalability for more cores to be available.
Concerning multi-threaded software development there are nice documents available for download that describe all relevant concepts and also most coding pitfalls (see e.g. [DevMTApps]). Also it should be noted that tools like VTune* (for finding hotspots), ThreadCecker* (for finding race conditions and other threading bugs) and ThreadProfiler* (for finding load balancing and other performance issues) are valuable tools to help your multi-threaded development effort (see [IntelTools]). The threading problems you solve on your PC with their help will most likely help you on other multi-core platforms as well.
[West06] has written an introduction presenting ideas of how to take advantage of multiple cores in game engines. This article takes his approach one step ahead and describes the use of multi-threading to realize real-time terrain height field smoothing (see Figure 1). The visible part of the terrain is smoothed in real-time every frame to achieve smooth LOD transitioning without visible geo-morphing or popping. Also this way you make sure that you spend your vertex budget for the terrain in an almost optimal way.
It has to be mentioned that the terrain smoothing described here could also be done on the graphics card using floating point render targets (see [Bunnell05]) or ‘Tessellation through Instancing’ (see [Gruen05]). Also DirectX10* and GeometryShaders are just around the corner. Still, most games are processor limited on low to medium display resolutions, so offloading the processor tasks to other cores can help. Again most games are graphics-card-limited at higher display resolutions. So with higher display resolution it makes sense to shift load away from the graphics card as you might desperately need your graphics card power for other things. Of course all this depends on how you can balance your processor and graphics card workloads. If you do rendering in multiple passes on pre DirectX10* hardware, tessellation on the processor can be faster than doing it potentially many times on the graphics card.
The use of OpenMP* (see [OpenMP]) to realize data parallel smoothing is investigated as a first approach. Since it is harder to use OpenMP* to realize asynchronous threading it is also shown how to use the Windows* threading API to realize an asynchronous tessellation that only picks up a new terrain mesh when it is done. This latter approach has no impact on the frame time at all. The algorithm described can generate balanced workloads for even a big number of cores and therefore scales nicely. You can use it to let just one thread or one free core do the heavy lifting of smoothing or distribute smaller workloads to cores that you know are not fully utilized.
Next this article discusses how a sparse terrain height field can be turned into an overall smooth surface using Bezier patches. After that it is described how culling, LOD and thread workload selection can be realized and how a big and watertight mesh can be generated. Finally the discussion on how to multi-thread the smoothing work concludes the article.
For simplicity reasons this article chooses one dimensional rectangular bi-cubic Bezier patches to smooth a sparse height field. These patches are only used to smooth the height values across the height field, the other coordinates are generated by linear interpolation. The following text assumes, because of brevity reasons, that the reader is familiar with how Bezier patches work. See [Farin96] for a well written introduction.
Figure 1 shows a 4 by 4 neighborhood within a bigger height field. The left and bottom corner of this neighborhood is at location (x,y) (both are integer values) in the height field. The height values, indicated by thin vertical lines, are meant to be different but are shown as roughly equal to make the drawing more readable. The patch p shown at the center of the height field is a representative the grid cells defined by the height field. The grid cell for p is defined by h(x+1,y+1), h(x+2,y+1), h(x+1,y+2) and h(x+2,y+2).
In order to replace p by a bi-cubic Bezier patchone needs to define the 16 control heights. These control heights are indicated by b00, b10, b20, b30, b01, b11, b21, b31, b02, b12, b22, b32, b03, b13, b23 and b33.
Because of brevity reasons – after all the focus of this article is on multi-threading – Formula 1 simply presents one algorithm to setup the control heights to achieve an overall C1 smooth terrain surface.
How does one now evaluate the surface? The key insight is that each patch defines four vertical Bezier curves defined by four control heights each. The curves are:
- Curve 0 : Defined by b00, b01, b02 and b03
- Curve 1 : Defined by b10, b11, b12 and b13
- Curve 2 : Defined by b20, b21, b22 and b23
- Curve 3 : Defined by b30, b31, b32 and b33
Each curve can be written as a cubic polynomial with coefficients that can be derived from the control heights (see [Foley90]). Forward differencing (see [Foley90]) can be used to step through each of the curves in constant steps. The implementation of the demo for this article does this with SSE (see e.g. [Klimovitski01]) compiler intrinsics. It actually steps all four curves simultaneously using instruction level parallelism. For each step, forward differencing gives us four height values. These four height values are again used as four control heights which now define a curve that runs across the patch in the horizontal direction. Again forward differencing is used to step along this horizontal curve. This time SSE is used to transform the three adds of an FFD step to one 4 dimensional add and one shuffle operation. Directional derivatives are stepped in a similar manner. It has to be noted though that the code only generates approximate derivates along the vertical direction.
Culling and LOD Selection
Now assume one has some efficient culling algorithm that quickly identifies the cells of the height field inside the current view frustum. The output of this culling algorithm is a list of potentially visible grid cells.
The goal is to tessellate patches near the viewer with more details and patches far away from the viewer with less detail. The implementation in the demo chooses a level of detail (LOD) value for every grid cell. This LOD value is based on the distance of the center of the grid cell from the viewer. The LOD is chosen as a floating point number that defines how many points the FFD algorithm generates along each edge of the patch. The non-integral part of the LOD is defined in the same way that D3D defines segment counts for curved surface patches (see [D3D05]). This ensures that a moving camera generates a very smooth view dependent LOD transitioning since there is no popping of discrete LODs.
The demo generates one big triangle strip for the whole of the tessellated terrain. It also generates extra triangles that stitch the gaps between neighboring patches. Using an LOD per edge as [D3D05] does would be more advanced because stitching triangle would not be needed. The per-cell LOD approach was used because the setup for forward differencing is easier to code this way.
Now that the smoothing mathematics, culling and mesh generation have been covered the text moves on to describe how to generate a balanced workload for multiple threads to do the actual tessellation work.
Creating a Balanced Workload for Multi-Threading
The goal is to distribute the tessellation computations collected by the culling algorithm evenly to a set of hardware threads. To distribute an even workload one needs some value that specifies how complex a certain tessellation task is. Luckily the LOD value of each grid cell describes how expensive it will be to tessellate the corresponding patch. What is also nice is that the LOD values usually fall into a very limited range. In the demo implementation LODs range from 2.0 to 30.0.
Therefore an algorithm collects all cells with the same integral LOD into a separate list and essentially sorts patches to be tessellated into buckets with roughly the same tessellation work to be done. Given a number of worker threads the code generates an initially empty workload list for every thread. Now starting from the buckets with the most complex workload, in a round robin fashion, the workload is distributed to the threads. This process is outlined by the following pseudo code Code Fragment 1.
|Code Fragment 1|
You can further enhance this algorithm by adding a per thread value that is used to weight the current thread workload when comparing it with the current maximum workload. This allows the generation of controlled uneven workloads with the same algorithm.
From the LOD value associated with every cell it is also very easy to compute how many vertices will be produced by a certain tessellation workload. When offloading workloads to several threads each thread will also be given an offset along with the starting address of a vertex buffer that has been locked by the main thread of the application. This way threads can write their tessellation results into non-overlapping memory regions.
It has been shown how to generate a workload distribution for a number of threads. Next it will be discussed how to architect multi-threaded code that picks up these workloads. First the use of OpenMP* is discussed and afterwards the use of the Windows* threading API is described.
Multi-Threaded essellation with OpenMP*
You can use OpenMP* to add thread level parallelism to your application by adding special OpenMP* compiler directives to you source code. These directives come as pragmas that do not change the semantics of your initial code and are therefore non-intrusive. As you will see they are easy to add and you can use them to incrementally parallelize your code. Of course you need a compiler that supports OpenMP* but luckily VS2005* and the also the Intel C/C++ compiler both support it. Source code with OpenMP* pragmas is portable to the Xbox360 because the Microsoft* compiler supports it there as well. Still, this article is not meant to be a primer on OpenMP* - please see [OpenMP] for an in-depth introduction on it.
The main usage model for OpenMP* is a fork and join multi-threading, which means that a set of threads fork from the main execution flow and work together on a shared set of tasks. After having finished their work they all join again. Since OpenMP* uses an internal thread pool there is no thread creation or cleanup overhead.
For the purpose of the multi-threaded tessellation a data parallel OpenMP* pragma is used to parallelize a for()-loop. As indicated by Code Fragment 2 OpenMP* is told work on N tessellation workloads in parallel. The OpenMP* runtime will decide how many threads it will use to do this work. By default it is the number of hardware threads supported by your machine. It is possible though to change this through OpenMP* library calls.
|Code Fragment 2|
Figure 2 shows what would happen on a machine capable of four hardware threads – please note that the MainThread is also working with the other threads. Also note that the main thread does the culling and workload distribution and also the drawing.
If you start the demo application it starts in a mode that uses OpenMP* to do a data parallel tessellation of the tessellation workload as just described. A slider in the demo can be used to tell OpenMP* to use as many hardware threads as your machine can run in parallel.
Unfortunately the heavy use of SSE on all threads does not work well with using all logical processors of a hyperthreading system, and will even result in a slowdown. D3D* and the graphics driver which run on the main thread also make use of the SSE units. If you also wanted to use all logical processors and gain a speedup you would have to write additional tessellation code that does not use the SSE units at all. The demo can use affinity masks to try to make sure that only one of the two logical processors of an HT core will be used for tessellation (see below). Still if you ever get hold of a real 4 core machine the demo allows you to use them.
To prove that the demo can really reach a speedup on a dual core machine do the following:
- Make sure that the device settings indicate that vertical syncing is off
- Select the number of threads to be used to one
- Tick ‘Use OpenMP’
- Increase the viewing distance until you go down to 60 FPS. It is assumed that your graphics card is fast enough to run the initial settings at over 60 FPS.
- Increase the number of threads to be used to two.
- You should see the frame rate go up again, obviously only if you really have a two core machine. If there is no speedup or almost no speedup, then the tessellation workload is not the limiting factor. Most likely your graphics card is then transform or memory (transfer) limited which means rendering is relatively expensive. To check this you can un-tick the ‘Tessllation running’ box. After that you should see how fast your card can draw the vertex load generated by the tessellation.
You will have noticed that the speedup is not necessarily very high. Depending on you system and your graphics card you can get an increase of frame rate from e.g. 60 to say 75 FPS which would be a speedup of roughly 25%. Again, how much speedup you get is determined by how fast your system can render the tessellated scene. If rendering cost is small compared to the tessellation cost the speedup gained with OpenMP* can be higher. One test-system I used produced a 50% speedup.
If you bring up the Windows* task manager it becomes apparent that OpenMP* does not use affinity masks to try to lock threads on certain cores or processors. You will see that Windows* reschedules the tessellation threads trying to minimize core utilization. For our purpose this is not too bad but it might be worth trying to bind threads to certain cores.
The reason why one can’t get a higher speedup on certain systems (where rendering is relatively expensive) is that the time the tessellation work takes does represent a relatively small percentage of the overall frame processor load on these systems. Culling, workload distribution and mainly rendering are taking most of the time. This is not necessarily a problem and actually can be predicted by Amdahl’s Law (see [DevMTApps]). This law in a nutshell states that the maximum parallel speedup one can reach is limited by the serial portion of your code. Since the rendering is done in just one thread it limits the speedup. Still it is possible to reach higher frame rates by decoupling tessellation work from rendering. How this can be done is discussed next.
Asynchronous Multi-Threaded Tessellation
To reach a maximum frame rate on systems where the rendering cost is high when compared to tessellation one ideally wants to completely decouple rendering from culling and tessellation. The basic idea is to only pick up a new terrain tessellation when it is done. To cope with camera movements a triangle strip for an enlarged view frustum can be generated. The demo does not do this. You will thus notice that for fast rotations there simply is no terrain available for a short moment in time.
The asynchronous threading architecture that is realized in the demo (activated if you un-tick ‘Use OpenMP’) is shown in Figure 3. For this architecture one needs two vertex buffers that are used alternatively. One vertex buffer is rendered by the main thread. The other vertex buffer is asynchronously filled by the tessellation threads. The main thread checks every frame if a new tessellation is available. If it is available, it from then on uses the new vertex buffer to be drawn. It then locks the other (old) vertex buffer and hands it off to the tessellation threads to fill it. This is done in a round robin fashion.
The synchronization of the threads is handled using Windows* events. One event is used to signal the main tessellation thread that it should start a new tessellation. The main tessellation thread uses yet another event to signal to the main thread that a new tessellation is available. The main tessellation thread itself first does the culling and the workload distribution. After that it signals a set of events that will kick off additional tessellation threads. That is, if there are more than two cores in your system. The additional tessellation threads will work along with the main tessellation thread to finish the tessellation. Each additional tessellation thread signals the main tessellation thread when it has finished its job by setting its own event.
The main tessellation thread does a WaitForMultipleObjects() to wait for all its siblings to finish before signaling the main thread.
The demo application actually initially does not run completely asynchronously but the main thread waits until the last tessellation has been done by the tessellation threads kicked off last frame. Interestingly you will still see frames with an incomplete terrain. The reason for this is that the main tessellation thread has picked up a view cone for culling that is not the same used by the main thread when drawing the actual frame. This can be rectified if we accept a one frame lag.
You can now switch to fully asynchronous mode if you un-tick ‘Wait for tessellation’. In this case the main thread will only use a new tessellation when it is done.
All threads used by the tessellation are created at the startup of the demo, so no thread creation or cleanup is going on while the demo is running. In addition to that, all threads including the main thread can be affinity-bound to exactly one logical processor of one of the cores by setting the appropriate affinity masks for them. This has been done to enable the use of the Windows* task manager to really see how much processor time is spent for tessellation and on each core – that is if Windows* really respects the affinity masks.
The slider for the number of threads is used differently when un-ticking ‘Use OpenMP’ and running asynchronously. It specifies the number of threads, including the main tessellation thread, to be used for the asynchronous tessellation. On a two core machine it should be left at a value of one.
Compared to the OpenMP* mode you should now, using the same viewing distance and tessellation settings, see a much higher frame rate if the render cost is high when compared to the tessellation cost. If rendering is cheap when compared to tessellation you will see a smaller speedup than with OpenMP*. Depending on your machine this means you can let the player look even further or increase the quality of the tessellation. If you un-tick ‘Wait for tessellation’ the frame rate you see is independent of the complexity of the tessellation workload. It should be the same that you see when un-ticking ‘Tessellation running’.
The source code for the demo (see Figure 5) is available for download, so everybody can have a look. The culling code that has been implemented is far from optimal, but you may stick your own culling code into the sample.
|Figure 5 (download source)|
The demo pre-computes a grid of patches from a height field in memory. It would be easy to change the code to work on a height field that is synthesized on-the-fly and does not sit in memory at all. Also it is probably also easy to port the SSE intrinsics to appropriate code for the vector units of the new consoles.
In addition to the tessellation code, you will find the source code for a library that implements CPU detection (written by my colleague Leigh Davies). The CPU detection library enumerates cores and logical processors which enables you to detect which logical processor is a HT core. Please note that the CPU detection code is supposed to work on all IA32* PC processors not only on Intel processors.
This article has described how to multi-thread terrain smoothing in a scalable way. The tessellation will be faster with every core you allow the code to use. Initial performance tests indicate that the OpenMP* code path can tessellate and display a terrain with around 20-40 million vertices a second on a dual core processor system. Further the graphics card that has been used could draw tessellated terrain from dynamic vertex buffers at roughly 70 million vertices a second. This indicates that additional cores can be successfully used to do dynamic terrain tessellation and generate other dynamic geometry generation like procedural plants. Just imagine a forest with trees that all look different. Furthermore it has been shown that additional cores can be used to offload tasks from the graphics card. The graphics card would otherwise have to do terrain tessellation in addition to what it has to do anyway. It seems as if the new consoles have even more efficient ways to push dynamic geometry to the graphics card (see [Stokes05]), so the approach described in this article could probably be applied very successfully.
[DevMTApps] ‘Developing Multithreaded Applications: A Platform Consistent Approach’ online at http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/mrte/53797.htm?page=1
[D3D05] DirectX9c December SDK – Available online from www.microsoft.com
[Farin96] Farin, Gerald E. “Curves and Surfaces for Computer-Aided Geometric Design“Academic Press Inc. (London) Ltd (8. Oktober 1996)
[Foley90] Foley James D., van Dam Andries, Feiner. Steven K., Hughes John F. ,”Computer Graphics”, Addison Wesley 1990
[Gruen05] – Gruen Holger, “Efficient Tessellation on the GPU through Instancing”, Journal Of Game Development Volume 1, Issue 3, Thomson Delmar Learning, December 2005
[Bunnell05] Bunnell Michael, “Adaptive Tessellation of Subdivision Surfaces with Displacement Mapping“, GPU Gems II, Addison Wesley 2005
[Gabb05] Gabb Henry, “Threading 3D Game Engine Basics” available online at http://www.gamasutra.com/features/20051117/gabb_01.shtml
[Klimovitski01] Klimovitski Alex, “SSE/SSE2 Toolbox Solutions for Real-Life SIMD Problems“, Game Developer Conference 2001, available online at http://www.gamasutra.com/features/gdcarchive/2001E/Alex_Klimovitski3.pdf
[Stokes05] Stokes Jon
‘Inside the Xbox 360, part I: procedural synthesis and dynamic worlds’ available online at http://arstechnica.com/articles/paedia/cpu/xbox360-1.ars
‘Inside the Xbox 360, Part II: the Xenon CPU’ available online at http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars
‘Introducing the IBM/Sony/Toshiba Cell Processor — Part I: the SIMD processing units’ available online at http://arstechnica.com/articles/paedia/cpu/cell-1.ars
‘Introducing the IBM/Sony/Toshiba Cell Processor -- Part II: The Cell Architecture’ available online at http://arstechnica.com/articles/paedia/cpu/cell-2.ars
[West06] West Nick, “The Inner Product: Multi-core Processors” Game Developer Magazine Volume 13, Number2, February 2006
*Names and brands may be claimed as the properties of others