[In this Intel-sponsored feature, part of the Visual Computing microsite, an Intel duo introduce Horsepower, a demonstration of enhanced, multi-CPU specific ambient animation -- with full, redistributible source code included.]
Game developers want to deliver the best experience possible for each player, but they also want a game that is fair to all players. A higher-performing machine for one player can and should lead to a better game experience, but not a gameplay advantage in a multi-player situation.
Many solutions to this dilemma exist; one approach is to use the extra power to render more frames. Another approach is to take incidental effects and amplify them on multi-core machines. This leaves gameplay consistent across all computer platforms, but rewards those with higher-end systems.
This article introduces a demo called Horsepower that shows such a technique: enhanced ambient animation when run on a multi-core CPU. The source code for Horsepower is free to use and improve.
Horsepower started with the code base of the Intel Smoke demo, which features a multi-threaded game framework with tremendous flexibility. Smoke is a free application and its source code can be downloaded.
Most of the existing Smoke systems were carried over to Horsepower. Smoke already contained a threaded AI system so this capability was migrated to Horsepower. The Smoke framework also utilizes Havok Physics, which is threaded, and Horsepower benefits from this as well.
Horsepower was created to showcase a perceptible difference for users with a multi-core CPU. By threading to take advantage of multi-core CPU power, a few horses running in a field are multiplied to hundreds of horses (Figs. 1-3). Dynamically adjusting the number of horses drawn on-screen maintains a consistent frame rate of 30 frames per second (fps).
The demo decreases the number of horses drawn on screen until 30 fps is hit and maintained and, on the flip side, increases the number of horses on screen until the target 30 fps is hit. Although the horses are the main animated objects in this demo, you can use this technique on any ambient animations in a game for a better game experience.
Figure 1. On an Intel Core i7 processor-based system using eight threads, we can maintain almost 600 horses.
Figure 2. On the same Intel Core i7 processor-based system using four threads -- which means no Intel Hyper-Threading Technology -- we drop to 570 horses. So simultaneous multi-threading gives us about 20 additional horses by adding logical cores (and not adding any physical cores).
Figure 3. When we drop to two threads, and therefore are only using two of the four physical cores and no Intel Hyper-Threading Technology, we can maintain roughly 235 horses.
In this article, we introduce the technique of threading an animation system, and you'll see how to apply it to your ambient animations. We'll show you the code we built to explore the idea, show how we changed an animation system so it would scale as desired, and describe some issues you might find as you try this in your game. Your gamers will thank you for providing a richer gaming experience!
To achieve this enhanced performance, the entire animation process must be highly parallel. The code uses OGRE 3D, with a custom-threaded version of OGRE's animation system. Although some documentation exists that details the threaded animation system already available in OGRE, the performance of the animation system was not acceptable for this demo. Optimization of the existing code was necessary.
Let's start with a simple overview of how animation normally works in OGRE, shown in Figure 4.
Figure 4. OGRE animation system: single-threaded case.
The demo calls renderOneFrame in OGRE once per frame. This loops through all the entities in the scene and updates the animation by calling updateAnimation. When OGRE runs updateAnimation, it calculates all the vertex positions based on the current frame of animation. This update includes calculating bone positions, blending animations, and applying weight maps.
OGRE updates all of the entities in serial order, using a single thread as illustrated in Figure 5. This presents a unique opportunity to increase performance by introducing multi-threading. OGRE's animation is ideal for threading because:
- There are many calculations
- An entity's animation doesn't affect other entities
- The work can be easily pulled out of OGRE
For Horsepower, OGRE's animation system was modified to distribute the updates of all the entities across multiple threads (the work shown in the OGRE block in Figure 4), increasing performance. All of the "adjustments" (change or advance animations) could also be threaded, and, fortunately, because Horsepower is built on top of Smoke, the threading benefits that were already in place were taken advantage of.
Figure 5. In single-threaded mode, we can maintain only about 87 horses.
Figure 6 shows how things look once the update is parallelized.
Since the Adjust step is already parallel, the demo calls updateAnimation directly for each entity after it adjusts the animation. Previously, Adjust happened after the update, effectively moving the adjustments to the next frame.
In the threaded case, Adjust needs to be called prior to updateAnimation, or there will be nothing to update and the call will return early. Because there are so many animated objects, the difference between these two approaches results in the same net effect, so the order of the Adjust and update is of no concern.
When first implemented, this change crashed the demo for two reasons:
- The demo accessed OGRE and DirectX from multiple threads.
- OGRE does not support multiple readers of a hardware buffer.
While exploring solutions to these problems, we discovered that OgreConfig.h contains a pre-processor macro called OGRE_THREAD_SUPPORT. If OGRE_THREAD_SUPPORT is defined, OGRE supports multi-threaded access and also initializes DirectX in multi-thread mode (the DX device is created with the D3DCREATE_MULTITHREADEDflag). Defining this macro resolved the first issue.
Resolving the second issue required more insight. In Horsepower, all of the horses are based on the same mesh. The data for the mesh are loaded only once for all of the horses. To animate the horses, each entity has to access this shared mesh data.
Because DirectX supports multiple readers and OGRE does not, OgreHardwareBuffer needed to be modified to support this functionality. The OgreHardwareBuffer changes can be viewed in the source code's OgreHardwareBuffer.h file (in the path code\extern\Ogre1_9\OgreMain\include). Those two changes to OGRE were sufficient to enable threaded animation in the demo.
Horsepower uses a unique performance metric: horse count. On any system, with a locked frame rate of 30 fps, the number of horses displayed will indicate the relative performance of the system at hand. On an Intel Core i7 processor with eight-thread capability, the data shown in Figure 7 was captured.
Possible future work on the Horsepower demo includes:
- Instanced horses -- Horsepower currently shares the same mesh data, but each object is its own entity with its own animation vertices; performance could be improved with instancing.
- An optimized level of detail (LOD) system -- LOD was removed because of performance reasons; work can be done for further performance optimization and higher level of detail with LOD optimization.
- Possibly turning it into a herding game -- The framework already has "fear" programmed into the demo (from its Smoke roots). Have the horses fear the camera and create a pen into which the horses are herded.
The possibilities are endless, and with the source code free for use, anyone can try anything out!
Horsepower's primary goal is to show a fair, perceptible difference in effects through the use of threaded animation. We hope this example will encourage developers to make use of the extra compute power on multi-core CPUs in real PC games.