Occlusion Culling Algorithms
Developers are always going to want better performance in real-time rendering, and so speed-up techniques and acceleration schemes will always be needed. In this excerpt from Chapter 7, "Speed-Up Techniques," of Real-Time Rendering, the authors discuss the class of acceleration schemes known as the occlusion culling techniques.
November 9, 1999
Author: by Tomas Möller
One of the great myths concerning computers is that one day we will have enough processing power. Even in a relatively simple application like word processing, we find that additional power can be applied to all sorts of things, such as on-the-fly spell and grammar checking, more elaborate graphic presentation, antialiased text display, automated voice recognition and dictation, etc.
In real-time rendering we have at least three performance goals: more frames per second, higher resolution, and more (and more realistic) objects in the scene. A speed of 60-72 frames per second is generally considered enough, and perhaps 1600x1200 is enough resolution for a while, but there is no real upper limit on scene complexity. The rendering of a Boeing-777 would include 132,500 unique parts and over 3,000,000 fasteners, which would yield a polygonal model with over 500,000,000 polygons [Cripe98]. Our conclusion: speed-up techniques and acceleration schemes will always be needed.
In this article, we will talk about a certain class of acceleration scheme called occlusion culling techniques. Most of this article is an excerpt from chapter 7, "Speed-Up Techniques" from our book Real-Time Rendering (www.realtimerendering.com or www.acm.org/tog/resources/RTR/). In the book, the occlusion culling section is preceded by sections on backface and clustered culling, hierarchical view-frustum culling, portal culling, and detail culling. Sections on impostor algorithms, level-of-detail techniques, triangle fan, strip and polygon mesh techniques follow after.
To cull can mean to "select from a flock," and in the context of computer graphics this is exactly what culling techniques do. The flock is the whole scene that we want to render, and the selection is limited to those portions of the scene that are not considered to contribute to the final image. The rest of the scene is sent through the rendering pipeline. The actual culling can theoretically take place at any stage of the rendering pipeline. For culling algorithms that are implemented in hardware, we can typically
only enable/disable or set some parameters for the culling function. For full control, the programmer can implement the algorithm in the application stage (on the CPU). Culling is often achieved by using geometric calculations but is in no way limited to these. For example, an algorithm may also use the contents of the frame buffer.
As we all know, visibility may be solved via a hardware construction called the Z-buffer. Even though it may solve visibility correctly, the Z-buffer is not a very smart mechanism in all respects. For example, it has the following implications. Imagine that the viewer is looking along a line where 10 spheres are placed. This is illustrated in Figure 1.
An image rendered from this viewpoint will show but one sphere, even though all 10 spheres will be scan-converted and compared to the Z-buffer and then potentially written to the color buffer and Z-buffer. The simple conclusion in this case is that nine spheres will be drawn unnecessarily. This uninteresting scene is not that likely to be found in reality, but it describes (from its viewpoint) a densely populated model. These sorts of configurations are found in such real scenes as a rain forest, an engine, a city, and the inside of a skyscraper.
Thus it seems plausible that an algorithmic approach to avoid this kind of inefficiency may pay off in terms of speed. Such approaches go under the name of occlusion culling algorithms, since they try to cull away (avoid drawing) objects that are occluded, that is, inside the view frustum but not visible in the final image. The optimal occlusion culling algorithm would select only the objects that are visible. In a sense, the Z-buffer selects and renders only those objects which are visible, but not before all objects are sent through the pipeline. The idea behind efficient occlusion culling algorithms is to perform some simple tests early on and so avoid sending data through much of the pipeline.
Pseudocode for a general occlusion culling algorithm is shown in Figure 2, where the function isOccluded, often called the visibility test, checks whether an object is occluded. G is the set of geometrical objects to be rendered; OR is the occlusion representation.
1: OcclusionCullingAlgorithm (G)2: OR=empty3: for each object g in G4: if(isOccluded(g,OR))5: Skip(g)6: else7: Render(g)8: Update(OR,g)9: end10: end |
Figure 2. Pseudocode for a general occlusion culling algorithm. G contains all the objects in the scene, and OR is the occlusion representation. |
Depending on the particular algorithm, OR represents some kind of occlusion information. OR is set to be empty at the beginning. After that all objects (which passed the view-frustum culling test) are processed.
Consider a particular object. First we test whether the object is occluded with respect to the occlusion representation OR. If it is occluded, then it is not processed further, since we then know that it will not contribute to the image. If the object is determined not to be occluded, then that object has to be rendered, since it probably contributes to the image (at that point in the rendering). Finally OR is updated with that object.
For some algorithms, it is expensive to update the occlusion representation, so this is only done once (before the actual rendering starts) with the objects that are believed to be good occluders. This set is then updated from frame to frame.
A number of occlusion algorithms will be scrutinized in this section.
Hierarchical Z-Buffering and the Hierarchical Visibility Algorithm
One approach to occlusion culling is the hierarchical visibility (HV) algorithm [Greene93]. This algorithm maintains the scene model in an octree, and a frame's Z-buffer as an image pyramid, which we call a Z-pyramid. The octree enables hierarchical culling of occluded regions of the scene, and the Z pyramid enables hierarchical Z-buffering of individual primitives and bounding volumes. The Z-pyramid is thus the occlusion representation of this algorithm. Examples of these data structures are shown in Figure 3.
Any method can be employed for organizing scene primitives in an octree, although Greene et al. [Greene93] recommend a specific algorithm that avoids assigning small primitives to large octree nodes. In general, an octree is constructed by enclosing the entire scene in a minimal axis-aligned box. The rest of the procedure is recursive in nature, and starts by checking whether the box contains fewer than a threshold number of primitives. If it does, the algorithm binds the primitives to the box and then terminates the recursion. Otherwise, it subdivides the box along its main axes using three planes, thereby forming eight boxes (hence the name octree). Each new box is tested and possibly subdivided again into 2x2x2 smaller boxes. This process continues until each box contains fewer than the threshold number of primitives, or until the recursion has reached a specified deepest level [Samet89a,Samet89b]. This is illustrated in two dimensions, where the data structure is called a quadtree, in Figure 4.
The construction of an octree takes too much time to be done at runtime, so this method is best suited for static models.
Once the octree has been created, each frame is rendered in approximately front-to-back order by calling the procedure ProcessOctreeNode (outlined in Figure 5) with the root node of the octree. Octree nodes that are outside the view frustum are culled away. The first step determines whether the node's bounding box is visible with respect to the Z-pyramid, using a procedure that will be described later. In this case, a node's bounding box is a box in the octree. If the node is occluded, we do not need to process the contents of that box further, since its contents do not contribute to the final image. Otherwise, we render the primitives associated with the node into the Z-pyramid (tileInto in the pseudocode) and then process each of the node's children (if it has any) in front-to-back order using this same recursive procedure. When recursion finishes, all visible primitives have been tiled into the Z-pyramid, and a standard Z-buffer image of the scene has been created.
The HV algorithm performs occlusion culling very efficiently because it only traverses visible octree nodes and their children, and it only renders the primitives in visible nodes. This can save much of the work in scenes that are densely occluded. For example, in the scene pictured in Figure 3, more than 99% of on-screen polygons are inside occluded octree nodes, which are therefore culled by the Z pyramid [Greene95].
1: ProcessOctreeNode(OctreeNode N)2: if(isOccluded(NBV, ZP)) then return;3: for each primitive p in N4: tileInto(p, ZP)5: end6: for each child node C in N in front-to-back order7: ProcessOctreeNode(C)8: end |
Figure 5. Pseudocode for the hierarchical visibility algorithm. To render a frame this procedure is called with the root node of the octree. NBV is the bounding volume of the octree node N, and ZP is the Z-pyramid that is the occlusion representation of this algorithm. The operation tileInto renders a primitive p into the Z-pyramid, ZP, and this also updates the entire Z-pyramid. |
Now we will describe how the Z-pyramid is maintained and how it is used to accelerate culling. The finest (highest-resolution) level of the Z-pyramid is simply a standard Z-buffer. At all other levels, each z-value is the farthest z in the corresponding 2x2 window of the adjacent finer level. Therefore each z-value represents the farthest z for a square region of the screen. To maintain a Z-pyramid, whenever a z-value is overwritten in the Z-buffer it is propagated through the coarser levels of the Z-pyramid. This is done recursively until the top of the image pyramid is reached, where only one z-value remains (this is illustrated
in Figure 6.
Next, we describe how hierarchical culling of octree nodes is done. To determine whether a node is visible, the front faces of its bounding box are tested against the Z-pyramid. The node is occluded if all of its front faces are occluded by the Z-pyramid. To establish whether an individual face is occluded, we begin at the coarsest Z-pyramid cell that encloses the face's screen projection. The face's nearest depth within the cell
(znear) is then compared to the Z-pyramid value, and if znear is farther, the face is known to be occluded. For densely occluded scenes, this procedure often culls an occluded face with a single depth comparison. When this initial test fails to cull a face, its visibility can be definitively established by recursively traversing from the initial Z-pyramid cell down to finer levels in the Z-pyramid. Additional depth comparisons at the encountered child cells must be performed. If this subdivision procedure does not ultimately find a visible sample on the face, the face is occluded. In scenes where the bounding boxes of octree nodes overlap deeply on the screen, this hierarchical procedure can establish visibility much more efficiently than can conventional Z-buffering. For example, in the scene pictured in Figure 3, hierarchical culling of octree nodes with the Z-pyramid generates roughly one hundred times fewer depth comparisons than visibility testing with conventional Z-buffering.
As we have seen, the original HV algorithm [Greene93] used a Z-pyramid only for occlusion tests and employed traditional Z-buffer scan conversion to render the polygons in visible octree nodes. Subsequently, a more efficient method, called hierarchical polygon tiling [Greene96], was developed. This algorithm adapts the basic screen subdivision procedure described above to polygon "tiling," that is, finding the visible samples on a polygon. When tiling a polygon into the Z-pyramid, it must be determined, at each level of subdivision where the polygon is compared to a Z-pyramid cell, whether the polygon overlaps the cell and if so, whether the polygon is occluded by the cell. Overlap tests are accelerated with coverage masks that indicate which subcells within a Z-pyramid cell are covered by the polygon, and occlusion tests are performed by comparing the polygon's nearest z-value within the cell to the Z-pyramid value. This procedure for hierarchical Z-buffering is very efficient, because it only traverses regions of the screen where a polygon is visible or nearly visible.
Hierarchical polygon tiling can be performed even more efficiently if the polygons in a scene are organized into a BSP tree or an "octree of BSP trees" [Greene96]. The reason is that this enables traversing polygons in strict front-to-back order, which eliminates the need to maintain depth information. Rather, the only occlusion information required is whether or not each image sample has been written, and this information can be maintained in an image pyramid of coverage masks called a coverage pyramid. This is the data structure maintained by hierarchical polygon tiling with coverage masks [Greene96], which is similar to hierarchical Z-buffering except that occlusion tests are performed with coverage-mask operations instead of depth comparisons, which accelerates tiling considerably.
The HV algorithm implemented with hierarchical tiling may well be the most efficient method known for software rendering of complex scenes composed of polygons, but it is not fast enough for real-time rendering of complex scenes on today's microprocessors. To enable real-time rendering, Greene et al. [Greene93] suggest modifying hardware Z-buffer pipelines to support HV, which requires substituting a Z-pyramid for the Z-buffer, and including a fast feedback path to report visibility of bounding volumes. It is likely that these modifications would extend the domain of real-time rendering to much more complex scenes, such as the scene in Figure 3.
In the absence of this kind of hardware support, the HV algorithm can be accelerated on systems having conventional Z-buffer hardware by exploiting frame-to-frame coherency [Greene93]. The idea is that octree nodes that were visible in one frame tend to be visible in the next. With this variation, the first frame of an animation sequence is generated with the standard HV algorithm, except that after completing the frame, a list of octree nodes that were visible in that frame (the visible node list) is created by testing nodes for visibility against the Z-pyramid. Subsequent frames are generated with the following two-pass algorithm. In the first rendering pass, primitives associated with nodes on the visible node list are rendered by Z-buffer hardware. Then, the Z-buffer of the partially rendered scene is read back from the hardware, and a Z-pyramid is built from this Z-buffer. In the second rendering pass, the standard HV algorithm is run in software, traversing the octree from front to back but skipping nodes which have already been rendered. This second pass fills in any missing parts of the scene. The final step in processing a frame is to update the visible node list. Typically, this variation of the HV algorithm runs considerably faster than the all-software version, because nearly all visible polygons are rendered with Z-buffer hardware.
Greene and Kass [Greene94b] have developed an extension to hierarchical Z-buffering which renders antialiased scenes with error bounds. Another interesting algorithm for occlusion culling is the visibility skeleton developed by Durand et al. [Durand97,Durand97b].
The HOM Algorithm
The hierarchical occlusion map (HOM) algorithm [Zhang97] is another way of enabling hierarchical image-space culling (such as the hierarchical Z-buffering algorithm). However, the HOM algorithm can be used on systems that have graphics hardware but not a hardware Z-pyramid, and it can also handle dynamic scenes. The HOM algorithm is described in detail in Zhang's Ph.D. thesis [Zhang98].
We start by describing how the function isOccluded works. This function, used in the pseudocode in Figure 2, is a key part of the algorithm. This occlusion test takes place after the view transform, so the viewer is located at the origin looking down the negative z-axis, with the x-axis going to the right, and the y-axis going upwards. The test is then divided into two parts: a one-dimensional depth test in the z-direction and a two-dimensional overlap test in the xy plane, i.e., whereby the image gets projected. The overlap test supports approximate visibility culling, where objects that "shine through" small holes in the occluders can be culled away using an opacity threshold parameter.
For both tests, a set of potentially good occluders is identified before the scene is rendered, and the occlusion representation is built from these. This step is followed by the rendering of the scene, where the occluders are rendered without an occlusion test. Then the rest of the scene is processed by having each object tested against the occlusion representation. If the object occluded by the occluder representation, it is not rendered.
For the two-dimensional overlap test, the occluders are first rendered into the color buffer with a white color on a black background. Therefore, texturing, lighting, and Z-buffering can be turned off. An advantage of this operation is that a number of small occluders can be combined into a large occluder. The
rendered image, which is called an occlusion map, is read back into the main memory of the computer. For simplicity, we assume that this image has the resolution of 2^n x 2^n pixels. It is used as the base for the occlusion representation. Then a hierarchy of occlusion maps (HOM), i.e., an image pyramid of occlusion maps, is created by averaging over 2^n-1 x 2^n-1 pixel blocks to form an image of 2^n-1 x 2^n-1 pixels. This is done recursively until a minimum size is reached (for example 4x4 pixels). The highest-resolution level of the HOM is numbered 0, with increasing numbers having decreasing resolution. The gray-scale values in the HOM are said to be the opacity of the pixels. A high opacity value (near white) for a pixel at a level above 0 means that most of the pixels it represents are covered by the HOM.
The creation of the HOM can be implemented either on the CPU or by texture mapping, with bilinear interpolation used as a minification filter. For large image sizes, the texture filtering approach was found to be faster, and for small image sizes, the CPU was faster. Of course, this varies with CPUs and graphics hardware. For a 1024x1024 image, Zhang et al. [Zhang97] used a 256x256 image as the base for the HOM. An example of a HOM is shown in Figure 7.
The overlap test against the HOM starts by projecting the bounding volume of the object to be tested onto the screen (Zhang et al. [Zhang97] used oriented bounding boxes). This projection is then bounded by a rectangle, which then covers more pixels than the object enclosed in the bounding volume. So this test is a conservative test, meaning that even if the test results show that the object is not occluded, it may still be so. This rectangle is then compared against the HOM for overlap. The overlap test starts at the level in which the size of the pixel in the HOM is approximately the size of the rectangle. If all pixels in the rectangle are opaque (which means fully white for non-approximate culling), then the rectangle is occluded in the xy plane and the object is said to pass the test. On the other hand, if a pixel is not opaque, then the test for that pixel continues recursively to the subpixels in the HOM which are covered by the rectangle, meaning that the resolution of the occlusion maps increases with each test.
For approximate visibility culling, the pixels in the HOM are not compared to full opacity, i.e., white, but rather against an opacity threshold value, a gray-scale value. The lower the threshold value, the more approximate the culling. The advantage here is that if a pixel is not fully opaque (white) but still higher than the threshold, then the overlap test can terminate earlier. The penalty is that some object may be omitted from rendering even though it is (partially) visible. The opacity values are not constant from one level to another in the HOM, as shown in the following example.
Example: Computation of opacity threshold values
Assume the rendered image is 1024x1024 pixels and that the lowest level in the HOM (i.e., the one with the largest resolution) has a resolution of 128x128 pixels. A pixel in this level-zero occlusion map corresponds to an 8x8-pixel region in the rendered image. Also assume that a 2x2 region of black pixels in an 8x8 region can pass as a negligible hole. This would give an opacity value O=1-2^2/8^8=0.9375. The next level in the HOM would then have a 64x64 resolution, and a pixel at this level would correspond to 16x16 pixels in the rendered image. So the opacity threshold at this level would be O=1-2^2/16^2 which is approximately 0.984.
We will now derive a recursive formula for computing the opacity values of the different levels in the HOM. The opacity of the level with the highest resolution in the HOM is O0=1-n/m, where n is equal to the number of black pixels that can be considered a negligible hole, and m is the number of pixels in the rendered image represented by one pixel in this occlusion map (m=8x8 in the example above). The next level in the HOM has a threshold of O1=1-n/(4m)=1-(1-O0)/4=(3+O0)/4. This reasoning can be generalized to the formula in below for the kth level in the HOM.
Ok+1=(3+Ok)/4
For more details on this topic, consult Zhang's Ph.D. thesis [Zhang98].
For the one-dimensional z-depth test, we must be able to determine whether an object is behind the selected occluders. Zhang [Zhang98] describes a number of methods, and we choose to describe the depth estimation buffer, which provides reasonable estimation and does not require a Z-buffer. It is implemented as a software Z-buffer that divides the screen into a number of rectangular regions that are rather large in relation to the pixel size. The selected occluders are inserted into this buffer. For each region the farthest z-value is stored. This is in contrast to a normal Z-buffer, which stores the nearest z-value at each pixel. An estimation is used to obtain a far value for an occluder quickly. The z-value of the farthest vertex of the bounding box is used to estimate the farthest z-value of an occluder. An example of a depth estimation buffer is shown in Figure 8.
The depth estimation buffer is built for each frame. During rendering, to test whether an object passes the depth test (i.e., whether it is behind the occluders) the z-value of the nearest vertex of its bounding box is computed. This value is compared against the z-values of all regions in the depth estimation buffer that the bounding box rectangle covers in screen space. If the near value of the bounding box is larger than the stored z-depth in all regions, then the object passes the depth test, and is thus occluded in the depth direction. A resolution of 64x64 regions in the depth estimation buffer was used by Zhang et al. [Zhang97].
For an object to be occluded, it must thus first pass the overlap test; i.e., the rectangle of the projected bounding volume of the object must pass the HOM test. Then it must pass the depth test, i.e., it must be behind the occluders. If an object passes both tests, the object is occluded and is not rendered.
Before the algorithm starts, an occluder database is built, where a few criteria are used to exclude certain objects [Zhang98]. First, small objects do not work well in the occluder database, since they usually cover a small portion of the image unless the viewer is very close to them. Second, objects with a high polygon count are also avoided, as these may negatively affect the performance of the rendering of the occlusion map. Third, objects with large or ill-shaped bounding boxes (e.g., a bounding box for a skinny polygon) should be avoided, as they may cause the depth estimation buffer to be too conservative. Finally, redundant objects are avoided: for example, a clock on a wall does not contribute as much to occlusion as the wall itself.
At runtime, occluders are selected from the database. To avoid allowing the creation of the HOM to become a bottleneck, there is a limit to the number of occluders that can be selected. These are selected with respect to their distance from the viewer and to their size. Only objects inside the view frustum are selected. A case when this does not work well is shown in Figure 9.
Read more about:
FeaturesYou May Also Like