The famous Moore's law, which states in rough terms that every 18 months the speed of computers doubles, has an evil twin: every 18 months software becomes twice as slow. A similar relationship can be formulated for RAM and game data: no matter how big the memory budget of your nextgeneration game may seem, your art team can probably fill it up faster than you can say "disk thrashing." The appetite for art megabytes grows faster than the publisher's willingness to raise the minimum platform requirements.
Until
we start seeing games with a serious amount of geometry, the greatest
slice of the memory pie will belong to textures. Nobody wants to ship
a game with small, blurry, obviously tiling textures—and it's
up to the programmers to alleviate texture limitations. The hundreds
of megabytes of stuff coming from the art quarters must be compressed.
Conventional imagecompression algorithms are not very well suited
to the specific requirements of art storage in games. They are designed
for relatively fast compression, which is not an issue, since art
assets are preprocessed offline; their decompression speed leaves
much to be desired. Also, it is usually hard to access a specific
portion of the image.
For fixed textures used in hardwarerendered games, the texture compression
schemes such as DXTn present a solution; however, for supporting older
hardware, for (gasp!) software renderers, and doing more complicated
stuff with textures they aren't perfect. Sure, you could decompress
DXTn in software and process it, but those formats aren't really meant
for this—it would probably be quite slow. There is a better solution
in terms of both decompression speed and image quality.
Imagecompression algorithms based on vector quantization (VQ) techniques
have been researched for years. Recently, such algorithms have been
implemented in hardware by several graphics chip vendors. Unlike DXTn,
VQ decompression is as easy to do in software as it is in hardware,
and might be just what you need to slash the memory requirements of
your project in half.
This article provides an introduction to the field of VQ, presents
two algorithms for performing VQ, and goes into the details of a successful
realworld application for VQ texture compression.
What Is Vector
Quantization?
Strictly speaking, quantization is the procedure of approximating
continuous with discrete values; in practice, the input values to
the quantization procedure are often also discrete, but with a much
finer resolution than that of the output values. The goal of quantization
usually is to produce a more compact representation of the data while
maintaining its usefulness for a certain purpose. For example, to
store color intensities you can quantize floatingpoint values in
the range [0.0, 1.0] to integer values in the range 0255, representing
them with 8 bits, which is considered a sufficient resolution for
many applications dealing with color. In this example, the spacing
of possible values is the same over the entire discrete set, so we
speak of uniform quantization; often, a nonuniform spacing is more
appropriate when better resolution is needed over some parts of the
range of values. Floatingpoint number representation is an example
of nonuniform quantization—you have the as many possible FP values
between 0.1 and 1 as you have between 10 and 100.
Both these are examples of scalar quantization—the input and output values are scalars, or single numbers. You can do vector quantization (VQ) too, replacing vectors from a continuous (or dense discrete) input set with vectors from a much sparser set (note that here by vector we mean an ordered set of N numbers, not just the special case of points in 3D space). For example, if we have the colors of the pixels in an image represented by triples of red, green, and blue intensities in the [0.0, 1.0] range, we could quantize them uniformly by quantizing each of the three intensities to an 8bit number; this leads us to the traditional 24bit representation.
By quantizing each component of the vector for itself, we gain nothing over standard scalar quantization; however, if we quantize the entire vectors, replacing them with vectors from a carefully chosen sparse nonuniform set and storing just indices into that set, we can get a much more compact representation of the image. This is nothing but the familiar paletted image representation. In VQ literature the "palette," or the set of possible quantized values for the vectors is called a "codebook," because you need it to "decode" the indices into actual vector values.
Why Does VQ Work?
It turns out that VQ is a powerful method for lossy compression of data such as sounds or images, because their vector representations often occupy only small fractions of their vector spaces. We can illustrate this distribution in the case of a simple representation of a grayscale image in a 2D vector space. The vectors will be composed by taking in pairs the values of adjacent pixels. If the input image has 256 shades of gray, we can visualize the vector space as the [0,0][255,255] square in the plane. We can then take the two components of the vectors as XY coordinates and plot a dot for each vector found in the input image.
Figure 2 shows the result of this procedure applied to a grayscale version of the famous "Lena" (Figure 1), a traditional benchmark for imagecompression algorithms.


FIGURE 1. Lena in grayscale. 
The diagonal line along which the density of the input vectors is concentrated is the x = y line; the reason for this clustering is that "Lena," like most photographic images, consists predominantly of smooth gradients. Adjacent pixels from a smooth gradient have similar values, and the corresponding dot on the diagram is close to the x = y line. The areas on the diagram which would represent abrupt intensity changes from one pixel to the next are sparsely populated.


FIGURE 2. Distribution of pairs of adjacent pixels from grayscale Lena. 
If we decide to reduce this image to 2 bits/pixel via scalar quantization, this would mean reducing the pixels to four possible values. If we interpret this as VQ on the 2D vector distribution diagram, we get a picture like Figure 3.


FIGURE 3. Scalar quantization to 2 bits/pixel interpreted as 2D VQ. 
The big red dots on the figure represent the 16 evenly spaced possible values of pairs of pixels. Every pair from the input image would be mapped to one of these dots during the quantization. The red lines delimit the "zones of influence," or cells of the vectors—all vectors inside a cell would get quantized to the same codebook vector.
Now we see why this quantization is very inefficient: Two of the cells are completely empty and four other cells are very sparsely populated. The codebook vectors in the six cells adjacent to the x = y diagonal are shifted away from the density maxima in their cells, which means that the average quantization error in these cells will be unnecessarily high. In other words, six of the 16 possible pairs of pixel values are wasted, six more are not used efficiently and only four are O.K.
Let's perform an equivalent (in terms of size of resulting quantized image) vector quantization. Instead of 2 bits/pixel, we'll allocate 4 bits per 2D vector, but now we can take the freedom to place the 16 vectors of the codebook anywhere in the diagram. To minimize the mean quantization error, we'll place all of these vectors inside the dense cloud around the x = y diagonal.


FIGURE 4. Vector quantization to 4 bits per 2Dvector. 
Figure 4 shows how things look with VQ. As in Figure 3, the codebook vectors are represented as big red dots, and the red lines delimit their zones of influence. (This partitioning of a vector space into cells around a predefined set of "special" vectors, such as for all vectors inside a cell the same "special" vector is closest to them, is called a Voronoi diagram; the cells are called Voronoi cells. You can find a lot of resources on Voronoi diagrams on the Internet, since they have some interesting properties besides being a good illustration of the merits of VQ.)
You can see that in the case of VQ the cells are smaller (that is, the quantization introduces smaller errors) where it matters the most—in the areas of the vector space where the input vectors are dense. No codebook vectors are wasted on unpopulated regions, and inside each cell the codebook vector is optimally spaced with regard to the local input vector density.
When you go to higher dimensions (for example, taking 4tuples of pixels instead of pairs), VQ gets more and more efficient—up to a certain point. How to determine the optimal vector size for a given set of input data is a rather complicated question beyond the scope of this article; basically, to answer it, you need to study the autocorrelation properties of the data. It suffices to say that for images of the type and resolution commonly used in games, four is a good choice for the vector size. For other applications, such as voice compression, vectors of size 4050 are used.
Algorithms for Vector Quantization
The main problem in VQ is choosing the vectors for the codebook so that the mean quantization error is minimal; after the codebook is known, mapping input vectors to it is a trivial matter of finding the best match. In applications where the quantization is performed in real time, a trivial approach to this latter step might prove too slow, but in our case it was on orders of magnitudes faster than finding an optimal codebook.
We experimented with two algorithms for VQ, the classical GLA (generalized Lloyd algorithm, sometimes called Kmeans clustering), and Anthony Dekker's Neuquant. Both of them are extremely computationally expensive, basically using brute force to find a general solution to the problem. Other, much faster algorithms exist, but they achieve speed by restricting the generality of the codebook (for example, treestructured VQ), which would lead to greater quantization error. For our purposes—compression as a preprocessing stage for some of the art assets—compression times of a few hours were acceptable, and that was well within the reach of the brute force algorithms.
Generalized Lloyd Algorithm
Each iteration of the GLA consists of two phases: the codebook assignment phase and the codebook adjustment phase. During the former, each of the vectors from the input set is assigned to the nearest vector from the codebook. During the latter, each of the codebook vectors is replaced with the centroid (in this case, average) of all input vectors assigned to it. This process is convergent, and minimizes the mean square error of quantization.


FIGURE 5. A grass texture (uncompressed.) 
There are two problems with GLA: what to do with "empty cells" and how to choose the initial codebook. You have an empty cell when some vector from the codebook gets no input vectors assigned to it during the assignment phase (its Voronoi cell is empty). It will not move during the adjustment phase and will therefore the cell will probably remain empty in all subsequent operations. In this case, you should remove it from the codebook. You need some kind of heuristic to come up with a prospective replacement. You could split the codebook vector with the greatest number of assigned input vectors into two close vectors, and let several iterations pull them apart; or you could split the one whose assigned vectors are most distant. The first heuristic aims to minimize the mean error, while the second minimizes maximum error.
If you have a satisfying solution to the dead vector problem, the choice of initial codebook does not matter—you could start out with random vectors, or with all vectors at the origin, GLA will eventually move them into position and eliminate improper ones. This, however, can take tens of iterations—work which would be spared with a more careful starting position.
Neuquant
Neuquant was the name given by Anthony Dekker for his application of Kohonen's selforganizing maps (SOMs)—a special type of neural network—to color quantization. We found it quite suitable for quantization of vectors of higher dimensions.
Imagine our codebook as a string of connected points—neurons—in the vector space. Each neuron has two neighbors (except, of course, for the first and the last) in this string. Now, with each vector in the input set, you "stimulate" the string: you find the neuron closest to the stimulus vector, and you pull it in the direction of the stimulus, say, onehalf the distance towards it. Its neighbors in the string also get pulled towards the stimulus, but by a lesser amount, say, onefourth. Their neighbors are influenced only by oneeighth and so on, until at some distance along the string the reaction to the stimulus stops.
When you feed the input set as stimuli, the string moves, turns, and stretches itself to occupy these parts of the vector space where the input data is dense—which is precisely what you want from a good VQ codebook. Neuquant with larger codebooks sometimes wastes codebook entries.
We found that, in general, Neuquant gives decent results faster than GLA (with less iterations), but when given enough time GLA tends to produces better codebooks, which adapt well to "stray" vectors. If you can live with the time, using GLA is definitely recommended.
VQ for Texture Compression
VQ compression is highly asymmetric in processing time: choosing an optimal codebook takes huge amounts of calculations, but decompression is lightningfast—only one table lookup per vector. This makes VQ an excellent choice for data which once created will never change, like most art assets for a game, but which can be kept in a compressed form even in memory right up to the moment when it's used.
Texture images are a prime candidate for VQ—they are often of limited color gamut (for example, in a grass texture you might have hundreds of shades of green, but only a few different reds, blues, and whites) and have a lot of similar, samefrequency features. Several hardware vendors have recognized the suitability of textures for VQ compression. For example, the Dreamcast's video chip supports rendering directly from textures compressed to 2 bits/pixel. The vectors in this case are 2x2 blocks of pixels, or 12dimensional; the codebook has 256 entries for singlebyte indices.
Roll Your Own: Practical Issues
The rest of this article is a detailed account of our practical experience with VQ compression for our current project, a softwarerendered 2.5D RTS. The source code accompanying the article is very close to what we have in the actual game, so you can easily experiment with ideas and tradeoffs discussed below.
For our game we didn't have the luxury of building special hardware for VQ, so we had to design our VQ scheme around the software blitter/blender. Since it uses MMX to process four adjacent 16bit pixels, we chose to quantize 12dimensional vectors too, but taken from a 4x1 block of pixels. This leads to slightly worse results compared to the 2x2 blocks, because the pixels in the tightly packed square block are more closely correlated (that is, likely to have similar values).
Both VQ algorithms work on vectors via a bunch of operators and don't care about their dimensionality or internal representation. This makes them a perfect fit for generic implementations in the form of templated C++ classes, with the vector types left as template parameters for the quantizer classes. The vector class should provide +=, =, *= operators with their usual meaning, a function returning the distance between two vectors according to the metrics of choice (Euclidian distance works just fine). Neuquant needs an additional function, shifting a vector towards another vector by a specified amount.
Because with both algorithms almost all of the time is spend in these vector operations, they are a good candidate for SIMD optimization. Writing SSE and 3DNow versions of the vector classes took us a couple of hours. They both run at about the same speed, roughly twice as fast as their scalar counterparts on the respective processors; greater speedup is probably possible with careful handtuning. The plain old x87 version of the vector class can also be implemented in a generic manner without sacrificing performance, with the dimensionality as a template parameter.


FIGURE 6. VQcompressed grass texture, codebook size 256, compression time 2 minutes, 2.1 bits/pixel. 
We had to spend much more time tackling the empty cell problem with the GLA algorithm. We found that splitting the cell with the largest population of input vectors results in neglecting colors outside the general color gamut of the image; for example, small patches of brown dirt in a grass texture almost completely disappeared because the heuristics allocated all available vectors to the greens. Splitting the largest cells (measured by the maximal distance from a codebook vector to a input vector mapped to it) works much better and preserves color variations more faithfully.
Another issue is the initial codebook configuration and the stop condition for the GLA algorithm. Depending on the input data, sometimes GLA never stops updating the codebook vectors, falling into a stable loop of several iterations. We arbitrarily stop the process when the last N iterations have updated less than 1 percent of the codebook vectors, or when they have updated the same number of input vectors. It might be possible to come across an image for which the algorithm will fall into a loop which updates a different number of vectors on each iteration but still never finishes; we haven't found such an image, but still we put an arbitrary limit on the total number of iterations.
As for the initial configuration, we kept the naive solution of starting with all codebook vectors having the same values. It takes about 1015 iterations just to fill up the codebook with "active" vectors, but the end results are better than starting with a random sample of vectors from the image, starting with a random codebook or using a few iterations of Neuquant as a "seed" for GLA.
Because we let the compression run overnight anyway, for each image we generated compressed versions with different codebook sizes in the range 2568,192. Then we examined them and chose the best tradeoff between visual fidelity and size.
For the compressed images we keep the codebook in highcolor format (because that's what our blitter needs; the sample code works with 24bit true color) and for each image vector a 8bit or 16bit index into the codebook. Reasonable codebook sizes (up to about 8,192) don't use all 16 bits of the indices, but tight packing of the bits would slow down decompression, and this is more important in our case than the additional compression gains. Even in this case you should always use the smallest codebook possible to minimize memory traffic; ideally it should fit into the CPU L1 data cache.
If you ignore the memory for the codebook, this corresponds to 2 or 4 bits/pixel. While a bit on the low side for general image compression, this is excellent for an algorithm which is practically free in terms of CPU time decompression.
When we put the VQ compression described above in our renderer, we expected to get slightly worse performance but were ready to live with it, because it took about 20MB off our memory footprint. However, profiles showed that rendering has actually gained about 15 to 20 percent performance, probably due to decreased memory traffic.
Possible Applications for VQ
Your project may benefit from VQ image compression if you have access to the renderer and you can implement decompression directly inside it. It is also helpful if you can't rely on your hardware renderer supporting a particular standard texturecompression format and still need to stream from disk significant amounts of image data.
What about DXTn texture compression, you might ask? Well, if you can count on hardware support for a texturecompression scheme on your minimum target platform, it is most certainly a good idea to stick with it: after all, somebody has already paid for the tens of thousands of transistors to implement it, and you'd better make good use of them, save tons of RAM on the video card, and keep its precious memory bandwidth down. However, if, for some reason—for example, software rendering (gasp!), texture preprocessing, or support for legacy hardware (where "legacy" may mean "older than nine months" in our business)—you need to have at some point in your application access to the decompressed pixels of your texture, VQ might provide a better answer. DXTn compresses textures to 4 bits/pixel (for fully opaque or singlebitalpha images) or 8 bits/pixel (for images with alpha), while in our experiments we rarely needed more than about 3 bits/pixel with VQ. Decompression times for DXTn are much larger than those for VQ: in our experiments with a 1024x1024 truecolor image, DXTn decompression took about 200ms, straight copying of uncompressed data took about 25ms, and VQ decompression from a 256entry codebook took about 16ms. (Note: We didn't have access to DXTn decompression source code, so we did what most developers would do to decompress DXTn textures, we made the DirectX drivers do it for us. Driver code may or may not be optimized, so above figures should be taken with a grain of salt. Still, the nature of DXTn compression makes us believe that it's impossible to reach VQ decompression speeds.)


FIGURE 7. "Lena" uncompressed. 


FIGURE 8. VQcompressed "Lena," codebook size 1,024, compression time 6 minutes, 2.6 bits/pixel. 
Here's a quick summary of pros and cons of VQ for image compression:
Pros:

Blindingly fast decompression (often faster than simply copying the uncompressed data, orders of magnitude faster than decompressing PNGs or JPEGs)

Good quality at excellent compression ratios (see the 2.6 bits/pixel samples below)

A flexible choice of the tradeoff between compression ratio and fidelity (from about 2 bits/pixel all the way to about 8 bits/pixel; even at 4 bits/pixel most images look considerably better in VQ than in 8bit simple palletization).
Cons:

Very slow compression: compressing any practical amount of art assets is definitely an overnight batch job.

Nonstandard, not widely supported in hardware.
Sample Code
This article is accompanied by sample code for a commandline compression tool and a minimalistic compressed file viewer. To build them you'd need Microsoft Visual C++. No libraries outside the Win32 API and the STL are used.
Two samples of images compressed with this tool are shown in Figures 5 through 9. Running times are for a 600MHz Pentium III for the SSE version of the algorithm. The bitsperpixel ratios are for the case where 10 or 11bit codebook indices are packed tightly, not padded to 16 bits.


FIGURE 9. VQcompressed "Lena," codebook size 2,048, compression time 10 minutes, 2.9 bits/pixel. 