Broken Sword is a graphical adventure game re-implemented from scratch to work on the Game Boy Advance. Selecting the right compression techniques for various data is important for all games and never more so when you want to squash as much as possible onto an 8MB cartridge without resorting to content butchery. The largest chunk of our storage budget was given to animation data, but the game also contains roughly 150 unique background images, nearly 80 music cues complete with orchestral sample set, a suite of SFX samples and five languages worth of dialogue.
Uncompressed, the backgrounds in Broken Sword would occupy close to 7MB. This article discusses the technique we used for storing lots of high detail unique backgrounds for use with character (tiled) BG modes. There's no glamor here I'm afraid; this article presents a useful, flexible solution to the meat and potatoes problem of managing a large number of background graphics.
on a graphics mode to best fit the game was first port of call after identifying
all visual requirements. We decided to use a screen mode which supports
256 colors because
it suited the in-game graphics best.
JPEG compression suits detailed images particularly well, so it was a natural first choice for storing our backgrounds. Not only is it very powerful, crunching images down to impressive ratios, but it's very flexible too. JPEG compression can accept a quality parameter, making it very simple to play the quality of the resulting image against its compressed size. We exploited this in our build process by applying an initial global quality level to all backgrounds, set as low as we could get away with, then spending the remainder of the budget individually improving any backgrounds that decompressed with too many ugly artifacts (those with large areas of solid color in particular). Figures 1 to 3 give an example of JPEGs flexible quality control. It's also worth noting that both the artifacts and the decrease in saturation subtleties that JPEG introduces are much less noticeable on the Game Boy Advance LCD.
Figure 1: Uncompressed original (46.0 KB)
Figure 2: JPEG compressed at 50/100 quality (15.3 KB)
Figure 3: JPEG compressed at 20/100 quality (9.3 KB)
For ten representative backgrounds, on average, JPEG compression, set at a quality of 50/100 compresses images down to 30.2 percent of their original size. With a quality setting of 35/100 this becomes 24.8 percent and at 20/100 its down to 18.1 percent.
Exporting and Decompressing JPEG
To generate JPEGs that are free from any unnecessary information we used a tool that optimize the compression tables and strips out any comments etc. saving a bunch of bytes per image. cjpeg.exe is a tool that does just that that and is freely available on the web along with its complement tool djpeg.exe for decompressing.
Implementing a JPEG decompressor on the Game Boy Advance is simply a matter of converting the algorithm to fixed-point math; there is no hardware support for floating point numbers on the Game Boy Advance so relying on them is very slow.
The JPEG method processes image data in blocks; 8 by 8 pixels in size, which is particularly useful as tiles in character BG modes on the Game Boy Advance are exactly the same size. What this effectively means is that we can decompress blocks contiguously.
Some backgrounds in our game are wider than the screen width so, for easy scrolling implementation we decompressed such graphics into an external RAM buffer and DMA transferred tiles to video memory when they were needed. For static images however, the buffer is redundant as the JPEG can be decompressed directly to video memory.
Due to its nature, JPEG generates pixel data as 24-bit colors, so inorder to display a JPEG compressed image in our chosen graphics mode, we need to palletize the output to fit 256 colors. Performing this operation accurately at runtime on top of the JPEG decompression itself is an intensive business and for this to be a feasible solution we couldn't afford for a screen change to take, at most, more than a few seconds.
To palletize an image, pixels must be mapped to the closest possible color in a representative palette of colors, so the choice of palette entries themselves is critical in ensuring the results are as close as possible to the original image. Because accurately computing this optimal palette is a very time consuming operation, we decided to export a pre-calculated palette along with the JPEG data rather than attempt to generate it algorithmically.
Now, as JPEG is a lossy compression technique, palletizing the original image will not necessarily yield an ideal palette, as the image still contains information that JPEG compression might discard. As it's the decompressed JPEG that we need to palletize at runtime, to pre-calculate the best possible palette we need to palletize the very same decompressed JPEG offline. Using a tool that mimics our runtime decompression routines we are able to generate the exact same pixels that we expect to get on the Game Boy Advance and create an image suitable for Photoshop to process, giving us our optimal palette.
Palletizing a pixel involves comparing the color the pixel decompresses to with all available colors in the palette and mapping it to the index of the closest match. Finding the closest match accurately is governed by the method that determines how alike two colors are. One such method is to simply split the colors into their respective RGB components and sum the absolute differences between them. Another more expensive but more accurate method is to sum the squares of the differences between each component. To demonstrate the differences consider the following example:
Suppose we decompress the RGB color (128, 128, 128) and want to determine which of the following two palette entries we should map to:
Palette entry 1 is (132, 129, 129):
Summing the absolute differences between components gives the value:
abs(132 - 128) + abs(129 - 128) + abs(129 - 128) = 6
Summing the squares of the differences between components gives:
(132 - 128)² + (129 - 128)² + (129 - 128)² = 18
Palette entry 2 is (130, 130, 130):
Summing the absolute differences between components gives the value:
abs(130 - 128) + abs(130 - 128) + abs(130 - 128) = 6
Summing the squares of the differences between components gives:
(130 - 128)² + (130 - 128)² + (130 - 128)² = 12
The color we're trying to map is a pure shade of grey so, as palette entry 1 has a red tint, the ideal match would be palette entry 2. The simpler method considers both palette entries equally alike in the above example and could therefore map to either. The second method however, correctly differentiates between the two. Generally speaking, summing the squares produces much more accurate results but at the cost of 3 more multiplies per comparison. Since the difference between any two components lies between -255 and 255, the multiplies can be substituted with a check into a pre-calculated power-of-two lookup table using the absolute value of the difference as the index:
unsigned short pow2;
for(int i = 0; i < 256; i++)
pow2[i] = i*i ;
difference = pow2[abs( red1 - red2 )] +
pow2[abs( green1 - green2 )] +
pow2[abs( blue1 - blue2 )] ;
Of course, calculating a palette index for every pixel in the image would take an enormous amount of time so we need to consider ways to improve efficiency. A good start is to exploit the fact that an image is likely to contain multiple pixels of the same color so, to avoid performing the same calculations more than once; we can use a lookup table. First of all, initialize the table by filling it with a value that represents a blank entry. This needs to be chosen carefully as not to collide with real palette indices. When a pixel is decompressed, use its packed RGB color value as a unique index into the table. If the table entry is marked as blank, search for the most appropriate palette color to map to and store its index at that table entry. If the table entry is not blank however, then it already contains a valid mapping for this color so just use the palette index already computed.
// 256 colors in a palette
typedef unsigned char PaletteIndex ;
// The lookup table needs an entry for every possible color
// Broken Sword backgrounds only use the first 240 colors in the
// palette so 255 is a safe value for blank entry
#define BLANK_ENTRY 0xff
// Prepare table before palletizing
for(int i = 0; i < 224; i++)
lookupTable[i] = BLANK_ENTRY ;
// For all pixels in image
PaletteIndex pixelMapping ;
// Get next pixel
int pixelcolor = GetDecompressedPixel() ;
if(lookupTable[pixelcolor] == BLANK_ENTRY)
// Palletize and store in table
lookupTable[pixelcolor] = FindClosestPaletteEntry(pixelcolor) ;
// Fetch and write pixel
pixelMapping = lookupTable[pixelcolor] ;
Off the shelf, JPEG operates at 24-bit color depth, 8-bits per RGB component. As seen in the code snippet above, this means we need a lookup table with 224 entries, one for every possible color an image may contain. At worst-case, this translates to performing a massive number of palletizing calculations, not to mention a lookup table with a 16MB memory footprint. Rather obviously, this is impossible to implement with only 256KB of external RAM.
Fortunately, the maximum color depth available on the Game Boy Advance is only 15-bit, 5 bits per RGB component. So we can instantly simplify things by working at a reduced color depth. Modifying the JPEG decompressor then:
@ Decompressed pixel from JPEG:
@ r1 = R (8-bit)
@ r2 = G (8-bit)
@ r3 = B (8-bit)
@ Reduce 24-bit RGB to 15-bit BGR
mov r0,r1,lsr #3
mov r1,r2,lsr #3
add r0,r0,r1,lsl #5
mov r1,r3,lsr #3
add r0,r0,r1,lsl #10
@ r0 = ((B >> 3) << 10) | ((G >> 3) << 5) | (R >> 3)
With this in place, we now have a look-up table with only 215 entries, 32KB in size. It's a step in the right direction certainly; though for images containing a large variety of different colors, populating such a lookup table at runtime is still very expensive. In fact, we have a worst-case execution time of the same order as not using a lookup table at all (occurring when every pixel color in the image is unique).
A sensible way to improve matters is to further reduce the size of the lookup table, saving not only memory but also the maximum number of entries we would need to calculate. By sacrificing a little precision and working with 12-bit colors, discarding the least significant bit of each RGB color component, we can do just that. Of course this introduces a risk of mapping very similar colors to the same palette entry when a more appropriate entry might exist, but for images containing a wide range of colors, like our backgrounds, the impact is negligible. The payoff however is vast: now the lookup table is only 4KB in size and the worst-case execution time has been reduced by a factor of ten.
@ Reduce 24-bit RGB to 12-bit BGR
mov r0,r1,lsr #4
mov r1,r2,lsr #4
add r0,r0,r1,lsl #4
mov r1,r3,lsr #4
add r0,r0,r1,lsl #8
@ r0 = ((B >> 4) << 8) | ((G >> 4) << 4) | (R >> 4)
As we've switched to a color depth different to that of our exported palette, run time palletizing operations now require comparisons to occur in 12-bit also. Rather than converting each palette entry every time a search is needed, it's worthwhile constructing a 12-bit copy of the entire palette just once as part of initialisation and palletizing to that. Or better yet, avoiding these calculations at runtime altogether...
Offline Table Calculation
Despite the new slim line size, populating the lookup table remains the most expensive part of palletizing our images. Pre-calculating the table contents and exporting them along with the JPEG data is one further way to improve performance. It's a trade off between ROM consumption and execution time, the more you choose to export, the faster the images get processed. Selecting the right balance is often very difficult until most of the project is relatively static so with this in mind, we settled for a flexible approach. Say it's more reasonable to export only a portion of the lookup table and then calculate any remaining entries at runtime. In this case choosing the best portion becomes an issue. Once the entire table has been calculated offline, it can be analyzed in order to obtain the most populated slice. By making the size of the slice an input parameter to the tool, it's very simple to run tests and discover what suits the project best.
Another option is to compress the entire lookup table and export that. Choice of compression method is also important here of course; anything too intensive and there's the risk of it taking longer to decompress the table than it would do to populate it. In practice run length encoding and LZ77 compression both gave acceptable results.
If you can afford to reserve 32KB of Game Boy Advance memory for a 15-bit lookup table then perhaps it's worth considering an adaptive decompressor. As a first pass, allow the tool to generate a 15-bit lookup table and check its size, when compressed, with some maximum acceptable limit that's been defined. For the majority of cases, this limit is most likely too strict, so perform another pass at 12-bit color depth and export as previously. Some images however, particularly those that use only similar ranges of color may palletize with a relatively modest 15-bit table because it compresses particularly well. In which case, export this instead. Modifying the decompressor to cope with either color depth is a trivial affair so certain images get palletized with improved quality for only a little extra effort.
Figure 4: 15-bit Palletized Image
A further modification, to decrease the average size of 15-bit lookup tables is to construct the tables with dual entries. For every color the table is expected to palletize, export an explicit index mapping along side it. This way the table need only contain as many entries as there are unique colors and even though the size of each entry has grown by two bytes, the total table size is likely to be much smaller. The biggest drawback however, occurs when accessing the table. Rather than using the color of the decompressed pixels as direct indices, the table now needs to be searched which is much slower.
For any algorithm, in order to squeeze as much performance as possible from the Game Boy Advance CPU, it pays to hand optimize the time critical routines in 32-bit ARM code and execute them from internal WRAM. We compiled all our JPEG decompression and palletizing routines in this way and also kept the JPEG scaling and zigzag tables in WRAM too.
In conclusion then, using all these techniques for Broken Sword, the 7MB of uncompressed background data were successfully stored in 1.95MB of cartridge ROM. And the average time taken to decompress a single background lies between 1 and 2 seconds.
For Further Information
Independent JPEG Group: http://www.ijg.org/