The task of porting Indiana Jones and the Infernal Machine from PC to N64 proved to be quite challenging. The Nintendo 64 could handle the amount of polygons quite well, but memory use was a huge problem because the developers fo the PC original hadn't needed to worry about it at all. While a typical PC memory footprint might be 128 MB, the N64 needed to run the game with only 4MB -- and that must also hold the code, frame buffers, textures and display lists. You can take advantage of the N64 Expansion Pak to provide higher resolution and better performance, but only if it is present.
Looking at these numbers alone, it seems that this kind of challenge is doomed to fail from the start. However, when the project started, the original PC game was much smaller in terms of texture and model usage. As we all know, projects tend to grow during the last phase of production because everybody wants to make the game as good as it can be. Certainly, an adventure game like Indy provided enough opportunities for growth.
out, we were looking for every thousand bytes we could free up in the
end. A number of different methods were used to make it fit. This article
presents a summary of the most important methods necessary for dealing
with memory constraints, most of which are applicable and beneficial to
other projects. Reducing memory usage also helps to increase performance
because data is more likely to be local, and therefore more likely to
be found in one of the CPU's data caches.
One could argue that developers today no longer need to concentrate on memory constraint issues because of new systems like the Sony Playstation 2 and Nintendo Gamecube. However, while their rendering capabilities have steadily improved, their memory subsystems have not changed very much. This suggests that memory organization and efficient memory usage are now an even more pressing issue.
The approach used in porting Indy to the N64 shares many similarities with a data path that is used in producing an original title from scratch. Normally, the art and level data is exported into the game engine-specific data format using a combination of exporting tools. Those in turn are then bound together, creating a binary resource which is loaded on the target. To reduce the size of the binary resource, we were left with two options: rewriting the export tool setting it to output smaller and more optimized structures, or using the existing PC game startup and initialization part to load the binary resource.
We chose the second approach, because it offered the benefit of a quick startup given that all the code loading and initializing data was already in place. All we needed to do was to take the game and strip the rendering loop out of it. Starting here, we could take all the game's data structures, rewrite and optimize them in an easy fashion (Figure 1). We familiarized ourselves with the game's internal structure in the process. We called this abused executable the munger. We extended it to optimize and rewrite all data structures. It also provided for more consistency checks, which were required to ensure that our changes to the data structures didn't cripple the data by casting 32 bit values to eight bits.
Figure 1: Data-Path Setup
This approach has many similarities between an original data path because it is used for an original title, and therefore the methods used and lessons learned by porting Indy apply to original titles. It is also important to take data management and storage requirements into account, since the amount of art data (e.g. geometry, textures, animation data, etc.) is increasing with the release of the next generation systems
The methods used are categorized into three classes. The rest of the article discusses the methods in more detail and offers some practical insight, which may be of some benefit to those of you experiencing the similar memory problems.
- Rewriting & Optimizing Data Structures: decreasing the size of data structures sounds like a tedious and erroneous process; done well this can be an easy way to reduce the size of the data.
- Resource Caching on the Target: although a large amount of data can be cached (this works surprisingly well), different kinds of data require different caching approaches. The caching schemes used are described and classified.
- Using a Virtual Memory Kernel to reduce code size: one of the benefits of console programming is that you have total control over the target hardware. Implementing a virtual memory kernel sounds over the top at first, however, it not only provides precious memory, but also helps in developing and debugging. In general, a virtual memory kernel presents a means to trap illegal memory access.
- Defining a simple set of C preprocessor macros (Figure 2), which are easy to recognize by a single-line scanner, but expand to valid C syntax.
- Using a callback during runtime to specify which part of a union to write out, providing a (void *) pointer to the current object in question.
- Keeping all data structures, which are to be converted, together in one file.
Method A: Rewriting & Optimizing Data Structures
One basic problem at the beginning of the project was that the PC data was all in little-endian, whereas the N64 (and GCN) uses big-endian. Although it's possible to change the MIPS4300i CPU's endianess, it would require a new operating system, which wasn't feasible at the time. Since the data had to be swapped according to the data's word size (i.e. you swap a 32 bit value in a different way than you swap a 16 bit value and you don't swap a byte at all), the munger needed to know the layout of the data structures. This was necessary because it's easy to write an array. Floating-point vectors are a great example. You know that a vector consists of three consecutive floating-point values, which are 32 bits each. From that you can derive a tiny piece of code, which would swap "# of vectors * 3" long words. However, as soon as a vector is realized using signed 16 bit values, one would need to change all of the tiny pieces of code. Based on experience, approaches like that don't work.
This problem is even more challenging if you need to write complex structures, which contain substructures and subunions. You would have to write a tiny piece of code for each type-definition in the game, which would be almost impossible to maintain because a discrepancy between the type-definition and the tiny piece of code would cause the data to be written out incorrectly. Looking at a full-scale game, we quickly realized that this approach was not possible; there were simply too many data types involved, and many of them were predicted to be under constant change because our goal was to reduce their size. Therefore, we needed a mechanism, which could automatically determine how to convert the endianess of arbitrary data structures, regardless of whether they contain unions and substructures.
The solution sounds a bit scary a first, but it proved to be quite useful. During startup of the munger, it parses a part of it's own source code to determine the memory layout of all the data structures that are about to be converted from big to little-endian. With Indy, this would have been too difficult if we allowed for the full C syntax for data structures, including files and macro expansion. There was, however, an easy solution which allowed for keeping the parsing really simple while creating all C structures:
By doing so, we can modify data structures without worrying about any conversion issues. It is also a feasible method to ensure proper alignment in the data types because it's possible to check all elements of a structure for their position within it. One would need to verify that the C compiler does not introduce automatic padding that is invisible to the munger. This can be accomplished using some compiler specific #pragma instructions.
Figure 2: Data Structure Definitions (Example)
The example above resembles legal C code, which can run through any compiler. Please note that in the structure definitions, each ENTRY line consists of four elements. The last two elements are for the compiler's usage and are replaced by the #define ENTRY macro. The Munger on the opposite side scans the file on a line-by-line basis and detects the three keywords BEGIN_STRUCT, ENTRY and END after stripping white space. It then looks to the first two elements of the ENTRY line. The first number gives the amount information whereas the second field declares the data type. The second field cannot hold all C-types. It can only hold C-types known to the Munger, such as UBYTE, UWORD, ULONG, float, enum, and POINTER. You can also specify complex types, which are already known to the Munger, as it is done with the HAL_VECTOR type in the example above.
Using a couple more macros, subunions and substructures can be specified. The Munger also gathers a named list of all elements and types involved, so it's possible to name structure- elements during runtime. This is useful when it is necessary to decide which part of a union needs to be written out. Using unions, we can interpret one chunk of memory in different ways, which will affect the byte swapping during write out. Nevertheless, each object that is using unions must encode part of that union which is being used. Using a callback, which provides you with the name of the structure/union in question and a pointer to it, allows the Munger to examine an object and return the correct union name to be written to disk (Figure 3).
Figure 3: Decision Making Callback (Example)
you save memory with that?
Why, you may ask, is it necessary to have such a long-winded discussion about how to write data structures, when all we are trying do is reduce their size? The answer is straightforward. Your primary goal is to reduce the size of the data structures. Given that, it's obvious that the data structure definitions themselves are under constant change during the development process. Automatically dealing with those changes helps a lot. However, what if you change a structure so much that it requires huge code changes to the Munger, which as we recall, is based on the actual game code? For instance, if you change all the world points to be an integer of 16 bit vectors instead of floats, you will end up with an unbelievable amount of compiler errors. Even if it's compiling, it's very doubtful that the executable will still work.
Therefore, we introduce two sets of data structures. Leave the original set (set A) as it was in the original PC game and in the Munger. We also introduce a new set of data structures, which is, in the beginning, identical to set A. This set (set B) is recognized by the munger as the set that is to be converted to big endian and the definitions of the data structures are read during runtime as described above. The data structures in set B are likely to be smaller than the original ones by using different techniques, which is described later. A transfer function is also needed, which copies data from set A to set B. During this copy step, a number of things can be done to the data to remove redundancy and compress it. The target machine (the N64 in our case) only knows about the data in set B. Therefore, it's wise to share the include file, which contains all the data structure definitions between the munger and the target.
Optimizing the B set of data structures uses a couple of principles, which is briefly outlined here:
- Use shorter types of the same kind (i.e. use WORD instead of LONG, use BYTE instead of SHORT)
- Combine multiple bit fields, enumerations into one
- Use short integers instead of floating point values
- Remove and optimize padding within the structure
- Remove unused components
Of course, many of these methods are dangerous because the values can exceed their data type's capabilities. In addition, by using short integers for floating point values, we could loose needed precision. However, if we use a transfer function to copy over the data, it's possible to safeguard for all those pitfalls (in fact it's insane not to do so). While copying over a long to a short, simply add an if statement and bail out if the long is < -32768 || > 32767. The same technique can be used for all other problems.
Besides changing data structures themselves, there are a number of other techniques that can produce considerable amounts of memory.
- Find unique elements in huge arrays and use indices to access them. For example, instead of storing hundreds of floating point vectors, which are all pointing in the same direction, just store one vector, and use a short index that looks up the correct normal. This approach is feasible for a lot of data such as normals, vertices and colors.
- Remove inherit redundant data. For instance, in a portal engine, there are always two portals, where one is the exact mirror of the other. In Indy, it was possible to cut exactly half of the portal structures by introducing an "is-mirror" bit. The code needed to be changed to invert the normal of the portal. This gained a lot of memory since the portal structure was considerable in size.
- An engine sometimes abuses data structures. Complex types are used to represent simple ones. The best example in Indy was the use of a "thing" structure to represent points in the 3D world. The waste was about 500 bytes for each instance, which summed up to almost 100k in a typical level. The reason for this bad design decision on the PC was that it simply didn't matter at all and the level designers could use the "thing" placement tools in the level editor to specify positions. However, it took some investigation and code changes in the game, but it was well worth the effort.
- Use bit fields instead of arrays of booleans. Sometimes booleans are defined as 32 bit values to circumvent some typical C pitfalls. However, even when they are already unsigned char, you can reduce the size to 1/8. This was used in Indy to reduce the size of the potential visibility information (PVS).
- Have a look at the binary representation of data objects. Some simple compression scheme can be applied, especially when the data is not frequently used. It may be beneficial to use a small cache to hide the decompression costs (see below).
- Get rid of unused and/or unnecessary precision where it is feasible. Indy's animation system on the PC used three floats of translational and three floats of rotational data per key frame. It also stored another three floats of translational delta information as it stored three additional floats of rotational delta information plus one float for the key frame's time index (frame #). By using absolute rotational information, the six floats for the delta information can be stripped, and instead of using floats for the remaining six values, we used signed 16 bit values (in fact only 13 bits have been used, the remaining 3*3 bits have been used to encode the frame number). Therefore, the size of one key frame was reduced from 7*4 = 28 bytes to 3*2 = 6 bytes.
However, one has to be careful when stripping precision. In Indy, we also converted world vertices and world normals from float to signed 16 bit values. What happened is that triangle fans became bumpy and the collision system failed because it assumed convex polygons. By moving vertices (i.e. snapping them to the integer grid), this property was lost along with the collision in order to deal with concave polygons. No need to mention that we did this in Indy.
- Find matching rows in huge multi-dimensional arrays, and store indices into references. Note that this only works with read only data, since a modification of one element could affect many logical ones (Figure 4). Using this method to compress the animation description tables (which animation to play for what move with what speed, etc) gained almost 100k in Indy when there were was a lot of animation in a level.
- Combine multiple elements of a structure in a union if they are mutually exclusive. This technique is rather more of a 'last resort' kind because one can't quickly determine if some elements of a structure really are mutually exclusive. In addition, it requires a lot of code changes to access the unions correctly.
huge substructures if they are rarely used. A good example of this are
the 'surface' structures in Indy because they contained a fairly
large amount of information regarding surface properties, as animation
and lighting, which was only used for 'special fx' surfaces. By their
very nature, there were less than one percent of those in a typical
level. Remove the "special fx" information and store it in
a new data type such as tSurfaceFX
and assign a handle (it could be just an eight bit value). During runtime,
allocate a new tSurfaceFX
as soon as the engine animates a surface.
Figure 4: Array Compression (Example)
Resource Caching on the Target
Caching objects is based on the observations made during any point of gameplay; you do not need all resources at once. There are many different kinds of resources that you should consider caching. Many resources seem hard to deal with or not worth caching, since you assume a cache would not be beneficial. When we were creating Indy, the following lessons were learned:
- It's hard to predict how a cache will perform. Generally, it always tends to work better than expected, which is of course a very nice thing.
- When writing a cache, make it general. Don't create a cache, which deals with the world textures, it's better to create one, which deals with all textures.
When it comes down to selecting which resources to cache, you may want to consider everything depending on how desperate you are. To give you an impression of what is feasible, here is a list of things that have been successfully cached in Indy on the N64.
This proved very beneficial because you seldom see all textures at once in a typical indoor environment. Textures are often compressed using a LZ method.
lists for world/object geometry
Display lists are a huge memory issue on the N64. They contain binary information used by the rendering hardware to draw primitives, and are considerably larger than the geometry itself. For that reason, world geometry (sliced up in sectors) and objects are converted from their geometry representation to display lists when they become visible. They are purged if they are not rendered for a specific amount of time.
Visibility Information (PVS)
To speed up rendering, a pvs table is used to ease the visibility computation. However, it contains a bit field for each world sector, which is of size "# world sectors". A set bit in this bit field means there is a line of sight between a pair of sectors. The renderer only needs the visibility information for the sector the camera is in currently. That means that the pvs set changes rarely (i.e. if the player is running fast, about every five seconds). Even if it is changing, it's likely that the camera goes back to a previously visited sector (i.e. the player is running in circles). Therefore, a cache with just ten entries helps a great deal. Since adding a new pvs set to the cache is such a rare event, the data is downloaded from ROM and decompressed using a RLL method.
In a third person character based game, there are many animation tracks that can be applied to the characters. For each actor, there are over 700 different animations. Of course, some of them are used only once in the complete game (i.e. for special cutscenes). It would be a crime to hold them all in memory at once. Luckily, the animation system used on the original PC game, supplied functions to start and stop a track. Those two points made it easy to anchor cache lock and unlock functions.
The game's scripting engine used a byte code that was interpreted to execute scripted actions. Those scripts were specific to a certain situation and level. Therefore, caching them was a straightforward idea. In addition, many scripts used a basic execution scheme: after initializing themselves, they waited for some specific condition to occur. During this time, none of the script is executed, so the code doesn't need to hang around.
Data for World/Object Geometry
By their very nature collisions occur locally. The data involved consists of normals and vertices. If an object is not moving, the engine does not perform any collision checks, therefore their normals and vertices are not needed. Remember that with the N64 architecture, a copy of those 'rests' in the display lists for rendering. Unfortunately, one cannot access the display lists for fetching vertices and normals because the format is binary, which is extremely complicated to access.
During game play many objects are created dynamically. The objects are created from a template that describes the new object's properties. The templates are therefore not constantly needed with a cache of just eight being sufficient.
Benefits of caching
Caching is used to reduce memory usage for a given set of objects. This is archived by only keeping a fixed amount of objects in memory. When you need room for a new one, you'll figure out which one isn't used for a while and remove it, by doing so you create room for the new object. Of course trashing will occur when you make the cache very small, Trashing is a term used to describe a cache line that is constantly refilled with different objects. This is bad for the overall performance because you are constantly loading and downloading from mass storage.
However, there is a different view on this as well. Without caching, you simply need a specific amount of memory; there is nothing to fiddle around with. When you cache a resource, there is the potential to trade off performance against memory usage. Maybe you're running so tight, that you are happy to give away some performance in exchange for some much needed memory.
With caching, memory usage during runtime can be made fixed. When you create dynamic objects using a cache, you can impose an upper bound (of course, it must be ok to throw away old objects and recreate them later). Having fixed memory footprints on consoles is a very good idea. However, you must provide 'panic' modes for some caches. For example, when you use a cache for render geometry and the cache is full thereby preventing you from purging an element from the cache (since they are all visible), the renderer copes with this by not rendering the object. Otherwise, you must fill the screen wisely.
Another benefit of caches is that they can hide expensive access costs (in fact, this is what caches in CPUs are all about). When your low entropy source data is compressing nicely, but the unpacking is consuming quite a bit of time, it's wise to put the unpacked objects into a cache. Therefore, the next time they are requested, one just returns a pointer to the object. Normally, a small one is sufficient for that kind of cache.
There are basically two different approaches to integrate a cache with its client. One could use a Lock(); and Unlock(); protocol. Whenever the engine decides to use an object for a prolonged amount of time it must use Lock(); with the cache first. The cache in turn, does whatever it takes to provide that resource (i.e. downloading and/or uncompressing). The object then resides in the cache until the client decides that it's no longer needed, and issues an unlock call. The cache recognizes that, and can reuse the cache line for other objects. Of course, it would be wise to keep the information in the cache, even if its no longer locked. The information may be requested again soon and as long, the cache is not running out of space. The drawback to using this scheme is that it's possible for the cache to run out of space, because all the cache lines are filled with objects. Since the Lock(); call returns a pointer to the client, and the client can use it as long as the pointer hasn't been unlocked, there is no way to free a cache line. Therefore, the client must be able to deal with the "nope" condition.
The second approach for interfacing the cache with its client is to use a Touch(); call to the cache whenever an object is about to be used. The idea is, that the objects pointer, returned by the Touch(); call, is only valid until the next Touch(); call. Considering the full condition, the worst thing that can happen is having the cache being constantly refilled and trashed. That would cause poor performance, but at least there is no special condition to worry about. This is definitely the way to go if the cache is sufficiently large enough, and it can be guaranteed that trashing will not occur,. However, whenever an object is used, there must a Touch(); call made. Consider the Lock(); a