The task of porting Indiana Jones and the Infernal Machine from PC to N64 proved to be quite challenging. The Nintendo 64 could handle the amount of polygons quite well, but memory use was a huge problem because the developers fo the PC original hadn't needed to worry about it at all. While a typical PC memory footprint might be 128 MB, the N64 needed to run the game with only 4MB -- and that must also hold the code, frame buffers, textures and display lists. You can take advantage of the N64 Expansion Pak to provide higher resolution and better performance, but only if it is present.
Looking at these numbers alone, it seems that this kind of challenge is doomed to fail from the start. However, when the project started, the original PC game was much smaller in terms of texture and model usage. As we all know, projects tend to grow during the last phase of production because everybody wants to make the game as good as it can be. Certainly, an adventure game like Indy provided enough opportunities for growth.
out, we were looking for every thousand bytes we could free up in the
end. A number of different methods were used to make it fit. This article
presents a summary of the most important methods necessary for dealing
with memory constraints, most of which are applicable and beneficial to
other projects. Reducing memory usage also helps to increase performance
because data is more likely to be local, and therefore more likely to
be found in one of the CPU's data caches.
One could argue that developers today no longer need to concentrate on memory constraint issues because of new systems like the Sony Playstation 2 and Nintendo Gamecube. However, while their rendering capabilities have steadily improved, their memory subsystems have not changed very much. This suggests that memory organization and efficient memory usage are now an even more pressing issue.
The approach used in porting Indy to the N64 shares many similarities with a data path that is used in producing an original title from scratch. Normally, the art and level data is exported into the game engine-specific data format using a combination of exporting tools. Those in turn are then bound together, creating a binary resource which is loaded on the target. To reduce the size of the binary resource, we were left with two options: rewriting the export tool setting it to output smaller and more optimized structures, or using the existing PC game startup and initialization part to load the binary resource.
We chose the second approach, because it offered the benefit of a quick startup given that all the code loading and initializing data was already in place. All we needed to do was to take the game and strip the rendering loop out of it. Starting here, we could take all the game's data structures, rewrite and optimize them in an easy fashion (Figure 1). We familiarized ourselves with the game's internal structure in the process. We called this abused executable the munger. We extended it to optimize and rewrite all data structures. It also provided for more consistency checks, which were required to ensure that our changes to the data structures didn't cripple the data by casting 32 bit values to eight bits.
Figure 1: Data-Path Setup
This approach has many similarities between an original data path because it is used for an original title, and therefore the methods used and lessons learned by porting Indy apply to original titles. It is also important to take data management and storage requirements into account, since the amount of art data (e.g. geometry, textures, animation data, etc.) is increasing with the release of the next generation systems
The methods used are categorized into three classes. The rest of the article discusses the methods in more detail and offers some practical insight, which may be of some benefit to those of you experiencing the similar memory problems.
- Rewriting & Optimizing Data Structures: decreasing the size of data structures sounds like a tedious and erroneous process; done well this can be an easy way to reduce the size of the data.
- Resource Caching on the Target: although a large amount of data can be cached (this works surprisingly well), different kinds of data require different caching approaches. The caching schemes used are described and classified.
- Using a Virtual Memory Kernel to reduce code size: one of the benefits of console programming is that you have total control over the target hardware. Implementing a virtual memory kernel sounds over the top at first, however, it not only provides precious memory, but also helps in developing and debugging. In general, a virtual memory kernel presents a means to trap illegal memory access.
- Defining a simple set of C preprocessor macros (Figure 2), which are easy to recognize by a single-line scanner, but expand to valid C syntax.
- Using a callback during runtime to specify which part of a union to write out, providing a (void *) pointer to the current object in question.
- Keeping all data structures, which are to be converted, together in one file.
Method A: Rewriting & Optimizing Data Structures
One basic problem at the beginning of the project was that the PC data was all in little-endian, whereas the N64 (and GCN) uses big-endian. Although it's possible to change the MIPS4300i CPU's endianess, it would require a new operating system, which wasn't feasible at the time. Since the data had to be swapped according to the data's word size (i.e. you swap a 32 bit value in a different way than you swap a 16 bit value and you don't swap a byte at all), the munger needed to know the layout of the data structures. This was necessary because it's easy to write an array. Floating-point vectors are a great example. You know that a vector consists of three consecutive floating-point values, which are 32 bits each. From that you can derive a tiny piece of code, which would swap "# of vectors * 3" long words. However, as soon as a vector is realized using signed 16 bit values, one would need to change all of the tiny pieces of code. Based on experience, approaches like that don't work.
This problem is even more challenging if you need to write complex structures, which contain substructures and subunions. You would have to write a tiny piece of code for each type-definition in the game, which would be almost impossible to maintain because a discrepancy between the type-definition and the tiny piece of code would cause the data to be written out incorrectly. Looking at a full-scale game, we quickly realized that this approach was not possible; there were simply too many data types involved, and many of them were predicted to be under constant change because our goal was to reduce their size. Therefore, we needed a mechanism, which could automatically determine how to convert the endianess of arbitrary data structures, regardless of whether they contain unions and substructures.
The solution sounds a bit scary a first, but it proved to be quite useful. During startup of the munger, it parses a part of it's own source code to determine the memory layout of all the data structures that are about to be converted from big to little-endian. With Indy, this would have been too difficult if we allowed for the full C syntax for data structures, including files and macro expansion. There was, however, an easy solution which allowed for keeping the parsing really simple while creating all C structures:
By doing so, we can modify data structures without worrying about any conversion issues. It is also a feasible method to ensure proper alignment in the data types because it's possible to check all elements of a structure for their position within it. One would need to verify that the C compiler does not introduce automatic padding that is invisible to the munger. This can be accomplished using some compiler specific #pragma instructions.
Figure 2: Data Structure Definitions (Example)
The example above resembles legal C code, which can run through any compiler. Please note that in the structure definitions, each ENTRY line consists of four elements. The last two elements are for the compiler's usage and are replaced by the #define ENTRY macro. The Munger on the opposite side scans the file on a line-by-line basis and detects the three keywords BEGIN_STRUCT, ENTRY and END after stripping white space. It then looks to the first two elements of the ENTRY line. The first number gives the amount information whereas the second field declares the data type. The second field cannot hold all C-types. It can only hold C-types known to the Munger, such as UBYTE, UWORD, ULONG, float, enum, and POINTER. You can also specify complex types, which are already known to the Munger, as it is done with the HAL_VECTOR type in the example above.
Using a couple more macros, subunions and substructures can be specified. The Munger also gathers a named list of all elements and types involved, so it's possible to name structure- elements during runtime. This is useful when it is necessary to decide which part of a union needs to be written out. Using unions, we can interpret one chunk of memory in different ways, which will affect the byte swapping during write out. Nevertheless, each object that is using unions must encode part of that union which is being used. Using a callback, which provides you with the name of the structure/union in question and a pointer to it, allows the Munger to examine an object and return the correct union name to be written to disk (Figure 3).
Figure 3: Decision Making Callback (Example)
you save memory with that?
Why, you may ask, is it necessary to have such a long-winded discussion about how to write data structures, when all we are trying do is reduce their size? The answer is straightforward. Your primary goal is to reduce the size of the data structures. Given that, it's obvious that the data structure definitions themselves are under constant change during the development process. Automatically dealing with those changes helps a lot. However, what if you change a structure so much that it requires huge code changes to the Munger, which as we recall, is based on the actual game code? For instance, if you change all the world points to be an integer of 16 bit vectors instead of floats, you will end up with an unbelievable amount of compiler errors. Even if it's compiling, it's very doubtful that the executable will still work.
Therefore, we introduce two sets of data structures. Leave the original set (set A) as it was in the original PC game and in the Munger. We also introduce a new set of data structures, which is, in the beginning, identical to set A. This set (set B) is recognized by the munger as the set that is to be converted to big endian and the definitions of the data structures are read during runtime as described above. The data structures in set B are likely to be smaller than the original ones by using different techniques, which is described later. A transfer function is also needed, which copies data from set A to set B. During this copy step, a number of things can be done to the data to remove redundancy and compress it. The target machine (the N64 in our case) only knows about the data in set B. Therefore, it's wise to share the include file, which contains all the data structure definitions between the munger and the target.
Optimizing the B set of data structures uses a couple of principles, which is briefly outlined here:
- Use shorter types of the same kind (i.e. use WORD instead of LONG, use BYTE instead of SHORT)
- Combine multiple bit fields, enumerations into one
- Use short integers instead of floating point values
- Remove and optimize padding within the structure
- Remove unused components
Of course, many of these methods are dangerous because the values can exceed their data type's capabilities. In addition, by using short integers for floating point values, we could loose needed precision. However, if we use a transfer function to copy over the data, it's possible to safeguard for all those pitfalls (in fact it's insane not to do so). While copying over a long to a short, simply add an if statement and bail out if the long is < -32768 || > 32767. The same technique can be used for all other problems.
Besides changing data structures themselves, there are a number of other techniques that can produce considerable amounts of memory.
- Find unique elements in huge arrays and use indices to access them. For example, instead of storing hundreds of floating point vectors, which are all pointing in the same direction, just store one vector, and use a short index that looks up the correct normal. This approach is feasible for a lot of data such as normals, vertices and colors.
- Remove inherit redundant data. For instance, in a portal engine, there are always two portals, where one is the exact mirror of the other. In Indy, it was possible to cut exactly half of the portal structures by introducing an "is-mirror" bit. The code needed to be changed to invert the normal of the portal. This gained a lot of memory since the portal structure was considerable in size.
- An engine sometimes abuses data structures. Complex types are used to represent simple ones. The best example in Indy was the use of a "thing" structure to represent points in the 3D world. The waste was about 500 bytes for each instance, which summed up to almost 100k in a typical level. The reason for this bad design decision on the PC was that it simply didn't matter at all and the level designers could use the "thing" placement tools in the level editor to specify positions. However, it took some investigation and code changes in the game, but it was well worth the effort.
- Use bit fields instead of arrays of booleans. Sometimes booleans are defined as 32 bit values to circumvent some typical C pitfalls. However, even when they are already unsigned char, you can reduce the size to 1/8. This was used in Indy to reduce the size of the potential visibility information (PVS).
- Have a look at the binary representation of data objects. Some simple compression scheme can be applied, especially when the data is not frequently used. It may be beneficial to use a small cache to hide the decompression costs (see below).
- Get rid of unused and/or unnecessary precision where it is feasible. Indy's animation system on the PC used three floats of translational and three floats of rotational data per key frame. It also stored another three floats of translational delta information as it stored three additional floats of rotational delta information plus one float for the key frame's time index (frame #). By using absolute rotational information, the six floats for the delta information can be stripped, and instead of using floats for the remaining six values, we used signed 16 bit values (in fact only 13 bits have been used, the remaining 3*3 bits have been used to encode the frame number). Therefore, the size of one key frame was reduced from 7*4 = 28 bytes to 3*2 = 6 bytes.
However, one has to be careful when stripping precision. In Indy, we also converted world vertices and world normals from float to signed 16 bit values. What happened is that triangle fans became bumpy and the collision system failed because it assumed convex polygons. By moving vertices (i.e. snapping them to the integer grid), this property was lost along with the collision in order to deal with concave polygons. No need to mention that we did this in Indy.
- Find matching rows in huge multi-dimensional arrays, and store indices into references. Note that this only works with read only data, since a modification of one element could affect many logical ones (Figure 4). Using this method to compress the animation description tables (which animation to play for what move with what speed, etc) gained almost 100k in Indy when there were was a lot of animation in a level.
- Combine multiple elements of a structure in a union if they are mutually exclusive. This technique is rather more of a 'last resort' kind because one can't quickly determine if some elements of a structure really are mutually exclusive. In addition, it requires a lot of code changes to access the unions correctly.
huge substructures if they are rarely used. A good example of this are
the 'surface' structures in Indy because they contained a fairly
large amount of information regarding surface properties, as animation
and lighting, which was only used for 'special fx' surfaces. By their
very nature, there were less than one percent of those in a typical
level. Remove the "special fx" information and store it in
a new data type such as tSurfaceFX
and assign a handle (it could be just an eight bit value). During runtime,
allocate a new tSurfaceFX
as soon as the engine animates a surface.
Figure 4: Array Compression (Example)
Resource Caching on the Target
Caching objects is based on the observations made during any point of gameplay; you do not need all resources at once. There are many different kinds of resources that you should consider caching. Many resources seem hard to deal with or not worth caching, since you assume a cache would not be beneficial. When we were creating Indy, the following lessons were learned:
- It's hard to predict how a cache will perform. Generally, it always tends to work better than expected, which is of course a very nice thing.
- When writing a cache, make it general. Don't create a cache, which deals with the world textures, it's better to create one, which deals with all textures.
When it comes down to selecting which resources to cache, you may want to consider everything depending on how desperate you are. To give you an impression of what is feasible, here is a list of things that have been successfully cached in Indy on the N64.
This proved very beneficial because you seldom see all textures at once in a typical indoor environment. Textures are often compressed using a LZ method.
lists for world/object geometry
Display lists are a huge memory issue on the N64. They contain binary information used by the rendering hardware to draw primitives, and are considerably larger than the geometry itself. For that reason, world geometry (sliced up in sectors) and objects are converted from their geometry representation to display lists when they become visible. They are purged if they are not rendered for a specific amount of time.
Visibility Information (PVS)
To speed up rendering, a pvs table is used to ease the visibility computation. However, it contains a bit field for each world sector, which is of size "# world sectors". A set bit in this bit field means there is a line of sight between a pair of sectors. The renderer only needs the visibility information for the sector the camera is in currently. That means that the pvs set changes rarely (i.e. if the player is running fast, about every five seconds). Even if it is changing, it's likely that the camera goes back to a previously visited sector (i.e. the player is running in circles). Therefore, a cache with just ten entries helps a great deal. Since adding a new pvs set to the cache is such a rare event, the data is downloaded from ROM and decompressed using a RLL method.
In a third person character based game, there are many animation tracks that can be applied to the characters. For each actor, there are over 700 different animations. Of course, some of them are used only once in the complete game (i.e. for special cutscenes). It would be a crime to hold them all in memory at once. Luckily, the animation system used on the original PC game, supplied functions to start and stop a track. Those two points made it easy to anchor cache lock and unlock functions.
The game's scripting engine used a byte code that was interpreted to execute scripted actions. Those scripts were specific to a certain situation and level. Therefore, caching them was a straightforward idea. In addition, many scripts used a basic execution scheme: after initializing themselves, they waited for some specific condition to occur. During this time, none of the script is executed, so the code doesn't need to hang around.
Data for World/Object Geometry
By their very nature collisions occur locally. The data involved consists of normals and vertices. If an object is not moving, the engine does not perform any collision checks, therefore their normals and vertices are not needed. Remember that with the N64 architecture, a copy of those 'rests' in the display lists for rendering. Unfortunately, one cannot access the display lists for fetching vertices and normals because the format is binary, which is extremely complicated to access.
During game play many objects are created dynamically. The objects are created from a template that describes the new object's properties. The templates are therefore not constantly needed with a cache of just eight being sufficient.
Benefits of caching
Caching is used to reduce memory usage for a given set of objects. This is archived by only keeping a fixed amount of objects in memory. When you need room for a new one, you'll figure out which one isn't used for a while and remove it, by doing so you create room for the new object. Of course trashing will occur when you make the cache very small, Trashing is a term used to describe a cache line that is constantly refilled with different objects. This is bad for the overall performance because you are constantly loading and downloading from mass storage.
However, there is a different view on this as well. Without caching, you simply need a specific amount of memory; there is nothing to fiddle around with. When you cache a resource, there is the potential to trade off performance against memory usage. Maybe you're running so tight, that you are happy to give away some performance in exchange for some much needed memory.
With caching, memory usage during runtime can be made fixed. When you create dynamic objects using a cache, you can impose an upper bound (of course, it must be ok to throw away old objects and recreate them later). Having fixed memory footprints on consolesis a very good idea. However, you must provide 'panic' modes for some caches. For example, when you use a cache for render geometry and the cache is full thereby preventing you from purging an element from the cache (since they are all visible), the renderer copes with this by not rendering the object. Otherwise, you must fill the screen wisely.
Another benefit of caches is that they can hide expensive access costs (in fact, this is what caches in CPUs are all about). When your low entropy source data is compressing nicely, but the unpacking is consuming quite a bit of time, it's wise to put the unpacked objects into a cache. Therefore, the next time they are requested, one just returns a pointer to the object. Normally, a small one is sufficient for that kind of cache.
There are basically two different approaches to integrate a cache with its client. One could use a Lock(); and Unlock(); protocol. Whenever the engine decides to use an object for a prolonged amount of time it must use Lock(); with the cache first. The cache in turn, does whatever it takes to provide that resource (i.e. downloading and/or uncompressing). The object then resides in the cache until the client decides that it's no longer needed, and issues an unlock call. The cache recognizes that, and can reuse the cache line for other objects. Of course, it would be wise to keep the information in the cache, even if its no longer locked. The information may be requested again soon and as long, the cache is not running out of space. The drawback to using this scheme is that it's possible for the cache to run out of space, because all the cache lines are filled with objects. Since the Lock(); call returns a pointer to the client, and the client can use it as long as the pointer hasn't been unlocked, there is no way to free a cache line. Therefore, the client must be able to deal with the "nope" condition.
The second approach for interfacing the cache with its client is to use a Touch(); call to the cache whenever an object is about to be used. The idea is, that the objects pointer, returned by the Touch(); call, is only valid until the next Touch(); call. Considering the full condition, the worst thing that can happen is having the cache being constantly refilled and trashed. That would cause poor performance, but at least there is no special condition to worry about. This is definitely the way to go if the cache is sufficiently large enough, and it can be guaranteed that trashing will not occur,. However, whenever an object is used, there must a Touch(); call made. Consider the Lock(); and Unlock(); interface to reduce the amount of overhead if this happens frequently.
The method of choice is based on a number of different circumstances including, cost of providing resource (download time, uncompressing time), existing software framework, and the duration frequency of cached objects being used.
The example header file below (c.f. fig. 5) shows the structural definitions for Indy's vertex cache. This cache contains vertex information for collision purposes with the world geometry. It's invoked whenever collision takes place in a specific sector of the world. More precisely speaking, it holds an array of unsigned 16 bit values, which are indices into the (one, big) array of world vertices. The indices are not in memory all the time, since they are only used for collision purposes.
The definition of the cache consists of three functions and a type definition. The structure holds an allocation bit for each of the NUM_LINES cache lines (here: 128). Note the use of the preprocessor macros. It would have been possible to change the size of the cache dynamically during runtime, however, there is no real need for that considering the size definition using a macro makes the code quicker to write. Also, the cache contains a mapping from a cache line to a sector. This can be used to quickly find what sector is cached in a specific cache line. In contrast, there is also a mapping from a given sector to a cache line. This is used when the cache is asked to provide information regarding a specific sector. That means that the object is not cached if the mapping is uninitialized (i.e. 0xff). If there is a cache line assigned, the position of the data is looked up using the CacheLine2Base mapping with the cache's base address *Data. Since the amount of vertex indices can vary from sector to sector, the amount is encoded in CacheLine2Size as well. The data chunk of the cache is managed using a freelist service, which is encapsulated in the tFreeList member. It provides basic allocation and deallocation for the *Data area. The cache keeps track of the oldest cache line using another service. This tLRU (last used) service provides means to maintain a list ordered by access age. Every time the cache is accessed using the GetSectorPoints(); function, the cache's LRU is updated. Whenever it's time to purge an entry, the LRU service provides the oldest cache line. It's a good idea to implement separate services such as tFreeList and tLRU, since they are very likely to be reused for different caches.
Figure 5: Example Cache
The cache also contains a counter to keep track of the amount of bytes downloaded from ROM for debugging and profiling reasons. One may also notice that even though the number of cache lines is set in stone during compile time, the actual size of the cache (i.e. the amount of indices it can hold with that fixed number of cache lines) is dynamically adjustable using the Initialize(); call. The amount of memory allocated for *Data is changed accordingly.
One can observe a number of characteristics for specific caches having specific requirements, through the implementation of many different caches. Some characteristics have a huge impact on the design of a cache.
the cache deal with the "full" condition?
There are different approaches to the "full" condition. Of course, the cache should be appropriately sized so that this nasty condition occurs infrequently. You can avoid this by performing the following steps:
- Discard a random cache line. Of course, this is not the most optimal way of dealing with the situation; however, it's easy to implement. This is exactly what the MIPS CPUs memory management unit does, when it has to fetch a new page descriptor (see below).
- Discard the oldest cache line. This is only useful when there is an unlocked cache line (which is the oldest). Using the Lock(); and Unlock(); protocol is not always the best method because all the cache lines could be currently locked.
- It is sometimes impossible to discard any cache line (return NULL). This can happen in a texture cache, where all cache lines contain locked textures because the engine is claiming that they are all visible. In Indy, the renderer drew a filled rectangle using the sector's ambient color.
is the cache's access frequency?
Caches vary in their access frequency. Sometimes, an object is requested only once during level startup (i.e. an initializing scripting object) and then never touched again. Sometimes an object is used frequently during a prolonged timeframe in a specific level, but is not used anywhere else in the game. Finally, some objects (such as the main characters animation data) are used throughout the whole game on a constant basis.
There is also the access frequency during a single rendering loop. Is the object requested from the cache for each frame, or is it requested (locked) only once and then used until it is freed (unlocked) again? Caches, that have a low access frequency, qualify for compressed source data because uncompressing doesn't occur very often (of course, the data must be suitable to compression as well).
the cache deal with fragmentation?
Note that memory fragmentation and garbage collection is only an issue for caches that support differently sized data objects. Fragmentation is not an issue because it does not occur if all objects are the same size. It's always worth trying to avoid fragmentation, although it's not always possible.
If one has to deal with fragmentation, then it's likely that one is interested in avoiding holes in the allocated cache memory. Those holes can lead to the full condition even if the cache is half empty because there is not a single block of memory that is big enough to hold a requested object. In Indy, garbage collection was required for the texture cache. Many small textures shared the cache with a couple of large ones. The small textures (if not collected) would spread out evenly in the cache, effectively cluttering up the whole cache, so that no big texture could be downloaded.
Luckily, a texture cache is especially suited for garbage collecting because the only client keeping a pointer to a texture is the renderer. Therefore, you only have a single client informing you about moved textures. The movement is necessary to fill the holes, which pop up during the use of the cache. The approach is quite simple. Search for the cache, beginning at the end, for a texture that fits into a hole. By doing so, space at the end of the cache is freed up whereas the holes at the start of the cache are filled. To do this efficiently, search for one match per each or (as it was done in Indy) have a second thread using otherwise unused CPU time to browse through the cache and clean it up. Using a second thread enforces the usage of semaphores to ensure proper synchronization between the threads, but this is off topic and belongs to the lovely field of parallel programming.
determine the oldest object in a cache?
This problem is tightly bound to the basic access protocol used with a cache because the only input available to solve this problem is the time of a Lock();or Touch(); call, as long as no aging information is written to the cached object during usage.
- Find the oldest object using a LRU algorithm (recently used). Such algorithms build a sorted list of cached objects. Whenever an object is used, it is moved to the head of the list. When an object is used the first time it is added to the list, otherwise it's just moved from its previous position. Notice, that this can produce incorrect results for the Lock(); and Unlock(); protocol, since you take the time of locking as basis and not the last time of usage. This problem does not occur when you use the Touch(); protocol, because you have a correct basis.
- Another option is to use age stamps. Whenever you use an object, you update an age stamp embedded into an object. The current frame number is a good value. When you need to purge a cache line, search through all cached objects (which is only recommended, when you have a small cache) to determine the oldest one.
Note that the stamping can be done either during Touch(); calls or even with the Lock(); and Unlock(); scheme. In that case, you will need to write the age stamp to a designated field of the cached object every time it is used.
Tips for Running Caches
Once the cache is running, you should take some time and profile it. Get a feeling for the amount of data that goes through, and determine why it is requested. Find out about access patterns. This information can be used to resize the cache (yes, it may be too large). In almost all caches, there are two resources that determine the "full" condition. The cache can run out of storage space. In this case, the allocated memory for it needs to be increased. On the other hand, the cache can run out of free cache lines. It won't make sense in this case to increase the storage size, if the cache doesn't have the means to hold the information to manage the extra space. In addition, if you figure out that your cache's storage space is too large and you scale it down, you also may end up in using less cache lines. When profiling caches, be aware of real game play situations. The stress put on a texture cache is very different when you use a fast flying debug camera instead of the real game play cameras.
For complex caches use many, many asserts(); to help you check the consistency of your new cache. Sometimes some erroneous programming can make you believe everything is fine until a point late in the game; like when your cache starts to return pointers to data, which do not belong to the correct object.
Lastly, implement separate LRU and FREELIST services because you will need them all over again. One could argue, that an abstracted caching module would be of some benefit, but at least in Indy, each cache had it's own characteristics and tiny special cases, which prevented that approach.
Using a Virtual Memory Kernel
The final step we made in Indy was implementing a virtual memory kernel for code and read only data. This decision was made because the code for a release build came very close to the one-megabyte mark. Compared to the total of four megabytes the unexpanded N64 has to offer, this was simply too much. Strictly speaking, a virtual memory kernel represents just another method of caching. Of course, one exploits the possibilities of the processor due to the usage of page mapping tables, and the memory management unit. Another good thing about using a kernel, is that there is no need to split up the code into several overlay sections, whereas it was common with the first generation of N64 games. Overlaying is a dangerous technique to reduce code size. The linker puts mutually exclusive modules into the same physical address space. It's the applications responsibility to download the correct code at the correct time, and to not call any code, which is physically not present.
Of course, you will make mistakes with that scheme (simply since there is the potential for making mistakes), and it does not give very good results. However, because a virtual memory kernel will only hold those code pages in memory, which are actually being used, you'll get the benefit of overlaying code for free. Many modules also split up into two sections: one for initialization, and one for updating objects. After engine startup, a virtual memory kernel discards the initialization code automagically, based on the selected virtual page size.
Figure 6: Components of a virtual memory kernel
A kernel can also provide and aid you in debugging your program. It's possible to detect write accesses to code and read only data. In addition, the use of a unique virtual address space distinguishes data from addresses even more. Memory mapped resources can also be implemented, however, it was not done in Indy.
A virtual memory kernel consists of two mayor components (c.f. fig. 6). A page table describes which virtual page is mapped to what physical page in memory, if it is mapped at all. There is an entry in the page table for all known virtual pages. A page is usually a block of 4096 bytes. The CPU has a 32 entry TLB (translation look aside cache) on chip. This TLB maps up to 32 virtual pages to their physical counterparts. You have to setup one global (i.e. complete address space) mapping to cover valid read/write accesses to memory. However, the remaining 31 entries can be used to map 4K pages of code. Of course, those 31 mappings are not enough. Therefore, the CPU issues a refill IRQ whenever a mapping is unknown to the TLB cache. A very short assembly routine then fetches the correct mapping information from the external page table and loads it into a random TLB entry. This random strategy is not the best one, but it's also not the worst and it is easy to implement. In fact, the MIPS CPU even provides an assembly command LoadToRandomTLBEntry. However, when the CPU finds an unmapped (i.e. marked as invalid) virtual page it issues an invalid page IRQ. At this point, the kernel software needs to determine which page to purge from physical memory (using a very, very fast LRU algorithm of course), invalidate that page's valid bits, download the new code page from ROM, and mark the requested page as being valid after updating the TLB entry responsible for that page.
This whole process may sound scary, but in fact, it's not too hard to implement. The kernel software is overall about ~1000 lines of MIPS assembly instructions. One needs to dig deep into the processor manual though, but it's well worth the effort. Of course, we tried to run Indy with just 64k of code, which worked fine, but very slow. In the end, Indy used ~400k of physical code memory, freeing up about 550k.
To summarize this article:
- Even though some methods gain considerable amount of memory, no single method alone gains what you need.
- The combination of methods makes the difference. Of course, one wants to deal with the big chunks first.
- Organizing your data with space considerations in mind right from the beginning is better than performing emergency programming in the end.
- Caching works better than expected, especially when you cache resources at central points (i.e. dealing with textures in one place only).
- Caching also helps to keep data local, and therefore increases performance. The less data the CPU has to deal with, the better.
- Don't be afraid to do low level work. At least, on consoles you can still do so!
- Even with the next generation of consoles, the issues presented here are still valid (if not even more important), and with Gamecube's ARAM there is something to download from…