Monitoring Your Console's Memory Usage, Part One
When developing games for Xbox and PS2, satisfying the console's memory requirements is one of the most challenging tasks. Not having found an off-the-shelf tool that was sufficient, Jelle van der Beek set about creating his own.
When developing games for consoles, satisfying the console's memory requirements is one of the most challenging tasks. It's a recurring problem: you put in a lot of effort to get your memory requirements right, then a week later you must start all over because of the changes in the game's content. Having a tool that provides you the correct information quickly would be invaluable. Not having found an off-the-shelf tool that meets my needs, I set about creating my own. This article, the first of two parts, describes the cross-platform tool I created to support our Xbox and PS2 development.
The solution described here is not about monitoring memory performance such as cache misses or page misses--it instead focuses on three main aspects of memory usage:
The amount of memory your application uses (and by what code).
Displaying the memory layout, to visualize memory fragmentation.
Discovering memory leaks in the application, and what caused them.
I do not know of any third-party tools for consoles that monitor these very basic issues, and I find this to be rather odd, because memory and performance issues are frequently on the top of my to-do-list. Microsoft has taken a nice first step with the Xbox development tools. XbMemdump is able to display the layout of physical pages, but it is very basic: just a command-line tool that outputs ASCII characters. There is a tool from Metrowerks that covers these memory issues—CodeTEST—but it is not available for game consoles [REF7]. Another tool that covers memory issues is Boundschecker [REF8]. It finds memory leaks and as from version 7.1, it also has a memory and resource viewer. Again, this product is not available for consoles.
I will show you how we built a tool, called MemAnalyze, which monitors all three of the above memory issues for Xbox and PS2. (Supporting the Gamecube is not covered, simply because Playlogic does not develop for the platform.) After reading this article, you should have enough ideas on how to extend the tool for other platforms.
This article provides an overview of the tool and how to make a memory snapshot of the game. In part two of the article (which will be published on Gamasutra this Friday), I'll show you how to interpret the data.
Overview
Theory
We will run the game, and at the press of a button, have the game output a file that holds the current memory status--the addresses and sizes of all the blocks currently allocated in memory. The file will be stored on the console's hard disk, if present. Otherwise, it will be stored on the PC's hard disk, which means we can only output the file if we are running the game remotely from a PC. This article does not cover any alternative ways to output the file—it simply describes how to collect the right contents for the file. Saving it to an appropriate location is up to you.
Besides the allocation's block information, we will also provide a callstack for each allocated block, using real-time callstack tracing. Real-time callstack tracing should be possible on each platform. Why? Each function always needs to return to the previous, so the return address must be stored somewhere or somehow. We just need to figure out how each platform retrieves that data.
If you have written your own allocation or heap manager, gathering the correct information for the memory dump will be an easier task. You probably have most of the information at hand. For the tool, we need the following data:
The address of each allocated block
The allocation size
The callstack per allocation, which is an array of function addresses (not quite, but we'll get to that later).
Our tool will read the memory dump offline, on a PC, and read symbol information from a map file or a program database. The symbol information is then used to convert the function addresses to function names. The tool will implement several views of this data.
Platform independence
Where do we draw the line between platform dependence and platform independence? That is mostly up to you. The platform specific information includes:
Heap information, as stored on the console in hardware.
Symbol information, in the form of a map file or program database.
Image location information, if needed.
You could let the game walk the heap, process it and dump it to file in a platform-independent data structure. You could also let the game itself parse its own symbols and immediately replace the addresses in the memory dump with function names. This scheme would output a single platform-independent file, and you could make MemAnalyze completely platform-independent. While this sounds great, there are a few disadvantages to this approach:
Adding a function name to each dumped callstack function creates a lot of overhead because the function names will be duplicated numerous times.
If the function addresses are replaced by names, we have to convert the names to a unique value such as a CRC32 in order to process (compare, collapse) the data.
We are limited to the console's libraries for parsing symbol information. For some platforms, this might turn out to be a problem. If we want to parse our symbols from a program database, there is a good chance we will need to write our own PDB parser, which is quite complex and hard to maintain, in terms of version changes.
We need to load symbol information. This data will also be displayed in our memory analysis. We can partially work around this problem by reloading symbol information on each memory dump.
We chose, instead, to output a platform-dependent file from the game, excluding function names. For the PS2, we even dump the entire heap to a file, then walk the heap completely offline. Doing so, we can even dump the PS2 memory to file if a critical assertion occurs, and do some postmortem debugging in MemAnalyze. This also has the benefit of being able to display and compare the memory's contents. Obviously, this shifts some of the platform-dependent code to the tool.
The tool will include two platform-dependent pieces of code:
Reading of the platform-specific memory dump, and converting it to an internal, platform-independent data structure.
Reading of platform specific symbols, and converting it to an internal, platform-independent data structure.
From this point on, everything should be multiplatform.
MemAnalyze
In the end we will have three different views of the data. In terms of graphical views, I only implemented two: One for displaying the layout of the memory and one that shows how much memory each function has allocated. In MemAnalyze, we can open multiple memory dumps in multiple windows. The third view is simply a dialog that lists memory leaks by comparing multiple memory dumps.
A view that I have not yet implemented is a Hierarchy view. It will display a hierarchy of the functions that allocated memory. Using this view, we can have more of an overview on the memory usage and zoom in and out on allocation hotspots. More information on this will be covered in Part two.
Memory layout view
This view shows blocks of memory as they are physically present on the console. It's a Microsoft Defrag-like style of displaying. Moving the mouse cursor over a block causes a tooltip to appear that shows the complete callstack of the function that allocated it. This is very convenient for the PS2, where memory fragmentation is a big issue. You will mostly be searching for scattered small blocks that clutter your memory.
Unfortunately, this view is not that useful for Xbox games. The Xbox uses virtual memory addressing and this solves a lot of the heap fragmentation issues within the VMM. The VMM can split large virtual allocations into separate, non-contiguous 4KB physical pages. The addresses we use in our programs are virtual addresses and may be mapped onto multiple physical pages. We can monitor the virtual addresses, but fragmentation in the virtual address space is not really an issue, as we can map our 64MB onto a 4GB address space.
I don't know if it is possible to track the real physical pages on each virtual heap allocation. Maybe then we could really show the physical mapping of our virtual allocations. But I am not even sure if this would prove to be useful information.
Physical allocations, on the other hand, might be useful to monitor in the tool.
TopX view
This view shows a series of bars, one for each function that allocated memory. Again, if you move over a bar, a tooltip will display the name of the function, along with the exact size of the allocation and the number of allocations.
We can sort the functions in several interesting ways:
The total size allocated to each.
The number of allocations by each.
On the function name.
Memory leaks view
The memory leaks view will compare two dumps and display the differences in a dialog box, as seen in Figure 3.
______________________________________________________
Making a memory dump
Xbox
Those who have experienced the joy of Xbox programming will have already found out that Microsoft thought of almost everything concerning game programming. Luckily for us, they also have a tool that dumps all allocated memory blocks, along with a callstack. It is called XbMemDump. If XbMemDump does not suit your needs, they also have a series of debugging functions to store callstack info and to run the heap manually.
Lastly, the Xbox has a unified memory structure, which makes it possible to monitor all memory, including that used for sound and video.
Automatic dumping memory using XbMemdump
At first look, XbMemdump seems like everything we need. It has many benefits: it has support for memory tracking at the kernel level, so it does not miss any allocations. It is able to display callstack information on up to 32 levels per allocation, so you won't have to bother tracking these allocations yourself.
However, when I started building MemAnalyze half a year ago, XbMemdump ran horribly slow if allocation tracking was enabled. It crashed during a level load on a regular interval, and when it did not crash, it took about 1.5 hours to complete. When I could finally dump the memory, it displayed just the return addresses and I couldn't get the symbol information to work.
Now, half a year later, I tested XbMemDump again, and there is no performance problem and the symbols are loaded just fine. Although when I asked them, Microsoft reported no changes to XbMemdump since December 2002, you should check to see how it performs with your code. It might be running smoothly now because of different allocation strategies we implemented in our game since I first began this work. In case XbMemdump doesn't perform well with your code, or if you are interested in how I worked around the problems, the following will explain how to manually dump the Xbox's memory.
Manually dumping the memory
Intercepting all allocations
We first need to intercept all allocations. This can be a pretty tough job. The Xbox has two different types of allocations: PhysicalAllocs, typically used to allocate contiguous memory: (video buffers, sound data), and HeapAllocs.
Xbox provides a global allocation function, XMemAlloc, which can be overloaded. XMemAlloc supports (almost) all types of allocations. Every third-party product should use XMemAlloc for their allocations, so the game developer can intercept them. If the tool developer really needs other behavior that XMemAlloc doesn't support, like 32-byte alignments or higher, a wrapper for the allocation function should be created, with the possibility for providing a callback function. This way, the application can respond to all allocations.
Sadly, not all third-party products conform to these rules. Even Microsoft has ignored these rules: up until the December SDK 2003, the XACT and XMV modules did not use XMemAlloc. (They do now, however.)
Once we can intercept all allocations, or at least all the allocations needed, we can then store our callstack information.
Real-time callstack tracing
Microsoft offers a series of debugging functions with the prefix "Dm". To use them, you need to link with the debug library XbDm.lib. The function DmCaptureStackBackTrace is used to store callstack information. (If you would like to know more about callstack tracing on Intel-based machines, I suggest reading Chavdar Dimitrov's explanation [REF2]). Listing 1 shows my own callstack trace function that works on any IA-32 based architecture (and above), provided that you disable the omission of frame pointers in the compiler settings.
Please note that we have not obtained the start addresses of the functions that preceded our function. Instead, we have found the return addresses! This address is located somewhere in between the function's start- and end address of the caller.
The functions StoreCallStackAsm and StoreCallStackCPP return the number of successful items placed in the array. Listing 2 shows how to use StoreCallStack.
In this example, StoreCallStack will store the instructions in the scope of the functions Foo2, Foo1 and _tmain. Both the caller of StoreCallStack: Foo3, and StoreCallStack itself are not included in the callstack!
Storing the data
We must store the callstack somewhere. For heap allocations, I decided to enlarge the block that was allocated by 16 bytes, and add our information at the back of the allocated block. I also provide a tag of 4 bytes in the 16 bytes. Choose a hexadecimal value such as 0xCAFEBABE for the tag value. The tag value is used later, when walking the heap. The heap walker must check if the allocated block it is processing has our callstack information, since there will always be allocations that we didn't track. In running a test of our first level, I found that we managed to track almost all allocations:
Heap summary: Total count=76162, of which: Tagged: 75756, Untagged: 406!
Heap summary: Total size=28244816 bytes, of which: Tagged: 26859088, Untagged: 1385728!
The Xbox memory manager rounds each heap allocation to a 16 bytes address (a 16-byte alignment), and the size is always a multiple of 16 bytes. If you want to pad your own data to a block, do this math yourself. First round up the size of the allocated item to a multiple of 16 bytes, and then add another 16 for your own data (or any multiple of 16). Using 16 bytes, we can store a callstack three functions deep. Figure 4 shows the layout of an allocation of 24 bytes on Xbox.
As you can see, we are losing 8 valuable bytes. There is not much we can do about this: during the heapwalk, there is no way to recover the original size that was requested for the block after the allocation. As a last resort you could add a byte at the back of the block indicating the number of callstack levels present. This way you could have a dynamic number of callstack levels, ranging from 3 to 6 levels deep, filling up unused bytes (the tag needs to shrink to 3 bytes though).
Although I have used the approach as described above, there are a few disadvantages to it:
The 16-byte overhead per allocation block pollutes the memory dump.
The callstack is quite limited, unless we add even more overhead per block.
There is a small chance that a memory block is recognized as a tagged block, even if it is not, since we can't guarantee our tag will be unique. This is not very harmful: the system won't crash; it will simply display a few blocks with incorrect or unknown callstack functions.
On the positive side, these downsides never really proved to be a problem to me. The system is easy to implement, and more importantly: there is no performance penalty involved when a block is allocated or freed!
Still, I would like to present another approach. Since the Xbox has support for multiple heaps, we can create a separate list that contains the extra allocation data and put it on an alternative heap. The advantage of this technique is that our memory snapshot will be the exact representation of the memory in a normal build. It is also much easier to track larger callstack levels, as XbMemDump does, and it makes walking the heap easier: we can just run over this list. The disadvantage is that each free of a memory block will need to search this list in order to delete our extra data. We need to use a hash table or another optimization algorithm in order to keep the performance penalty down.
For physical allocations, you have no choice but to maintain a separate list with the addresses, sizes and return addresses. We have to, because there is no such thing asa "PhysicalAllocWalker" on the Xbox. Typically there will be far fewer PhysicalAllocs then HeapAllocs, so the performance penalty for walking the list on a deallocation is not too big. In our test run of our first level, our number of PhysicalAllocs were:
*** Number of tracked physical allocations:39, total size: 12601656 ***
Dumping all allocations
We can now create a snapshot of the memory. If we decided to put our heap data on a separate heap, we can simply run over the list. If we didn't, we will need to walk the heap, and for each item, check the tag to see if it was tracked by our code, this output the extra allocation data that we stored at the end of the block. For PhysicalAllocs, we simply run over the list of PhysicalAllocs.
We can walk the heap pretty easily by using Microsoft's debug function HeapWalk. It works perfectly, but unfortunately, it is only available in the debug libraries. It is difficult, if not impossible, to make a release build while linking with just the XapiLibD.lib. Whenever I tried this, I always ended up in a complete debug build. The reason HeapWalk is put in a debug library is purely that Microsoft does not want our final game to have low-level heapwalk functionality, which sounds plausible. Perhaps they should place the HeapWalk function in the XbDm library, which can be easily linked into a release build, but is unapproved.
One key disadvantage of a debug build is that the data structures will look quite different. In debug mode the memory manager behaves slightly differently. For instance: the heap header for each allocation block is larger, and it adds 0xFF tags to check for memory overruns. Last but not least, most games run terribly slow in debug mode.
Sadly, there is no simple way to walk the heap in a release configuration unless we write our own heapwalker. I have tried and I have come a long way, but it is not a methodology I want to propagate. The Xbox kernel is way too complicated and it is bad practice not to use Microsoft's existing code. For the PS2 however, my colleague Tom van Dijck wrote a heapwalker. A detailed description of his PS2 heapwalker can be found below.
Finally, we need to output the image's base address. We can retrieve the image base address by calling DmWalkLoadedModules. This function will return all currently loaded modules, including kernel and debugging modules. We need to output all the base addresses along with their names. An in-depth description of the image base address will be given in part two of this series.
As mentioned earlier, I personally decided not to output function names in the memory dump. If you would like to do so, the “Dm” functions provide functionality for parsing symbol information and converting addresses to function names. For more information on the Xbox memory functions, take a look at Forrest Trepte's Xstream training session on Xbox central [REF9].
______________________________________________________
PS2
The Playstation 2 does not have the Xbox's great debugging tools. It also lacks a unified memory structure. On the bright side, the heap system is so simple that we can easily write our own heapwalker.
Dumping the memory
Intercepting all allocations
On the PS2, there is no global allocation that can be intercepted as on the Xbox. Of course, the new operators can be overloaded, but there is no way to intercept any other allocation functions. That leaves us with just one option: wrap all allocations! Since we used Renderware Graphics for our game, we simply called their allocation routines. Renderware's allocation routines can be redirected, so we redirected them to our own custom allocation routines. Now all allocations were done through Renderware, and therefore, through our custom allocation functions.
Overloading new and delete operators is even easier, so after our game used just these functions, most of the memory was intercepted. With a few exceptions...
The Sony runtime libraries allocate memory from the heap as well. More specifically, printf and atof were the two functions bugging us. They allocated small memory blocks as soon as they needed them, causing fragmentation. We couldn't capture them because they used malloc_r directly. Malloc_r is an internal allocation routine from the runtime libraries. In the end we made sure that on startup of our application printf and atof were called a few times to be sure they allocated all the memory they needed. The following code did the trick for us, and caused no memory fragmentation during the game.
Now that this issue was solved, we could intercept all other allocations used in our game, and we could add our 16-byte additional data to store our callstack information.
First we tried to store it at the beginning of each block, and simply return an address 16 bytes further, but somehow the Renderware DMA handler did not like that idea, so we ended up putting our data at the end of the memory block, which exposed a small quirk:
When we perform “malloc(8)” we get a memory block of at least 8 bytes, but when retrieving the block's size using: malloc_usable_size () we receive 12 as its size, which means the block is actually 12 bytes in size.
So, when we allocate 16 bytes extra, we should not put our information at “address+8;”, but at “address + malloc_usable_size(addres) – 16;” because otherwise we will be unable to find it later in functions such as free, realloc, and in our heapwalker.
Again, we mark our data with a tag such as 0xCAFEBABE. As mentioned earlier, this is needed because the Sony libraries allocate some memory too, and by using this tag we can identify whether the memory block has callstack information or not.
Realtime callstack tracing
Callstack tracing on a MIPS machine is far more complicated then on Intel-based machines. I suggest reading, “See MIPS run” [REF1], for more detailed information.
Keith Packard from the MIT X Consortium created a callstack tracer algorithm for MIPS processors. It can be found in the Sony Developer Newsgroups [REF6]. This one already contains some modifications for EE specific instructions, and works like a charm for us.
Writing a heapwalker
The layout of the PS2's memory heap is very easy to parse. It is simply a large block of contiguous memory. First we have to find out where the heap starts. We are using CodeWarrior to build our project, and the following code will likely be different for other compilers such as GCC or ProDG.
In CodeWarrior, there is a feature called Linker Configuration Files (LCF), which, amongst other things, can be used to specify the heap size.
The CodeWarrior linker defines some symbols that can be accessed in the code. Sony's default heap implementation uses these symbols too, and so we are able to find out exactly where the heap starts.
typedef int __attribute__ ((mode (TI))) heap_size_type __attribute__((aligned(16)));
extern heap_size_type _end;
Adding the above two lines of code to one of your files makes it possible to find the exact address where your executable data ends. This also seems to be the start of the heap.
Using the following code, we can walk from the start of the heap until the exact end of the heap. By doing so, it accesses every single block of memory (listing 6).
Dumping the memory
By walking the heap, we can also figure out what its end address is. We already had the start address, so by using the start and end address, we can dump the entire heap to file. This supplies us the actual contents of the memory. We will walk the heap again offline in MemAnalyze, using a slightly modified version of the HeapWalk function from listing 6. Listing 7 shows how you can dump the entire heap to file.
What's next?
Now that we can dump the heap data from both platforms to file, it is time to take a look at the tool. In part two, I'll discuss the details on map file parsing, PDB parsing, and take a close look at how the Xbox image is loaded into memory. We will also see how the tool processes the data from the memory dump to come up with several interesting views.
References
[1] See MIPS run, by Dominic Sweetman. Morgan Kaufmann Publishers, 1999 [ISBN: 1558604103]
[2] Playing with the stack, by Chavdar Dimitrov.
http://www.codeproject.com/tips/stackdumper.asp#xx324128xx
[3] XDK documentation: chapter "Xbox kernel memory management"
[4] Rob Wyatt's explanation on fragmentation and caching on Xbox
Xbox newsgroups: news.xds.xbox.com
Search for:
Matt Benic
D3D_AllocContiguousMemory question
08/12/2002
[5] Xbox Memory Architecture and Performance, by Mike Abrash.
Available in the XDK documentation and on Microsoft website:
https://xds.xbox.com/BPProgInfo.asp?Page=content/prog_wp_memoryarch.htm
[6] Keith Packard's algorithm for callstack tracing on MIPS processors
Sony Developer Newsgroups (news.ps2-pro.com)
Search for:
Phil Camp (SN Systems) <[email protected]>
sce.dev.prog.ee
Tuesday, February 04, 2003 2:22 PM
Re: call stack trace for EE?
[7] Metrowerks' CodeTEST
http://www.metrowerks.com/MW/Develop/AMC/CodeTEST/CodeTEST+Memory.htm
[8] Compuware Boundschecker
http://www.compuware.com/products/devpartner/bounds.htm
[9] Forrest Trepte's training session on Xbox memory management
https://xds.xbox.com/media/Memory%20Management_files/default.htm
______________________________________________________
Read more about:
FeaturesAbout the Author
You May Also Like