In Part 1 we started of our Engine development cycle by first designing an Architecture that would suit our Game. We then realized that a Game Engine, like any software, is nothing more than a big data crunching machine. We have some data (Game Assets), the user can modify some data (Input), we process that data (Game Logic) and display the resulting data (Rendering). Thus, from a developer viewpoint, managing that data is the most important aspect of the Engine. For now, we'll ignore the fact that gameplay seems to be an important aspect to the Player. But, what does he know.
In order to work with data, we chose a Top-down approach, going from overall Engine Architecture right down into Memory Management. But we still haven't talked about the actual data we will manage. We will do this here.
We will also look at some important Cache related performance issues when we work with data in order to make educated design decisions on how to structure it.
Be warned, as we are going to work at a lower level than before, there might be Code involved.
Let's say our Game will need a Deck of 60 Cards. How do we represent that in code? A Stack? What if we need to shuffle it? What if a Card is randomly placed inside the Deck, is a Stack still appropriate?
Those are alot of design questions we face while coding. It all boils down to this: What we are really interested in is something very fundamental about data. Something that applies to all shapes and sizes we might encounter in the wild - How we organize it. Or, as these organizational elements are also called, Containers.
At first glance, this might seem like something we covered in Memory Management. But it is not. The difference is subtle, our Allocators in Memory Management tell us where we store the data and our Containers tell us how we store our data.
So what is a Container? Well, you probably know alot of them, but might not have used this term before. Just to name a few:
- Linked Lists
and on.. and on..
Basically a Container is just some way of organizing pieces of Information. These small pieces might be simple variables, structs or even classes - or in our case, Game Cards.
The goal of this Article is not to teach you about the types of Containers themselves. The goal of this Article is to bridge our knowledge of Containers to our Allocators and find good and Cache-friendly ways of designing these for use in our Engine. And to be honest, this will be no easy task. I'd even go as far as saying, this is probably one of the hardest parts of writing an Engine.
But first, let's start with something simple and take a look at Containers from our Memory Management background before we move on.
The Linked List
As you probably know, a linked List is just a collection of elements, each containing some piece of data and a pointer to the next (and/or to the previous) element in the List. It's an okay solution to our Card Game, probably better than a pure Stack implementation, if we need random access or shuffling.
The obviously cool thing about linked Lists is, that you can store each element wherever you want in Memory and that this decision won't affect the relative position of the Element inside the List. Perhaps you have even implemented a linked List in the past this way, storing Elements somewhere in Memory and linking them up later on.
In short, a simple, yet very bad implementation of a singly linked List in our imaginary language might look like this:
void addElement( someObject item )
if ( firstElement == null )
firstElement = new listElement();
firstElement->data = item;
lastElement = firstElement;
lastElement->nextElement = new listElement();
lastElement->nextElement->data = item;
lastElement = lastElement->nextElement;
void deleteElementAt( int index )
listElementPointer currentElement = firstElement;
listElementPointer previousElement = firstElement;
for (int i = 0; i < index; i++)
previousElement = currentElement;
currentElement = currentElement->nextElement;
previousElement->nextElement = null;
lastElement = previousElement;
// Iteration functions.... etc...
listElementPointer firstElement = null;
listElementPointer lastElement = null;
(Ignore all the potential for errors in the above code and focus on the use of new and delete)
Imagine we ran that code in our Engine. By adding single elements to our List, using the standard allocator that is given to us by our Language, we will seriously spread out the area where this simple List stores it's elements. They could be anywhere.
Smells like Memory Fragmentation. How can we fix this? Well, we already know how from Part 2! By allowing our SinglyLinkedList to use one of our Allocators or Allocation Techniques. Let's go with a Pool-based Allocator. By replacing new and delete inside the List Class with our allocate function, we constrain our Linked List to live entirely on a pre-defined area in our Memory.
Pretty neat, right? Sure, we have to be certain our Pool is large enough to fit all cases our List might encounter, but more often than not, we can actually estimate this maximum size. For example, our Card deck might never grow beyond 200 Cards.
This isn't just some special Game development thing to use custom Allocators on Data Containers. As a matter of fact every Container in the Standard Library in C++ allows you to use a custom Allocator. So the above List would simply look like this in C++ using our own little myAllocator.
Pretty neat, if you followed the C++ guidelines on writing Allocators. I won't recommend, nor not recommend using the Standard Library, but if you are coming from a C++ Background, it is definitely worth reading this first!
Okay, enough C++ specific things you probably don't care about, we just wanted to look at a simple example!
What's the big deal, I know how to write Containers and we already covered Memory Management!
Well, apart from avoiding all the pitfalls we discussed in the last Article, there are a lot of speed related issues we face when designing Containers. The two big ones we will face are:
- Container Overhead
We will mainly talk about Caching, as this is can become very important and currently is a very "in" thing to do. Furthermore, the issue of Overhead we face is actually tightly coupled to this subject anyways.
At this point you will probably recall this Quote by Donald Knuth:
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
This is indeed true most of the time. But what we are going to do and talk about is none the less necessary at this stage. Why? For one, we already know it's going to be slow if we mess up, thus we should design it efficiently from the beginning using some of the methods we are going to talk / have talked about. Furthermore, we are designing something as fundamental as our data structures at this point. Optimizing this at a later stage will be a huge can of bugs we don't want to open, so why not take the time now and design it well once.
Before we start to talk about Cache:
- if you are actually writing an Engine that needs some hardcore Data/Memory handling right now, concider reading this Article by Ulrich Drepper instead of my babbling. It's worth it.
- if you are writing a simple Engine or are just interested in Engines, you could continue reading here. But bookmarking the above Article never hurts.
So, what is a Cache? As the name suggests, it is some form of Storage. When talking about Programming, it mostly refers to a Region of Memory that temporarily stores hard-to-get data to avoid fetching it again. This alone is a very interesting concept, as we saw how slow RAM access can be.
And indeed, this is exactly how our CPU operates. Pretty much every piece of Memory coming from the RAM is stored in the Cache, just in case we might need it again later. Schematically it looks something like this:
I deliberately drew the RAM far away, as to symbolize the slow access time. It's okay to think of the Cache to be the middle man between the CPU and the RAM. Let's just stick with this picture for now and assume, that everything that is coming from the RAM is actually first stored inside the Cache, before the CPU gets it.
So why not just use the Cache the whole time instead of the RAM?
Let's look at an actual Processor and its Cache.
(That's an Intel Core i7-4960X, Image by Intel, found through Google)
The above Processor "only" has 15 MB of Cache and look at how much space that takes. How can we increase that? We could slap another 15 MB of Cache on each side of the Cores and double the Cache (in theory), but then, we start running out of space. So, what needs to be done? We need to go farther away. At the next distance interval, we might be able to add another 60 MB. We will thus need to go farther and farther away from the actual Core on the Die in order to achieve the large amounts of Memory we have in our RAM.
We'd end up with a huge Processor Die. And this is followed up by a lot of Problems. Apart from heating issues, the distance between the outer edges of the Cache and the Cores will be problematic. This distance will increase the time it takes for the Memory to travel back and forth. Something we tried to avoid. Not good.
What? Yes, having too much Cache could hurt the speed our Memory runs at. We would gain alot more overhead just to address our Cache (We will see why later). This might seem negligible at first glance, if we concider that we wouldn't be using our RAM anymore / as often. But our main Goal here is to reduce RAM access anyways, thus introducing more Cache by sacrificing Cache speed would be counterproductive.
Well, Cache isn't just faster than your usual RAM just because it's closer, it's also faster, because of the way it is manufactured. Cache mostly uses SRAM (Static RAM - basically, just a few Transistors forming a FlipFlop) to store it's information, instead of the typical DRAM (Dynamic RAM - formed using a capacitor to hold the bit and a transistor) found on the RAM sticks inside your PC. Sadly, having large arrays of SRAM is alot more expensive than using large DRAM arrays, hence cost becomes a Factor.
L1, L2 and L3
In the Pictures we looked at, we had different types of Cache called L1, L2 and L3. So what do they mean? In short, they just express how close the Cache is to the actual Core, indicated by the use of L as in Level.
So an L1-Cache would be the closest. Being the closest also means, that it is the smallest of the Caches, due to size constraints. Usually the L1 is local to a single Core, i.e. only that one Core has access to it. The size hasn't changed much in the past few years and it's pretty safe to assume that we have about 32KB - 64KB of L1-Cache per Core.
L1 (hit) Access Time: 0.5 ns - 2 ns
The L2-Cache is a bit larger than the L1. Starting with the L2, Cache's might be shared between Cores, so 2 Cores sitting next to each other might have a local L1 cache each, but share their L2 Cache. Here sizes start to vary more. It's roughly 4 - 8 times the size of the L1-Cache.
L2 (hit) Access Time: 4 ns - 7 ns
The L3-Cache (if present) is pretty much just the largest Cache and often shared by multiple Cores. Its size varies the most, but if you had to guess, you could assume that it's around 1/1000th the size of the RAM in use. (For example, an 8GB mid-range PC would probably run a mid-range Processor with 8 MB of L3 Cache)
L3 (hit) Access Time: 15 ns - 50 ns (depending on how it is shared)
Notice how the Access Times jump, even though all these Caches are on the same Die? Confirming our suspicion that more Cache wouldn't necessarily be faster.
Why do we have 3 Levels?
For this, let's take a detailed look at a previous graphic.
A simple way of looking at how it works is,
- Core needs a certain Memory Location (for example to multiply it by some number)
- Check if the Memory Location happens to be in L1.
- No? Check if the Memory Location happens to be in L2.
- No? Check if the Memory Location happens to be in L3.
- No? Fetch it from the RAM.
The last "No?" - not finding the Memory in the Cache and having to get it from the RAM - is mostly referred to as a Cache-miss.
So what happens on a Cache-miss or when fetching Memory from the RAM in general?
A request to the RAM is made for the Memory Location in question. The RAM sends that piece of Memory to our Cache and we store it. Sounds pretty neat. But say we don't have any space left? We must therefore kick out some part of the L1 Cache, or our Core has no way of accessing it! Ouch...
Usually, the piece of Memory that is kicked out of the L1 Cache is the one that hasn't been accessed the longest time ago. But still, we lose it, and if we were to need it again, we would have to go back all the way to the RAM and get it.
Or do we? Like we saw earlier, the L2 Cache is bigger than the L1 Cache. So, instead of completely kicking out the oldest Memory out of the L1, we simply kick it to L2 and store it there.
But what if L2 is full as well? Well, to make space we then kick the oldest L2 Cache piece of Information to L3 and can safely store our data coming from L1.
What happens to the oldest Memory on L3 if we run out of space there? That's gone! If we ever need it again, we need to fetch it from our RAM.
Speedwise, avoiding as many Cache-misses as possible is our primary goal when designing data structures. But to understand Cache-misses better, we need to learn how data is actually represented in the Cache, how it gets send there and how it gets send back!
We learned that Memory from the RAM is first passed to the Cache before being processed in one of the Cores. The currently executed Program does not know this, therefore, the Program will want to fetch some Memory Location as it is mapped on the RAM. This means, our Cache must be smart enough to realize that if the CPU wants Memory Location 0x0033FA22 on the RAM, this would reference Position 7 in the Cache.
Therefore, there must be some sort of look-up table that maps certain RAM Locations to their spot on the Cache. And this look-up table has to be stored somewhere near the Cache.
Obviously, it would be pretty stupid to have a seperate Cache location for each Byte or Word we load from the RAM. Our look-up table would be bigger than the Cache itself!
Therefore, we don't just map single Bytes of RAM to the Cache, instead we map little 64 Byte chunks. And we call these little chunks a Cache-Line.
(Note: 64 Byte is about the average Cache-Line size you might expect around the date this Article is written. I will keep using 64 Byte Cache-Lines for the remainder of this Article. Replace by X if this should not hold true for your Processor / Alternate Timeline)
Storing entire Cache-Lines reduces the overhead of mapping Cache Locations to RAM Locations. Furthermore, if we were to restrict the mapping to only map to addresses that are a multiple of 64 Bytes, we can actually shorten the Reference Address we store for the RAM location, by omitting the last few bits entirely, as they are always 0.
And indeed, we find that Cache-Lines are loaded at 64-Byte offsets or boundaries from the RAM.
We therefore load entire 64 Byte chunks from the RAM into the Cache "at once". Well, not entirely "at once", as the Bus between the RAM and Cache isn't wide enough to achieve this. It would require 512 bits of Data to actually get the entire Cache-Line at once. That just doesn't happen with current hardware. Most CPU's
wouldn't even have enough pins dedicated to Memory to actually achieve all that. Which is fine, the entire Cache-Line gets read piece by piece.
The 64 Bytes we load are the 64 Bytes that contain our requested Memory Location.
Now, we don't need to kick individual Bytes around from Cache to Cache if we need space, we can kick entire Cache-Lines between L1 and L3.
We've talked about reading Data from the RAM thusfar. But what about writing Data to the RAM? Well, it should be obvious that we can't just write Data to the RAM without editing the Cache Line as well. Otherwise we would be working with outdated Data Sets most of the time.
What actually happens is, that the Memory gets altered in the Cache and is only "occasionally" updated / written to the RAM. Fo