(cross posted to http://mtnphil.wordpress.com/2012/01/06/identifying-entities-in-the-game-world/)
Everywhere in the game engine are identifiers that let us access important objects. These objects might be game entities, scripts, 3d model names, etc… We can’t use pointers or live managed references to access these objects since we need to be able to persist these identifiers for save games or access from scripts. If the resource has a distinct name, then we can use strings. Strings, however, take up a lot of space, are slow to compare against, and – in the managed world – impact garbage collection cost.
So generally we use some sort of integer that can uniquely identify the item. The problem then becomes how to generate these integer identifiers and maintain them.
My first attempt at this used a giant monolithic string list. When any piece of code requested an integer identifier for a string, we would look in our list and hand back the index. If the string was not present, we would add it to the list and hand back a new id. We started from 0, and the ids simply increased as we added more strings.
Now, this requires maintenance of this monolithic string list. Once I have generated an id for a string, it cannever change. That means I can never remove any string from the list (well, I could replace it with an empty string), re-order the list or anything. If any change is made to the game world that adds a new string, I need to ensure the monolithic list is correctly updated and saved out, always in sync.
While that’s doable, it still frightens me a little. If the list did get out of sync, it might produce very insidious bugs.
I had briefly thought of just using string hashes for the ids, but quickly dismissed it due to the risk of collisions (two strings producing the same hash code). However, a chapter I came across in Game Engine Architecture changed my mind. The author stated that they used a 32 bit hash for strings, and didn’t encounter a single collision in two years of development on a game. Even in the rare case a collision was encountered, it is easily detected and easily fixed by changing one of the strings.
And so I began looking into that. It would simply a lot if it would really work.
Unfortunately, the .net implementation of GetHashCode() for strings, while quick and sufficient for use in a hash table, isn’t good enough for these purposes. Ideally the hash would result in large differences in the number for small differences in input. I noticed that the values returned for similar strings were very similar.
Even more important though, is that the value returned by GetHashCode() is not guaranteed to be consistent across platforms or versions of .net. The values it produces really cannot be persisted.
So I found a CRC-32 implementation on the web and adapted it for my purposes. Since I control the implementation I can guarantee consistent results across platforms and .net versions, and the distribution of values is also much better.
In order to convince myself of the unlikelihood of collisions, I found a large (58,000) list of English words on the internet and ran them through the hash. No collisions at 32 bits. Or 31 or 30 bits. 1 at 29 bits, and 2 at 28 bits. Seemed good enough for me.
Just for kicks I ran it through the .net GetHashCode() implementation too. 1 collision at 32 bits (“nominee” and “mineshaft”), and 14 collisions at 28 bits.
So now I’ve converted most of my code that needs this to using these hash ids, eliminating a lot of the data files I needed to maintain and simplifying my code quite a bit. In many cases, I don’t even need to load the actual strings into the game – the ids are all resolved at compile time, and they strings themselves never need to be seen by the user (they can be loaded into the debug version of the game or the game editor though). I will at some point need to write some code that warns collisions, but that shouldn’t be very difficult.
I use these ids for many purposes: for variables in scripts, to identify words in NPC conversations, and to identify named game entities.
For game entities, I need to be careful. Not all game entities have names. They only need a name if they are a unique item in the world and they need to be referenced from a script. A random piece of fence, a chair, or a weapon dropped as loot in combat do not need names. They may still need a unique identifier, however. If that weapon is moved into a character’s inventory, I need some way to track that.
I don’t want two separate notions of id though, so I have decided to allocate specific zones in a 32 bit space to different types of ids. All ids with the high bit set are string hashes (yes, I’m only using 31 bits, but I’m still not worried about collisions). The bottom half is reserved for unnamed entity ids which are allocated by an ever-increasing value that I need to “maintain”. Thus I avoid collisions between the two types. (I actually have additional requirements that result in me segregating the 32 bit space into 4 different zones, but I won’t go into those details now).
Generating string hashes is certainly not cheap (you need to check each character of the string), but I don’t need to do it in tight loops anywhere. And as much as possible they are resolved at compile time.