Opinion: Practical Floating Point

In this reprinted #altdevblogaday-opinion piece, Bungie's engineer architect Andy Firth offers tips to help programmers push floating point performance and accuracy further, pointing out pitfalls he's fallen into o
[In this reprinted #altdevblogaday-opinion piece, Bungie's engineer architect Andy Firth offers some tips to help programmers push floating point performance and relative accuracy further, and points out pitfalls he's fallen into over the years.] Floating point is ubiquitous in programming these days. Hardware has improved to the point where in many environments it is actually faster to use floating point as apposed to integer (this wasn't the case a decade ago). This post will attempt to educate the reader on various "tricks" that can help push floating point performance and relative accuracy even further, and allow the programmer to avoid some of the pitfalls I have fallen into over the years. This article will assume basic knowledge of floating point numbers. We've all used them, likely for all and sundry within our games and tools, so here's a question to ask yourself: "What does the code below print out for the test' variable?":
    float small_value= 1.0f;
    float smaller_value= 1 / 100000000.0f;
    float test= small_value + smaller_value;
    printf("{%g,%g} => {0x%08x,0x%08x}\n",
         small_value, test, *(int*)&small_value, *(int*)&test);
Almost all programmers I spoke to assumed the value of test to be 1.00000001f, and of course mathematically it should be. However, it is not. The print out would show the following:

{1,1} => {0x3f800000, 0x3f800000}
Single precision floating poinst simply cannot represent the accuracy the result requires, and as such, the result is kept at 1.0f. Now if you are mathematically minded, this will likely poke at your OCD gene and "force" you into using doubles. This is a perfectly palatable solution in many situations where one does not require speed. However, double still suffers the same fate if you double the number of zeros in the denominator (1 / 10000000000000000.0f;). For those of us who work in games, math is important, but it's far less important than simulation determinism and getting things done expediently whilst executing within performance budgets. This forces us to make decisions that might otherwise be seen as somewhat mathematically incorrect. The above is one such circumstance. In most games applications, we cannot afford to switch math to use doubles, and as such, the inherent limitations of single precision floating point math are deemed acceptable, even becoming "normal". I've been working with them for so long that it is now natural to apply these limitations in all circumstances as par for the course. As was mentioned in the link I posted above, comparing floating point numbers using == is only sensible in a purely deterministic (read: constant) form. If math is being used to generate the values to be compared then great care has to be taken; great care is rarely taken. The general rule in most studios is simple: don't compare floats using == on pain of death or public embarrassment. The usual method is to apply some form of epsilon to the comparison:

    float a= 1.0f;
    float b= 1.0005f;
    static const float epsilon= 0.0001f;
    float temp= fabs(b - a);
    if(temp > epsilon)
        // not the same
Within games, this is one method used to enforce determinism; apply a decent epsilon, and most math "behaves" (this doesn't mean its accurate mind). One complication, however, is that epsilon is not always obvious, nor can it always be constant, especially within helper classes such as Vector Math. A good epsilon is usually dependent upon the data you're representing. If you are dealing in world coordinates, for instance, where 1.0f = 1m, then:

0.01    = 1cm
0.001   = 1mm
0.0001  = 100µm
0.00001 = 10µm
Many of the games i've worked on use 1mm as their positional epsilon (0.001f) and 10µm for directional/rotational epsilon (0.00001f). For Positional Epsilon, this provides an effective range of approximately +/- 10000.001 and for rotational epsilon +/- 1000.0001f. The general rule of thumb I use is 8 decimal places between your highest and lowest accuracy requirements. If you feel you need a larger range than this but still want accuracy, then consider using a reference frame; 1 value to represent low accuracy high values (say kilometers) and another to represent high accuracy low values (meters down to µm) Now one area that seems to catch a lot of people out (myself included on many occasions) is Infinity & NaN Rules… so here is a handy table borrowed from here:
On the subject of error handling, consider the standard method of normalizing a 3d vector.:

vector3 normalize_vector(const vector3 & vec)
    float length= vector_length(vec);
    vector3 result=vec;
    if(length > 0.0f)
        float reciprocal= 1.0f / length;
        result.x *= reciprocal;
        result.y *= reciprocal;
        result.z *= reciprocal;
    return result;
We want to avoid introducing a problem by way of an INF=>NAN in the returned data or throwing an exception so a branch is inserted. This effectively removes the divide by zero problem, however, at great cost; on many platforms, the branch is an instruction cache flush resulting in significant performance issues. The problem is there really isn't another way to achieve the same avoidance mathematically. But there is a method of avoiding it if you rely upon floating point math. We've established that a large value remains the same when a small value is added, and we've also discussed that the effective range of floating point values for use in games is limited both for determinism and to avoid issues with the math itself. Combining the two provides a rather elegant method of avoiding the divide by zero issue under normalization and many other well known situations:

const float very_small_float= 1.0e-037f;
vector3 normalize_vector_2(const vector3 & vec)
    float length= very_small_float + vector_length(vec);
    float reciprocal= 1.0f / length;
    vector3 result;
    result.x = vec.x * reciprocal;
    result.y = vec.y * reciprocal;
    result.z = vec.z * reciprocal;
    return result;
Due to limiting the allowed range of vector3 components and understanding that addition of the very small value to any value larger than 1.0e-29 has zero effect, we have effectively removed the possibility of receiving INF and thus NAN as the result. (The Denormal length is still possible however much less likely) There is a plethora of information out there on floating point, both practical and theoretical. However, the above is my attempt to represent those cases not covered or not well highlighted and hopefully make people think about their implementations more in terms of the specific requirements and less in terms of "floating point numbers handle everything" [This piece was reprinted from #AltDevBlogADay, a shared blog initiative started by @mike_acton devoted to giving game developers of all disciplines a place to motivate each other to write regularly about their personal game development passions.]

Latest Jobs


Hybrid, Cambridge, MA or Chicago, IL
Quality Assurance Lead

Bladework games

Remote (United States)
Senior Gameplay Engineer

High Fidelity, Inc.

Game Interaction Designer

Fred Rogers Productions

Hybrid (424 South 27th Street, Pittsburgh, PA, USA
Producer - Games & Websites
More Jobs   


Explore the
Advertise with
Follow us

Game Developer Job Board

Game Developer


Explore the

Game Developer Job Board

Browse open positions across the game industry or recruit new talent for your studio

Advertise with

Game Developer

Engage game professionals and drive sales using an array of Game Developer media solutions to meet your objectives.

Learn More
Follow us


Follow us @gamedevdotcom to stay up-to-date with the latest news & insider information about events & more