In-depth: Tricks with the floating-point format

In this reprinted <a href="http://altdevblogaday.com/">#altdevblogaday</a> in-depth piece, Valve Software programmer Bruce Dawson explains a trick for doing epsilon floating-point comparisons by using integer comparisons.

Bruce Dawson, Blogger

January 11, 2012

7 Min Read

[In this reprinted #altdevblogaday in-depth piece, Valve Software programmer Bruce Dawson explains a trick for doing epsilon floating-point comparisons by using integer comparisons.] Years ago, I wrote an article about how to do epsilon floating-point comparisons by using integer comparisons. That article has been quite popular (it is frequently cited, and the code samples have been used by a number of companies), and this worries me a bit, because the article has some flaws. I'm not going to link to the article because I want to replace it, not send people looking for it. Today I am going to start setting the groundwork for explaining how and why this trick works, while also exploring the weird and wonderful world of floating-point math. There are lots of references that explain the layout and decoding of floating-point numbers. In this post I am going to supply the layout, and then show how to reverse engineer the decoding process through experimentation. The IEEE 754-1985 standard specifies the format for 32-bit floating-point numbers, the type known as 'float' in many languages. The 2008 version of the standard adds new formats but doesn't change the existing ones, which have been standardized for over 25 years. A 32-bit float consists of a one-bit sign field, an eight-bit exponent field, and a twenty-three-bit mantissa field. The union below shows the layout of a 32-bit float. This union is very useful for exploring and working with the internals of floating-point numbers. I don't recommend using this union for production coding, but it is useful for learning.

union Float_t
{
    int32_t i;
    float f;
    struct
    {
        unsigned mantissa : 23;
        unsigned exponent : 8;
        unsigned sign : 1;
    } parts;
};

The format for 32-bit float numbers was carefully designed to allow them to be put in a union with an integer, and the aliasing of 'i' and 'f' should work on all platforms, with the sign bit of the integer and the float occupying the same location. The layout of bitfields is compiler dependent so the bitfield struct that is also in the union may not work on all platforms. However it works on Visual C++ on x86 and x64, which is good enough for my exploratory purposes. In order to really understand floats, it is important to explore and experiment. One way to explore is to write code like this, in a debug build so that the debugger doesn't optimize it away:

Float_t num;
num.f = 1.0f;
num.i -= 1;
printf("Float value, representation, sign, exponent, mantissa\n");
for (;;)
{
    printf("%1.8e, 0x%08X, %d, %d, 0x%06X\n",
                num.f, num.i,
                num.parts.sign, num.parts.exponent, num.parts.mantissa);
}

Put a breakpoint on the 'printf' statement and then add the various components of num to your debugger's watch window and examine them, like this: You can then start trying interactive experiments, such as incrementing the mantissa or exponent fields, incrementing num.i, or toggling the value of the sign field. As you do this you should watch num.f to see how it changes. Or, assign various floating-point values to num.f and see how the other fields change. You can either view the results in the debugger's watch window, or hit 'Run' after each change so that the printf statement executes and prints some nicely formatted results. Go ahead. Put Float_t and the sample code into a project and play around with it for a few minutes. Discover the minimum and maximum float values. Experiment with the minimum and maximum mantissa values in various combinations. Think about the implications. This is the best way to learn. I'll wait. I've put some of the results that you might encounter during this experimentation into the table below: With this information we can begin to understand the decoding of floats. Floats use an base-two exponential format so we would expect the decoding to be mantissa * 2^exponent. However in the encodings for 1.0 and 2.0 the mantissa is zero, so how can this work? It works because of a clever trick. Normalized numbers in base-two scientific notation are always of the form 1.xxxx*2^exp, so storing the leading one is not necessary. By omitting the leading one we get an extra bit of precision – the 23-bit field of a float actually manages to hold 24 bits of precision because there is an implied 'one' bit with a value of 0×800000. The exponent for 1.0 should be zero but the exponent field is 127. That's because the exponent is stored in excess 127 form. To convert from the value in the exponent field to the value of the exponent you simply subtract 127. The two exceptions to this exponent rule are when the exponent field is 255 or zero. 255 is a special exponent value that indicates that the float is either infinity or a NAN (not-a-number), with a zero mantissa indicating infinity. Zero is a special exponent value that indicates that there is no implied leading one, meaning that these numbers are not normalized. This is necessary in order to exactly represent zero. The exponent value in that case is –126, which is the same as when the exponent field is one. To clarify the exponent rules I've added an "Exponent value" column which shows the actual binary exponent implied by the exponent field: Although these examples don't show it, negative numbers are dealt with by setting the sign field to 1, which is called sign-and-magnitude form. All numbers, even zero and infinity, have negative versions. The numbers in this chart were chosen in order to demonstrate various things:

0.0: It's handy that zero is represented by all zeroes. However there is also a negative zero which has the sign bit set. Negative zero is equal to positive zero.
1.40129846e-45: This is the smallest positive float, and its integer representation is the smallest positive integer
1.17549435e-38: This is the smallest float with an implied leading one, the smallest number with a non-zero exponent, the smallest normalized float. This number is also FLT_MIN. Note that FLT_MIN is not the smallest float. There are actually about 8 million positive floats smaller than FLT_MIN.
0.2: This is an example of one of the many decimal numbers that cannot be precisely represented with a binary floating-point format. That mantissa wants to repeat 'C' forever.
1.0: Note the exponent and the mantissa, and memorize the integer representation in case you see it in hex dumps.
1.5, 1.75: Just a couple of slightly larger numbers to show the mantissa changing while the exponent stays the same.
1.99999988: This is the largest float that has the same exponent as 1.0, and the largest float that is smaller than 2.0.
2.0: Notice that the exponent is one higher than for 1.0, and the integer representation and exponent are one higher than for 1.99999988.
16,777,215: This is the largest odd float. The next larger float has an exponent value of 24, which means the mantissa is shifted enough left that odd numbers are impossible. Note that this means that above 16,777,216 a float has less precision than an int.
3.40282347e+38: FLT_MAX. The largest finite float, with the maximum finite exponent and the maximum mantissa.
Positive infinity: The papa bear of floats.

We can now describe how to decode the float format:

If the exponent field is 255 then the number is infinity (if the mantissa is zero) or a NaN (if the mantissa is non-zero)
If the exponent field is from 1 to 254 then the exponent is between –126 and 127, there is an implied leading one, and the float's value is:
- (1.0 + mantissa-field / 0×800000) * 2^(exponent-field-127)
If the exponent field is zero then the exponent is –126, there is no implied leading one, and the float's value is:
- (mantissa-field / 0×800000) * 2^-126
If the sign bit is set then negate the value of the float

The excess-127 exponent and the omitted leading one lead to some very convenient characteristics of floats, but I've rambled on too long so those must be saved for the next post, in a fortnight*1.0714. [This piece was reprinted from #AltDevBlogADay, a shared blog initiative started by @mike_acton devoted to giving game developers of all disciplines a place to motivate each other to write regularly about their personal game development passions.]

About the Author(s)

Bruce Dawson

Blogger

Bruce is the director of technology at Humongous Entertainment, which means he gets to work on all the fun and challenging tasks that nobody else has time to do. He also teaches part-time at DigiPen. Prior to Humongous Entertainment he worked at Cavedog Entertainment, assisting various product teams. Bruce worked for several years at the Elastic Reality branch of Avid Technology, writing special effects software and video editing plug-ins, and before that worked at Electronic Arts Canada, back when it was called Distinctive Software. There he wrote his first computer games, for the Commodore Amiga, back when thirty-two colours was state of the art. Bruce presented a paper at GDC 2001 called "What Happened to my Colours?!?" about the quirks of NTSC. He is currently trying to perfect the ultimate Python script that will automate all of his job duties - so that he can spend more time playing with the console machine he won at a poker game. Bruce lives with his wonderful wife and two exceptional children near Seattle, Washington where he tries to convince his coworkers that all computer programmers should juggle, unicycle and balance on a tight wire. You can contact Bruce Dawson at: [email protected]

See more from Bruce Dawson

Related Topics

Related Topics

Recent in More

Related Topics

Related Topics

In-depth: Tricks with the floating-point format

About the Author(s)

Latest News

Trending

Featured Blogs

Game Developer Essentials