*In this reprinted #altdevblogaday in-depth piece, Valve Software programmer Bruce Dawson shares a few seemingly "fantastic/improbable" tricks he's picked up with the floating-point format.]*I left the last post with a promise to share an interesting property of the IEEE float format. There are several equivalent ways of stating this property, and here are two of them. For floats of the same sign:

- Adjacent floats have adjacent integer representations
- Incrementing the integer representation of a float moves to the next representable float, moving away from zero

The (partial) output looks like this:

union Float_t { int32_t i; float f; struct { uint32_t mantissa : 23; uint32_t exponent : 8; uint32_t sign : 1; } parts; }; void IterateAllPositiveFloats() { // Start at zero and print that float. Float_t allFloats; allFloats.f = 0.0f; printf("%1.8e\n", allFloats.f); // Continue through all of the floats, stopping // when we get to positive infinity. while (allFloats.parts.exponent < 255) { // Increment the integer representation to move // to the next float. allFloats.i += 1; printf("%1.8e\n", allFloats.f); } }

For double precision floats you could use _nextafter() to walk through all of the available doubles, but I'm not aware of a simple and portable alternative to this technique for 32-bit floats. We can use this property and the Float_t union to find out how much precision a float variable has at a particular range. We can assign a float to Float_t::f, then increment or decrement the integer representation, and then compare the before/after float values to see how much they have changed. Here is some sample code that does this:

0.00000000e+000 1.40129846e-045 2.80259693e-045 4.20389539e-045 5.60519386e-045 7.00649232e-045 8.40779079e-045 9.80908925e-045 … 3.40282306e+038 3.40282326e+038 3.40282347e+038 1.#INF0000e+000

Note that the difference between the values of two adjacent floats can always be stored exactly in a (possibly subnormal) float. I have a truly marvelous proof of this theorem which the margin is too small to contain. These functions can be called from test code to learn about the float format. Better yet, when sitting at a breakpoint in Visual Studio you can call them from the watch window. That allows dynamic exploration of precision: Usually the delta is the same whether you increment the integer representation or decrement it. However if incrementing or decrementing changes the exponent then the two deltas will be different. This can be seen in the example above where the precision at 65536 is twice as good (half the delta) going towards zero compared to going away from zero.

float TestFloatPrecisionAwayFromZero(float input) { union Float_t num; num.f = input; // Incrementing infinity or a NaN would be bad! assert(num.parts.exponent < 255); // Increment the integer representation of our value num.i += 1; // Subtract the initial value to find our precision float delta = num.f – input; return delta; } float TestFloatPrecisionTowardsZero(float input) { union Float_t num; num.f = input; // Decrementing from zero would be bad! assert(num.parts.exponent || num.parts.mantissa); // Decrementing a NaN would be bad! assert(num.parts.exponent != 255 || num.parts.mantissa == 0); // Decrement the integer representation of our value num.i -= 1; // Subtract the initial value to find our precision float delta = num.f – input; return -delta; } struct TwoFloats { float awayDelta; float towardsDelta; }; struct TwoFloats TestFloatPrecision(float input) { struct TwoFloats result = { TestFloatPrecisionAwayFromZero(input), TestFloatPrecisionTowardsZero(input), }; return result; }

**Caveat incrementor**Pay close attention to the number of caveats that you have to watch for when you start partying on the integer representation of a float. It's safe in a controlled environment, but things can quickly go bad in the real world:

- Some compilers may prohibit the type-punning/aliasing used by Float_t (gcc and VC++ allow it)
- Incrementing the integer representation of infinity gives you a NaN
- Decrementing the integer representation of zero gives you a NaN
- Incrementing or decrementing some NaNs will give you zero or infinity
- The ratio of the value of two adjacent floats is usually no more than about 1.0000012, but is sometimes much much larger
- The representations of positive and negative zero are far removed from each other
- The representations of FLT_MIN and -FLT_MIN are as far from each other as the representations of FLT_MAX and -FLT_MAX
- Floating-point math is
*always*more complicated than you expect

**Log rolling**A related property is that, for floats that are positive, finite, and non-zero, the integer representation of a float is a piecewise linear approximation of its base 2 logarithm I just like saying that. It sounds cool. The reason that the integer representation is (after appropriate scaling and biasing) a piecewise linear representation of the base 2 logarithm of a (positive) float is because the exponent is logically the base-2 logarithm of a float, and it is in the high bits. The mantissa linearly interpolates between power-of-2 floats. The code below demonstrates the concept:

If we drop the results into Excel and plot them with the x-axis on a base-2 log scale then we get this lovely chart: If it's plotted linearly then the 'linear' part of 'piecewise linear' becomes obvious, but I like the slightly scalloped straight line better. The estimate is exact when the float is a power of two, and at its worst is about 0.086 too small. In the last of this week's stupid float tricks, I present you with this dodgy code:

void PlotFloatsVersusRep() { // Let's plot some floats and their representations from 1.0 to 32.0 for (float f = 1.0f; f <= 32.0f; f *= 1.01f) { Float_t num; num.f = f; // The representation of 1.0f is 0x3f800000 and the representation // of 2.0f is 0×40000000, so if the representation of a float is // an approximation of its base 2 log then 0x3f800000 must be // log2(1.0) == 0 and 0×40000000 must be log2(2.0) == 1. // Therefore we should scale the integer representation by // subtracting 0x3f800000 and dividing by (0×40000000 – 0x3f800000) double log2Estimate = (num.i – 0x3f800000) / double(0×40000000 – 0x3f800000); printf("%1.5f,%1.5f\n", f, log2Estimate); } }

Depending on how you think about such things, this is either the simplest and most elegant, or the most complicated and obtuse way of finding out how many bits it takes to represent a particular integer.

int RoundedUpLogTwo(uint64_t input) { assert(input > 0); union Float_t num; num.f = (float)input; // Increment the representation enough that for any non // power-of-two (FLT_MIN or greater) we will move to the next exponent. num.i += 0x7FFFFF; // Return the exponent after subtracting the bias. return num.parts.exponent – 127; }

**Random aside**The IEEE standard guarantees correct rounding for addition, multiplication, subtraction, division, and square-root. If you've ever wondered if this is important, then try this: using the Windows calculator (I'm using the Windows 7 version), calculate sqrt(4) – 2. The answer should, of course, be zero. However the answer that the calculator actually returns is:

This is utterly fascinating. It shows that the calculator is using impressive precision (it did the calculations to

Scientific mode: -8.1648465955514287168521180122928e-39 Standard mode: -1.068281969439142e-19

*40*digits of accuracy, 20 digits in standard mode) and yet it got the wrong answer. Because the Windows calculator is using so much precision it will, for many calculations, get a more accurate answer than an IEEE float or double. However, because it fails to correctly round the answer (the last digit cannot be trusted) it sometimes gives answers that are laughably wrong. I can just imagine the two calculator modes arguing: the standard mode says "according to my calculations sqrt(4) is 1.9999999999999999999" and the scientific mode says "according to

*my*calculations it's 1.999999999999999999999999999999999999992 – that is far more accurate". Meanwhile an IEEE float says "I've only got about eight digits of precision but I think sqrt(4) is 2." Having lots of digits of precision is nice, but without correct rounding it can just end up making you look foolish.

**That's all folks**In the next post I'll discuss some concrete examples that show the value of knowing how much precision is available around certain values. Until next time, float on.

*[This piece was reprinted from*

*#AltDevBlogADay, a shared blog initiative started by @mike_acton devoted to giving game developers of all disciplines a place to motivate each other to write regularly about their personal game development passions.]*