language agnostic - Is floating point math broken?

Question

52 votes

18 answers

language agnostic - Is floating point math broken?

Get the solution ↓↓↓

Consider the following code:

0.1 + 0.2 == 0.3  ->  false

0.1 + 0.2         ->  0.30000000000000004

Why do these inaccuracies happen?

Undefined asked

2021-12-1

Write your answer

Write quick answer

Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.

score 384 · Answer 1

Answer

Solution:

Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as0.1, which is1/10) whose denominator is not a power of two cannot be exactly represented.

For0.1 in the standardbinary64 format, the representation can be written exactly as

0.1000000000000000055511151231257827021181583404541015625 in decimal, or
0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number0.1, which is1/10, can be written exactly as

0.1 in decimal, or
0x1.99999999999999...p-4 in an analogue of C99 hexfloat notation, where the... represents an unending sequence of 9's.

The constants0.2 and0.3 in your program will also be approximations to their true values. It happens that the closestdouble to0.2 is larger than the rational number0.2 but that the closestdouble to0.3 is smaller than the rational number0.3. The sum of0.1 and0.2 winds up being larger than the rational number0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is . For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

** Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).

So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not doif (x == y) { ... }

Instead doif (abs(x - y) < myToleranceValue) { ... }.

whereabs is the absolute value.myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.

score 761 · Answer 2

Answer

Solution:

A Hardware Designer's Perspective

I believe I should add a hardware designerвЂ™s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time.

1. Overview

From an engineering perspective, most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place. Therefore, much hardware will stop at a precision that's only necessary to yield an error of less than one half of one unit in the last place for a single operation which is especially problematic in floating point division. What constitutes a single operation depends upon how many operands the unit takes. For most, it is two, but some units take 3 or more operands. Because of this, there is no guarantee that repeated operations will result in a desirable error since the errors add up over time.

2. Standards

Most processors follow the IEEE-754 standard but some use denormalized, or different standards . For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.

In the IEEE-754 standard, hardware designers are allowed any value of error/epsilon as long as it's less than one half of one unit in the last place, and the result only has to be less than one half of one unit in the last place for one operation. This explains why when there are repeated operations, the errors add up. For IEEE-754 double precision, this is the 54th bit, since 53 bits are used to represent the numeric part (normalized), also called the mantissa, of the floating point number (e.g. the 5.3 in 5.3e5). The next sections go into more detail on the causes of hardware error on various floating point operations.

3. Cause of Rounding Error in Division

The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly inZ=X/Y,Z = X * (1/Y). A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, wherek>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be2+2= 4 bits (plus a few optional bits).

3.1 Division Rounding Error: Approximation of Reciprocal

What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.

4. Rounding Errors in Other Operations: Truncation

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error is especially problematic in exponentiation, which involves some form of repeated multiplication.

5. Repeated Operations

Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.

Note that the default rounding mode, round-to-nearest even digit in the last place, guarantees an error of less than one half of one unit in the last place for one operation. Using the truncation, round-up, and round down alone may result in an error that is greater than one half of one unit in the last place, but less than one unit in the last place, so these modes are not recommended unless they are used in Interval Arithmetic.

6. Summary

In short, the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware, and the truncation of a reciprocal in the case of division. Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation, the floating point errors over repeated operations will add up unless corrected.

score 526 · Answer 3

Answer

Solution:

It's broken in the exact same way the decimal (base-10) notation you learned in grade school is broken, just for base-2.

To understand, think about representing 1/3 as a decimal value. It's impossible to do exactly! In the same way, 1/10 (decimal 0.1) cannot be represented exactly in base 2 (binary) as a "decimal" value; a repeating pattern after the decimal point goes on forever. The value is not exact, and therefore you can't do exact math with it using normal floating point methods.

score 804 · Answer 4

Answer

Solution:

Most answers here address this question in very dry, technical terms. I'd like to address this in terms that normal human beings can understand.

Imagine that you are trying to slice up pizzas. You have a robotic pizza cutter that can cut pizza slices exactly in half. It can halve a whole pizza, or it can halve an existing slice, but in any case, the halving is always exact.

That pizza cutter has very fine movements, and if you start with a whole pizza, then halve that, and continue halving the smallest slice each time, you can do the halving 53 times before the slice is too small for even its high-precision abilities. At that point, you can no longer halve that very thin slice, but must either include or exclude it as is.

Now, how would you piece all the slices in such a way that would add up to one-tenth (0.1) or one-fifth (0.2) of a pizza? Really think about it, and try working it out. You can even try to use a real pizza, if you have a mythical precision pizza cutter at hand. :-)

Most experienced programmers, of course, know the real answer, which is that there is no way to piece together an exact tenth or fifth of the pizza using those slices, no matter how finely you slice them. You can do a pretty good approximation, and if you add up the approximation of 0.1 with the approximation of 0.2, you get a pretty good approximation of 0.3, but it's still just that, an approximation.

For double-precision numbers (which is the precision that allows you to halve your pizza 53 times), the numbers immediately less and greater than 0.1 are 0.09999999999999999167332731531132594682276248931884765625 and 0.1000000000000000055511151231257827021181583404541015625. The latter is quite a bit closer to 0.1 than the former, so a numeric parser will, given an input of 0.1, favour the latter.

(The difference between those two numbers is the "smallest slice" that we must decide to either include, which introduces an upward bias, or exclude, which introduces a downward bias. The technical term for that smallest slice is an ulp.)

In the case of 0.2, the numbers are all the same, just scaled up by a factor of 2. Again, we favour the value that's slightly higher than 0.2.

Notice that in both cases, the approximations for 0.1 and 0.2 have a slight upward bias. If we add enough of these biases in, they will push the number further and further away from what we want, and in fact, in the case of 0.1 + 0.2, the bias is high enough that the resulting number is no longer the closest number to 0.3.

In particular, 0.1 + 0.2 is really 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125, whereas the number closest to 0.3 is actually 0.299999999999999988897769753748434595763683319091796875.

P.S. Some programming languages also provide pizza cutters that can split slices into exact tenths. Although such pizza cutters are uncommon, if you do have access to one, you should use it when it's important to be able to get exactly one-tenth or one-fifth of a slice.

score 837 · Answer 5

Answer

Solution:

Floating point rounding errors. 0.1 cannot be represented as accurately in base-2 as in base-10 due to the missing prime factor of 5. Just as 1/3 takes an infinite number of digits to represent in decimal, but is "0.1" in base-3, 0.1 takes an infinite number of digits in base-2 where it does not in base-10. And computers don't have an infinite amount of memory.

score 925 · Answer 6

Answer

Solution{-code-{-code-15}{-code-15}}

My answer is quite long, so I've split it into three sections. Since the question is about floating point mathematics, I've put the emphasis on what the machine actually does. I've also made it specific to double (64 bit) precision, but the argument applies equally to any floating point arithmetic.

Preamble

An IEEE 754 double-precision binary floating-point format (binary64) number represents a number of the form

value = (-{-code-1})^s * ({-code-1}.m_5{-code-1}m₅₀...m₂m_{-code-1}m₀)₂ * 2^{e-{-code-1}023}

in 64 bits{-code-{-code-15}{-code-15}}

The first bit is the sign bit{-code-{-code-15}{-code-15}}{-code-1} if the number is negative,0 otherwise^{-code-1}.
The next {-code-1}{-code-1} bits are the exponent, which is offset by {-code-1}023. In other words, after reading the exponent bits from a double-precision number, {-code-1}023 must be subtracted to obtain the power of two.
The remaining 52 bits are the significand (or mantissa). In the mantissa, an 'implied'{-code-1}. is always² omitted since the most significant bit of any binary value is{-code-1}.

^{-code-1} - IEEE 754 allows for the concept of a signed zero -+0 and-0 are treated differently{-code-{-code-15}{-code-15}}{-code-1} / (+0) is positive infinity;{-code-1} / (-0) is negative infinity. For zero values, the mantissa and exponent bits are all zero. Note{-code-{-code-15}{-code-15}} zero values (+0 and -0) are explicitly not classed as denormal².

² - This is not the case for denormal numbers, which have an offset exponent of zero (and an implied0.). The range of denormal double precision numbers is d_min в‰¤ |x| в‰¤ d_max, where d_min (the smallest representable nonzero number) is 2^{-{-code-1}023 - 5{-code-1}} (в‰€ 4.94 * {-code-1}0^-324) and d_max (the largest denormal number, for which the mantissa consists entirely of{-code-1}s) is 2^{-{-code-1}023 + {-code-1}} - 2^{-{-code-1}023 - 5{-code-1}} (в‰€ 2.225 * {-code-1}0^-308).

Turning a double precision number to binary

Many online converters exist to convert a double precision floating point number to binary (e.g. at binaryconvert.com), but here is some sample C# code to obtain the IEEE 754 representation for a double precision number (I separate the three parts with colons ({-code-{-code-15}{-code-15}}){-code-{-code-15}{-code-15}}

public static string BinaryRepresentation(double value)
{
    long valueInLongType = BitConverter.DoubleToInt64Bits(value);
    string bits = Convert.ToString(valueInLongType, 2);
    string leadingZeros = new string('0', 64 - bits.Length);
    string binaryRepresentation = leadingZeros + bits;

    string sign = binaryRepresentation[0].ToString();
    string exponent = binaryRepresentation.Substring({-code-1}, {-code-1}{-code-1});
    string mantissa = binaryRepresentation.Substring({-code-1}2);

    return string.Format("{0}{-code-{-code-15}{-code-15}}{{-code-1}}{-code-{-code-15}{-code-15}}{2}", sign, exponent, mantissa);
}

Getting to the point{-code-{-code-15}{-code-15}} the original question

(Skip to the bottom for the TL;DR version)

Cato Johnston (the question asker) asked why 0.{-code-1} + 0.2 != 0.3.

Written in binary (with colons separating the three parts), the IEEE 754 representations of the values are{-code-{-code-15}{-code-15}}

0.{-code-1} => 0{-code-{-code-15}{-code-15}}0{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}0{-code-1}{-code-1}{-code-{-code-15}{-code-15}}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0{-code-1}0
0.2 => 0{-code-{-code-15}{-code-15}}0{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}00{-code-{-code-15}{-code-15}}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0{-code-1}0

Note that the mantissa is composed of recurring digits of00{-code-1}{-code-1}. This is key to why there is any error to the calculations - 0.{-code-1}, 0.2 and 0.3 cannot be represented in binary precisely in a finite number of binary bits any more than {-code-1}/9, {-code-1}/3 or {-code-1}/7 can be represented precisely in decimal digits.

Also note that we can decrease the power in the exponent by 52 and shift the point in the binary representation to the right by 52 places (much like {-code-1}0^-3 * {-code-1}.23 == {-code-1}0^-5 * {-code-1}23). This then enables us to represent the binary representation as the exact value that it represents in the form a * 2^p. where 'a' is an integer.

Converting the exponents to decimal, removing the offset, and re-adding the implied{-code-1} (in square brackets), 0.{-code-1} and 0.2 are{-code-{-code-15}{-code-15}}

0.{-code-1} => 2^-4 * [{-code-1}].{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0{-code-1}0
0.2 => 2^-3 * [{-code-1}].{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0{-code-1}0
or
0.{-code-1} => 2^-56 * 7205759403792794 = 0.{-code-1}0000000000000000555{-code-1}{-code-1}{-code-1}5{-code-1}23{-code-1}25782702{-code-1}{-code-1}8{-code-1}58340454{-code-1}0{-code-1}5625
0.2 => 2^-55 * 7205759403792794 = 0.2000000000000000{-code-1}{-code-1}{-code-1}0223024625{-code-1}5654042363{-code-1}6680908203{-code-1}25

To add two numbers, the exponent needs to be the same, i.e.{-code-{-code-15}{-code-15}}

0.{-code-1} => 2^-3 *  0.{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0{-code-1}(0)
0.2 => 2^-3 *  {-code-1}.{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0{-code-1}0
sum =  2^-3 * {-code-1}0.0{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}{-code-1}
or
0.{-code-1} => 2^-55 * 360287970{-code-1}896397  = 0.{-code-1}0000000000000000555{-code-1}{-code-1}{-code-1}5{-code-1}23{-code-1}25782702{-code-1}{-code-1}8{-code-1}58340454{-code-1}0{-code-1}5625
0.2 => 2^-55 * 7205759403792794  = 0.2000000000000000{-code-1}{-code-1}{-code-1}0223024625{-code-1}5654042363{-code-1}6680908203{-code-1}25
sum =  2^-55 * {-code-1}0808639{-code-1}05689{-code-1}9{-code-1} = 0.3000000000000000{-code-1}6653345369377348{-code-1}0635447502{-code-1}3623046875

Since the sum is not of the form 2ⁿ * {-code-1}.{bbb} we increase the exponent by one and shift the decimal (binary) point to get{-code-{-code-15}{-code-15}}

sum = 2^-2  * {-code-1}.00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}({-code-1})
    = 2^-54 * 54043{-code-1}9552844595.5 = 0.3000000000000000{-code-1}6653345369377348{-code-1}0635447502{-code-1}3623046875

There are now 53 bits in the mantissa (the 53rd is in square brackets in the line above). The default rounding mode for IEEE 754 is 'Round to Nearest' - i.e. if a number x falls between two values a and b, the value where the least significant bit is zero is chosen.

a = 2^-54 * 54043{-code-1}9552844595 = 0.2999999999999999888977697537484345957636833{-code-1}909{-code-1}796875
  = 2^-2  * {-code-1}.00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}

x = 2^-2  * {-code-1}.00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}({-code-1})

b = 2^-2  * {-code-1}.00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0{-code-1}00
  = 2^-54 * 54043{-code-1}9552844596 = 0.30000000000000004440892098500626{-code-1}6{-code-1}69452667236328{-code-1}25

Note that a and b differ only in the last bit;...00{-code-1}{-code-1} +{-code-1} =...0{-code-1}00. In this case, the value with the least significant bit of zero is b, so the sum is{-code-{-code-15}{-code-15}}

sum = 2^-2  * {-code-1}.00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0{-code-1}00
    = 2^-54 * 54043{-code-1}9552844596 = 0.30000000000000004440892098500626{-code-1}6{-code-1}69452667236328{-code-1}25

whereas the binary representation of 0.3 is{-code-{-code-15}{-code-15}}

0.3 => 2^-2  * {-code-1}.00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}
    =  2^-54 * 54043{-code-1}9552844595 = 0.2999999999999999888977697537484345957636833{-code-1}909{-code-1}796875

which only differs from the binary representation of the sum of 0.{-code-1} and 0.2 by 2^-54.

The binary representation of 0.{-code-1} and 0.2 are the most accurate representations of the numbers allowable by IEEE 754. The addition of these representation, due to the default rounding mode, results in a value which differs only in the least-significant-bit.

TL;DR

Writing0.{-code-1} + 0.2 in a IEEE 754 binary representation (with colons separating the three parts) and comparing it to0.3, this is (I've put the distinct bits in square brackets){-code-{-code-15}{-code-15}}

0.{-code-1} + 0.2 => 0{-code-{-code-15}{-code-15}}0{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}0{-code-1}{-code-{-code-15}{-code-15}}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0[{-code-1}00]
0.3       => 0{-code-{-code-15}{-code-15}}0{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}{-code-1}0{-code-1}{-code-{-code-15}{-code-15}}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}00{-code-1}{-code-1}0[0{-code-1}{-code-1}]

Converted back to decimal, these values are{-code-{-code-15}{-code-15}}

0.{-code-1} + 0.2 => 0.300000000000000044408920985006...
0.3       => 0.299999999999999988897769753748...

The difference is exactly 2^-54, which is ~5.55{-code-1}{-code-1}{-code-1}5{-code-1}23{-code-1}258 Г— {-code-1}0^-{-code-1}7 - insignificant (for many applications) when compared to the original values.

Comparing the last few bits of a floating point number is inherently dangerous, as anyone who reads the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (which covers all the major parts of this answer) will know.

Most calculators use additional guard digits to get around this problem, which is how0.{-code-1} + 0.2 would give0.3{-code-{-code-15}{-code-15}} the final few bits are rounded.

score 798 · Answer 7

Answer

Solution:

In addition to the other correct answers, you may want to consider scaling your values to avoid problems with floating-point arithmetic.

For example:

var result = 1.0 + 2.0;     // result === 3.0 returns true

... instead of:

var result = 0.1 + 0.2;     // result === 0.3 returns false

The expression0.1 + 0.2 === 0.3 returnsfalse in JavaScript, but fortunately integer arithmetic in floating-point is exact, so decimal representation errors can be avoided by scaling.

As a practical example, to avoid floating-point problems where accuracy is paramount, it is recommended¹ to handle money as an integer representing the number of cents:2550 cents instead of25.50 dollars.

¹ Douglas Crockford: .

score 390 · Answer 8

Answer

Solution:

Floating point numbers stored in the computer consist of two parts, an integer and an exponent that the base is taken to and multiplied by the integer part.

If the computer were working in base 10,0.1 would be1 x 10вЃ»В№,0.2 would be2 x 10вЃ»В№, and0.3 would be3 x 10вЃ»В№. Integer math is easy and exact, so adding0.1 + 0.2 will obviously result in0.3.

Computers don't usually work in base 10, they work in base 2. You can still get exact results for some values, for example0.5 is1 x 2вЃ»В№ and0.25 is1 x 2вЃ»ВІ, and adding them results in3 x 2вЃ»ВІ, or0.75. Exactly.

The problem comes with numbers that can be represented exactly in base 10, but not in base 2. Those numbers need to be rounded to their closest equivalent. Assuming the very common IEEE 64-bit floating point format, the closest number to0.1 is3602879701896397 x 2вЃ»вЃµвЃµ, and the closest number to0.2 is7205759403792794 x 2вЃ»вЃµвЃµ; adding them together results in10808639105689191 x 2вЃ»вЃµвЃµ, or an exact decimal value of0.3000000000000000444089209850062616169452667236328125. Floating point numbers are generally rounded for display.

score 614 · Answer 9

Answer

Solution:

Floating point rounding error. From What Every Computer Scientist Should Know About Floating-Point Arithmetic:

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.

**Date the issue was resolved:** · Answer 10 · 2021-12-1

Answer

Solution:

In short it's because:

Floating point numbers cannot represent all decimals precisely in binary

So just like 10/3 which does not exist in base 10 precisely (it will be 3.33... recurring), in the same way 1/10 doesn't exist in binary.

So what? How to deal with it? Is there any workaround?

In order to offer The best solution I can say I discovered following method:

parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3

Let me explain why it's the best solution. As others mentioned in above answers it's a good idea to use ready to use Javascript toFixed() function to solve the problem. But most likely you'll encounter with some problems.

Imagine you are going to add up two float numbers like0.2 and0.7 here it is:0.2 + 0.7 = 0.8999999999999999.

Your expected result was0.9 it means you need a result with 1 digit precision in this case. So you should have used(0.2 + 0.7).tofixed(1) but you can't just give a certain parameter to toFixed() since it depends on the given number, for instance

0.22 + 0.7 = 0.9199999999999999

In this example you need 2 digits precision so it should betoFixed(2), so what should be the paramter to fit every given float number?

You might say let it be 10 in every situation then:

(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"

Damn! What are you going to do with those unwanted zeros after 9? It's the time to convert it to float to make it as you desire:

parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9

Now that you found the solution, it's better to offer it as a function like this:

function floatify(number){
           return parseFloat((number).toFixed(10));
        }

Let's try it yourself:

function floatify(number){
       return parseFloat((number).toFixed(10));
    }
 
function addUp(){
  var number1 = +$("#number1").val();
  var number2 = +$("#number2").val();
  var unexpectedResult = number1 + number2;
  var expectedResult = floatify(number1 + number2);
  $("#unexpectedResult").text(unexpectedResult);
  $("#expectedResult").text(expectedResult);
}
addUp();

input{
  width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>

Write your answer

759

votes

Answer

Solution:

My workaround:

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

precision refers to the number of digits you want to preserve after the decimal point during addition.

Write your answer

568

votes

Answer

Solution:

A lot of good answers have been posted, but I'd like to append one more.

Not all numbers can be represented via floats/doubles For example, the number "0.2" will be represented as "0.200000003" in single precision in IEEE754 float point standard.

Model for store real numbers under the hood represent float numbers as

Even though you can type0.2 easily,FLT_RADIX andDBL_RADIX is 2; not 10 for a computer with FPU which uses "IEEE Standard for Binary Floating-Point Arithmetic (ISO/IEEE Std 754-1985)".

So it is a bit hard to represent such numbers exactly. Even if you specify this variable explicitly without any intermediate calculation.

Write your answer

513

votes

Answer

Solution:

No, not broken, but most decimal fractions must be approximated

Summary

Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.

Even simple numbers like 0.01, 0.02, 0.03, 0.04 ... 0.24 are not representable exactly as binary fractions. If you count up 0.01, .02, .03 ..., not until you get to 0.25 will you get the first fraction representable in base₂. If you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact", but that's not really true.

We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.

How did this happen?

When we write in decimal, every fraction (specifically, every terminating decimal) is a rational number of the form

a / (2ⁿ x 5^m)

In binary, we only get the 2ⁿ term, that is:

a / 2ⁿ

So in decimal, we can't represent ¹/₃. Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base₁₀ fraction is representable in binary. In the range from 0.01, 0.02, 0.03 ... 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2ⁿ term.

In base₁₀ we can't represent ¹/₃. But in binary, we can't do ¹/₁₀ or ¹/₃.

So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.

Dealing with it

Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), i.e., stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.

Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.

The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.

I love the Pizza answer by Chris, because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story.)

Conclusion

If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.

Write your answer

445

votes

Answer

Solution:

Some statistics related to this famous double precision question.

When adding all values (a + b) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error. Note that the error could result in slightly bigger or smaller values. Here are some examples:

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

When subtracting all values (a - b where a > b) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error. Here are some examples:

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).

Write your answer

442

votes

Answer

Solution:

Did you try the duct tape solution?

Try to determine when errors occur and fix them with short if statements, it's not pretty but for some problems it is the only solution and this is one of them.

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}

I had the same problem in a scientific simulation project in c#, and I can tell you that if you ignore the butterfly effect it's gonna turn to a big fat dragon and bite you in the a**

Write your answer

261

votes

Answer

Solution:

Given that nobody has mentioned this...

Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. For example:

Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.

Write your answer

800

votes

Answer

Solution:

Those weird numbers appear because computers use binary(base 2) number system for calculation purposes, while we use decimal(base 10).

There are a majority of fractional numbers that cannot be represented precisely either in binary or in decimal or both. Result - A rounded up (but precise) number results.

Write your answer

166

votes

Answer

Solution:

Many of this question's numerous duplicates ask about the effects of ing point rounding on specific numbers. In practice, it is easier to get a feeling for how it works by looking at exact results of calculations of interest rather than by just reading about it. Some languages provide ways of doing that - such as converting a or{-code-2} to{-code-3} in Java.

Since this is a language-agnostic question, it needs language-agnostic tools, such as a Decimal to Floating-Point Converter.

Applying it to the numbers in the question, treated as {-code-2}s:

0.1 converts to 0.1000000000000000055511151231257827021181583404541015625,

0.2 converts to 0.200000000000000011102230246251565404236316680908203125,

0.3 converts to 0.299999999999999988897769753748434595763683319091796875, and

0.30000000000000004 converts to

Write your answer

Share solution ↓

Additional Information:

Date the issue was resolved:

2021-12-1

Link To Source
Link To Answer People are also looking for solutions of the problem: mysqli::real_connect(): (hy000/2002): connection refused

Didn't find the answer?

Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.

Ask a Question

Answer

Solution:

My workaround:

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

precision refers to the number of digits you want to preserve after the decimal point during addition.

Write your answer

score 568 · Answer 12

Answer

Solution:

A lot of good answers have been posted, but I'd like to append one more.

Not all numbers can be represented via floats/doubles For example, the number "0.2" will be represented as "0.200000003" in single precision in IEEE754 float point standard.

Model for store real numbers under the hood represent float numbers as

Even though you can type0.2 easily,FLT_RADIX andDBL_RADIX is 2; not 10 for a computer with FPU which uses "IEEE Standard for Binary Floating-Point Arithmetic (ISO/IEEE Std 754-1985)".

So it is a bit hard to represent such numbers exactly. Even if you specify this variable explicitly without any intermediate calculation.

score 513 · Answer 13

Answer

Solution:

No, not broken, but most decimal fractions must be approximated

Summary

Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.

Even simple numbers like 0.01, 0.02, 0.03, 0.04 ... 0.24 are not representable exactly as binary fractions. If you count up 0.01, .02, .03 ..., not until you get to 0.25 will you get the first fraction representable in base₂. If you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact", but that's not really true.

We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.

How did this happen?

When we write in decimal, every fraction (specifically, every terminating decimal) is a rational number of the form

a / (2ⁿ x 5^m)

In binary, we only get the 2ⁿ term, that is:

a / 2ⁿ

So in decimal, we can't represent ¹/₃. Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base₁₀ fraction is representable in binary. In the range from 0.01, 0.02, 0.03 ... 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2ⁿ term.

In base₁₀ we can't represent ¹/₃. But in binary, we can't do ¹/₁₀ or ¹/₃.

So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.

Dealing with it

Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), i.e., stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.

Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.

The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.

I love the Pizza answer by Chris, because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story.)

Conclusion

If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.

score 445 · Answer 14

Answer

Solution:

Some statistics related to this famous double precision question.

When adding all values (a + b) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error. Note that the error could result in slightly bigger or smaller values. Here are some examples:

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

When subtracting all values (a - b where a > b) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error. Here are some examples:

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).

score 442 · Answer 15

Answer

Solution:

Did you try the duct tape solution?

Try to determine when errors occur and fix them with short if statements, it's not pretty but for some problems it is the only solution and this is one of them.

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}

I had the same problem in a scientific simulation project in c#, and I can tell you that if you ignore the butterfly effect it's gonna turn to a big fat dragon and bite you in the a**

score 261 · Answer 16

Answer

Solution:

Given that nobody has mentioned this...

Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. For example:

Python's and Java's , that represent numbers internally with decimal notation (as opposed to binary notation). Both have limited precision, so they are still error prone, however they solve most common problems with binary floating point arithmetic.
Decimals are very nice when dealing with money: ten cents plus twenty cents are always exactly thirty cents:
```
>>> 0.1 + 0.2 == 0.3
False
>>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
True
```
Python'sdecimal module is based on IEEE standard 854-1987.
Python's and Apache Common's . Both represent rational numbers as(numerator, denominator) pairs and they may give more accurate results than decimal floating point arithmetic.

Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.

score 800 · Answer 17

Answer

Solution:

Those weird numbers appear because computers use binary(base 2) number system for calculation purposes, while we use decimal(base 10).

There are a majority of fractional numbers that cannot be represented precisely either in binary or in decimal or both. Result - A rounded up (but precise) number results.

score 166 · Answer 18

Answer

Solution:

Many of this question's numerous duplicates ask about the effects of ing point rounding on specific numbers. In practice, it is easier to get a feeling for how it works by looking at exact results of calculations of interest rather than by just reading about it. Some languages provide ways of doing that - such as converting a or{-code-2} to{-code-3} in Java.

Since this is a language-agnostic question, it needs language-agnostic tools, such as a Decimal to Floating-Point Converter.

Applying it to the numbers in the question, treated as {-code-2}s:

0.1 converts to 0.1000000000000000055511151231257827021181583404541015625,

0.2 converts to 0.200000000000000011102230246251565404236316680908203125,

0.3 converts to 0.299999999999999988897769753748434595763683319091796875, and

0.30000000000000004 converts to

language agnostic - Is floating point math broken?

Answer

Solution:

Answer

Solution:

A Hardware Designer's Perspective

1. Overview

2. Standards

3. Cause of Rounding Error in Division

4. Rounding Errors in Other Operations: Truncation

5. Repeated Operations

6. Summary

Answer

Solution:

Answer

Solution:

Answer

Solution:

Answer

Solution{-code-{-code-15}{-code-15}}

Answer

Solution:

Answer

Solution:

Answer

Solution:

Answer

Solution:

Answer

Solution:

Answer

Solution:

Answer

Solution:

No, not broken, but most decimal fractions must be approximated

Answer

Solution:

Answer

Solution:

Answer

Solution:

Answer

Solution:

Answer

Solution:

Share solution ↓

Additional Information:

Didn't find the answer?

Similar questions

Write quick answer

About the technologies asked in this question

JavaScript

JQuery

CSS

HTML

Welcome to programmierfrage.com

Get answers to specific questions

Help Others Solve Their Issues