48241

# Floating-point Numbers in IDL

TOPIC:

Exelis VIS Technical Support frequently gets questions about "floating-point errors" in IDL.

This help article discusses frequently asked questions regarding floating-point numbers with respect to their use in IDL.

More information about floating point numbers in IDL may be found in the IDL Online Help under the Contents section:

IDL Programming > Concepts > IDL Data Types

--under the topic:

Precision of Floating-Point Numbers

--and also by following the link under this topic titled:

Accuracy and Floating Point Operations

Note also that most of the concepts explained in this help article may also be found in introductory Numerical Analysis textbooks.

DISCUSSION:

First, two important facts regarding floating-point numbers:

1) Floating-point processors in modern computers use BINARY arithmetic, not DECIMAL arithmetic. Consequently, floating-point numbers are stored in BINARY formats, not DECIMAL formats. Decimal numbers must undergo a conversion from base 10 to base 2, when entered into a computer program. Likewise, they must undergo another conversion, from base 2 to base 10, before they can be displayed.

2) Floating-point numbers are represented using standard floating-point formats, with a limited number of significant BINARY places, which translate to a limited number of significant DECIMAL places.

For these reasons, inaccuracies in floating-point number representations are guaranteed.

Floating-point inaccuracies are not unique to IDL. The limitations imposed by floating-point formats exist on all computer hardware and are evident in every computer programming language. In fact, IDL uses standard C-Library routines, such as scanf() and printf(), for converting, formatting, and displaying numbers; inaccuracies observed in IDL are directly related to C-Library routines.

Regarding Fact 1, consider the number 5 3/4, which has the "exact" decimal representation 5.75 (as opposed to the number 1/3, which would require an infinite number of significant decimal places: 0.333...forever...).

When the number 5.75 is converted to binary, the result is 101.11, which is also exact.

Not all decimal numbers have such a "nice" relationship to the binary number system. For example, the number 1/5 has the exact decimal representation 0.2, but an infinite number of significant binary places would be required to represent 0.2.

Regarding Fact 2, the floating-point formats in use on modern computer hardware include a single-precision floating-point format and a double-precision floating-point format. The formats are defined by The Institute of Electrical and Electronics Engineers (IEEE). The single-precision format supports the equivalent of approximately 7 significant decimal places. The double-precision format supports the equivalent of approximately 16 significant decimal places. The word "approximately" refers to the fact that each decimal place does not correspond, exactly, to an integral number of binary places. Therefore, the least significant decimal place associated with these formats may be "fuzzy." In general, it is best not to trust it, if it is displayed.

An example:

Using IDL and the decimal number 423.36, observe what happens when it is entered as a single-precision floating-point value and is printed with an increasing number of significant DISPLAY places.

IDL> a = 423.36
IDL> ; Display 7 significant decimal places:
IDL> print, a, format='(f20.4)'
423.3600
IDL> ; Display 8 significant decimal places:
IDL> print, a, format='(f20.5)'
423.35999
IDL>

Asking for 8 significant decimal places to be DISPLAYED, reveals the limitations of the single-precision format. "Under the hood," 423.36 can't be represented, exactly, in binary floating-point format. The number 423.36 can be DISPLAYED exactly, because IDL (which means the C-Library printf() function) performs rounding (in binary, not decimal) to the requested number of display places. The rounding process fails, when there aren't enough significant binary places available.

Now, look at a double-precision version of the same number:

IDL> a = 423.36d0
IDL> print, a, format='(d20.13)'  ; Display 16 significant decimal places.
423.3600000000000
IDL> print, a, format='(d20.14)'  ; Display 17 significant decimal places.
423.36000000000001
IDL>

Requesting 17 significant decimal places reveals the limitations of the double-precision format; the least significant DISPLAYED place represents uninitialized data bits (a.k.a., "garbage").

A Detailed Look at the IEEE Single-precision Floating-point Format

The IEEE floating-point formats are, essentially, the binary equivalent of the familiar Scientific Notation used in the decimal number system. The IEEE formats store a signed mantissa part, which contains the significant places of a number, and a signed exponent part, which represents the magnitude of a number. The mantissa bits are coefficients of powers of two and the exponent part is a binary exponent. Just as in Scientific Notation, where the exponent part represents the number of decimal places (and direction) by which the decimal point is to be shifted, so the IEEE exponent part represents the number of binary places (and direction) by which the binary point is to be shifted. As with Scientific Notation, the significant places are positioned so that the most significant place is in the units position (the position corresponding to the zero-th power of the base). This process is called normalization.

Except for the special case of the number 0.0, every other floating-point number will have at least one non-zero significant binary place. Therefore, after the normalization process, the units place will always be non-zero (one) for non-zero numbers. This fact allows for the concept of a "hidden" bit in the IEEE formats. An extra significant binary place is gained by treating the units place as a "virtual" one-bit (a special IEEE "bit-pattern" allows computer floating-point arithmetic units to identify the 0.0 case). Only the remaining significant places of the normalized mantissa, from the negative-first power of two downwards, are actually stored in computer memory.

Another example:

IDL can be used to examine the actual bit-pattern of any floating-point number. The single-precision format can be revealed by copying the bit-pattern into a variable of type LONG and printing it using the hexadecimal editing code. Using the previous example of the decimal number 5.75, we would expect to see the number 101.11 pop out of the binary floating-point representation.

IDL> a = 5.75
IDL> b = long(a, 0)
IDL> print, b, format='(z8)'
40b80000
IDL>

Convert each hexadecimal place to its binary equivalent:

0100 0000 1011 1000 0000 0000 0000 0000

Spread out the sign bit, the exponent part, and the mantissa part:

 s exponent Mantissa 0 10000001 01110000000000000000000

The IEEE single-precision floating-point format has a sign bit (set to zero for positive numbers and to one for negative numbers), an 8-bit exponent part, and a 23-bit mantissa part. The exponent part is stored with a bias value, so that both negative and positive exponents may be represented. The bias value is 01111111 (127 decimal). Exponent parts which are greater than, or equal to, the bias value represent positive powers of two. Otherwise, they represent negative powers of two.

Subtract the bias value from the biased exponent part:

10000001 - 01111111 = 00000010

OK., it is easier to understand the subtraction if it is done in decimal:

129 - 127 = 2

The exponent part is actually a positive 2.

Combine the "hidden" bit (units place) with the bits actually stored in the mantissa part:

1.0111

Since the exponent part is a positive 2, shift the binary point two places to the right:

101.11

This result agrees with our expectations.

IDL TYPE-CONVERSION FUNCTIONS

The behavior and purpose of the IDL type-conversion functions, such as the DOUBLE function, are often misunderstood. For example, using the DOUBLE function with a single-precision argument will create a double-precision result. However, the result will contain only a single-precision amount of significant places.

IDL> x = double(423.36)
IDL> print, x, format='(d20.4)'   ; Display 7 significant decimal places.
423.3600
IDL> print, x, format='(d20.5)'   ; Display 8 significant decimal places.
423.35999

The number 423.36, which is an argument to the DOUBLE function, is converted to a single-precision floating-point value BEFORE it is passed to the DOUBLE function. Since 423.36 can't be represented, exactly, in binary floating-point format, truncation of significant places occurs BEFORE it is passed to the DOUBLE function.

To declare a floating-point number as a double-precision number, with a full double-precision mantissa, the "D" exponential notation may be used:

IDL> x = 423.36d0

This notation forces the conversion process, from decimal to binary, to generate as many significant binary places as are possible in the double-precision format.

The IDL type-conversion functions are useful in expressions where specific behavior is required. For example, to avoid truncation of fractional parts during integer division:

IDL> x = 5
IDL> y = 7
IDL> print, x / y
0
IDL> print, float(x) / float(y)
0.714286