What Every Computer Scientist Should Know About Floating-Point Arithmetic

Note  This appendix is an edited reprint of the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys. Copyright 1991, Association for Computing Machinery, Inc., reprinted by permission.

Abstract

Floating-point arithmetic is considered an esoteric subject by many people. This is rather surprising because floating-point is ubiquitous in computer systems. Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on those aspects of floating-point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating-point standard, and concludes with numerous examples of how computer builders can better support floating-point.

Categories and Subject Descriptors: (Primary) C.0 [Computer Systems Organization]: General -- instruction set design; D.3.4 [Programming Languages]: Processors -- compilers, optimization; G.1.0 [Numerical Analysis]: General -- computer arithmetic, error analysis, numerical algorithms (Secondary)

D.2.1 [Software Engineering]: Requirements/Specifications -- languages; D.3.4 Programming Languages]: Formal Definitions and Theory -- semantics; D.4.1 Operating Systems]: Process Management -- synchronization.

General Terms: Algorithms, Design, Languages

Additional Key Words and Phrases: Denormalized number, exception, floating-point, floating-point standard, gradual underflow, guard digit, NaN, overflow, relative error, rounding error, rounding mode, ulp, underflow.

Introduction

Builders of computer systems often need information about floating-point arithmetic. There are, however, remarkably few sources of detailed information about it. One of the few books on the subject, Floating-Point Computation by Pat Sterbenz, is long out of print. This paper is a tutorial on those aspects of floating-point arithmetic (floating-point hereafter) that have a direct connection to systems building. It consists of three loosely connected parts. The first section, Rounding Error, discusses the implications of using different rounding strategies for the basic operations of addition, subtraction, multiplication and division. It also contains background information on the two methods of measuring rounding error, ulps and relative error . The second part discusses the IEEE floating-point standard, which is becoming rapidly accepted by commercial hardware manufacturers. Included in the IEEE standard is the rounding method for basic operations. The discussion of the standard draws on the material in the section Rounding Error. The third part discusses the connections between floating-point and the design of various aspects of computer systems. Topics include instruction set design, optimizing compilers and exception handling.

I have tried to avoid making statements about floating-point without also giving reasons why the statements are true, especially since the justifications involve nothing more complicated than elementary calculus. Those explanations that are not central to the main argument have been grouped into a section called "The Details," so that they can be skipped if desired. In particular, the proofs of many of the theorems appear in this section. The end of each proof is marked with the z symbol. When a proof is not included, the z appears immediately following the statement of the theorem.

Rounding Error

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation. The section Relative Error and Ulps describes how it is measured.

Since most floating-point calculations have rounding error anyway, does it matter if the basic arithmetic operations introduce a little bit more rounding error than necessary? That question is a main theme throughout this section. The section Guard Digits discusses guard digits, a means of reducing the error when subtracting two nearby numbers. Guard digits were considered sufficiently important by IBM that in 1968 it added a guard digit to the double precision format in the System/360 architecture (single precision already had a guard digit), and retrofitted all existing machines in the field. Two examples are given to illustrate the utility of guard digits.

The IEEE standard goes further than just requiring the use of a guard digit. It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard. This greatly simplifies the porting of programs. Other uses of this precise specification are given in Exactly Rounded Operations.

Floating-point Formats

Several different representations of real numbers have been proposed, but by far the most widely used is the floating-point representation. 1 Floating-point representations have a base (which is always assumed to be even) and a precision p . If = 10 and p = 3, then the number 0.1 is represented as 1.00 × 10 -1 . If = 2 and p = 24, then the decimal number 0.1 cannot be represented exactly, but is approximately 1.10011001100110011001101 × 2 -4 .

In general, a floating-point number will be represented as ± d.dd. d × e , where d.dd. d is called the significand 2 and has p digits. More precisely ± d ₀ . d ₁ d ₂ . d _p-1 × e represents the number

(1) .

The term floating-point number will be used to mean a real number that can be exactly represented in the format under discussion. Two other parameters associated with floating-point representations are the largest and smallest allowable exponents, e _max and e _min. Since there are p possible significands, and e _max - e _min + 1 possible exponents, a floating-point number can be encoded in

bits, where the final +1 is for the sign bit. The precise encoding is not important for now.

There are two reasons why a real number might not be exactly representable as a floating-point number. The most common situation is illustrated by the decimal number 0.1. Although it has a finite decimal representation, in binary it has an infinite repeating representation. Thus when = 2, the number 0.1 lies strictly between two floating-point numbers and is exactly representable by neither of them. A less common situation is that a real number is out of range, that is, its absolute value is larger than × or smaller than 1.0 × . Most of this paper discusses issues due to the first reason. However, numbers that are out of range will be discussed in the sections Infinity and Denormalized Numbers.

Floating-point representations are not necessarily unique. For example, both 0.01 × 10 1 and 1.00 × 10 -1 represent 0.1. If the leading digit is nonzero ( d ₀ 0 in equation (1) above), then the representation is said to be normalized. The floating-point number 1.00 × 10 -1 is normalized, while 0.01 × 10 1 is not. When = 2, p = 3, e _min = -1 and e _max = 2 there are 16 normalized floating-point numbers, as shown in FIGURE D-1. The bold hash marks correspond to numbers whose significand is 1.00. Requiring that a floating-point representation be normalized makes the representation unique. Unfortunately, this restriction makes it impossible to represent zero! A natural way to represent 0 is with 1.0 × , since this preserves the fact that the numerical ordering of nonnegative real numbers corresponds to the lexicographic ordering of their floating-point representations. 3 When the exponent is stored in a k bit field, that means that only 2 k - 1 values are available for use as exponents, since one must be reserved to represent 0.

Note that the × in a floating-point number is part of the notation, and different from a floating-point multiply operation. The meaning of the × symbol should be clear from the context. For example, the expression (2.5 × 10 -3 ) × (4.0 × 10 2 ) involves only a single floating-point multiplication.

FIGURE D-1 Normalized numbers when = 2, p = 3, e _min = -1, e _max = 2

Relative Error and Ulps

Since rounding error is inherent in floating-point computation, it is important to have a way to measure this error. Consider the floating-point format with = 10 and p = 3, which will be used throughout this section. If the result of a floating-point computation is 3.12 × 10 -2 , and the answer when computed to infinite precision is .0314, it is clear that this is in error by 2 units in the last place. Similarly, if the real number .0314159 is represented as 3.14 × 10 -2 , then it is in error by .159 units in the last place. In general, if the floating-point number d.d . d × e is used to represent z , then it is in error by d.d . d - ( z / e ) p-1 units in the last place. 4 , 5 The term ulps will be used as shorthand for "units in the last place." If the result of a calculation is the floating-point number nearest to the correct result, it still might be in error by as much as .5 ulp. Another way to measure the difference between a floating-point number and the real number it is approximating is relative error, which is simply the difference between the two numbers divided by the real number. For example the relative error committed when approximating 3.14159 by 3.14 × 10 0 is .00159/3.14159 .0005.

To compute the relative error that corresponds to .5 ulp, observe that when a real number is approximated by the closest possible floating-point number d.dd . dd × e , the error can be as large as 0.00 . 00 ' × e , where ' is the digit /2, there are p units in the significand of the floating-point number, and p units of 0 in the significand of the error. This error is (( /2) -p ) × e . Since numbers of the form d.dd . dd × e all have the same absolute error, but have values that range between e and × e , the relative error ranges between (( /2) -p ) × e / e and (( /2) -p ) × e / e+1 . That is,

(2)

In particular, the relative error corresponding to .5 ulp can vary by a factor of . This factor is called the wobble. Setting = ( /2) -p to the largest of the bounds in (2) above, we can say that when a real number is rounded to the closest floating-point number, the relative error is always bounded by e , which is referred to as machine epsilon.

In the example above, the relative error was .00159/3.14159 .0005. In order to avoid such small numbers, the relative error is normally written as a factor times , which in this case is = ( /2) -p = 5(10) -3 = .005. Thus the relative error would be expressed as (.00159/3.14159)/.005) 0.1 .

To illustrate the difference between ulps and relative error, consider the real number x = 12.35. It is approximated by = 1.24 × 10 1 . The error is 0.5 ulps, the relative error is 0. 8 . Next consider the computation 8 . The exact value is 8 x = 98.8, while the computed value is 8 = 9.92 × 10 1 . The error is now 4.0 ulps, but the relative error is still 0. 8 . The error measured in ulps is 8 times larger, even though the relative error is the same. In general, when the base is , a fixed relative error expressed in ulps can wobble by a factor of up to . And conversely, as equation (2) above shows, a fixed error of .5 ulps results in a relative error that can wobble by .

The most natural way to measure rounding error is in ulps. For example rounding to the nearest floating-point number corresponds to an error of less than or equal to .5 ulp. However, when analyzing the rounding error caused by various formulas, relative error is a better measure. A good illustration of this is the analysis in the section Theorem 9. Since can overestimate the effect of rounding to the nearest floating-point number by the wobble factor of , error estimates of formulas will be tighter on machines with a small .

When only the order of magnitude of rounding error is of interest, ulps and may be used interchangeably, since they differ by at most a factor of . For example, when a floating-point number is in error by n ulps, that means that the number of contaminated digits is log n . If the relative error in a computation is n , then

(3) contaminated digits log n .

Guard Digits

One method of computing the difference between two floating-point numbers is to compute the difference exactly and then round it to the nearest floating-point number. This is very expensive if the operands differ greatly in size. Assuming p = 3, 2.15 × 10 12 - 1.25 × 10 -5 would be calculated as

x = 2.15 × 10 12
y = .0000000000000000125 × 10 12
x - y = 2.1499999999999999875 × 10 12

which rounds to 2.15 × 10 12 . Rather than using all these digits, floating-point hardware normally operates on a fixed number of digits. Suppose that the number of digits kept is p , and that when the smaller operand is shifted right, digits are simply discarded (as opposed to rounding). Then 2.15 × 10 12 - 1.25 × 10 -5 becomes

x = 2.15 × 10 12
y = 0.00 × 10 12
x - y = 2.15 × 10 12

The answer is exactly the same as if the difference had been computed exactly and then rounded. Take another example: 10.1 - 9.93. This becomes

x = 1.01 × 10 1
y = 0.99 × 10 1
x - y = .02 × 10 1

The correct answer is .17, so the computed difference is off by 30 ulps and is wrong in every digit! How bad can the error be?

Theorem 1

Using a floating-point format with parameters and p , and computing differences using p digits, the relative error of the result can be as large as - 1.

Proof

A relative error of - 1 in the expression x - y occurs when x = 1.00 . 0 and y = . , where = - 1. Here y has p digits (all equal to ). The exact difference is x - y = - p . However, when computing the answer using only p digits, the rightmost digit of y gets shifted off, and so the computed difference is - p +1 . Thus the error is - p - - p +1 = - p ( - 1), and the relative error is - p ( - 1)/ - p = - 1. z

When =2, the relative error can be as large as the result, and when =10, it can be 9 times larger. Or to put it another way, when =2, equation (3) shows that the number of contaminated digits is log₂(1/ ) = log₂(2 p ) = p . That is, all of the p digits in the result are wrong! Suppose that one extra digit is added to guard against this situation (a guard digit). That is, the smaller number is truncated to p + 1 digits, and then the result of the subtraction is rounded to p digits. With a guard digit, the previous example becomes

x = 1.01 0 × 10 1
y = 0.993 × 10 1
x - y = .017 × 10 1

and the answer is exact. With a single guard digit, the relative error of the result may be greater than , as in 110 - 8.59.

x = 1.1 0 × 10 2
y = .085 × 10 2
x - y = 1.015 × 10 2

This rounds to 102, compared with the correct answer of 101.41, for a relative error of .006, which is greater than = .005. In general, the relative error of the result can be only slightly larger than . More precisely,

Theorem 2

If x and y are floating-point numbers in a format with parameters and p , and if subtraction is done with p + 1 digits (i.e. one guard digit), then the relative rounding error in the result is less than 2 .

This theorem will be proven in Rounding Error. Addition is included in the above theorem since x and y can be positive or negative.

Cancellation

The last section can be summarized by saying that without a guard digit, the relative error committed when subtracting two nearby quantities can be very large. In other words, the evaluation of any expression containing a subtraction (or an addition of quantities with opposite signs) could result in a relative error so large that all the digits are meaningless (Theorem 1). When subtracting nearby quantities, the most significant digits in the operands match and cancel each other. There are two kinds of cancellation: catastrophic and benign.

Catastrophic cancellation occurs when the operands are subject to rounding errors. For example in the quadratic formula, the expression b 2 - 4 ac occurs. The quantities b 2 and 4 ac are subject to rounding errors since they are the results of floating-point multiplications. Suppose that they are rounded to the nearest floating-point number, and so are accurate to within .5 ulp. When they are subtracted, cancellation can cause many of the accurate digits to disappear, leaving behind mainly digits contaminated by rounding error. Hence the difference might have an error of many ulps. For example, consider b = 3.34, a = 1.22, and c = 2.28. The exact value of b 2 - 4 ac is .0292. But b 2 rounds to 11.2 and 4 ac rounds to 11.1, hence the final answer is .1 which is an error by 70 ulps, even though 11.2 - 11.1 is exactly equal to .1 6 . The subtraction did not introduce any error, but rather exposed the error introduced in the earlier multiplications.

Benign cancellation occurs when subtracting exactly known quantities. If x and y have no rounding error, then by Theorem 2 if the subtraction is done with a guard digit, the difference x-y has a very small relative error (less than 2 ).

A formula that exhibits catastrophic cancellation can sometimes be rearranged to eliminate the problem. Again consider the quadratic formula

(4)

When , then does not involve a cancellation and

But the other addition (subtraction) in one of the formulas will have a catastrophic cancellation. To avoid this, multiply the numerator and denominator of r ₁ by

(and similarly for r ₂) to obtain

(5)

If and , then computing r ₁ using formula (4) will involve a cancellation. Therefore, use formula (5) for computing r ₁ and (4) for r ₂. On the other hand, if b < 0, use (4) for computing r ₁ and (5) for r ₂.

The expression x 2 - y 2 is another formula that exhibits catastrophic cancellation. It is more accurate to evaluate it as ( x - y )( x + y ). 7 Unlike the quadratic formula, this improved form still has a subtraction, but it is a benign cancellation of quantities without rounding error, not a catastrophic one. By Theorem 2, the relative error in x - y is at most 2 . The same is true of x + y . Multiplying two quantities with a small relative error results in a product with a small relative error (see the section Rounding Error).

In order to avoid confusion between exact and computed values, the following notation is used. Whereas x - y denotes the exact difference of x and y, x y denotes the computed difference (i.e., with rounding error). Similarly , , and denote computed addition, multiplication, and division, respectively. All caps indicate the computed value of a function, as in LN(x) or SQRT(x) . Lowercase functions and traditional mathematical notation denote their exact values as in ln( x ) and .

Although ( x y ) ( x y ) is an excellent approximation to x 2 - y 2 , the floating-point numbers x and y might themselves be approximations to some true quantities and . For example, and might be exactly known decimal numbers that cannot be expressed exactly in binary. In this case, even though x y is a good approximation to x - y , it can have a huge relative error compared to the true expression , and so the advantage of ( x + y )( x - y ) over x 2 - y 2 is not as dramatic. Since computing ( x + y )( x - y ) is about the same amount of work as computing x 2 - y 2 , it is clearly the preferred form in this case. In general, however, replacing a catastrophic cancellation by a benign one is not worthwhile if the expense is large, because the input is often (but not always) an approximation. But eliminating a cancellation entirely (as in the quadratic formula) is worthwhile even if the data are not exact. Throughout this paper, it will be assumed that the floating-point inputs to an algorithm are exact and that the results are computed as accurately as possible.

The expression x 2 - y 2 is more accurate when rewritten as ( x - y )( x + y ) because a catastrophic cancellation is replaced with a benign one. We next present more interesting examples of formulas exhibiting catastrophic cancellation that can be rewritten to exhibit only benign cancellation.

The area of a triangle can be expressed directly in terms of the lengths of its sides a , b , and c as

(6)

(Suppose the triangle is very flat; that is, a b + c . Then s a , and the term ( s - a ) in formula (6) subtracts two nearby numbers, one of which may have rounding error. For example, if a = 9.0, b = c = 4.53, the correct value of s is 9.03 and A is 2.342. Even though the computed value of s (9.05) is in error by only 2 ulps, the computed value of A is 3.04, an error of 70 ulps.

There is a way to rewrite formula (6) so that it will return accurate results even for flat triangles [Kahan 1986]. It is

(7)

If a , b, and c do not satisfy a b c , rename them before applying (7). It is straightforward to check that the right-hand sides of (6) and (7) are algebraically identical. Using the values of a , b , and c above gives a computed area of 2.35, which is 1 ulp in error and much more accurate than the first formula.

Although formula (7) is much more accurate than (6) for this example, it would be nice to know how well (7) performs in general.

Theorem 3

The rounding error incurred when using (7) to compute the area of a triangle is at most 11 , provided that subtraction is performed with a guard digit, e .005, and that square roots are computed to within 1/2 ulp.

The condition that e < .005 is met in virtually every actual floating-point system. For example when = 2, p 8 ensures that e < .005, and when = 10, p 3 is enough.

In statements like Theorem 3 that discuss the relative error of an expression, it is understood that the expression is computed using floating-point arithmetic. In particular, the relative error is actually of the expression

(8) SQRT (( a ( b c )) ( c ( a b)) ( c ( a b )) (a ( b c ))) 4

Because of the cumbersome nature of (8), in the statement of theorems we will usually say the computed value of E rather than writing out E with circle notation.

Error bounds are usually too pessimistic. In the numerical example given above, the computed value of (7) is 2.35, compared with a true value of 2.34216 for a relative error of 0.7 , which is much less than 11 . The main reason for computing error bounds is not to get precise bounds but rather to verify that the formula does not contain numerical problems.

A final example of an expression that can be rewritten to use benign cancellation is (1 + x ) n , where . This expression arises in financial calculations. Consider depositing $100 every day into a bank account that earns an annual interest rate of 6%, compounded daily. If n = 365 and i = .06, the amount of money accumulated at the end of one year is

100

dollars. If this is computed using = 2 and p = 24, the result is $37615.45 compared to the exact answer of $37614.05, a discrepancy of $1.40. The reason for the problem is easy to see. The expression 1 + i / n involves adding 1 to .0001643836, so the low order bits of i / n are lost. This rounding error is amplified when 1 + i / n is raised to the n th power.

The troublesome expression (1 + i / n ) n can be rewritten as e nln(1 + i / n ) , where now the problem is to compute ln(1 + x ) for small x. One approach is to use the approximation ln(1 + x ) x , in which case the payment becomes $37617.26, which is off by $3.21 and even less accurate than the obvious formula. But there is a way to compute ln(1 + x ) very accurately, as Theorem 4 shows [Hewlett-Packard 1982]. This formula yields $37614.07, accurate to within two cents!

Theorem 4 assumes that LN(x) approximates ln( x ) to within 1/2 ulp. The problem it solves is that when x is small, LN (1 x ) is not close to ln(1 + x ) because 1 x has lost the information in the low order bits of x. That is, the computed value of ln(1 + x ) is not close to its actual value when .

Theorem 4

If ln(1 + x) is computed using the formula
the relative error is at most 5 when 0 x < 3/4, provided subtraction is performed with a guard digit, e < 0.1, and ln is computed to within 1/2 ulp.

This formula will work for any value of x but is only interesting for , which is where catastrophic cancellation occurs in the naive formula ln(1 + x ). Although the formula may seem mysterious, there is a simple explanation for why it works. Write ln(1 + x ) as

The left hand factor can be computed exactly, but the right hand factor µ ( x ) = ln(1 + x )/ x will suffer a large rounding error when adding 1 to x. However, µ is almost constant, since ln(1 + x ) x . So changing x slightly will not introduce much error. In other words, if , computing will be a good approximation to x µ ( x ) = ln(1 + x ). Is there a value for for which and can be computed accurately? There is; namely = (1 x ) 1, because then 1 + is exactly equal to 1 x .

The results of this section can be summarized by saying that a guard digit guarantees accuracy when nearby precisely known quantities are subtracted (benign cancellation). Sometimes a formula that gives inaccurate results can be rewritten to have much higher numerical accuracy by using benign cancellation; however, the procedure only works if subtraction is performed using a guard digit. The price of a guard digit is not high, because it merely requires making the adder one bit wider. For a 54 bit double precision adder, the additional cost is less than 2%. For this price, you gain the ability to run many algorithms such as formula (6) for computing the area of a triangle and the expression ln(1 + x ). Although most modern computers have a guard digit, there are a few (such as Cray systems) that do not.

Exactly Rounded Operations

When floating-point operations are done with a guard digit, they are not as accurate as if they were computed exactly then rounded to the nearest floating-point number. Operations performed in this manner will be called exactly rounded. 8 The example immediately preceding Theorem 2 shows that a single guard digit will not always give exactly rounded results. The previous section gave several examples of algorithms that require a guard digit in order to work properly. This section gives examples of algorithms that require exact rounding.

So far, the definition of rounding has not been given. Rounding is straightforward, with the exception of how to round halfway cases; for example, should 12.5 round to 12 or 13? One school of thought divides the 10 digits in half, letting round down, and round up; thus 12.5 would round to 13. This is how rounding works on Digital Equipment Corporation's VAX computers. Another school of thought says that since numbers ending in 5 are halfway between two possible roundings, they should round down half the time and round up the other half. One way of obtaining this 50% behavior to require that the rounded result have its least significant digit be even. Thus 12.5 rounds to 12 rather than 13 because 2 is even. Which of these methods is best, round up or round to even? Reiser and Knuth [1975] offer the following reason for preferring round to even.

Theorem 5

Let x and y be floating-point numbers, and define x₀ = x, x₁ = (x₀ y) y, . , x_n = (x_n-1 y) y. If and are exactly rounded using round to even, then either x_n = x for all n or x_n = x₁ for all n 1. z

To clarify this result, consider = 10, p = 3 and let x = 1.00, y = -.555. When rounding up, the sequence becomes

x₀ y = 1.56, x₁ = 1.56 .555 = 1.01, x₁ y = 1.01 .555 = 1.57,

and each successive value of x_n increases by .01, until x_n = 9.45 (n 845) 9 . Under round to even, x_n is always 1.00. This example suggests that when using the round up rule, computations can gradually drift upward, whereas when using round to even the theorem says this cannot happen. Throughout the rest of this paper, round to even will be used.

One application of exact rounding occurs in multiple precision arithmetic. There are two basic approaches to higher precision. One approach represents floating-point numbers using a very large significand, which is stored in an array of words, and codes the routines for manipulating these numbers in assembly language. The second approach represents higher precision floating-point numbers as an array of ordinary floating-point numbers, where adding the elements of the array in infinite precision recovers the high precision floating-point number. It is this second approach that will be discussed here. The advantage of using an array of floating-point numbers is that it can be coded portably in a high level language, but it requires exactly rounded arithmetic.

The key to multiplication in this system is representing a product xy as a sum, where each summand has the same precision as x and y. This can be done by splitting x and y. Writing x = x _h + x_l and y = y _h + y_l, the exact product is

x y = x _h y _h + x _h y _l + x _l y _h + x _l y _l.

If x and y have p bit significands, the summands will also have p bit significands provided that x_l, x_h, y_h, y_l can be represented using [ p /2 ] bits. When p is even, it is easy to find a splitting. The number x₀. x ₁ . x_{p - 1} can be written as the sum of x₀. x ₁ . x _{p/2 - 1} and 0.0 . 0 x _p/2 . x _{p - 1}. When p is odd, this simple splitting method will not work. An extra bit can, however, be gained by using negative numbers. For example, if = 2, p = 5, and x = .10111, x can be split as x_h = .11 and x_l = -.00001. There is more than one way to split a number. A splitting method that is easy to compute is due to Dekker [1971], but it requires more than a single guard digit.

Theorem 6

Let p be the floating-point precision, with the restriction that p is even when > 2, and assume that floating-point operations are exactly rounded. Then if k = [p/2] is half the precision (rounded up) and m = k + 1, x can be split as x = x_h + x_l, where x_h = (m x) (m x x), x_l = x x_h,

and each x_i is representable using [p/2] bits of precision.

To see how this theorem works in an example, let = 10, p = 4, b = 3.476, a = 3.463, and c = 3.479. Then b 2 - ac rounded to the nearest floating-point number is .03480, while b b = 12.08, a c = 12.05, and so the computed value of b 2 - ac is .03. This is an error of 480 ulps. Using Theorem 6 to write b = 3.5 - .024, a = 3.5 - .037, and c = 3.5 - .021, b 2 becomes 3.5 2 - 2 × 3.5 × .024 + .024 2 . Each summand is exact, so b 2 = 12.25 - .168 + .000576, where the sum is left unevaluated at this point. Similarly, ac = 3.5 2 - (3.5 × .037 + 3.5 × .021) + .037 × .021 = 12.25 - .2030 +.000777. Finally, subtracting these two series term by term gives an estimate for b 2 - ac of 0 .0350 .000201 = .03480, which is identical to the exactly rounded result. To show that Theorem 6 really requires exact rounding, consider p = 3, = 2, and x = 7. Then m = 5, mx = 35, and m x = 32. If subtraction is performed with a single guard digit, then ( m x ) x = 28. Therefore, x_h = 4 and x_l = 3, hence x_l is not representable with [ p /2] = 1 bit.

As a final example of exact rounding, consider dividing m by 10. The result is a floating-point number that will in general not be equal to m /10. When = 2, multiplying m /10 by 10 will restore m , provided exact rounding is being used. Actually, a more general fact (due to Kahan) is true. The proof is ingenious, but readers not interested in such details can skip ahead to section The IEEE Standard.

Theorem 7

When = 2, if m and n are integers with |m| 2 p - 1 and n has the special form n = 2 i + 2 j , then (m n) n = m, provided floating-point operations are exactly rounded.

Proof
Scaling by a power of two is harmless, since it changes only the exponent, not the significand. If q = m / n , then scale n so that 2 p - 1 n < 2 p and scale m so that 1/2 < q < 1. Thus, 2 p - 2 < m < 2 p . Since m has p significant bits, it has at most one bit to the right of the binary point. Changing the sign of m is harmless, so assume that q > 0. If = m n, to prove the theorem requires showing that (9)
That is because m has at most 1 bit right of the binary point, so n will round to m . To deal with the halfway case when | n - m | = 1/4, note that since the initial unscaled m had | m | < 2 p - 1 , its low-order bit was 0, so the low-order bit of the scaled m is also 0. Thus, halfway cases will round to m . Suppose that q = . q ₁ q ₂ . , and let = . q ₁ q ₂ . q _p1. To estimate | n - m |, first compute | - q | = | N /2 p + 1 - m / n |,
where N is an odd integer. Since n = 2 i + 2 j and 2 p - 1 n < 2 p , it must be that n = 2 p - 1 + 2 k for some k p - 2, and thus .
The numerator is an integer, and since N is odd, it is in fact an odd integer. Thus, | - q | 1/( n 2 p + 1 - k ).
Assume q < (the case q > is similar). 10 Then n < m , and |m-n |= m-n = n(q- ) = n(q-( -2 -p-1 ))
=(2 p -1 +2 k )2 -p-1 -2 -p-1+k =
This establishes (9) and proves the theorem. 11 z
The theorem holds true for any base , as long as 2 i + 2 j is replaced by i + j . As gets larger, however, denominators of the form i + j are farther and farther apart.

We are now in a position to answer the question, Does it matter if the basic arithmetic operations introduce a little more rounding error than necessary? The answer is that it does matter, because accurate basic operations enable us to prove that formulas are "correct" in the sense they have a small relative error. The section Cancellation discussed several algorithms that require guard digits to produce correct results in this sense. If the input to those formulas are numbers representing imprecise measurements, however, the bounds of Theorems 3 and 4 become less interesting. The reason is that the benign cancellation x - y can become catastrophic if x and y are only approximations to some measured quantity. But accurate operations are useful even in the face of inexact data, because they enable us to establish exact relationships like those discussed in Theorems 6 and 7. These are useful even if every floating-point variable is only an approximation to some actual value.

The IEEE Standard

There are two different IEEE standards for floating-point computation. IEEE 754 is a binary standard that requires = 2, p = 24 for single precision and p = 53 for double precision [IEEE 1987]. It also specifies the precise layout of bits in a single and double precision. IEEE 854 allows either = 2 or = 10 and unlike 754, does not specify how floating-point numbers are encoded into bits [Cody et al. 1984]. It does not require a particular value for p, but instead it specifies constraints on the allowable values of p for single and double precision. The term IEEE Standard will be used when discussing properties common to both standards.

This section provides a tour of the IEEE standard. Each subsection discusses one aspect of the standard and why it was included. It is not the purpose of this paper to argue that the IEEE standard is the best possible floating-point standard but rather to accept the standard as given and provide an introduction to its use. For full details consult the standards themselves [IEEE 1987; Cody et al. 1984].

Formats and Operations

Base

It is clear why IEEE 854 allows = 10. Base ten is how humans exchange and think about numbers. Using = 10 is especially appropriate for calculators, where the result of each operation is displayed by the calculator in decimal.

There are several reasons why IEEE 854 requires that if the base is not 10, it must be 2. The section Relative Error and Ulps mentioned one reason: the results of error analyses are much tighter when is 2 because a rounding error of .5 ulp wobbles by a factor of when computed as a relative error, and error analyses are almost always simpler when based on relative error. A related reason has to do with the effective precision for large bases. Consider = 16, p = 1 compared to = 2, p = 4. Both systems have 4 bits of significand. Consider the computation of 15/8. When = 2, 15 is represented as 1.111 × 2 3 , and 15/8 as 1.111 × 2 0 . So 15/8 is exact. However, when = 16, 15 is represented as F × 16 0 , where F is the hexadecimal digit for 15. But 15/8 is represented as 1 × 16 0 , which has only one bit correct. In general, base 16 can lose up to 3 bits, so that a precision of p hexadecimal digits can have an effective precision as low as 4 p - 3 rather than 4 p binary bits. Since large values of have these problems, why did IBM choose = 16 for its system/370? Only IBM knows for sure, but there are two possible reasons. The first is increased exponent range. Single precision on the system/370 has = 16, p = 6. Hence the significand requires 24 bits. Since this must fit into 32 bits, this leaves 7 bits for the exponent and one for the sign bit. Thus the magnitude of representable numbers ranges from about to about = . To get a similar exponent range when = 2 would require 9 bits of exponent, leaving only 22 bits for the significand. However, it was just pointed out that when = 16, the effective precision can be as low as 4 p - 3 = 21 bits. Even worse, when = 2 it is possible to gain an extra bit of precision (as explained later in this section), so the = 2 machine has 23 bits of precision to compare with a range of 21 - 24 bits for the = 16 machine.

Another possible explanation for choosing = 16 has to do with shifting. When adding two floating-point numbers, if their exponents are different, one of the significands will have to be shifted to make the radix points line up, slowing down the operation. In the = 16, p = 1 system, all the numbers between 1 and 15 have the same exponent, and so no shifting is required when adding any of the ( ) = 105 possible pairs of distinct numbers from this set. However, in the = 2, p = 4 system, these numbers have exponents ranging from 0 to 3, and shifting is required for 70 of the 105 pairs.

In most modern hardware, the performance gained by avoiding a shift for a subset of operands is negligible, and so the small wobble of = 2 makes it the preferable base. Another advantage of using = 2 is that there is a way to gain an extra bit of significance. 12 Since floating-point numbers are always normalized, the most significant bit of the significand is always 1, and there is no reason to waste a bit of storage representing it. Formats that use this trick are said to have a hidden bit. It was already pointed out in Floating-point Formats that this requires a special convention for 0. The method given there was that an exponent of e_min - 1 and a significand of all zeros represents not , but rather 0.

IEEE 754 single precision is encoded in 32 bits using 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand. However, it uses a hidden bit, so the significand is 24 bits (p = 24), even though it is encoded using only 23 bits.

Precision

The IEEE standard defines four different precisions: single, double, single-extended, and double-extended. In IEEE 754, single and double precision correspond roughly to what most floating-point hardware provides. Single precision occupies a single 32 bit word, double precision two consecutive 32 bit words. Extended precision is a format that offers at least a little extra precision and exponent range (TABLE D-1).
TABLE D-1 IEEE 754 Format Parameters Parameter Format Single Single-Extended Double Double-Extended

The IEEE standard only specifies a lower bound on how many extra bits extended precision provides. The minimum allowable double-extended format is sometimes referred to as 80-bit format, even though the table shows it using 79 bits. The reason is that hardware implementations of extended precision normally do not use a hidden bit, and so would use 80 rather than 79 bits. 13

The standard puts the most emphasis on extended precision, making no recommendation concerning double precision, but strongly recommending that Implementations should support the extended format corresponding to the widest basic format supported, .

One motivation for extended precision comes from calculators, which will often display 10 digits, but use 13 digits internally. By displaying only 10 of the 13 digits, the calculator appears to the user as a "black box" that computes exponentials, cosines, etc. to 10 digits of accuracy. For the calculator to compute functions like exp, log and cos to within 10 digits with reasonable efficiency, it needs a few extra digits to work with. It is not hard to find a simple rational expression that approximates log with an error of 500 units in the last place. Thus computing with 13 digits gives an answer correct to 10 digits. By keeping these extra 3 digits hidden, the calculator presents a simple model to the operator.

Extended precision in the IEEE standard serves a similar function. It enables libraries to efficiently compute quantities to within about .5 ulp in single (or double) precision, giving the user of those libraries a simple model, namely that each primitive operation, be it a simple multiply or an invocation of log, returns a value accurate to within about .5 ulp. However, when using extended precision, it is important to make sure that its use is transparent to the user. For example, on a calculator, if the internal representation of a displayed value is not rounded to the same precision as the display, then the result of further operations will depend on the hidden digits and appear unpredictable to the user.

To illustrate extended precision further, consider the problem of converting between IEEE 754 single precision and decimal. Ideally, single precision numbers will be printed with enough digits so that when the decimal number is read back in, the single precision number can be recovered. It turns out that 9 decimal digits are enough to recover a single precision binary number (see the section Binary to Decimal Conversion). When converting a decimal number back to its unique binary representation, a rounding error as small as 1 ulp is fatal, because it will give the wrong answer. Here is a situation where extended precision is vital for an efficient algorithm. When single-extended is available, a very straightforward method exists for converting a decimal number to a single precision binary one. First read in the 9 decimal digits as an integer N , ignoring the decimal point. From TABLE D-1, p 32, and since 10 9 < 2 32 4.3 × 10 9 , N can be represented exactly in single-extended. Next find the appropriate power 10 P necessary to scale N . This will be a combination of the exponent of the decimal number, together with the position of the (up until now) ignored decimal point. Compute 10 | P | . If | P | 13, then this is also represented exactly, because 10 13 = 2 13 5 13 , and 5 13 < 2 32 . Finally multiply (or divide if p < 0) N and 10 | P | . If this last operation is done exactly, then the closest binary number is recovered. The section Binary to Decimal Conversion shows how to do the last multiply (or divide) exactly. Thus for | P | 13, the use of the single-extended format enables 9-digit decimal numbers to be converted to the closest binary number (i.e. exactly rounded). If | P | > 13, then single-extended is not enough for the above algorithm to always compute the exactly rounded binary equivalent, but Coonen [1984] shows that it is enough to guarantee that the conversion of binary to decimal and back will recover the original binary number.

If double precision is supported, then the algorithm above would be run in double precision rather than single-extended, but to convert double precision to a 17-digit decimal number and back would require the double-extended format.

Exponent

Since the exponent can be positive or negative, some method must be chosen to represent its sign. Two common methods of representing signed numbers are sign/magnitude and two's complement. Sign/magnitude is the system used for the sign of the significand in the IEEE formats: one bit is used to hold the sign, the rest of the bits represent the magnitude of the number. The two's complement representation is often used in integer arithmetic. In this scheme, a number in the range [-2 p-1 , 2 p-1 - 1] is represented by the smallest nonnegative number that is congruent to it modulo 2 p .

The IEEE binary standard does not use either of these methods to represent the exponent, but instead uses a biased representation. In the case of single precision, where the exponent is stored in 8 bits, the bias is 127 (for double precision it is 1023). What this means is that if is the value of the exponent bits interpreted as an unsigned integer, then the exponent of the floating-point number is - 127. This is often called the unbiased exponent to distinguish from the biased exponent .

Referring to TABLE D-1, single precision has e_max = 127 and e_min = -126. The reason for having |e_min| < e_max is so that the reciprocal of the smallest number will not overflow. Although it is true that the reciprocal of the largest number will underflow, underflow is usually less serious than overflow. The section Base explained that e_min - 1 is used for representing 0, and Special Quantities will introduce a use for e_max + 1. In IEEE single precision, this means that the biased exponents range between e_min - 1 = -127 and e_max + 1 = 128, whereas the unbiased exponents range between 0 and 255, which are exactly the nonnegative numbers that can be represented using 8 bits.

Operations

The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even). The section Guard Digits pointed out that computing the exact difference or sum of two floating-point numbers can be very expensive when their exponents are substantially different. That section introduced guard digits, which provide a practical way of computing differences while guaranteeing that the relative error is small. However, computing with a single guard digit will not always give the same answer as computing the exact result and then rounding. By introducing a second guard digit and a third sticky bit, differences can be computed at only a little more cost than with a single guard digit, but the result is the same as if the difference were computed exactly and then rounded [Goldberg 1990]. Thus the standard can be implemented efficiently.

One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic. Another advantage of precise specification is that it makes it easier to reason about floating-point. Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic. Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds. Theorem 4 is an example of such a proof. These proofs are made much easier when the operations being reasoned about are precisely specified. Once an algorithm is proven to be correct for IEEE arithmetic, it will work correctly on any machine supporting the IEEE standard.

Brown [1981] has proposed axioms for floating-point that include most of the existing floating-point hardware. However, proofs in this system cannot verify the algorithms of sections Cancellation and Exactly Rounded Operations, which require features not present on all hardware. Furthermore, Brown's axioms are more complex than simply defining operations to be performed exactly and then rounded. Thus proving theorems from Brown's axioms is usually more difficult than proving them assuming operations are exactly rounded.

There is not complete agreement on what operations a floating-point standard should cover. In addition to the basic operations +, -, × and /, the IEEE standard also specifies that square root, remainder, and conversion between integer and floating-point be correctly rounded. It also requires that conversion between internal formats and decimal be correctly rounded (except for very large numbers). Kulisch and Miranker [1986] have proposed adding inner product to the list of operations that are precisely specified. They note that when inner products are computed in IEEE arithmetic, the final answer can be quite wrong. For example sums are a special case of inner products, and the sum ((2 × 10 -30 + 10 30 ) - 10 30 ) - 10 -30 is exactly equal to 10 -30 , but on a machine with IEEE arithmetic the computed result will be -10 -30 . It is possible to compute inner products to within 1 ulp with less hardware than it takes to implement a fast multiplier [Kirchner and Kulish 1987]. 14 15

All the operations mentioned in the standard are required to be exactly rounded except conversion between decimal and binary. The reason is that efficient algorithms for exactly rounding all the operations are known, except conversion. For conversion, the best known efficient algorithms produce results that are slightly worse than exactly rounded ones [Coonen 1984].

The IEEE standard does not require transcendental functions to be exactly rounded because of the table maker's dilemma. To illustrate, suppose you are making a table of the exponential function to 4 places. Then exp(1.626) = 5.0835. Should this be rounded to 5.083 or 5.084? If exp(1.626) is computed more carefully, it becomes 5.08350. And then 5.083500. And then 5.0835000. Since exp is transcendental, this could go on arbitrarily long before distinguishing whether exp(1.626) is 5.083500 . 0 ddd or 5.0834999 . 9 ddd . Thus it is not practical to specify that the precision of transcendental functions be the same as if they were computed to infinite precision and then rounded. Another approach would be to specify transcendental functions algorithmically. But there does not appear to be a single algorithm that works well across all hardware architectures. Rational approximation, CORDIC, 16 and large tables are three different techniques that are used for computing transcendentals on contemporary machines. Each is appropriate for a different class of hardware, and at present no single algorithm works acceptably over the wide range of current hardware.

Special Quantities

On some floating-point hardware every bit pattern represents a valid floating-point number. The IBM System/370 is an example of this. On the other hand, the VAX TM reserves some bit patterns to represent special numbers called reserved operands . This idea goes back to the CDC 6600, which had bit patterns for the special quantities INDEFINITE and INFINITY .

The IEEE standard continues in this tradition and has NaNs (Not a Number) and infinities. Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation. Under IBM System/370 FORTRAN, the default action in response to computing the square root of a negative number like -4 results in the printing of an error message. Since every bit pattern represents a valid number, the return value of square root must be some floating-point number. In the case of System/370 FORTRAN, is returned. In IEEE arithmetic, a NaN is returned in this situation.

The IEEE standard specifies the following special values (see TABLE D-2): ± 0, denormalized numbers, ± and NaNs (there is more than one NaN, as explained in the next section). These special values are all encoded with exponents of either e_max + 1 or e_min - 1 (it was already pointed out that 0 has an exponent of e_min - 1).
TABLE D-2 IEEE 754 Special Values Exponent Fraction Represents
NaNs

Traditionally, the computation of 0/0 or has been treated as an unrecoverable error which causes a computation to halt. However, there are examples where it makes sense for a computation to continue in such a situation. Consider a subroutine that finds the zeros of a function f , say zero(f) . Traditionally, zero finders require the user to input an interval [ a , b ] on which the function is defined and over which the zero finder will search. That is, the subroutine is called as zero(f , a , b) . A more useful zero finder would not require the user to input this extra information. This more general zero finder is especially appropriate for calculators, where it is natural to simply key in a function, and awkward to then have to specify the domain. However, it is easy to see why most zero finders require a domain. The zero finder does its work by probing the function f at various values. If it probed for a value outside the domain of f , the code for f might well compute 0/0 or , and the computation would halt, unnecessarily aborting the zero finding process.

This problem can be avoided by introducing a special value called NaN, and specifying that the computation of expressions like 0/0 and produce NaN, rather than halting. A list of some of the situations that can cause a NaN are given in TABLE D-3. Then when zero(f) probes outside the domain of f , the code for f will return NaN, and the zero finder can continue. That is, zero(f) is not "punished" for making an incorrect guess. With this example in mind, it is easy to see what the result of combining a NaN with an ordinary floating-point number should be. Suppose that the final statement of f is return(-b + sqrt(d))/(2*a) . If d < 0, then f should return a NaN. Since d < 0, sqrt(d) is a NaN, and -b + sqrt(d) will be a NaN, if the sum of a NaN and any other number is a NaN. Similarly if one operand of a division operation is a NaN, the quotient should be a NaN. In general, whenever a NaN participates in a floating-point operation, the result is another NaN.
TABLE D-3 Operations That Produce a NaN Operation NaN Produced By

Another approach to writing a zero solver that doesn't require the user to input a domain is to use signals. The zero-finder could install a signal handler for floating-point exceptions. Then if f was evaluated outside its domain and raised an exception, control would be returned to the zero solver. The problem with this approach is that every language has a different method of handling signals (if it has a method at all), and so it has no hope of portability.

In IEEE 754, NaNs are often represented as floating-point numbers with the exponent e_max + 1 and nonzero significands. Implementations are free to put system-dependent information into the significand. Thus there is not a unique NaN, but rather a whole family of NaNs. When a NaN and an ordinary floating-point number are combined, the result should be the same as the NaN operand. Thus if the result of a long computation is a NaN, the system-dependent information in the significand will be the information that was generated when the first NaN in the computation was generated. Actually, there is a caveat to the last statement. If both operands are NaNs, then the result will be one of those NaNs, but it might not be the NaN that was generated first.

Infinity

Just as NaNs provide a way to continue a computation when expressions like 0/0 or are encountered, infinities provide a way to continue when an overflow occurs. This is much safer than simply returning the largest representable number. As an example, consider computing , when = 10, p = 3, and e_max = 98. If x = 3 × 10 70 and y = 4 × 10 70 , then x 2 will overflow, and be replaced by 9.99 × 10 98 . Similarly y 2 , and x 2 + y 2 will each overflow in turn, and be replaced by 9.99 × 10 98 . So the final result will be , which is drastically wrong: the correct answer is 5 × 10 70 . In IEEE arithmetic, the result of x 2 is , as is y 2 , x 2 + y 2 and . So the final result is , which is safer than returning an ordinary floating-point number that is nowhere near the correct answer. 17

The division of 0 by 0 results in a NaN. A nonzero number divided by 0, however, returns infinity: 1/0 = , -1/0 = - . The reason for the distinction is this: if f ( x ) 0 and g ( x ) 0 as x approaches some limit, then f ( x )/ g ( x ) could have any value. For example, when f ( x ) = sin x and g ( x ) = x , then f ( x )/ g ( x ) 1 as x 0. But when f ( x ) = 1 - cos x , f ( x )/ g ( x ) 0. When thinking of 0/0 as the limiting situation of a quotient of two very small numbers, 0/0 could represent anything. Thus in the IEEE standard, 0/0 results in a NaN. But when c > 0, f ( x ) c , and g ( x ) 0, then f ( x )/ g ( x ) ± , for any analytic functions f and g. If g ( x ) < 0 for small x, then f ( x )/ g ( x ) - , otherwise the limit is + . So the IEEE standard defines c /0 = ± , as long as c 0. The sign of depends on the signs of c and 0 in the usual way, so that -10/0 = - , and -10/-0 = + . You can distinguish between getting because of overflow and getting because of division by zero by checking the status flags (which will be discussed in detail in section Flags). The overflow flag will be set in the first case, the division by zero flag in the second.

The rule for determining the result of an operation that has infinity as an operand is simple: replace infinity with a finite number x and take the limit as x . Thus 3/ = 0, because
.

Similarly, 4 - = - , and = . When the limit doesn't exist, the result is a NaN, so / will be a NaN (TABLE D-3 has additional examples). This agrees with the reasoning used to conclude that 0/0 should be a NaN.

When a subexpression evaluates to a NaN, the value of the entire expression is also a NaN. In the case of ± however, the value of the expression might be an ordinary floating-point number because of rules like 1/ = 0. Here is a practical example that makes use of the rules for infinity arithmetic. Consider computing the function x/( x 2 + 1). This is a bad formula, because not only will it overflow when x is larger than , but infinity arithmetic will give the wrong answer because it will yield 0, rather than a number near 1/ x . However, x/( x 2 + 1) can be rewritten as 1/( x + x -1 ). This improved expression will not overflow prematurely and because of infinity arithmetic will have the correct value when x = 0: 1/(0 + 0 -1 ) = 1/(0 + ) = 1/ = 0. Without infinity arithmetic, the expression 1/( x + x -1 ) requires a test for x = 0, which not only adds extra instructions, but may also disrupt a pipeline. This example illustrates a general fact, namely that infinity arithmetic often avoids the need for special case checking; however, formulas need to be carefully inspected to make sure they do not have spurious behavior at infinity (as x/( x 2 + 1) did).

Signed Zero

Zero is represented by the exponent e_min - 1 and a zero significand. Since the sign bit can take on two different values, there are two zeros, +0 and -0. If a distinction were made when comparing +0 and -0, simple tests like if (x = 0) would have very unpredictable behavior, depending on the sign of x . Thus the IEEE standard defines comparison so that +0 = -0, rather than -0 < +0. Although it would be possible always to ignore the sign of zero, the IEEE standard does not do so. When a multiplication or division involves a signed zero, the usual sign rules apply in computing the sign of the answer. Thus 3 · (+0) = +0, and +0/-3 = -0. If zero did not have a sign, then the relation 1/(1/ x ) = x would fail to hold when x = ± . The reason is that 1/- and 1/+ both result in 0, and 1/0 results in + , the sign information having been lost. One way to restore the identity 1/(1/ x ) = x is to only have one kind of infinity, however that would result in the disastrous consequence of losing the sign of an overflowed quantity.

Another example of the use of signed zero concerns underflow and functions that have a discontinuity at 0, such as log. In IEEE arithmetic, it is natural to define log 0 = - and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return - . Another example of a function with a discontinuity at zero is the signum function, which returns the sign of a number.

Probably the most interesting use of signed zero occurs in complex arithmetic. To take a simple example, consider the equation . This is certainly true when z 0. If z = -1, the obvious computation gives and . Thus, ! The problem can be traced to the fact that square root is multi-valued, and there is no way to select the values so that it is continuous in the entire complex plane. However, square root is continuous if a branch cut consisting of all negative real numbers is excluded from consideration. This leaves the problem of what to do for the negative real numbers, which are of the form - x + i0, where x > 0. Signed zero provides a perfect way to resolve this problem. Numbers of the form x + i (+0) have one sign and numbers of the form x + i (-0) on the other side of the branch cut have the other sign . In fact, the natural formulas for computing will give these results.

Back to . If z =1 = -1 + i 0, then
1/ z = 1/(-1 + i 0) = [(-1- i 0)]/[(-1 + i 0)(-1 - i 0)] = (-1 -- i 0)/((-1) 2 - 0 2 ) = -1 + i (-0),

and so , while . Thus IEEE arithmetic preserves this identity for all z . Some more sophisticated examples are given by Kahan [1987]. Although distinguishing between +0 and -0 has advantages, it can occasionally be confusing. For example, signed zero destroys the relation x = y 1/ x = 1/ y , which is false when x = +0 and y = -0. However, the IEEE committee decided that the advantages of utilizing the sign of zero outweighed the disadvantages.

Denormalized Numbers

Consider normalized floating-point numbers with = 10, p = 3, and e_min = -98. The numbers x = 6.87 × 10 -97 and y = 6.81 × 10 -97 appear to be perfectly ordinary floating-point numbers, which are more than a factor of 10 larger than the smallest floating-point number 1.00 × 10 -98 . They have a strange property, however: x y = 0 even though x y ! The reason is that x - y = .06 × 10 -97 = 6.0 × 10 -99 is too small to be represented as a normalized number, and so must be flushed to zero. How important is it to preserve the property
(10) x = y x - y = 0 ?

It's very easy to imagine writing the code fragment, if (x y) then z = 1/(x-y) , and much later having a program fail due to a spurious division by zero. Tracking down bugs like this is frustrating and time consuming. On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. For example, introducing invariants is quite useful, even if they aren't going to be used as part of a proof. Floating-point code is just like any other code: it helps to have provable facts on which to depend. For example, when analyzing formula (6), it was very helpful to know that x/2 < y < 2 x x y = x - y . Similarly, knowing that (10) is true makes writing reliable floating-point code easier. If it is only true for most numbers, it cannot be used to prove anything.

The IEEE standard uses denormalized 18 numbers, which guarantee (10), as well as other useful relations. They are the most controversial part of the standard and probably accounted for the long delay in getting 754 approved. Most high performance hardware that claims to be IEEE compatible does not support denormalized numbers directly, but rather traps when consuming or producing denormals, and leaves it to software to simulate the IEEE standard. 19 The idea behind denormalized numbers goes back to Goldberg [1967] and is very simple. When the exponent is e_min, the significand does not have to be normalized, so that when = 10, p = 3 and e_min = -98, 1.00 × 10 -98 is no longer the smallest floating-point number, because 0.98 × 10 -98 is also a floating-point number.

There is a small snag when = 2 and a hidden bit is being used, since a number with an exponent of e_min will always have a significand greater than or equal to 1.0 because of the implicit leading bit. The solution is similar to that used to represent 0, and is summarized in TABLE D-2. The exponent e_min is used to represent denormals. More formally, if the bits in the significand field are b ₁, b ₂, . , b _{p -1}, and the value of the exponent is e, then when e > e_min - 1, the number being represented is 1. b ₁ b ₂ . b _{p - 1} × 2 e whereas when e = e_min - 1, the number being represented is 0. b ₁ b ₂ . b _{p - 1} × 2 e + 1 . The +1 in the exponent is needed because denormals have an exponent of e_min, not e_min - 1.

Recall the example of = 10, p = 3, e_min = -98, x = 6.87 × 10 -97 and y = 6.81 × 10 -97 presented at the beginning of this section. With denormals, x - y does not flush to zero but is instead represented by the denormalized number .6 × 10 -98 . This behavior is called gradual underflow. It is easy to verify that (10) always holds when using gradual underflow.

FIGURE D-2 Flush To Zero Compared With Gradual Underflow

FIGURE D-2 illustrates denormalized numbers. The top number line in the figure shows normalized floating-point numbers. Notice the gap between 0 and the smallest normalized number . If the result of a floating-point calculation falls into this gulf, it is flushed to zero. The bottom number line shows what happens when denormals are added to the set of floating-point numbers. The "gulf" is filled in, and when the result of a calculation is less than , it is represented by the nearest denormal. When denormalized numbers are added to the number line, the spacing between adjacent floating-point numbers varies in a regular way: adjacent spacings are either the same length or differ by a factor of . Without denormals, the
spacing abruptly changes from to , which is a factor of , rather than the orderly change by a factor of . Because of this, many algorithms that can have large relative error for normalized numbers close to the underflow threshold are well-behaved in this range when gradual underflow is used.

Without gradual underflow, the simple expression x - y can have a very large relative error for normalized inputs, as was seen above for x = 6.87 × 10 -97 and y = 6.81 × 10 -97 . Large relative errors can happen even without cancellation, as the following example shows [Demmel 1984]. Consider dividing two complex numbers, a + ib and c + id . The obvious formula
· i

suffers from the problem that if either component of the denominator c + id is larger than , the formula will overflow, even though the final result may be well within range. A better method of computing the quotients is to use Smith's formula:
(11)

Applying Smith's formula to (2 · 10 -98 + i 10 -98 )/(4 · 10 -98 + i (2 · 10 -98 )) gives the correct answer of 0.5 with gradual underflow. It yields 0.4 with flush to zero, an error of 100 ulps. It is typical for denormalized numbers to guarantee error bounds for arguments all the way down to 1.0 x .

Exceptions, Flags and Trap Handlers

When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue. Typical of the default results are NaN for 0/0 and , and for 1/0 and overflow. The preceding sections gave examples where proceeding from an exception with these default values was the reasonable thing to do. When any exception occurs, a status flag is also set. Implementations of the IEEE standard are required to provide users with a way to read and write the status flags. The flags are "sticky" in that once set, they remain set until explicitly cleared. Testing the flags is the only way to distinguish 1/0, which is a genuine infinity from an overflow.

Sometimes continuing execution in the face of exception conditions is not appropriate. The section Infinity gave the example of x/( x 2 + 1). When x > , the denominator is infinite, resulting in a final answer of 0, which is totally wrong. Although for this formula the problem can be solved by rewriting it as 1/( x + x -1 ), rewriting may not always solve the problem. The IEEE standard strongly recommends that implementations allow trap handlers to be installed. Then when an exception occurs, the trap handler is called instead of setting the flag. The value returned by the trap handler will be used as the result of the operation. It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined.

The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception. The meaning of the first three exceptions is self-evident. Invalid operation covers the situations listed in TABLE D-3, and any comparison that involves a NaN. The default result of an operation that causes an invalid exception is to return a NaN, but the converse is not true. When one of the operands to an operation is a NaN, the result is a NaN but no invalid exception is raised unless the operation also satisfies one of the conditions in TABLE D-3. 20
TABLE D-4 Exceptions in IEEE 754* Exception Result when traps disabled Argument to trap handler

* x is the exact result of the operation, = 192 for single precision, 1536 for double, and x _max = 1.11 . 11 × .

The inexact exception is raised when the result of a floating-point operation is not exact. In the = 10, p = 3 system, 3.5 4.2 = 14.7 is exact, but 3.5 4.3 = 15.0 is not exact (since 3.5 · 4.3 = 15.05), and raises an inexact exception. Binary to Decimal Conversion discusses an algorithm that uses the inexact exception. A summary of the behavior of all five exceptions is given in TABLE D-4.

There is an implementation issue connected with the fact that the inexact exception is raised so often. If floating-point hardware does not have flags of its own, but instead interrupts the operating system to signal a floating-point exception, the cost of inexact exceptions could be prohibitive. This cost can be avoided by having the status flags maintained by software. The first time an exception is raised, set the software flag for the appropriate class, and tell the floating-point hardware to mask off that class of exceptions. Then all further exceptions will run without interrupting the operating system. When a user resets that status flag, the hardware mask is re-enabled.

Trap Handlers

One obvious use for trap handlers is for backward compatibility. Old codes that expect to be aborted when exceptions occur can install a trap handler that aborts the process. This is especially useful for codes with a loop like do S until (x >= 100) . Since comparing a NaN to a number with , , or = (but not ) always returns false, this code will go into an infinite loop if x ever becomes a NaN.

There is a more interesting use for trap handlers that comes up when computing products such as that could potentially overflow. One solution is to use logarithms, and compute exp instead. The problem with this approach is that it is less accurate, and that it costs more than the simple expression , even if there is no overflow. There is another solution using trap handlers called over/underflow counting that avoids both of these problems [Sterbenz 1974].

The idea is as follows. There is a global counter initialized to zero. Whenever the partial product overflows for some k , the trap handler increments the counter by one and returns the overflowed quantity with the exponent wrapped around. In IEEE 754 single precision, e_max = 127, so if p_k = 1.45 × 2 130 , it will overflow and cause the trap handler to be called, which will wrap the exponent back into range, changing p_k to 1.45 × 2 -62 (see below). Similarly, if p_k underflows, the counter would be decremented, and negative exponent would get wrapped around into a positive one. When all the multiplications are done, if the counter is zero then the final product is p_n. If the counter is positive, the product overflowed, if the counter is negative, it underflowed. If none of the partial products are out of range, the trap handler is never called and the computation incurs no extra cost. Even if there are over/underflows, the calculation is more accurate than if it had been computed with logarithms, because each p_k was computed from p_{k - 1} using a full precision multiply. Barnett [1987] discusses a formula where the full accuracy of over/underflow counting turned up an error in earlier tables of that formula.

IEEE 754 specifies that when an overflow or underflow trap handler is called, it is passed the wrapped-around result as an argument. The definition of wrapped-around for overflow is that the result is computed as if to infinite precision, then divided by 2 , and then rounded to the relevant precision. For underflow, the result is multiplied by 2 . The exponent is 192 for single precision and 1536 for double precision. This is why 1.45 x 2 130 was transformed into 1.45 × 2 -62 in the example above.

Rounding Modes

In the IEEE standard, rounding occurs whenever an operation has a result that is not exact, since (with the exception of binary decimal conversion) each operation is computed exactly and then rounded. By default, rounding means round toward nearest. The standard requires that three other rounding modes be provided, namely round toward 0, round toward + , and round toward - . When used with the convert to integer operation, round toward - causes the convert to become the floor function, while round toward + is ceiling. The rounding mode affects overflow, because when round toward 0 or round toward - is in effect, an overflow of positive magnitude causes the default result to be the largest representable number, not + . Similarly, overflows of negative magnitude will produce the largest negative number when round toward + or round toward 0 is in effect.

One application of rounding modes occurs in interval arithmetic (another is mentioned in Binary to Decimal Conversion). When using interval arithmetic, the sum of two numbers x and y is an interval , where is x y rounded toward - , and is x y rounded toward + . The exact result of the addition is contained within the interval . Without rounding modes, interval arithmetic is usually implemented by computing and , where is machine epsilon. 21 This results in overestimates for the size of the intervals. Since the result of an operation in interval arithmetic is an interval, in general the input to an operation will also be an interval. If two intervals , and , are added, the result is , where is with the rounding mode set to round toward - , and is with the rounding mode set to round toward + .

When a floating-point calculation is performed using interval arithmetic, the final answer is an interval that contains the exact result of the calculation. This is not very helpful if the interval turns out to be large (as it often does), since the correct answer could be anywhere in that interval. Interval arithmetic makes more sense when used in conjunction with a multiple precision floating-point package. The calculation is first performed with some precision p. If interval arithmetic suggests that the final answer may be inaccurate, the computation is redone with higher and higher precisions until the final interval is a reasonable size.

Flags

The IEEE standard has a number of flags and modes. As discussed above, there is one status flag for each of the five exceptions: underflow, overflow, division by zero, invalid operation and inexact. There are four rounding modes: round toward nearest, round toward + , round toward 0, and round toward - . It is strongly recommended that there be an enable mode bit for each of the five exceptions. This section gives some simple examples of how these modes and flags can be put to good use. A more sophisticated example is discussed in the section Binary to Decimal Conversion.

Consider writing a subroutine to compute x n , where n is an integer. When n > 0, a simple routine like

PositivePower(x,n)
while (n is even)
x = x*x

n = n/2

u = x

while (true)
n = n/2

if (n==0) return u

x = x*x

if (n is odd) u = u*x

If n < 0, then a more accurate way to compute x n is not to call PositivePower(1/x, -n) but rather 1/PositivePower(x, -n) , because the first expression multiplies n quantities each of which have a rounding error from the division (i.e., 1/ x ). In the second expression these are exact (i.e., x), and the final division commits just one additional rounding error. Unfortunately, these is a slight snag in this strategy. If PositivePower(x, -n) underflows, then either the underflow trap handler will be called, or else the underflow status flag will be set. This is incorrect, because if x - n underflows, then x n will either overflow or be in range. 22 But since the IEEE standard gives the user access to all the flags, the subroutine can easily correct for this. It simply turns off the overflow and underflow trap enable bits and saves the overflow and underflow status bits. It then computes 1/PositivePower(x, -n) . If neither the overflow nor underflow status bit is set, it restores them together with the trap enable bits. If one of the status bits is set, it restores the flags and redoes the calculation using PositivePower(1/x, -n) , which causes the correct exceptions to occur.

Another example of the use of flags occurs when computing arccos via the formula
arccos x = 2 arctan .

If arctan( ) evaluates to /2, then arccos(-1) will correctly evaluate to 2 · arctan( ) = , because of infinity arithmetic. However, there is a small snag, because the computation of (1 - x )/(1 + x ) will cause the divide by zero exception flag to be set, even though arccos(-1) is not exceptional. The solution to this problem is straightforward. Simply save the value of the divide by zero flag before computing arccos, and then restore its old value after the computation.

Systems Aspects

The design of almost every aspect of a computer system requires knowledge about floating-point. Computer architectures usually have floating-point instructions, compilers must generate those floating-point instructions, and the operating system must decide what to do when exception conditions are raised for those floating-point instructions. Computer system designers rarely get guidance from numerical analysis texts, which are typically aimed at users and writers of software, not at computer designers. As an example of how plausible design decisions can lead to unexpected behavior, consider the following BASIC program.

q = 3.0/7.0

if q = 3.0/7.0 then print "Equal":

else print "Not Equal"

When compiled and run using Borland's Turbo Basic on an IBM PC, the program prints Not Equal ! This example will be analyzed in the next section

Incidentally, some people think that the solution to such anomalies is never to compare floating-point numbers for equality, but instead to consider them equal if they are within some error bound E . This is hardly a cure-all because it raises as many questions as it answers. What should the value of E be? If x < 0 and y >0 are within E , should they really be considered to be equal, even though they have different signs? Furthermore, the relation defined by this rule, a ~ b |a - b | < E , is not an equivalence relation because a ~ b and b ~ c does not imply that a ~ c .

Instruction Sets

It is quite common for an algorithm to require a short burst of higher precision in order to produce accurate results. One example occurs in the quadratic formula ( )/2 a . As discussed in the section Proof of Theorem 4, when b 2 4 ac , rounding error can contaminate up to half the digits in the roots computed with the quadratic formula. By performing the subcalculation of b 2 - 4 ac in double precision, half the double precision bits of the root are lost, which means that all the single precision bits are preserved.

The computation of b 2 - 4 ac in double precision when each of the quantities a , b , and c are in single precision is easy if there is a multiplication instruction that takes two single precision numbers and produces a double precision result. In order to produce the exactly rounded product of two p-digit numbers, a multiplier needs to generate the entire 2 p bits of product, although it may throw bits away as it proceeds. Thus, hardware to compute a double precision product from single precision operands will normally be only a little more expensive than a single precision multiplier, and much cheaper than a double precision multiplier. Despite this, modern instruction sets tend to provide only instructions that produce a result of the same precision as the operands. 23

If an instruction that combines two single precision operands to produce a double precision product was only useful for the quadratic formula, it wouldn't be worth adding to an instruction set. However, this instruction has many other uses. Consider the problem of solving a system of linear equations,
a ₁₁ x ₁ + a ₁₂ x ₂ + · · · + a _1n x _n= b ₁
a ₂₁ x ₁ + a ₂₂ x ₂ + · · · + a _2n x _n= b ₂
· · ·
a _n1 x ₁ + a _n2 x ₂ + · · ·+ a _nn x _n= b _n

which can be written in matrix form as Ax = b , where

Suppose that a solution x (1) is computed by some method, perhaps Gaussian elimination. There is a simple way to improve the accuracy of the result called iterative improvement. First compute
(12) = Ax (1) - b

and then solve the system
(13) Ay =

Note that if x (1) is an exact solution, then is the zero vector, as is y. In general, the computation of and y will incur rounding error, so Ay Ax (1) - b = A ( x (1) - x ), where x is the (unknown) true solution. Then y x (1) - x , so an improved estimate for the solution is
(14) x (2) = x (1) - y

The three steps (12), (13), and (14) can be repeated, replacing x (1) with x (2) , and x (2) with x (3) . This argument that x ( i + 1) is more accurate than x ( i ) is only informal. For more information, see [Golub and Van Loan 1989].

When performing iterative improvement, is a vector whose elements are the difference of nearby inexact floating-point numbers, and so can suffer from catastrophic cancellation. Thus iterative improvement is not very useful unless = Ax (1) - b is computed in double precision. Once again, this is a case of computing the product of two single precision numbers (A and x (1) ), where the full double precision result is needed.

To summarize, instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set. Some of the implications of this for compilers are discussed in the next section.

Languages and Compilers

The interaction of compilers and floating-point is discussed in Farnum [1988], and much of the discussion in this section is taken from that paper.

Ambiguity

Ideally, a language definition should define the semantics of the language precisely enough to prove statements about programs. While this is usually true for the integer part of a language, language definitions often have a large grey area when it comes to floating-point. Perhaps this is due to the fact that many language designers believe that nothing can be proven about floating-point, since it entails rounding error. If so, the previous sections have demonstrated the fallacy in this reasoning. This section discusses some common grey areas in language definitions, including suggestions about how to deal with them.

Remarkably enough, some languages don't clearly specify that if x is a floating-point variable (with say a value of 3.0/10.0 ), then every occurrence of (say) 10.0*x must have the same value. For example Ada, which is based on Brown's model, seems to imply that floating-point arithmetic only has to satisfy Brown's axioms, and thus expressions can have one of many possible values. Thinking about floating-point in this fuzzy way stands in sharp contrast to the IEEE model, where the result of each floating-point operation is precisely defined. In the IEEE model, we can prove that (3.0/10.0)*10.0 evaluates to 3 (Theorem 7). In Brown's model, we cannot.

Another ambiguity in most language definitions concerns what happens on overflow, underflow and other exceptions. The IEEE standard precisely specifies the behavior of exceptions, and so languages that use the standard as a model can avoid any ambiguity on this point.

Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 10 30 , y = -10 30 and z = 1 (it is 1 in the former case, 0 in the latter). The importance of preserving parentheses cannot be overemphasized. The algorithms presented in theorems 3, 4 and 6 all depend on it. For example, in Theorem 6, the formula x_h = mx - ( mx - x ) would reduce to x_h = x if it weren't for parentheses, thereby destroying the entire algorithm. A language definition that does not require parentheses to be honored is useless for floating-point calculations.

Subexpression evaluation is imprecisely defined in many languages. Suppose that ds is double precision, but x and y are single precision. Then in the expression ds + x*y is the product performed in single or double precision? Another example: in x + m/n where m and n are integers, is the division an integer operation or a floating-point one? There are two ways to deal with this problem, neither of which is completely satisfactory. The first is to require that all variables in an expression have the same type. This is the simplest solution, but has some drawbacks. First of all, languages like Pascal that have subrange types allow mixing subrange variables with integer variables, so it is somewhat bizarre to prohibit mixing single and double precision variables. Another problem concerns constants. In the expression 0.1*x , most languages interpret 0.1 to be a single precision constant. Now suppose the programmer decides to change the declaration of all the floating-point variables from single to double precision. If 0.1 is still treated as a single precision constant, then there will be a compile time error. The programmer will have to hunt down and change every floating-point constant.

The second approach is to allow mixed expressions, in which case rules for subexpression evaluation must be provided. There are a number of guiding examples. The original definition of C required that every floating-point expression be computed in double precision [Kernighan and Ritchie 1978]. This leads to anomalies like the example at the beginning of this section. The expression 3.0/7.0 is computed in double precision, but if q is a single-precision variable, the quotient is rounded to single precision for storage. Since 3/7 is a repeating binary fraction, its computed value in double precision is different from its stored value in single precision. Thus the comparison q = 3/7 fails. This suggests that computing every expression in the highest precision available is not a good rule.

Another guiding example is inner products. If the inner product has thousands of terms, the rounding error in the sum can become substantial. One way to reduce this rounding error is to accumulate the sums in double precision (this will be discussed in more detail in the section Optimizers). If d is a double precision variable, and x[] and y[] are single precision arrays, then the inner product loop will look like d = d + x[i]*y[i] . If the multiplication is done in single precision, than much of the advantage of double precision accumulation is lost, because the product is truncated to single precision just before being added to a double precision variable.

A rule that covers both of the previous two examples is to compute an expression in the highest precision of any variable that occurs in that expression. Then q = 3.0/7.0 will be computed entirely in single precision 24 and will have the boolean value true, whereas d = d + x[i]*y[i] will be computed in double precision, gaining the full advantage of double precision accumulation. However, this rule is too simplistic to cover all cases cleanly. If dx and dy are double precision variables, the expression y = x + single(dx-dy) contains a double precision variable, but performing the sum in double precision would be pointless, because both operands are single precision, as is the result.

A more sophisticated subexpression evaluation rule is as follows. First assign each operation a tentative precision, which is the maximum of the precisions of its operands. This assignment has to be carried out from the leaves to the root of the expression tree. Then perform a second pass from the root to the leaves. In this pass, assign to each operation the maximum of the tentative precision and the precision expected by the parent. In the case of q = 3.0/7.0 , every leaf is single precision, so all the operations are done in single precision. In the case of d = d + x[i]*y[i] , the tentative precision of the multiply operation is single precision, but in the second pass it gets promoted to double precision, because its parent operation expects a double precision operand. And in y = x + single(dx-dy) , the addition is done in single precision. Farnum [1988] presents evidence that this algorithm in not difficult to implement.

The disadvantage of this rule is that the evaluation of a subexpression depends on the expression in which it is embedded. This can have some annoying consequences. For example, suppose you are debugging a program and want to know the value of a subexpression. You cannot simply type the subexpression to the debugger and ask it to be evaluated, because the value of the subexpression in the program depends on the expression it is embedded in. A final comment on subexpressions: since converting decimal constants to binary is an operation, the evaluation rule also affects the interpretation of decimal constants. This is especially important for constants like 0.1 which are not exactly representable in binary.

Another potential grey area occurs when a language includes exponentiation as one of its built-in operations. Unlike the basic arithmetic operations, the value of exponentiation is not always obvious [Kahan and Coonen 1982]. If ** is the exponentiation operator, then (-3)**3 certainly has the value -27. However, (-3.0)**3.0 is problematical. If the ** operator checks for integer powers, it would compute (-3.0)**3.0 as -3.0 3 = -27. On the other hand, if the formula x y = e ylog x is used to define ** for real arguments, then depending on the log function, the result could be a NaN (using the natural definition of log( x ) = NaN when x < 0). If the FORTRAN CLOG function is used however, then the answer will be -27, because the ANSI FORTRAN standard defines CLOG(-3.0) to be i + log 3 [ANSI 1978]. The programming language Ada avoids this problem by only defining exponentiation for integer powers, while ANSI FORTRAN prohibits raising a negative number to a real power.

In fact, the FORTRAN standard says that
Any arithmetic operation whose result is not mathematically defined is prohibited.
Unfortunately, with the introduction of ± by the IEEE standard, the meaning of not mathematically defined is no longer totally clear cut. One definition might be to use the method shown in section Infinity. For example, to determine the value of a b , consider non-constant analytic functions f and g with the property that f ( x ) a and g ( x ) b as x 0. If f ( x ) g ( x ) always approaches the same limit, then this should be the value of a b . This definition would set 2 = which seems quite reasonable. In the case of 1.0 , when f ( x ) = 1 and g ( x ) = 1/ x the limit approaches 1, but when f ( x ) = 1 - x and g ( x ) = 1/ x the limit is e -1 . So 1.0 , should be a NaN. In the case of 0 0 , f ( x ) g ( x ) = e g( x )log f ( x ) . Since f and g are analytic and take on the value 0 at 0, f ( x ) = a ₁ x 1 + a ₂ x 2 + . and g ( x ) = b ₁ x 1 + b ₂ x 2 + . . Thus lim_x ₀ g ( x ) log f ( x ) = lim_x ₀ x log( x ( a ₁ + a ₂ x + . )) = lim_x ₀ x log( a ₁ x ) = 0. So f ( x ) g ( x ) e 0 = 1 for all f and g , which means that 0 0 = 1. 25 26 Using this definition would unambiguously define the exponential function for all arguments, and in particular would define (-3.0)**3.0 to be -27.

The IEEE Standard

The section The IEEE Standard," discussed many of the features of the IEEE standard. However, the IEEE standard says nothing about how these features are to be accessed from a programming language. Thus, there is usually a mismatch between floating-point hardware that supports the standard and programming languages like C, Pascal or FORTRAN. Some of the IEEE capabilities can be accessed through a library of subroutine calls. For example the IEEE standard requires that square root be exactly rounded, and the square root function is often implemented directly in hardware. This functionality is easily accessed via a library square root routine. However, other aspects of the standard are not so easily implemented as subroutines. For example, most computer languages specify at most two floating-point types, while the IEEE standard has four different precisions (although the recommended configurations are single plus single-extended or single, double, and double-extended). Infinity provides another example. Constants to represent ± could be supplied by a subroutine. But that might make them unusable in places that require constant expressions, such as the initializer of a constant variable.

A more subtle situation is manipulating the state associated with a computation, where the state consists of the rounding modes, trap enable bits, trap handlers and exception flags. One approach is to provide subroutines for reading and writing the state. In addition, a single call that can atomically set a new value and return the old value is often useful. As the examples in the section Flags show, a very common pattern of modifying IEEE state is to change it only within the scope of a block or subroutine. Thus the burden is on the programmer to find each exit from the block, and make sure the state is restored. Language support for setting the state precisely in the scope of a block would be very useful here. Modula-3 is one language that implements this idea for trap handlers [Nelson 1991].

There are a number of minor points that need to be considered when implementing the IEEE standard in a language. Since x - x = +0 for all x, 27 (+0) - (+0) = +0. However, -(+0) = -0, thus - x should not be defined as 0 - x. The introduction of NaNs can be confusing, because a NaN is never equal to any other number (including another NaN), so x = x is no longer always true. In fact, the expression x x is the simplest way to test for a NaN if the IEEE recommended function Isnan is not provided. Furthermore, NaNs are unordered with respect to all other numbers, so x y cannot be defined as not x > y . Since the introduction of NaNs causes floating-point numbers to become partially ordered, a compare function that returns one of , or unordered can make it easier for the programmer to deal with comparisons.

Although the IEEE standard defines the basic floating-point operations to return a NaN if any operand is a NaN, this might not always be the best definition for compound operations. For example when computing the appropriate scale factor to use in plotting a graph, the maximum of a set of values must be computed. In this case it makes sense for the max operation to simply ignore NaNs.

Finally, rounding can be a problem. The IEEE standard defines rounding very precisely, and it depends on the current value of the rounding modes. This sometimes conflicts with the definition of implicit rounding in type conversions or the explicit round function in languages. This means that programs which wish to use IEEE rounding can't use the natural language primitives, and conversely the language primitives will be inefficient to implement on the ever increasing number of IEEE machines.

Optimizers

Compiler texts tend to ignore the subject of floating-point. For example Aho et al. [1986] mentions replacing x/2.0 with x*0.5 , leading the reader to assume that x/10.0 should be replaced by 0.1*x . However, these two expressions do not have the same semantics on a binary machine, because 0.1 cannot be represented exactly in binary. This textbook also suggests replacing x*y-x*z by x*(y-z) , even though we have seen that these two expressions can have quite different values when y z. Although it does qualify the statement that any algebraic identity can be used when optimizing code by noting that optimizers should not violate the language definition, it leaves the impression that floating-point semantics are not very important. Whether or not the language standard specifies that parenthesis must be honored, (x+y)+z can have a totally different answer than x+(y+z) , as discussed above. There is a problem closely related to preserving parentheses that is illustrated by the following code

eps = 1;

do eps = 0.5*eps; while (eps + 1 > 1);

:

This is designed to give an estimate for machine epsilon. If an optimizing compiler notices that eps + 1 > 1 eps > 0, the program will be changed completely. Instead of computing the smallest number x such that 1 x is still greater than x (x e ), it will compute the largest number x for which x/2 is rounded to 0 (x ). Avoiding this kind of "optimization" is so important that it is worth presenting one more very useful algorithm that is totally ruined by it.

Many problems, such as numerical integration and the numerical solution of differential equations involve computing sums with many terms. Because each addition can potentially introduce an error as large as .5 ulp, a sum involving thousands of terms can have quite a bit of rounding error. A simple way to correct for this is to store the partial summand in a double precision variable and to perform each addition using double precision. If the calculation is being done in single precision, performing the sum in double precision is easy on most computer systems. However, if the calculation is already being done in double precision, doubling the precision is not so simple. One method that is sometimes advocated is to sort the numbers and add them from smallest to largest. However, there is a much more efficient method which dramatically improves the accuracy of sums, namely

Theorem 8 (Kahan Summation Formula)
Suppose that is computed using the following algorithm