What every computer scientist should know about floating point arithmetic

Transcript

1 What Every Should Know About Computer Scientist Floating-Point Arithmetic DAVID GOLDBERG Alto, Research Coyote Hill Road, Palo 3333 CalLfornLa 94304 Xerox Palo Center, Alto considered an esotoric subject by many people. This is Floating-point arithmetic is floating-point surprising, ubiquitous in computer systems: Almost rather because is PCs has datatype; computers from floating-point to supercomputers language every a accelerators; most compilers will be called upon to compile have floating-point from algorithms to time; and virtually every operating system must floating-point time presents floating-point as overflow This paper such a tutorial on to exceptions respond of floating-point that have a direct impact on designers of computer the aspects begins systems. background on floating-point representation and rounding It with floating-point with discussion of the IEEE continues standard, and concludes error, a examples of how computer system builders can better support floating point, with and Categories (Primary) C.0 [Computer Systems Organization]: Subject Descriptors: design; [Programming Languages]: D.3.4 General– instruction set optirruzatzon; 1.0 [Numerical Analysis]: General—computer G. —compders, Processors analysis, numerzcal algorithms (Secondary) D. 2.1 [Software arithmetic, error languages; 3.1 [Programming D, Engineering]: Requirements/Specifications– Definitions ,4.1 Theory —semantZcs D Formal [Operating Systems]: and Languages]: Management—synchronization Process General Terms: Algorithms, Design, Languages Key Words and Phrases: denormalized number, exception, floating-point, Additional standard, gradual guard digit, NaN, overflow, relative error, underflow, floating-point rounding mode, ulp, underflow error, rounding addition, subtraction, multipli- tions of INTRODUCTION and contains also cation, It division. computer often systems need Builders of background two the on information about arith- floating-point information methods error, rounding measuring of remarkably however, are There metic. second and ulps relative error. The part information about few sources of detailed the discusses IEEE floating-point stand- One subject, the on books few it. the of ard, accepted rapidly becoming is which by Ster- Computation Pat Floating-Point hardware by commercial manufacturers. paper benz, is long out of a is print. This the Included in the IEEE standard is of those aspects on tutorial floating-point method for basic operations; rounding floating-point hereafter) that arithmetic ( standard therefore, the discussion of the systems a direct connection to have the on draws The 1. Section in material It building. consists of three loosely con- discusses be- connections the part third nected dis- (Section first 1) The parts. and tween floating point the design of of implications using the cusses different various aspects of computer systems. strategies the rounding opera- for basic Topics include instruction set design, Permission to copy without fee all or part of this material is granted provided that the copies are not made notice or direct commercial advantage, the ACM copyright for and the title of the publication distributed and its data appear, and notice is given that copying is by permission of the Association for Computing requires Machinery. otherwise, or to republish, copy a fee and/or specific permission. To @ 1991 ACM 0360-0300/91/0300-0005 $01.50 ACM Computing Surveys, Vol 23, No 1, March 1991

2 . 6 Goldberg David 1 CONTENTS into its finite representation. The fit back resulting rounding error is the character- istic computa- feature of floating-point 1.2 is it how describes Section tion. measured. INTRODUCTION ROUNDING 1 ERROR Since calculations floating-point most Floating-Point 1 1 Formats have rounding error anyway, does it Relatlve Ulps and Error 12 operations arithmetic basic the if matter 1 3 Guard Dlglts a bit introduce rounding error than more Cancellation 14 Operations Exactly 5 1 Rounded That question is a main theme necessary? IEEE STANDARD 2 1. throughout dis- 1.3 Section Section 2 and Operations Formats 1 reducing means of guard a digits, cusses Quantltles 22 S~eclal the error nearby two subtracting when Handlers Trap and Flags, Exceptions, 23 3 SYSTEMS ASPECTS numbers. Guard digits were considered 1 Instruction Sets 3 in that IBM by important sufficiently 32 Compders and Languages the to digit guard a added it 1968 double Handling 33 Exception System/360 precision format ar- the in 4 DETAILS (single a had already precision chitecture Error 1 Rounding 4 42 Conversion Bmary-to-Decimal existing retrofitted and digit) all guard Errors 3 4 Summatmn in examples field. Two machines in the are SUMMARY 5 guard of utility the illustrate given to APPENDIX digits. ACKNOWLEDGMENTS REFERENCES standard goes further than IEEE The of a guard digit. It just requiring the use an addition, subtrac- for algorithm gives and square division, multiplication, tion, optimizing compilers, and exception and requires that implementations root handling. as result same the produce algo- that about All the statements made float- rithm. Thus, when a is program moved are justifications, with provided ing-point machine from to results another, the one central to not explanations those but the same the basic be the of operations will main a are in argument section called the every bit if both machines support in Details The de- if skipped be can and standard. simplifies greatly This IEEE the particular, of many of In proofs sired. of the porting programs. Other uses of theorems the The appear in this section. this precise specification are given in marked the end of H with is m-oof each 1.5. Section symbol; not proof included, the a whe~ is immediately appears q following the 2.1 Floating-Point Formats statement the theorem. of real representations Several different of ROUNDING ERROR 1. far have been proposed, but by numbers the most widely used is the floating-point real Squeezing many numbers infinitely representation.’ Floating-point represen- into a an number of bits requires finite is (which O base a tations have always approximate representation. Although If p. to and even) assumed be precision a are infinitely many integers, in there com- most programs the result of integer 3, rep- is 0.1 p = number the = and 10 6 P If 10-1. x 1.00 as and = resented 2 In putations can be stored in 32 bits. fixed any given contrast, of bits, number the number decimal cannot 24, 0.1 P = most real will numbers with calculations exactly be cannot that quantities produce bits. There- represented using that many lExamples of other representations are floatzng result of a calcu- floating-point fore, the slas;, a ud szgned logan th m [Matula and Kornerup be often order must lation to in rounded 1985; and Alexopoulos 1975] Swartzlander ACM Computing 1991 Vol 23, No 1, March Surveys,

3 Floating-Point 7 l Arithmetic 100X22 11 X221.11X22 101X22 O [,!,1 I I , , r r c \ I i I 2 1 o 6 5 4 7 3 p = 3, em,n = – 1, emax = 2. Normalized 2, Figure 1. = (3 when numbers of ~em~. Most x ‘m= or smaller than 1.0 is approxi- be represented exactly but o issues due to this paper discusses the x mately 1.10011001100110011001101 out that Numbers reason. first of are 2-4. In general, a floating-point num- range however, be discussed in Sec- will, d . . “ dd d. ~ as represented be ber will 2.2.2 tions and 2.2.4. . x where d. dd . /3’, . d is called the Floating-point are representations not p More has pre- significand2 and digits. example, both For unique. necessarily “.” dp_l x dld2 repre- kdO. b’ cisely, x 10-1 1.00 and 101 x represent 0.01 number the sents O do # digit leading is nonzero [ 0.1. If the eq. is to representation the (1)], said in +dP_l&(P-l))&, dl~-l ““. do + + + ( floating-point The num- normalized. be o<(il <~. (1) ber 1.00 x 10-1 is normalized, whereas = 3, p not. is ~ = 2, 101 0.01 x When floating-point will number term The 1, and = 2, there are 16 e~~X – e~i~ = mean can that number real a to used be floating-point numbers, as normalized be in represented exactly the format un- 1. hash marks shown in Figure The bold parameters Two discussion. der other correspond numbers to significant whose associated with floating-point represen- Requiring 1.00. floating-point that is a tations and largest the are smallest al- makes representation be normalized the and e~~X lowable Since exponents, e~,~. representation unique. Unfortunately, possible (3P are there significands and to impossible it makes restriction this exponents, a e~i. + 1 possible e - repre to way natural A zero! represent — max number in encoded be can floating-point this O sent with 1.0 x is ~em~- 1, since + + 1)] 1 + [log2((3J’)] preserves the fact that the numerical or- ‘ma. 1°g2 ‘m,. – L( final its, where the is for the sign 1 + nonnegative cor- of dering numbers real is bit. The precise impor- not encoding to responds ordering the lexicographical tant for now. of floating-point representations. 3 their There num- real a why reasons two are stored When the exponent bit k a in is not might ber be as representable exactly k values 1 – 2 only that means that field, a The most number. com- floating-point as use are for available since exponents, mon deci- the by illustrated is situation one reserved to represent O. must be mal finite a has it Although 0.1. number a floating-point x in that Note the has it binary in representation, decimal of number is differ- part notation the and an repeating representation. infinite opera- multiply floating-point a from ent lies when Thus, D = 2, number 0.1 the the The of meaning tion. symbol x strictly between two floating-point num- For clear should from the context. be bers and is exactly by representable nei- x 10-3, x (2.5 expression the example, of them. ther A common situation is less (4.0 float- single a only involves 102) X real that a number is out of range; that ing-point multiplication. x its absolute value is larger than f? is, Forsythe and Moler 2This term was introduced by 3This arrangement the where usual the assumes the [196’71 and has older generally replaced term exponent is stored to the left of the significant mantissa. ACM Computing Surveys, Vol. 23, No 1, March 1991

4 David Goldberg 8* x /3’/~’+1. That is, Relative Error Ulps 1.2 and error rounding in float- is Since inherent s ;6-’. ;Ulp s :(Y’ (2) to computation, ing-point important is it 2 have Con- a way to measure this error. with ~ = ~ sider the floating-point format n particular, the relative error corre - 3, p will be used = which 10 and 1/2 factor a by vary spending to can ulp this a of result the If section. throughout wobble. the called is of O. This factor 10’2 3.12 is computation floating-point x = (~ /2)~-P to the largest of Setting E when answer to infi- the bounds in (2), we can say that computed a and the when is .0314, it is clear that precision nite is rounded to the closest real number floating-point number, the relative error in the last units by 2 is in this error if real number Similarly, the place. bounded by c, which is referred is always to represented 3.14 x 10-2, as as machine epsilon is .0314159 it is in error by .159 then in the units In example above, the relative er- the place. last if the floating-point ~or was .oo159i3, ~4159 In general, To = avoid 0005. used x fle is to repre- d . . d . d. number such relative error is small numbers, the Id. d . . . d– it is in error sent z, by 6, times factor a normally written as last place.4 The units in the I 1 - flp ( z//3’) = c is case this in which = (~/2)P-P used be will ulps for as shorthand term Thus, the relative error -3 = .005. 5(10) in “units place. ” If the result of the last ((.00159/ be would expressed as is calculation floating-point num - a the E. O.l = /.oo5)e 3.14159) it result, correct the to nearest ber still illustrate the difference between To be error by as much as 1/2 ulp. might in the consider error, real relative ulps and difference measure to way Another the = number x 12.35. It is approximated by and between a number floating-point the Z = 1.24 x 101. The error is 0.5 ulps; the rela- real approximating is it number is 0.8 relative error is e. Next consider the which difference be- is the error, tive exact x is 8 The = computation 8x. value computed numbers two the 98.8, whereas, the divided value is 81 the tween by For example, the relative real number. x = The error is now 4.0 ulps, 9.92 101. when The but the relative error is still 0.8 e. committed error approximating 10° is .00159 /3.14159 x 3.14 3.14159 by measured in ulps is eight times error .0005. = is error relative the though even larger, cor- To relative the compute that error In the the base is (3, general, same. when ulp, observe that when a to responds 1/2 relative error expressed a ulps fixed in number approximated by the real is by wobble factor of up to (3. Con- can a possible floating-point number closest versely, eq. (2) shows, a fixed error of as P that error relative a in results ulps 1/2 X can absolute the ~e, error dd ~. dd d be can wobble by (3. u measure to way natural most The x is & where /3’ ‘(Y as large as rounding error is in ulps. For example, x the is ((~/2)&P) digit This error ~/2. neared rounding to the flo~ting.point dd --- Since the form of d. numb... /3’ number corresponds to 1/2 ulp. When absolute error dd x /3e all have this same analyzing the rounding error caused by have that range between ~’ but values error various formulas, however, relative ranges fle, be- the relative error x and O illustration good A measure. better a is ((~/2)&J’) and /3’//3’ x tween ((&/2 )~-’) is the analysis immediately fol- this of proof of Theorem 10. Since ~ lowing the overestimate the effect of rounding can floating-point to the nearest number b; ~em=+ 1 or the z is number larger than 4Un1ess the wobble factor of (3, error estimates of range smaller than (lem~. Numbers that are out of machines formulas will be tighter on with until further in fashion will not be this considered small p. a notice. ACM Computmg 1991 Vol. 23, No 1, March Surveys,

5 Floating-Point Arithmetic g 9 magnitude of When order of only the How digit! every in wrong bad can the and error rounding interest, e ulps of is be? error may they since interchangeably used be most exam- by differ For ~. of factor a at Theorem 1 is in floating-point a when ple, number with format floating-point a Using pa- number error by n ulps, that means the rameters and p and computing differ- /3 digits If n. logD is contaminated of the p using ences of error relative the digits, ne, is computation a in error relative the as large b 1. – be can result as then error relative in 1 – A 13 of Proofi = contaminated digits (3) log,n. the expression x – y = x when occurs 1.00””” ””p, wherep= Oandy=. pp. Guard Digits 1.3 digits (all equal to p Here @ has – 1. y x is difference exact The Q). ‘p. P = y – method difference the computing of One answer using only computing When the to between two floating-point numbers is digit of rightmost the however, digits, p then compute the difference exactly, shifted off, so differ- computed y gets the floating-point nearest round it to the p-p ence is P–p+l. Thus, the error – is This number. is the if very expensive = ~-P(~ – 1), and the relative er- @-P+l Assuming size. in greatly differ operands H ror is $-P((3 – l)/O-p = 6 – 1. would X 1012 – 1.25 X 10-5 2,15 3, = P be as calculated can When f? = 2, the absolute error be as = 13 when and result, the 10, large as 1012 X = 2.15 x larger. put it can be nine times it To X 1012 y = .0000000000000000125 another way, that (3 = 2, (3) shows when X – y = 2.1499999999999999875 1012, X is the number of contaminated digits is, p. That all of log2(l/~) = logJ2 J’) = 2.15 x 1012. Rather than rounds which to are p digits in the result wrong! the digits, floating-point using all these digit extra is Suppose added to one normally a hardware fixed on operates guard this guard situation (a against number the Suppose digits. number of of smaller the That is, number is digit). when and that the p is kept digits result p + 1 digits, then the truncated to is right, smaller operand shifted digits digits. p subtraction of the is rounded to (as to opposed simply are discarded the previous example With a guard digit, Then, rounding). 2.15 x 1012 – 1.25 x becomes 10-5 becomes = 101 x 1.010 x X 1012 x = 2.15 101 x y 0.993 = x 1012 ‘y = 0.00 x–y= .017 x 101, =2.15x1012. x–y and answer is exact. With single the a exactly the The same as is if the answer error - re the of relative the digit, guard had difference been computed exactly in 110 – suit as ~, than greater be may Take rounded. then another example: 8.59: becomes This 10.1 – 9.93. 1.1OX x= 102 x 101 1.01 x= 102 X .085 = y x 101 ‘y 0.99 = 1.015 102 x z–y= X–yz x .02 101. with the This rounds to 102, compared correct is the answer so The .17, com- correct answer of 101.41, for a relative is difference puted is ulps 30 by and off error .006, which is greater than of ACM 1991 Surveys, Vol 23, No. 1, March Computing

6 Goldberg David 10 “ to .1. The subtraction did not introduce relative of general, the error In .005. e = exposed but error the rather error any than can larger be slightly only the result the earlier multiplications. in introduced Theorem 2. More we precisely, c. have when Benign cancellation occurs sub- quantities. known If tracting exactly x 2 Theorem then by and have no y rounding error, numbers floating-point y and x If in are a 2 with done is subtraction the if Theorem subtraction if and p and 13 with format is – a guard digit, the difference a has y x p guard (i. with done digits e., 1 one + small e). 2 than (less error relative very then the in error rounding relative digit), A formula that exhibits catastrophic is the result less than 2 ~. cancellation can sometimes be rear- ranged to eliminate the problem. Again will This theorem be proven in Section formula quadratic consider the 4.1. Addition the above is included in and can be positive y x since theorem ~b2–4ac –b+ negative. or 1.4 Cancellation –b–~ = r2 (4) 1.3 by saying Section summarized can be “ 2a the that without a guard digit, relative P b2 When not does ac 4 – b2 then ac, two error committed when subtracting can be very large. In nearby quantities a cancellation = ~ and involve words, other ex- any of evaluation the (subtraction) addition other the But b 1. \ a pression (or an containing subtraction will formulas the of catas- one in a have of quantities with opposite signs) addition mul- cancellation. To avoid this, trophic a large so error relative could result in numerator and denominator of the tiply (The- the digits are meaningless all that – similarly (and by – b ~ r-l quan- nearby orem 1). When subtracting to obtain r2 ) for digits in the the tities, most significant 2C operands other. each cancel and match rl = are cancellation: of kinds two There –b–~’ benign. and catastrophic 2C Catastrophic cancellation occurs when rz = (5) are operands the subject to rounding er- –b+~” quadratic the in example, For rors. for- and % b2 If rl computing then >0, b ac 4 – bz ac occurs. mula, the expression involve a will formula using cancella- (4) subject quantities 62 and 4 ac are The to rl tion. Therefore, use (5) for computing the they since errors rounding re- are and (4) for rz. On the other hand, if sults floating-point multiplications. of computing for (4) use <0, b and (5) rl the to rounded are they Suppose nearest r2. for accu- are so and number floating-point expression The – y2 is another for- X2 rate ulp. are they 1/2 within to When that mula catastrophic cancella- exhibits can many cancellation subtracted, cause accurate tion. It is as it evaluate to more disappear, of the accurate digits to leav- quadratic ( x – y)( x + y). 5 Unlike the ing mainly digits contaminated behind Hence rounding by the difference error. have ulps. For might an error of many a = 1.22, b = 3.34, consider example, does y) + x .Y)( not x ( expression the 5Although – catastrophic IS shghtly less cause a it cancellation, b2 of value exact The 2.28. = c and -- x In this x or y > y If y2 – X2 than accurate < and But b2 rounds to 11.2 is .0292. ac 4 three has y) + x -Y)( but errors, rounding – ( x case, an- ac rounds to 11.1, hence the final 4 only the rounding error com- has two since X2 – y2 error by 70 ulps swer is .1, which is an smaller y when computing the mitted of x 2 and 2 – 11.2 though even equal is exactly 11.1 does affect the final subtraction. not ACM 1991 Surveys, Vol 23, No 1, March Computmg

7 Floating-Point 8 Arithmetic 11 replaced with pre- benign one. We next a a formula, improved form still has this of sent more interesting examples formu- benign subtraction, cancella- a is it but exhibiting cancellation catastrophic las quantities without rounding er- tion of exhibit can be rewritten to that only ror, a Theorem By one. catastrophic not benign cancellation. at relative in error the 2, – y most is x area expressed be can triangle The a of x + y. Multiply- true is same e. 2 The of directly in terms of its of lengths the small relative ing a with quantities two sides c a, b, and as a with results error a product small in 4.1). relative error (see Section a)(s - A - c) , = ~s(s - b)(s and between exact To avoid confusion following values, notation computed the a+b+c . (6) exact the is used. denotes y – x Whereas where = s 2 y, and x of difference the @y denotes x difference with computed rounding (i. e., Suppose the triangle is very flat; that is, Similarly @, @, and @ denote error). = and term Then s the a, b + c. = a addition, computed multiplication, and eq. (6) subtracts two nearby a) in – (s respectively. division, All caps indicate numbers, have round- may which of one the computed of a function, as in value = b 9.0, = a c if example, For error. ing LN( x) or SQRT( x). Lowercase functions then s of value correct the is = 4.53, and mathematical notation traditional Even and the 2.34. is A though 9.03 exact denote as in ln( their x) values (9.05) computed value of s by is in error and &. only 2 ulps, the computed value of A is (x y) ex- an is @ (x @ @y) Although 60 error an 3.04, ulps. of y2, – 2 x of the approximation cellent formula rewrite way a is There (6) to and y might x floating-point numbers it will return so accurate results that be themselves some to approximations [Kahan triangles flat for even It is 1986]. 2 example, For j. and true quantities 2 (a-b)) (b+c))(c [(la+ A= - and j might be known decimal exactly expressed be cannot that numbers ex- c))] (b– b))(a+ ’/’/4, (a– X(C+ actly even binary. In this case, though in x ~ is x good approximation to y – y, a (7) b?c. a? relative error a com- have can it huge pared and to the true expression 2 – $, a a, > c, b, and c do not satisfy If > b the y) over y)( + x ( of advantage – so x (7). applying before them rename simply Since dramatic. as not is y2 – - comput X2 check to straightforward the that is It – x the about is y) same x y)( + ( ing are alge- sides right-hand (6) of and (7) y2, amount X2 computing as work of it – Using braically identical. of values the is case. the preferred form in this clearly and area b, computed a gives above c a, catas- a replacing however, general, In is much 2.35, of 1 ulp in error and which by trophic cancellation a benign is one formula. first the than accurate more worthwhile not the expense is large if more much is (7) formula Although because (but not often is input the al- it example, this for (6) than accurate - eliminat But approximation. an ways) to would nice be per- (7) well how know the ing a cancellation entirely (as in forms general. in quadratic formula) is worthwhile even if are Throughout this the data exact. not Theorem 3 float- the that assumed will it paper, be incurred when using The rounding error ing-point to an algorithm are ex - inputs at ie t.icqqle a of area the compuie #o (T) are computed as and Qxat the results .aGt e, per- is subtraction provided 11 most as possible. accurately e digit, guard a formed with <.005, and more accu- – y2 is The expression X2 1/2 square roots are computed to within y)( when x + rate rewritten y) as (x – Ulp. is catastrophic a because cancellation ACM Computing Surveys, Vol. 23, No. 1, March 1991

8 David Goldberg 12 “ that .005 is met in The c condition s + i/n)’ The troublesome expression (1 be n)], + ln(l n exp[ as rewritten i can / sys- floating-point actual every virtually compute to is problem where now the >8 p 13 2, = tem. when example, For + x) for small x. One approach is to In(l 10, that e < .005, and when 6 = ensures in x, = x) ln(l approximation the use + z 3 is enough. p In Theorem statements becomes payment the case which 3 that dis- like of cuss the relative error expression, an even by $3.21 and off $37617.26, which is formula. obvious less accurate the than is it is understood that the expression x) compute to a is there But + way ln(l using arith- computed floating-point as accurately, Theorem 4 shows error is the relative particular, In metic. [Hewlett-Packard 1982], This formula actually the expression of accurate within $37614.07, 2 yields to cents! @c))@ @(a @b)) @(b (C (sQRT(a assumes that LN( x) Theorem 4 ap- proximate ln( x) to within 1/2 ulp. The (a @(b @c))) F3(c @(a @b))@ it problem that when x is small, solves is + LN(l not close to ln(l is x) be- @ x) (8) @4. x the lost @ information in cause 1 has order of x. That is, the com- low the bits the cumbersome nature Because (8), puted value of ln(l + x) is not close to its of of the of theorems we will actual value when x statement 1. in < the computed value of E say usually than writing out E with circle rather 4 Theorem notation. – x) is computed using the for- If ln(l too are bounds Error usually pes- mula simistic. numerical given In the example is 2.35, the above, computed value of (7) ln(l + x) 2.34216 of value true a with compared O. 7c, which is much for a relative error of x forl~x=l I than 11 e. The main reason for less com- — — + xln(l x) error bounds is precise puting to not get G3x#l forl +X)-1 (1 verify that the bounds but rather to 1 numerical contain not does formula 5 relative the error is at most c when O < problems. is 3/4, < x per- provided subtraction of example final that A expression an and <0.1, digit, guard a with formed e can to benign rewritten be cancella- use 1/2 within to is in ulp. computed where x < 1. This ex- tion is (1 + x)’, in calculations. financial arises pression formula This will work for any value of day Consider depositing $100 every into x only which 1, + x for interesting but is a annual earns that account bank an catastrophic cancellation occurs is where interest rate of 6~o, compounded daily. If = 365 and i = ,06, the amount of n x) ln(l formula Although + in the naive seem may formula the there mysterious, money accumulated the end of one at it works. for explanation simple why a is is 100[(1 + i/n)” – 11/(i/n) dol- year 2 = ~ using computed is this If lars. and ln(l = x)/xl + x[ln(l as x) + Write XV(x). com- be can factor left-hand The the P = 24, result is $37615.45 compared of to the exact answer $37614.05, a puted factor right-hand the but exactly, of reason The $1.40. for discrepancy a suffer will large P(x) = ln(l + x)/x How- rounding 1 adding when error x. to expres- see. the problem is easy to The involves adding 1 to i/n + 1 sion ever, since + constant, almost is ln(l v i/n bits order the so .0001643836, of low not x) = x. So changing x slightly will amplified are This rounding error lost. is words, if In introduce much error. other n / nth i power. the to raised is when + 1 z= x, computing XK( 2) will be a good ACM Computmg Surveys, Vol 23, No 1, March 1991

9 Floating-Point 13 8 Arithmetic ln(l + x). Is approximation to x) = xp( Corporation’s VAXG comput - Equipment for which and 2 5 value a there for says that ers. Another school of thought + 1 5 There accurately? computed be can are halfway since numbers ending in 5 is; namely, they roundings, possible two between = (1 @ x) e 1, because 2 equal then + 2 is exactly 1 to 1 @ x. should down round half the time and be sum- can section this of The results the up One round way of ob - half. other a by saying guard that digit marized taining 50’%0 behavior is to require this when nearby pre- guarantees accuracy the its have result rounded least that known cisely quantities are subtracted 12.5 significant even. be Thus digit for- (benign a Sometimes cancellation). to rounds is 2 because 13 than rather 12 inaccurate can be results gives that mula best, even. Which of methods these is numeri much have to rewritten - higher and round round up or to even? Reiser cal accuracy using benign cancella- by reason following the offer [1975] Knuth works only the however, tion; procedure even. to round preferring for if is performed using a guard subtraction digit. The price of a guard not digit is Theorem 5 is because merely high making requires x and y be floating-point numbers, Let 1 bit wider. For a 54 the double adder bit define X. = x, and xl=(xOey)O precision the additional cost is less adder, and y,...,=(x(ley)@y)[email protected]@ gain you price, this 2%. than For the e are exactly rounded using round to ability run to many algorithms such as or then all for x = x. either xl even, = x. n of the for (6) formula computing a area q foralln >1. the expression and in Theorem 4 triangle for computing ln(l + ~). Although most this clarify To consider ~ 10, = result, a have computers modern guard digit, x p –.555. and let = = 1.00, y = 3 (such few a are there that Crays) as up, be- sequence the rounding When not. do Y 9 X. 1.56, Xl = 1.56 9 .555 = comes xl e y ~ LO1 Q .555 1.01, 1.57, = = 1.5 Rounded Operations Exactly each successive value of x. and in- creases by .01. Under round to even, x. When floating-point operations done are always 1.00. This example suggests is accu- with as are they digit, guard a not com- up round the using when that rule, rate as if they were computed exactly putations drift upward, gradually can to the then rounded nearest floating-point even whereas the when using round to Operations in performed this number. cannot theorem this says happen. exactly rounded. will be called manner the Throughout rest of this round paper, preceding example The immediately to will be used. even single Theorem shows guard 2 a that exact oc- of One application rounding always not will digit rounded exactly give precision curs in arithmetic. multiple Section results. exam- several gave 1.4 There are two basic approaches higher to ples that require of guard a algorithms precision. approach represents float - One digit sec- This properly. work order to in numbers sig- large very a using ing-point tion gives examples of algorithms that of nificant, stored in an array which is require exact rounding. codes the manip- for routines words, and definition the far, So of has rounding assembly lan- ulating in numbers these not straightfor- is Rounding given. been approach second The guage. represents to of exception the with ward, round how numbers floating-point precision higher cases; halfway for example, should 12.5 array of ordinary floating-point as an mnnd to 12 OP 12? thought Ofie of whool divides the digits in half, letting 10 1,2,3,4} round down and {5,6,’7,8,9} {0, would round to 13. round up; thus 12.5 Digital ‘VAX is a trademark of Equipment Digital works rounding how is on This Corporation. ACM Computmg Surveys, Vol 23, No. 1, March 1991

10 14 Goldberg “ David Theorem 6 to write b = 3.5 ulps. Using elements of the adding where numbers, c = 3.5 – and .037, – 3.5 = a .024, – recovers precision in array the infinite x b2 becomes 3.52 – 2 x 3.5 .024 .021, floating-point precision number. the high b2 Each summand is .0242. so + exact, this is that It will be second approach the where .000576, + .168 – 12.25 = here. The advantage of using discussed sum is left unevaluated at this point. that numbers is an array of floating-point Similarly, high-level can be coded it portably in a exactly language, but it requires rounded ac .037 x (3.5 – x 3.52 .021) 3.5 = + arithmetic. in to multiplication this key sys- The x .037 + .021 representing a product xy a is tem as = .000777. + .2030 – 12.25 the has summand each where sum, same can done This y. and x as precision be Finally, term these two series subtracting x x~ + xl = x Writing y. by splitting and of ac – b2 an by gives for estimate term exact the yl, + y~ = y and xy is product which .03480, = .04685 e .0350 is @ O y Xlyh + Xlyl. If X and xhyh = + xhyl + the identical exactly rounded result. to the significands, p summands bit have 6 Theorem that show To requires really pro- significands, bit p have also will = = p 3, P 2, exact rounding, consider yh? Y1 carI XI, xh, be represented vided and = 7. Then m 35, = 5, mx = x and bits. is p/2] is even, it When p using [ performed subtraction If is 32. m x @ = easy The splitting. a find to number x) @ m guard with then ( single a digit, sum can be written as the xp_l Xo. xl ““” 3, Therefore, x~ = 4 and xl = e 28. = x ““” xl O. of .OXP,Z O.. and xp/2–l Xo. p/2] = not \ ~~e representable xl with . . . simple p is odd, this XP ~. When An extra method will not work. splitting As a final example of exact rounding, neg- can, bit however, be gained by using m is The 10. result by dividing consider ative = ~ if example, For 2, numbers. gen- will that number floating-point a in x can be split as P = 5, x and = .10111, When 2, /10. m P = be not eral to equal There .00001. – = xl and .11 = x~ is 10 by @10 m will however, multiplying number. way to split a more than A one exact provided m, miraculously restore that is easy splitting method to compute more a Actually, used. being is rounding [1971], but Dekker to due is it requires is Kahan) to (due fact general The true. a digit. guard single more than ingenious, proof is inter- not readers but to such details can skip ahead ested in Theorem 6 2. Section precision, Let p be the floating-point with Theorem 7 restriction the >2, p that D when even is operations fl;ating-point assume that and are m n with integers When O = 2, if and ~ p /2~ is are Then exactly if k rounded. = 2p-1 and m ~ < ~ n has the special form m and up) (rounded the = precision half n=2z+2J then (m On)@n=m, je split as x = Xh + xl, fik + 1, x can operations fi?~ating-point provided are xh=(m where xl Q9x)e ([email protected] Xe x), exactly rounded. — — representable is x, each and Xh, e x using precision. of bits p/2] ~ is 2 of power a by Scaling Proof it changes since harmless, only the expo- q = m /n, nent the significant. If not To how this theorem works in an see so < then scale n and that n 2P-1 s 2P p = 4, 3.476, = b 10, = P let example, < 1. 2P–2 Thus, q 1/2 that so m < scale – ac and c = 3.479. Then b2 a = 3.463, < 2P. has m m Since significant p < floating-point nearest the to rounded @ b = 12.08, b number .03480, is while has it bits, to right of 1 most at the bit m the binary point. sign of Changing the so value computed the and = 12.05, a @ c so 0. is harmless, assume q > is ac – b2 480 error an of is This .03. of ACM Computmg Surveys, Vol 23, No 1, March 1991

11 Floating-Point g Arithmetic 15 = m @ ij to prove the theorem If n, introduce a arithmetic little operations requires showing that The more rounding error than necessary? because answer is that it does matter, basic to accurate enable us operations are formulas that prove the in “correct” they sense have a small relative error. right bit 1 most at m has because That is 1.4 algorithms several discussed Section the round will nij so of point, binary to cor- that require guard digits to produce deal TO m. when case halfway the with to rect results in this If the input sense. ini- I 1/4, note that = since the m – [email protected] I those formulas are numbers representing m I had m I its 1, 2‘- < unscaled tial the imprecise measurements, however, O, low-order low-order bit the so was bit less bounds Theorems 3 and 4 become of m is halfway Thus, O. also scaled of the The be- interesting. reason is that the m. cases will round to y can become cancellation nign x – and = g & , “.. .qlqz = q Suppose catastrophic x and y are only if approxi- . . . nq estimate I – m 1, qP1. To mations to But measured quantity. some = I N/2p+1 – I ~ – q I compute ifs? operations accurate useful even in are integer. odd an is N m/nl, where they because data, inexact of face the relationships us enable exact establish to 2P-l <2p, n=2’+2J and – ~ Thus, I specifies a in bits of layout precise the Then similar). is Q > q (the @ < q case precision. IEEE single and double 854 and m, < nij P or 2 = L? and = either allows 10 unlike floating-point 754, does not specify how [email protected]= m-nij=n([email protected]) bits into encoded are numbers et [Cody 2-P-1)) = n(q – (~ – does It al. not require a 19841. particular but instead it specifies con- p, value for 1 2–P–1 — < n p for on the values of allowable straints +1–k p n2 ) term double precision. The and single ( used when dis- will Standard IEEE be +2-P-’+’=:. +2’)2-’-’ P-1 (2 = cussing properties common to both standards. (9) establishes and proves the theo- This of IEEE the tour a provides section This rem. q standard. discusses one subsection Each and aspect the standard why it was of It is not the this purpose of included. base any for true holds theorem The 6, paper is standard IEEE the that argue to J z DJ. + L (3 as long as 2 by + 2 replaced is As there however, larger. gets 6 are floating-point possible best the standard but as the accept to rather given standard denominators fewer the and fewer of and introduction its an provide use. to form + p’. ~’ full For standards the consult details to We are the now in a position answer Does question, basic if the matter it [Cody al. 1984; Cody 1988; IEEE 19871. et 1991 Computing Surveys, Vol 23, No 1, March ACM

12 David Goldberg 16 l sec- precision later in (as this explained 2.1 and Operations Formats the of so tion), bits 23 has machine 2 = ~ Base 2. 1.1 with range of a compare to precision for the ~ = 16 machine. 21-24 bits ~ It IEEE = 10. allows 854 why clear is for explanation possible Another Base 10 is how humans exchange and has to do with shift- choosing ~ = 16 bits 10 is about think = (3 Using numbers. ing. When adding two floating-point especially appropriate for calculators, numbers, their exponents are different, if each operation is dis- where result the of to the of significands will have be one by decimal. in calculator played the points up, line radix the make to shifted are IEEE w~y reasons several There the = slowing operation. In the /3 down 10, not is base the if that requires 854 it 16, all p = 1 system, numbers be- the must one mentioned 1.2 Section 2. be tween and 15 1 have the same exponent, error analyses results The reason: of are adding when required is shifting so no much when ~ is 2 because a tighter of possible 105 = 15 pairs the of any of error rounding a by wobbles ulp 1/2 () In this distinct numb~rs from set. the of computed factor as a relative fl when al- error and error, almost are analyses P = 4 system, however, these b = 2, have numbers O from ranging exponents relative er- on based when simpler ways of the 3, to and shifting is required for 70 do to the A has with reason ror. related pairs. 105 precision for bases. large Con- effective In modern hardware, the perform- most ~ to compared 1 = p 2, = fi = 16, sider gained a ance shift a avoiding by for bits 4 have systems Both 4. = p signif- of so the subset of operands is negligible, Consider 15/8. of computation the icant. of wobble small the it makes 2 = (3 When 1.111 as represented is 15 2, = ~ 23 is and 15/8 So 2°, x 1.111 as 15/8 x Another of base. advantage preferable using gain to way a is there that is 2 ~ = 16, 15 is rep- however, = p When exact. hex- F the is F where 160, x resented as of an significance extra .V Since float- bit normal- always ing-point are numbers But adecimal 15/8 is repre- 15. for digit the of bit significant ized, the most has which 1 bit 160, x only 1 as sented can lose up to correct. In general, base 16 significant is no is there and 1, always have can p an a so precision of bits, 3 reason to waste a repre- bit of storage it. senting this use trick that Formats 3 – 4p as low as precision effective It was bit. hidden said to are a have 4p. than rather that 1.1 this in out pointed re- Section (3 Since these have of values large quires a special for O. The convention problems, why did IBM 6 = 16 choose for Only system/370? its sure, for knows IBM method given there expo- an that was of significant 1 all a and of e~,~ – nent The reasons. possible two are there but but first is increased exponent range. Single 2 x 1.0 not represent zeros 1 ‘mln– 16, = ~ has system/370 the on precision rather O. significant requires 24 the Hence 6. = p 754 encoded IEEE precision single is this fit bits, 32 bits. into Since this must the for bit 1 using bits 8 sign, bits 32 in bits for 1 and exponent the for 7 leaves sig- exponent, the the for bits 23 and for the rep- of magnitude the Thus, bit. sign howeve~, bit, a uses It nificant. hidden about from ranges numbers resentable 24), (p = 24 is so bits the significant even though is it using encoded only get 16-2’ 1626 = 228. To about a simi- to 23 bits. exponent D range lar when 2 would = exponent, of bits 9 require leaving only the for just 22 bits It significant. was D when that however, pointed out, 16, = as effective precision can be as low the appears Gold- by published been first have to ‘This 4p – 3 = 21 bits. Even worse, when B = 211] berg [1967], although Knuth [1981 page at- it 2 of extra an gain to possible bit is tributes Idea to Konrad Zuse this ACM 1991 Surveys, Vol 23, No 1, March Computing

13 Arithmetic Floating-Point 17 l 1.2 den, the calculator presents a simple Precision 2. the operator. to model defines different standard IEEE The four stand- the in precision Extended IEEE double, single, single ex- precision: similar a enables ard It function. serves sin- 754, In extended. double and tended, libraries within to quantities compute to correspond precision double and gle (or ulp about 1/2 in single double) preci- most floating-point roughly to what those the of giving efficiently, sion user Single provides. hardware oc- precision namely, model, simple a libraries that 32 bit word, double cupies preci- a single a it be primitive each operation, simple words. sion two consecutive 32 bit multiply returns or an invocation of log, a format that precision offers Extended is about ulp. 1/2 within to accurate value a just a little extra precision and exponent When however, precision, extended using standard range only 1). The IEEE (Table is that its use sure make to it important how many specifies a lower bound on user. the to is transparent example, For bits extra extended precision provides. calculator, if the represen- on a internal double-extended allowable The minimum not is rounded displayed a of tation value 80-bit format is sometimes referred to as to the display, as precision same the the shows table the though even it format, will operations further of result depend The bits. hard- 79 that is reason using unpre- and digits hidden the on appear implementations preci- of extended ware the dictable user. to bit and sion normally do not use a hidden fur- precision extended illustrate To 79 would so use 80 rather bits.8 than converting ther, of problem consider the The emphasis the puts standard most IEEE and precision single 754 between recom- precision, making no extended on num- decimal. Ideally, single precision concerning double precision mendation bers will be so digits enough with printed but strongly recommending that that when the number is read decimal the support should Implementations extended precision number single the in, back can to corresponding format basic format widest the be recovered. It turns out that 9 decimal supported, to single pre- a recover digits enough are 4.2). binary number (see Section cision One extended for motivation precision decimal number back When converting a comes often will which calculators, from binary a to its unique representation, 13 display 10 digits but use digits inter- as small as error rounding fatal is ulp 1 nally. By displaying only 10 of the 13 the because it will give answer. wrong user the calculator digits, appears the to Here a situation where extended preci- is black box that computes exponen- } a ~ algorithm. an sion is vital for efficient and 10 digits of to on, cosines, so tial, When a single extended is available, accuracy. compute For to calculator the for method con- straightforward exists to cos and log, exp, like functions within a verting decimal number to a single with digits how- efficiency, reasonable 10 in read First, precision 9 the one. binary ever, needs a few with digits extra it integer N, ignoring as digits decimal an work. find to which to hard not is a It p >32, Table 1, From point. decimal the that expression rational simple approxi- and = N 109, x 4.3 232 < 109 since can mates log with an error of 500 units in ex- in exactly represented be single the last place. Thus, computing with 13 find Next, tended. power appropriate the digits an answer correct to 10 dig- gives N. scale to necessary 10P This will be a hid- By its. keeping these extra 3 digits deci- of the exponent of the combination position the and number, mal the of point. decimal ignored now) until (up extended Kahan, 64 to *According precision has P is this 13, s I also 10 \ If ‘l. I Compute significant that because bits of widest the was exactly, represented 1013 because = carry which across could be precision propagation 513<232. and 213513 multiply Finally, increasing cycle done on the Intel 8087 without the [Kahan 19881. time If P < 0) N and 10’ P‘. this (or divide if ACM Computmg Surveys, Vol. 23, No. 1, March 1991

14 Goldberg David 18 - 754 Format Parameters IEEE 1. Table Format Parameter Double Single Single Extended Double Extended 64 > 24 > 32 53 P > 16383 + 1023 + 1023 z + 127 + emax 163$32 – < – < 1022 1022 – – 126 emln 11 11 > 15 8 2 bits Exponent width in 2 79 64 32 2 43 Format width in bits case pre- representation. single of In the is last closest the exactly, done operation stored in cision, 8 exponent the where is binary number is recovered. Section 4.2 bias bits, is 127 (for precisiog double the do shows how to (or multiply last the is What k 1023). this means is that if it P 13, the s I divide) Thus, exactly. I for the is of the exponent inter- bits value use enables format single-extended the of then the unsigned preted an as integer, 9 decimal numbers to be converted digit is floating-point the number of exponent (i. number closest the to e., ex- binary biased is ~ the called often This 127. – 13, > I P single- I actly If rounded). to di~tinguish from the unbi- exponent is not for the above enough extended k. An advantage of’ biased ased exponent rounded to exactly the compute algorithm that representation nonnegative flout- is Coonen but always, equivalent binary as treated be can numbers ing-point to [1984] guaran- enough shows that it is integers comparison for purposes. deci- tee that the conversion of binary to Referring to 1, single precision Table mal back will recover the original and 126. The – has e~~, = 127 and e~,~ = binary number. I e~,X for having that e~l~ I < is reason so If double the supported, is precision number reciprocal the of the smallest above double in run would algorithm Although overflow. not will ‘mm) (1/2 is it than single-extended, rather precision true of the largest that the reciprocal to double precision to a 17 but convert will underflow, underflow is usu- number would digit decimal number and back Section overflow. than serious less ally require the double-extended format. for that – explained 1 is e~,~ 2.1.1 used in- will 2.2 Section O, representing and Exponent 2.1.3 for In troduce a use e~,X + sin- IEEE 1. gle biased the this means that precision, or be can exponent the Since positive e~,~ – 1 = exponents range between negative, must method be chosen to some – 127 and e = the whereas 128 ~.X 1 + sign. represent methods common Two its unbiased O between range exponents are numbers signed representing of nonneg- which and are exactly the 255, complement. two’s sign/magnitude and can that numbers ative represented be is Sign/magnitude the system used for bits. using 8 sign of the significant in the IEEE the sign; formats: 1 bit is used the the hold to rest the the bits represent magnitude of 2. 1.4 Operations the The two’s complement of number. in integer representation is often used standard IEEE The requires that the re- arithmetic. In this scheme, a number addition, multiplica- subtraction, of sult represented nonneg- smallest is the by exactly tion, and division be rounded. it to congruent is that number ative is, That the result must be computed modulo ~. 2 to then exactly the nearest float- rounded The IEEE binary standard does not even). to round (using number ing-point represent to methods these of either use computing Section pointed out that 1.3 biased the exponent but instead uses a- the exact difference or sum of two float- ACM Computmg Surveys, Vol 23, No 1, March 1991

15 Floating-Point 19 “ Arithmetic than proving them assuming operations expensive ing-point numbers be very can substantially are exponents when their are exactly rounded. There is not complete agreement on That guard introduced section different. a stand- what floating-point operations which way practical provide a of digits, In cover. should ard basic the to addition guarantee- while differences computing ing that is small. error relative the Com- /, the –, IEEE +, operations x, and also square root, that specifies standard guard puting digit, a single with inte- between conversion and remainder, always give the same not will however, ger floating point be correctly and answer as computing the exact result conversion rounded. It also requires that a then rounding. By introducing second differ- sticky bit, guard a and digit third be internal formats and decimal between very for (except rounded correctly large little a can only be at computed ences Kulisch numbers). [19861 Miranker and cost guard digit, single than more with a proposed have to adding inner product the result is the same as if the differ- but precisely are that operations of list the computed ence exactly then rounded were note inner when that They specified. [Goldberg Thus, can the standard 19901. arith- are computed in IEEE products be implemented efficiently. metic, the final answer quite be can completely specifying One reason for For wrong. sums are example, a special the results of arithmetic operations is to sum case of inner products, and the ((2 x improve the portability of software. When – 10--30) – 1030) + 10-30 exactly is 1030 two IS moved between ma- a .Program but 30 with machine a on chmes arith- IEEE support both and to equal 10- if result intermediate any differs, metic, the IEEE arithme~ic result computed 30. It is possible to compute be will – – 10 must be because of software bugs not it Another ad- differences arithmetic. in with inner products to within 1 ulp less than it takes imple- to hardware vantage precise specification is that it of easier it to reason about floating makes [Kirchner multiplier fast a ment and are point floating about Proofs point. Kulisch 19871.9 hard without with deal to having enough the All the in mentioned operations multiple cases from multiple arising dec- standard, except conversion between of integer pro- as Just arithmetic. kinds to be are required imal and binary, correct, can grams be proven to can so be rounded. that is reason exactly The effi- programs, floating-point although what rounding algorithms for all exactly cient the round- is case that proven in that is operations, the are conversion, except error ing certain result the of satisfies best the conversion, For known. known efficient produce results that algorithms bounds. 4 is an example of such Theorem a proof. These much are made eas- proofs rounded exactly than worse slightly are operations the when ier reasoned being 19841. [Coonen ones does require not standard IEEE The an are specified. precisely Once about transcendental exactly be to functions is IEEE for correct be to proven algorithm table maker’s the because of rounded arithmetic, will work correctly on any it illustrate, are you suppose To dilemma. the IEEE standard. supporting machine the exponential making a table func- of has [1981] Brown for axioms proposed Then four to tion = exp(l.626) places. of include that point floating most the rounded 5.083 to be this Should 5.0835. Proofs hardware. existing floating-point exp(l If or 5.084? more computed is .626) verify in this system cannot, however, 5.08350, then becomes it carefully, algorithms of Sections the 1.5, and 1.4 not features which all on require present hardware. Brown’s axioms Furthermore, are complex than simply defining more exactly then operations to be performed ‘Some inner including against arguments product theorems rounded. Thus, proving from of as one the basic operations are presented by axioms Brown’s difficult usually is more Kahan LeBlanc [19851. and ACM Computing 1991 Vol 23, No 1, March Surveys,

16 David Goldberg 20 “ 2. Table 754 IEEE Values Special is exp Since 5.0835000. then 5.083500, could this arbitrar- on go transcendental, Fraction Represents Exponent ily whether distinguishing before long f=o ~=~ *O –1 nun ddd or 5.083500 “ exp(l.626) “ O “ is O fx 2’mLn -1 f#o e ‘ml. = ddd. 9 not is it Thus, prac- “ “ “ 5.0834999 fx2’ 1 5 e 5 emax ~,n e f:o e=emay+l of precision the tran- that specify tical to e=g N%; f#o 1 ~ay + as if same the scendental functions be infinite to computed were functions the precision rounded. then ap- Another to proach would be specify transcenden- a num- valid pattern bit represents tal there functions algorithmically. But value return the ber, square root of not appear single a be does to algorithm must be number. floating-point some all across well works ar- hardware that System/370 of case the In FORTRAN, Rational approximation, chitectures. returned. IEEE arith- In ~ = 2 is three are tables large CORDIC,1° and metic, an this in returned is NaN different techniques used for computing situation. ma- transcendental on contemporary specifies The IEEE standard the fol- a Each is appropriate chines. for differ- 2): Table (see values special lowing O, f hardware, and present of class ent no at and NaNs numbers, + co denormalized single algorithm works over acceptably more is NaN, one (there ex- than as the of current hardware. wide range in These section). next the plained are encoded with all special values Quantities 2.2 Special or 1 exponents of 1 – e~,~ either + e~.X some hardware floating-point every On O has an (it was already pointed out that a bit pattern represents valid floating- of exponent e~,. 1). – point number. The IBM System/370 is the other hand, of example this. On an NaNs 2.2.1 patterns bit some VAX the reserves to re- called represent special numbers 0/0 Traditionally, computation of the or This idea goes back to served operands. 4 as unrecover- an treated been has 1 – CDC the 6600, which had bit patterns for that causes a computation able error to the special quantities INDEFINITE and for are, halt. There however, examples INFINITY. computation for sense makes it which a continues standard this IEEE The in Consider such in continue to a situation. tradition Number, a (Not NaNs has and a of zeros the finds that subroutine a in- and plan) with rhyme to pronounced function Traditionally, zero(f). say f, finities. special quantities, there Without zero an finders require the user to input exceptional sit- to way handle no is good on b] the which function is [a, interval the a of root square taking like uations over zero which the finder defined and negative aborting than number other is, will search. is subroutine That the computation. Under IBM System/370 A b). zero a, zero(f, useful more as called FORTRAN, the re- default action in finder to not the in- user would require a of root square the computing to sponse more This information. extra this put negative the in results 4 – like number appropri- general zero finder is especially every printing of an error message. Since calculators, to natural is it where for ate function a in key then to awkward and is have to specify the domain. It easy, 10 CORDIC Rotation Coordinate for acronym an is however, most finders zero why see to Digital Computer and is a method of computing finder require a domain. The zero does mostly funct~ons transcendental shifts uses that the work at its f probing by function divi- and multiplications few very e., (i. adds and probed value a for it If values. various both on used method the is It 1971], [Walther sions) the outside the domain of f, code for f 68881. and the Intel Motorola the 8087 ACM Computmg Surveys, Vol 23. No 1, March 1991

17 Floating-Point 21 9 Arithmetic 3. Operations that Produce an NaN Table e~~X nent significands. nonzero and 1 + put to free are Implementations system- Operation by Produced NaN the signifi- into dependent information W+(–w) + there unique cant. Thus, NaN a not is x Oxw NaNs. a whole family of rather When but cO/03 0/0, I ordinary floating-point NaN and an an m REM O, y REM REM x number should result the combined, are fi(when x < O) \ operand. the be same as the NaN Thus, a if the result of long computation is an the NaN, information system-dependent and might well compute ~, or 0/0 in information be will significant the the the would computation halt, unnecessar- in the generated when the first NaN the zero finding ily aborting process. Actually, computation was generated. This problem can be avoided by intro- last statement. If there is a caveat to the ducing and NaN called value special a the will both operands are NaNs, result of ex- computation specifying that the might one of those NaNs but it be not be first. the NaN that was generated produce ~ and 0/0 pressions like some of list (A halting. than rather NaN situations can that the of cause a NaN is Infinity 2.2.2 in Table 3.) Then, when zero(f) given to way a provide NaNs as Just continue probes outside the the of f, code domain computation when expressions like 0/0 a finder zero the and NaN return for f will can not is zero(f) is, That continue. infinities or ~ are encountered, pro- “punished” incorrect for an making way overflow an when continue to a vide this guess. in mind, With it is example This simply than safer much is occurs. the combining of to see what easy result returning to the largest representable an ordinary NaN a floating-point with example, consider an As number. com- final should be. Suppose number the = p 3, = b when ~~, 10, puting sqrt(d))/ statement off + b – return( is e~~X If = and and 3 = 98. x 1070 x d a f should return then <0, (2* If a). and overflow will X2 en th 1070, X = y 4 <0, an NaN, d sqrt(d) is NaN. Since Similarly 1098. x 9.99 by be replaced yz and will if NaN a be sqrt(d) + b – the turn and X2 + yz will each overflow in any sum of an NaN and other number the So 1098. x be 9.99 by replaced and operand one if NaN. Similarly, a is = 1098)112 x be final result will (9.99 NaN, of a division operation is an wrong. x 1049, which 3.16 is drastically a be should quotient the In NaN. The 1070. x IEEE 5 is answer correct In whenever participates NaN a general, is X2 CO, of result the arithmetic, is as operation, in a floating-point the X2 yz, + yz, and -. SO the final is another NaN. result result than m, which is safer is a writing to approach Another zero an returning floating-point ordinary user to the solver that does not require number the correct is nowhere near that use The input a domain is to signals. answer.” handler zero finder could install a signal results in an O O division of The by for floating-point exceptions. Then if f by number O, divided nonzero A NaN. were evaluated its domain and outside infinity: 1/0 returns ~, = however, re- raised an exception, control would be the distinc- – The reason = for 1/0 – co. solver. The problem zero the to turned and -0 f(x) If this: is tion O + as g(x) lan- with this approach is that every handling guage method has a different of method all), signals (if it so and at has a hope it portability. has no of the arith- IEEE Although default llFine point: in as 754, NaNs IEEE In are represented it is to round overflowed numbers to ~, metic is to possible change the default (see Section 2.3.2). expo- with numbers the floating-point ACM Computing Surveys, Vol 23, No 1, March 1991

18 David Goldberg 22 “ correct value when = = O: 1/(0 + 0-1) x x some then f( x)/g( x) approaches limit, 1/(0 CO) infinity Without O. = l/CO = + For could have example, any value. the x-1) arithmetic, + x 1/( expression sin when f’(x) then x, = g(x) and x = which not a only test for x = O, requires ~ 1 ~(x) when But O. + x as ~(x)/g(x) may but instructions extra adds dis- also ~ When O. f(x)/g(x) =l– COSX, a This example illustrates rupt pipeline. thinking situation limiting the as 0/0 of infinity that namely, fact; general a of a of two very small numbers, quotient spe - often arithmetic avoids the need for 0/0 represent anything. Thus, in could cial case checking; however, formulas IEEE the an in results 0/0 standard, make to be carefully inspected to need g(x) NaN. c, ~ f(x) and >0 c when But sure behavior spurious have not do they ~(x)/g(*) then O, ~ ana- any for m * ~ [as at did]. 1) + x/(X2 infinity f functions lytic for <0 g(x) If g. and other- m; ~ f(x)/g(x) then – small x, is the IEEE wise the limit stan- + m. So Slgnea 2.2.3 Zero # m as O. = c/0 long defines dard c as & of c the on depends The sign of signs co Zero is by the exponent represented = – co in the usual way, so – 10/0 O and the Since significant. zero a and 1 – e mm can +m. –10/–0= distin- You and bit values, take different two can on sign m getting between guish of over- because O. If a + zeros, two are there O and – because m and of division by getting flow when made were distinction comparing the status by O flags (which will checking O) = (x if like tests simple O, – and O -t in be discussed detail 2.3.3). in Section have behavior, de- would unpredictable will the in set be flag overflow The first pending x. the sign Thus, of IEEE the on second. in flag O by division the case, the defines standard that so comparison The of result the determining for rule than –O< +0. Al- +0= rather –O an as infinity has that operation an to always possible be would it though operand is simple: Replace infinity with sign stan- the IEEE the zero, of ignore take a finite number x and the limit as dard does not do so. When a multiplica- 3/m + x m. = O, because Thus, tion or division a signed zero, involves – co = – 4 Iim ~+~3/x = O. Similarly aI apply rules sign usual the computing in the of answer. Thus, 3(+ O) = -t O sign the not does limit the When w. = G and a = 3 – and O. If +0/– did not have zero will NaN, an is result the exist, so m/co additional be an NaN (Table 3 has exam- the 1/(1 sign, relation /x) = x would fail ples). This agrees with the used reasoning when hold to x = *m. The reason is in O, 1/– and 1/+ that ~ ~ both result NaN. to conclude that 0/0 should be an subexpression When evaluates to a a the and results 1/0 + ~, informa- sign in way One lost. having tion been restore to expression the NaN, the value of entire the In NaN. a also is how- w, & of case to is the identity 1/(1 /x) = have x the might expression the of value ever, however, infinity; of’ kind one only be- number floating-point ordinary an be that in the disastrous result would I/m a is Here O. = like rules cause of consequence an of losing the sign of example that makes use of the practical quantity. overflowed rules infinity for arithmetic. Consider Another signed of use the of example + computing 1). the function x/( X2 This zero functions and underflow concerns is only will it formula, a bad because not that have a discontinuity at such as zero is larger than IEEE natural is it arithmetic, to In log. x overflow when – and log x to be an O log define = w arithmetic infinity but iz will fib’”” NaN represents whe”n x <0. Suppose” x it &rong the give answer will because a small negative number that under- has yield O rather than a number near 1/x. Thanks zero. signed zero, x to flowed to can as rewritten be 1) + X2 x/( However, negative so can return log be will an expression 1/( x + x- l). This improved there no zero, signed were If NaN. will be- overflow prematurely and not not the log function could however, cause the will arithmetic infinity have of ACM Computmg Surveys, Vol 23, No. 1, March 1991

19 “ Floating-Point Arithmetic 23 distinguish an negative underflowed IEEE committee decided, however, that from and would therefore have number O the zero out- advantages of signed using to return – m. of example Another a the weighed disadvantages. at discontinuity with function is zero a function, returns the signum the which Numbers Denormalized 2.2.4 sign of a number. Consider normalized num- floating-point the most interesting of use Probably and –98. = p = e~,. 3, with bers = 10, O occurs in signed complex arithmetic. zero x 10-97 and = y The numbers % = 6.87 the equation example, an As consider appear to be perfectly ordi- 10-97 x 6.81 = ~/&. This true is certainly ~ which are numbers, nary floating-point = O. z when If z = 1. the obvious com- — larger 10 of factor a than more the than i and gives ~~ = putation ~ = x number floating-point 1.00 smallest I/i ~# Thus, –i. = I/n= strange property, They 10-98. have a 1/W The problem can be traced to the ! y! # x though however: even O = x 0 y and multivalued, is square that fact root The reason is that x – y = .06 x 10-97 there to select no way so values the is — — too be to small 6.0 is ‘g repre- 10- x entire the in continuous they are com- as a normalized number and so sented plex plane. continuous, is root Square must flushed to zero. be branch cut consisting of all if a however, important is it to How the preserve negative numbers is excluded real from property of problem the leaves This consideration. numbers, real negative the for do to what X=yex–y=o? (lo) iO, where of the form which x + are – to imagine writing the is very It easy x zero 0. Signed way perfect a > provides z = 1/ y) # (x if fragment code then of the to this resolve problem. Numbers and later having a program fail (x y) – of x square a have O) + i( + – form root to Track- due zero. by division spurious a form + x – the of numbers and i&, this bugs is frustrating down ing like the other side branch the of on O) – cut i( time consuming. a more philo- and On root with the other sign a have square text science level, computer - sophical ~). the for formulas natural i fact, In (– that though out point often books it even will ~ computing give results. these is currently impractical to prove large Let z = to return us ~ = l/fi. If programs correct, designing programs –l+iO, –1= then idea the - with re often them proving of in suits For example, intro- better code. 1/2 = 1/(-1 + iO) they ducing invariants is useful, even if 1(-1 -iO) part of a are not going to be used as — — just any like is code Floating-point proof. iO)(-1-iO) (-1+ facts provable have other to helps It code: on For example, when which depend. to - iO)/(( -1)2 (-1 - 02) = helpful will it (7), formula analyzing be –l+i(–0), = Oy=x <2x*x toknowthat x/2

20 David Goldberg 24 l 1! I , , , , I , I I r , I t I I , , 1 t I ~,,.+3 ~ +2 .=. o P’-’ P-+’ P P-”+’ compared underflow. with zero gradual to Flush Figure 2. e~,. 3, = Recall the example p O 10, = are They relations. useful other as well — — of the stan- the most controversial part –98, y x 6.87 = x 10-97, = 6.81 and 10-97 x at the beginning of presented long the for accounted probably and dard – x denormals, With y section. this does approved. getting in delay 754 Most is not repre - but zero to flush instead high-performance hardware that claims sented denormalized the by number be compatible IEEE does not support to is This called behavior .6 10-98. X but denormalized directly numbers It is verify to easy gradual underflow. rather when consuming or produc- traps always (10) using when holds that ing denormals, leaves it to software and gradual underflow. to simulate IEEE standard. 13 The the denormalized numbers goes behind idea illustrates Figure 2 denormalized numbers. the in line number top The to simple. is and back [19671 Goldberg figure normalized shows floating-point signifi- the the e~,., is exponent When between O numbers. Notice the gap and be cant does not For normalized. to have the smallest normalized number 1.0 x p e~,. 3, = and 13 = 10, when example, — — 98, 1.00 x 10-98 is no – the longer ~em~. the result of a floating-point cal- If falls gulf, it is flushed culation into this floating-point number, smallest because -‘8 10 x 0.98 is a floating-point also shows to zero. The bottom number line number. happens what are denormals added when is P 2 = when snag small a and There The numbers. floating-point of set the to a since used, being is bit hidden a num- in; “gulf’ a of result the when filled is e~,. ber with an always exponent will of is it ~’m~, calculation is less than 1.0 x or equal greater than a have significant by represented the nearest denormal. of the leading implicit because 1.0 bit. to When denormalized are added numbers to that used to The solution is similar between number the to line, the spacing O Table represent in summarized is and varies adjacent floating-point numbers in – e~,. exponent The 2. is rep- to used 1 are a regular way: Adjacent spacings ei- formally, if the denormals. More resent ther the same length or differ factor a by bl, are bits in the significant field of the Without denormals, spacing f3. the expo- value the and of b b lflem~ ‘P+ B from changes abruptly ~em~, to p–1 z~...~ when nent is then e, e > e~,~ – 1, the PP a – 1, rather than the which factor of is bl bz . “ . number being represented is 1. factor of ~, change Because a by orderly 2’, 1, – e~,~ = e when whereas x b ~ have can that algorithms many this, of bz being t~e represented 0.61 number is large num- normalized for error relative . . . ~_l b x 2’+1. The + 1 in the exponent to close bers the are threshold underflow denormals have an ex- is needed because behaved well gradual when range this in ponent of e~l., not e~,~ – 1. used. underflow is simple Without the underflow, gradual a expression x + y large very can have inputs, relative error for normalized as of the most troublesome 13This M the cause of one x 10–97 and x 6.87 = for above seen was Programs the of aspects that frequently #,andard. errors relative arge L 10-97. x 6.81 y = underilow on noticeably run hardware often slower can happen even without cancellation, as uses that traps. software ACM Computmg Surveys, Vol 23, No 1, March 1991

21 Floating-Point 25 l Arithmetic the following example shows [Demmel set, they remain set until ex- that once complex two dividing Consider 1984]. Testing plicitly is the flags the cleared. and c The obvious id. ib + + a numbers, is distinguish to way only which 1/0, a formula from infinity genuine an overflow. execution continuing Sometimes the in a+ib ac + bd be – ad — face appro- not is conditions exception of i + c+id–c2+d2 d2 -F C2 Section priate. gave the example of 2.2.2 x/(x2 the ~(3e-xf2, > x When 1). + if suffers from the problem that either a in resulting infinite, is denominator id is denominator of component c + the wrong. is which O, totally of answer final formula the /2, will ‘m= fib than larger the Although for this formula problem final even may result the though overflow as 1/ solved by be it rewriting can better be of method A range. within well (x always x-l), not rewriting may + computing quotients is to use Smith’s the problem. The IEEE standard solve the formula: recommends strongly that implementa- allow tions handlers to be installed. trap b(d/c) + a b a(d/c) - trap the exception an when Then occurs, ‘Zc+d(d/c) c+d(d/c) handler called instead of setting the is a+ib ifldl

22 26 Goldberg David * Table4. mlEEE Exceptions 754a Exception to Argument Result Trap When Traps Disabled Handler *xmaY or fm Overflow Round(x2-”) ordenormal + O, 2em~ Underflow Round(x2”) Divide zero by Operands Invalid ~amN Operands round(x) Inexact round(x) = ‘.x for single precision, 1536 for Is the exact result of the operation, a 192 x23m*. 1.11...11 = xm,, and double, product partial ~ ~= H = hardware p~ for not have flags of’ its own overflows xi does trap k, the handler increments the sys- some operating the interrupts instead but by 1 counter returns the overflowed and to signal a floating-point exception, tem wrapped exponent the with quantity of be the inexact exceptions could cost around. precision, single 754 IEEE In avoided prohibitive. This cost can be by by having the status flags maintained 1.45 = if so 127, will p~ it x 2130, = e max software. The time an exception is first be overflow and cause to the trap handler ap raised, set the software flag - for the called, the which will wrap exponent back propriate the tell and class floating-point 2-62 ph 1.45 to x changing into range, mask off of class to excep- hardware that underflows, p~ below). Similarly, if (see all will further exceptions tions. Then and counter would be decremented the the interrupting without run operating get the wrapped negative exponent would status that resets user a When system. the all around into a positive one. When reenabled. is mask hardware the flag, is counter the if are multiplications done, p.. If the is zero, the final product Handlers Trap 2.3.1 product is over- counter positive, the is flowed; if the counter is negative, it un- One obvious use for trap handlers is for products none of the If partial derflowed. backward that codes Old compatibility. handler out range, the trap is is never of exceptions oc- when expect to be aborted no incurs the and called computation ex- cur can install a trap handler that aborts cost. Even if there are over/under- tra is useful especially This process. the for like loop with a do S until (x codes > = accurate is calculation the flows, more than with loga- computed been had it if a Since comparing a num- NaN to 100). pk computed was because each rithms, berwith (but or= >,~, <,<, not multi- full-precision _ ~ using a p~ from go always #) returns false, this code will Barnett ply. a formula [1987] discusses infinite an into if x ever becomes loop over/under- the accuracy full where of NaN. an counting turned flow an error in ear- up a is There for use interesting more formula. lier tables of that when up comes that handlers trap com- over- an when that specifies 754 IEEE could products puting such as ~= ~ x, that H is handler trap underflow or flow called, potentially overflow. One solution is to passed it the wrapped-around result as is use logarithms and exp(X log x,) compute wrapped argument. The definition of an instead. ap- this with problems The result is around the that is overflow for and proach are that it less accurate is then to if as computed precision, infinite than more the simple expression costs IIx,, no if there is is There overflow. even divided 2”, by then rounded to the rele- the result underflow, For precision. vant trap handlers another using solution underfZo w counting that over / called 2a. is a The by multiplied is exponent avoids of these problems [Sterbenz both dou- for for single precision and 1536 192 1974]. 1.45 This why is was x ble precision. 2130 x ex- the in 2-62 global The idea is as follows: There is a into 1.45 transformed the Whenever zero. to initialized counter ample above. ACM Computmg Surveys, Vol. 23, No. 1, March 1991

23 Floating-Point 27 “ Arithmetic be anywhere in that could answer correct Rounding Modes 2.3.2 interval. Interval arithmetic makes more rounding occurs standard, IEEE In the a with conjunction in when sense used that operation a whenever an has result precision multiple floating-point pack- the exception exact, since (with not of is first performed is calculation age. The conversion) decimal binary opera- each interval arith- If p. precision with some is computed exactly then tion rounded. metic suggests the final answer that may default, By round means rounding to- be redone the computation is inaccurate, that standard The nearest. ward requires precision higher and until with higher provided; be modes rounding other three is a reasonable size. final the interval namely, round toward O, round toward – + used When co. m, toward round and Flags 2.3.3 with the convert integer operation, to The of number flags a has standard IEEE to round toward – m causes the convert As and is there above, discussed modes. function, floor the become round whereas, flag excep- five the of each for status one mode rounding is The ceiling. toward + m division tions: underflow, overflow, by to- overflow affects round because when invalid inexact. and operation, zero, or round – m is in ward effect, toward O round There are four rounding modes: magnitude overflow positive of causes an round round w, + toward nearest, toward result default the to be the largest repre- is and round O, toward – m. It toward Similarly, + not number, sentable m. be an there strongly recommended that magnitude will negative overflows of each ex- five enable mode bit for the of negative the largest produce number some gives section This ceptions. exam- toward m + toward round when or round be ples how these modes and flags can of effect. O is in use. good to put sophisticated more A oc- One application of rounding modes in discussed is 4.2. Section example is (another arithmetic interval in curs com- writing a subroutine Consider to mentioned When using 4.2). Section in n is an integer. When pute x‘, where sum the interval arithmetic, num- two of like routine simple a 0, n > x y and bers is where 21, [Z, interval an toward rounded y @ x is g is 2 and co – PositivePower(x,n) { The m. + toward rounded y @ x exact while (n is even) { result the addition is contained within of X=X*X rounding the Without 21. [Z, interval = n/2 n im- modes, is interval usually arithmetic } U.x Y) @ (x = z computing by plemented while (true) { Z e), + y)(l (1 – c) and ~ = (x @ where = n/2 n over- in results epsilon. machine is This if u = = (n O) return for the size of the intervals. estimates X=X*X an the result of Since operation in inter- if(nisodcl)u=u*x val arithmetic is an interval, in general the input an operation to will also be an 11 intervals two If interval. and [y, y] [g, are the result is [g, .21, where-g is added, will compute x n. g to the @ rounding mode set y with most the <0, accurate way to n If not is to call Positive- x” compute 2 with roun~ toward – ~, and Z is Z @ Power(l /x, – but rather 1 /Posi- n) rounding mode set toward + ~. the the because first n), – tivePower(x, When a floating-point calculation is n each quantities, expression multiplies arithmetic, interval using performed the the has from error rounding a which of answer final interval that an contains is the division (i.e., 1/x). In second expres- calculation. This of result the exact the sion these are exact (i.e., x) and the final the turns interval very not is helpful if division commits just one additional large be to since out the does), often it (as ACM Computmg Surveys, Vol 23, No 1, March 1991

24 Goldberg David 28 “ tures usually have floating-point instruc- error. rounding is there Unfortunately, a tions, those generate must compilers If slight snag in this strategy. Positive- oper- instructions, floating-point the and Power(x, – n) then either underflows, what to do ating must decide system called the underflow tra~ handler will be 1 raised when exception conditions are for or set. be will flag status underflow the floating-point instructions. Com- those because is incorrect, if This x-n under- guid- get designers system puter rarely x‘ or will be either then flow, overflow ance from numerical analysis texts, standard IEEE the since But 14 range. in which aimed at users and typically are flags, all the the gives to access user the writers software at not of computer subroutine can easilv for this. correct designers. It turns underflow and overflow the off plausible how of example an design As trap enable bits overflow and saves the can lead decisions to unexpected behav- then and com- status bits. It underflow ior, consider following BASIC the tmtes /PositivePower(x. n). If – 1 nei- program: ~her overflow nor underflow status th~ them is together restores it set, bit with 3.0/7.0 = q of one If enable bits. status the the trap if q = 3.0/7.0 then print “Equal”: redoes bits flags the restores it set, is and “Not else Equal” print the calculation PositivePower using using Borland’s run and compiled When ex- correct the causes n), which – (1/x, Basic15 Turbo the on pro- PC, an IBM to occur. ceptions gram example Not prints Equal! This of flags Another example of the use will in Section 3.2.1. analyzed be the computing arccos via occurs when some people think that Incidentally, formula never the solution to such to anomalies is compare for numbers floating-point l–x but equality them consider instead to arccos x = 2 arctan — 1+X” r equal some error bound if they within are all, it because cure a hardly is This E. as it answers. raises questions many as to m /2, then arc- If arctan(co) evaluates E be? If x <0 of value What should the correctly evaluate to – COS( 1) will should E, re- they are within and > y 0 of 2 arith- infinity because r = arctan(co) snag, small however, a is There metic. though equal, considered be ally even Furthermore, they have different signs? because computation of (1 – x)/ the & b - the relation defined by this rule, a (1 i- x) will cause the divide zero ex- by not is equivalence an rela- E, - I a b I < be ception flag to set, even though arc- The is exceptional. not solution COS( – 1) c not b and b do - a because - tion c. a that imply - to straightforward. is problem this Sim- the save ply zero by divide the of value flag computing arccos, then re- before 3.1 Instruction Sets after the computation. store its old value for common is It an algorithm to require in precision higher of burc.t a short order 3. SYSTEMS ASPECTS One example to results. produce accurate every The design of a of aspect almost b [ the quadratic – in occurs formula system computer requires knowledge Sec- in discussed As ~/2a. t about floating architec- Computer point. rounding error b2 = 4 ac, 4.1, when tion in can contaminate to half the digits up the with computed roots the quadratic because <0, n <1, z if and range in be can 141t than smaller bit a just is –” x underflow the tiny and so threshold 2em,n, then x“ = 2 ‘em~ < 2em= registered trademark of Borland 15 Turbo Basic is a precision, may not overflow, since in all IEEE International, Inc. em... emln – < ACM 1991 Surveys, Vol 23, No 1, March Computmg

25 Floating-Point 29 A~ithmetic l solution x(l) is computed by Suppose a formula. By subcalcula- the performing perhaps elimina- Gaussian method, some 4 b2 precision, half – ac in double tion of a is There tion. improve to way simple are double precision bits of the root the iteratiue the accuracy of the result called lost, the pre- that means which single all compute First improvement. preserved. bits are cision b2 ac 4 double – in The computation of a, each of the quantities precision when easy if precision is and c are in single b, solve the system Then is there instruction that multiplication a two single precision numbers and takes (13) double result. a produces precision To of product rounded exactly the produce is x(l) if that Note solution, exact an a needs multiplier two numbers, p-digit is & then as is vector, zero the y. product, to generate the entire 2p bits of the In general, computation of & y and away although it may throw as it bits & = Ay = rounding will error, so incur proceeds. a Thus, compute to hardware the x where is = – x), Ax(l) A(x(lJ – b from double-precision single-pre- product = y Then solution. true (unknown) a only be normally cision operands will estimate the so an improved for x(l) x, – single-preci- a little more expensive than solution is and sion multiplier much less expensive than double-precision multiplier. De- a X(2) = .Jl) _ (14) Y. instruction modern this, spite sets tend only provide produce instructions that to (12), be can (14) and (13), steps three The as a result of the same the precision replacing and X(2), with x(l) repeated, operands. 16 X(2) with x(’ that argument This X(3). +1) If an instruction that two combines more than X(L) is infor- is accurate only single-precision operands to produce a more For mal. Golub see information, double-precision only useful product were Loan [1989]. and Van would for it not formula, the be quadratic When performing improve- iterative an instruction set. worth adding to This are elements whose vector a is $ ment, instruction has how- uses, other many floating- nearby of difference the inexact a ever. Consider the problem of solving suffer from numbers point so can and equations: system of linear iterative Thus, catastrophic cancellation. improvement unless useful very not is – pre- double in computed is b Ax(l) = ~ Once cision. again, this is a case of com- two puting the product single-precision of where ( A and X(l)), numbers the full needed. double-precision result is that summarize, To multi- instructions written form matrix as in can be which two re- and numbers ply floating-point b, = Ax where the of turn a product with twice precision make operands the a to addition useful a the instruction set. Some of floating-point for implications of this compilers are dis- the section. cussed in next Compilers and Languages 3.2 because like “orthogo- probably designers is IsThis The and floating compilers of interaction where a of precision the sets, instruction nal” Farnum in is discussed point [19881, and the of floating-point instruction are independent is much of the discussion in this section case actual multi- a special operation. for Making taken from that paper. plication orthogonality, destroys this ACM Computing Surveys, Vol. 23, No. 1, March 1991

26 Goldberg David 30 “ example, the expression numbers. For 3.2.1 Ambiguity + answer different totally a has z (x y) + de- a Ideally, language definition should + (y + x than 1030, = x when z) semantics the language the of pre- fine –1030 = y z 1 = and for- the in 1 is (it enough about to prove statements cisely ‘O case, mer in impor- The latter). the programs. true usually is this Whereas tance parentheses preserving of cannot lan - language, for the integer part of a pre- be overemphasized. The algorithms often gray guage large a have definitions 4, 3, in Theorems sented all 6 and depend point area when it comes to floating example, on it. For in Theorem 6, the [Nelson 19911). exception an (modula-3 is re - – ( mx mx – x) would x. formula = that fact the to due is this Perhaps many for not were it if x x~”= to duce paren- designers nothing language that believe entire destroying thereby theses, the floating about since preven point, can be that language A algorithm. definition error. previ- rounding the so, If it entails to require not does be parentheses fal- demonstrated have the ous sections floating-point useless is honored for reasoning. section lacy in this This calculations. areas in common some discusses mav evaluation is impre- Subexpression sugges- “gi~es and definitions language defined Sup- cisely in many languages. them. with deal to how about tions is x ds y and pose but double precision, languages enough, some Remarkably are expres- precision. Then in the single is x if that specify float- a not do clearly x + sion *y, is the product performed ds value of a say (with variable ing-point or in precision? Here is single double (say) every then 3.0/10.0), occurrence of another m where h m + x In examde: . must 10.0 * x have the same value. For and are the division is an n integers, which is based on example Ada,~7 floating-point integer operatio; or a one? Brown’s model, seems to that imply with this There ways to are deal two floating-point only has to sat- arithmetic completely problem, neither which of is expres thus axioms, Brown’s isfy and satisfactory. that require is first to The sions have one of many possible can the have all variables in expression an in floating point about Thinking values. solution type. This is the simplest same stands way fuzzy this contrast sharp in but some lan- drawbacks. First, has where model, IEEE the to of result the that guages like Pascal have subrange each floating-point operation is precisely types allow mixing subrange variables prove defined. In the IEEE model, we can with so somewhat is it variables, integer to that (Theo- 3 evaluates 3.0 * (3.0/10.0) prohibit bizarre mixing single- and to cannot. we model, Brown’s rem 7), In Another variables. prob- double-precision Another ambiguity in most language concerns lem expres- the In constants. what concerns happens on definitions most interpret 0.1 sion 0.1 *x, languages other overflow, and excep- underflow, single-precision a be to Now constant. precisely standard IEEE The tions. spec- the programmer decides to suppose behavior so the exceptions, of ifies declaration of all the the change a that use the standard as languages single from variables floating-point to model avoid any ambiguity on this can double as 0.1 is If still treated precision. Boint. . be will there single-precision a constant, gray Another the concerns area inter- compile The error. time programmer a Due parentheses. of pretation roundoff to will every change and down to have hunt errors, the associative laws of algebra do floating-point constant. not necessarily floating-point for hold second allow is to The mixed approach in which case rules for expressions, subexpression evaluation must be pro- guiding vided. are a number of There examples. The original definition of C is 17Ada S. the of U trademark Govern- registered a required that every floating-point expres - ment office joint Ada program ACM Computing Surveys, Vol. 23, No 1, March 1991

27 Floating-Point 31 g Arithmetic be computed in double precision both operands are single cause precision, sion result. the is as leads 19781. Ritchie and [Kernighan This more A subexpression sophisticated the immedi- example to like anomalies as- follows. rule is as First, evaluation expres- ately proceeding Section 3.1. The in is double sion computed 3.0/7.0 sign operation a tentative precision, each single-precision is q if but precision, a of maximum the is which of precision the be to has assignment This operands. its variable, sin- to rounded the quotient is is a gle precision for storage. Since 3/7 out carried from the leaves to of root the perform Then, tree. expression the - sec a computed repeating binary fraction, its in different from is precision double value the leaves. from pass In ond the root to operation each to assign pass, this the Thus, precision. single in value stored its and maximum of the tentative precision 3/7 fails. This sug- = q comparison the every in expression gests computing that precision the In parent. the by expected a the highest precision available is not of leaf the case q = 3.0/7.0, every is sin- rule. good are precision, gle so all the operations inner is example guiding Another the in done single precision. In case of inner the thou- If products. has product * = y[il, the tentative precision x[il + d d terms, of sands the the in error rounding preci- single of the multiply operation is second sion, pass pro- but in the it gets substantial. One way to sum can become to its moted double precision because error is to accumu- reduce this rounding in sums the late (this precision double double-preci- a operation parent expects single sion operand. And in y = x + discussed in more detail in Sec- will be If 3). 3.2. a tion double-precision is d addition is (dx – single in done dy), the [1988] precision. Farnum evi- presents preci- variable, and X[ 1 and y[ 1 are single dence difficult not is algorithm this that sion will arrays, the inner product loop If d like multi- = d + x[i] * y[i]. look the implement. to plication is done in single precision, much disadvantage The that is rule this of the of a subexpression de- evaluation of the advantage of double-precision ac- cumulation is lost because the product is which it is pends on the expression in just precision single to truncated before embedded. This can have some annoying double-precision a to added being For you suppose example, consequences. variable. a program debugging to want are and covers two previous the that rule A know subexpression. the value of a You compute to is examples in expression an cannot simply type the subexpression to ask and evaluated be to it the debugger highest variable any of precision the that because subexpression the of value the in Then in that expression. occurs q = will sin- in entirely computed be 3.0/7.0 the expression the program depends on precisionlg gle which in it is embedded. A final com- Boolean the have will and since con- on ment subexpression is that * d true, whereas d = + y[il x[il value precision, will be computed in double an is verting decimal binary to constants full gaining double-pre- of advantage the af- rule evaluation the operation, also sim- too cision accumulation. This rule is decimal con- interpretation of fects the cases plistic, however, to cover all for especially is This stants. important If and dx dy are double-preci- cleanly. exactly not are which 0.1, like constants representable binary. in variables, sion the expression y = x + area potential Another gray occurs single(dx dy) contains a double-preci- – exponentia includes language a when - the but performing sum variable, in sion built-in as operations. Un- tion its one of double pointless be- would be precision the the basic arithmetic operations, like value of exponentiation is not always ob- and vious [Kahan * * If 19821. Coonen is 18This assumes the common convention 3.0 that operator, then is the exponentiation whereas is 3.ODO constant, single-precision a a dou- 3 value – 3) * * ( certainly has the – 27. constant. ble-precision ACM Computing Surveys, Vol. 23, No. 1, March 1991

28 Goldberg David “ 32 Using which means 00 = 1.19 this defini- However, problematical. is (–3.0)**3.O tion would unambiguously define the ex- checks for integer pow- * the If * operator ponential and arguments all for function it as (–3.0)**3.O compute would ers, define in particular would 3.0 ( – 3.0) * * On = 3.03 – the if hand, other the –27. * * Y k x is to used define –27. be to = formula x e- Y then depending on for real arguments, the result could be a function, the log 3.2.2 IEEE Standard the of definition natural (using NaN x log(x) the If O). when FOR- < NaN = 2 Section features the of many discussed the IEEE of The IEEE stan- standard. is TRAN CLOG however, used, function be – 27 because the ANSI the answer will dard, however, says about how nothing CLOG defines standard FORTRAN these a from accessed be to are features is there programming language. Thus, 3.0) be [ANSI 1978]. The 3 log ( to i~ – mismatch a floating- between usually avoids Ada language programming this exponentia- problem only by defining point hardware that supports the stan- dard like languages programming and C, tion ANSI while powers, integer for FORTRAN. Pascal, Some of the IEEE or FORTRAN prohibits negative a raising accessed through can capabilities a be to a number real power. subroutine For example, calls. library of fact, standard FORTRAN says the In requires standard IEEE the square that that root be exactly rounded, and the square result M Any arithmetic operation whose not implemented root often function di - is . mathematically . prohibited defined is hardware. rectly functionality is in This introduction Unfortunately, the with root square library a accessed easily via the standard, the by IEEE mean- of CO + routine. Other standard, the aspects of no is not defined mathematically of ing however, are easily implemented not so totally longer clear cut. One definition example, most subroutines. as For com- of Section use to the method might be most at specify languages puter two For 2.2.2. determine the to example, IEEE floating-point types, the whereas, nonconstant consider ana- ab, of value standard (al- precision different four has property the with g and f functions lytic recommended though configurations the b If x~O. as g(x)+ and a f(x)+ that or single, extended single plus single are always approaches the same f’( X)g(’) extended). double and Infinity double, of ab. This value the limit, this should be to another example. provides Constants m, which 2m set would definition seems = be supplied by a represent co + could quite Om, In the case of 1. reasonable. make might that But subroutine. them and 1 = f(x) limit the I/x = g(x) when unusable in places constant require that – x and f(x) = 1 1, when but approaches expressions, initializer the as a such of e. g(x) = l/x the limit is So l.0~ should constant variable. = f(x)g(’) NaN. In the case be 0°, of an A - manipulat is situation subtle more Since analytical are g and f eg(x)log ~t’). state with ing the a computa- associated = f(x) O, and take on the value of O at of the consists state the where tion, + and g(x) = blxl alxl+az .“” x2+ trap rounding trap bits, modes, enable Thus. . . .. bzxz+ handlers, and exception flags. One ap- proach is to provide subroutines for read- the writing a In state. ing and addition, :~g(x)log f(x) — — .“. )) limxlog(x(al + CZzx+ x-o lgThe that 00 = 1 depends on the conclusion re- striction f be nonconstant. If this restriction is removed, then letting ~ be the identically O func- for tion O as a possible value gives lim ~ - ~ f( x)gt’~, a NaN. be and so 00 would have to be defined to g, f and all So ~(x)g(’) + e“ = 1 for ACM Computmg Surveys, Vol. 23, No 1, March 1991

29 Floating-Point 33 “ Arithmetic a new call single that atomically set can and cisely, current the on depends it value often and return the old is value value some- This modes. rounding the of Section in examples the As useful. 2.3.3 im- the with conflicts times of definition showed, modifying of pattern common a rounding plicit in type conversions or the state IEEE change within only it to is function explicit round in languages. of scope or subroutine. the Thus, a block This programs that wish means that the on programmer to is burden the find to use the cannot rounding IEEE use the from each sure make and block exit and language con- natural primitives, the support Language restored. is state will versely the language primitives be precisely in the for setting the state scope ever- implement inefficient on the to useful here. block of very a be would of number increasing machines. IEEE that imple- language one is Modula-3 trap for idea ments handlers this 3.2.3 Optimizers 19911. [Nelson to subject the ignore tend texts Compiler minor be need number A points to of al. point. floating of For example, Aho et IEEE the when considered implementing with x/2.O replacing mentions [19861 x language. standard in a = x – Since x * 0.5, leading the reader to assume that all +0 How- +0. = (+0) – (+0) X,zo for x/10.O be should *x. 0.1 by replaced ever, should x – ( + O) = – O, thus not – expressions two These however, not, do O of introduction The x. – as defined be on a binary same the have semantics can NaN confusing because an NaNs be cannot 0.1 because repre- machine be (in- number other any to equal never is textbook sented exactly in binary. This so NaN), another cluding no is x = x * z x – y * x replacing suggests also by longer always true. In fact, the expres- x z), – even though we have seen that *(y is for way simplest the to x # x sion test can expressions two these dif- quite have the func- recommended IEEE if NaN a values ferent y == when z. Although it provided. not tion Isnan is Furthermore, alge- does qualify the statement that any are NaNs with unordered respect to all when optimiz- can identity be used braic de- be cannot y s x so numbers, other by ing should optimizers that noting code intro- z not as fined > the Since y. not definition, language the violate it duction of NaNs causes floating-point floating-point impression that the leaves become partially a to numbers ordered, important. semantics are not very compare of one returns that function the Whether or not language standard unordered can make it , or > <,=, specifies that parenthesis must be hon- deal with easier for the programmer to have ored, + -y) + z (x a totally differ- can comparisons. as z), + (y -F x than answer ent discussed defines standard IEEE the Although above. floating-point basic the - re to operations There is to related closely problem a NaN, turn a this is a NaN if any operand is preserving parentheses that illus- might not always be the best definition code: trated by the following example, for For operations. compound computing appropriate scale the when eps 1 = graph, the a plotting in use to factor do 1> + (eps while eps 0.5* = 1) eps must be maximum of a set of values sense computed. In this case, it makes an estimate give to designed is code This simply for ignore to the operation max machine for epsilon. If an optimizing NaNs. 1 eps eps @ + > 1 notices that compiler problem. Finally, rounding can be a com- changed the program will be >0, pletely. the computing of Instead small- The - standard defines rounding pre IEEE est still x such that 1 @ x is number c = x(x greater than com- it 8-F’), = will pute x/2 which for x number largest the rounding mode is round toward – m, 20 Unless the (x is rounded to O = (?em~). Avoiding this O. = x – – x case which in ACM Computing Surveys, Vol. 23, No. 1, March 1991

30 Goldberg David 34 “ that believed floating- An optimizer kind so important is “optimization” of of arithmetic point alge- laws the obeyed that use- is worth it presenting one more would bra [T = C that conclude – – S] ruined totally is that algorithm ful it. by rendering = [(S + Y) – Y – Y = O, S] in- numerical as such problems, Many These the completely useless. algorithm of tegration and the numerical solution examples summarized by saying be can involve differential computing equations, cau- that optimizers should be extremely sums terms. manv Because each with . tious identities algebraic applying when an er- introduce potentially can addition real num- hold for the that mathematical large ulp, sum a 1/2 ror as as involving involving bers floating- to expressions a bit thousands of terms can have quite variables. point to way cor- of rounding error. A simple that can optimizers way Another is rect for this to store the partial sum- floating-point change the semantics of a mand double-precision variable and in expres- the In constants. involves code addition each perform to double using is sion 1.OE –40 *x, there an implicit dec- done being is calculation the If precision. to that binary conversion operation imal the in single sum precision, performing converts the a to number decimal binary com- in double precision is easy on most constant cannot this Because constant. al- is calculation the If systems. puter in exactly be represented in- the binary, precision, in done being ready double In exact - raised. be should exception addi so precision the doubling however, not is tion, the underflow flag should to be set method One that is sometimes simple. evaluated is expression the if single in sort is advocated the numbers and add to precision. inexact, is constant the Since to smallest them a is There largest. from depends on exact conversion to binary its much more efficient method, however, rounding value IEEE the of current the that dramatically improves the accuracy converts that an Thus, modes. optimizer 8. of sums, namely Theorem compile OE-40 would 1. to binary at time the pro- of be changing the semantics 8 (Kahan Summation Formula) Theorem however, that Constants like 27.5, gram. the smallest are exactly representable in is computed using the Suppose ~ XJ El: be available precision can safely con- following algorithm verted time, compile since they are at s X[l] = raise exception, any cannot exact, always C=o and rounding the by are unaffected forj=2to N{ Constants be modes. to that are intended Y.xrjl-c converted at compile time should be done T=S+Y constant a with declaration such as const C=(T– S)– Y pi 3.14159265. = S=T elimination is Common subexpression } that another example of an optimization as can change floating-point semantics, computed equal Then to is S sum the following code: by illustrated the xi where I 6, I 1, I 0(iVe2)X Exj(l + dj) + 2f. s A*B; C= the com- the ~xl, formula Using naive Up RndMode = A*B; D= + is Xx~(l to equal where 6J) puted sum - the with this Comparing j)e. (n < 6, I I Although * com- a be to appear may B A the error in Kahan summation form- improvement. shows a dramatic ula not is it the because subexpression, mon only e by perturbed Each summand is 2 the two rounding mode is different at n e as large as instead of perturbations evaluation sites. Three final examples in Details are formula. simple in the are x = x cannot be replaced by the Section 4.3. Boolean true, because it fails constant ACM 1991 Surveys, Vol 23, No 1, March Computmg

31 Floating-Point 9 Arithmetic 35 NaN; x O – x fails for x – is an The implementation of library functions when = cos such as sin and difficult, more even is is opposite x = + O; not of and x < y the because transcenden- these of value the are greater neither NaNs because y, > x nor than floating-point than ordinary less rational not are functions tal numbers. integer arithmetic is often pro- Exact numbers. there are use- these Despite examples, handy is and vided by for Lisp systems some Exact floating-point problems. that optimizations ful can done on be code. floating-point there are alge - First, useful. arithmetic is, however, rarely algorithms useful are there is fact The that float- for valid are identities braic formula) (like the Kahan summation that Some in numbers. ing-point examples z x (x + y) + exploit # + (y + z), and 2 x + IEEE are x y = y + arithmetic x, work whenever the bound O.5 x=x+x. lxx=x. and =X12. XX however, the’se Even simple ‘identities, a machines on fail can few such as CDC afllb=(a+b)(l+d) and Cray Instruction supercomputers. and scheduling inline procedure substi- holds –, for bounds similar as well (as potentially other two are tution useful for /). Since these bounds hold and x, eXample) cOn- final a As 21 optimizations. just not hardware commercial almost all y, where sider the expression dx = x * x machines IEEE arithmetic, it would with and y are single precision variables and numerical for be to programmers foolish m-ecision. double that machines On is dx such ignore would be it and algorithms, multiplies two ins~ruction an have that irresponsible compiler writers to de- for numbers single-precision to a produce that these algorithms pretending by stroy * = dx number, double-precision y can x num- floating-point variables have real mapped instruction that to get rather semantics. ber than instructions of compiled series to a then that convert the operands to double Exception Handling 3.3 perform a double-to-double precision The discussed up to now have pri- topics multiply. concerned of implications systems marily view restric- Some compiler writers handlers and precision. Trap accuracy tions that prohibit z -t y) + (x converting raise systems some interesting is- also x to of irrelevant, as z) + interest + (y recom- strongly standard IEEE The sues. unportable use who programmers to only able trap to mends a that users be specify have that mind in thev Perham tricks. five handler for each of the classes of real model num- floating-point’ num~ers exceptions, and Section 2.3.1 some gave real the bers and should obey same laws applications trap user defined han- of num- with problem The do. numbers real operation invalid of case the In dlers, semantics extremelv are thev that is ber . exceptions, the han- and zero by division expensive two time Every implement. to with dler should be provided the n bit numbers are multiplied, the prod- exactly otherwise the operands, with n have 2 uct n will two time Every bits. result. Depending rounded on the pro- bit with widely numbers spaced expo- language gramming being used, trap the n are nents 2 have will sum the added, handler might be access able other to An algorithm that involves thou- bits. variables For well. as program the in all a sands of operations (such as solving handler trap be able the exceptions, must system) operating be soon will on linear to identify what operation was being hopelessly huge numbers be and slow. performed and the precision of its destination. oper- The that assumes standard IEEE ‘lThe VMS math libraries on the VAX use a weak are serial that and conceptually ations form that in substitution procedure inline of they to when an interrupt occurs, it is possible call use jump rather to subroutine the inexpensive identify the operation and its operands. slower the than instructions. CALLG and CALLS ACM Computing Surveys, Vol 23, No. 1, March 1991

32 Goldberg David l 36 specifies an exception and a value to be or machines have On pipelining that when the the as used exception result an when units, arithmetic multiple ex- that suppose example, an As occurs. in sim- enough be not may it occurs, ception sin the /x, x computing de- for code user to the have the trap handler examine ply = x that cides O would it that rare so is for program counter. Hardware support a improve for performance to avoid test exactly which operation identifying = O case this handle x instead and when necessary. trapped may be a occurs. trap IEEE Using trap 0/0 han- the Another problem is illustrated by handler dlers, the user would write a following fragment: program installs that returns it value of 1 and a x=y*z - presub Using x/x. sin computing before Z=X*W the user would specify that stitution, a=b+c an occurs, operation when the invalid d = a/x value calls be should 1 of used. Kahan this presubstitution because the value to raises Suppose the second multiply an be be before the ex- must used specified wants handler trap the and to exception, using trap han- ception occurs. When the can that hardware On a. of value use be the value to be returned can dlers, and multiply add parallel, in do an the occurs. computed when trap o~timizer mobablv move the would an . of The advantage presubstitution is ahead the of second operation addi

33 Floating-Point 37 Arithmetic ~ subtraction will always be accurate (The- x – ~ <1, then 6 = O. Since Second, if 2). this verify to proceed now We orem the y – x that smallest can be is Theorem one 2 parts, has two for fact. k k part addition. for one and subtraction The – o.r—— 1.0 subtraction is as follows: for o@”””@ (P – 1)(8-1 + . . . +p-~) > Theorem 9 positive If x and y are fZoating-point num- Q (where rela- the case this in 1), – ~ = bers in a format with parameters D and p is tive error bounded by done with p + 1 is dig- if and subtraction the guard one e., rela- (i. its then digit), is less result the in error tive rounding 26. = (2/~)]e + [1 = l]~-p [(~/2) + than . + (6- l)p-p(r’ . . +p-k) < and x y is neces- Interchange Proof It so sary > y. that is also harmless to x . + 1)(6-’ - (/3 +6-’) . . y so represented is x that by and x scale = /’-P (18) 13°. x ~ is ~_ x ““” xl Xo. y If represented yo. as . . . yl then is difference the 1, YP. when is but final The x – y <1 case yP, c exact. If y is represented as O. yl . . this could The x > ~ hap- only way 1. – then that the ensures digit guard the which 8 – j = 1, in x case if = O. is pen computed difference will be the exact dif- applies, But again if 6 = so then (18) O, ference rounded to a floating-point num- b < relative error is bounded by ‘p the most ~. In the ber, so rounding error is at 6/2). + p-p(l q general, “ yh+P .00 Oy~+l let . 0.0 = y . p digits. 1 + truncated y be Y let to and When P = 2, the bound is exactly Then, e, = x for achieved is bound this and 2 – 1 y = 21-P and 21-2P in the 22-P + of p + co. When adding numbers as limit (0 – +p-~-~). ““” 1)(6-P-1 < the not neces- digit same sign, a guard is (15) sary to achieve good accuracy, as the following shows. result From definition guard digit, the the of value x – y is x–y computed of Theorem 10 rounded a floating-point to number; be J) 8, where the rounding – (x is, + that in >0 and y If >0, the relative error x 8 satisfies error e, is -t x computing at most 2 y even no if guard digits are used. (16) addition for algorithm The Proof to k guard digits is similar the with and for subtraction. If x = y, algorithm The exact difference is y, x – so the er- x of points radix the until right y shift roris(x ~)= J–y +6. –y)–(x–ji+ digits Discard and y any are aligned. three the >1, x – y There If are cases. position. k Compute + p shifted the past error bounded by relative is k + num- p digit of these two the sum y–j+ti digits. p round to bers exactly. Then will verify the theorem no We when 1 the is case general used; are guard digits There in generality of loss similar. no is (~- ““” 1)(~-’+ +6-’)+: ==~-p that O x is assuming that x z y ~ and 1 [ /3°. X d d 00. d. scaled the of be to form Then carry out. First, assume there is no (17)

34 David Goldberg 38 0 is equal to 81 + 62 + This relative error and the sum is at 1 + ‘p (3 than less value + bounded 618Z + 616~ + 6263, which t+ is relative is so error least the 1, than less by 5e + 8 maxi- the words, other In E2. the there a carry If 2e. = ~-P+l/l is out, mum five error is about round- relative the added be must shifting from error to number, small a is ~ (since errors ing C2 sum The ‘p’2. ~ (1/2) of error rounding is almost is negligible). less at is error relative the so II, least analvsis A similar of ( x B x) e than (~-p+l + (1/2) ~-p+2)/~ = (1 + result a cannot y) (y@ small value for in p/2)~-~ s 2,. H the relative error because when two of plugged are x and values nearby y two It is obvious that combining these into y2, — usu- will error relative the X2 theorems Theorem 2. Theorem 2 gives to way see large. quite be ally Another performing for error relative the gives the duplicate and analysis this is to try Comparing operation. rounding the one worked on that e y) 8 (x O y), (x error (x+y)(x–y) re- y2 and – 2 x of yielding quires knowing the relative error of mul- operations. tiple The relative error of [(x0 y)-(x-y)l/ xeyis81= (YC8Y) (Xth) e (x to satisfies I al I s y), Or – which 26. = [x’(l +8J - Y2(1 + ‘52)](1 + b) it write way, another (6, - = ((X2 - y’)(1 + 61) + 62)Y2) J< 26. Xey=(x–y)(l+dl), 18, (1 + &J. (19) Similarly, and y are nearby, the error When x can the (61 – 82)y2 term be as large as y2. X2 for- – result computations These x@y=(x+y)(l+a2), 182] <2E. justify our claim mally y) – x ( that (20) is y) + (x – x’ than accurate more y’. We next turn to an analysis of the that multiplication per- is Assuming for the area of a triangle. To formula by computing formed the exact product can that error maximum estimate the relative error is at then rounding, the when occur computing with (7), the fol- 1/2 most so ulp, will be lowing needed. fact u@u=uu(l+~3), (21) 18315C 11 Theorem u. numbers for any and floating u point with If a is guard subtraction performed Putting equations these together three y/2 and digit is y – x then 2y, x< < v y) = x 0 y and u = x Q (letting gives computed exactly, the have and Note that if x y Proof (xey)~(xey) x then exponent, same certainly is y G +(!,) = (X-y)(l condition the from Otherwise, exact. of by differ can exponents the theorem, the y)(l + 32)(1 + (22) X(x+ 63). y Scale and interchange x and at most 1. necessary if repre- is x and x < y < O so So com- when incurred error relative the sented . “ xl XO. as “ O.Yl as y Xp.l and x + is y) puting (x – y)( . . . - comput yP. Then the algorithm for y exactly y will compute x – ing x e ey)-(xa-y’) (Xey)o(x number round to a floating-point and but . dl . . is of the form O. if the difference (Xa-ya) dp, the difference will already be p digits = -1. 1+62)(1+8,) +6,)( (23) (1 long, no rounding is necessary. Since and ACM 1991 Surveys, Vol 23. No 1. March Computmg

35 Floating-Point 39 “ Arithmetic conditions of Theorem 11. This means 2y, ysy, and since y is of the X= x– “ q x-y. sois ciP, .” O.dl form is hence = a El b b exact, and – c a a subtraction – b) is a harmless that (a be can be estimated from Theorem 9 to Theo- (1 the When hypothesis of >2, by y/~ x s rem 11 cannot be replaced s the y/2 s condition stronger x 2 y y; ~ < (c- (C 0 (a e b)) = (a- b))(l+nJ, is still necessary. analysis The the of s2,. /q21 (25) y) in x y)( – x ( previous error the in + relative the section used fact that the term is third The exact two of the sum basic the in error of addition operations so positive quantities, [namely, eqs. and subtraction is small common most the This (20)]. and (19) is b))(l+ (a- (c+ b))= e (a @I (C v,), kind Analyzing for- analysis. error of however, requires something mula (7), lq31 <2c. (26) as the follow- more; namely, Theorem 11, show. will proof ing is term last the Finally, )(l+q4)2, Theorem 12 (a+ = c)) 0 (b (a@ (b-c) digit a, subtraction uses a guard If and if IvAI S2E, (27) a of sides the b, and triangle, c are the 10. 9 and Theorem both using Theorem + i- (a computing in error relative (b is If multiplication assumed to be exactly – (a c))(c – – b))(c + (a – b))(a + c)) (b = @ f) + xy(l with rounded so that x y .005. provided < e t, 16 is most at then combining (24), s (26), I f I e, (25), us the factors Let examine one Proof gives and (27) b ‘d3 c = From Theorem 10, one. by the relative is al where (b + al), -i- c)(1 (a 0 0 d) (b 63 c))(c b)) (a value the Then ~. 2 til I of I and error s ED (b 63 (a + (a = c)) first factor is the c)) (C (a 0 (b @ b))(a 0 @ (a+ (b + c)(I + 81)) c))(1 + 8,) = (b 63 x(1 + ti2), and thus (a- (b (a+ s - +c))(c b)) b)) (a- (c+ b+ (a+ -242 C)(l (a+(b-c))E ? <[a+ (b+c)(l-2~)](1-2e)

36 David Goldberg 40 “ Combining 6)3. two – these 26)6(1 Theorem 13 E < 1 + 16c. – < 16c bounds yields 1 x)/x, < p(x) = ln(l + If then for O S x is Thus the at 16~. error relative most derivative the and 1 < W(x) s 1/2 3/4, E 1/2. K’(x) I < I satisfies is there shows 12 Theorem catas- no that Note Proof x/2 – p(x) = + 1 with series alternating an is –... x2/3 (7). cancellation in formula trophic although to necessary not is it Therefore, p(x) <1, x for so >1 terms, decreasing — stable, numerically show formula (7) is it see that even is It 1/2. = x/2 to easier en- for bound a have to satisfying is the series the because is alternating, p for tire formula, is what Theorem 3 of which <1. of Taylor series V(x) M’(x) is The also and if x = 3/4 has de- alternating, 1.4 Section gives. – creasing terms, so – 1/2 < p’(x) < 1/2 Theorem 3. Proof Let – s p’(x) s O, thus 1/2 or + 2x/3, p’(%)1 s 1/2. m / +c))(c -(a-b)) q=(a+(b se- Since the Taylor Proof Theorem 4. b))(a+ (b-c)) (c+ (a- for In, ries X3 X2 and ln(l+x)=x–l+~ –..., Ob)) Q=(a@(b (ce(a @c))@ ln(l alternating series, O < x – an + is x) < X2/2. Therefore, the relative error x) incurred when approximating ln(l + x/2. @ by x is 1, = x bounded 1 If by q(l = Then 12 shows that Q + Theorem x the is then I error so relative c, < I is that 166. = 6 with 8), check to easy It E/2. by bounded When define 1, # x 2 via 1 @ x @ 1 since =1+2. Then O

37 Floating-Point Arithmetic 41 l by x multiplying a final 6A, so introduces rl)(x – a(x = c + bx + ax2 that fact the — the + x)/((1 + xln(l of value computed = 4ac, b2 then Since so arlr2 = c. rz), 1) x) – is term error second the rl r2, = ac~~ 4 is so of value 4 a2 r~til. Thus, the computed = a2 inequality d + 4 The r~~~ . ~is~ +(s4), +83)(1 X(l \tjls E. <0.1, ~ if that then to easy is It check + 8, (1 + 1 = 84) + 83)(1 + 82)(1 + 8,)(1 shows that ~d -t- 4a2r~8d = ~ + E, 16\s5e. q with ~-, s I E the where abso- so I A. rl error in ~/2a is lute about An interesting example of analy- error P-’, & = fi-p12, 6A = and thus Since sis formulas (19), (20), and (21) using the rl & destroys absolute of error the formula quadratic the in occurs b – [ = rl roots bottom half of the bits of the explained Section 1.4 a. ~]/2 ~ other In words, since the calculation r2. how - elimi will eauation the rewriting of the roots involves computing with by caused pote~tial the nate can~ellation this and have not does expression a ~/2 is the another operation. But there ~ meaningful bits in the position corre- ~otential occur can cancellation that . r,, the order half of to the lower sponding ac. one This 4 – bz commtin~ = when d . meaning- be cannot r, of lower bits order cannot elim~nated by a simple rear- be ful. q of rangement the Roughly formula. b rounding ac, 4 = 2 error when speaking, Finally, proof the to turn we of Theo- contaminate digits the half to up can in 6. in fact following the on based is It rem the roots computed with the quadratic which Theorem is proven in the 14, formula. Here is an informal proof Appendix. (another approach to estimating the er- in ror in appears formula quadratic the 14 Theorem Kahan [1972]). Let and andsetm=~k+l, O. the only way there can a carrY ‘p ~ – 1. In that case, 1s lf xp_k_l = out however, low-order digit of x~ is 1 + the repre- = O, and so again x~ is x so = multipli- (3 assume proof, informal this 2 221n p/2] digits. s&/a~le in ~ require not does and exact a is 4 by cation 6,. ACM Computmg Surveys, Vol. 23, No 1, March 1991

38 Goldberg David 42 “ decimal number to the converting the be an To deal with to xl, scale inte- x the recover will number binary closest x – ger Let 1. s x s P-1 (3 satisfying Bp original number. floating-point p high- k – ?ifi the is = where it, + ?ik k low-order of x and Zl is the order digits single-precision Binary num- Proof digits. There are three cases to consider. in the lying bers interval half-open If 21< (P/2)/3~-1, then x to rounding 210) [103, to bits 10 have 1024) [1000, = places p – k is the same as chopping and to the left of the binary point and 14 bits Il. and xl = Since Zt has at ~h, X~ = the right of the binary Thus, there point. digits, k Zl then even, is p at if has most 103)214 bi- different 393,216 = – are (210 p/21 p/21 digits. Other- = ( = ( k most in that interval. If decimal nary numbers 2k-~ wise, 13 = 2 and repre- is it < eight with numbers are dig- represented – k significant p/21 f < 1 with sentable = 103)104 – (210 are there its, 240,000 The It when is case second > bits. interval. decimal numbers in the same (P/2) Xk computing 1; 0~- then involves num- There is decimal 240,000 way no so = 2~ Xh xl = and @k + up, rounding bers could represent 393,216 different bi- x–xh=x —2h– = Once pk. 21- pk nary decimal eight So numbers. digits k so it is digits, has most again, El at to represent each single- not are enough p/21 Finally, digits. [ with representable uniquely. number binary precision if + 2~ or ith = xh then 2L = (P/2) 6k–1, To are digits sufficient, nine that show round whether a there is f?k depending on show to enough is it that spacing the L?k up. Therefore, xl is or -1 either (6/2) between is numbers binary always of –~k/2, which = ~k – (~/2)~k-’ both spacing greater than the between deci- are with 1 digit. H represented ensure mal numbers. This will that for N, each decimal number the interval [ N gives the express to way a 6 Theorem — most ulp] 1/2 + N ulp, contains 1/2 at two single-precision numbers of product binary number. Thus, binary each one exactly companion a sum. There is a as number num- a unique to decimal rounds formula for expressing a If exactly. sum a rounds unique in which ber, to turn lxl>lyl, +(x x+y=(x~y) then binary number. y)) @ y [Dekker 1971; Knuth 0 (x @ To show that bi- the spacing between in Section 4.2.2]. 1981, When C Theorem than greater always numbers nary is the using operations, how- exactly rounded decimal numbers, con- spacing between true P = 2, only is formula this ever, for interval an sider this On l]. 10”+ [10’, = /3 = x example the not for as 10 .99998, consecu- interval, the spacing between shows. .99997 = y is numbers decimal tive On 1)-9. 10(”+ smallest the is m inte- where ‘1, 2 [10”, 4.2 Binary-to-Decimal Conversion of spacing the 2‘, < n that so ger 10 the binary numbers is 2 m -‘4 spac- and p and Since single precision = has 24 on in the gets ing larger further inter- that 2‘4 <108, we might expect convert- is val. Thus, check to that enough it decimal eight to number binary a ing 2wL-ZA < ~()(72+1)-9 But, since fact, in digits the recover to sufficient be would then 10” < 2n, 10(”+ lJ-g = 10-lO-S < the binary This is not original number. 2n10-s < 2m2-zl q case, however. argument applied to double The same Theorem 15 that precision digits shows 17 decimal are to recover a double-precision required When a binary IEEE single-precision is converted to the closest eight number. number pro- decimal it is not always Binary-decimal conversion also number, digit vides another example of the use of flags. recover the binary number possible to from uniquely decimal one. If nine Recall the Section 2.1.2 that to recover from decimal digits are used, however, then a number from its decimal expan- binary ACM 1991 Surveys, Vol 23, No 1, March Computmg

39 . 43 Arithmetic Floating-Point < 6, ~ second- c, and ignoring where ~ sion, the decimal-to-binary conversion i gives must computed exactly. That be conver- terms in 6 order sion by multiplying the performed is 10 are both quantities N I (which and p I < preci- single-extended in 13) P if exact single this to rounding sion and then both cases P < O; precision if dividing (or of similar). N . computation The are I P I cannot be exact; it is the combined 10 The eaualitv of (31) shows that first must that be I p I 10 “ (N round operation ) where the rounding is from single exact, same computed ~alue”of EXJ is the the as if an exact summation was performed on see precision. single to extended why To perturbed values of x,. xl term first The take the simple it might fail to be exact, 2 3 = p and single for = p ~ = 10, of case ne, X. term last the by by is perturbed for is product the If extended. single to only shows (31) in equality second The e. rounded be be to 12.5 this would 12.51, x n~ 1. XJ I is term error that by bounded of multiply the single-extended part as has precision the Doubling the effect of operation. Rounding to single precision is squaring being done in sum the If c. is give 12. But that answer not would an IEEE double-precision format, 1/~ = to product correct, because rounding the ne any for <1 reasonable that so 1016, The give precision should single 13. error n. Thus, doubling the precision value of result of double rounding. is a ne the perturbation of takes maximum By double the flags, IEEE the using it changes the 2 E Thus e. < n~z to and follows. avoided be can rounding Save as the error bound for Kahan summation flag, the current value of the inexact then as formula good as not is 8) (Theorem to rounding the Set it. reset round mode using is it though even precision, double perform the to Then multiplication zero. much better single precision. than 1. I the of value new the Store p .10 N explanation an of why intuitive For and in the restore ixflag, inexact flag works, formula Kahan summation the rounding ixflag and inexact flag. If mode proce of diagram following - consider the is O, then N o 10 is exact, so I‘1 round dure: be to down correct will the I p I 10 (N. ) last some is ixflag If 1, digits bit. then Isl round to zero al- were truncated, since significant truncates. the ways The of bz~ bl “ o “ bzz like look product will 1. A double-rounding error may oc- b~l. “ “ “ 10 “ “ “ O. A simple bz~ “ . “ b~l = if cur account way to for both cases is to per- Then b31. ixflag form a logical OR of with I I round (N “ 10 be p will ) computed I IT cases. correctly in all -r=== Summation 4.3 Errors in Section 3.2.3 mentioned of problem the Yh accurately very long sums. computing u approach improving ac- simplest to The to double curacy precision. To get a is the - 13___zl how much doubling rough estimate of a the of accuracy improves precision the @x2,. ... s,= = sl sum, let SI = xl, sz –Yl =C u = s, Then x,. + e s,_l x,), 8,)(s,_l + (1 ACM Computing Surveys, Vol. 23, No. 1, March 1991

40 44 David “ Goldberg that use be- of the standard are features there Each a summand is added, time ever 2 Section portable. more coming a factor C that correction ap- be will is illustrating gave numerous examples next on plied loop. So first subtract the the features of the IEEE standard how previ- the correction C computed in the can be used in writing practical floating- corrected giving the ous loop from Xj, point codes. this summand to Y. Then add summand sum S. running of bits low-order The the Yl) are lost in the sum. Next, Y (namely, APPENDIX the high-order bits of Y compute by com- Appendix This technical two contains subtracted is Y When T – from S. puting proofs text. the from omitted bits re- be will Y of low-order the this, the are These were covered. that bits lost 14 Theorem They in the in diagram. the sum first the next become correction factor for the setm=fik+l, as- and Let O cancellation gorithms involving can be ”.. Zz bb “ ~ “ zzbb proven to have small relative errors if the underlying hardware has guard a been partitioned two into has x where efficient there and digit algorithm an is marked are digits k low-order The parts. binary-decimal can be conversion that for are digits k – p b high-order the and to be proven provided ex- invertible, mx m from x @ compute To a. marked task The supported. is precision tended off rounding involves k low-order the of constructing reliable floating-point so ones digits marked with b) (the software is easier when the under- made system computer of supportive is lying (32) mx– r~k. + xmod(~k) m~x= addition the to floating In point. two examples just mentioned (guard digits and of 3 Section precision), extended if 1 .bb ..0 r is The value of is b greater has examples ranging from this paper and - pre More otherwise. O 1/2 than instruction set design to compiler opt- cisel y, to better imization how illustrating point. floating support r = 1 ifa.bb ““” broundstoa+l, increasing acceptance of the IEEE The r = O otherwise. (33) means that standard floating-point codes ACM Computmg Surveys, Vol 23, No 1, March 1991

41 Floating-Point e Arithmetic 45 Next compute looks like this: + ~~x ~ “aabb”””bb aa”. r~k–x m@x–x=mx –xmod((3h)+ .aabb”””bb aa”” + B’(x+r) - xmod(~’). = “zZbb”””bb “ “ Zzz + mod(13k) = m @l x w~k, mx – x Thus, the below picture computation shows The = – Z where Z < /3/2, but the exact w if – x rounded, that is, (m @ x) m @ x of in Next value of w unimportant. m 8 x where B r), + x. 0 top line is flk(x The pic- = IIkx – x + wok. In a xmod(/3k) results from adding r to – the that is digit ture order lowest b: the digit aabb. bbOO”. .OO .” ..” aa aa. ” “aabb”””bBOO. ..OO -bb””. bb -bb”””bb +W Zz”” Zzzoo . ..00 “ (m gives Rounding 0 x = (3kx + x) 8 If.bb.. Subtract- r = O. < b 1/2, then where r=l if.bb”. ”b> – w@ r(3k, ing causes a from the digit borrow = 1. = bO and 1/2 F’i- “b “ “ .bb if or 1/2 difference marked B, but the is rounded nally, up, the net effect is that the rounded so difference equals line, top the which is (m C3x)-(rn@x Ox) /?~x. If .bb “ “ o 1, and 1 r = b > 1/2, then — — mod(fik) mx – + x w(3k because B from subtracted is bor- the of row. So the result is L?k x. Finally, again .bb case the consider “ “ Ifr= “b=l/2. — is even, is B then O, odd, and Z the — rflk. – xmod[f?k) + x x. /3k giving up rounded is difference rounding again, r = 1 exactly when Once odd, Z is 1, = r B is when Similarly, k p – places involves to ah”.. ““” b a is difference the even, so down, rounded up. rounding proven is 14 Theorem Thus, summa- again the difference is 13 k x. To q in cases. all rize, (Kahan Summation Theorem 8 Formula) (34) ex=~kx. (m@x) XJ Suppose ~ EJ! is the using computed algorithm following (32) eqs. Combining gives (34) and (m8x)–(m@x0x)=x– == s X[ll of perform- rf?k. The result mod( x + 13~) C=o ing this computation is fm-j=2to N{ Y=xrjl-c T=S+Y .OO rOO”. Y S)– C=(T– “aabb. .”bb aa. ” S=T –bb.. ”bb } + .OO. “.”aaOO”. aa is Then the computed sum S equal to 1, where x~ xx~(l + I + 0( Nc2)Z 6j) S = r, the is (33), eq. for computing The rule 16JJ<2C. same the rule for as a “ “ “ rounding Thus, comput- places. p – k to b esti- error the how recall First Proof ah””” floating-point in – mx ( – mx x) ing for the went. Xx; formula simple mate equal exactly arithmetic precision is to = s, xl, = _~ 1 8L)(s, + (1 – Introduce SI in places, k – case the p to x rounding s., + XL). Then the computed sum is does not carry out. x + Ok when x of sum a is each which of terms, which an by multiplied x, an expression is mx = When + 13k x does carry out, x 1991 Computing Surveys, Vol 23, No 1, March ACM

42 Goldberg David 46 * exact The d~’s. coefficient involving of xl c~= –Yh](l+~k) [{s~–sk_l}(l+7J 6Z)(1 3P),. Therefore d~)... + is(l + + (1 - + %%,)(1 u,) - %,) = [{((%, Xz of coefficient renumbering, by the a.), and must be (1 + 63)(1 + 8A) . . . (1 + -sk_l}(l + ~k) + ck_l(l + qk)] The proof so Theorem 8 runs along on. of (1+ ak) coeffi - the only lines, same the exactly complicated. more is xl of de- In cients = ok) + ~kck-~(1 ck-,)ffk- - [{(s&l = co = so tail O and + -ck_l}(l + ~,) + ck_l(l qk)] %1= e ck.l)(l+qk) Yk=xk (.%– bk) (1+ Yk= sk=s&~ @ (s&l +~k)(l ‘ok) = - [(sk-, c&,)ok(l + yk) e (Sk e sk.1) Y, ck = u~d)l ~k(”k + ?’k + ‘ck-1(~~ + ‘k-l)(l +~k) -Y,](I - +&) [(sk = (1+ dk) Ck Sk – all bounded the are letters Greek where coefficient in the Although ~. by s~ xl of ~kck-,) = ((Sk-, - Ck-,) - expression of interest, it is the ultimate (1+ C7k) the compute to easier be to out turns in sh – Ck and Ck. coefficient of xl When ‘[(Sk-, ‘yk) c&,)uk(l - k=l, ‘k~k))] “k-,(~k + ~k(”k ‘y, + Y,)(1 - 7,) + (%(1 L) + e,= (1+ dk) = – -YJ + %)(1 + Yl((l 61) + 1)(1 ak) + ck-l)((l - = (sk-, — — + %71) %(% + 71 ~k)(l ‘~k(l + + 6k)) ~,) (1 + (3,)(1 + + ‘k) ck-~(–~k(l + 01) + SI — c1 = Xl[(l – (q + ‘)’1 + %71) + +(~k + ~k(~k + uk~k)) ~k %) (1+ ‘-L)](1 + = xl a~-y~ 1 – ‘y~ – (7181 – 6k)) (1+ [ ck_J - (Sk-l = + (1 a171aJ – –8171 ~J. ~k8k)) + - u~(~k + 8k (1 in these Calling the coefficients of xl respectively, Sk, and Ch expressions then “k-l[-qk+~k + uk~k) +~k(~k c1 = 26 + 0(E2) + ‘%(”k + ~k + ‘k~k))dk] +(~k 71+462+0(,3). S1=1+?7– get formula for Sk and general To the and Ch are only being computed Since S~ Ck, the definitions of Ck, and expand Sk up order ~2, these formulas can be to all terms involving x, with ignoring simplified to i > 1. That gives + Sk + ok) Ck = (ok +&)(l o(~2))&-1 = (Sk_l + o(e2))Ck_~ +(–~k - c,-,)(1 ‘~k)] (Xk + [Sk-, = (1+ fJk) 2e2 + o(,3))sk_, + Sk = ((1 = - Ck-l) - ~kc&l](l + ‘h) [(Sk-l +(26 + 0(~2))C&~. ACM Computing Surveys, Vol, 23, No. 1, March 1991

43 Floating-Point - 47 Arithmetic Com- Scientific on Methods Interual of Role The gives formulas these Using Ed. Press, Academic Ramon Moore, E. puting, 99-107. Boston, Mass., pp. 0(62) C2 = (JZ + proposed COONEN, J. 1984, Contributions to a 52 = O(C3), 10E2+ + –’yl +ql 1 standard for binary floating-point arithmetic. PhD of California, Berke- Univ. dissertation, by check and, easy is it general, in to ley. that induction A for DEKKER, technique T. floating-point J. 1971. Numer. the precision. extending available 0(62) = Ck ok + 3, Math. 18, 224-242. and the of reliability DEMMEL, J. 1984. Underflow S~=l+ql –~l+(4k +2) E2+O(e3). Stat. Com- SIAM J. Sci. numerical software. put. 887-919. 4, 5, is what Finally, coeffi- the is wanted FARNUM, 1988. Compiler support for floating- C. let value, this get To Sk. in xl of cient Pratt. Experi. 18, 7, point computation. $of%w. x with let all the Greek letters O, = n+l 701-709. compute and O equal 1 + n subscripts of 1967. C. B. AND Com- G. E., FORSYTHE, MOLER, s. coef- the and c., – Then s.+ ~ = Algebraic Linear of Solut~on puter Systems. in s. is less than the coeffi- of .xI &~&t Prentice-Hall, Englewood Cliffs, N.J. + VI 1 = S, is which ~, s~+ cient in I -y – I. enough not are Bits 27 1967. GOLDBERG, B. q +(4n+2)c2 = 1 +2c 0(rze2). + 10, Commum. ACM 2, for 8-digit accuracy. 105-106. GOLDBERG, D. 1990. In arithmetic. Computer ACKNOWLEDGMENTS Quantitative Computer itecture: Ap- Arch A by was This at inspired article a course given Sun proach, and Hen- John Patterson David L. Los Eds. Morgan Kaufmann, Altos, nessy, by by W, Kahan, David ably organized Microsystems Appendix Calif., A. Hough Sun, from May through Its July 1988. of aim enable others to learn about the interac- to is Matrix GOLUB, VAN 1989. F. C. LOAN, AND H., G. Computations. The Johns Hopkins University tion of systems computer and point floating without Press, MD. Baltimore, am lectures. having to get up in time to attend 8:00 Advanced HEWLETT PACKARD HP-15L’ 1982. Kahan my of many and col- to due are Thanks Funct~ons Handbook. PARC (especially at John Gilbert) leagues Xerox drafts of this paper and providing many for reading Binary IEEE IEEE Standard 754-1985 for 1987. Reprinted in Floating-Point Arithmetic, IEEE. Reviews from comments, Paul Hilfinger and useful 9-25. 22, SIGPLAN 2, an anonymous referee also helped improve the pre- sentation. of Survey A 1972. W. KAHAN, Analysis. In Error 71, Information Processing (Ljubljana, Yugo- slavia), North Holland, Amsterdam, pp. 2, vol, REFERENCES 1214-1239. D. J. ULLMAN, AND R., SETHI, 1986. V., A. AIIO, KAHAN, and Area Calculating 1986. Angle W. Compders: Princ~ples, Techn~ques and Tools. of Needle-like Triangle. Unpublished manu- a Mass. Addison-Wesley, Reading, script. American National Standard Pro- ANSI 1978. KAHAN, for complex ele- cuts Branch 1987. W. FORTRAN, gramming ANSI Stan- Language m the The State of Art In mentary functions. X3.9-1978. dard Standards American National Analyszs, Numerical M. and D. Powell A J. Institute, New York. Eds., oxford Y., N. Press, University Iserles, Chap. 7. A 13ARNETT, envi- floating-point portable 1987. D. manuscript. ronment. Unpubhshed at given lectures Unpublished 1988. W, KAHAN, model but simple A 1981. S, W. BROWN, realistic Microsystems, Sun Calif. Mountain View, Trans. ACM of computation. floating-point J. T. AND W., The near COONEN, KAHAN, 1982. 4, 7, Math. Softw. 445-480. diag- orthogonality and semantics, syntax, of CARDELLI, L., DONAHUE, GLASSMAN, L., JORDAN, J., numerical in nostics environ- programming Mod- KASLOW, B,, AND NELSON, G. 1989. M., The Numerical between Zp Relations In ments. ula-3 Re- Systems Digital (revised). Report Computation Languages, Programming and J. search Calif. Alto, Palo #52, Report Center Amsterdam, pp Ed Reid, K North-Holland, 103-115. and radix- proposed A 1984. al. et J. W. CoDY, floating- standard word-length-independent for in Anomalies 1985. E. LEBLANC, AND W., KAHAN, 4, IEEE Micro 4, 86-100. point arithmetic. of the Proceedings In package. the IBM acrith Symposwm 7th on Computer Arithmetic IEEE CODY, W. J. 1988. Floating-point standards–The- in Computing: Reliability ory practice. In and (Urbana, Ill.), pp. 322-331. 1991 Computing Surveys, Vol. 23, No. 1, March ACM

44 Goldberg David 9 48 Rational Number Slash Arithmetic: Precision 1978. The W., RITCHIE, D, AND M, KERNIGHAN, B. En- Prentice-Hall, Language. Programmvzg C Systems. 3-18. IEEE Trans. C-34, 1, Comput. NJ. glewood Cliffs, KNUTH, AND REISER, J. F., D E. 1975. Evading R Arithmetic 1987 U KULISCH, KIRCHNER, AND , Inf. Pro- drift addition floating-point in the 8th the Proceedings of vector for processors. In cess. Lett 3, 3, 84-87 Symposium Arithmetic Computer on IEEE Computa- H. STERBETZ, 1974. P. Floatmg-Pomt Italy), pp. 256-269. (Como, tion, N.J. Prentice-Hall, Englewood Cliffs, The Art Computer Pro- of KNUTH, 1981. E. D. E. AND 1975, G. , E ALEXOPOULOS, SWARTZLANDER, gramming Addison-Wesley, Mass., Reading, IEEE sign/logarithm number system, The ed. 2nd II, vol. Trans Comput. C-24, 12, 1238-1242 L. 1986. The KULISH, W., AND MIRAN~ER U. W S. A ele- for algorlthm umfied 1971. J, WALTHER, A Arithmetic new the Digital Computer: of AFIP proceedings of the mentary functions. 1, 1–36 28, Rev SIAM approach Spring Jo,nt Computer Conference, pp. 379- KORNERUP, P. 1985. Finite MATULA, D. W., AND 385, revision accepted March 1990 Received December 1988; final ACM Computmg Surveys, Vol. 23, No 1, March 1991

Related documents