roundoff error.
Machine accuracy is smallest number ε such
that (1.0 + ε != ε). In Java, it is XYZ
with double and XYZ with float.
Changing the error tolerance ε to a small positive
value helps, but does not fix the problem (see exercise XYZ).
Every time you perform an arithmetic operation, you introduce
an additional error of at least ε.
Kernighan and Plauger:
"Floating point numbers are like piles of sand; every
time you move one you lose a little sand and
pick up a little dirt."
If the errors occur at random, we might expect a cumulative error of
sqrt(N) εm. However,
if we are not vigilant in designing our numerical algorithms,
these errors can propagate in very unfavorable and non-intuitive ways,
leading to cumulative errors of N εm or worse.
We must be satisfied with approximating the square root.
A reliable way to do the computation is to choose some
error tolerance ε, say 1E-15, and try to find a
value t such that
|t - c/t| <ε t. we use relative error
instead of absolute error; otherwise the program
may go into an infinite loop (see exercise XYZ).
double EPSILON = 1E-15;
double t = c;
while (Math.abs(t*t - c) > c*EPSILON)
t = (c/t + t) / 2.0;
|
Harmonic sum.
Possibly motivate by trying to estimate Euler's constant γ =
limit as n approaches infinity of γn = Hn - ln n.
The limit exists and is approximately 0.5772156649.
Surprisingly, it is not even known whether γ is irrational.
γ100 = 0.582207332 is accurate to only one decimal place;
γ1000000 = 0.577216164 is accurate to only five decimal
places, assuming no roundoff error.
Better formula: γ ≈ γn - 1/(2n) + 1/(12n2).
Accurate to 12 decimal places for n = 1 million.
Many numerical computations (e.g., integration or solutions to
differential equations) involve summing up alot of small terms.
The errors can accumulate.
Program HarmonicSum.java
computes 1/1 + 1/2 + ... + 1/N using single precision
and double precision.
With single precision, when N = 10,000, the sum is accurate to 5 decimal
digits, when N = 1,000,000 it is accurate to only 3 decimal digits,
when N = 10,000,000 it is accurate to only 2 decimal digits.
In fact, once N reaches 10 million, the sum never increases.
Although the Harmonic sum diverges to infinity, in floating point it
converges to a finite number!
This dispels the popular misconception that if you are solving
a problem that requires only 4 or 5 decimal digits of
precision, then you are safe using a type that stores 7.
Indeed, accumulation of roundoff error can lead to serious problems.
Financial computing.
We illustrate some examples of roundoff error that can ruin
a financial calculation. Financial calculations involving dollars
and cents involve base 10 arithmetic. The following examples
demonstrate some of the perils of using a binary floating
point system like IEEE 754.
Sales tax.
Calculate the 5% sales tax of a 70 cent phone call. Using IEEE 754,
1.05 * 0.70 = 0.73499999999999998667732370449812151491641998291015625.
This results gets rounded down to 0.73 even though the exact
answer is 0.735, which the telephone company is permitted to round up
to 0.74 (using Banker's rounding).
In contrast, a $1.30 phone call
1.05 * 1.30 = 1.3650000000000002131628207280300557613372802734375
might get rounded up to 1.37 even though the result should be
rounded down (using Banker's rounding) to 1.36.
This difference in pennies might not seem significant, and you might
hope that the effects cancel each other out in the long run.
However, taken over hundreds of millions of transactions,
this might result in millions
of dollars. Reference: Superman III where Richard Prior
discovers a flaw in the computer system and embezzles
fractions of a penny from every business transaction.
For this reason, some programmers believe that you should always
use integer types to store financial values instead of
floating point types.
The next example will show the perils of storing financial
values using type int.
Compound interest.
This example introduces you to the dangers of roundoff error.
Suppose you invest $1000.00 at 5% interest, compounded daily.
How much money will you end up with after 1 year?
If the bank computes the value correctly, you should end up
with $1051.27 since the exact formula is
which leads to 1051.2674964674473....
Suppose, instead, that the bank stores your balance as an integer (measured
in pennies). At the end of each day, the bank calculates
your balance and multiplies it by 1.05 and rounds the result to
the nearest penny. Then, you will end up with only $1051.10,
and have been cheated out of 17 cents. Suppose instead the bank
rounds down to the nearest penny at the end of each day.
Now, you will end up with only $1049.40 and you will have
been cheated out of $1.87. The error of not storing the fractions
of a penny accumulate and can eventually become significant,
and even fraudulent.
Program CompoundInterest.java.
Instead of using integer or floating point types, you should
use Java's BigDecimal library. This library has two
main advantages.
First, it can represent decimal numbers exactly. This prevents
the sales tax issues of using binary floating point.
Second, it can store numbers with an arbitrary
amount of precision.
This enables the programmer to control the degree to which
roundoff error affects the computation.
Other source of error.
€ƒIn addition to roundoff error inherent when using floating point
arithmetic, there are two other types of approximation errors that commonly
arise in scientific applications.
- Measurement error.
The data values used in the computation are not accurate. This arises from
both practical (inaccurate or imprecise measuring instruments) and theoretical
(Heisenberg uncertainty principle) considerations. We are primarily interested
in models and solution methods whose answer is not highly sensitive to the
initial data.
Discretization error.
Another source of inaccuracy results from discretizing continuous system,
e.g., approximating a transcendental function by truncating its Taylor expansion, approximating an
integral using a finite sum of rectangles, finite difference method for finding approximate
solutions to differential equations,
or estimating a continuous function over a lattice.
Discretization error would still be present even if using exact
arithmetic. It is often unavoidable, but we can reduce truncation error by using more refined
discretization. Of course, this comes at the price of using more resources, whether it be memory or
time.Discretization error often more important than roundoff error in practical applications.
Catastrophic cancellation.
Devastating loss of precision when small numbers are computed
from large numbers by addition or subtraction.
For example if x and y agree in all but the last few digits,
then if we compute z = x - y, then z may only have a few digits
of accuracy. If we subsequently use z in a calculation, then
the result may only have a few digits of accuracy.
As a first example, consider the following contrived code fragment
from Catastrophic.java.
double x1 = 10.000000000000004;
double x2 = 10.000000000000000;
double y1 = 10.00000000000004;
double y2 = 10.00000000000000;
double z = (y1 - y2) / (x1 - x2);
System.out.println(z);
|
It is easy to compute the exact answer by hand as
(4/10,000,000,000,000) / (4/100,000,000,000,000) = 10.
However, Java prints out result 11.5.
Even though all calculations are done using 15 digits of
precision, the result suddenly has only 1 digit of accuracy.
Now we consider a more subtle and pernicious consequence of catastrophic
cancellation.
Program Exponential.java
computes ex using the Taylor series
ex = 1 + x + x2/2! + x3/3! + x4/4! + ...
|
This series converges uniformly and absolutely for all values of x.
Nevertheless, for many negative values of x (e.g., -25),
the program obtains no correct digits, no matter how many
terms in the series are summed.
For example when x = -25, the series converges in floating point
to -7.129780403672078E-7 (a negative number!), but the true
answer is 1.3887943864964021E-11.
To see why, observe that term 24 in the sum is 2524/24! and
term 25 is -2525/25!. In principle, they should exactly
cancel each other out. In practice, they cancel each other out
catastrophically. The size of these terms (5.7 * 109)
is 20 orders of magnitude greater than the true answer, and any
error in the cancellation is magnified in the calculated answer.
Fortunately, in this case, the problem is easily rectified
(see Exercise XYZ).
Stability.
A mathematical problem is well-conditioned if it solution
changes by only a small amount when the input parameters changes by a small amount.
An algorithm is
numerically stable if the output of the algorithm changes by only
a small amount when the input data changes by a small amount.
Numerical stability captures how errors are propagated by the algorithm.
Numerical analysis is the art and science of finding numerically stable
algorithms to solve well-conditioned problems.
Accuracy depends on the conditioning of problem and the stability of the algorithm.
Inaccuracy can result from applying a stable algorithm to an ill-conditioned
problem or an unstable algorithm to a well-conditioned problem.
For a simple example, computing f(x) = exp(x) is a well conditioned problem
since f(x + ε) = ....
One algorithm for computing exp(x) is via its Taylor series.
Suppose we estimate f(x) = exp(x) using the first four
terms of its Taylor approximation: g(x) = 1 + x + x2/2 + x3/3!.
Then f(1) = 2.718282, g(1) = 2.666667.
However, it is unstable since if x <0 ..... a stable algorithm is to use the taylor series if x is nonnegative, but if x is negative, compute e-x using a Taylor series and take
the reciprocal.
Ill conditioning.
In these previous section, we say an example of a non-stable algorithm for computing
exp(x). We also discovered an improved algorithm for the problem that was stable.
Sometimes, we are not always so lucky. We say that a problem is
ill-conditioned if there is no stable algorithm for solving it.
Addition, multiplication, exponentiation, and division of positive numbers are all
well-conditioned problems. So is computing the roots of a quadratic equation. (See
Exercise XYZ.) Subtraction is ill-conditioned. So is finding the roots of a general
polynomial. The conditioning of the problem of solving Ax = b depends on the matrix A.
Population dynamics.
The Verhulst equation is a simplified model of population dynamics.
xn = (R + 1)xn-1 - R (xn-1)2
|
Program Verhulst.java reads in a command
line parameter R and iterates the Verhulst equation for 100 iterations,
starting with x0 = 0.5.
It orders the computation in four different, but mathematically equivalent,
ways. All lead to significantly different results when R = 3, none of which is
remotely correct (as can be verified using exact arithmetic in Maple).
(Reference)
(R+1)x-R(xx) (R+1)x-(Rx)x ((R+1)-(Rx))x x + R(x-xx) correct
0: 0.5000000000 0.5000000000 0.5000000000 0.5000000000 0.5000000000
10: 0.3846309658 0.3846309658 0.3846309658 0.3846309658 0.3846309658
20: 0.4188950250 0.4188950250 0.4188950250 0.4188950250 0.4188950250
30: 0.0463994725 0.0463994756 0.0463994787 0.0463994775 0.0463994768
40: 0.3201767255 0.3201840912 0.3201915468 0.3201885159 0.3201870617
50: 0.0675670955 0.0637469566 0.0599878799 0.0615028942 0.0622361944
60: 0.0011449576 0.2711150754 1.0005313342 1.2945078734 0.0049027225
70: 1.2967745569 1.3284619999 1.3296922488 0.1410079704 0.5530823827
80: 0.5530384607 0.8171627721 0.0219521770 0.0184394043 0.1196398067
90: 0.0948523432 0.1541841695 0.0483069438 1.2513933889 0.3109290854
100: 0.0000671271 0.1194574394 1.2564956763 1.0428230334 0.7428865400
|
When R > 2.57, this system exhibits chaotic behavior. The system
itself is ill-conditioned, so there is no way to restructure the computation
to iterate the system using floating point arithmetic.
Although we cannot expect our floating point algorithms to correctly
handle ill-conditioned problem, we can ask that they report back
an error rangle associated with the solution so that at least we are alerted
to potential problems. For example, when solving
linear systems of equations (see section 9.5), we can compute something
called a condition number: this quantity can be used to bound the
error of the resulting solution.
Ill-conditioning is not just a theoretical possibility. Astrophysicists have
determined that our Solar System is chaotic.
The trajectory of Pluto's orbit is chaotic, as in the motion of the
Jovian planets, Halley's comet, and asteroidal zone trajectories.
Upon a close encounter with Venus,
Mercury could be ejected from the Solar System in
less than 3.5 billion year!
Calculating special functions.
When learning calculus, we derive convergent Taylor series
formulas for calculating exponential and trigometric functions.
(e.g., exp(x), log(x), sin(x), cos(x), arctan(x), etc.).
However, we must
take great care when applying such formulas on digital computers.
Not all special functions are built in to Java's Math
library, so there are times when you must create your own.
Give example, e.g., error function.
Numerical analysts have derived accurate, precise, and
efficient algorithms for computing
classic functions (e.g., hyperbolic trigometric functions,
gamma function, beta function, error function,
Bessel functions, Jacobian elliptic functions, spherical harmonics)
that arise in scientific applications.
We strongly recommend using these proven recipes instead
of devising your own ad hoc routines.
Real-world numerical catastrophes.
The examples we've discussed above are rather simplistic. However,
these issues arise in real applications. When done incorrectly,
disaster can quickly strike.
Ariane 5 rocket.
Ariane 5 rocket exploded 40 seconds after being launched
by European Space Agency. Maiden voyage after a decade and
7 billion dollars of research and development.
Sensor reported
acceleration that so was large that it caused an overflow in
the part of the program responsible for recalibrating inertial
guidance. 64-bit floating point number was converted to
a 16-bit signed integer, but the number was larger than
32,767 and the conversion failed.
Unanticipated overflow was caught by a general
systems diagnostic and dumped debugging data into an area of memory
used for guiding the rocket's motors. Control was switched to
a backup computer, but this had the same data. This resulted
in a drastic attempt to correct the nonexistent problem, which
separated the motors from their mountings, leading to
the end of Ariane 5.
Patriot missile accident.
On February 25, 1991 an American Patriot missile failed to track
and destroy an Iraqi Scud missile. Instead it hit an Army barracks,
killing 26 people. The cause was later determined to be
an inaccurate calculate caused by measuring time in tenth of a second.
Couldn't represent 1/10 exactly since used 24 bit floating point.
Software to fix problem arrived in Dhahran on February 26.
Here is more
information.
Intel FDIV Bug
Error in Pentium hardwire floating point divide circuit.
Discovered by Intel in July 1994, rediscovered and
publicized by math professor in September 1994. Intel
recall in December 1994 cost $300 million. Another floating
point bug discovered in 1997.
Sinking of Sleipner oil rig.
Sleipner A $700 million platform for producing oil and gas sprang a leak
and sank in North Sea in August, 1991.
Error in inaccurate finite element approximation
underestimate shear stress by 47%
Reference.
Vancouver stock exchange.
Vancouver stock exchange index was undervalued by over 50% after
22 months of accumulated roundoff error.
The obvious algorithm is to add up all the stock prices after
Instead a "clever" analyst decided it would be more efficient to
recompute the index by adding the net change of a stock after
each trade. This computation was done using four decimal places
and truncating (not rounding) the result to three.
Reference
Lessons.
-
Use double instead of float for accuracy.
-
Use float only if you really need to conserve
memory, and are aware of the associated risks with accuracy.
Usually it doesn't make things faster, and occasionally
makes things slower.
-
Be careful of calculating the difference of two very similar values
and using the result in a subsequent calculation.
-
Be careful about adding two quantities of very different magnitudes.
-
Be careful about repeating a slightly inaccurate computation
many many times. For example, calculating the change in position
of planets over time.
-
Designing stable floating point algorithms is highly nontrivial.
Use libraries when available.
Q + A
Q. Is there any direct way to check for overflow on integer types?
A. No. The integer types do not indicate overflow in any way.
Integer divide and integer remainder throw exceptions when
the denominator is zero.
Q. What happens if I enter a number that is too large, e.g., 1E400?
A. Java returns the error message "floating point number too large."
Q. What about floating point types?
A. Operations that overflow evaluate to plus or minus infinity.
Operations that underflows result in plus or minus zero.
Operations that have no mathematically definite evaluate to
NaN (not a number), e.g., 0/0, sqrt(-3), acos(3.0), log(-3.0).
Q. How do I test whether my variable has the value NaN?
A. Use the method Double.isNan. Note that NaN is unordered so
the comparison operations <,>, and = involving one or
two NaNs always evaluates to false. Any != comparison involving
NaN evaluates to true, even (x != x), when x is NaN.
Q. What's the difference between -0.0 and 0.0?
A. Both are representations of the number zero. It is true that if
x = 0.0 and y = -0.0 that (x == y). However, 1/x yields
Infinity, whereas 1/y yields -Infinity. Program
NegativeZero.java illustrates
this. It gives a surprising counterexample to the misconception
that if (x == y) then (f(x) == f(y)).
log 0 = -infinity, log (-1) = NaN. log(ε) vs. log(-ε).
Q. I've heard that programmers should never
compare two real numbers for exact equality, but you
should always use a test like if |a-b| <ε or |a-b| < ε min(|a|, |b|). is this right?
A. It depends on the situation.
The relative error method is usually preferred, but it fails
if the numbers are very close to zero.
The absolute value method may have unintended consequences.
It is not clear what value of ε to use. If
a = +ε/4 and b = -ε/4, should we really consider
them equal. Transitivity is not true: if a and b are "equal"
and b and c are "equal", it may not be the case that a and
c are "equal."
Q. How are zero, infinity and NaN represented using IEEE 754?
A. By setting all the exponent bits to 1.
Positive infinity = 0x7ff0000000000000 (all exponent bits 1,
sign bit 0 and all mantissa bits 0),
negative infinity = 0xfff0000000000000 (all exponent bits 1,
sign bit 1 and all mantissa bits 0),
NaN = 0x7ff8000000000000 (all exponent bits 1, at least
one mantissa bit set).
Positive zero = all bits 0.
Negative zero = all bits 0, except sign bit which is 1.
Q. What is StrictMath?
A. Java's Math library guarantees its results to
be with 1 or to 2 ulps of the true answer. Java's
StrictMath guarantees that all results are
accurate to within 1/2 ulp of the true answer.
A classic tradeoff of speed vs. accuracy.
Q. What is strictfp modifier?
A. It is also possible to use the strictfp modifier
when declaring a class or method. This ensures that the
floating point results will be bit-by-bit accurate across
different JVMs. The IEEE standard allows processors to perform
intermediate computations using more precision if the result
would overflow. The strictfp requires that
every intermediate result be truncated to 64-bit double format.
This can be a major performance
hit since Intel Pentium processor registers operate using
IEEE 754's 80-bit double-extended format. When operating
Not many people use this mode.
Q. What does the compiler flag javac -ffast-math do?
A. It relaxes some of the rounding requirements of IEEE.
It makes some floating point computations faster.
Q. Are integers always represented exactly using IEEE floating
point?
A. Yes, barring overflow of the 52 bit mantissa.
The smallest positive integer that is not represented
exactly using type double is
253 + 1 = 9,007,199,254,740,993.
Q. Does (a + b) always equal (b + a) when a and b and floating point
numbers?
A. Yes.
Q. Does x / y always equal the same value, independent of my platform?
A. Yes. IEEE requires that operations (+ * - /) are performed exactly
and then rounded to the nearest floating point number (using Banker's round
if there is a tie: round to nearest even number).
This improves portability at the expense of efficiency.
Q. Does (x - y) always equal (x + (-y))?
A. Yes. Guaranteed by IEEE 754.
Q. Is -x always the same as 0 - x?
A. Almost, except if x = 0. Then -x = -0, but 0 - x = 0.
Q. Does (x + x == 2 * x)? (1 * x == x)? (0.5 * x == x / 2)?
A. Yes, guaranteed by IEEE 754.
Q. If (x != y), is z = 1 / (x - y)
always guaranteed not to produce division by zero.
A. Yes, guaranteed by IEEE 754 (using denormalized numbers).
Q. Why not use a decimal floating point instead of binary
floating point?
A. Decimal
floating point offers several advantages over binary floating
point, especially for financial computations.
However, it usually requires about 20% more storage (assuming it is stored
using binary hardware) and the resulting code is somewhat slower.
Q. Any good resources for numerics in Java?
A. Java numerics provides
a focal point for information on numerical computing in Java.
JSci:
a free science API for numerical computing in Java.
Exercises
- What is the result of the following code fragment?
double a = 12345.0;
double b = 1e-16;
System.out.println(a + b == a);
|
- How many values does the following code fragment print?
for (double d = 0.1; d <= 0.5; d +=0.1) system.out.println(d); for (double d=1.1; d <=1.5; d +=0.1) system.out.println(d); |
Answer: nine (four in the first loop and five
in the second).
- What numbers are printed out when executing the following
code fragment with N = 25?
Answer: 1 5 6 7 8 9 10 11 12 13 14 15 16 17 18.
Unlikely you would predict this without typing it in.
Avoid using floating point numbers to check loop continuation
conditions.
- Find real numbers a, b, and c such that (a * b) / c doesn't
equal a * (b / c) without resorting to overflow tricks.
Answer: a = 2, b = 0.1, c = 3.
-
Find a value of x for which (0.1 * x) is different from (x / 10).
- Find a real number a such that (a * (3.0 / a)) doesn't
equal 3.0.
Answer: If a = 0, the first expression evaluates to NaN.
- Find a real number a such that (Math.sqrt(a) * Math.sqrt(a))
doesn't equal a.
Answer: a = 2.0.
- Find floating point values such that (x/x != 1), (x - x != 0),
(0 != 0 * x).
Answer: consider values of x that are 0, +- infinity, or NaN.
-
What is the result of the following code fragment from
FloatingPoint.java?
float f = (float) (3.0 / 7.0);
if (f == 3.0 / 7.0) System.out.println("yes");
else System.out.println("no");
|
-
Fix Exponential.java
so that it works properly for negative inputs using the method
described in the text.
- Sum the following real numbers from left to right.
Use floating point arithmetic, and assume your computer
only stores three decimal digits of precision.
0.007 0.103 0.205 0.008 3.12 0.006 0.876 0.005 0.015
|
Repeat using the priority queue algorithm. Repeat using
Kahan's algorithm.
-
Write a program to read in a list of bank account values and print
the average value, rounded exactly to the nearest penny using banker's
rounding. The input values will be separeated by whitespace and using two
decimal digits.
Hint: avoid Double.parseDouble.
100.00 489.12 1765.12 3636.10
3222.05 3299.20 5111.27 3542.25
86369.18 532.99
|
-
What does the following code fragment
print?
long x = 16777217; // 2^24 + 1
System.out.println(x);
float y = 16777217;
DecimalFormat df = new DecimalFormat("0.00000000000");
System.out.println(df.format(y));
|
Answer: 16777217 and 16777216.00000000000.
2^24 + 1 is the smallest positive integer that is not exactly
representable using type float.
-
What is wrong with each of the following
two loops?
int count1 = 0;
for (float x = 0.0f; x <20000000.0f; x=x + 1.0f) count1++; system.out.println(count1); int count2=0; for (double x=0.0; x < 20000000.0; x=x + 1.0) count2++; system.out.println(count1); |
Answer: The first code fragment goes into an infinite
loop when x = 16777216.0f since 16777217.0f cannot be represented
exactly using type float and
16777216.0f + 1.0f = 16777216.0f. The second loop is guaranteed
to work correctly, but using a variable of type int for
the loop variable might be more efficient.
-
One way to define e is the limit as n approaches infinity of
(1 + 1/n)n. Write a computer program to estimate e using
this formula. For which value of n do you get the best approximation of e?
Hint: if n gets too large, then 1 + 1/n is equal to 1 using
floating point arithmetic.
-
Not all subtraction is catastrophic.
Demonstrate that the expression x2 - y2
exhibits catastrophic cancellation when x is close to y.
However, the improved expression (x + y)(x - y) still subtracts
nearly equal quantities, but it is a benign cancellation.
- (Goldberg)
(1 + i/n)^n arises in financial calculations involving interest.
Rewrite as e^(n ln (1 + i/n)). Now trick is to compute
ln(1+x). Math.log(1+x) not accurate when x <1.
if (1 + x == 1) return x
else return (x * Math.log(1 + x)) / (1 + x) - 1.
|
- Catastrophic cancellation: f(x) = ex - sin
x - cos x near x = 0. Use x2 + x3/3 as an approximation near x = 0.
-
Catastrophic cancellation: f(x) = ln(x) - 1.
For x near e use ln(x/e).
-
Catastrophic cancellation: f(x) = ln(x) - log(1/x).
For x near 1 use 2 ln(x).
-
Catastrophic cancellation: x-2(sinx - ex + 1).
-
Find a value of x for which Math.abs(x) doesn't equal
Math.sqrt(x*x).
-
Numerical stability.
The golden mean φ = sqrt(5)/2 - 1/2 = 0.61803398874989484820...
It satisfies the equation φN = φN-2 - φN-1.
We could use this recurrence to compute power of φ by starting with
phi0 = 1, phi1 = φ, and iterating the recurrence to calculate
successive powers of φ.
However, floating-point roundoff error swamps the computation by about N = 40.
Write a program Unstable.java
that reads in a command line parameter N and prints out φN
computed using the recurrence above and also using Math.pow().
Run your program with N = 40 and N = 100 to see the consequences of
roundoff error.
Creative Exercises
- Roundoff errors.
Sums.
Relative error = |x - x'| / |x| = independent of units.
One idea is to create priority queue of numbers and repeatedly
add the two smallest values and insert back into the priority queue.
(See exercise XYZ.)
Simpler and faster alternative: Kahan's summation formula (Goldberg, Theorem 8).
Example of useful algorithm that exploits fact that (x + y) + z doesn't
equal x + (y + z). Optimizing compiler better not rearrange terms.
double c = 0.0, sum = 0.0, y;
for (int i = 0; i |
Recommended method: sorting and adding.
- Cauchy-Schwartz inequality.
The Cauchy-Schwartz inequality guarantees that for
any real numbers xi and yi that
(x1x1 + ... xnxn) × (y1y1 + ... ynyn) ≥ (x1y1 + ... xnyn)2
|
In particular, when n = 2, y1 = y2 = 1, we have
2(x1x1 + x2x2) ≥ (x1 + x2)2
|
Write a program CauchySchwartz.java that
reads in two integers x1 and x2, and verifies that the
identity is not necessarily true in floating point.
Try x1 = 0.5000000000000002 and x2 = 0.5000000000000001
- Chebyshev approximation for distance computations.
Use the following approximation for computing the
distance between p and q without
doing an expensive square root operation.
sqrt(dx2 + dy2) = max(|dx|, |dy|) + 0.35 * min(|dx|, |dy|).
How accurate.
- Pythagoras.
Write a program
Program Pythagoras.java that reads
in two real-valued command line parameters a and b and prints out
sqrt(a*a + b*b). Try to stave off overflow. For example, if
|a| >= |b| then compute |a| sqrt(1 + (b/a)*(b/a)).
If |a| <|b| do analogous. ex: x=y =DBL_MAX / 10.0;
- Overflow.
What is the result of the following code fragment?
double a = 4.0, b = 0.5, c = 8e+307;
System.out.println((a * b) * c);
System.out.println(a * (b * c));
|
- Optimizing compiler.
An optimizing compiler....
Must be wary about blindly applying
algebraic laws to computer programs since this could
lead to disastrous results.
Show that with floating point numbers, addition is
not associative. That is find
values a, b, and c such that ((a + b) + c) != (a + (b + c)).
Show also that the distributive property does not necessarily
apply finding a, b, and c such that
(a * (b + c)) != (a * b + a * c).
Problem more pronounced with multiplication ((a * b) * c) !=
(a * (b * c)). Give simple example.
- Harmonic sum.
Redo the harmonic sum computation, but sum
from right-to-left instead of left-to-right.
It illustrates that addition is not associative (a + b) + c
vs. a + (b + c) since we
get different answers depending on whether we sum from left-to-right
or right-to-left.
- Pi.
Compare the following two methods for approximating the mathematical
constant pi. Repeat while p_k > p_k-1. The sequence p_k converges
to pi.
s_6 = 1 s_6 = 1
s_2k = sqrt(2 - sqrt((2-s_k)(2+s_k)) s_2k = s_k / sqrt(2 + sqrt((2-s_k)(2+s_k))
p_k = k * s_k / 2 p_k = k * s_k / 2
|
The second formula has much better behavior since it avoids the
catastrophic cancellation 2 - sqrt((2-s_k)(2+s_k)).
- Jean-Michael Muller example.
Consider the sequence
x0 = 4, x1 = 4.25,
xn = 108 - (815 - 1500/xn-2) / xn-1.
Program Muller.java attempts
to estimate what xn converges to as n gets large.
It computes the value 100.0, but the correct value is 5.
- Another Kahan example.
Myth: if you keep adding more precision
and answer converges then it's correct.
Counterexample:
E(0) = 1, E(z) = (exp(z) - 1) / z.
Q(x) = |x - sqrt(x2 + 1)| - 1/(x + sqrt(x2 + 1))
H(x) = E(Q(x)2).
Compute H(x) for x = 15.0, 16.0, 17.0, 9999.0.
Repeat with more precision, say using BigDecimal.
- Dividing rational numbers (a + ib) /(c + id). See Smith's formula (11).
- With integer types we must be cognizant of overflow.
The same principles of overflow apply to floating point numbers.
For example, to compute sqrt(x*x + y*y),
use fabs(y) * sqrt(1+(x/y)*(x/y))
if x y.
x = y = DBL_MAX / 10.0;
- Gamma function.
Write a program Gamma.java
that takes a command line parameter x and prints Gamma(x) and
log Gamma(x) to 9 significant digits, where Gamma(x) is the gamma function:
Gamma(x) = integral( tx-1 e-t, t = 0..infinity )
|
The Gamma function is a continuous version of the factorial function:
if n is a positive integer, then n! = Gamma(n+1). Use Lanczos' formula:
Gamma(x) ≈ (x + 4.5)x - 1/2 * e-(x + 4.5) * sqrt(2 π) *
[ 1.0 + 76.18009173 / (x + 0) - 86.50532033 / (x + 1)
+ 24.01409822 / (x + 2) - 1.231739516 / (x + 3)
+ 0.00120858003 / (x + 4) - 0.00000536382 / (x + 5)
]
|
This is accurate to 9 significant digits for x > 1.
- Siegfried M. Rump example.
Write a program Remarkable.java
to compute
y = 333.75 b6 + a2(11 a2 b2 - b6 - 121 b4 - 2) + 5.5 b8 + a/(2b)
|
for a = 77617 and b = 33096. Try with different floating point precision.
Can rewrite
x = 5.5 b8
z = 333.75 b6 + a2(11 a2 b2 - b6 - 121 b4 - 2)
y = z + x + a/(2b)
|
It turns out that x and z have 35 digits in common.
z = -7919111340668961361101134701524942850
x = +7919111340668961361101134701524942848
|
The answer we get in Java (on Sun Sparc) using single precision is
6.338253 ×1029 and using double precision is
a/(2b) = 1.1726039400531787.
The true answer is -2 + a/(2b) = -54767/66192 = -0.82739606....,
but the z + x = -2 term is catostrophically cancelled unless you are using at
least 122 digits of precision!
We match this answer using the BigDecimal ADT in Java's Math library.
The paper
A Remarkable Example of Catastrophic Cancellation Unraveled
describes this well-known formula, which demonstrates why obtaining identical
results in different precisions does not mathematically imply accuracy.
Moreover, even on IEEE 754 conforming platforms (Intel, Sun Sparc),
you can get different answers depending on whether the intermediate
rounding mode is 24, 53, or 64 bits.
This remarkable example was constructed by Siegfried M. Rump at IBM
for S/370 architectures.
- Area of a triangle.
Write a program TriangleArea.java that takes three command
line inputs a, b, and c, representing the side lengths of a
triangle, and prints out the area of the triangle using
Heron's formula: area = sqrt(s(s-a)(s-b)(s-c)), where s = (a + b + c) / 2.
Inaccurate when a is close to b.
Improve with 1960s era formula: sort a >= b >= c. if (c + b Quadratic formula.
The quadratic equation is a grade school formula for finding
the real roots of ax2 + bx + c.
discriminant = Math.sqrt(b*b - 4*a*c);
root1 = -b + discriminant / (2*a);
root2 = -b - discriminant / (2*a);
|
A novice programmer who needed to calculate the roots of a quadratic
equation might naively apply this formula. Of course, we should be
careful to handle the a = 0 case specially to avoid division by zero.
Also we should check the discriminant b2 - 4ac. If it's negative,
then there are no real roots. However, the major problem with
this formula is that it doesn't work if a or c are very small because
one of the two roots will be computed by subtracting b from a very
nearly equal quantity.
The correct way to compute the roots is
if (b > 0) q = -0.5 * (b + discriminant);
else q = -0.5 * (b - discriminant);
root1 = q / a;
root2 = c / q;
|
For example the roots of x2 - 3000000x + 2 are r1 and r2.
The grade school formula yields only 3 digits of accuracy
for the smaller root (6.665941327810287E-7) when using double
precision arithmetic, whereas the good formula yields
12 digits of accuracy (6.666666666668148E-7).
When b = -30000000, the situation deteriorates.
The grade school method now only has two digits of accuracy
(6.705522537231445E-8) vs. 12 digits of accuracy
(6.666666666666681E-8). When b = -300000000, the grade school
method has zero digits of accuracy (0.0) as opposed to
16 (6.666666666666667E-9).
Consider summarizing this in a table.
- Beneficial cancellation.
Massive cancellations in subtraction do not always lead to
catastrophic cancellation.
within a few ulps unless overflow occurs.
Cancellation error does not always lead to inaccuracy.
Straightforward implementation of f(x) = e^x - 1 not accurate if
x is close to 0.
Surprisingly, the solution below achieves full working accuracy
regardless of the size of x.
Final computed answer more accurate than intermediate results
through extreme cleverness in helpful cancellation!
Error analysis can be very subtle and non-obvious.
double y = Math.exp(x);
if (y == 1) return 1;
if (y - 1 == -1) return -1;
else return (y - 1) * x / Math.log(y);
|
- 1 - cos(x).
Compare computing 1 - cos(x) naively and using the formula
cos(x) = 2 sin(x/2)^2.
The Right Way to
Calculate Stuff.
- Square root cancellation.
Devise an accurate expression for
the difference between the square root of (b2 + 100) and b.
Solution: If b is negative or relatively small, just do the
obvious thing. If b is much larger than 100, you can get catastrophic
cancellation. The difference equals b (sqrt(1 + 100/b2) - 1).
If x is very small then, it's safe to approximate sqrt(1 + x) by 1 + x/2.
Now, the difference is approximately b (100 / b2) / 2 = 50 / b.