Problem 1 [5 points] Consider solving the scalar equation , for given a and b and assume that you have computed . To measure the quality of , we can compute the residual . Derive the error in , that is the relative error in the floating point representation of . Can it be large? Explain.
Answer:
Given ,
- Let is the floating point representation of
- Let be the floating point representation of
- Let be the floating point representation of
Assuming relative error of is ⇒
Therefore:
Assuming relative error of is ⇒
Computed residual
Assuming relative error of is ⇒
Thus, the error in is
The error can be large if:
- the relative error of is large
- significant rounding error in multiplication and subtraction (otherwise and is large)
- value of and such that introduces “catastrophic cancellation”, or
Problem 2 [2 points] Explain the output of the following code
clear all;
x = 10/9;
for i=1:20
x = 10*(x-1);
end
x
Is the result accurate?
Answer:
The following includes steps for the above MATLAB code:
clear all
clears all variables in current workspacex = 10/9
initialise the first value of to- The
for
loop runs for 20 times, where it updates using the following formula - Finally,
x
prints out the value ofx
into the MATLAB terminal window.
The output of the code is not correct, due to floating point errors. Machine epsilon by default in MATLAB (which is in double precision) is approx.
Since is a floating point, every iteration in the for
loop will include a floating point error, and thus after 20 iterations, the results won’t be accurate to its mathematical value.
Problem 3 [3 points] Suppose you approximate by its truncated Taylor series. For given , derive the smallest number of terms of the series needed to achieve accuracy of . Write a Matlab program to check that your approximation is accurate up to . Name your program check_exp.m
.
Answer:
Taylor series of real or complex at is defined by
Given has continuous derivative , or , then the truncated Taylor series can be defined as
where
Hence, with we have where and is between and
Thus, we need to find terms such that with between 0 and
With , then .
From the above function, with the Taylor Series will be accurate up to
The below is the Matlab to examine the above terms:
function check_exp()
x = 0.1;
% Approximation for the first 6 terms of the Taylor series
approx = 1 + x + x^2/factorial(2) + x^3/factorial(3) + x^4/factorial(4) + x^5/factorial(5);
actual = exp(x);
error = abs(approx - actual);
% Display the results
fprintf('Approximated value: %f\n', approx);
fprintf('Actual value: %f\n', actual);
fprintf('Error: %e\n', error);
% Check if the error is less than 10^-8
if error < 10^-8
disp('The approximation is accurate up to 10^-8.');
else
disp('The approximation is NOT accurate up to 10^-8.');
end
end
Problem 4 [3 points] The sine function has the Taylor series expansion Suppose we approximate by . What are the absolute and relative errors in this approximation for ? Write a Matlab program to produce these errors; name it sin_approx.m
.
Answer:
Assuming as exact value and is the approximate value of , which is
- Absolute error is given by
- Relative error is given by
For the following , the following table represents the error:
Error | |||
---|---|---|---|
Absolute | 1.983852e-11 | 1.544729e-06 | 1.956819e-04 |
Relative | 1.987162e-10 | 3.222042e-06 | 2.325474e-04 |
function sin_approx()
% Define the values of x
x_values = [0.1, 0.5, 1.0];
% Loop through each value of x to compute the errors
for i = 1:length(x_values)
x = x_values(i);
% Calculate the approximation
approx = x - x^3/factorial(3) + x^5/factorial(5);
% Calculate the actual value of sin(x)
actual = sin(x);
% Calculate the absolute error
abs_error = abs(approx - actual);
% Calculate the relative error
rel_error = abs_error / abs(actual);
% Display the results for each x
fprintf('For x = %f:\n', x);
fprintf('Approximated value: %f\n', approx);
fprintf('Actual value: %f\n', actual);
fprintf('Absolute Error: %e\n', abs_error);
fprintf('Relative Error: %e\n\n', rel_error);
end
end
Problem 5 [2 points] How many terms are needed in the series to compute for accurate to 12 decimal places.
Answer:
To calculate for accurate to 12 decimal places, we need to find such that
Substitute for error term, needs to find such that
We know that the general term for Taylor series of is
Since we are considering on interval , and arccot(x)
is an alternating series, the largest possible value of the error term will occur when
Thus, the equation to solve for term is
Using the following function find_nth_term
, we can find that when will ensure the for to be accurate to 12 decimal places.
import math
def find_nth_terms(x: float, eps: float = 1e-12):
n = 0
term = x
while abs(term) >= eps:
n += 1
term = math.pow(-1, n) * math.pow(x, 2 * n + 1) / (2 * n + 1)
return n
find_nth_terms(0.5)
Problem 6 [2 points] Consider the expression . Derive for what values of this expression evaluates to 1024.
Answer:
In IEEE 754 double precision,
From the definition of machine epsilon (), the difference between and the next representable numbers is proportional to , that is
Thus the problem implies there is such that exists within a range such that
Substitute value for and
⇒
Problem 7 [2 points] Give an example in base-10 floating-point arithmetic when a. b.
Answer:
For the first example , assuming using double precision:
Let:
⇒ , whereas
The explanation from Problem 6 can be used to explain that since , therefore , whereas in due to round up for floating point arithmetic.
For the second example , assuming the following system where Let:
⇒ ( rounded and ), whereas ( rounded and )
Problem 8 [8 points] Consider a binary floating-point (FP) system with normalised FP numbers and 8 binary digits after the binary point:
For this problem, assume that we do not have a restriction on the exponent . Name this system B8.
(a) [2 points] What is the value (in decimal) of the unit roundoff in B8?
(b) (1 point) What is the next binary number after ?
(c) [5 points] The binary representation of the decimal is infinite: . Assume it is rounded to the nearest FP number in B8. What is this number (in binary)?
Answer:
B8 system can also be defined as
(a). For a binary FP system with binary digits after binary point, the unit roundoff is given by
With , unit roundoff for this system in decimal is
(b). Given in binary, the next binary number can be calculated as:
1.10011001
+
0.00000001
=
1.10011010
(c).
first 9 digits after the binary point to determine how to round: 0.000110011
Given the unit roundoff is and 9th digit is 1 (or ) → round up
Therefore, 0.1 rounded to nearest FP system in B8 is in binary
Problem 9 [10 points] For a scalar function consider the derivative approximations
and
.
a. [4 points] Let and .
- Write a Matlab program that computes the errors and for each .
- Using
loglog
, plot on the same plot these errors versus . Name your programderivative_approx.m
. For each of these approximations:
b. [4 points] Derive the value of for which the error is the smallest.
c. [2 points] What is the smallest error and for what value of is achieved? How does this value compare to the theoretically “optimum” value?
Answer:
(a).
function derivative_approx()
% Define the function f and its derivative
f = @(x) exp(sin(x));
df = @(x) cos(x) * exp(sin(x));
% Define the approximation functions g1 and g2
g1 = @(x, h) (f(x + 2*h) - f(x)) / (2*h);
g2 = @(x, h) (f(x + h) - f(x - h)) / (2*h);
% Define x0
x0 = pi/4;
% Define k values and compute h values
k_values = 1:0.5:16;
h_values = 10.^(-k_values);
% Initialize error arrays
errors_g1 = zeros(size(h_values));
errors_g2 = zeros(size(h_values));
% Compute errors for each h_value
for i = 1:length(h_values)
h = h_values(i);
errors_g1(i) = abs(df(x0) - g1(x0, h));
errors_g2(i) = abs(df(x0) - g2(x0, h));
end
% Find the h value for which the error is the smallest for each approximation
[~, idx_min_error_g1] = min(errors_g1);
[~, idx_min_error_g2] = min(errors_g2);
h_min_error_g1 = h_values(idx_min_error_g1);
h_min_error_g2 = h_values(idx_min_error_g2);
% Display the h values for the smallest errors
fprintf('For g1, the smallest error is at h = %e\n', h_min_error_g1);
fprintf('For g2, the smallest error is at h = %e\n', h_min_error_g2);
% Plot errors using loglog
loglog(h_values, errors_g1, '-o', 'DisplayName', '|f''(x_0) - g_1(x_0, h)|');
hold on;
loglog(h_values, errors_g2, '-x', 'DisplayName', '|f''(x_0) - g_2(x_0, h)|');
hold off;
% Add labels, title, and legend
xlabel('h');
ylabel('Error');
title('Errors in Derivative Approximations');
legend;
grid on;
end
(b).
The Taylor’s series expansion of function around point is:
For the first approximation , with Taylor series expansion:
for
for
Hence the error term is
⇒
For the second approximation : the error term is
(c).
For , the smallest error is at h = 1.000000e-08 For , the smallest error is at h = 3.162278e-06
Problem 10 [7 points] In the Patriot disaster example, the decimal value 0.1 was converted to a single precision number with chopping.
Suppose that it is converted to a double precision number with chopping.
(a). [5 points] What is the error in this double precision representation of 0.1.
(b). [2 points] What is the error in the computed time after 100 hours?
Answer:
(a).
Given the binary representation of in double precision:
- Sign:
- Exponent: , which is 1019 in decimal ⇒ effective exponent is
- Significand: the binary digits will be chopped off at 52 bit. Therefore, and thus
(b).
After 100 hours: