Floating points error, Taylor series, and approximation

Problem 1 [5 points] Consider solving the scalar equation $ax = b$ , for given a and b and assume that you have computed $\hat{x}$ . To measure the quality of $\hat{x}$ , we can compute the residual $r = b − a\hat{x}$ . Derive the error in $fl(r)$ , that is the relative error in the floating point representation of $r$ . Can it be large? Explain.

Answer:

Given $r = b - a\hat{x}$ ,

Let $fl(a)$ is the floating point representation of $a$
Let $fl(b)$ be the floating point representation of $b$
Let $fl(\hat{x})$ be the floating point representation of $\hat{x}$

Assuming relative error of $fl(\hat{x})$ is $\delta_{\hat{x}}$ ⇒ $fl(\hat{x}) = \hat{x}_{true}(1+\delta_{\hat{x}})$

Therefore: $a*\hat{x}=a*\hat{x}_{true}(1+\delta_{\hat{x}})$

Assuming relative error of $fl(a\hat{x})$ is $\delta_{m}$ ⇒ $fl(a\hat{x}) = a*\hat{x}_{true}(1+\delta_{\hat{x}})(1+\delta_{m})$

Computed residual $r = b - a*\hat{x}_{true}(1+\delta_{\hat{x}})$

Assuming relative error of $fl(b-a\hat{x})$ is $\delta_{s}$ ⇒ $fl(b-a\hat{x}) = b - a*\hat{x}_{true}(1+\delta_{\hat{x}})(1+\delta_{m})(1+\delta_{s})$

Thus, the error in $fl(r)$ is $\delta_{r} = (1+\delta_{\hat{x}})(1+\delta_{m})(1+\delta_{s}) - 1$

The error can be large if:

the relative error of $\hat{x}$ is large

significant rounding error in multiplication and subtraction (otherwise $\delta_m$ and $\delta_s$ is large)

value of $a$ and $b$ such that $b - a\hat{x}$ introduces “catastrophic cancellation”, or $b \approx a\hat{x}$

Problem 2 [2 points] Explain the output of the following code

clear all;
x = 10/9;
for i=1:20
	x = 10*(x-1);
end
x

Is the result accurate?

Answer:

The following includes steps for the above MATLAB code:

clear all clears all variables in current workspace
x = 10/9 initialise the first value of $x$ to $\frac{10}{9}$
The for loop runs for 20 times, where it updates $x$ using the following formula $x:=10*(x-1)$
Finally, x prints out the value of x into the MATLAB terminal window.

The output of the code is not correct, due to floating point errors. Machine epsilon $\epsilon_{mach}$ by default in MATLAB (which is in double precision) is approx. $2.2204e-16$ Since $x$ is a floating point, every iteration in the for loop will include a floating point error, and thus after 20 iterations, the results won’t be accurate to its mathematical value.

Problem 3 [3 points] Suppose you approximate $e^x$ by its truncated Taylor series. For given $x = 0.1$ , derive the smallest number of terms of the series needed to achieve accuracy of $10^{−8}$ . Write a Matlab program to check that your approximation is accurate up to $10^{−8}$ . Name your program check_exp.m.

Answer:

Taylor series of real or complex $f$ at $c$ is defined by $f(x) = \sum^{\inf}_{k=0}\frac{f^{(k)}(c)}{k!}(x-c)^k$

Given $f$ has $n+1$ continuous derivative $[a, b]$ , or $f \in C^{n+1}[a, b]$ , then the truncated Taylor series can be defined as

$f(x) = \sum^{\inf}_{k=0}\frac{f^{(k)}(c)}{k!}(x-c)^k + E_{n+1}$ where $E_{n+1} = \frac{f^{n+1}(\xi(c, x))}{(n+1)!}(x-c)^{n+1} = \frac{f^{n+1}(\xi)}{(n+1)!}(x-c)^{n+1}$

Hence, with $x := x+h$ we have $f(x+h) = \sum^{\inf}_{k}\frac{f^{(k)}(x)}{k!}(h)^k + E_{n+1}$ where $E_{n+1} = \frac{f^{n+1}(\xi)}{(n+1)!}h^{n+1}$ and $\xi$ is between $x$ and $x+h$

Thus, we need to find $n$ terms such that $| E_{n+1} = \frac{e^x(\xi)}{(n+1)!}x^{n+1} | \le 10^{-8}$ with $\xi$ between 0 and $x$

With $x=0.1$ , then $e^0.1 \approx 1.1052$ .

$E_{n+1} = \frac{e^{\xi}}{(n+1)!}x^{n+1} = \frac{1.1052}{(n+1)!}0.1^{n+1} \le 10^{-8} \rightleftharpoons \frac{0.1^{n+1}}{(n+1)!} \le 9.0481e-09$

From the above function, with $n=6$ the Taylor Series will be accurate up to $10^{-8}$

The below is the Matlab to examine the above terms:

check_exp.m

function check_exp()
    x = 0.1;
 
    % Approximation for the first 6 terms of the Taylor series
    approx = 1 + x + x^2/factorial(2) + x^3/factorial(3) + x^4/factorial(4) + x^5/factorial(5);
    actual = exp(x);
    error = abs(approx - actual);
 
    % Display the results
    fprintf('Approximated value: %f\n', approx);
    fprintf('Actual value: %f\n', actual);
    fprintf('Error: %e\n', error);
 
    % Check if the error is less than 10^-8
    if error < 10^-8
        disp('The approximation is accurate up to 10^-8.');
    else
        disp('The approximation is NOT accurate up to 10^-8.');
    end
end

Problem 4 [3 points] The sine function has the Taylor series expansion $sin(x) = x − \frac{x^3}{3!} + \frac{x^5}{5!} − \frac{x^7}{7!} + · · · +$ Suppose we approximate $sin(x)$ by $x − \frac{x^3}{3!} + \frac{x^5}{5!}$ . What are the absolute and relative errors in this approximation for $x = 0.1, 0.5, 1.0$ ? Write a Matlab program to produce these errors; name it sin_approx.m.

Answer:

Assuming $y=sin(x)$ as exact value and $\tilde{y}$ is the approximate value of $sin(x)$ , which is $\tilde{y} = x − \frac{x^3}{3!} + \frac{x^5}{5!}$

Absolute error is given by $|y - \tilde{y}|$
Relative error is given by $\frac{|y-\tilde{y}|}{y}$

For the following $x \in {0.1, 0.5, 1.0}$ , the following table represents the error:

Error	$x=0.1$	$x=0.5$	$x=1.0$
Absolute	1.983852e-11	1.544729e-06	1.956819e-04
Relative	1.987162e-10	3.222042e-06	2.325474e-04

sin_approx.m

function sin_approx()
    % Define the values of x
    x_values = [0.1, 0.5, 1.0];
 
    % Loop through each value of x to compute the errors
    for i = 1:length(x_values)
        x = x_values(i);
 
        % Calculate the approximation
        approx = x - x^3/factorial(3) + x^5/factorial(5);
 
        % Calculate the actual value of sin(x)
        actual = sin(x);
 
        % Calculate the absolute error
        abs_error = abs(approx - actual);
 
        % Calculate the relative error
        rel_error = abs_error / abs(actual);
 
        % Display the results for each x
        fprintf('For x = %f:\n', x);
        fprintf('Approximated value: %f\n', approx);
        fprintf('Actual value: %f\n', actual);
        fprintf('Absolute Error: %e\n', abs_error);
        fprintf('Relative Error: %e\n\n', rel_error);
    end
end

Problem 5 [2 points] How many terms are needed in the series $arccot(x) = \frac{π}{2} − x + \frac{x^3}{3} − \frac{x^5}{5} + \frac{x^7}{7} + · · ·$ to compute $arccot(x)$ for $|x| \le 0.5$ accurate to 12 decimal places.

Answer:

To calculate $arccot(x)$ for $|x| \le 0.5$ accurate to 12 decimal places, we need to find $n$ such that $|E_{n+1}| < 10^{-12}$

Substitute for error term, needs to find $n$ such that $|\frac{f^{n+1}(\xi)}{(n+1)!}h^{n+1}| < 10^{-12}$

We know that the general term for Taylor series of $arccot(x)$ is $a_n = \frac{(-1)^nx^{2n+1}}{2n+1}$

Since we are considering on interval $|x| \le 0.5$ , and arccot(x) is an alternating series, the largest possible value of the error term will occur when $x=0.5$

Thus, the equation to solve for $n$ term is $|\frac{(-1)^{n+1}*x^{2n+1}}{(2n+1)*(n+1)!}| < 10^{-12} \rightleftharpoons \frac{x^{2n+1}}{(2n+1)*(n+1)!} < 10^{-12}$

Using the following function find_nth_term, we can find that when $n=17$ will ensure the $arccot(x)$ for $|x| \le 0.5$ to be accurate to 12 decimal places.

import math
 
 
def find_nth_terms(x: float, eps: float = 1e-12):
    n = 0
    term = x
    while abs(term) >= eps:
        n += 1
        term = math.pow(-1, n) * math.pow(x, 2 * n + 1) / (2 * n + 1)
    return n
 
 
find_nth_terms(0.5)

Problem 6 [2 points] Consider the expression $1024 + x$ . Derive for what values of $x$ this expression evaluates to 1024.

Answer:

In IEEE 754 double precision, $\epsilon_{mach} = 2^{-52} \approx 2.2*10^{−16}$

From the definition of machine epsilon ( $1024 + \epsilon_{mach} > 1024$ ), the difference between $N$ and the next representable numbers is proportional to $N$ , that is $N*\epsilon_{mach}$

Thus the problem implies there is such $x$ that exists within a range such that $x < \frac{1}{2}*\epsilon_{mach}*N$

Substitute value for $N=1024$ and $\epsilon_{mach} \approx 2.2*10^{−16}$

⇒ $x < \frac{1}{2}*2.2*10^{-16}*1024 \approx 1.1368448×10^{−13}$

$\forall x \lessapprox 1.1368448×10^{−13} \rightarrow (1024 + x) \: \text{evaluates} \: 1024$

Problem 7 [2 points] Give an example in base-10 floating-point arithmetic when a. $(a + b) + c \neq a + (b + c)$ b. $(a ∗ b) ∗ c \neq a ∗ (b ∗ c)$

Answer:

For the first example $(a + b) + c \neq a + (b + c)$ , assuming using double precision:

Let:

$a=1.0$
$b=1.0*10^{-16}$
$c=-1.0$

⇒ $(a+b)+c = 0$ , whereas $a+(b+c) = 1.11022*10^{-16}$

The explanation from Problem 6 can be used to explain that $(a+b) = a$ since $b < 1.1368448×10^{−13}$ , therefore $(a+b)+c=0$ , whereas in $a+(b+c) \approx 1.0 - 0.999999999 \approx 1.11022*10^{-16}$ due to round up for floating point arithmetic.

For the second example $(a ∗ b) ∗ c \neq a ∗ (b ∗ c)$ , assuming the following $FP$ system $(10, 3, L, U)$ where $x=\pm{d_0.d_1d_2}*10^e, d_0 \neq 0, e \in [L, U]$ Let:

$a=1.23$
$b=4.56$
$c=7.89$

⇒ $(a*b)*c=44.3$ ( $a*b=5.61$ rounded and $5.61*c=44.3$ ), whereas $a*(b*c)=44.2$ ( $b*c=35.9$ rounded and $35.9*a = 44.2$ )

Problem 8 [8 points] Consider a binary floating-point (FP) system with normalised FP numbers and 8 binary digits after the binary point:

$x=\pm{1.d_1d_2 · · · d_8 × 2^e}$

For this problem, assume that we do not have a restriction on the exponent $e$ . Name this system B8.

(a) [2 points] What is the value (in decimal) of the unit roundoff in B8?

(b) (1 point) What is the next binary number after $1.10011001$ ?

(c) [5 points] The binary representation of the decimal $0.1$ is infinite: $0.00011001100110011001100110011 · · ·$ . Assume it is rounded to the nearest FP number in B8. What is this number (in binary)?

Answer:

B8 system can also be defined as $FP(2, 8, L, U)$

(a). For a binary FP system with $p$ binary digits after binary point, the unit roundoff $u$ is given by $u=2^{-p}$

With $t=8$ , unit roundoff for this system in decimal is $u = 2^{-8} = 0.00390625$

(b). Given $u=2^{-8}=0.00000001$ in binary, the next binary number can be calculated as:

(c).

first 9 digits after the binary point to determine how to round: 0.000110011

Given the unit roundoff is $2^{-8}$ and 9th digit is 1 (or $2^{-9}$ ) → round up

Therefore, 0.1 rounded to nearest FP system in B8 is $0.00011010$ in binary

Problem 9 [10 points] For a scalar function $f$ consider the derivative approximations

$f^{'}(x) \approx g_1(x, h) := \frac{f(x + 2h) − f(x)}{2h}$

and

$f^{'}(x) \approx g_2(x, h) := \frac{f(x + h) − f(x − h)}{2h}$ .

a. [4 points] Let $f(x) = e^{sin(x)}$ and $x_0 = \frac{\pi}{4}$ .

Write a Matlab program that computes the errors $|f ′(x_0)−g1(x_0, h)|$ and $|f′(x_0)−g_2(x_0, h)|$ for each $h = 10^{−k}, k = 1, 1.5, 2, 2.5 . . . , 16$ .
Using loglog, plot on the same plot these errors versus $h$ . Name your program derivative_approx.m. For each of these approximations:

b. [4 points] Derive the value of $h$ for which the error is the smallest.

c. [2 points] What is the smallest error and for what value of $h$ is achieved? How does this value compare to the theoretically “optimum” value?

Answer:

(a).

derivative_approx.m

function derivative_approx()
 
  % Define the function f and its derivative
  f = @(x) exp(sin(x));
  df = @(x) cos(x) * exp(sin(x));
 
  % Define the approximation functions g1 and g2
  g1 = @(x, h) (f(x + 2*h) - f(x)) / (2*h);
  g2 = @(x, h) (f(x + h) - f(x - h)) / (2*h);
 
  % Define x0
 
  x0 = pi/4;
 
  % Define k values and compute h values
 
  k_values = 1:0.5:16;
 
  h_values = 10.^(-k_values);
 
  % Initialize error arrays
  errors_g1 = zeros(size(h_values));
  errors_g2 = zeros(size(h_values));
 
  % Compute errors for each h_value
  for i = 1:length(h_values)
    h = h_values(i);
    errors_g1(i) = abs(df(x0) - g1(x0, h));
    errors_g2(i) = abs(df(x0) - g2(x0, h));
  end
 
	% Find the h value for which the error is the smallest for each approximation
	[~, idx_min_error_g1] = min(errors_g1);
	[~, idx_min_error_g2] = min(errors_g2);
	h_min_error_g1 = h_values(idx_min_error_g1);
	h_min_error_g2 = h_values(idx_min_error_g2);
	% Display the h values for the smallest errors
	fprintf('For g1, the smallest error is at h = %e\n', h_min_error_g1);
	fprintf('For g2, the smallest error is at h = %e\n', h_min_error_g2);
 
  % Plot errors using loglog
  loglog(h_values, errors_g1, '-o', 'DisplayName', '|f''(x_0) - g_1(x_0, h)|');
  hold on;
  loglog(h_values, errors_g2, '-x', 'DisplayName', '|f''(x_0) - g_2(x_0, h)|');
  hold off;
 
  % Add labels, title, and legend
  xlabel('h');
  ylabel('Error');
  title('Errors in Derivative Approximations');
  legend;
  grid on;
end

(b).

The Taylor’s series expansion of function $f(x)$ around point $a$ is:

$f(x) = \sum_{n=0}^{\inf}{\frac{f^{(n)}(a)}{n!}(x-a)^n} = f(a) + f^{'}(a)(x-a) + \frac{f^{''}(a)}{2!}(x-a)^2 + \frac{f^{'''}(a)}{3!}(x-a)^3 + ...$

For the first approximation $g_1(x, h)$ , with Taylor series expansion:

$f(x+2h) = f(x) + 2hf^{'}(x) + (2h)^2\frac{f^{''}(x)}{2!}$ for $x \leq \xi \leq x + 2h$

$\rightarrow g_1(x, h) = f^{'}(x) + (2h){f^{''}(\xi)}$ for $x \leq \xi \leq x + 2h$

Hence the error term is $2hf^{''}(\xi)$

⇒ $h=2*\sqrt{\epsilon_{mach}}*\frac{1}{\sqrt{e^{sin(x)}cos(x)^2−e^{sin(x)}sin(x)}} = \frac{2\sqrt{\epsilon_{mach}}}{\sqrt{\frac{e^{\frac{1}{\sqrt{2}}}}{2} - \frac{e^{\frac{1}{\sqrt{2}}}}{\sqrt{2}}}}$

For the second approximation $g_2(x, h)$ : the error term is $-\frac{1}{6}h^2f^{'''}(x)$

(c).

For $g_1$ , the smallest error is at h = 1.000000e-08 For $g_2$ , the smallest error is at h = 3.162278e-06

Problem 10 [7 points] In the Patriot disaster example, the decimal value 0.1 was converted to a single precision number with chopping.

Suppose that it is converted to a double precision number with chopping.

(a). [5 points] What is the error in this double precision representation of 0.1.

(b). [2 points] What is the error in the computed time after 100 hours?

Answer:

(a).

Given the binary representation of $0.1$ in double precision:

Sign: $0$
Exponent: $0111111101101111111011$ , which is 1019 in decimal ⇒ effective exponent is $1029-1023=-4$
Significand: $10011001100110011001100110011001100110011001100110101001100110011001100110011001100110011001100110011010$ the binary digits will be chopped off at 52 bit. Therefore, $\epsilon_{mach} = 2^{-52}$ and thus $\text{roundoff error} = \frac{1}{2}\epsilon_{mach} = 2^{-53} \approx 1.11×10^{−16}$

(b).

After 100 hours: $100 × 60 × 60 × 10 × 1.11 × 10^{−16} \approx 3.996×10^{−10} sec$

Answer:

Given $r = b - a\hat{x}$ ,

Let $fl(a)$ is the floating point representation of $a$
Let $fl(b)$ be the floating point representation of $b$
Let $fl(\hat{x})$ be the floating point representation of $\hat{x}$

Assuming relative error of $fl(\hat{x})$ is $\delta_{\hat{x}}$ ⇒ $fl(\hat{x}) = \hat{x}_{true}(1+\delta_{\hat{x}})$

Therefore: $a*\hat{x}=a*\hat{x}_{true}(1+\delta_{\hat{x}})$

Assuming relative error of $fl(a\hat{x})$ is $\delta_{m}$ ⇒ $fl(a\hat{x}) = a*\hat{x}_{true}(1+\delta_{\hat{x}})(1+\delta_{m})$

Computed residual $r = b - a*\hat{x}_{true}(1+\delta_{\hat{x}})$

Assuming relative error of $fl(b-a\hat{x})$ is $\delta_{s}$ ⇒ $fl(b-a\hat{x}) = b - a*\hat{x}_{true}(1+\delta_{\hat{x}})(1+\delta_{m})(1+\delta_{s})$

Thus, the error in $fl(r)$ is $\delta_{r} = (1+\delta_{\hat{x}})(1+\delta_{m})(1+\delta_{s}) - 1$

The error can be large if:

the relative error of $\hat{x}$ is large

significant rounding error in multiplication and subtraction (otherwise $\delta_m$ and $\delta_s$ is large)

value of $a$ and $b$ such that $b - a\hat{x}$ introduces “catastrophic cancellation”, or $b \approx a\hat{x}$

Problem 2 [2 points] Explain the output of the following code

clear all;
x = 10/9;
for i=1:20
	x = 10*(x-1);
end
x

Is the result accurate?

Answer:

The following includes steps for the above MATLAB code:

clear all clears all variables in current workspace
x = 10/9 initialise the first value of $x$ to $\frac{10}{9}$
The for loop runs for 20 times, where it updates $x$ using the following formula $x:=10*(x-1)$
Finally, x prints out the value of x into the MATLAB terminal window.

Answer:

Taylor series of real or complex $f$ at $c$ is defined by $f(x) = \sum^{\inf}_{k=0}\frac{f^{(k)}(c)}{k!}(x-c)^k$

Given $f$ has $n+1$ continuous derivative $[a, b]$ , or $f \in C^{n+1}[a, b]$ , then the truncated Taylor series can be defined as

$f(x) = \sum^{\inf}_{k=0}\frac{f^{(k)}(c)}{k!}(x-c)^k + E_{n+1}$ where $E_{n+1} = \frac{f^{n+1}(\xi(c, x))}{(n+1)!}(x-c)^{n+1} = \frac{f^{n+1}(\xi)}{(n+1)!}(x-c)^{n+1}$

Hence, with $x := x+h$ we have $f(x+h) = \sum^{\inf}_{k}\frac{f^{(k)}(x)}{k!}(h)^k + E_{n+1}$ where $E_{n+1} = \frac{f^{n+1}(\xi)}{(n+1)!}h^{n+1}$ and $\xi$ is between $x$ and $x+h$

Thus, we need to find $n$ terms such that $| E_{n+1} = \frac{e^x(\xi)}{(n+1)!}x^{n+1} | \le 10^{-8}$ with $\xi$ between 0 and $x$

With $x=0.1$ , then $e^0.1 \approx 1.1052$ .

$E_{n+1} = \frac{e^{\xi}}{(n+1)!}x^{n+1} = \frac{1.1052}{(n+1)!}0.1^{n+1} \le 10^{-8} \rightleftharpoons \frac{0.1^{n+1}}{(n+1)!} \le 9.0481e-09$

From the above function, with $n=6$ the Taylor Series will be accurate up to $10^{-8}$

The below is the Matlab to examine the above terms:

check_exp.m

function check_exp()
    x = 0.1;
 
    % Approximation for the first 6 terms of the Taylor series
    approx = 1 + x + x^2/factorial(2) + x^3/factorial(3) + x^4/factorial(4) + x^5/factorial(5);
    actual = exp(x);
    error = abs(approx - actual);
 
    % Display the results
    fprintf('Approximated value: %f\n', approx);
    fprintf('Actual value: %f\n', actual);
    fprintf('Error: %e\n', error);
 
    % Check if the error is less than 10^-8
    if error < 10^-8
        disp('The approximation is accurate up to 10^-8.');
    else
        disp('The approximation is NOT accurate up to 10^-8.');
    end
end

Answer:

Assuming $y=sin(x)$ as exact value and $\tilde{y}$ is the approximate value of $sin(x)$ , which is $\tilde{y} = x − \frac{x^3}{3!} + \frac{x^5}{5!}$

Absolute error is given by $|y - \tilde{y}|$
Relative error is given by $\frac{|y-\tilde{y}|}{y}$

For the following $x \in {0.1, 0.5, 1.0}$ , the following table represents the error:

Error	$x=0.1$	$x=0.5$	$x=1.0$
Absolute	1.983852e-11	1.544729e-06	1.956819e-04
Relative	1.987162e-10	3.222042e-06	2.325474e-04

sin_approx.m

function sin_approx()
    % Define the values of x
    x_values = [0.1, 0.5, 1.0];
 
    % Loop through each value of x to compute the errors
    for i = 1:length(x_values)
        x = x_values(i);
 
        % Calculate the approximation
        approx = x - x^3/factorial(3) + x^5/factorial(5);
 
        % Calculate the actual value of sin(x)
        actual = sin(x);
 
        % Calculate the absolute error
        abs_error = abs(approx - actual);
 
        % Calculate the relative error
        rel_error = abs_error / abs(actual);
 
        % Display the results for each x
        fprintf('For x = %f:\n', x);
        fprintf('Approximated value: %f\n', approx);
        fprintf('Actual value: %f\n', actual);
        fprintf('Absolute Error: %e\n', abs_error);
        fprintf('Relative Error: %e\n\n', rel_error);
    end
end

Answer:

To calculate $arccot(x)$ for $|x| \le 0.5$ accurate to 12 decimal places, we need to find $n$ such that $|E_{n+1}| < 10^{-12}$

Substitute for error term, needs to find $n$ such that $|\frac{f^{n+1}(\xi)}{(n+1)!}h^{n+1}| < 10^{-12}$

We know that the general term for Taylor series of $arccot(x)$ is $a_n = \frac{(-1)^nx^{2n+1}}{2n+1}$

Since we are considering on interval $|x| \le 0.5$ , and arccot(x) is an alternating series, the largest possible value of the error term will occur when $x=0.5$

Thus, the equation to solve for $n$ term is $|\frac{(-1)^{n+1}*x^{2n+1}}{(2n+1)*(n+1)!}| < 10^{-12} \rightleftharpoons \frac{x^{2n+1}}{(2n+1)*(n+1)!} < 10^{-12}$

Using the following function find_nth_term, we can find that when $n=17$ will ensure the $arccot(x)$ for $|x| \le 0.5$ to be accurate to 12 decimal places.

import math
 
 
def find_nth_terms(x: float, eps: float = 1e-12):
    n = 0
    term = x
    while abs(term) >= eps:
        n += 1
        term = math.pow(-1, n) * math.pow(x, 2 * n + 1) / (2 * n + 1)
    return n
 
 
find_nth_terms(0.5)

Problem 6 [2 points] Consider the expression $1024 + x$ . Derive for what values of $x$ this expression evaluates to 1024.

Answer:

In IEEE 754 double precision, $\epsilon_{mach} = 2^{-52} \approx 2.2*10^{−16}$

From the definition of machine epsilon ( $1024 + \epsilon_{mach} > 1024$ ), the difference between $N$ and the next representable numbers is proportional to $N$ , that is $N*\epsilon_{mach}$

Thus the problem implies there is such $x$ that exists within a range such that $x < \frac{1}{2}*\epsilon_{mach}*N$

Substitute value for $N=1024$ and $\epsilon_{mach} \approx 2.2*10^{−16}$

⇒ $x < \frac{1}{2}*2.2*10^{-16}*1024 \approx 1.1368448×10^{−13}$

$\forall x \lessapprox 1.1368448×10^{−13} \rightarrow (1024 + x) \: \text{evaluates} \: 1024$

Problem 7 [2 points] Give an example in base-10 floating-point arithmetic when a. $(a + b) + c \neq a + (b + c)$ b. $(a ∗ b) ∗ c \neq a ∗ (b ∗ c)$

Answer:

For the first example $(a + b) + c \neq a + (b + c)$ , assuming using double precision:

Let:

$a=1.0$
$b=1.0*10^{-16}$
$c=-1.0$

⇒ $(a+b)+c = 0$ , whereas $a+(b+c) = 1.11022*10^{-16}$

For the second example $(a ∗ b) ∗ c \neq a ∗ (b ∗ c)$ , assuming the following $FP$ system $(10, 3, L, U)$ where $x=\pm{d_0.d_1d_2}*10^e, d_0 \neq 0, e \in [L, U]$ Let:

$a=1.23$
$b=4.56$
$c=7.89$

⇒ $(a*b)*c=44.3$ ( $a*b=5.61$ rounded and $5.61*c=44.3$ ), whereas $a*(b*c)=44.2$ ( $b*c=35.9$ rounded and $35.9*a = 44.2$ )

Problem 8 [8 points] Consider a binary floating-point (FP) system with normalised FP numbers and 8 binary digits after the binary point:

$x=\pm{1.d_1d_2 · · · d_8 × 2^e}$

For this problem, assume that we do not have a restriction on the exponent $e$ . Name this system B8.

(a) [2 points] What is the value (in decimal) of the unit roundoff in B8?

(b) (1 point) What is the next binary number after $1.10011001$ ?

Answer:

B8 system can also be defined as $FP(2, 8, L, U)$

(a). For a binary FP system with $p$ binary digits after binary point, the unit roundoff $u$ is given by $u=2^{-p}$

With $t=8$ , unit roundoff for this system in decimal is $u = 2^{-8} = 0.00390625$

(b). Given $u=2^{-8}=0.00000001$ in binary, the next binary number can be calculated as:

(c).

first 9 digits after the binary point to determine how to round: 0.000110011

Given the unit roundoff is $2^{-8}$ and 9th digit is 1 (or $2^{-9}$ ) → round up

Therefore, 0.1 rounded to nearest FP system in B8 is $0.00011010$ in binary

Problem 9 [10 points] For a scalar function $f$ consider the derivative approximations

$f^{'}(x) \approx g_1(x, h) := \frac{f(x + 2h) − f(x)}{2h}$

and

$f^{'}(x) \approx g_2(x, h) := \frac{f(x + h) − f(x − h)}{2h}$ .

a. [4 points] Let $f(x) = e^{sin(x)}$ and $x_0 = \frac{\pi}{4}$ .

Write a Matlab program that computes the errors $|f ′(x_0)−g1(x_0, h)|$ and $|f′(x_0)−g_2(x_0, h)|$ for each $h = 10^{−k}, k = 1, 1.5, 2, 2.5 . . . , 16$ .
Using loglog, plot on the same plot these errors versus $h$ . Name your program derivative_approx.m. For each of these approximations:

b. [4 points] Derive the value of $h$ for which the error is the smallest.

c. [2 points] What is the smallest error and for what value of $h$ is achieved? How does this value compare to the theoretically “optimum” value?

Answer:

(a).

derivative_approx.m

function derivative_approx()
 
  % Define the function f and its derivative
  f = @(x) exp(sin(x));
  df = @(x) cos(x) * exp(sin(x));
 
  % Define the approximation functions g1 and g2
  g1 = @(x, h) (f(x + 2*h) - f(x)) / (2*h);
  g2 = @(x, h) (f(x + h) - f(x - h)) / (2*h);
 
  % Define x0
 
  x0 = pi/4;
 
  % Define k values and compute h values
 
  k_values = 1:0.5:16;
 
  h_values = 10.^(-k_values);
 
  % Initialize error arrays
  errors_g1 = zeros(size(h_values));
  errors_g2 = zeros(size(h_values));
 
  % Compute errors for each h_value
  for i = 1:length(h_values)
    h = h_values(i);
    errors_g1(i) = abs(df(x0) - g1(x0, h));
    errors_g2(i) = abs(df(x0) - g2(x0, h));
  end
 
	% Find the h value for which the error is the smallest for each approximation
	[~, idx_min_error_g1] = min(errors_g1);
	[~, idx_min_error_g2] = min(errors_g2);
	h_min_error_g1 = h_values(idx_min_error_g1);
	h_min_error_g2 = h_values(idx_min_error_g2);
	% Display the h values for the smallest errors
	fprintf('For g1, the smallest error is at h = %e\n', h_min_error_g1);
	fprintf('For g2, the smallest error is at h = %e\n', h_min_error_g2);
 
  % Plot errors using loglog
  loglog(h_values, errors_g1, '-o', 'DisplayName', '|f''(x_0) - g_1(x_0, h)|');
  hold on;
  loglog(h_values, errors_g2, '-x', 'DisplayName', '|f''(x_0) - g_2(x_0, h)|');
  hold off;
 
  % Add labels, title, and legend
  xlabel('h');
  ylabel('Error');
  title('Errors in Derivative Approximations');
  legend;
  grid on;
end

(b).

The Taylor’s series expansion of function $f(x)$ around point $a$ is:

$f(x) = \sum_{n=0}^{\inf}{\frac{f^{(n)}(a)}{n!}(x-a)^n} = f(a) + f^{'}(a)(x-a) + \frac{f^{''}(a)}{2!}(x-a)^2 + \frac{f^{'''}(a)}{3!}(x-a)^3 + ...$

For the first approximation $g_1(x, h)$ , with Taylor series expansion:

$f(x+2h) = f(x) + 2hf^{'}(x) + (2h)^2\frac{f^{''}(x)}{2!}$ for $x \leq \xi \leq x + 2h$

$\rightarrow g_1(x, h) = f^{'}(x) + (2h){f^{''}(\xi)}$ for $x \leq \xi \leq x + 2h$

Hence the error term is $2hf^{''}(\xi)$

For the second approximation $g_2(x, h)$ : the error term is $-\frac{1}{6}h^2f^{'''}(x)$

(c).

For $g_1$ , the smallest error is at h = 1.000000e-08 For $g_2$ , the smallest error is at h = 3.162278e-06

Problem 10 [7 points] In the Patriot disaster example, the decimal value 0.1 was converted to a single precision number with chopping.

Suppose that it is converted to a double precision number with chopping.

(a). [5 points] What is the error in this double precision representation of 0.1.

(b). [2 points] What is the error in the computed time after 100 hours?

Answer:

(a).

Given the binary representation of $0.1$ in double precision:

Sign: $0$
Exponent: $0111111101101111111011$ , which is 1019 in decimal ⇒ effective exponent is $1029-1023=-4$
Significand: $10011001100110011001100110011001100110011001100110101001100110011001100110011001100110011001100110011010$ the binary digits will be chopped off at 52 bit. Therefore, $\epsilon_{mach} = 2^{-52}$ and thus $\text{roundoff error} = \frac{1}{2}\epsilon_{mach} = 2^{-53} \approx 1.11×10^{−16}$

(b).

After 100 hours: $100 × 60 × 60 × 10 × 1.11 × 10^{−16} \approx 3.996×10^{−10} sec$

Floating points error, Taylor series, and approximation

Étiquette

publié à

modifié à

durée

source

Vous pourriez aimer ce qui suit