Numerically computing any non-trivial function (spanning the trigonometric functions through to special functions like the complex gamma function) is a large field of numerical computing, and one I am interested in expanding my knowledge within. In this article, I describe my recent exploration of how the exponential function can be implemented numerically.

## Introduction¶

A recent pull request in the Julia repository (PR 36761) was proposed that aims to speed up
the calculation of the exponential function (`\(e^x\)`

) for double floating point values.
While reading through the code and comments, I found myself wondering how some of the
“magic constants” were being derived, especially because they are similar — but *not* identical —
to the Taylor expansion coefficients any mathematician or physicist is likely to recognize on
sight.
Curiosity led me to devote the past few days to investigating some of the standard techniques
used in implementing numerical approximations of special functions, so this article serves as
a useful reference for myself (and hopefully others) for some of the techniques I’ve learned.

## Taylor Series¶

To any physicist, the first obvious choice for approximating any (well behaved) function is
its Taylor series expansion.
For an infinitely differential function `\(f(x)\)`

at `\(x = a\)`

, the Taylor series is defined
as the infinite sum
`\begin{align*} f(x) = \sum_{n = 0}^{\infty} \frac{1}{n!} \frac{d^n f(a)}{dx^n} (x - a)^n \end{align*}`

where `\(!\)`

denotes the factorial function and `\(d^nf(a)/dx^n\)`

is the `\(n\)`

-th derivative of
`\(f(x)\)`

evaluated at `\(x = a\)`

.^{1}
The approximation is made by truncating the infinite series to a finite series of some
arbitrary number of terms, giving a finite-degree polynomial approximation to
the function `\(f(x)\)`

.

There is an immediately obvious problem with just using the Taylor series, though —
what value of `\(x = a\)`

do you choose to expand about?
The exponential function is defined on the entire real line, but you only have one
point about which the function is well approximated, with the approximation growing
poorer the further from `\(a\)`

you go.
The span of the well-approximated region can be increased by including more terms in
the series, but this is a strategy of diminishing returns given that increasing the
accuracy to arbitrary precision everywhere requires an infinite series.

In general, there are also issues of ensuring convergence of the series (i.e. the
function does not have to converge everywhere for a given choice of `\(a\)`

), and the
rapidly growing size of `\(n!\)`

and high powers of `\((x-a)^n\)`

will quickly lead to issues
of numerically computing their values without under/overflow.

Despite these problems, though, let us ignore them for now and proceed to investigate
the family of Taylor series approximations `\(P_n(x)\)`

of degree `\(n\)`

.
The Taylor series for `\(e^x\)`

about `\(x = 0\)`

is
`\begin{align} e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \ldots \end{align}`

which can be easily coded up as a generated function (so that the factorial coefficients
need only be calculated once):

```
@generated function exp_taylor(x::Float64, ::Val{N}) where {N}
coeff = ntuple(i -> inv(factorial(i-1)), N+1) # (1/0!, 1/1!, ..., 1/N!)
return :(evalpoly(x, $coeff))
end
```

The following figure plots the first through third degree Taylor series approximations
versus `\(e^x\)`

(top panel), along with the absolute error `\(\Delta = e^x - P_n(x)\)`

(middle)
and relative error `\(\epsilon = \Delta / e^x\)`

(bottom).

With just this simple plot, we can already make some useful observations, both good and
bad.
First, a good feature is that the Taylor series expansion is almost trivial to implement —
we defined it using only 4 lines of code, and this works up to a 20th-degree series.
(After that `factorial(n)`

overflows, so continuing to add degrees would require a bit more
effort to work around.)
Second, we also see the expected behavior that adding higher degree terms to the series
improves the approximation, with the third-order approximation having uniformly smaller
error than the first-degree approximation.

On the other hand, though, there are some problems evident as well.
For one, the odd degree polynomial terms lead to values which are generally biased low.
This is clearly visible in the first-degree term where the orange line in the upper panel
is everywhere below the black line except at `\(x = 0\)`

where they are equal.
Ideally we’d like to have an approximation which gives an unbiased answer over some domain.

Next, the error is distributed unevenly in both the absolute and relative sense, and we
do not have a way to tune or control that other than to just increase the number of terms
in the series and hope for the best.
In particular, the relative error rapidly increases for `\(x < 0\)`

where the exponential takes
on very small values, and a calculation based on a value which deviates from its true value
by `\(>100\%\)`

is likely to give problematic results.

Thankfully, we have plenty of tools to use yet which can address each of these problem.

## Range reduction¶

Let us first address the problem of choosing where to center the Taylor expansion.
In the previous section I chose to expand the exponential about `\(x = 0\)`

, and that
caused the approximations to be best near only that point.
The general technique of “range reduction” aims to reformulate a calculation in
terms of multiple (hopefully easy) calculations that shift one domain (here infinite)
to a smaller one of our choice.

The simplest case of such a range reduction is for a periodic function where only a single
period of the domain has to be considered.
For instance, `\(\cos(x + 2n\pi) = \cos(x)\)`

for all integers `\(n\)`

, so the Taylor approximation
only has to converge to the required accuracy for a domain like `\(x \in [0, 2\pi)\)`

and all
other inputs are remapped into the range.

We can do something analogous for the exponential function, but instead of periodicity
we will exploit the identity `\(e^{a + b} = e^a e^b\)`

for some carefully chosen values of
`\(a\)`

and `\(b\)`

.

Let `\(x = k \ln 2 + r\)`

for some integer `\(k\)`

, which implies that the residual `\(r\)`

is
limited to finite domain `\(|r| \le \frac{1}{2}\ln 2\)`

.
Then substituting and simplifying with some exponential-logarithmic identities, we
find that `\(e^x\)`

can be re-expressed as
`\begin{align*} e^x = e^{k\ln 2 + r} = e^{k\ln 2} e^r = 2^k e^r \end{align*}`

This is a manifestly simpler expression because the factor of `\(2^k\)`

is easy to compute.
Standard floating point values^{2} are stored in base-2 “scientific notation” in terms
of its fractional part `\(f\)`

and a power-of-two `\(p\)`

such that `\(z = f \cdot 2^p\)`

.
Because of this representation, the particular power `\(2^k\)`

is constructed by a few fast
bit manipulations, and then the answer is a simple product of two floating point values
(where we assume we can calculate `\(e^r\)`

, such as with a Taylor series approximation).

Rearranging the expression to solve for `\(k\)`

, we invoke the requirement that `\(k\)`

be an
integer to eliminate the residual `\(r\)`

so that
`\begin{align*} k = \operatorname{round}\left(\frac{x}{\ln 2}\right) = \left\lfloor \frac{x}{\ln 2} + \frac{1}{2} \right\rfloor \end{align*}`

where we assume `\(\operatorname{round}(z)\)`

rounds toward zero (to agree with the
definition that `\(|r| \le \frac 1 2 \ln 2\)`

and still not have `\(k\)`

change) and
the the rightmost expression just rewrites the rounding as the floor function.
Then the residual is obtained by simply again solving the expression again but this time
for `\(r\)`

now that `\(k\)`

is known:
`\begin{align*} r &= x - k\ln 2 \end{align*}`

We can then put this range reduction and the Taylor series approximation from the previous
section together^{3} to make a range-reducing Taylor series approximation to `\(e^x\)`

.

```
function reduce_exp(x::Float64)
ln2 = 0.6931471805599453 # log(2)
invln2 = 1.4426950408889634 # 1 / log(2)
k = floor(muladd(x, invln2, 0.5)) # ⌊ x/log(2) + 1/2 ⌋
r = muladd(k, -ln2, x) # x - k * ln2
return (trunc(Int, k), r)
end
function exp_reduce_taylor(x::Float64, n::Val{N}) where {N}
k, r = reduce_exp(x)
return exp2(k) * exp_taylor(r, n) # 2^k * exp(r)
end
```

Figure 3.1 is analogous to Fig. 2.1 from the previous section but instead calls our new
`exp_reduce_taylor()`

function which does range-reduction.
The first obvious feature is the periodicity of the relative error across the domain caused
by the range reduction and scaling.
This technique has allowed us to constrain the relative error to a bounded range everywhere
(and equal to the error near `\(x = 0\)`

), and the bound is progressively tighted by adding
higher-order terms to the Taylor series.
Note that unlike the relative error, though, the absolute error grows as
`\(x \rightarrow \infty\)`

because any approximation error in `\(e^r\)`

is then scaled by the factor
of `\(2^k\)`

.

While we have successfully improved upon the problem of choosing a suitable point to expand
about and the problem of an infinite domain, we still see the biased average result in the
odd orders.
Furthermore, we’ve added a new problem with the fact that the numerical function is no longer
continuous across the `\(\frac{k}{2}\ln 2\)`

boundaries.

To solve these problems, we have to move beyond the Taylor series approximation.

## Minimax polynomial¶

To take a new approach, let us consider some conditions we’d like our approximation to
satisfy.
First, the approximation should be a “good” approximation, which we can quantify by
requiring that the absolute (relative) error be minimized over all possibly `\(n\)`

-th degree
polynomials `\(P_n(x)\)`

:
`\begin{align} \newcommand{\min}{\operatorname*{min}} \newcommand{\max}{\operatorname*{max}} \text{absolute error}\quad &\Rightarrow \quad \min_{P_n} \big\lVert f(x) - P_n(x) \big\rVert_\infty \\ \text{relative error}\quad &\Rightarrow \quad \min_{P_n} \left\lVert \frac{f(x)-P_n(x)}{f(x)} \right\rVert_\infty \end{align}`

where `\(\|\cdot\|_\infty\)`

is notation for the infinity norm which reduces to the
maximum of the expression.
The solution `\(P_n(x)\)`

is then called the minimax polynomial.
To generically describe both the absolute and relative error minimizations simultaneously,
we’ll add a weight function `\(w(x)\)`

so that we are instead minimizing
`\begin{align} \min_{P_n} \Big\lVert w(x) \big[ f(x) - P_n(x) \big] \Big\rVert_\infty \end{align}`

where `\(w(x) = 1\)`

to minimize the absolute error and `\(w(x) = 1 / f(x)\)`

to minimize the
relative error.

The solution to this minimization problem is given by the Remez algorithm, which is a method for iteratively improving the polynomial approximation, and it can be easily understood if you keep a couple of facts about polynomials in mind.

First, `\(n\)`

-th degree polynomials have `\(n + 1\)`

degrees of freedom (coefficients)
and `\(n + 1\)`

roots.
For any arbitrary polynomial, the roots may be repeated or complex, but we can ensure the
polynomial has all distinct and real roots by adding an additional constraint — if the
polynomial is constructed such that it has `\(n\)`

local extrema that successively change
signs, then all `\(n + 1\)`

roots will be real and distinct.

Second, recall that all non-constant (`\(n > 0\)`

) polynomials will always diverge as
`\(|x| \rightarrow \infty\)`

, so to constrain the error over a finite domain `\([a, b]\)`

, we are
motivated to augment the local extrema with the endpoints.
This gives us an ordered list of `\(n + 2\)`

points
`\((a = x_0) < x_1 < x_2 < \ldots < x_n < (x_{n+1} = b)\)`

at which the error function is to be minimized.

The minimax condition is then achieved when the error function evaluated at each of these
extrema are equal in magnitude.
(This is the equioscillation theorem.)
That is to say, we are seeking the polynomial that satisfies the system of equations
over all of the extrema nodes given by
`\begin{align*} e(x_i) = (-1)^i E = w(x_i) \big[ f(x_i) - P_n(x_i) \big] \end{align*}`

for `\(i = \{0, 1, \ldots, n+1\}\)`

.
Note that this constraint adds another degree of freedom (the error `\(E\)`

) to the system
(on top of the polynomial coefficients) so that there are `\(n+2\)`

undetermined coefficients
and `\(n+2\)`

equations and therefore forms a solvable system.

Rewritten into the form of a matrix equation, we can formally solve for the
polynomial coefficients `\(a_i\)`

— where `\(P_n(x) = \sum_{k=0}^n a_k x^k\)`

— and the maximum
deviation `\(E\)`

with the weights and oscillation constraint as:

`\begin{align} \newcommand{\vs}{\vphantom{(-1)_{n+1}^{n+1}}} \begin{bmatrix} 1 & x_0 & x_0^2 & \ldots & x_0^n & 1/w(x_0) \vs \\ 1 & x_1 & x_1^2 & \ldots & x_1^n & -1/w(x_1) \vs \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & x_n & x_n^2 & \ldots & x_n^n & (-1)^n/w(x_n) \vs \\ 1 & x_{n+1} & x_{n+1}^2 & \ldots & x_{n+1}^n & (-1)^{n+1}/w(x_{n+1}) \vs \end{bmatrix} \begin{bmatrix} a_0\vs \\ a_1\vs \\ \vdots \\ a_{n+1}\vs \\ E\vs \end{bmatrix} &= \begin{bmatrix} f(x_0)\vs \\ f(x_1)\vs \\ \vdots \\ f(x_n)\vs \\ f(x_{n+1})\vs \end{bmatrix} \label{eqn:sys} \end{align}`

You may have noticed by now, though, that there are actually twice as many unknowns than
equations since we haven’t actually stated how to *a priori* choose the nodes `\(x_i\)`

other than to say they should be the local extrema of the minimax polynomial.
This is where the iterative part of the Remez algorithm comes in — instead of solving
the exact solution to the system of equations, we make an initial guess at some nodal
points and solve the least-squares solution.
This will almost surely fail to solve to the minimax polynomial with equal magnitude
oscillations of the error function, but we can use this best-fit estimate to update our
choice of nodes and repeat.

Finally, we are ready to state the full iterative Remez algorithm:

- Choose an initial vector of trial nodes
`\(\mathbf{x}\)`

. - Calculate the regressor matrix
`\(\mathbf{R}\)`

(square matrix on left-hand side of Eqn.`\(\ref{eqn:sys}\)`

) and the vector of function values`\(\mathbf{f} = f(\mathbf{x})\)`

(vector on the right-hand side). - Solve (in a least-squares sense) the system of equations
`\(\mathbf{R} \mathbf{a} = \mathbf{f}\)`

for the vector of polynomial coefficients and deviation magnitude. - For the best-fit polynomial, find the local extrema
`\(\mathbf{x'}\)`

and compare their deviation magnitudes. If they are equal (or suffiently close within some tolerance), the current solution is the minimax polynomial. Quit iterating; the polynomial is defined by all but the last entry in the`\(\mathbf{a}\)`

vector. - Otherwise, replace the nodes
`\(\mathbf{x}\)`

with the locations of the extrema (i.e. let`\(\mathbf{x} \leftarrow \mathbf{x'}\)`

) and return to step 2 to iterate the process.

For additional context on solving a system of equations like this built from
evaluating basis functions over a set of points, see my previous article on
building
linear regression matrix operators.
For our purposes, though, it is sufficient to just know that the least-squares
regression can be obtained by using Julia’s built-in backslash (`\`

) operator
and/or `LinearAlgebra.ldiv!`

methods.

We’ll also skip working through the details of finding the extrema except to say that
it can be easily accomplished though a couple more iterative algorithms that first
find the zeros of the error function (`rootbisect()`

in the function below) and then
search for extrema between each pair of roots (`extremumsearch()`

).
The code for these functions is included in a hidden section which can be shown by
clicking the link below the following code block.

Finally, the only remaining definition we require is a method for choosing the initial
node distribution.
We’ve already argued that the endpoints are two of the nodes, but we have `\(n\)`

more to
choose.
In theory almost any distribution of non-overlapping nodes should work, but the
convergence could be very slow if they are very poor initial estimates.
A common suggestion is to the zeros of the Chebyshev polynomial of degree `\(n\)`

remapped
from `\([-1,1]\)`

to `\([a,b]\)`

.
In the definition below, I have chosen a modified, Chebyshev-like distribution^{4}
that differs by naturally including the endpoints.

The following `remez()`

function is a translation of this process.
It takes in as arguments the exact function to approximate `f`

, the endpoints
`a`

and `b`

, the degree of the polynomial to produce `N`

, and an optional
weight function `w`

(defaulting to the unit function which as you’ll recall corresponds
to finding the absolute error minimax approximation).

```
using LinearAlgebra
function remez(f, a, b, N, w = one; atol = sqrt(eps(b - a)), maxiter = 20)
# Initialize node locations
nodes = @. (a + b)/2 - (b - a)/2 * cospi((0:N+1) / (oftype(a, N)+1))
# Working space for exact function values, the regressor matrix, and the
# locations of the roots of the error function
fvals = similar(nodes)
regr = similar(nodes, N + 2, N + 2)
roots = similar(nodes, N + 1)
# Weighted error function
coeff = @view fvals[1:N+1]
err(x) = w(x) * (f(x) - evalpoly(x, coeff))
# Difference in magnitude between the error function and best-fit deviation constant
dev(x) = abs(abs(err(x)) - abs(fvals[N+2]))
iter = 0
while true
# Update the exact solution and the regressor
regr[:,1:N+1] .= nodes .^ transpose(0:N)
regr[:, N+2] .= ((-1)^i / w(nodes[i]) for i in 1:N+2)
# Perform the regression, overwriting the exact solution
fvals .= f.(nodes)
ldiv!(factorize(regr), fvals) # coeff .= regr \ fvals
iter += 1
if iter > maxiter
@warn "Maximum number of iterations reached. Quitting."
break
end
# Search for each root between successive pairs of nodes.
roots .= (rootbisect(err, nodes[i], nodes[i+1]) for i in 1:N+1)
# And then update all the internal nodes with the locations of the actual
# local extrema
nodes[2:N+1] .= (extremumsearch(err, roots[i], roots[i+1]) for i in 1:N)
# Stopping condition is when the actual deviations differ from the best-fit
# deviation coefficient in magnitude by less than `atol`.
maximum(dev, nodes) ≤ atol && break
end
return (nodes, copy(coeff))
end
```

The following helper functions to find the zeros and extrema of the error functions are very simple implementations of the bisection method for finding roots and the golden-section search for finding the extrema. In particular, these are not robust against all possible inputs and are not guaranteed to work in all cases, but they were simple to code and are sufficient for the few cases shown here.

```
"""
Simple bisecting search for the zero of function ``f(x)`` in the interval ``a ≤ x ≤ b``.
"""
function rootbisect(f, a, b; atol = sqrt(eps(abs(b - a))))
fa, fb = f(a), f(b)
abs(fa) < atol && return a # Left bound is a zero
abs(fb) < atol && return b # Right bound is a zero
signbit(fa) == !signbit(fb) || # no zero crossing is an error
error("Invalid interval — does not cross zero")
while true
m = (a + b) / 2 # bisect halfway between intervals
fm = f(m) # and recalculate f(m)
abs(fm) < atol && return m # found the zero
# If f(a) and f(m) are the same sign, then drop lower half of remaining interval
if signbit(fa) == signbit(fm)
a, fa = m, fm
# otherwise the upper half has been excluded.
else
b, fb = m, fm
end
end
end
"""
Search for an extremum of ``f(x)`` in the interval ``a ≤ x ≤ d``.
"""
function extremumsearch(f, a, d; atol = sqrt(eps(abs(d - a))))
T = typeof(atol)
ϕinv = T(2) / (one(T) + sqrt(T(5)))
b = d - ϕinv * (d - a)
c = a + ϕinv * (d - a)
fa, fb, fc, fd = abs.(f.((a, b, c, d)))
while true
# Stop when the search bounds have converged to sufficient accuracy
d - a < atol && return (a + d) / 2
# The extremum must be in [a, c]
if fb > fc
c, d = b, c
fc, fd = fb, fc
b = d - ϕinv * (d - a)
fb = abs(f(b))
# otherwise the extremem must be in [b, d]
else
a, b = b, c
fa, fb = fb, fc
c = a + ϕinv * (d - a)
fc = abs(f(c))
end
end
end
```

We are now *finally* ready to improve our approximation of the exponential function over
that given by a Taylor series.
As mentioned earlier, we can choose to minimize the error in an absolute or relative
sense, so we’ll do both by defining variations of `exp`

optimized under both weighting
functions.
These kernel functions replace the Taylor-series approximation on the interval
`\(|x| \le \frac 1 2 \ln 2\)`

, and we continue to use range reduction over the entire real
axis.

```
# Minimizes absolute error in |x| ≤ log(2)/2
@generated function exp_absminimax_kernel(x::Float64, ::Val{N}) where {N}
ln2by2 = log(2.0) / 2
_, coeff = remez(exp, -ln2by2, ln2by2, N)
return :(evalpoly(x, $(tuple(coeff...))))
end
# Minimizes relative error in |x| ≤ log(2)/2
@generated function exp_relminimax_kernel(x::Float64, ::Val{N}) where {N}
ln2by2 = log(2.0) / 2
_, coeff = remez(exp, -ln2by2, ln2by2, N, x -> exp(-x))
return :(evalpoly(x, $(tuple(coeff...))))
end
function exp_reduce_kernel(x::Float64, n::Val{N}, kernel) where {N}
k, r = reduce_exp(x)
return exp2(k) * kernel(r, n) # 2^k * exp(r)
end
exp_absminimax(x::Float64, n::Val{N}) where {N} = exp_reduce_kernel(x, n, exp_absminimax_kernel)
exp_relminimax(x::Float64, n::Val{N}) where {N} = exp_reduce_kernel(x, n, exp_relminimax_kernel)
```

First, both variations of the minimax polynomial eliminate the systematic bias in the odd-order polynomials that was present while using the Taylor series approximation.

Second, we can see that both variations do minimize the absolute and relative error,
respectively.
(This is probably easiest to see by comparing the last one and a half of the range-reducing
periods from about `\(x = 1\)`

to the right side.)
In the left column which shows the absolute minimax approximation, the left and right
end points of each range-reducing period has the same deviation.
(Compare this against Fig. 3.1 where lower and upper endpoints of the period have
different absolute errors.)
Then in the right column which shows the relative minimax approximation, we see a similar
per-period relationship but now in the relative error measurement.

For the relative error case, we also gain continuity of the approximation across range-reducing periods in the odd-degree approximations; this is a direct consequence of the error polynomial being a leading-order-even function together with constraining the relative errors to be equal on the lower and upper edges of the range-reducing period.

Finally, let us directly compare the known Taylor series coefficients against the minimax polynomial coefficients (since, as mentioned in the introduction, the observation that the Julia pull request contained not-quite Taylor series coefficients is what initially motivated my exploration into this topic). The following table shows the Taylor series and relative minimax polynomial coefficients up to degree 3, along with the difference (minimax minus Taylor) the relative percent change with respect to the Taylor series coefficients.

Degree | Taylor series | Minimax poly | Abs. (Rel.) change |

0 | 1.0000000000000000 | 0.9999280752600668 | -7.192e-05 (-0.00719 %) |

1 | 1.0000000000000000 | 1.0001641903948264 | 0.0001642 (0.0164 %) |

2 | 0.5000000000000000 | 0.5049632650961922 | 0.004963 (0.993 %) |

3 | 0.1666666666666667 | 0.1656683995499798 | -0.0009983 (-0.599 %) |

The bottom-line conclusion is that only very small (less than 1%) tweaks to the Taylor series coefficients are necessary to transform the Taylor polynomial into the minimax polynomial, and the coefficients observed in the Julia pull request could reasonably be due to a minimax-like optimization.

## A more complex minimax example — `expm1`

¶

Before we conclude, let’s take a quick look at a slightly more complex example of defining minimax polynomials.

A well known issue with numerically calculating the exponential function is that
for `\(|x| \ll 1\)`

, the answer is very nearly `\(1\)`

, and accuracy in the result is lost
due to the limits of finite floating point representation.
Namely, calculating `exp(x)`

for a value very nearly `\(0\)`

and then subtracting `1.0`

is subject to so-called “catastrophic” cancellation.
Therefore, getting a high-accuracy result requires direct calculation of a special form of
the function — `exmp1(x)`

which is defined to be `\(e^x - 1\)`

— without intermediately using
`exp(x)`

and subtracting.
The `expm1`

function is commonly provided in math libraries (including the C99 standard
and, of course, by Julia’s base library).

Given the simple mathematical definition, we can consider how to form a numerical
approximation by expressing the function in terms of the exponential’s Taylor series
and simplifying:
`\begin{align*} e^x - 1 &= \left(\sum_{k=0}^{\infty} \frac{x^k}{k!}\right) - 1 \\ {} &= \left(1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \ldots\right) - 1 \\ {} &= x \left(1 + \frac{x}{2} + \frac{x^2}{6} + \ldots \right) \\ e^x - 1 &= x \sum_{k=0}^{\infty} \frac{x^k}{(k+1)!} \approx x P_n(x) \end{align*}`

Thus we have identified `\(e^x - 1\)`

as the product of `\(x\)`

and an infinite power series
which can be truncated and approximated by some `\(n\)`

-th order polynomial.

To have the `remez()`

function above solve for the minimax polynomial, we must manipulate
the error between `\(e^x - 1\)`

and `\(x P_n(x)\)`

for either absolute or relative
error minimization back into the standard form `\(w(x)\left[ f(x) - P_n(x) \right]\)`

.
Doing so for the relative case, we simply divide through by the extra factor of `\(x\)`

and
make the appropriate identifications:
`\begin{align*} \frac{\left[e^x - 1\right] - \left[x P_n(x)\right]}{\left[e^x - 1\right]} &= \frac{x}{e^x - 1}\left[ \frac{e^x - 1}{x} - P_n(x) \right] \\ {} &= w(x) \left[ f(x) - P_n(x) \right] \\ {} &\qquad\text{where}\quad f(x) \equiv \frac{1}{w(x)} \equiv \frac{e^x - 1}{x} \\ \end{align*}`

Note that both `\(f(x)\)`

and `\(w(x)\)`

limit to unity as `\(x \rightarrow 0\)`

, so the functions
are well-defined.

For a generic solution, we also need to replace the range-reduction step with one which
accounts for the subtraction by one.
With a bit of algebra, it’s easy to show that for `\(x = k\ln 2 + r\)`

as previously, we have
`\begin{align} e^x - 1 = 2^k(e^r - 1) + 2^k - 1 \end{align}`

Translating these definitions into code:

```
# Minimizes relative error in |x| ≤ log(2)/2
@generated function expm1_minimax_kernel(x::Float64, ::Val{N}) where {N}
ln2by2 = log(2.0) / 2
expm1_kernel(x) = iszero(x) ? one(x) : expm1(x) / x
expm1_weight(x) = iszero(x) ? one(x) : x / expm1(x)
# Solve for one fewer degrees than requested to make comparisons with
# exp(x)-1 fair — i.e. the highest degree in (x * P_n(x)) == N
_, coeff = remez(expm1_kernel, -ln2by2, ln2by2, N - 1, expm1_weight)
return :(x * evalpoly(x, $(tuple(coeff...)))) # Note the leading mult. by `x`
end
function expm1_minimax(x::Float64, n::Val{N}) where {N}
k, r = reduce_exp(x)
s = exp2(k)
return muladd(s, expm1_minimax_kernel(r, n), s - 1.0)
end
```

and then directly comparing a third-order relative minimax approximation of
`exp(x)-1`

versus `expm1(x)`

:

The catastrophic calculation for small `\(x\)`

is visible in the bottom panel of the left
column where the fractional error diverges as `\(x \rightarrow 0\)`

.
This is because the absolute error (middle row of left column) is non-zero near and at
`\(x = 0\)`

, so since `\(\lim_{x\rightarrow 0} e^x - 1 = 0\)`

, the fractional error is unbounded.
In contast, the right column shows that the direct minimax optimization of `expm1(x)`

actually constructs a minimax polynomial that approaches `\(0\)`

as `\(x \rightarrow 0\)`

, and
therefore the relative error is tightly bounded.

## Conclusions¶

For myself (and likely many physicists) the Taylor series is a well known and familiar method of constructing polynomial approximations of arbitrary functions. Numerically, though, the minimax polynomial is an alternative construction that provides more desirable properties, and it is a concept that any numerically-oriented person should probably have in their toolbox.

In this article, I ignored the problem of bootstrapping the implementation — i.e. we
optimized minimax polynomials against Julia’s built-in `exp(x::Float64)`

and
`expm1(x::Float64)`

functions — but this isn’t a deal-breaking problem.
In some cases you may want to build a custom, fast approximation of an intricate function
which can be accurately (but too slowly) calculated in terms of known functions.
Then this is just a performance optimization, and there’s no issue of how to bootstrap
the process.

If the initial bootstrapping of a special function is the goal, though, then we already have a perfectly servicable definition — just calculate the exact answers to machine precision using a large number of terms in the Taylor series, and optimize for a minimax polynomial that requires fewer terms to achieve the same accuracy. In practice the use of extended-precision arithmetic will almost certainly be required to avoid loss of precision in standard floating point implementations, but that is not an insurmountable problem.

Finally, this is far from being a comprehensive explanation of how to write accurate
numerical calculations of special functions.
For instance, our range reduction for `exp`

only contracted the domain down to the
interval `\(|x| \le \frac 1 2 \ln 2\)`

, but a commonly-referenced extension of this idea
is Tang’s method which makes use of a
pre-computed table of non-integer powers of two to further reduce the working range.
This reduces the polynomial degree required to calculate all answers to machine
precision.
(The Julia pull request linked to in the introduction uses a similar scheme, and the
range is reduced to `\(|x| \le \ln(2)/512\)`

.)
There’s also an entire domain of implementation details to be considered to ensure the
calculation is actually performant on real hardware — e.g. optimizing for branch
(mis-)prediction, instruction pipelining, auto-vectorization, etc — that are outside
the scope of this article to discuss.

Note that for

`\(n = 0\)`

, the`\(0\)`

-th derivative of`\(f\)`

is defined to be itself (e.g.`\(d^0 f(x)/dx^0 = f(x)\)`

), and`\(0! = 1\)`

so that the first term always simplifies to just`\(f(a)\)`

. ↩︎This is technically not true for all possible ways of staring floating point values, but the IEEE 754 standard is used almost everywhere most users are likely to encounter floating point values. ↩︎

Using the

`exp2()`

function is not a subtle cheat and/or circular definition that itself depends on a properly defined generic`exp()`

function when the argument is an integer type. You can verify with the Julia source code that it does exactly as I claimed and uses bit-level manipulations to directly construct the floating point value equal to`\(2^k\)`

. ↩︎The Chebyshev polynomial zeros are all contained

*within*the interval`\(-1\)`

to`\(1\)`

(excluding the endpoints) and are distributed such that they correspond to evenly distributed points on a unit sphere mapped onto the`\(x\)`

-axis. The left-hand panel below shows the`\(n\)`

Chebyshev zeros for a 12th degree polynomial as the blue points, with the two endpoints added in as the red points. Notice that the spacing between the first and second points and the penultimate and last points are not equal with the spacing between the rest of the successive pairs.

The initial distribution of points used in`remez()`

is very similar but modified to directly produce`\(n + 2\)`

points distributed uniformly across the unit sphere*with*the endpoints included. This case is shown in the right-hand panel.

↩︎