Numerically computing the exponential function with polynomial approximations

Numerically computing any non-trivial function (spanning the trigonometric functions through to special functions like the complex gamma function) is a large field of numerical computing, and one I am interested in expanding my knowledge within. In this article, I describe my recent exploration of how the exponential function can be implemented numerically.


Introduction

A recent pull request in the Julia repository (PR 36761) was proposed that aims to speed up the calculation of the exponential function (\(e^x\)) for double floating point values. While reading through the code and comments, I found myself wondering how some of the “magic constants” were being derived, especially because they are similar — but not identical — to the Taylor expansion coefficients any mathematician or physicist is likely to recognize on sight. Curiosity led me to devote the past few days to investigating some of the standard techniques used in implementing numerical approximations of special functions, so this article serves as a useful reference for myself (and hopefully others) for some of the techniques I’ve learned.

Taylor Series

To any physicist, the first obvious choice for approximating any (well behaved) function is its Taylor series expansion. For an infinitely differential function \(f(x)\) at \(x = a\), the Taylor series is defined as the infinite sum \begin{align*} f(x) = \sum_{n = 0}^{\infty} \frac{1}{n!} \frac{d^n f(a)}{dx^n} (x - a)^n \end{align*} where \(!\) denotes the factorial function and \(d^nf(a)/dx^n\) is the \(n\)-th derivative of \(f(x)\) evaluated at \(x = a\).1 The approximation is made by truncating the infinite series to a finite series of some arbitrary number of terms, giving a finite-degree polynomial approximation to the function \(f(x)\).

There is an immediately obvious problem with just using the Taylor series, though — what value of \(x = a\) do you choose to expand about? The exponential function is defined on the entire real line, but you only have one point about which the function is well approximated, with the approximation growing poorer the further from \(a\) you go. The span of the well-approximated region can be increased by including more terms in the series, but this is a strategy of diminishing returns given that increasing the accuracy to arbitrary precision everywhere requires an infinite series.

In general, there are also issues of ensuring convergence of the series (i.e. the function does not have to converge everywhere for a given choice of \(a\)), and the rapidly growing size of \(n!\) and high powers of \((x-a)^n\) will quickly lead to issues of numerically computing their values without under/overflow.

Despite these problems, though, let us ignore them for now and proceed to investigate the family of Taylor series approximations \(P_n(x)\) of degree \(n\). The Taylor series for \(e^x\) about \(x = 0\) is \begin{align} e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \ldots \end{align} which can be easily coded up as a generated function (so that the factorial coefficients need only be calculated once):

@generated function exp_taylor(x::Float64, ::Val{N}) where {N}
    coeff = ntuple(i -> inv(factorial(i-1)), N+1) # (1/0!, 1/1!, ..., 1/N!)
    return :(evalpoly(x, $coeff))
end

The following figure plots the first through third degree Taylor series approximations versus \(e^x\) (top panel), along with the absolute error \(\Delta = e^x - P_n(x)\) (middle) and relative error \(\epsilon = \Delta / e^x\) (bottom).

Plots of the exponential function \(e^x\) and three Taylor series approximations (top panel), along with the absolute (middle) and relative (bottom) errors.

With just this simple plot, we can already make some useful observations, both good and bad. First, a good feature is that the Taylor series expansion is almost trivial to implement — we defined it using only 4 lines of code, and this works up to a 20th-degree series. (After that factorial(n) overflows, so continuing to add degrees would require a bit more effort to work around.) Second, we also see the expected behavior that adding higher degree terms to the series improves the approximation, with the third-order approximation having uniformly smaller error than the first-degree approximation.

On the other hand, though, there are some problems evident as well. For one, the odd degree polynomial terms lead to values which are generally biased low. This is clearly visible in the first-degree term where the orange line in the upper panel is everywhere below the black line except at \(x = 0\) where they are equal. Ideally we’d like to have an approximation which gives an unbiased answer over some domain.

Next, the error is distributed unevenly in both the absolute and relative sense, and we do not have a way to tune or control that other than to just increase the number of terms in the series and hope for the best. In particular, the relative error rapidly increases for \(x < 0\) where the exponential takes on very small values, and a calculation based on a value which deviates from its true value by \(>100\%\) is likely to give problematic results.

Thankfully, we have plenty of tools to use yet which can address each of these problem.

Range reduction

Let us first address the problem of choosing where to center the Taylor expansion. In the previous section I chose to expand the exponential about \(x = 0\), and that caused the approximations to be best near only that point. The general technique of “range reduction” aims to reformulate a calculation in terms of multiple (hopefully easy) calculations that shift one domain (here infinite) to a smaller one of our choice.

The simplest case of such a range reduction is for a periodic function where only a single period of the domain has to be considered. For instance, \(\cos(x + 2n\pi) = \cos(x)\) for all integers \(n\), so the Taylor approximation only has to converge to the required accuracy for a domain like \(x \in [0, 2\pi)\) and all other inputs are remapped into the range.

We can do something analogous for the exponential function, but instead of periodicity we will exploit the identity \(e^{a + b} = e^a e^b\) for some carefully chosen values of \(a\) and \(b\).

Let \(x = k \ln 2 + r\) for some integer \(k\), which implies that the residual \(r\) is limited to finite domain \(|r| \le \frac{1}{2}\ln 2\). Then substituting and simplifying with some exponential-logarithmic identities, we find that \(e^x\) can be re-expressed as \begin{align*} e^x = e^{k\ln 2 + r} = e^{k\ln 2} e^r = 2^k e^r \end{align*} This is a manifestly simpler expression because the factor of \(2^k\) is easy to compute. Standard floating point values2 are stored in base-2 “scientific notation” in terms of its fractional part \(f\) and a power-of-two \(p\) such that \(z = f \cdot 2^p\). Because of this representation, the particular power \(2^k\) is constructed by a few fast bit manipulations, and then the answer is a simple product of two floating point values (where we assume we can calculate \(e^r\), such as with a Taylor series approximation).

Rearranging the expression to solve for \(k\), we invoke the requirement that \(k\) be an integer to eliminate the residual \(r\) so that \begin{align*} k = \operatorname{round}\left(\frac{x}{\ln 2}\right) = \left\lfloor \frac{x}{\ln 2} + \frac{1}{2} \right\rfloor \end{align*} where we assume \(\operatorname{round}(z)\) rounds toward zero (to agree with the definition that \(|r| \le \frac 1 2 \ln 2\) and still not have \(k\) change) and the the rightmost expression just rewrites the rounding as the floor function. Then the residual is obtained by simply again solving the expression again but this time for \(r\) now that \(k\) is known: \begin{align*} r &= x - k\ln 2 \end{align*} We can then put this range reduction and the Taylor series approximation from the previous section together3 to make a range-reducing Taylor series approximation to \(e^x\).

function reduce_exp(x::Float64)
    ln2 = 0.6931471805599453 # log(2)
    invln2 = 1.4426950408889634 # 1 / log(2)
    k = floor(muladd(x, invln2, 0.5)) # ⌊ x/log(2) + 1/2 ⌋
    r = muladd(k, -ln2, x) # x - k * ln2
    return (trunc(Int, k), r)
end
function exp_reduce_taylor(x::Float64, n::Val{N}) where {N}
    k, r = reduce_exp(x)
    return exp2(k) * exp_taylor(r, n) # 2^k * exp(r)
end
Similar to Fig. .2.1 with the additional of a range-reducing step that limits the Taylor-series approximation to operating within the range \(|x| \le \frac 1 2 \ln 2\), and the full result is a rescaled version of the reduced domain. Note that compared to Fig. 2.1, the range of the absolute error has decreased by about a factor of 10, and the relative error by about 200 times.

Figure 3.1 is analogous to Fig. 2.1 from the previous section but instead calls our new exp_reduce_taylor() function which does range-reduction. The first obvious feature is the periodicity of the relative error across the domain caused by the range reduction and scaling. This technique has allowed us to constrain the relative error to a bounded range everywhere (and equal to the error near \(x = 0\)), and the bound is progressively tighted by adding higher-order terms to the Taylor series. Note that unlike the relative error, though, the absolute error grows as \(x \rightarrow \infty\) because any approximation error in \(e^r\) is then scaled by the factor of \(2^k\).

While we have successfully improved upon the problem of choosing a suitable point to expand about and the problem of an infinite domain, we still see the biased average result in the odd orders. Furthermore, we’ve added a new problem with the fact that the numerical function is no longer continuous across the \(\frac{k}{2}\ln 2\) boundaries.

To solve these problems, we have to move beyond the Taylor series approximation.

Minimax polynomial

To take a new approach, let us consider some conditions we’d like our approximation to satisfy. First, the approximation should be a “good” approximation, which we can quantify by requiring that the absolute (relative) error be minimized over all possibly \(n\)-th degree polynomials \(P_n(x)\): \begin{align} \newcommand{\min}{\operatorname*{min}} \newcommand{\max}{\operatorname*{max}} \text{absolute error}\quad &\Rightarrow \quad \min_{P_n} \big\lVert f(x) - P_n(x) \big\rVert_\infty \\ \text{relative error}\quad &\Rightarrow \quad \min_{P_n} \left\lVert \frac{f(x)-P_n(x)}{f(x)} \right\rVert_\infty \end{align} where \(\|\cdot\|_\infty\) is notation for the infinity norm which reduces to the maximum of the expression. The solution \(P_n(x)\) is then called the minimax polynomial. To generically describe both the absolute and relative error minimizations simultaneously, we’ll add a weight function \(w(x)\) so that we are instead minimizing \begin{align} \min_{P_n} \Big\lVert w(x) \big[ f(x) - P_n(x) \big] \Big\rVert_\infty \end{align} where \(w(x) = 1\) to minimize the absolute error and \(w(x) = 1 / f(x)\) to minimize the relative error.

The solution to this minimization problem is given by the Remez algorithm, which is a method for iteratively improving the polynomial approximation, and it can be easily understood if you keep a couple of facts about polynomials in mind.

First, \(n\)-th degree polynomials have \(n + 1\) degrees of freedom (coefficients) and \(n + 1\) roots. For any arbitrary polynomial, the roots may be repeated or complex, but we can ensure the polynomial has all distinct and real roots by adding an additional constraint — if the polynomial is constructed such that it has \(n\) local extrema that successively change signs, then all \(n + 1\) roots will be real and distinct.

Second, recall that all non-constant (\(n > 0\)) polynomials will always diverge as \(|x| \rightarrow \infty\), so to constrain the error over a finite domain \([a, b]\), we are motivated to augment the local extrema with the endpoints. This gives us an ordered list of \(n + 2\) points \((a = x_0) < x_1 < x_2 < \ldots < x_n < (x_{n+1} = b)\) at which the error function is to be minimized.

The minimax condition is then achieved when the error function evaluated at each of these extrema are equal in magnitude. (This is the equioscillation theorem.) That is to say, we are seeking the polynomial that satisfies the system of equations over all of the extrema nodes given by \begin{align*} e(x_i) = (-1)^i E = w(x_i) \big[ f(x_i) - P_n(x_i) \big] \end{align*} for \(i = \{0, 1, \ldots, n+1\}\). Note that this constraint adds another degree of freedom (the error \(E\)) to the system (on top of the polynomial coefficients) so that there are \(n+2\) undetermined coefficients and \(n+2\) equations and therefore forms a solvable system.

Rewritten into the form of a matrix equation, we can formally solve for the polynomial coefficients \(a_i\) — where \(P_n(x) = \sum_{k=0}^n a_k x^k\) — and the maximum deviation \(E\) with the weights and oscillation constraint as:

\begin{align} \newcommand{\vs}{\vphantom{(-1)_{n+1}^{n+1}}} \begin{bmatrix} 1 & x_0 & x_0^2 & \ldots & x_0^n & 1/w(x_0) \vs \\ 1 & x_1 & x_1^2 & \ldots & x_1^n & -1/w(x_1) \vs \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & x_n & x_n^2 & \ldots & x_n^n & (-1)^n/w(x_n) \vs \\ 1 & x_{n+1} & x_{n+1}^2 & \ldots & x_{n+1}^n & (-1)^{n+1}/w(x_{n+1}) \vs \end{bmatrix} \begin{bmatrix} a_0\vs \\ a_1\vs \\ \vdots \\ a_{n+1}\vs \\ E\vs \end{bmatrix} &= \begin{bmatrix} f(x_0)\vs \\ f(x_1)\vs \\ \vdots \\ f(x_n)\vs \\ f(x_{n+1})\vs \end{bmatrix} \label{eqn:sys} \end{align}

You may have noticed by now, though, that there are actually twice as many unknowns than equations since we haven’t actually stated how to a priori choose the nodes \(x_i\) other than to say they should be the local extrema of the minimax polynomial. This is where the iterative part of the Remez algorithm comes in — instead of solving the exact solution to the system of equations, we make an initial guess at some nodal points and solve the least-squares solution. This will almost surely fail to solve to the minimax polynomial with equal magnitude oscillations of the error function, but we can use this best-fit estimate to update our choice of nodes and repeat.

Finally, we are ready to state the full iterative Remez algorithm:

  1. Choose an initial vector of trial nodes \(\mathbf{x}\).
  2. Calculate the regressor matrix \(\mathbf{R}\) (square matrix on left-hand side of Eqn. \(\ref{eqn:sys}\)) and the vector of function values \(\mathbf{f} = f(\mathbf{x})\) (vector on the right-hand side).
  3. Solve (in a least-squares sense) the system of equations \(\mathbf{R} \mathbf{a} = \mathbf{f}\) for the vector of polynomial coefficients and deviation magnitude.
  4. For the best-fit polynomial, find the local extrema \(\mathbf{x'}\) and compare their deviation magnitudes. If they are equal (or suffiently close within some tolerance), the current solution is the minimax polynomial. Quit iterating; the polynomial is defined by all but the last entry in the \(\mathbf{a}\) vector.
  5. Otherwise, replace the nodes \(\mathbf{x}\) with the locations of the extrema (i.e. let \(\mathbf{x} \leftarrow \mathbf{x'}\)) and return to step 2 to iterate the process.

For additional context on solving a system of equations like this built from evaluating basis functions over a set of points, see my previous article on building linear regression matrix operators. For our purposes, though, it is sufficient to just know that the least-squares regression can be obtained by using Julia’s built-in backslash (\) operator and/or LinearAlgebra.ldiv! methods.

We’ll also skip working through the details of finding the extrema except to say that it can be easily accomplished though a couple more iterative algorithms that first find the zeros of the error function (rootbisect() in the function below) and then search for extrema between each pair of roots (extremumsearch()). The code for these functions is included in a hidden section which can be shown by clicking the link below the following code block.

Finally, the only remaining definition we require is a method for choosing the initial node distribution. We’ve already argued that the endpoints are two of the nodes, but we have \(n\) more to choose. In theory almost any distribution of non-overlapping nodes should work, but the convergence could be very slow if they are very poor initial estimates. A common suggestion is to the zeros of the Chebyshev polynomial of degree \(n\) remapped from \([-1,1]\) to \([a,b]\). In the definition below, I have chosen a modified, Chebyshev-like distribution4 that differs by naturally including the endpoints.

The following remez() function is a translation of this process. It takes in as arguments the exact function to approximate f, the endpoints a and b, the degree of the polynomial to produce N, and an optional weight function w (defaulting to the unit function which as you’ll recall corresponds to finding the absolute error minimax approximation).

using LinearAlgebra

function remez(f, a, b, N, w = one; atol = sqrt(eps(b - a)), maxiter = 20)
    # Initialize node locations
    nodes  = @. (a + b)/2 - (b - a)/2 * cospi((0:N+1) / (oftype(a, N)+1))

    # Working space for exact function values, the regressor matrix, and the
    # locations of the roots of the error function
    fvals = similar(nodes)
    regr  = similar(nodes, N + 2, N + 2)
    roots = similar(nodes, N + 1)

    # Weighted error function
    coeff = @view fvals[1:N+1]
    err(x) = w(x) * (f(x) - evalpoly(x, coeff))
    # Difference in magnitude between the error function and best-fit deviation constant
    dev(x) = abs(abs(err(x)) - abs(fvals[N+2]))

    iter = 0
    while true
        # Update the exact solution and the regressor
        regr[:,1:N+1] .= nodes .^ transpose(0:N)
        regr[:,  N+2] .= ((-1)^i / w(nodes[i]) for i in 1:N+2)
        # Perform the regression, overwriting the exact solution
        fvals .= f.(nodes)
        ldiv!(factorize(regr), fvals) # coeff .= regr \ fvals

        iter += 1
        if iter > maxiter
            @warn "Maximum number of iterations reached. Quitting."
            break
        end

        # Search for each root between successive pairs of nodes.
        roots .= (rootbisect(err, nodes[i], nodes[i+1]) for i in 1:N+1)
        # And then update all the internal nodes with the locations of the actual
        # local extrema
        nodes[2:N+1] .= (extremumsearch(err, roots[i], roots[i+1]) for i in 1:N)

        # Stopping condition is when the actual deviations differ from the best-fit
        # deviation coefficient in magnitude by less than `atol`.
        maximum(dev, nodes)  atol && break
    end
    return (nodes, copy(coeff))
end

The following helper functions to find the zeros and extrema of the error functions are very simple implementations of the bisection method for finding roots and the golden-section search for finding the extrema. In particular, these are not robust against all possible inputs and are not guaranteed to work in all cases, but they were simple to code and are sufficient for the few cases shown here.

"""
Simple bisecting search for the zero of function ``f(x)`` in the interval ``a ≤ x ≤ b``.
"""
function rootbisect(f, a, b; atol = sqrt(eps(abs(b - a))))
    fa, fb = f(a), f(b)
    abs(fa) < atol && return a    # Left bound is a zero
    abs(fb) < atol && return b    # Right bound is a zero
    signbit(fa) == !signbit(fb) || # no zero crossing is an error
        error("Invalid interval — does not cross zero")
    while true
        m = (a + b) / 2 # bisect halfway between intervals
        fm = f(m)       #  and recalculate f(m)
        abs(fm) < atol && return m  # found the zero

        # If f(a) and f(m) are the same sign, then drop lower half of remaining interval
        if signbit(fa) == signbit(fm)
            a, fa = m, fm
        # otherwise the upper half has been excluded.
        else
            b, fb = m, fm
        end
    end
end

"""
Search for an extremum of ``f(x)`` in the interval ``a ≤ x ≤ d``.
"""
function extremumsearch(f, a, d; atol = sqrt(eps(abs(d - a))))
    T = typeof(atol)
    ϕinv = T(2) / (one(T) + sqrt(T(5)))

    b = d - ϕinv * (d - a)
    c = a + ϕinv * (d - a)
    fa, fb, fc, fd = abs.(f.((a, b, c, d)))

    while true
        # Stop when the search bounds have converged to sufficient accuracy
        d - a < atol && return (a + d) / 2

        # The extremum must be in [a, c]
        if fb > fc
            c, d = b, c
            fc, fd = fb, fc
            b = d - ϕinv * (d - a)
            fb = abs(f(b))
        # otherwise the extremem must be in [b, d]
        else
            a, b = b, c
            fa, fb = fb, fc
            c = a + ϕinv * (d - a)
            fc = abs(f(c))
        end
    end
end

We are now finally ready to improve our approximation of the exponential function over that given by a Taylor series. As mentioned earlier, we can choose to minimize the error in an absolute or relative sense, so we’ll do both by defining variations of exp optimized under both weighting functions. These kernel functions replace the Taylor-series approximation on the interval \(|x| \le \frac 1 2 \ln 2\), and we continue to use range reduction over the entire real axis.

# Minimizes absolute error in |x| ≤ log(2)/2
@generated function exp_absminimax_kernel(x::Float64, ::Val{N}) where {N}
    ln2by2 = log(2.0) / 2
    _, coeff = remez(exp, -ln2by2, ln2by2, N)
    return :(evalpoly(x, $(tuple(coeff...))))
end
# Minimizes relative error in |x| ≤ log(2)/2
@generated function exp_relminimax_kernel(x::Float64, ::Val{N}) where {N}
    ln2by2 = log(2.0) / 2
    _, coeff = remez(exp, -ln2by2, ln2by2, N, x -> exp(-x))
    return :(evalpoly(x, $(tuple(coeff...))))
end

function exp_reduce_kernel(x::Float64, n::Val{N}, kernel) where {N}
    k, r = reduce_exp(x)
    return exp2(k) * kernel(r, n) # 2^k * exp(r)
end
exp_absminimax(x::Float64, n::Val{N}) where {N} = exp_reduce_kernel(x, n, exp_absminimax_kernel)
exp_relminimax(x::Float64, n::Val{N}) where {N} = exp_reduce_kernel(x, n, exp_relminimax_kernel)
Similar to Fig. 2.2 but changes the kernel function from a Taylor approximation to a minimax polynomial approximation which minimizes the absolute error (left column) or relative error (right column).

First, both variations of the minimax polynomial eliminate the systematic bias in the odd-order polynomials that was present while using the Taylor series approximation.

Second, we can see that both variations do minimize the absolute and relative error, respectively. (This is probably easiest to see by comparing the last one and a half of the range-reducing periods from about \(x = 1\) to the right side.) In the left column which shows the absolute minimax approximation, the left and right end points of each range-reducing period has the same deviation. (Compare this against Fig. 3.1 where lower and upper endpoints of the period have different absolute errors.) Then in the right column which shows the relative minimax approximation, we see a similar per-period relationship but now in the relative error measurement.

For the relative error case, we also gain continuity of the approximation across range-reducing periods in the odd-degree approximations; this is a direct consequence of the error polynomial being a leading-order-even function together with constraining the relative errors to be equal on the lower and upper edges of the range-reducing period.

Finally, let us directly compare the known Taylor series coefficients against the minimax polynomial coefficients (since, as mentioned in the introduction, the observation that the Julia pull request contained not-quite Taylor series coefficients is what initially motivated my exploration into this topic). The following table shows the Taylor series and relative minimax polynomial coefficients up to degree 3, along with the difference (minimax minus Taylor) the relative percent change with respect to the Taylor series coefficients.

DegreeTaylor seriesMinimax polyAbs. (Rel.) change
01.00000000000000000.9999280752600668-7.192e-05 (-0.00719 %)
11.00000000000000001.00016419039482640.0001642 (0.0164 %)
20.50000000000000000.50496326509619220.004963 (0.993 %)
30.16666666666666670.1656683995499798-0.0009983 (-0.599 %)

The bottom-line conclusion is that only very small (less than 1%) tweaks to the Taylor series coefficients are necessary to transform the Taylor polynomial into the minimax polynomial, and the coefficients observed in the Julia pull request could reasonably be due to a minimax-like optimization.

A more complex minimax example — expm1

Before we conclude, let’s take a quick look at a slightly more complex example of defining minimax polynomials.

A well known issue with numerically calculating the exponential function is that for \(|x| \ll 1\), the answer is very nearly \(1\), and accuracy in the result is lost due to the limits of finite floating point representation. Namely, calculating exp(x) for a value very nearly \(0\) and then subtracting 1.0 is subject to so-called “catastrophic” cancellation. Therefore, getting a high-accuracy result requires direct calculation of a special form of the function — exmp1(x) which is defined to be \(e^x - 1\) — without intermediately using exp(x) and subtracting. The expm1 function is commonly provided in math libraries (including the C99 standard and, of course, by Julia’s base library).

Given the simple mathematical definition, we can consider how to form a numerical approximation by expressing the function in terms of the exponential’s Taylor series and simplifying: \begin{align*} e^x - 1 &= \left(\sum_{k=0}^{\infty} \frac{x^k}{k!}\right) - 1 \\ {} &= \left(1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \ldots\right) - 1 \\ {} &= x \left(1 + \frac{x}{2} + \frac{x^2}{6} + \ldots \right) \\ e^x - 1 &= x \sum_{k=0}^{\infty} \frac{x^k}{(k+1)!} \approx x P_n(x) \end{align*} Thus we have identified \(e^x - 1\) as the product of \(x\) and an infinite power series which can be truncated and approximated by some \(n\)-th order polynomial.

To have the remez() function above solve for the minimax polynomial, we must manipulate the error between \(e^x - 1\) and \(x P_n(x)\) for either absolute or relative error minimization back into the standard form \(w(x)\left[ f(x) - P_n(x) \right]\). Doing so for the relative case, we simply divide through by the extra factor of \(x\) and make the appropriate identifications: \begin{align*} \frac{\left[e^x - 1\right] - \left[x P_n(x)\right]}{\left[e^x - 1\right]} &= \frac{x}{e^x - 1}\left[ \frac{e^x - 1}{x} - P_n(x) \right] \\ {} &= w(x) \left[ f(x) - P_n(x) \right] \\ {} &\qquad\text{where}\quad f(x) \equiv \frac{1}{w(x)} \equiv \frac{e^x - 1}{x} \\ \end{align*} Note that both \(f(x)\) and \(w(x)\) limit to unity as \(x \rightarrow 0\), so the functions are well-defined.

For a generic solution, we also need to replace the range-reduction step with one which accounts for the subtraction by one. With a bit of algebra, it’s easy to show that for \(x = k\ln 2 + r\) as previously, we have \begin{align} e^x - 1 = 2^k(e^r - 1) + 2^k - 1 \end{align} Translating these definitions into code:

# Minimizes relative error in |x| ≤ log(2)/2
@generated function expm1_minimax_kernel(x::Float64, ::Val{N}) where {N}
    ln2by2 = log(2.0) / 2
    expm1_kernel(x) = iszero(x) ? one(x) : expm1(x) / x
    expm1_weight(x) = iszero(x) ? one(x) : x / expm1(x)
    # Solve for one fewer degrees than requested to make comparisons with
    # exp(x)-1 fair — i.e. the highest degree in (x * P_n(x)) == N
    _, coeff = remez(expm1_kernel, -ln2by2, ln2by2, N - 1, expm1_weight)
    return :(x * evalpoly(x, $(tuple(coeff...)))) # Note the leading mult. by `x`
end

function expm1_minimax(x::Float64, n::Val{N}) where {N}
    k, r = reduce_exp(x)
    s = exp2(k)
    return muladd(s, expm1_minimax_kernel(r, n), s - 1.0)
end

and then directly comparing a third-order relative minimax approximation of exp(x)-1 versus expm1(x):

Similar to Fig. 4.1 but compares a minimax polynomial (optimized for relative error) calculated for \(e^x\) and subsequently used to calculate the quantity \(e^x - 1\) (left column, exp(x) - 1) versus a minimax polynomial optimized to directly calculate \(e^x - 1\) (rightcolumn, expm1(x)). In both cases, the answer is calculated up to third order in \(x\) (i.e. the residual is \(\mathcal{O}(x^4)\)).

The catastrophic calculation for small \(x\) is visible in the bottom panel of the left column where the fractional error diverges as \(x \rightarrow 0\). This is because the absolute error (middle row of left column) is non-zero near and at \(x = 0\), so since \(\lim_{x\rightarrow 0} e^x - 1 = 0\), the fractional error is unbounded. In contast, the right column shows that the direct minimax optimization of expm1(x) actually constructs a minimax polynomial that approaches \(0\) as \(x \rightarrow 0\), and therefore the relative error is tightly bounded.

Conclusions

For myself (and likely many physicists) the Taylor series is a well known and familiar method of constructing polynomial approximations of arbitrary functions. Numerically, though, the minimax polynomial is an alternative construction that provides more desirable properties, and it is a concept that any numerically-oriented person should probably have in their toolbox.

In this article, I ignored the problem of bootstrapping the implementation — i.e. we optimized minimax polynomials against Julia’s built-in exp(x::Float64) and expm1(x::Float64) functions — but this isn’t a deal-breaking problem. In some cases you may want to build a custom, fast approximation of an intricate function which can be accurately (but too slowly) calculated in terms of known functions. Then this is just a performance optimization, and there’s no issue of how to bootstrap the process.

If the initial bootstrapping of a special function is the goal, though, then we already have a perfectly servicable definition — just calculate the exact answers to machine precision using a large number of terms in the Taylor series, and optimize for a minimax polynomial that requires fewer terms to achieve the same accuracy. In practice the use of extended-precision arithmetic will almost certainly be required to avoid loss of precision in standard floating point implementations, but that is not an insurmountable problem.

Finally, this is far from being a comprehensive explanation of how to write accurate numerical calculations of special functions. For instance, our range reduction for exp only contracted the domain down to the interval \(|x| \le \frac 1 2 \ln 2\), but a commonly-referenced extension of this idea is Tang’s method which makes use of a pre-computed table of non-integer powers of two to further reduce the working range. This reduces the polynomial degree required to calculate all answers to machine precision. (The Julia pull request linked to in the introduction uses a similar scheme, and the range is reduced to \(|x| \le \ln(2)/512\).) There’s also an entire domain of implementation details to be considered to ensure the calculation is actually performant on real hardware — e.g. optimizing for branch (mis-)prediction, instruction pipelining, auto-vectorization, etc — that are outside the scope of this article to discuss.


  1. Note that for \(n = 0\), the \(0\)-th derivative of \(f\) is defined to be itself (e.g. \(d^0 f(x)/dx^0 = f(x)\)), and \(0! = 1\) so that the first term always simplifies to just \(f(a)\). ↩︎

  2. This is technically not true for all possible ways of staring floating point values, but the IEEE 754 standard is used almost everywhere most users are likely to encounter floating point values. ↩︎

  3. Using the exp2() function is not a subtle cheat and/or circular definition that itself depends on a properly defined generic exp() function when the argument is an integer type. You can verify with the Julia source code that it does exactly as I claimed and uses bit-level manipulations to directly construct the floating point value equal to \(2^k\). ↩︎

  4. The Chebyshev polynomial zeros are all contained within the interval \(-1\) to \(1\) (excluding the endpoints) and are distributed such that they correspond to evenly distributed points on a unit sphere mapped onto the \(x\)-axis. The left-hand panel below shows the \(n\) Chebyshev zeros for a 12th degree polynomial as the blue points, with the two endpoints added in as the red points. Notice that the spacing between the first and second points and the penultimate and last points are not equal with the spacing between the rest of the successive pairs.

    The initial distribution of points used in remez() is very similar but modified to directly produce \(n + 2\) points distributed uniformly across the unit sphere with the endpoints included. This case is shown in the right-hand panel.
    inital point distribution ↩︎