Maintaining numerical accuracy in the Legendre recurrences

Legendre.jl Series: Parts 1, 2, 3, 4, 5

An algorithm for calculating special functions is no use if it has poor numerical accuracy and returns inaccurate or invalid answers. In this article I discuss the importance of using fused multiply-add (FMA) and a form of loop-unrolling to maintain numerical accuracy in calculating the associated Legendre polynomials to degrees and orders of order 1000 and greater.


In this article we are going to explore a pair of simple yet important modifications to the existing legendre() implementation that are aimed solely at improving the numerical precision of the algorithm. The following two sections will each concentrate on one of the transformations, discussing in depth why the transformation is prudent, and the conclusion will tease a calculation I was performing which demonstrated the need for this careful attention to numerical accuracy.

As a reminder, the calculations proceed via the evaluation of the three following recurrence relations: \begin{align} p_\ell^\ell(x) &= -\mu_\ell \sqrt{1-x^2} p_{\ell-1}^{\ell-1}(x) \label{eqn:cus_rr_1term_lm} \\ p_\ell^{\ell-1}(x) &= \nu_\ell x p_{\ell-1}^{\ell-1}(x) \label{eqn:cus_rr_1term_l} \\ p_\ell^m(x) &= \alpha_\ell^m x p_{\ell-1}^m(x) - \beta_\ell^m p_{\ell-2}^m(x) \label{eqn:cus_rr_2term} \end{align}

Graphical depiction of the order of evaluating the recurrence relations. The steps denoted in red that move along the diagonal (where \(\ell = m\)) are made by evaluating Eq. \(\ref{eqn:cus_rr_1term_lm}\). Then a single boost (blue step) from \(\ell \rightarrow \ell+1\) is made by evaluating Eq. \(\ref{eqn:cus_rr_1term_l}\). This gives the two requiste previous terms to permit using Eq. \(\ref{eqn:cus_rr_2term}\) to step upward in \(\ell\) (green steps) to the target degree.

In particular, note that Eqn. \(\ref{eqn:cus_rr_1term_lm}\) (the expression that steps along the matrix diagonal) is the precursor to the other two recursions. The implication is that any accumulated error in the first step propagates through the other two recurrences, so it is prudent to ensure the process starts with as much accuracy as possible.

There are essentially three parts to the recurrence — the previous term of the recurrence, the normalization-specific coefficient, and the factor of \(\sqrt{1 - x^2}\). The only one we can meaningfully (and generically) effect is the square root factor, so let us consider that specifically. (Of course, the normalization-specific constants should also be calculated in a numerically accurate way, but since that’s specific to each case, we’ll just assume that has been done expertly here.)

In the following sections we will split the calculation of \(y = \sqrt{1 - x^2}\) into two parts and analyze each of them independently. First, we’ll analyze and discuss the accuracy of calculating the inner component \(y^2 = 1 - x^2\), and then the following section will concentrate on the impact of calculating the square root \(y = \sqrt{y^2}\).

Fused Multiply-Add (FMA) in two-term diagonal recurrence

If you recall your algebra well, it should be immediately obvious that the polynomial \(1 - x^2\) can be mathematically equivalently expressed as the product of factors \((1 - x)(1 + x)\). The slightly more generic case of any difference of squares \(w^2 - x^2\) is a classic example from the well-known article “What Every Computer Scientist Should Know about Floating-Point Arithmetic”1 by D. Goldberg that demonstrates that mathematical equivalence does not mean the expression is numerically equivalent within a calculation using floating point numbers. In that article, it is stated that the factored product is generally preferred over the difference of squares because the latter suffers from catastrophic cancellation whereas the former has only benign cancellation.

We can demonstrate this numerically by defining functions for each of the two forms and evaluating the functions with both standard Float64 64-bit floating point numbers and with arbitrary-precision BigFloat numbers. (In this case, the default 256-bit precision of the BigFloat numbers is sufficient to quantify the rounding error in the 64-bit floats.)

f₁(x) = 1.0 - x^2
f₂(x) = (1.0 - x) * (1.0 + x)

Since the functions are even, we will analyze only half of the domain \(x \in [0, 1]\) since the span \(-1\) to \(0\) will be a mirror reflection of the result on \(0\) to \(1\). We’ll take 1 million samples spaced uniformly2 within the range and calculate a reference answer \(y' = 1 - x^2\) at BigFloat precision. Then because we care about the fractional error and not the absolute magnitude — that is to say, a difference of 1 compared to 1 billion is just as significant a difference of \(10^{-9}\) compared to \(1\) — the difference will be normalized by \(\varepsilon\), which is the corresponding size of the ulp3 (or “unit in last place”) appropriate for the result.

x  = collect(range(0.0, 1.0, length=1_000_001))
y′ = @. big(1.0) - big(x)^2
ε  = @. eps(Float64(y′))

ulperr(y) = @. Float64((y - y′) / ε)

Now we can calculate the maximum error for both f₁\({} = 1 - x^2\) and f₂\({} = (1-x)(1+x)\):

println("max err   1-x^2    = ", maximum(ulperr(f₁.(x))))
println("max err (1-x)(1+x) = ", maximum(ulperr(f₂.(x))))
max err   1-x^2    = 52233.570617393154
max err (1-x)(1+x) = 1.4976462110125

In agreement with the Goldberg article, the first form has a worst-case scenario which has much greater loss of precision in the final result. The simple maximum, though, is not very informative — if only a single pathelogical case is orders of magnitude worse, than the maximum is not a good representation of the behavior across the entire domain of interest.

Therefore, it’ll be more useful to plot the error in ulps for every sample. With one million points, no single marker or line segment will be visible; instead we’ll be looking at what are essentially shaded regions that envelope all answers within some span of the domain (with the span being implicitly defined by the pixel resolution of our plot).

using PyPlot, Printf

function summary_plot(fns, labels)
    for (fn,lbl) in zip(fns, labels)
        δ = ulperr(fn.(x))
        maxδ = maximum(abs, δ)
        plot(x, δ; linewidth=0.5, alpha=0.5,
            label = @sprintf("%s :: max err = %0.2f ulps", lbl, maxδ))
    xlabel(raw"$x$"); ylabel("ulps")
    ylim(-10, 10)

summary_plot((f₁, f₂), (raw"$1 - x^2$", raw"$(1-x)(1+x)$"))

What we find is that the first form actually loses less precision than the second for about 75% of the domain (from \(x = 0\) to about \(0.75\)), but as \(x\) approaches \(1\) it rapidly degrades in precision. This is precisely the catastrophic cancellation mentioned earlier.

In contrast, the second form is bounded across the entire domain, seemingly with no point having an error greater than 1.5 ulps. It is generally less precise for small values of \(x\) but avoids the catastrophic cancellation as \(x\) approaches 1.

The process of catastrophic cancellation can be shown through a simple example. Take the value \(c = 1 - 2^{-30} \approx 0.9999999990686774\) and square it: \begin{align*} c^2 = \left(1 - 2^{-30}\right)^2 = 1 - 2 \cdot 2^{-30} + 2^{-60} \end{align*} A standard double (64-bit) floating point value has only 51 base-2 digits of precision, so the small factor of \(2^{-60}\) is truncated away and the result is rounded to \(1 - 2^{-29}\). Now subtracting the quantity from 1 gives the result that \(1 - c^2\) is \(2^{-29}\):

c = 1 - 2^-30
1 - c^2 == 2^-29

Mathematically, this seems perverse because we can easily see that the two factors of \(1\) cancel, so that \begin{align*} 1 - c^2 &= 1 - \left(1 - 2^{-30}\right)^2 \\ {} &= 2^{-29} - 2^{-60} \end{align*} which spans only 31 powers of two and is well within the 51 orders representable by a double float. The relative error between the value \(2^{-29}\) and \(2^{-29}+2^{-60}\) is roughly 4 million ulps, which is certainly a catastrophic loss of precision!

This problem of intermediate rounding has long been recognized in numerical computing, and many mathematical transformations have been developed (like using the factored form \((1-x)(1+x)\)) to maintain precision in calculations.

Thankfully for us, though, the operation \(1 - x^2\) is a specific form of the more general \(a \cdot b + c\), and that generic form is so crucial to a wide variety of algorithms that special “fused multiply-add” (FMA) functions have long been available that perform the computation at extended precision and only round the result once. Delaying the rounding of the squared term until after adding (or subtracting) in this case is exactly what is required to eliminate the common \(1\) in the expression and collapse the range of digits to a span which does fit within a standard double float.

Adding the FMA calculation to our toolbox and comparing to the previous two cases:

f₃(x) = fma(-x, x, 1.0) # == (-x)*x + 1
summary_plot((f₁, f₂, f₃), (raw"$1 - x^2$", raw"$(1-x)(1+x)$", raw"fma(-x*x + 1.0)"))

The FMA calculation is clearly the best option in terms of precision, with a maximum error of 0.5 ulps.

As of several years ago, the use of FMA instructions could be very expensive to utilize as it was implemented in software libraries rather than at the hardware level. At this point, though, a version of the FMA instruction has been supported4 in both Intel and AMD processors since circa 2011–2014 (and even more recently in many mobile (ARM) processors or GPUs as well). In the context of scientific computing, we can thankfully now reasonable assume the FMA instructions will be performed in hardware, so we have no reason not to improve the numerical precision of the Legendre polynomials by exploiting the FMA.

Pairwise unrolling the diagonal recurrence

The second improvement we can make is to simply recognize that the best way to preserve numerical accuracy in a calculation is to not do a calculation in the first place. Consider the first two iterations along the diagonal, given the initial condition \(p_0^0\): \begin{align*} p_1^1(x) &= -\mu_1 \sqrt{1-x^2} p_0^0(x) \\ p_2^2(x) &= -\mu_2 \sqrt{1-x^2} p_1^1(x) \\ {} &= \mu_1 \mu_2 \left(1 - x^2\right) p_0^0(x) \end{align*} We have just established a way to accurately calculate \(1 - x^2\), but the first iteration requires the square root of the quantity which will necessarily incur some more numerical rounding. The second iteration would benefit from just using the FMA quantity directly, but as the naive iteration proceeds, it is instead a square of the square-root quantity. We can easily show the extra error incurred by unnecessarily squaring a square root with the same summary plot used in the previous section:

r₁(x) = sqrt(f₃(x))^2
summary_plot((f₃, r₁), (raw"$1-x^2$", raw"$(\sqrt{1-x^2})^2$"))

In general, a round-trip incurs one extra ulp of error across the domain. This error is then carried forward and amplified through the recurrence since each step builds upon the last.

A simple transform that can be made is to observe that the first two iterations are just a particular case of a more general pattern — odd degrees require one factor of \(\sqrt{1 - x^2}\) in the product while even degrees can be written completely in terms of \(1 - x^2\). Therefore, let us rewrite the recurrence relation Eqn. \(\ref{eqn:cus_rr_1term_lm}\) as \begin{align} p_\ell^\ell(x) = \begin{cases} \hphantom{\mu_\ell \mu_{\ell-1}}\llap{-\mu_\ell} \sqrt{1 - x^2} \, p_{\ell-1}^{\ell-1}(x) & \ell\text{ is odd} \\ \mu_\ell \mu_{\ell-1} \left(1 - x^2\right) \, p_{\ell-2}^{\ell-2}(x) & \ell\text{ is even} \end{cases} \label{eqn:cus_rr_1term_lm_unroll} \end{align}

Before modifying the legendre() implementation, let us first explore incrementally how each of the FMA and pairwise-unrolling of the diagonal recurrence impacts the numerical precision. Ignoring the normalization-specific coefficient factor (i.e. letting \(\mu_\ell = 1\)) and assuming \(p_0^0 = 1\), then the recurrence relation simplifies directly to \begin{align*} p_\ell^\ell(x) = \left(1 - x^2 \right)^{\ell/2} \end{align*} We calculate this expression explicitly at extended precision for a 10,000 values of \(x \in [0,1]\) and for all \(0 < \ell \le 500\) to generate a matrix of values with 5 million entries.

Then at standard Float64 precision, we’ll calculate the recurrence (not as a direct power) for three cases:

  1. The baseline case that lacks even the FMA improvement discussed above.
  2. A version which introduces the FMA calculation of \(1 - x^2\) to case 1.
  3. A final version which further introduces on top of case 2 the pairwise loop iteration.

The matrices are very large, so we summarize them by histogramming the error of each version with respect the reference calculation.

The left panel of the plot below shows the binned histograms for the three cases on a horizontal axis that ranges up to 2500 ulps of error in either direction from the reference value. It is immediately obvious that the baseline case (blue line) has very long tails toward high error. To see the significance of that tail more clearly, the right panel of the figure instead plots the cumulative counts of (absolute) error values greater than or equal to a given error threshold on a logarithmic horizontal axis. The tail of this distribution reaches to nearly \(2 \times 10^{5}\) ulps, and about 1% of all values have an error of about 700 ulps or greater.

When the naive calculation of \(1 - x^2\) is replaced with the equivalent FMA calculation, the histogram (orange line) tightens considerably toward zero error. In fact, all terms have less than 700 ulps of error, so this change alone eliminates the long tails of the baseline version.

Finally, adding in the pairwise loop iteration as well further reduces the error, with the width of the histogram being about half of the FMA-only case.

Code for generating the following figure:

lmax = 500 # even so that m÷2 is even as well
X  = collect(range(0.0, 1.0, length=10_001))

# The reference case
Y′ = @. sqrt((1 - big(X)^2)^(1:lmax)')
E  = @. eps(Float64(Y′))
ulperr2(Y) = Float64.((Y - Y′) ./ E)

# Case 1 -- baseline
Y₀ = zeros(length(X), lmax)
@. Y₀[:,1] = sqrt(1.0 - X^2)
for ii in 2:lmax
    @. Y₀[:,ii] = Y₀[:,ii-1] * Y₀[:,1]

# Case 2 -- with FMA
Y₁ = zeros(length(X), lmax)
@. Y₁[:,1] = sqrt(fma(-X, X, 1.0))
@inbounds for ii in 2:lmax
    @. Y₁[:,ii] = Y₁[:,ii-1] * Y₁[:,1]

# Case 3 -- + unroll
Y₂ = zeros(length(X), lmax)
@. Y₂[:,2] = fma(-X, X, 1.0)
@. Y₂[:,1] = sqrt(Y₂[:,2])
@inbounds for ii in 4:2:lmax
    @. Y₂[:,ii-1] = Y₂[:,ii-2] * Y₂[:,1]
    @. Y₂[:,ii]   = Y₂[:,ii-2] * Y₂[:,2]

# Plot generation
function summary_plot2d(Ys, labels)
    binslin = range(-2500, 2500, length=251)
    binslog = 10 .^ range(-1, 6, length=250)
    fig, axs = subplots(1, 2)
    for Y in Ys
        Δ = vec(ulperr2(Y))
        axs[1].hist(Δ, binslin; histtype=:step)
        axs[2].hist(abs.(Δ), binslog; histtype=:step, density=true, cumulative=-1)
        xlabel = "err [ulps]",
        yscale = "log", ylim = (10, 3e6), ylabel = "counts",
        title = "binned")
    axs[2].yaxis.set_label_position("right"); axs[2].yaxis.tick_right()
        xscale = "log", xlim = (1, 2e5), xlabel = "|err| [ulps]",
        yscale = "log", ylim = (1e-6, 2), ylabel = "fraction",
        title = "cumulative, greater than")
    return (fig, axs)

summary_plot2d((Y₀, Y₁, Y₂), ["baseline legendre()", "... with FMA", "... + unroll"])

Note that there are two reasons for not unrolling the iteration further. First, this investigation case was simplified by assuming \(\mu_\ell = 1\), but in general the coefficient varies as a function of \(\ell\); there is an implicit coefficient \(\Pi_{\ell=1}^{\ell_\mathrm{max}} \mu_\ell\) that we were able to simplify away that would require keeping track of for a general normalization. Furthermore, direct calculation of \(\left(1-x^2\right)^{\ell/2}\) with double floats actually results in an error distribution which is almost identical to the FMA+unrolled case, so there is very little motivation to further complicate the calculation.

Legendre calculation with FMA and pairwise unrolling

Let us finally update the legendre() implementation to take advantage of both the enhancements we have enumerated here. (The definitions of normalization trait types and property functions are unchanged from the previous post, so there are omitted here for brevity.)

The changes are straight-forward: we calculate the quantity \(1 - x^2\) using an FMA instruction and store that as the value \(y^2\). The iteration along the diagonal proceeds in steps of two where we employ only the even case of Eqn. \(\ref{eqn:cus_rr_1term_lm_unroll}\) up to the last even integer less than or equal to the target \(m\). Only if the target \(m\) is odd does the function take a single step using the odd case of Eqn. \(\ref{eqn:cus_rr_1term_lm_unroll}\) and the square root factor \(y^1 = \sqrt{y^2}\).

The rest of the function to iterate through higher degrees \(\ell\) off the diagonal is unchanged thereafter.

function legendre(norm, , m, x)
    P = convert(typeof(x), initcond(norm))
     == 0 && m == 0 && return P

    # Use recurrence relation 1 to increase to the target m
     = fma(-x, x, one(x))
    meven = 2 * (m ÷ 2)
    for m′ in 2:2:meven
        μ₁ = coeff_μ(norm, m′ - 1)
        μ₂ = coeff_μ(norm, m′)
        P = μ₁ * μ₂ *  * P
    if meven < m
         = sqrt()
        μ₁ = coeff_μ(norm, m)
        P = -μ₁ *  * P
     == m && return P

    # Setup for 2-term recurrence, using recurrence relation 2 to set
    # Pmin1
    ν = coeff_ν(norm, m + 1)
    Pmin2 = P
    Pmin1 = ν * x * Pmin2
     == m+1 && return Pmin1

    # Use recurrence relation 3
    for ℓ′ in m+2:
        α = coeff_α(norm, ℓ′, m)
        β = coeff_β(norm, ℓ′, m)
        P = α * x * Pmin1 - β * Pmin2
        Pmin2, Pmin1 = Pmin1, P

    return P

Comparison of legendre() from the previous posting versus the code cell above.

 function legendre(norm, ℓ, m, x)
     P = convert(typeof(x), initcond(norm))
     ℓ == 0 && m == 0 && return P
     # Use recurrence relation 1 to increase to the target m
-    y = sqrt(one(x) - x^2)
-    for m′ in 1:m
-        μ = coeff_μ(norm, m′)
-        P = -μ * y * P
+    y² = fma(-x, x, one(x))
+    meven = 2 * (m ÷ 2)
+    for m′ in 2:2:meven
+        μ₁ = coeff_μ(norm, m′ - 1)
+        μ₂ = coeff_μ(norm, m′)
+        P = μ₁ * μ₂ * y² * P
+    end
+    if meven < m
+        y¹ = sqrt(y²)
+        μ₁ = coeff_μ(norm, m)
+        P = -μ₁ * y¹ * P
     ℓ == m && return P
     # Setup for 2-term recurrence, using recurrence relation 2 to set
     # Pmin1
     ν = coeff_ν(norm, m + 1)
     Pmin2 = P
     Pmin1 = ν * x * Pmin2
     ℓ == m+1 && return Pmin1
     # Use recurrence relation 3
     for ℓ′ in m+2:ℓ
         α = coeff_α(norm, ℓ′, m)
         β = coeff_β(norm, ℓ′, m)
         P = α * x * Pmin1 - β * Pmin2
         Pmin2, Pmin1 = Pmin1, P
     return P

All of this effort spent to study the numerical accuracy of the diagonal recurrence relation is not merely an interesting academic exercise. I was driven to investigate this path through a concrete example where the loss of precision led to non-physical results in a calculation.

Without going into greater detail (which may appear in a future article), I was interested in calculating a quantity that included a term similar to: \begin{align*} Q_\ell \sim \frac{4\pi}{2\ell + 1} \sum_{m=-\ell}^{\ell} \left| \int_{\phi_\mathrm{min}}^{\phi_\mathrm{max}} \int_{x_\mathrm{min}}^{x_\mathrm{max}} \lambda_\ell^m(x) e^{im\phi} \,dx\,d\phi \right|^2 \end{align*} where \(\lambda_\ell^m(x)\) is the spherical-harmoinc normalization of the Legendre polynomials. The value of \(Q_\ell\) was well-behaved at low-\(\ell\), but suddenly diverged in an unexpected way after some critical point. After some investigation, I concluded that the cause was specifically the Legendre term, and I narrowed down a representative case to the following slice of the matrix over degrees and orders for a fixed \(\ell\):

function example_use(legendre_fn)
    # Constants for a particular use of legendre functions
    lmax = 2125
    θlim = deg2rad.((147.375, 147.875))
    xs = range(cos.(θlim)..., length=100)
    # λlm_lmax^m(x) for m ∈ 0:lmax
    Λ = [legendre_fn(SphereNorm(), lmax, m, x) for x in xs, m in 0:lmax]
    # "integrated" over all xs in a small range
    Λint = step(xs) .* dropdims(sum(Λ, dims=1), dims=1)
    return Λint

# legendre_prev() is the legendre() implementation taken from the end of Part III
plot(example_use(legendre_prev), linewidth=0.5, label="previous legendre()")
plot(example_use(legendre), linewidth=0.5, label="legendre() with FMA+unroll")
title("Example use of Legendre functions")

Without the improvements explored in this posting, the integral for fixed degree \(\ell\) has a brief period of aberrant growth (blue line near \(m = 1250\)) before the normalization suppresses the function. The disturbance continues to grow with higher degrees, so the average over all orders \(m\) was dominated by this numerical instability and the desired quantity \(Q_\ell\) had an unwanted divergence.

The orange line instead shows the same quantity calculated using the improved implementation that makes use of the FMA and partially-unrolled recurrence enhancements. The function is very similar everywhere except around \(m = 1250\) where the hump disappears. When extended to all higher degrees as well, these changes solved the problems with calculating the quantity \(Q_\ell\) and allowed me to complete my desired task.

  1. D. Goldberg. “What Every Computer Scientist Should Know about Floating-Point Arithmetic” In: ACM Comput. Surv. 23.1 (Mar 1991), pp 5–48. issn: 0360-0300. doi: 10.1145/103162.103163. Also online at ↩︎

  2. Uniformly spaced in the interval subject to being representable by a double float. For instance, the classic example where \(0.1 + 0.2 = 0.30000000000000004 \neq 0.3\) is due to the fractions \(1/10\) and \(1/5\) having repeating digit expansions in the base-2 representations used to store floating point values. Here we take the representable finite precision value as exact and ignore that some values will be rounded and truncated from the true mathematically-defined sequence of values. ↩︎

  3. A quick example of an ulp is easiest to see in base 10. If the storage has only 1 digit of precision in the fraction, 1 ulp equals \(0.1\) since that is the smallest difference between any two numbers with a single digit after the decimal. The ulp is a useful quantity because a correctly rounded result should differ from the true (infinite precision) result by at most 0.5 ulps — i.e. \(0.06\) rounds to \(0.1\) and has a corresponding error of 0.4 ulps. ↩︎

  4. See for a more detailed summary of the supporting chipset architectures. ↩︎