In this article, I collect simple derivations that demonstrate how to calculate several statistical quantities using point-by-point online statistics. A particular emphasis is made to support multiple weighting schemes (not limited to just uniform weights).

## Introduction¶

Typical discussions of statistical calculations make use of operations which are most easily expressed over batches (e.g. vectors) of data. There can be situations, though, where data is provided incrementally, and the typical definitions are not desirable. An example of such is in streaming data applications where there may be a need to calculate a statistic over long periods of time where it is not possible to store the entire history in working memory simultaneously. Therefore, the goal is to recast the traditional statistical calculations in a form which is amenable to being updated point-by-point while maintaining minimal state. More specifically, in this article we’ll use the term “online statistics” to mean any calculation which can be calculated in a point-by-point manner while requiring a fixed-size (and finite) amount of state be propagated/stored at each update for any length of data stream.

## Weighted Online Mean¶

The (arguably) simplest statistic to start with is the average of a vector. Because we’re interested in the general case, we’ll skip the simple arithmetic mean (assuming uniform weights) and go directly to the weighted mean $\mu_n$. For a data vector of $n$ data points ${x_1, x_2, \ldots, x_n}$, the weighted mean is defined as the quantity

The online calculation follows from separating the $n$-th term from the rest of the series and simplifying.

Therefore, the online weighted mean is incrementally updated via the recursion

## Weighted Online Variance¶

Progressing from the mean, the next obvious statistic is the standard deviation.
For reasons which will soon become clear, we’ll instead consider the *biased* weighted variance $\sigma_n^2$, from
which the standard deviation can be very easily derived.
Let the biased variance be defined as the summations

where the weighted mean $\mu_n$ is defined in the previous section. To make progress with calculating the online variance, it is actually easier to first consider the sum of weighted squared deviations $\beta_n$ which ignores the normalization of dividing by the sum of weights.

For $\beta_n$, we can derive the online update via Welford’s Algorithm. (The purpose of using Welford’s algorithm rather than expanding the summation naively is that under numerical computation in finite floating point numbers, the naive calculation is more apt to suffer loss of precision due to catastrophic cancellation, which Welford’s algorithm is less prone to.) The online update to the sum of squared deviations is

(For a detailed derivation, see the appendix.)

Substituting this back into the expression for the variance,

which finally provides the online update to the biased variance $\sigma_n^2$:

## Online Linear Polynomial Regression¶

To incrementally calculate the fit of a degree-1 (linear) polynomial, we’ll start with the matrix-form expression for weighted linear regression matrices. Recall that the model $y = mx + b$ translates for vectors of data points $(\bm x, \bm y)$ of length $n$ with model coefficients $m$ and $b$ to

The weighted least-squares solution finds $\bm a$ such that for that an observed set of values $\bm{\hat y}$, the sum of weighted squared deviations with respect to weights $\bm w$ is minimized. The closed-form solution is:

Since the model is only two dimensional, the matrix inversion of the $2 \times 2$ matrix $\bm X^\top \bm W \bm X$ is tractable to expand element-by-element. Starting with this quadratic term, the elements are:

Inverting the matrix in terms of the defined scalar accumulators,

Similarly, the component-wise expansion of the latter term in the closed-form solution is:

Multiplying the two components together, we obtain expressions for the slope $m$ and intercept $b$ of the least-squares solution in terms of quantities which are amenable to online updates:

Inspection of these quantities should show several nearly-recognizable terms. For instance, relatively obvious is that $\mu_n = \delta_n / W_n$, but more subtly you may also see that $W_n \tau_n - \delta_n^2 = W_n^2 \sigma_n^2$.

Motivated both by these familiar terms as well as the desire to consider re-writing the accumulation quantities in terms of the ratios $w_n/W_n$ (in alignment with how the mean and variance are already being defined), divide through the numerator and denominators of each fraction by $W_n^2$ and update the definitions of the accumulators:

where we have introduced/reused new names for $\alpha_n/W_n \rightarrow \mu_{y,n}$ and $\delta_n/W_n \rightarrow \mu_{x,n}$ to explicitly recognize the fact that they calculate the mean of each coordinate and added primes to $\chi_n’$ and $\tau_n’$ to make it clear that they’ve been rescaled from their unprimed counterparts.

For the same reason as argued in the calculation of the variance, using a variation of Welford’s algorithm to accumulate $\sigma_n^2$ is numerically more accurate than the difference of quantities and therefore is preferred, but it comes at the apparent cost of requiring that yet another quantity be tracked. Thankfully, we can eliminate the accumulation of $\tau’_n$ based on the fact that it is related to the mean and variance as:

The slightly non-obvious transformation^{1} to

lends itself to inserting into the expression for $b$ and simplifying:

which eliminates any need for the keeping track of $\tau_n’$.

Continuing with the idea of Welford’s algorithm, the expression $\chi_n’ - \mu_{x,n} \mu_{y,n}$ may be recognizable as calculating the covariance between $x$ and $y$. Another variation of Welford’s algorithm applies to the calculation of the covariance between two quantities (see the appendix), so we substitute the covariance between $x$ and $y$, $\rho^2$, into the numerator of the expression for the slope.

Finally, we find that the online calculation of the slope and intercept over a data series is given by the set of equations

## Weighting Schemes¶

In the following subsections are the definitions for a couple of different weighting schemes. Before discussing particular weighting schemes, though, there are several properties and relationships that we can derive generally.

**Definitions**

Let us define the *weight fraction* `$\omega_i \equiv w_n / W_n$`

as the ratio of the *weight factor* `$w_i$`

and the
*total weight* `$W_n$`

.
Since every statistical update above is defined in terms of the weight fraction, we will take that as the only
known/given quantity, and then our task is to derive a self-consistent set of definitions for the weight factors and
total weight.

We begin by writing the total weight in terms of the weight fractions.
Start by expanding the total weights by one term and substituting to eliminate the unknown `$w_n$`

that arrises:

Then simply solving for `$W_n$`

in terms of just `$\omega$`

and `$W$`

:

This is a simple recursive definition, where the only subtlety to consider is that there’s no predecessor to `$W_1$`

.

To resolve this definitional problem, let us assume that `$W_1 = 1$`

.
By definition of the total weight `$W_1 = \sum_{i=1}^1 w_i = w_1$`

, we constrain to `$w_1 = 1$`

, and therefore
`$\omega_1 = 1$`

as well.
**Note that we’ve just argued that no matter the weighting scheme, we will have
$\boxed{\omega_1 = w_1 = W_1 = 1}$.**
We can justify this choice by considering (as an example) the update to the online mean.
The “previous” mean is undefined, but since it is multipled by the factor

`$(1 - \omega_1) = 0$`

, the first “update” is
operationally equivalent to an assignment of the mean to the value of the first data sample.
If you examine every one of the updates defined above, you should see that special casing the first weight factor
to unity effectively initializes every accumulation quantity.Returning to the total weight recursion, we can now pretty easily conclude that the explicit formula for `$W_n$`

is

From this, it directly follows that the traditional weight factors are given by

**Effective sample size**

The effective sample size $n_\mathrm{eff}$ of a weighting scheme (given one particular definition) can be defined as a function of only the weights:

We’ve already derived the expression for the sum of weights $W_n$ above, so we turn to the sum of squared weights $Q_n$. Using the definition of weight factors $w_i$ in terms of the weight fractions $\omega_i$,

While mathematically correct, this expression is non-ideal from a numerical standpoint. First, the repeated inner products would (naively) mean performing the same calculation multiple times (i.e. for $j = 3$, the product calculates $X_2 \cdot X_3$, and then for $j = 4$, we caculate $X_2 \cdot X_3 \cdot X_4$, and then …). Second, consider the case where $\omega_j$ is very small such that $\left(\frac{1}{1-\omega_j}\right) \sim 1 + \epsilon$ — the repeated products between terms imply sums involving contributions from terms of order $\mathcal{O}(\epsilon^{2i})$, which is challenging to do accurately with finite floating point arithmetic.

For both of these reasons, let us expand the summation and products for a few terms and look for a pattern:

Notice that except for the first term, the factor of $\left(\frac{1}{1-\omega_2}\right)^2$ appears in every later term. Factoring this out,

The term inside the square brackets now has single initial factor that stands on its own, but then in every subsequent term there’s a factor of $\left(\frac{1}{1-\omega_3}\right)^2$ which can also be factor out. Repeating the pattern:

Finally, to help make the pattern a little more complete, let’s recall that the initial $1$ comes from the first term where $w_1 = \omega_1 = 1$, so we can substitute back in $1 \rightarrow \omega_1^2$ for it to get the nested pattern:

Starting at the innermost bracket, the next bracket outward is a simple multiply-add of the previous quantity with terms that depend only on either $\omega_i$ or $\omega_{i+1}$. This is a recursive process:

where we’ve introduced new quantities $Q_n^{(i)}$ which denote the partial sum of $Q_n$ at step $(i)$. Using this new notation, we have:

The only remaining quantity that needs to be specified is the initial condition $Q_n^{(n)}$. It should be clear from the finite expansion we performed above, that if $n = 4$, there are no remaining terms that are being implied by the $[\ldots]$, so $Q_4^{(4)} = \omega_4^2$. In general, we then conclude that $Q_n^{(n)} = w_n^2$, giving us the complete recursive definition of the sum of squared weights:

Therefore, the generic method for calculating the effective sample size — agnostic to the particular weighting scheme in use — is to calculate the sum of weights $W_n$ and sum of squared weights $Q_n$ as defined above, an then the effective sample size is simply

### Uniform Weights¶

**Definitions**

- $\displaystyle \boxed{\omega_n = \frac{1}{n}}$

Uniform weights are the implicit weights seen in the “unweighted” case.

On the surface, there’s an ambiguity in the ratio `$\omega_n = w_n / W_n$`

as to whether it should be interpreted
as the total weight always being unity and every data point “reweights” when the next data point is observed to the
uniform value `$1/n$`

or whether every weight factor is $1$ and the sum increases.
Both definitions have their benefits, but the expressions derived above resolve this ambiguity (i.e. there’s only one
self-consistent interpretation given the other assumptions we’ve made).

The total weight is

where the last line is true because the numerator and denominator of each successive term cancel each other out except for the first denominator ($i - 1 = 2 - 1 = 1$) and the last numerator ($i = n$). It follows trivially that the weight factors are then

**Effective Sample Size**

The effective sample size for uniform weights are trivial since each weight factor is $1$. The sum of squared weights is $\boxed{Q_n = n}$, so $W_n^2/Q_n = \boxed{n_\mathrm{eff} = n}$ exactly as expected.

### Exponential Weights¶

- $\displaystyle \boxed{\omega_n = \lambda}$ (for
`$n > 1$`

) where`$0 < \lambda < 1$`

.

Qualitatively, the constant weight fraction results in an exponential weight distribution since the newest data point always contributes a fixed fraction of the total weight, which necessarily means that prior data points are repeatedly “de-weighted” (relative to the total weight) by a fixed ratio as well. The first data point undergoes this de-weighting the most number of times, and the relative de-weighting will occur fewer times for successively newer data points. This describes a power series of increasing relative weights.

Analytically, the total weight and weight factors according to the expression above are:

for `$n > 1$`

.

**Effective Sample Size**

Rather than using the generic, recursive definition for the sum of squared weights $Q_n$, we can specialize given the known weight factors. Explicitly,

Formally, the summation is a geometric series, which is easier to see by making the convenient substitutions letting $\gamma \equiv 1 - \lambda$ and $j = i - 1$:

where the second line follows by pulling out one factor of $1/\gamma^2$ from the summation in order to decrease the power inside the summation by 1. This puts the summation in the form of a finite geometric series, which has an well known closed-form expression for the partial sum, so that

Substituting back in the definition of $\gamma = 1 - \lambda$, we get that the sum of squared weights for exponential weighting is given by:

Combining with the known expression for $W_n$, the effective sample size $n_\mathrm{eff}$ is

**Scaling calculations for $\lambda$**

To define the value of $\lambda$, a useful definition can be to consider the “time” at which the $k$-th prior weight
decays to some fraction that of the latest weight, i.e. $\frac{w_{n}}{w_{n - k}} = \epsilon$ (where
`$0 < \epsilon < 1$`

).
Solving for $\lambda$ as a function of $k$ yields:

The two most common values of $\epsilon$ are:

$\epsilon = \frac{1}{2}$ corresponds to defining the “half life” $k = k_\mathrm{half}$. Then $\lambda = 1 - e^{-\ln 2 / k_\mathrm{half}}$.

$\epsilon = \frac{1}{e}$ correspond to defining the “time constant” $k = \kappa$. Then $\lambda = 1 - e^{-1/\kappa}$.

The half life and time constant are related by $k_\mathrm{half} = \kappa \ln 2$.

The effective sample size calculation gives another way to define $\lambda$. Consider the limit of $n_\mathrm{eff}$ as $n \rightarrow \infty$:

We see that the effective sample size asymptotically approaches a constant defined in terms of only the weight fraction $\lambda$. Therefore, we can reverse the relationship and solve for a weight fraction that results in a particular effective sample size, $\lambda = \frac{2}{1 + n_\mathrm{eff}}$.

## Welford’s Algorithm¶

The sum of weighted squared deviations $\beta_n$ is defined as

To calculate the quantity as an online update from $\beta_{n-1}$, the naive approach looks like
the following^{2} where the square is expanded, terms are regrouped, and the single terms can be
identified and accumulated.

where $\mu_n$ and $W_n$ are updates as derived above, and we’ve introduced a new quantity to track the sum of weighted squares $\tau_n$. The problem is that if the quantities $\tau_n$ and $W_n \mu_n$ have similar magnitude, the difference can suffer from catastrophic cancellation and significantly lose precision.

Welford’s Algorithm describes an alternate update method which avoids the worst of the numerical issues. To derive the Welford-style update, instead start by considering the difference of successive sums of deviations,

Expanding the first summation to exclude the $i=n$ term and grouping the remaining summation with the second summation,

The term in the square brackets is a difference of squares, so factoring as $a^2 - b^2 = (a - b)(a + b)$, the summation can be rewritten as

Then using the null identity $\sum_{i=1}^n w_i \left( x_i - \mu_n \right) = 0$, the second parenthesized group of terms trivially goes to zero:

(The null identity is readily derived from the definition of the weighted mean.) For the first parenthesized group where the limit $n$ of the summation and the subscript $n-1$ on the mean $\mu$ differ, we must first manipulate the sum.

Substituting this back into the expression for the differences of $\beta$, we finally arrive at

which gives us the completed derivation of the weighted version of Welford’s algorithm:

It is also interesting to note that the variance $\operatorname{var}(x)$ is a special case of the covariance $\operatorname{cov}(x, y)$ between to vectors $x$ and $y$. Therefore, a useful ansatz is that we can generalize Welford’s algorithm for the variance to also work for calculating the online covariance $\rho_n^2$ of two series ${x_i}$ and ${y_i}$ as:

where the mean $\mu$ has gained subscripts to differentiate the two distinct means.
On the left, there’s an apparent asymmetry in whether $x_n$ or $y_n$ is paired with its $n$-th or $(n-1)$-th mean, but
it is easy to show that $z_n - \mu_z = (1 - w_n/W_n) (z_n - \mu_{n-1})$, so both the first and second line are
equivalent.
A full proof that the generalization is valid is given by Shubert & Gertz (2018)^{3}.

## Online Least Squares via QR decomposition¶

A well-known substitute for directly solving for the regression coefficients via the normal equation

which involves an explicit matrix inversion is to instead QR decompose the regressor $\bm X$ and instead solve the system of equations

where the backslash denotes solving the system of equations without explicitly inverting $\bm R$.
For a detailed derivation, see Appendix D.3 of my thesis^{4}.

For use in online statistics, we must explicitly construct the QR decomposition (rather than relying on a numerical linear algebra library as typically happens). One method (and the easiest for our use here) of calculating the QR decomposition is via successive rounds of Gram-Schmidt orthogonalization. The first column of $\bm Q$, $\bm q_1$, is simply the first column of $\bm X$, $\bm x_1$, normalized to unit length:

The second column $\bm q_2$ is constructed by orthogonalizing $\bm x_2$ by removing its component which is parallel to $\bm q_1$ and normalizing:

A third column would be constructed similarly — projecting out the components of both $\bm q_1$ and $\bm q_2$ which are present in $\bm x_3$ and normalizing — but since we are interested specifically in regressing a degree-1 polynomial with only two coefficients, just the first two columns will suffice.

These together implicitly define $\bm Q = \begin{bmatrix} \bm q_1 & \bm q_2 \end{bmatrix}$. The partnering $\bm R$ matrix is a record of the transformations performed to turn the column basis $\bm x_i$ into $\bm q_i$. Namely for the $2 \times 2$ case,

which can be verified by performing the block-multiplication $\bm Q \bm R$ and observing that the products recover the column vectors $\bm x_i$.

For the specific case of a weighted degree-1 polynomial regression, there is an extra subtlety: the regressor $\bm X \rightarrow \bm X’ = {\bm W}^{1/2} \bm X$ and the ordinate vector $\bm y \rightarrow \bm y’ = {\bm W}^{1/2} \bm y$ — where the square root is applied elementwise to the diagonals of the weight matrix — to account for the weight factors. Let $(\bm w’)_{i} = \sqrt{(\bm w)_i}$ be the vector of the elementwise square roots of the weight vector.

With the weights in mind, we start with the first column of $\bm X’$ which is simply the square-root weight vector
$\bm w’$.
Its norm `$\| \bm w' \| = \sqrt{W_n}$`

, so we easily find

Next, the second column of $\bm X’$ is the vector
`$\begin{bmatrix} w'_1 x_1 & w'_2 x_2 & \ldots & w'_n x_n\end{bmatrix}^\top$`

,
so

where we first recognize $(w’_i)^2 = w_i$ and then identify the near match to the definition of the weighted mean over values $x_i$.

Projecting out the component of $\bm q_1$ in $\bm x’_2$,

and the norm `$\| \bm v_2 \|$`

should now be recognizable as containing the sum of weighted squared deviations of
$\bm x$, so we directly jump to using Welford’s algorithm on the quantity to declare

Therefore,

With every component of the QR decomposition in hand, we can now start solving the system of equations. For the product $\bm z \equiv \bm Q^\top \bm y'$

where again we’ve recognized definitions of the weighted mean over values $y_i$ and covariance between $x_i$ and $y_i$ from the derivation in the main body.

Finally, we have every component ready and can solve the system of equations $\bm R \bm a = \bm z$:

Using back-substitution, we start with the equation for $m$:

where we recognize that $\beta_n / W_n = \sigma_n^2$. Then relying on $m$ being known, the equation for $b$ is

These are precisely what was derived in the main body without ever constructing an expression for $\tau_n$ and having to eliminate it.

Markdown footnote generator workaround
^{5}
^{6}
^{7}

A way to more directly arrive at this particular choice of accumulation variables is to replace use of the normal equation with the back-solve method allowed by QR decomposition of the regressor matrix $\bm X$ (and proceeding to do the component-wise expansion of the solve). ↩︎

This is effectively the derivation that shows that

\begin{align*} \newcommand{\bbE}[1]{\mathbb{E}\left[#1\right]} \operatorname{var}(x) &= \bbE{(x - \mu_x)^2} \\ {} &= \bbE{x^2} - \left(\bbE{x}\right)^2 \end{align*}where $\bbE{\cdot}$ denotes the expectation value. ↩︎

Erich Schubert and Michael Gertz. “Numerically Stable Parallel Computation of (Co-)Variance.” In

*Proceedings of the 30th International Conference on Scientific and Statistical Database Management*, SSDBM ‘18. New York, NY, USA, 2018. Association for Computing Machinery. URL: https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf, doi: 10.1145/3221269.3223036 ↩︎Justin Willmert.

*Constraining Inflationary B-modes with the Bɪᴄᴇᴘ/Keck Array Telescopes.*PhD thesis, University of Minnesota, October 2019. URL: https://hdl.handle.net/11299/211821 ↩︎*2022 Oct 27*— The article has been updated to produce a self-consistent set of weight fractions, weight factors, and total weight. The “Weighting Schemes” section has gained a general derivation of how the weight factors and total weight can be derived from the weight fractions, and the initial condition of all weighting schemes has been clarified. These clarifications and corrections particularly impact the exponential case where there is no remaining ambiguity to how to defined the weight factors (and an explicit expression for the total weight has been added). ↩︎*2022 Nov 01*— Additional notes have been added to describe how the exponential weighting fraction $\lambda$ relates to decay rates, parameterized in terms of both half-life and the exponential time constant. ↩︎*2022 Dec 12*— Derivations of the effective sample size $n_\mathrm{eff}$ were added, both in the general case and specializations for each of the uniform and exponential weighting schemes. ↩︎