heiner.ai

The chain rule, Jacobians, autograd, and shapes

Sun, 19 Feb 2023 00:00:00 +0000

“Man muss immer umkehren”
– Carl Gustav Jacob Jacobi¹

This is a short explainer about the chain rule and autograd in PyTorch and JAX, from the perspective of a mathematical user.

~~The Joker~~

Carl Gustav Jacob Jacobi

(1804 – 1851)
“Die Haare immer nach hinten kehren”
Image source: Wikipedia

There are many, many explanations of this on the web. Many are likely better than this one. I’ll still write my own, which in the spirit of this blog is meant to be written, not read. I’ll also don’t focus on the implementation of the system, just on its observable behavior.

Other, perhaps better sources for the same info are:

A gentle introduction to torch.autograd from the tutorials at pytorch.org.
Autograd mechanics from pytorch.org.
Zachary DeVito’s excellent colab with an example implementation of reverse-mode autodiff from scratch. I highly recommend studying this one.
The Autodiff Cookbook in the JAX docs. Also very good.

The JAX docs especially are delightfully mathematical. There are many more sources on the web. I like the more extensive treatment is in Thomas Frerix’s PhD thesis.

Still, let me add my own spin on the issue. One reason is that many other articles write things like dL/dOutputs and dL/dInputs and generallly use a “variable”-based notation that, while entirely reasonable from an implementation standpoint, would make Spivak sad.

The chain rule

All of “deep learning”, and most of Physics, depends on this theorem discovered by Leibniz ca. 1676.

One dimensional functions

Given two differentiable functions $f,g\from\R\to\R$, the chain rule says

\[(g\circ f)'(x) = g'(f(x))f'(x) \where{x\in\R}. \label{eq:cr1}\tag{1}\]

Here, $(g\circ f)(x) = g(f(x))$ is the composition, i.e., chained evaluation of first $y = f(x)$, then $g(y)$.

Multidimensional functions

It’s one of the wonders of analysis that this rule keeps being correct for differentiable multidimensional functions $f\from\R^n\to\R^m$, $g\from\R^m\to\R^k$ if “differentiable” is defined correctly. Skipping over some technicalities, this requires $f$ and $g$ to be totally differentiable (also called Fréchet differentiable), which implies that all components $f_j\from\R^n\to\R$ are partially differentiable; the derivative $f'(a)$ for $a\in\R^n$ can then be shown to be equal to the Jacobian matrix

\[f'(a) = J_f(a) := \bigl(\partial_k f_j(a)\bigr)_{\substack{j=1,\ldots,m\\ k=1,\ldots,n}} = \begin{pmatrix} \partial_1 f_1(a) & \cdots & \partial_n f_1(a) \\ \vdots && \vdots \\ \partial_1 f_m(a) & \cdots & \partial_n f_m(a) \end{pmatrix} \in\R^{m\times n}.\]

The idea is to view $f(a) = (f_1(a), \ldots, f_m(a))^\top\in\R^m$ as a column vector and add one column per partial derivative, i.e., dimension of its input $a$.

With this definition, the multidimensional chain rule reads like its 1D version $\eqref{eq:cr1}$,

\[(g\circ f)'(x) = g'(f(x))\cdot f'(x) \where{x\in\R^n}. \label{eq:crN}\tag{2}\]

However, in this case this is the matrix multiplication $\cdot\from\R^{k\times m}\times\R^{m\times n}\to\R^{k\times n}$ of $J_g(f(x))$ and $J_f(x)$ and their order is important. Jacobi would have called this nachdifferenzieren.

Via the nature of this rule it iterates, i.e. the derivative of the composition $h\circ g\circ f$ is

\[(h\circ g\circ f)(x) = h'(g(f(x)))\cdot g'(f(x))\cdot f'(x) = (h'\circ g\circ f)(x)\cdot (g'\circ f)(x)\cdot f'(x),\]

and likewise for a composition of $n$ functions

\[(f_n\circ \cdots \circ f_1)'(x) = (f_n'\circ f_{n-1}\circ\cdots\circ f_1)(x)\cdot (f_{n-1}'\circ f_{n-2}\circ\cdots\circ f_1)(x) \,\cdots\, (f_2'\circ f_1)(x) \cdot f_1'(x). \label{eq:crNn}\tag{3}\]

Note that subscripts no longer mean components here, we are talking about $n$ functions, each with multidimensional inputs and outputs.

Backprop

Now, while matrix multiplication isn’t commutative, meaning in general $AB \ne BA$, it is associative: For a product of several matrices $ABC$, it does not matter if one computes $(AB)C$ or $A(BC)$; this is what makes the notation $ABC$ sensible in the first place. However, this only means the same output is produced by those two alternatives. It does not mean the same amount of “work” (or “compute”) went into either case.

Counting scalar operations, a product $AB$ with $A\in\R^{k\times m}$ and $B\in\R^{m\times n}$ takes $nkm$ multiplications and $nk(m-1)$ additions, since each entry in the output matrix is the sum of $m$ multiplications. In a chain of matmuls like $ABCDEFG\cdots$, it’s clear that some groupings like $((AB)(CD))(E(FG))\cdots$ might be better than others. In particular, there may well be better and worse ways of computing the Jacobian $\eqref{eq:crNn}$. However, to quote Wikipedia:

The problem of computing a full Jacobian of $f\from\R^n\to\R^m$ with a minimum number of arithmetic operations is known as the optimal Jacobian accumulation (OJA) problem, which is NP-complete.

The good news is that in important special cases, this problem is easy. In particular, if the first matrix in the product has only one row, or the last only one column, it makes sense to compute the product “from the left” or “from the right”, respectively. And those cases are not particularly pathological either, as they correspond to either the final function $f_n$ mapping to a scalar or the whole composition depending only on a scalar input $x\in\R$. The former case is especially important as that’s what happens whenever we compute the gradients of a scalar loss function, such as in bascially all cases where neural networks are used.² The corresponding orders of multiplications in the chain rule $\eqref{eq:crNn}$ are known as reverse mode or forward mode, respectively. Ignoring some distinctions not relevant here, reverse mode is also known as backpropagation, or backprop for short.

In this case, the image of $f_n$ is one dimensional, and therefore its Jacobian matrix $f_n'$ has only one row – it’s a row vector. After multiplication with the next Jacobian $f_{n-1}'$, this property is preserved: All matmuls in the chain turn into vector-times-matrix. Specifically, they are vector-times-Jacobian-matrix, more commonly known as a vector-Jacobian product, or VJP. To illustrate:

\[\begin{pmatrix} \unicode{x2E3B} \end{pmatrix}_1 \begin{pmatrix} | & | & | \\ | & | & | \\ | & | & | \end{pmatrix}_2 \begin{pmatrix} | & | & | \\ | & | & | \\ | & | & | \end{pmatrix}_3 \cdots \begin{pmatrix} | & | & | \\ | & | & | \\ | & | & | \end{pmatrix}_n = \begin{pmatrix} \unicode{x2E3B} \end{pmatrix} \begin{pmatrix} | & | & | \\ | & | & | \\ | & | & | \end{pmatrix}_3 \cdots \begin{pmatrix} | & | & | \\ | & | & | \\ | & | & | \end{pmatrix}_n = \text{etc.}\]

What this means is that for neural network applications, a system like PyTorch or JAX doesn’t need to actually compute full Jacobians – all it needs are vector-Jacobian products. It turns out that (classic) PyTorch does and can in fact do nothing else.

Also notice that the row vector $\begin{pmatrix}\unicode{x2E3B}\end{pmatrix}$ multiplied from the left is “output-shaped” from the perspective of the Jacobian it gets multiplied to. This is the reason grad_output in PyTorch always has the shape of the operation’s output and why torch.Tensor.backward receives an argument described as the “gradient w.r.t. the tensor”.

You may wonder what it means for it to have a specific shape vs just being a “flat” row vector as it is here. I certainly wondered about this; the answer is below.

In PyTorch

Here’s a very simple PyTorch example (lifted from here):

import torch

x = torch.tensor([1.0, 2.0], requires_grad=True)

y = torch.empty(3)
y[0] = x[0] ** 2
y[1] = x[0] ** 2 + 5 * x[1] ** 2
y[2] = 3 * x[1]

v = torch.tensor([1.0, 2.0, 3.0])
y.backward(v)  # VJP.

print("y:", y)
print("x.grad:", x.grad)

# Manual computation.
dydx = torch.tensor(
    [
        [2 * x[0], 0],
        [2 * x[0], 10 * x[1]],
        [0, 3],
    ]
)

assert torch.equal(x.grad, v @ dydx)

In PyTorch. Tensor.backward computes the backward pass via vector-Jacobian products. If the tensor in question is a scalar, an implicit 1 is assumed, otherwise one has to supply a tensor of the same shape which is used as the vector in the VJP, as in the example above.

But wait, tensor or vector?

For the typical mathematian, the term “tensor” for the multidimensional array in PyTorch and JAX is a bit on an acquired (or not) taste – but to be fair, so is anything about the real tensor product $\otimes$ as well.

The situation here seems especially confusing – the chain rule from multidimensional calculus makes a specific point of what’s a row and what’s a column and treats functions $\R^n\to\R^m$ by introducing $m\times n$ matrices. In PyTorch, functions depend on and produce one (or several) multidimensional “tensors”. What gives?

The answer turns out to be relatively simple: The math part of the backward pass doesn’t depend on these tensor shapes in any deep way. Instead, it does the equivalent of “reshaping” everything into a vector. Example:

x = torch.tensor(
    [
        [1.0, 2.0],
        [3.0, 4.0],
    ],
    requires_grad=True,
)

y = torch.empty((3, 2))
y[0, 0] = x[0, 0] ** 2
y[1, 0] = x[0, 0] ** 2 + 5 * x[0, 1] ** 2
y[2, 0] = 3 * x[1, 0]
y[0, 1] = x[1, 1]
y[1, 1] = 0
y[2, 1] = torch.sum(x)

v = torch.tensor(
    [
        [1.0, 2.0],
        [3.0, 4.0],
        [5.0, 6.0],
    ]
)

print("y:", y)
y.backward(v)  # VJP.
print("x.grad:", x.grad)

What did this even compute? It’s the equivalent of reshaping inputs and outputs to be vectors, then applying the standard calculus from above:

dydx = torch.tensor(  # y.reshape(-1) x x.reshape(-1) Jacobian matrix.
    [
        [2 * x[0, 0], 0, 0, 0],  # y[0, 0]
        [0, 0, 0, 1],  # y[0, 1]
        [2 * x[0, 0], 10 * x[0, 1], 0, 0],  # y[1, 0]
        [0, 0, 0, 0],  # y[1, 1]
        [0, 0, 3, 0],  # y[2, 0]
        [1, 1, 1, 1],  # y[2, 1]
    ]
)

print("dydx", dydx)
assert torch.equal(
    x.grad,
    (v.reshape(-1) @ dydx).reshape(x.shape),
)

To be clear: PyTorch does not actually compute the Jacobian only to multiply it from the left with this vector, but what it does has the same output as this less efficient code.

In JAX

PyTorch is great. But it’s not necessarily principled. This tweet expresses this somewhat more aggressively:

The Aesthetician in me wants to be constantly annoyed by how ugly PyTorch is but frankly I’m consistently impressed with how easy it is to develop in.
— Aidan Clark (@_aidan_clark_) September 9, 2022

JAX is, arguably, different, at least on the first account. It comes with vector-Jacobian products, Jacobian-vector products, and also the option to compute full Jacobians if required. I couldn’t write up the details better than the JAX Autodiff Cookbook does, so I won’t try. To quote just one relevant portion, adjusted slightly to our notation:

The JAX function vjp can take a Python function for evaluating $f$ and give us back a Python function for evaluating the VJP $(x, v)\mapsto(f(x), v^\top f'(x))$.

So what does it really do?

For the autodiff details, read either Zachary DeVito’s colab with an example implementation of what PyTorch does, or the JAX Autodiff Cookbook, or ideally both.

But to round things out, let’s look at how one defines a “custom function” with its own forward and backward pass in PyTorch.

We’ll take elementwise multiplication as our first example; the fancy mathematics name of this simple operation is Hadamard product (also known as Schur product – their motto was “name, always name”³). We’ll denote it by $\odot.$ To fit it within the standard calculus above, we reinterpret $\odot\from\R^n\times\R^n\to\R^n$ as $\odot\from\R^{2n}\to\R^n$, multiplying the first and second “half” of its single-vector input, i.e., $\R^{2n}\ni x\mapsto\odot(x) = (x_jx_{n+j})_{j=1,\ldots,n}\in\R^n.$ Its derivative is

\[\odot'(x) = \begin{pmatrix} x_{n+1} & & & & x_1 & & & \\ & x_{n+2} & & & & x_2 & & \\ & & {\lower 3pt\smash{\ddots}} & & & & {\lower 3pt\smash{\ddots}} & \\ & & & x_{2n} & & & & x_n \end{pmatrix} \in\R^{n\times 2n}, \where{x\in\R^{2n}}\]

where empty cells are zeros. It should be immediately obvious that there’s no need to “materialize” these diagonal Jacobians. In fact, given a vector $v = (v_1, \ldots, v_{2n})\in\R^{2n}$, the VJP here is just

\[v\cdot \odot'(x) = (v_1x_{n+1}, \ldots, v_nx_{2n}, v_{n+1}x_1, \ldots, v_{2n}x_n) \in \R^{2n}.\]

(And that already seems like a needless complication of such a simple thing als elementwise multiplication!)

The PyTorch version of this is … well, it’s just x * y, and PyTorch knows full well how to compute and differentiate that. But if we wanted for some reason to add this from scratch, it might look like this:

class HadamardProduct(torch.autograd.Function):
    """Computes lhs * rhs and its backward pass."""
    @staticmethod
    def forward(ctx, lhs, rhs):
        ctx.save_for_backward(lhs, rhs)
        return lhs * rhs

    @staticmethod
    def backward(ctx, grad_output):
        lhs, rhs = ctx.saved_tensors
        grad_lhs = grad_output * rhs
        grad_rhs = grad_output * lhs
        return grad_lhs, grad_rhs

mul = HadamardProduct.apply

The ctx argument is used to stash the tensors required for the backward pass somewhere (in Zachary’s colab, this is done via a closure) and the grad_output argument is the output-shaped “vector” of the VJP. It’s a single argument in this case since the function has a single output. PyTorch may reuse this tensor, so it’s important even in cases where the gradient computed has the same shape to “NEVER modify [this argument] in-place”, as the PyTorch docs say.

Testing this:

lhs = torch.tensor([1.0, 2.0], requires_grad=True)
rhs = torch.tensor([3.0, 4.0], requires_grad=True)

y = mul(lhs, rhs)

print("y =", y)
v = torch.tensor([5.0, 6.0])
y.backward(v)
assert torch.equal(lhs.grad, v * rhs)
assert torch.equal(rhs.grad, v * lhs)

This example is typical for elementwise operations, where the Jacobian matrices are diagonal and VJPs are just elementwise operations themselves.

To contrast this with another example, let’s look at a real, genuine linear function. Remember that for a matrix $W\in\R^{m\times n}$, the function $\R^n\ni x \mapsto Wx \in \R^m$ is Fréchet differentiable and its derivative is the constant $W$.⁴ Taking $x$ as a constant and $W=(W_{jk})_{j=1,\ldots,m;\; k=1,\ldots,n}$ as the “dependent variable”, and “reshaping to vector form” as above, the derivative at $W$ is

\[\begin{pmatrix} x_1 & \cdots & x_n & & & & & & & \\ & & & x_1 & \cdots & x_n & & & & \\ & & & & & & \ddots & & & \\ & & & & & & & x_1 & \cdots & x_n \end{pmatrix} \in\R^{m\times nm}.\]

Forming the VJP with a vector $v\in\R^m$ and reshaping back to the shape of $W$ results in VJP of $v\cdot x^\top$, with entries $(v_jx_k)_{j,k}$.

The PyTorch code for that is (cf. the PyTorch docs):

class ActuallyLinear(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight):
        ctx.save_for_backward(input, weight)
        output = weight @ input
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input, weight = ctx.saved_tensors
        grad_input = grad_weight = None
        # Checks for extra efficiency -- only compute VJP with vector != zero.
        if ctx.needs_input_grad[0]:
            grad_input = grad_output @ weight
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.unsqueeze(1) @ input.unsqueeze(0)
        return grad_input, grad_weight

Testing this:

W = torch.tensor(
    [
        [1.0, 2.0],
        [3.0, 4.0],
        [5.0, 6.0],
    ],
    requires_grad=True,
)

x = torch.tensor([10.0, 11.0], requires_grad=True)
y = ActuallyLinear.apply(x, W)
print("y =", y)
v = torch.tensor([1.0, 2.0, 3.0])
y.backward(v)

# Save for comparing.
x_grad = x.grad
W_grad = W.grad

x.grad = None  # Reset to zero.

torch_linear = torch.nn.Linear(W.shape[1], W.shape[0], bias=False)

# Set weight without touching gradient tape.
torch_linear.weight.data[...] = W
torch_y = torch_linear(x)
torch_y.backward(v)

assert torch.equal(x_grad, x.grad)
assert torch.equal(torch_linear.weight.grad, W_grad)

One final comment

I could go on, but this is plenty for now. The last thing worth mentioning is that $\eqref{eq:crNn}$ doesn’t quite fit the actual situation of deep neural networks, where one takes the gradient of all weights, and each layer $f_j$ depends both on the outputs (“activations”) of the previous layer and its own weight, and the inputs of the first layer $f_1$ (and potentially some later layers, e.g. for “targets”) are just “data”, which is treated as a parametrization of the functions. One way to apply the Calculus 102 rule above would be to have earlier layers pass on all weights they don’t need as an identity function (with identity matrix for that part of its Jacobian). This can also be written in other ways, but the margin-bottom here is too small to contain it.

Did Jacobi actually say that? Googling this is hard because Charlie Munger likes the supposed quote so much. However, there’s a 1916 source claiming “[t]he great mathematician Jacobi is said to have inculcated upon his students the dictum”. A 1968 source gives a slightly different version. Perhaps it’s a matter of invent, always invent. ↩
To illustrate the extremes this is pushed to for large scale cases: A large language model like GPT3 can have billions or even trillions of weights (corresponding to the dimension of $x\in\R^n$) but still computes a scalar (one dimensional) loss. ↩
This is a joke. In reality, mathematicians are taught that if something is named after someone, that is a good indication that person didn’t invent that. ↩
In fact, the underlying idea of the Fréchet derivative is linear approximations like that, Unlike partial deriviatives, this idea carries over to the infinite dimensional case, where it’s a stronger requirement than weaker notions of differentiability like the existence of the Gateaux derivative. ↩

Gompertz, annuities, and special functions

Wed, 14 Apr 2021 00:00:00 +0000

This is part 2 of a 2 part series on Gompertz’s law. Part 1 is here.

The first part of this series discussed the Gompertz distribution and gave a formula for “How likely am I to live until at least age $t_0 + t$ conditional on having lived until age $t_0$?”. In this post, we will use this to compute the present value of an annuities. This is a riff on chapter 3 of Moshe Milevsky’s book The 7 Most Important Equations for Your Retirement with a more mathy twist.

Present value

A standard idea in finance is that getting $\$1000$ today is worth more than getting $\$1000$ in a year from now. This should be true even if these are inflation-adjusted dollars: If we assume some bank will give us the expected rate of inflation as their interest rate when we deposit the $\$1000$, we could just do that and create the “$\$1000$ a year from now” situation. Ignoring the dilemma of choice, the first situation gives us more options and should therefore be worth more to us.

How much more should it be worth? Or, equivalently, what’s the value for $y$ at which point we become ambivalent about $\$x$ now and $\$y$ in one year?

The answer will be different for different people (you may really need that money now), but generally this is modeled with an interest/discount rate as well: $\$x$ today will be worth $\$x\cdot(1+r)$ a year from now (and therefore $\$x\cdot (1+r)^2$ two years from now) for some discount rate $r$. Why the temporal structure is exponential as opposed to some other increasing function is a topic all for itself (look up hyperbolic discounting), but it’s a common choice in various fields including finance, economics, psychology, neuroscience, and reinforcement learning.

Conversely, $\$y$ in one year should be worth $\frac{\$y}{1+r}$ today.

Getting money forever

A nice property of this discounting rule is that the option of getting a neverending stream of money (say, $a = \$1000$ every month) is worth a finite amount today by the magic of the geometric series:

\[\sum_{n=0}^\infty \frac{a}{(1+r)^n} = \frac{a}{1-\frac{1}{1+r}} = a \frac{1+r}{r} = a(1 + \tfrac{1}{r})\]

for a monthly discount rate $r\ne 0$. Just to be extra clear: The $n$th term of this sum is the present value of the payment we receive in the $n$th period.

Such an arragement is called an annuity. We talked about annuity loans before on this blog and just as in that case, the present value of a neverending stream of $a$ per month is $a$ times an annuity factor, in this case $\af(r)=1 + \frac{1}{r}$.

The same is true if the annuity runs out after a fixed number of periods. However, as a component for retirement planning, annuities that pay for the rest of ones natural life (or the life of ones spouse) tend to be a better option, as they help covering longevity risk. In fact, annuities are (in theory, although maybe not psychologically in practice) a great component of a retirement plan as they allow pensioners to take more risk with the rest of their funds, e.g., stay invested in the stock market where expected returns are higher (but so is the variance of returns).

Getting money until you die

This raises a question: What’s the present value of an annuity that pays for the rest of ones life? It ought to be less than $1+\frac{1}{r}$ times the annual amount (unless one expects to live forever), but how much less exactly?

Our answer, of course, is the survival function of the Gompertz distribution, as per the last blog post. Of course this depends on the current age of the pensioner: The survival function of the Gompertz distribution, conditioned on having lived until year $t_0$, is

\[p_{t_0}(t) := \exp\bigl(- e^{b(t_0-t_m)} (e^{bt} - 1)\bigr) \where{t\ge 0}\]

where $b \approx 1/9.5$ is the inverse dispersion coefficient and $t_m \approx 87.25$ is the modal value of human life.

The value of a pension that starts now (with a pensioner at age $t_0$) and pays $\$1$ per year for the rest of the life is therefore

\[\af(r, t_0) = \sum_{n=0}^\infty \frac{p_{t_0}(n)}{(1 + r)^n}. \tag{#3} \label{discraf}\]

The $n$th term in this series is the probabilty of being alive at year $t_0 + n$ given having been alive at year $t_0$ times the discount rate that turns it into a present value.

This is equation #3 in Milevsky’s book The 7 Most Important Equations for Your Retirement.

Some insurance contracts allow the pensioner to decide between a lump sum payment or an annuity; this formula can help to compare these two options. Some contracts provide payments until death, but with a minimum of (say) 10 years. Replacing $p_{t_0}(0), \ldots, p_{t_0}(9)$ with $1$ in $\eqref{discraf}$ would model this situation.

Milevsky suggests using a spreadsheet program to sum this series, along the lines of

$n$	$p_{65}(n)$	$(1+r)^{-n}$	product
70	93.6%	0.705	0.660
75	83.6%	0.497	0.415
80	69.1%	0.35	0.242
85	50.0%	0.247	0.124
90	29.0%	0.174	0.050
95	11.5%	0.122	0.014
100	2.4%	0.086	0.002
105	0.2%	0.061	0.000
		∑	1.507

In this example a pensioner of age 65 buys an annuity that gives them $\$1$ every 5 years (starting at age 70) with a discount rate of $r=7.25\%$. The present value of that annuity is $\$1.51$. If the payment every 5 years is $\$50.000$ instead, the present value of the annuity at age 65 is $1.507\cdot \$50.000 = \$ 75369.90$.

The example uses 5 year intervals in order to not get unduly long. But typically, annuities will pay monthly. Computing the sum in a spreadsheet then becomes somewhat annoying: Here’s an monthly example in Google sheets. (Notice that the difference to the 5 year interval version actually matters: In the latter case, one gets the money for five years in advance; in the former case one gets it only for the current month. A lot can happen in 5 years.)

I wasn’t quite satisfied with this.

The hunt for a closed-form solution

Equation $\eqref{discraf}$ is nice and all, but it would be much better to have a closed-form expression for it. Since $p_{t_0}$ is a doubly-exponential function, the solution isn’t immediately obvious. In fact, I don’t know a closed-form solution for $\eqref{discraf}$¹.

But let’s imagine a world with continuous banking, where instead of once a month we get a small portion of our annuity every moment. The continuous annuity factor would then be

\[\af_c(r, t_0) = \int_0^\infty p_{t_0}(t) e^{-rx} \,dx = \int_0^\infty \exp\bigl(-\eta(e^{bx} - 1)\bigr) e^{-rx} \,dx \tag{G.1} \label{contaf}\]

where $\eta = e^{b(t_0 - t_m)}$ as in the last blog post.²

This, too, doesn’t look super easy. In fact, it’s not solvable with “elemantary” functions. But somewhere within the large zoo of special functions there is the right one for us.

In this case, Wikipedia already tells us that the moment-generating function (aka Laplace transform) of the Gompertz distribution is

\[\E(e^{-tx}) = \eta e^{\eta}\mathrm{E}_{t/b}(\eta) \where{t>0}\]

with the generalized exponential integral

\[\mathrm{E}_{t/b}(\eta)=\int_1^\infty e^{-\eta v} v^{-t/b}\,dv \where{t>0}.\]

The Gompertz distribution has the nice property that “left-truncated” versions of itself are still Gompertz distributed (see the last blog post for details). This is a consequence of the fact that its survival function $S(x) = \exp(-\eta(e^{bx}-1))$ shows up in its pdf $f(x) = b\eta S(x)e^{bx}$.

Hence,

\[\E(e^{-tx}) = b\eta\int_0^\infty S(x) e^{-x(t-b)}\,dx = b\eta\af_c(t-b, t_0) \where{t>0}.\]

So the moment-generating function at $t = r + b$ gives us a closed-form for $\eqref{contaf}$:

\[\af_c(r, t_0) = \frac{1}{b\eta}\E(e^{-(r+b)x}) = \frac{e^{\eta}}{b} \mathrm{E}_{1+r/b}(\eta) = \frac{1}{b}\exp(e^{b(t_0 - t_m)})\mathrm{E}_{1+r/b}(e^{b(t_0 - t_m)}). \label{contaf-gef} \tag{G.2}\]

The annuity factor $\af_c(r, t_0)$ for different interest rates $r$ and starting ages $t_0$. Retiring early is expensive. Gnuplot source

Trying to use this

Formula $\eqref{contaf-gef}$ is in fact a closed-form solution to our problem. So let’s try to use it in a computation with Python. SciPy has the generalized exponential integral function, so this should be easy:

import numpy as np
from scipy import special

b = 1 / 9.5
t_m = 87.25


def af(r, t_0):
    eta = np.exp(b * (t_0 - t_m))
    return np.exp(eta) / b * special.expn(1 + r / b, eta)


print(af(0.025, t_0=65))

and the output is:

gompertz_discount.py:10: RuntimeWarning: floating point number truncated to an integer
  return np.exp(eta) / b * special.expn(1 + r / b, eta)
19.439804660538815

Ah, dang! Even though the generalized exponential function is part of SciPy, the implementation there only supports integer orders.

We’ll have to visit the special functions zoo for a bit longer.

More zoo animals: The incomplete gamma function

The classic reference for special functions is Abramowitz and Stegun. Wikipedia has a quote about it from the (American) National Institute of Standards and Technology:

More than 1,000 pages long, the Handbook of Mathematical Functions was first published in 1964 and reprinted many times […] [W]hen New Scientist magazine recently asked some of the world’s leading scientists what single book they would want if stranded on a desert island, one distinguished British physicist said he would take the Handbook. […] During the mid-1990s, the book was cited every 1.5 hours of each working day.

In the internet age, the Digital Library of Mathematical Functions (DLMF) hosts an updated version of this classic work and it is very useful for situations like the one we are in now.

In fact, looking at equation 8.19.1 in DLMF, we see that the generalized exponential integral is nothing but the (upper) incomplete gamma function:

\[\mathrm{E}_p(z) = z^{p-1}\Gamma(1-p, z) \where{p, z\in\C}\]

and thus

\[\af_c(r, t_0) = \frac{e^{\eta}}{b} \mathrm{E}_{1+r/b}(\eta) = \frac{\eta^{r/b}e^{\eta}}{b} \Gamma(-r/b, \eta).\]

Alas, this won’t do either. SciPy’s implementation of the incomplete gamma function doesn’t allow for a negative first argument.

Why might this be the case? Looking at the definition of the incomplete gamma functions in DLMF (8.2.2 there), we read

\[\Gamma(a, z) = \int_z^\infty t^{a-1}e^{-t}\,dt \where{a, z\in\C, \ \Re a > 0}.\]

Note how this is an “incomplete” version of the standard gamma function $\Gamma(a) = \Gamma(a, 0)$. Also note how this integral won’t work for negative $a$: There’s a non-integrable singularity when $z=0$.

So did our previous formulae not make any sense? They did. We just have to understand these functions a little bit better.

Analytic continuation

Complex analysis, the theory of differentiable complex functions, is one of the most beautiful areas of mathematics. In fact, I believe I once saw a video of Donald Knuth saying it’s such a great topic that he asked his daughter to attend a complex analysis course – although she didn’t otherwise study at any university.³

One of the results of complex analysis is that if two nice (aka complex-differentiable, aka holomorphic, aka analytic) functions coincide on a set that’s not totally discrete, they are the same! This means any analytic function like $a\mapsto\Gamma(a, z)$ has at most one analytic continuation. This is an important principle which also applies to the famous Riemann zeta function $\zeta$. The zeros of its analytic continuation to the left complex half-plane are what the most important open problem in mathematics is about (you’ll get $1M if you solve this one, which should help with retirement planning). In fact, $\zeta$ and $\Gamma$ are closely linked.

In the case of $\zeta(s) := \sum_{n=1}^\infty n^{-s}$ its unique analytic continuation to the left half-plane yields for instance $\zeta(-1) = -\frac{1}{12}$ which gives some sense to Ramanujan’s mysterious equation $1 + 2 + 3 + \cdots = -\frac{1}{12}.$

How is such an analytic continuation found? In the case of both the zeta and the gamma functions, it’s via functional equations. For instance, $\Gamma(n) = (n-1)!$ for integer $n$, and more generally $\Gamma(z+1) = z\Gamma(z)$ for complex $z$. If we start with some $z$ in the left half-plane (i.e., $\Im z \le 0$), repeatedly applying this formula eventually yields only terms of $\Gamma$ at “known” arguments, which establishes the value of the analytic continuation of $\Gamma$ to that point $z$.

For the incomplete Gamma function, a similar recurrence relation exists, namely (8.8.2 in DLMF)

\[\Gamma(a+1, z) = a\Gamma(a,z) + z^ne^{-z}\]

for, say, $a$ and $z$ in the right half-plane. This equation can be used to extend the incomplete gamma function to negative values of $a$. It also gives rise to a corresponding recurrence relation for the generalized exponential integral (8.19.12 in DLMF)

\[p\mathrm{E}_{p+1}(z) + z\mathrm{E}_{p}(z) = e^{-z}.\]

Applying the functional equation

Since the implementations of $\mathrm{E}_{p}(z)$ and $\Gamma(a,z)$ that we want to use don’t implement the continued versions of these functions, we can use the functional equations that define the continuations ourselves. The recurrence relation for $\mathrm{E}_{p+1}$ yields

\[\begin{align} \af_c(r, t_0) & = \frac{e^{\eta}}{b} \mathrm{E}_{1+r/b}(\eta) = \frac{e^{\eta}}{r}\bigl(e^{-\eta} - \eta\mathrm{E}_{r/b}(\eta)\bigr) = \frac{1}{r}\bigl(1 - \eta e^\eta\mathrm{E}_{r/b}(\eta)\bigr) \\ & = \frac{1}{r}\bigl(1 - e^\eta \eta^{r/b} \Gamma(1 - r/b, \eta)\bigr). \end{align}\]

For moderate values of $r$ (and $b$), this is enough. For very large $r$ we’d have to apply the recurrence relation again.

This allows us to finally run this in Python:

import numpy as np
from scipy import special

b = 1 / 9.5
t_m = 87.25


def af(r, t_0):
    eta = np.exp(b * (t_0 - t_m))
    return (
        1
        - np.exp(eta)
        * eta ** (r / b)
        * special.gamma(1 - r / b)
        * special.gammaincc(1 - r / b, eta)  # A normalized version of \Gamma(a, z).
    ) / r


print(af(0.025, t_0=65))

Which prints 14.79901377449508.

What about spreadsheets?

The above works for SciPy, but Excel (or Google Sheets) is a different matter as it doesn’t directly have support for the incomplete gamma function.⁴ However, it does support the Gamma distribution which has a normalized version of the lower incomplete gamma function as its cdf. As the helpful authors of Wikipedia note, $\Gamma(s, x)$ can be computed as GAMMA(s)*(1-GAMMA.DIST(x,s,1,TRUE)).

This finally gives us closed-form we can use in spreadsheets. Here’s an example in Google Sheets. Note that the relative difference of $\af(r,t_0)$ and $\af_c(r,t_0)$ for $r = 0.25\%$ and $t_0 = 65$ is only $0.64\%$.

What does it all mean?

If you (or your parents) bought an annuity in your name and the issuer offers the option for a lump sum payment instead, you can multiply the yearly (or monthly) payments with the annuity factor as determined above, treat that as the annuity’s value, and compare it to the lump sum. (The computations here mostly assume the annuity will start right away, extending this to an annuity that starts in a number of years is left as an exercise for the reader.)

That raises the question of which interest rate to pick? And why that should be treated as constant over time?

Also, the annuity provider will have done some math as well and will be aware of adverse selection: People who self-assess as being in good health might be right about this and choose the annuity more often. Milevsky quotes from Jane Austen’s Sense and Sensibility,

People always live forever when there is an annuity to be paid them.

Milevsky also ends his discussion of his equation #3 with a warning: This is a model. If market prices are different, they are likely to be right and the model wrong.

That said, if you got an annuity a while back, for instance pre-2008 when the effective Fed funds rate was 5%+, you likely got a good deal by today’s standards. The future might be different. But at least some people argue that interest rates will stay permanentely low, as they did in Japan for the last decades (demographic factors might play a role here). Others disagree, and of course people some always predict stronger inflation.

That’s it for today, thanks for reading. Reach out if you have comments.

Although coming to think of this perhaps Abel-Plana might help? Eyeballing our function $f(z) = p_{t_0}(z)e^{-z\ln(1+r)}$ it might apply, but the $\int_0^\infty \frac{f(iy) - f(-iy)}{e^{2\pi y} - 1}\,dy$ integral looks hard. ↩
There’s some extra confusion regarding the interest rate here: For continuous compounding we need something else than the effective rate, as $e^r - 1 \ne r$ for $r > 0$. See this answer for some explanation. ↩
I might misremember this; at any rate I can’t find the reference right now. It might have been part of his 1987 course series on mathematical writing. ↩
Apparently, Microsoft Excel supports Gamma at negative values. However, Google Sheets doesn’t. For Excel, we could have also used a series like 8.7.3 in DLMF to compute the incomplete gamma function. ↩

πs, deaths, and statistics

Tue, 23 Mar 2021 00:00:00 +0000

This is part 1 of a 2 part series on Gompertz’s law.

After enjoying a podcast interviewing Moshe Milevsky, I got interested in Milevsky’s work and read his book The 7 Most Important Equations for Your Retirement.

It’s a bit of a weird book. I can’t say I didn’t enjoy it. But it gave me the impression Moshe Milevsky didn’t expect his readers to enjoy the book, or at least he didn’t seem to expect them to like the formulas. That seems like a weird proposition for a book with “equations” in its very title. Perhaps the readers were imagined to be part of a captured audience of students, asked to read the book as an assignment? At any rate the author apologizes every time he actually discusses formulae, and having to resort to logarithms seems to make him positively embarrassed.

Then again, Prof. Milevsky wrote many books and this is apparently one of his bestsellers. And the anecdotes of Fibonacci, Gompertz, Halley, Fisher, Huebner, and Kolmogorov¹ in the book are lovely, as is Milevsky’s account of his experience of economist Paul Samuelson. Perhaps the professor just knows his audience.

But so do I, being well acquainted with the empty set. So there’s no issue with too much math on this blog.

Let’s therefore discuss what Milevsky doesn’t do in his book and try to explain why his equation #2 might be true: Let’s talk about the Gompertz distribution.

Gompertz’s discovery

Image source: Wikipedia

Benjamin Gompertz (1779–1865) was a British mathematician and actuary of German Jewish descent. According to Milevsky, Gompertz was looking for a law of human mortality, comparable to Newton’s laws of mechanics. At his time, statistics of people’s lifespan were already available and formed the basis for Gompertz’s discovery. For instance, one could compile a mortality table of a (hypothetical) group of 45-year-olds as they age. The data might look like this:

Age	Alive at Birthday	Die in Year
45	98,585	146
46	98,439	161
47	98,278	177
48	98,101	195
49	97,906	214
50	97,692	236

What does this tell us? It doesn’t immediately tell me anything not very obvious: The older one gets, the more likely one dies soon.

But let’s compute the mortality rate: How likely were people to die at a given age in this cohort? To be precise, we compute “Which proportion of people alive at age $t$ died before age $t+1$?”

Age	Alive at Birthday	Die in Year	Mortality Rate
45	98,585	146	0.148%
46	98,439	161	0.164%
47	98,278	177	0.180%
48	98,101	195	0.199%
49	97,906	214	0.219%
50	97,692	236	0.242%

So again, so much so obvious: The probability of dying the next year, conditional on having lived until the current year, increases with age.

But how much does it increase? This is Gompertz’s discovery, now called Gompertz’s law: The mortality rate appears to increase exponentially. One way of seeing this is to take logs and compute their difference:

Age	Alive at Birthday	Die in Year	Mortality Rate	Log of Mortality Rate	Difference in Log Values
45	98,585	146	0.148%	−6.515	—
46	98,439	161	0.164%	−6.416	0.0993
47	98,278	177	0.180%	−6.319	0.0964
48	98,101	195	0.199%	−6.221	0.0987
49	97,906	214	0.219%	−6.126	0.0950
50	97,692	236	0.242%	−6.026	0.1000

The difference in log values increases by about 0.1, regardless of the current age. Since $\exp(0.1) \approx 1.1$, this works out to roughly a 10% increase in the mortality rate per year.

According to Gompertz’s discovery, the likelihood of dying in the next year (conditional on having lived until this year) increases exponentially by a fixed percentage every year.

How do we model this mathematically? And if Gompertz’s law has a rate growing by a rate, how come everything stays a probability, i.e., $\le 1$?

Keep reading to learn about hazard functions and find out!

PDFs, CDFs, and Hazard Functions

Some probability density functions
Image source: Wikipedia

Some cumulative distribution functions
Image source: Wikipedia

If you have taken a probability or statistics course, you probably (ha!) know about probability density functions (pdfs). A pdf is a positive function that we use as a density and to make it a probabilty density it needs to integrate to one. If $f$ is a pdf and $X$ is a random variable with that distribution then

\[\P(a < X\le b) = \int_a^b f(x) \dx,\]

i.e., the probability of $X$ landing between $a$ and $b$ is the integral from $a$ to $b$ of $f$.² Let’s say our random variable takes only positive values, then for $a=0$ and $b=\infty$ this value will be $1$, i.e., 100%. For smaller values of $b$ this will be the so-called cumulative distribution function (cdf):

\[F(b) := \P(X\le b) = \int_0^b f(x)\dx.\]

While pdfs are positive and integrate to $1$, cdfs are positive, $0$ at $0$, monotone (non-decreasing), and $1$ at the limit to $\infty$. They are also typically assumed to be right-continuous. To compute the previous expression one can then take

\[\P(a < X \le b) = \int_a^b f(x) \dx = F(b) - F(a).\]

Moreover the fundamental theorem of calculus tells us that (barring a few exceptional points),

\[F'(x) = f(x).\]

What this means is that the cdf determines the pdf and therefore the probability distribution itself: Any function $F$ with the properties above can serve as a cdf and gives rise to a pdf $f$, and vice versa.

If you made it this far, you likely knew this in principle. So on to hazard functions now (which I learned about from Wikipedia only recently).

Hazard functions

Some hazard functions
Gnuplot source

If we imagine $f$ to describe deaths (or failure rates) over time, $F(t)$ will be the probability to have died by time $t$. Conversely, $S(t) := 1 - F(t)$ (the survival function) is the chance to still be alive at time $t$. If $f$ is the corresponding pdf and $X$ is a random variable with this distribution,

\[S(t) = \P(X > t) = \int_t^\infty f(x)\dx = 1 - F(t).\]

Suppose we made it until some time $t_0 \ge 0$. How will the future look like? We will need to condition on having lived so far and renormalize with $S(t_0)$ to get the pdf going forward. This is the same as finding the probability of not surviving an additional infinitesimal time:

\[h(t_0) := \frac{f(t_0)}{S(t_0)} = \lim_{t\downarrow 0} \frac{\P(t_0\le X < t_0 + t)}{t S(t_0)} = -\frac{S'(t_0)}{S(t_0)} = - \frac{\d}{\d t}\ln(S(t))\Biggr\rvert_{t=t_0}.\]

This is the hazard function, also known as force of mortality or force of failure. It takes positive values and the last equation indicates that it will not be integrable, i.e., $\int_0^\infty h(x)\dx = \infty$.

There is also the cumulative hazard function, which is

\[H(t) = \int_0^t h(x)\dx = -\ln S(t).\]

This last equation, which follows from the above, tells us something interesting: We ought to be able to retrieve $S$, and therefore the cdf $F$, and therefore the pdf $f$, from $h$ itself, because

\[S(t) = e^{-H(t)}.\]

This means, just as any one of a pdf or a cdf determines the other, so does a hazard function: Given either a pdf, a cdf, or a hazard function (or a cumulative hazard function), the probability density and therefore all of these functions and the entire distribution are uniquely determined. The criteria for $h$ are: (1) being nonnegative and (2) not being integrable.

Back to Gompertz

So how do we model Gompertz’s discovery? We choose the most simple exponentially growing function we know and take it as our hazard function:

\[h(t) := b \eta e^{bt}\]

where $\eta, b > 0$ are parameters to be determined from the data (see below). This also tells us why our “rate increase of a rate” doesn’t lead to probabilities greater than one over time: It’s a rate increase for the conditional probability at that fixed point in time $t_0$. The hazard function itself does grow to infinity.

Given this, what is the cdf for this Gompertz distribution? Well,

\[H(t) = \int_0^t h(x)\dx = b \eta\int_0^t e^{bx}\dx = \eta(e^{bt} - 1)\]

and therefore

\[F(t) = 1 - S(t) = 1 - e^{-H(t)} = 1 - \exp\bigl(-\eta(e^{bt} - 1)\bigr)\]

and

\[f(t) = h(t)S(t) = b\eta\exp\bigl(bt - \eta(e^{bt} - 1)\bigr).\]

So the simple exponential choice for the hazard function yields this not quite so simple double exponential as a pdf for the Gompertz distribution.

The pdf, cdf and hazard function of the Gompertz distribution for some choices of $\eta$ and $b$.
(See above for source.)

We could proceed to compute the mean, variance or moment-generating function (aka Laplace transform) of this distribution, but the math gets somewhat hairy and special functions (mainly the generalized exponential integral, a special function related to the incomplete gamma function) are involved. We will return to this for other reasons in a future blogpost. The infobox on Wikipedia has the data if necessary.

For now, let’s note that $h(t) = b \eta e^{bt}$ describes a hazard function where the chance of dying in the next moment increases throughout life, but there’s only this time-dependent component. If there are time-independent causes of death or failure (war, some kinds of diseases, voltage surges), one could consider modelling this differently. William Makeham, another British 19th century mathematician, proposed such an addition via the hazard function

\[h(t) = b \eta e^{bt} + \lambda.\]

The result is known as the Gompertz-Makeham distribution.

As simple as this change may seem, it does complicate the resulting integrals quite a bit. So much so that finding closed-form expressions of the quantile function or the moments of this distribution is still an active field of research!

Gompertz from here on out

If a random variable $X \sim \textrm{Gompertz}(\eta, b)$ describes the expected time of death of a newborn (or, less morbidly, the expected time of failure for a newly built device), how does the world look like at time $t_0$? That’s an important question for, e.g., retirement planning.

So let’s compute the pdf after surviving until $t_0$:

\[\frac{f(t_0 + t)}{S(t_0)} = b\eta\frac{b\eta\exp\bigl(b(t_0+t) - \eta(e^{b(t_0+t)} - 1)\bigr)}{\exp\bigl(-\eta(e^{bt} - 1)\bigr)} = b\eta e^{bt_0}\exp\bigl(-\eta e^{bt_0}(e^{bt}-1) + bt\bigr).\]

This is the pdf of $\textrm{Gompertz}(\eta e^{bt_0}, b)$. So the future stays Gompertz-distributed with new parameters. This is a useful property of the Gompertz distribution: truncated renormalized versions of it stay within the family, in contrast to, e.g., the Gaussian distribution.

This helps us answer questions like “How long will I spend in retirement?”, which is why Milevsky discusses Gompertz in his book about retirement planning.

Before we can do that, let’s discuss how one could fit the parameters $\eta$ and $b$.

The modal value(s) of a distribution is the maximum (or set of maxima) of the pdf. In general, it is distinct from both the mean as well as the median. In the case of Gompertz it’s also an easy computation:

\[\frac{\d}{\d t}f(t) = 0 \ \Rightarrow\ b\eta\exp\bigl(bt - \eta(e^{bt} - 1)\bigr)(b - \eta b e^{bt}) = 0 \ \Rightarrow\ \eta e^{bt} = 1\]

and therefore $\eta = e^{-bt_m}$ for the modal value $t_m$.

So once we have $b$ we can get $\eta$ from the data by looking for the year (or month) with the most deaths.

The $b$ parameter we also get from the data as it determines the year-on-year increase in the mortality rate: From the data above we would estimate $b=0.1$. Milevsky suggests using $1/b = 9.5$ years and calls this the dispersion coefficient of human life.

As the modal value of human life Milevsky suggests using $t_m=87.25$ for the general population in North America. Obviously $t_m$ was lower in Gompertz’s time; for a population of healthy females in a modern society Milevsky suggests using up to $t_m=90$.

Plugging it all together

For a member of the general population in a developed country we can take $b = 1/9.5$, $t_m=87.25$ and therefore $\eta = e^{- bt_m} = e^{-87.25 / 9.5} \approx 1.026\cdot 10^{-4}$. If that person is $t_0\ge 0$ years old, the Gompertz pdf for their future looks like

\[b\eta e^{bt_0}\exp\bigl(-\eta e^{bt_0}(e^{bt}-1) + bt\bigr) = b e^{b(t_0 - t_m)} \exp\bigl(-e^{b(t_0 - t_m)}(e^{bt}-1) + bt\bigr)\]

while the cdf is

\[1 - \exp\bigl(- e^{b(t_0-t_m)} (e^{bt} - 1)\bigr).\]

Therefore, the survival function conditioned on having lived until year $t_0$ is

\[p_{t_0}(t) := \exp\bigl(- e^{b(t_0-t_m)} (e^{bt} - 1)\bigr).\]

This is also the chance of still being alive at year $t_0 + t$ given being alive at year $t_0$ and constitutes an answer to the question of “How long will I spend in retirement?”, or more precisely “How likely am I to be alive $t$ years into my retirement?”. It’s also Milevsky’s equation #2.

In a future blogpost, we will use this to compute values of annuities. For now, let’s wrap up with a quick discussion of how realistic this model is.

Is this model any good?

The first thing to say is that sadly, human babies are more likely to die than the Gompertz distribution would indicate. That’s partially a sad fact of life and partially a consequence of the fact that severe medical problems in unborn babies or during birth are less likely to cause outright death at the point of delivery for the baby in question, but may still cause death after a few days, months, or even years.

On a potentially more positive note, the Gompertz distribution might overestimate the chance of dying next year for very old people. As Wikipedia has it:

At more advanced ages, some studies have found that death rates increase more slowly – a phenomenon known as the late-life mortality deceleration – but more recent studies disagree.

Generally, there seems to be consensus that the Gompertz distribution models mortality really well between the ages of 30 and 80.

A question very dear to “transhumanists” is to what extent the gains in longevity seen in the last centuries can be expected to continue to happen in the future. Obviously a simple two-parameter model won’t tell us a lot about that, and neither will the three-parameter Gompertz-Makeham distribution. Still, I found it interesting if a bit discouraging to read, again in Wikipedia:

The decline in the human mortality rate before the 1950s was mostly due to a decrease in the age-independent (Makeham) mortality component, while the age-dependent (Gompertz) mortality component was surprisingly stable. Since the 1950s, a new mortality trend has started in the form of an unexpected decline in mortality rates at advanced ages and “rectangularization” of the survival curve.

So apparently $\eta$ and $b$ have been stable over a long time periods. Whatever that says about human longevity, it also says the Gompertz (or the Gompertz-Makeham) model is not so bad.

That’s it for today. My apologies for not actually having any $\pi$s in all of this after all :)

For Kolmogorov, I’d love to have another reference for his supposed War and Peace-inspired nickname “ANK”. The book in question is the only reference I could find. ↩
This works for probability measures that are absolutely continuous with respect to the Lebesgue measure, meaning such a Lebesgue density exists. All measures on the reals can be decomposed into such a measure, a discrete measure and a so-called singular continuous measure. Let’s just stay with Lebesgue densities here. ↩

Well-definedness in measure theory

Tue, 29 Dec 2020 00:00:00 +0000

This blog covers very niche topics. Today: How to easily show that the Lebesgue measure and the Lebesgue integral is well-defined.

I recently started reading reading the “Bandit Algorithms” book by Tor Lattimore and Csaba Szepesvari. I like it a lot. Chapter 2 there got me back to one of my favorite topics in math: Measure Theory. While measure theory is often held as being boring or dry, I disagree. It’s just often presented in a suboptimal way. In particular, many authors get so carried away by a progression of simple functions – positive functions – real-valued functions that they forget that linearity is a great thing (one symptom of this: Trying to find a “common partition” for two simple functions).

Here we present a simple lemma and its corollary that takes care of most well-definedness questions in measure theory, in particular: Why the extension of a pre-measure from a semiring to a ring is well-defined (and therefore unique), and why the Lebesgue integral of simple functions is well-defined.

Lemma 1. Let $\H$ be a family of sets with the following partition property: For each finite family $(A_1, \ldots, A_n)$ in $\H$ there is a finite disjoint family $(B_k\st k \in K_0)$ in $\H$ such that

\[\forall j \in\set{1,\ldots,n}\, \exists K_j \subseteq K_0: A_j=\mathop{\dot{\bigcup}}_{k \in K_j} B_k.\]

(In particular, either $A_j \supseteq B_k$ or $A_j\cap B_k=\emptyset$ is true at all times, with $A_j\supseteq B_k \Leftrightarrow k \in K_j.$)

Let $\mu\from \H\to[0, \infty)$ be an additive set function (i.e., if $A, B\in\H$ are disjoint with $A\mathbin{\dot{\cup}} B\in\H$, then $\mu(A\mathbin{\dot{\cup}} B) = \mu(A) + \mu(B)$). Let $A_1, \ldots, A_n \in \H$ and $a_1, \ldots, a_n \in \F \in\set{\R, \mathbb{C}}$. Define

\[\qquad b_k:=\sum_{\substack{1\le j\le n\\ A_j\supseteq B_k}} a_j \where{k\in K_0}.\]

Then $\sum_{j=1}^n a_j \1_{A_j}=\sum_{k\in K_0} b_k\1_{B_k}$ and

\[\sum_{j=1}^n a_j \mu(A_j)=\sum_{k \in K_0} b_k \mu(B_k)\]

Proof. The first statement follows directly from the definition of $b_k$. The second:

\[\begin{aligned} \sum_{j=1}^n a_j \mu(A_j) &=\sum_{j=1}^n a_j \sum_{k \in K_j} \mu(B_k) \\ &=\sum_{k \in K_0} \underbrace{\sum_{j=1}^n \1_{K_j}(k) a_j}_{=b_k} \mu(B_k)=\sum_{k \in K_0} b_k \mu(B_k). \qquad\text{//}\end{aligned}\]

Corollary 2. Let $\H, \mu$ as in Lemma 1. Let $X:=\lin\set{\1_A\st A\in\H}$. Then $J\from X\to\F$, defined via

\[X\ni f=\sum_{j=1}^n a_j \1_{A_j} \mapsto \sum_{j=1}^n a_j \mu(A_j),\]

is well-defined and linear.

Proof. It suffices to treat the case $f=0$. Let $(B_k\st k \in K_0)$ be the disjoint representation and $(b_k\st b \in K_0)$ as in Lemma 1. Then $b_k=0$ for all $k \in k_0,$ as $f=\sum_{j=1}^n a_j \1_{A_j}=\sum_{k \in K_0} b_k \1_{B_k}$ and the $B_k$ are pairwise disjoint. By Lemma 1,

\[\sum_{j=1}^n a_j \mu(A_j) % =\sum_{j=1}^n a_j \mu(\mathop{\dot{\bigcup}}_{k \in k_j}B_k) = \sum_{k\in K_0} b_k \mu(B_k)=0.\]

This shows $J$ is well-defined; it’s also clearly linear. //

With this simple tool, we can look at our first application: The Lebesgue-Stieltjes pre-measure.

Lemma 3. Let $\H$ be the set of half-open intervals and let $\mathcal{R}$ be the ring of finite unions of half-open intervals. Then each $F \in \mathcal{R}$ is the disjoint union of finitely many intervals.

Proof (Adapted from J. Voigt, “Einführung in die Integration”, Satz 1.1.1). Let $F=\bigcup_{j=1}^n I_j$ with $I_j=[a_j, b_j)$. The set $\set{a_j, b_j\st j=1,\ldots,n}$ can be written as $\set{x_k\st k=0,\ldots,m}$ with $x_0 <x_2 < \cdots < x_m$. A set of disjoint intervals with union $F$ is

\[\{[x_{k-1}, x_k)\st 1\le k\le m \text{ und } \exists j\in\set{1,\ldots,n} : [x_{k-1}, x_k) \subseteq[a_j,b_j)\}. \qquad\text{//}\]

Corollary 4. Let $G\from\R\to\R$ be monotone. The mapping $\mu\from\H\to[0,\infty)$, defined via $\mu([a,b)) := G(b)-G(a)$, can be uniquely extended to an additive set function $\mu\from\mathcal{R}\to[0, \infty)$.

Proof. $\mu$ is additive and $\H$ has the partition property from Lemma 1 by Lemma 3. Hence $J\from X\to\F$ from Corollary 2 is well-defined. Let $F\in\mathcal{R}$. Then $\1_F\in X$ and thus $\mu(F):=J(\1_F)$ is well-defined and $\geq 0$. This extension $\mu$ is additive, since for $\tilde{F}\in\mathcal{R}$,

\[\mu(F \mathbin{\dot{\cup}} \tilde{F}) = J(\1_F+\1_{\tilde{F}}) = J(\1_F) + J(\1_{\tilde{F}}) = \mu(F)+\mu(\tilde{F}).\]

For the uniqueness: Let $\tilde{\mu}\from\mathcal{R}\to[0,\infty)$ be another additive extension and let $F = \mathop{\dot{\bigcup}}_{k=1}^m B_k \in\mathcal{R}$ with $B_1, \ldots, B_m\in\H$. Then

\[\tilde{\mu}(F)=\tilde{\mu}(\mathop{\dot{\bigcup}} B_k)=\sum \tilde{\mu}(B_k)=\sum \mu(B_k)=\mu(F). \qquad//\]

If $G$ left-continuous, $\mu$ can be shown to be $\sigma$-additive (and hence a pre-measure). One can then use Carathéodory’s extension theorem to extend $\mu$ to a measure on the $\sigma$-algebra generated by $\mathcal{R}$, which is the Borel $\sigma$-algebra $\mathcal{B}(\R)$.

Later in the theory, Lemma 1 proves the well-definedness of the Lebesgue integral for simple functions:

Lemma 5. Let $(\Omega, \mathcal{A}, \mu)$ be a measure space and let $f=\sum_{j=1}^n a_j \1_{A_j}$ be a simple function. Then $\int_\Omega f {~d}\mu:=\sum_{j=1}^n a_j \mu(A_j)$ is well-defined.

Proof. For $K\subseteq K_0:=\{1, \ldots, n\}$, define

\[B_K:=\bigcap_{k\in K} A_k \cap \bigcap_{k \in K_0 \setminus K}(\Omega \setminus A_k)\]

Then $(B_K\st \emptyset \neq K \subseteq K_0)$ is a partition as in Corollary 2, and hence $\int f {~d}\mu=J(f)$ is well-defined. //

This defines the Lebesgue integral for simple functions. Since simple functions are dense in $L_1(\mu)$, one is tempted to just use the BLT theorem now. That doesn’t quite work however, since the norm of $L_1(\mu)$ is defined via that integral in the first place. Instead, one can now define the integral for positive functions via pointwise convergence almost everywhere from below, then extend to real-valued (and complex-valued) functions.

More on linear regression: Capital asset pricing models

Tue, 08 Sep 2020 00:00:00 +0000

This is the long-awaited second part of February’s post on Linear Regression. This time, without the needlessly abstract math, but with some classic portfolio theory as an example of “applied linear regression”.

We’ll be discussing two papers: The performance of mutual funds in the period 1945-1964 (1968) by Michael Jensen and Common Risk Factors in the Returns On Stocks And Bonds (1993) by Eugene Fama and Kenneth French.

Some disclaimers:

I am by no means an expert on this topic. I learned about these investing concepts on the Rational Reminder podcast by the excellent Benjamin Felix and Cameron Passmore.
There’s many more results on this, both classic and recent. The two papers discussed here added some delta at the time, but previous results by others were essential too, and get no mention here. I chose these two papers because they are relatively easy to follow and their main message can be explained well in a blog post.
All of this is just for fun. Capital at risk.

Jensen (1968): Standard stuff.

On models

The papers we discuss present models. Models have assumptions as well as domains in which they are valid and domains in which they are not. We are not going to make precise statements about either of these, but it’s useful to not confuse models with reality. ‘Whereof one has no model thereof one must be silent’ (otherwise, philosophus mansisses). This is also known as searching under the streetlights.

The aim of our model will be to evaluate, or assess, the performance of any given portfolio. Specifically, the model should evaluate a portfolio manager’s ability to increase returns by successfully predicting future prices for a given level of riskiness. For instance, if stock market returns are positive in expectation, a leveraged portfolio could outperform an unleveraged portfolio without any special forecasting ability on the manager’s part. Conversely, a classical 60/40 mix of stocks and bonds would likely do worse than a 100% equity portfolio in this case, but may still be preferable due to its reduced risk. We won’t go into detail how risk is measured, but we should mention that under certain assumptions, it is precisely its riskiness that causes a given asset to yield higher returns (ceteris paribus, and on average): If investors are risk-averse, they will need to be compensated for taking on the additional risk of a specific asset compared to some other asset, and higher expected future returns (expected future prices relative to its current price) are the way that compensation happens in a liquid market. Of course, the actual expectation of any given investor may also just be wrong, but over time one may expect the worst investors to drop out, improving the accuracy of the average investor’s expectations.

Time series regressions

The specific models here do linear regression on time series. Suppose we have one data point per time interval, e.g. per trading day:

day		value
5 January 1970		388.8K
6 January 1970		475.2K
	⋮

We will use this data as an $n$-dimensional feature vector (aka explanatory variable). Think of it as coming from the value of an index, specifically from the capitalization-weighted value of the whole stock market on that trading day. We will call this the market portfolio.

The portfolio (or individual stock) we want to assess will also have a value each trading day. This puts us in a situation in which we could try to use linear regression to express our assessed portfolio as a linear combination of the market value and a constant intercept vector. However, little would be gained by just doing that – we’ll have to at least normalize things a bit, otherwise this is what the model will use its capacity (two numbers!) for. And while we are at it, we remember that there used to be a time when the (almost, or by definition) risk-free return offered by central banks was not basically zero. To tease out the “risk factor”, we will use the market return minus this risk-free rate as our feature vector. Our data thus could look like this:

day		market	risk-free
5 January 1970		1.21%	0.029%
6 January 1970		0.62%	0.029%
	⋮

For the risk-free rate, one could use the treasury bill rate. All these numbers can in principle be retrieved from the historical records (and in practice downloaded from Kenneth French’s homepage).

Let’s call the market return $R_M$ and the risk-free return $R_F$. A sensible regression without an intercept term could then read as

\[R \approx R_F + \beta(R_M - R_F), \label{eq:1}\tag{1}\]

where $R$ is vector the observed returns of the portfolio or stock we want to assess, and $\beta$ is computed via the least-squares condition of linear regression, i.e., $\beta = \argmin\{\abs{R - R_F - \beta(R_M - R_F)}_2\st \beta\in\R\}$, as in the last blog post. Notice that the $R_F$s make sense here: It’s deviations from this risk-free return that we want to model, and we want $\beta$ to scale the non risk-free portion of the market portfolio.

Example. For a classic 60/40 portfolio with 60% whole market and 40% risk-free return (historically not realistic for private investors, but easy to compute) we have

\[R - R_F= 0.6R_M + 0.4R_F - R_F = 0.6(R_M - R_F)\in\R^{n\times 1}\]

and therefore, by the normal equation,

\[\beta = \bigl((R_M - R_F)^\top (R_M - R_F)\bigr)^{-1} (R_M - R_F)^\top (R - R_F) = 0.6.\]

That seems sensible in this case.

Finding Alpha

However, \eqref{eq:1} isn’t quite good enough: More precisely, it reads

\[R = R_F + \beta(R_M - R_F) + e, \label{eq:2}\tag{2}\]

with an error term $e\in\R^n$. As Jensen (1968) argues,

[W]e must be very careful when applying the equation to managed portfolios. If the manager is a superior forecaster (perhaps because of special knowledge not available to others) he will tend to systematically select securities which realize [$e_j > 0$].

This touches on a subject glossed over in the last blog post: Most statements about linear regression models depend on certain statistical assumptions, among them that the error terms are elementwise iid, ideally with a mean of zero. There’s autocorrelation tests like Durbin-Watson to test if this is true for a particular dataset. In this particular modeling exercise, we can do better by adding the constant $\1=(1,\ldots,1)\in\R^n$ intercept vector to the subspace we project on, which turns \eqref{eq:2} into

\[R - R_F = \alpha + \beta(R_M - R_F) + u, \label{eq:3}\tag{3}\]

with an error term $u\in\R^n$.

Ever wondered where the “alpha” in the clickbait website Seeking Alpha comes from? It is this $\alpha$, the coefficient of the $\1$ intercept vector in \eqref{eq:3}. To quote Jensen (1968) again:

Thus if the portfolio manager has an ability to forecast security prices, the intercept, [$\alpha$, in eq. \eqref{eq:3}] will be positive. Indeed, it represents the average incremental rate of return on the portfolio per unit time which is due solely to the manager’s ability to forecast future security prices. It is interesting to note that a naive random selection buy and hold policy can be expected to yield a zero intercept. In addition if the manager is not doing as well as a random selection buy and hold policy, [$\alpha$] will be negative. At first glance it might seem difficult to do worse than a random selection policy, but such results may very well be due to the generation of too many expenses in unsuccessful forecasting attempts.

However, given that we observe a positive intercept in any sample of returns on a portfolio we have the difficulty of judging whether or not this observation was due to mere random chance or to the superior forecasting ability of the portfolio manager. […]

It should be emphasized that in estimating [$\alpha$], the measure of performance, we are explicitly allowing for the effects of risk on return as implied by the asset pricing model. Moreover, it should also be noted that if the model is valid, the particular nature of general economic conditions or the particular market conditions (the behavior of $\pi$) over the sample or evaluation period has no effect whatsoever on the measure of performance. Thus our measure of performance can be legitimately compared across funds of different risk levels and across different time periods irrespective of general economic and market condition.

About the error term $u$, first notice that thanks to the intercept term we can expect it to have a mean of zero. Further, Jensen (1968) argues it “should be serially [i.e., elementwise] independent” as otherwise “the manager could increase his return even more by taking account of the information contained in the serial dependence and would therefore eliminate it.”

Just show me the code!

After introducing this model, Jensen (1968) continues with “the data and empirical results”. In ca. 2015 AI parlance, this part could be called “Experiments”. Take a look at Table 1 in the paper to get a list of quaint mutual fund names. Notice too that it’s not immediately obvious how to get the market portfolio’s returns from historical trading data, as companies enter and leave the stock market, and how they leave will play a huge role. (Bankruptcy? Taken private at $420?)

For our purposes though, all of this has been taken care of and Kenneth French’s homepage has the data, including data for each trading day, in usable formats.

Let’s start by getting the data and sanity-checking our 60/40 example above. All of the code in this post can also be downloaded separately or run in a Google Colab.

import os
import re
import urllib.request
import zipfile

import numpy as np

FF3F = (  # Monthly data.
    "https://mba.tuck.dartmouth.edu/pages/faculty/"
    "ken.french/ftp/F-F_Research_Data_Factors_TXT.zip"
)


def download(url=FF3F):
    archive = os.path.basename(url)
    if not os.path.exists(archive):
        print("Retrieving", url)
        urllib.request.urlretrieve(url, archive)
    return archive


def extract(archive, match=re.compile(rb"\d" * 6)):
    with zipfile.ZipFile(archive) as z:
        name, *rest = z.namelist()
        assert not rest
        with z.open(name) as f:
            # Filter for the actual data lines in the file.
            return np.loadtxt((line for line in f if match.match(line)), unpack=True)


date, mktmrf, _, _, rf = extract(download())

mkt = mktmrf + rf

# Linear regression using the normal equation.
A = np.stack([np.ones_like(mktmrf), mktmrf], axis=1)
alpha, beta = np.linalg.inv(A.T @ A) @ A.T @ (0.6 * mkt + 0.4 * rf - rf)

print("alpha=%f; beta=%f" % (alpha, beta))

As expected, this prints

alpha=0.000000; beta=0.600000

If we are seeking $\alpha$, we’ll have to look elsewhere.

Let’s try some real data. The biggest issue is getting hold of the daily returns of real portfolios. Real researchers use data sources like the Center for Research in Security Prices (CRSP), but their data isn’t available for free. Instead, let’s the data for iShares S&P Small-Cap 600 Value ETF (IJS) from Yahoo finance.

# Continuing from above.

import time

IJS = (
    "https://gist.githubusercontent.com/heiner/b222d0985cbebfdfc77288404e6b2735/"
    "raw/08c1cacecbcfcd9e30ce28ee6d3fe3d96c07115c/IJS.csv"
)


def extract_csv(archive):
    with open(archive) as f:
        return np.loadtxt(
            f,
            delimiter=",",
            skiprows=1,  # Header.
            converters={  # Hacky date handling.
                0: lambda s: time.strftime(
                    "%Y%m", time.strptime(s.decode("ascii"), "%Y-%m-%d")
                )
            },
        )


ijs_data = extract_csv(download(IJS))

ijs = ijs_data[:, 5]  # Adj Close (includes dividends).

# Turn into monthly percentage returns.
ijs = 100 * (ijs[1:] / ijs[:-1] - 1)
ijs_date = ijs_data[1:, 0]

ijs_date, indices, ijs_indices = np.intersect1d(date, ijs_date, return_indices=True)

# Regression model for CAPM.
A = np.stack([np.ones_like(ijs_date), mktmrf[indices]], axis=1)
y = ijs[ijs_indices] - rf[indices]
B = np.linalg.inv(A.T @ A) @ A.T @ y
alpha, beta = B

# R^2 and adjusted R^2.
model_err = A @ B - y
ss_err = model_err.T @ model_err
r2 = 1 - ss_err.item() / np.var(y, ddof=len(y) - 1)
adjr2 = 1 - ss_err.item() / (A.shape[0] - A.shape[1]) / np.var(y, ddof=1)

print(
    "CAPM: alpha=%.2f%%; beta=%.2f. R^2=%.1f%%; R_adj^2=%.1f%%. Annualized alpha: %.2f%%"
    % (
        alpha,
        beta,
        100 * r2,
        100 * adjr2,
        ((1 + alpha / 100) ** 12 - 1) * 100,
    )
)

This prints:

CAPM: alpha=0.13%; beta=1.14. R^2=77.8%; R_adj^2=77.7%. Annualized alpha: 1.58%

Since $\alpha$ is the weight for the constant intercept vector $\1 = (1,\ldots,1)$, we can think of it as having percentage points as its unit. Note that fees are not included in this calculation. However, as for many ETFs fees for IJS are low, currently at 0.25% per year. (Managed mutual funds will typically have an annual fee of at least 1%, historically often more than that.)

It seems bit strange that this ETF tracking the S&P Small-Cap 600 Value index has significant $\alpha$: Presumably, the index just includes firms based on simple rules, not genius insights by some above-average fund manager. Looking at the $R^2$ value, we “explain” only 77% of the variance of the returns of IJS (the usual caveats to the wording “explain” apply).

Clearly more research was needed. Or just a larger subspace for the linear regression to project onto?

Fama & French (1993): More factors.

By the 1980s, research into financial economics had noticed that certain segments of the market outperformed other segments, and thus the market as a whole, on average. There are several possible explanations for this effect with different implications for the future. For example: Are these segments of the market just inherently riskier such that rational traders demand higher expected returns via sufficiently low prices for these stocks? Or were traders just irrationally disinterested in some ‘unsexy’ firms and have perhaps caught on by now (or not, hence TSLA)? The latter is the behavioural explanation, while the former tends to be put forth by proponents of the Efficient-market hypothesis (EMH), which includes Jensen as well as Fama and French. We won’t be getting into this now. Let’s instead take a look at which ‘easily’ identifiable segments of the market have historically outperformed.

Citing previous literature, Fama & French (1993) mention size (market capitalization, i.e., price per stock times number of shares), earnings per price and book-to-market (book value divided by market value) as variables that appear to have “explanatory power”, which I take to mean that some model that includes these variables has nonzero regression coefficients and a relatively large $R^2$ or other appropriate statistics.

The specific way in which Fama & French (1993) introduce these variables into the model is through the construction of portfolios that mimic these variables. This approach contributed to their being awarded the Nobel (Memorial) Prize in Economic Sciences in 2013. The specific construction goes as follows:

Take all stocks in the overall market and order them by their size (i.e., market capitalization). Then take the top and bottom halves and call them “big” (B) and “small” (S), respectively.

Next, again take all stocks in the overall market and order them by book-to-market equity. Then take the bottom 30% (“low”, L), the middle 40% (“medium”, M), and the top 30% (“high”, H). In both cases, some care needs to be taken: E.g., how to handle firms dropping in and out of the market, how to define book equity properly in the presence of deferred taxes, and other effects.

Then, construct the six portfolios containing stocks in the intersections of the two size and the three book-to-market equity groups, e.g.

	low	medium	high
small	S/L	S/M	S/H
big	B/L	B/M	B/H

Out of these six building blocks, Fama & French build a size and a book-to-market equity portfolio:

The size portfolio is “small minus big” (SMB) consisting of the monthly difference of the three small-stock portfolios S/L, S/M, and S/H to the three big-stock portfolios B/L, B/M, and B/H.
The book-to-market equity portfolio is “high minus low” (HML), the monthly difference of the two high book-to-market portfolios S/H and B/H to the two low book-to-market portfolios S/L and B/L. This is also known as the value factor.

Additionally, the authors also use the market portfolio as “market return minus risk-free return (one-month treasury bill rate)” in the same way as Jensen (1968).

An aside

I’m not sure why SMB and HML need the to have their two terms be equally weighted among the splits of the other ordering. The authors mention

[For HML] the difference between the two returns should be largely free of the size factor in returns, focusing instead on the different return behaviors of high- and low-[book-to-market] firms. As testimony to the success of this simple procedure, the correlation between the 1963–1991 monthly mimicking returns for the size and book-to-market factors is only $- 0.08$.

Taking “correlation” to mean the Pearson correlation coefficient, we can test this using the data from French’s homepage:

date, mktmrf, smb, hml, rf = extract(download())
print(np.corrcoef(smb, hml))

This prints

[[1.         0.12889074]
 [0.12889074 1.        ]]

which implies a coefficient of $0.13$. In 1993, Fama and French had less data available: The paper uses the 342 months from July 1963 to December 1991. Let’s check with this range:

orig_indices = (196307 <= date) & (date <= 199112)
assert len(smb[orig_indices]) == 342
print(np.corrcoef(smb[orig_indices], hml[orig_indices]))

This yields

[[ 1.         -0.09669641]
 [-0.09669641  1.        ]]

a coefficient of roughly $-0.10$, not the $-0.08$ the authors mention, but relatively close. I guess the data has been cleaned a bit since 1993?

As a further aside, accumulating this data and analyzing it was a true feat in 1993. These days, we can do the same using the internet and a few lines of Python (or, spoiler alert, using just a website).

Back to modelling

So what are we to do with SMB and HML? You guessed it – just add them to the regression model. Of course, this makes the subspace we project on larger, which will always decrease the “fraction of variance unexplained”, without necessarily explaining much. However, in the case of IJS it appears to explain a bit:

# Continuing from above.
A = np.stack(
    [np.ones_like(ijs_date), mktmrf[indices], smb[indices], hml[indices]], axis=1
)
y = ijs[ijs_indices] - rf[indices]
B = np.linalg.inv(A.T @ A) @ A.T @ y
alpha, beta_mkt, beta_smb, beta_hml = B

model_err = A @ B - y
ss_err = model_err.T @ model_err
r2 = 1 - ss_err.item() / np.var(y, ddof=len(y) - 1)
adjr2 = 1 - ss_err.item() / (A.shape[0] - A.shape[1]) / np.var(y, ddof=1)

print(
    "FF3F: alpha=%.2f%%; beta_mkt=%.2f; beta_smb=%.2f; beta_hml=%.2f."
    " R^2=%.1f%%; R_adj^2=%.1f%%. Annualized alpha: %.2f%%"
    % (
        alpha,
        beta_mkt,
        beta_smb,
        beta_hml,
        100 * r2,
        100 * adjr2,
        ((1 + alpha / 100) ** 12 - 1) * 100,
    )
)

This prints:

FF3F: alpha=0.04%; beta_mkt=0.97; beta_smb=0.79; beta_hml=0.51. R^2=95.8%; R_adj^2=95.8%. Annualized alpha: 0.43%

In other words, we dropped from an (annualized) $\alpha_{\rm CAPM} = 1.58\%$ to only $\alpha_{\rm FF3F} = 0.43\%$. The explained fraction of variance has increased to above 95%.

Remembering that Jensen (1968) talked about assessing fund managers with this model, we could try the same with actual managed funds. While I couldn’t produce any impressive results there, French and Fama did go into the question of Luck versus Skill in the Cross-Section of Mutual Fund Returns in a 2010 paper. The results, on average, don’t look good for fund managers’ skill. The story for individual fund managers may be better, but don’t hold your breath.

Was this worth it?

We discussed academic outputs of Jensen, French and Fama. The latter two even got a Nobel for their work on factor models. But nowadays, we can do (parts) of their computations in a few lines of Python.

It’s actually easier than that still. The website portfoliovisualizer.com allows us to do all of these computations and more with a few clicks. In that sense, this blog post was perhaps not worth it.

Another question is how useful these models are. This touches on why SMB and HML ‘explain’ returns of portfolios (e.g., the risk explanation vs the behavioural explanation mentioned above, or perhaps both or neither). In 2014, Fama and French presented another updated model with five factors, adding profitability and investment; judged by $R^2$, this five-factor model ‘explains’ even more of the variance of example portfolios. Other research suggesting alternative factors abounds.

How well do these models really ‘explain’ the phenomenal historical returns of star investors like Warren Buffett? Given that Buffett is a proponent of the Benjamin Graham school of value investing, including a value factor like HML could perhaps be the key to explain his success? For the Fama & French five-factor model, we can check portfoliovisualizer.com: With $R^2 \approx 33\%$, and an annualized $\alpha$ of 4.87%, the results don’t look too good for the math nerds but very good for the ‘Oracle of Omaha’.

This is obviously not a new observation. There is even a paper by a number of people from investment firm AQR about Buffett’s Alpha that aims to explain Buffett’s successes with leveraging as well as yet another set of new factors in a linear regression model:

[Buffett’s] alpha became insignificant, however, when we controlled for exposure to the factors “betting against beta” and “quality minus junk.”

Nice as this may sound, it would appear more convincing to this author if the financial analysis community could converge on a small common set of factors instead of seemingly creating them ad hoc. Otherwise, von Neumann’s line comes to mind: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

And now what?

We discussed two financial economics papers and the linear regression models they propose, merely to give us a sense of what’s done in this field. One may get a sense that this research should be useful for more than just amusement, perhaps it could even inform our investment choices? Many good financial advisors will make use of data analyses like this and suggest factor-tilted portfolios. However, value investing, both with factors as well as the Buffett/Munger variety, has trailed the overall market in the last 10–15 years. Statistically, this is to be expected to happen every now and then, so we cannot read too much into that. But it’s possible the market has just caught on, past performance is not indicative of future results and value investing should be cancelled in 2020. That would at least match the zeitgeist. However, it’s also entirely possible it’s exactly times like the present that make value investing hard but ultimately worthwhile and we should be greedy when others are fearful.

Time will tell.

References

Frazzini, Andrea & Kabiller, David & Pedersen, Lasse Heje. Buffett’s Alpha. Financ. Anal. J. 74 (4), 35–55 DOI:10.2469/faj.v74.n4.3.
Fama, Eugene F. & French, Kenneth R. Common risk factors in the returns on stocks and bonds. J. Financ. Econ. 33, (1), 3–56 (1993). DOI:10.1016/0304-405X(93)90023-5.
———. Luck versus skill in the cross‐section of mutual fund returns. J. Financ. 65 (5), 1915–1947 (2010). DOI:10.1111/j.1540-6261.2010.01598.x.
———. A five-factor asset pricing model. J. Financ. Econ. 116 (1), 1–22 (2015). DOI:10.1016/j.jfineco.2014.10.010.
Jensen, Michael C. The performance of mutual funds in the period 1945–1964. J. Financ. 23 (2), 389–416 (1968). DOI:10.1111/j.1540-6261.1968.tb00815.x.

On linear regression

Wed, 19 Feb 2020 00:00:00 +0000

[Edit 14 May 2020: Thanks to Marcus Waurick for pointing out $A$ needs to have close range for Lemmas 1 and 2 to be true in general. Unfortunately, there seems no easy way of doing linear regression in general Hilbert spaces (no equivalent of $\1$, although the idea of using weighted $L_2$ spaces and writing a blog post on Linear Regression in Hilbert spaces with Morgenstern norm has a certain ring to it.]

Apropos of nothing, this is a post about linear regression. There’s a whole world of information on that out there, but little with the right point of view: The Hilbert space perspective. That might not be universally loved, so you might feel more comfortable reading $\H_1 = \R^m$, $\H_2 = \R^n$ and “matrix” in place of “linear operator” below.

We’ll do the short theory up front, then look at an example, then say something about “fraction of variance explained”. A later blog post will discuss the CAPM, French-Fama models and efficient markets as examples.

Two lemmas.

Let $\H_1$, $\H_2$ be Hilbert spaces and $A\in L(\H_1,\H_2)$ be a linear operator with closed range (i.e., $\overline{\im A} = \im A$).

Lemma 1. $\H_2 = \im A\oplus\ker A^*$ in the sense of direct sums of Hilbert spaces.

Proof Let $x\in\im A$, $y\in\ker A^*$. There is a $x_1\in\H_1$ with $x = Ax_1$ and

\[\langle x,y\rangle = \langle Ax_1, y\rangle = \langle x_1, A^* y\rangle = 0.\]

Thus $\im A\perp\ker A^*$.

Reversely let $y\in(\im A)^\perp$. Then $0 = \langle Ax_1, y\rangle = \langle x_1, A^* y\rangle$ for all $x_1\in\H_1$, and thus $y\in\ker A^*$. //

Lemma 2. Let $P\from\H_2\to\im A$ be the orthogonal projection.

(a) Let $y\in\H_2$. Then there is a $b\in\H_1$ such that $A^*Ab = A^*y$ and $Py = Ab$. This is called the normal equation.

(b) If $\ker A = \{0\}$, i.e., $A$ injective, then $P = A(A^*A)^{-1}A^*$.

Proof. (a) There is $b\in\H_2$ with $Py = Ab$. Also $y - Py\in(\im A)^\perp$, and thus by Lemma 1

\[0 = A^*(y -Py) = A^*(y - Ab) = A^*y - A^*Ab.\]

(b) Lemma 1 implies $\H_1 = \im A^*\oplus\ker A = \im A^*$, i.e. $A^*$ is surjective. Since $\im A = (\ker A^*)^\perp$ and evidently $A^*\from(\ker A^*)^\perp\to\H_1$ is injective, that mapping is bijective. Additionally $A\from\H_1\to\im A$ is bijective, and thus $A^*A\from\H_1\to\H_1$ is bijective and invertible. Using (a) it follows that $Py = Ab = A(A^*A)^{-1}A^*y$. //

Linear regression as projection to subspaces.

Example: Warships

To take a favorite example from German Wikipedia, let’s discuss warships from the Second World War. Let’s say we’ve got $x_1\in\R^n$, a list of lengths of $n$ warships, $x_2\in\R^n$ a list of their beams (widths) and $y\in\R^n$ a list of their drafts (distance from bottom to waterline), all in meters. We call $x_1$ and $x_2$ feature vectors (also known as explanatory or independent variables) and can safely assume they are linearly independent, such that the matrix

\[A = \begin{pmatrix} | & | \\ x_1 & x_2 \\ | & | \end{pmatrix}\in\R^{n\times 2}\]

defines an injective operator. Its image is the linear span of $x_1$ and $x_2$, i.e., $\im A = \{\alpha x_1 + \beta x_2 \st \alpha, \beta\in\R\}\subseteq\R^n$. We wouldn’t expect $y$ to be an element of $\im A$, but the hope of a linear regression model trying to predict a warship’s draft from its length and beam is that $y$ is not too far from $\im A$ as a subspace of $\R^n$ either. Specifically, the $\abs{\dotid}_2$-distance is

\[\dist(\{y\}, \im A) = \min_{y'\in\im A}\abs{y - y'}_2 = \abs{y - P y}_2\]

where $P\from\R^n\to\im A$ is the orthogonal projection. We can call $Py$ the model’s prediction, and Lemma 2 above tells us how to compute $P$ from $A$, namely $P = A(A^\top A)^{-1}A^\top$. Moreover, the normal equation tells us which linear combination¹ of $x_1$ and $x_2$ computes $Py$:

\[Py = A(A^\top A)^{-1}A^\top y = Ab \;\text{ with }\; b = (A^\top A)^{-1}A^\top y\in\R^2.\]

Thus, if we are given an additional warship’s length and beam as a vector $a\in\R^2$, the model’s prediction of its draft will be $\langle a, b\rangle$.

We can implement this idea in simple Python code (Colab link):

import numpy as np

# Lengths, beams and drafts of 10 warships
data = """
187	31	9.4
242	31	9.6
262	32	8.7
216	32	9
195	32	9.7
227	31	11
193	20	5.2
175	17	5.2
""".split("\n")

lengths, beams, drafts = np.loadtxt(data, unpack=True)
A = np.stack([lengths, beams], axis=1)
b = np.linalg.inv(A.T @ A) @ A.T @ drafts
print("b:", b)
print("model prediction:", np.array([200, 20]) @ b)

This will print

b: [-0.00539494  0.34045321]
model prediction: 5.730077087608246

So our model would predict a hypothetical warship with length 200m and beam 20m to have 5.73m of draft.

Notice that the way we built our model makes it predict a draft of zero for the (nonsensical) inputs of a warship of zero meters length or beam. While this seems fine enough in this case, it’s not in others. A simple way of fixing this is to add an “intercept” (in machine learning known as “bias”) term: Define $\1=(1,\ldots,1)\in\R^n$ and add that as an additional (say, first) “feature” vector. This turns our model into an affine function of its data inputs. Doing this in our Python example changes little:

A = np.stack([np.ones_like(lengths), lengths, beams], axis=1)
b = np.linalg.inv(A.T @ A) @ A.T @ drafts
print("b:", b)
print("model prediction:", np.array([1, 200, 20]) @ b)

Result:

b: [ 0.09353128 -0.00580401  0.34027063]
model prediction: 5.738140993529072

However, trying to predict crew sizes changes that, see the Colab for details.

Of course, we can use more than two or three feature vectors as part of the design matrix $A$, as long as they are linearly independent, which is is typically the case in practice with enough examples $n$.

Using a matrix $Y\in\R^{n\times m}$ in place of $y\in\R^n$ the same math allows us to succinctly treat the “multivariate” case in which we’re trying to predict more than one “dependent variable”, e.g. a warship’s draft and the size of its crew. This is just doing more than one linear regression at once.

$R^2$ and all that

As mentioned above, the distance $\abs{y - Py}_2$ gives us a sense of the quality of our model’s predictions when applied to the data we built it on. The textbooks, being obsessed with element-wise expressions, call $\abs{y - Py}_2^2$ the residual sum of squares.

While this quantity (or a normalized version of it) is a measure of the error our model produces on the data we know, it doesn’t tell us if it was our input data that was useful in particular. To quantify that, we can compare it with a minimal, “inputless” model built from only the intercept entry, i.e., using $A_\1 = \1$. The prediction of that model will be the mean of the data, $\frac{1}{n}\langle y, \1\rangle$, and the orthogonal projection on $\im A_\1 = \lin\{\1\}$ is $P_1y = \frac{1}{n}\langle y, \1\rangle\1$. The squared distance $\abs{y - P_\1y}_2^2$ of that minimal model’s predictions from the data is known as the total sum of squares. It’s also the unnormalized variance of the data. If our original model used an intercept term, i.e., $\1\in\im A$, then $Py - P_1y\in\im A$ and therefore $y - Py \perp Py - P_1y$. Hence, the Pythagorean identity tells us

\[\abs{y - P_1y}_2^2 = \abs{y - Py + Py - P_1y}_2^2 = \abs{y - Py}_2^2 + \abs{Py - P_1y}_2^2.\]

The last term is known somewhat factitiously as the explained sum of squares with the idea that it measures deviations from the mean caused by the data.

The ratio $\abs{y - Py}_2^2 / \abs{y - P_1y}_2^2$ is called the fraction of variance unexplained, although one has to be careful with that terminology. The coefficient of determination is one minus that number, or

\[R^2 := 1 - \frac{\abs{y - Py}_2^2}{\abs{y - P_1y}_2^2} = \frac{\abs{Py - P_1y}_2^2}{\abs{y - P_1y}_2^2},\]

where the last equality is true if and only if $\1\in\im A$.

Adjustments. Since including more features in our model matrix $A$ will make $\im A$ a larger subspace of $\R^n$, the error $\abs{y - Py}_2^2$ will be monotonically decreasing and $R^2$ monotonically increasing. While this offers the opportunity to claim to “explain” more, it might in fact make our model’s predictions on new inputs worse. Various solutions for this have been proposed. One somewhat principled approach is to use “adjusted $R^2$”, defined as

\[\bar{R}^2 = 1 - \frac{\abs{y - Py}_2^2 / (n - p - 1)}{\abs{y - P_1y}_2^2 / (n - 1)} \le R^2,\]

where $p$ is the number of features, i.e., $A\in\R^{n\times (p+1)}$ if we use an intercept term. This substitutes the unbiased sample variance and error estimators for their biased versions.

That’s it for now. We’ll add a proper example from finance in a future blog post.

In situations with very many feature vectors, computing the inverse $(A^\top A)^{-1}$ may no longer be the best way of finding $b$. Instead, one could try to minimize $\abs{y-Ab}_2$ in another way, e.g. via gradient descent. This is how “linear layers” in machine learning are trained. ↩

Annuity loans

Mon, 19 Aug 2019 23:25:06 +0000

Edit June 2023: Via Twitter, I found that the same formula I arrive at here has been published in P. Milanfar, “A Persian Folk Method of Figuring Interest”, Mathematics Magazine, vol. 69, no. 5, Dec. 1996 and apparently was known and used “in the bazaar’s of Iran (and elsewhere)”. The mnemonics in that paper are $\textrm{Monthly payment} = \frac{1}{\textrm{Number of months}}(\textrm{Principal} + \textrm{Interest})$ where $\textrm{Interest} = \tfrac{1}{2} \textrm{Principal} \cdot \textrm{Number of years} \cdot \textrm{Annual interest rate}$.

I’m fond of small mental calculation helpers like the rule of 72. Not that I am good at mental math (I once tried to fix that and got sidetracked), but I am good at spending time contemplating how I’d do it if I was better at it.

Another thing I’m not good at is finance. Lack of capital usually saves me from the worst mistakes, but despite the brilliant advice from Ben Felix, I sometimes contemplate spending money on real estate. Real estate tends to be financed with a morgage, which often is a type of annuity loan. What is an annuity loan and is there a neat rule of thumb for it? Read on to find out.

What’s an annuity loan?

We receive a loan of size $S_0$ and pay it back. Each period we’ll pay the same amount

\[R = T_k + Z_k\]

with a principal repayment $T_k$ and an interest payment $Z_k$ for the $k$th payment out of $n$ total payments.

The interest payment is going to be a fixed percentage of the outstanding principal balance and thus $Z_1 = pS_0,$ where we set e.g. $p=0.02$ for a 2% interest rate.

So we always pay the same amount. How much?

Let’s consider an interest rate of $p$ and $n$ periods with one payment per period. We note that the payment

\[R = T_1 + pS_0 = T_2 + p(S_0 - T_1),\]

since $S_0 - T_1$ is the outstanding balance after the first payment. Hence $T_2 = (1 + p)T_1$. After $k$ and $k+1$ periods

\[R = T_k + p\Bigl(S_0 - \sum_{j=1}^{k-1} T_j\Bigr) = T_{k+1} + p\Bigl(S_0 - \sum_{j=1}^k T_j\Bigr),\]

hence $T_{k+1} = (1+p)T_k$ and

\[T_k = (1 + p)^{k-1}T_1.\]

To impose another condition, let’s say we want to fully pay back the loan after $n$ periods, i.e.,

\[S_0 = \sum_{j=1}^{n} T_k = T_1 \sum_{j=1}^{n}(1+p)^{k-1} = T_1\frac{(1+p)^n - 1}{p},\]

where, as so often, we made use of the sequence of partial sums $\sum_{k=0}^{n} q^k = \frac{q^{n+1}-1}{q-1}$ of the geometric series. Thus we find

\[R = T_1 + Z_1 = pS_0\frac{1}{(1+p)^n - 1} + pS_0 = pS_0\frac{(1+p)^n}{(1+p)^n - 1}\]

The regular annuity payment $R$ is therefore a constant factor of the loaned amount $S_0$, the annuity factor

\[\af(p, n) = p\frac{(1+p)^n}{(1+p)^n - 1}.\]

Not quite incidentelly, the annuity factor also shows up in formulas for computing the equivalent annual cost from the net present value. We won’t go into that here.

Note that the condition to fully pay back $S_0$ does not in practice constrain us very much – if less is paid back, say $S_1$, we would do the calculation with $S_1$ in place of $S_0$ and add a regular interest payment of $p(S_0 - S_1)$.

Months and years

Typically we’ll make a monthly payment over many years. This is where some treatments get confusing and also where banks make a bit of an extra buck. For $n$ years and $p$ interest per year, for our purposes we’ll just take $12n$ periods and a monthly interest of $\sqrt[12]{1 + p} - 1 = \frac{p}{12} + O(p^2)$. For small $p$, e.g. not more than 10%, this is a good approximation. The monthly payment is then

\[\frac{p}{12}\frac{(1 + p/12)^{12n}}{(1 + p/12)^{12n} - 1}S_0.\]

Taylored for mental arithmetic

We started this looking for a simple approximation we can use for mental arithmetic, like the rule of 72 (it takes $72/x$ periods for an investment to double in value if it appreciates $x$% per period). The above isn’t that. The binomial theorem helps:

\[\Bigl(1 + \frac{p}{12}\Bigr)^{12n} = \sum_{k=0}^{12n} \binom{12n}{k} \Bigl(\frac{p}{12}\Bigr)^k = 1 + np + \binom{12n}{2}\Bigl(\frac{p}{12}\Bigr)^2 + O(p^3).\]

If we want to forget about the $p^2$ term as well we end up at

\[\af(p/12, 12n) \approx \frac{p}{12}\frac{1 + np}{np} = \frac{1 + np}{12n},\]

which is nice. But not great. A short calculation with the binomial coefficient $\binom{12n}{2}$ yields

\[\binom{12n}{2}\Bigl(\frac{p}{12}\Bigr)^2 = \frac{(np)^2}{2} - \frac{np^2}{24}.\]

For typical $n$ and $p$ (e.g., $n = 20$ and $p = 0.02$) the second term won’t play any role. With the first term we improve our approximation to

\[\af(p/12, 12n) \approx \frac{p}{12}\frac{1 + np + (np)^2/2}{np + (np)^2/2} = \frac{1}{12n}\frac{1 + np(1 + np/2)}{1 + np/2} = \frac{1}{12n}\Bigl(\frac{1}{1 + np/2} + np\Bigr).\]

Now, this is clearly unsuitable. But using the geometric series again we see that $\frac{1}{1 + np/2} = 1 - np/2 + (np/2)^2 + O(p^3)$ and hence

\[\af(p/12, 12n) \approx \frac{1 + \frac{np}{2}(1 + \frac{np}{2})}{12n}.\]

This, too, is not too convienient. But if we drop the $(np)^2$ term we end up with

\[\af(p/12, 12n) \approx \frac{1 + np/2}{12n}.\]

This isn’t too bad. For slightly larger $n$ and $p$ we could also have taken $1 + \frac{np}{2} = 1.2$, which would give us $0.6$ instead of $1/2$ in the above formula. Of course we can also do the division by 12 and arrive at another formula.

Let’s look at this from a slightly different perspective. A naive, “interest-free” calculation on what the monthly payment for an $n$ year loan of $S_0$ is would be $\frac{S_0}{12n}.$

Our formula $\af(p/12, 12n) \approx \frac{1 + np/2}{12n}$ says: Do the naive thing, then add x% to the result, where x is “$n$ times the interest rate over 2”.

Example

A loan of $400000 with 2% interest over 20 years:

\[\frac{\$400000}{12\cdot 20} \ \cdot\ \frac{20\cdot 2}{2}\% = \$1666 \ \cdot\ 20\% = \$333,\]

so the monthly rate will be $\$1666 + \$333 = \$2000$. This isn’t too far from the exact value of $\af(\sqrt[12]{1.02} - 1, 12n) \cdot \$400000 = \$2020$.

More numbers

The following table compares exact monthly rates as a percentage of the total loan with our estimate, i.e., the true annuity factor $\af(\sqrt[12]{1 + p} - 1, 12n)$ and our approximation $\frac{1 + np/2}{12n}$ of it (bold).

	$n = 2$	$n = 5$	$n = 7$	$n = 10$	$n = 15$	$n = 20$	$n = 25$	$n = 30$
$p = 0.5\%$	$4.19\%$ $\bf 4.19\%$	$1.69\%$ $\bf 1.69\%$	$1.21\%$ $\bf 1.21\%$	$0.85\%$ $\bf 0.85\%$	$0.58\%$ $\bf 0.58\%$	$0.44\%$ $\bf 0.44\%$	$0.35\%$ $\bf 0.35\%$	$0.30\%$ $\bf 0.30\%$
$p = 1.0\%$	$4.21\%$ $\bf 4.21\%$	$1.71\%$ $\bf 1.71\%$	$1.23\%$ $\bf 1.23\%$	$0.88\%$ $\bf 0.88\%$	$0.60\%$ $\bf 0.60\%$	$0.46\%$ $\bf 0.46\%$	$0.38\%$ $\bf 0.38\%$	$0.32\%$ $\bf 0.32\%$
$p = 1.5\%$	$4.23\%$ $\bf 4.23\%$	$1.73\%$ $\bf 1.73\%$	$1.25\%$ $\bf 1.25\%$	$0.90\%$ $\bf 0.90\%$	$0.62\%$ $\bf 0.62\%$	$0.48\%$ $\bf 0.48\%$	$0.40\%$ $\bf 0.40\%$	$0.34\%$ $\bf 0.34\%$
$p = 2.0\%$	$4.25\%$ $\bf 4.25\%$	$1.75\%$ $\bf 1.75\%$	$1.28\%$ $\bf 1.27\%$	$0.92\%$ $\bf 0.92\%$	$0.64\%$ $\bf 0.64\%$	$0.51\%$ $\bf 0.50\%$	$0.42\%$ $\bf 0.42\%$	$0.37\%$ $\bf 0.36\%$
$p = 3.0\%$	$4.30\%$ $\bf 4.29\%$	$1.80\%$ $\bf 1.79\%$	$1.32\%$ $\bf 1.32\%$	$0.96\%$ $\bf 0.96\%$	$0.69\%$ $\bf 0.68\%$	$0.55\%$ $\bf 0.54\%$	$0.47\%$ $\bf 0.46\%$	$0.42\%$ $\bf 0.40\%$
$p = 4.0\%$	$4.34\%$ $\bf 4.33\%$	$1.84\%$ $\bf 1.83\%$	$1.36\%$ $\bf 1.36\%$	$1.01\%$ $\bf 1.00\%$	$0.74\%$ $\bf 0.72\%$	$0.60\%$ $\bf 0.58\%$	$0.52\%$ $\bf 0.50\%$	$0.47\%$ $\bf 0.44\%$
$p = 5.0\%$	$4.38\%$ $\bf 4.38\%$	$1.88\%$ $\bf 1.88\%$	$1.41\%$ $\bf 1.40\%$	$1.06\%$ $\bf 1.04\%$	$0.79\%$ $\bf 0.76\%$	$0.65\%$ $\bf 0.62\%$	$0.58\%$ $\bf 0.54\%$	$0.53\%$ $\bf 0.49\%$
$p = 7.0\%$	$4.47\%$ $\bf 4.46\%$	$1.97\%$ $\bf 1.96\%$	$1.50\%$ $\bf 1.48\%$	$1.15\%$ $\bf 1.13\%$	$0.89\%$ $\bf 0.85\%$	$0.76\%$ $\bf 0.71\%$	$0.69\%$ $\bf 0.62\%$	$0.65\%$ $\bf 0.57\%$
$p = 10.0\%$	$4.59\%$ $\bf 4.58\%$	$2.10\%$ $\bf 2.08\%$	$1.64\%$ $\bf 1.61\%$	$1.30\%$ $\bf 1.25\%$	$1.05\%$ $\bf 0.97\%$	$0.94\%$ $\bf 0.83\%$	$0.88\%$ $\bf 0.75\%$	$0.85\%$ $\bf 0.69\%$
$p = 20.0\%$	$5.01\%$ $\bf 5.00\%$	$2.56\%$ $\bf 2.50\%$	$2.12\%$ $\bf 2.02\%$	$1.83\%$ $\bf 1.67\%$	$1.64\%$ $\bf 1.39\%$	$1.57\%$ $\bf 1.25\%$	$1.55\%$ $\bf 1.17\%$	$1.54\%$ $\bf 1.11\%$

Our linear approximation $\frac{1 + np/2}{12n}$, solid, and the true annuity factor $\af(\sqrt[12]{1 + p} - 1, 12n)$, dashed, for different number of years $n$. Gnuplot source

For low interests $p$ our approximation is rather good. For larger interests over many years it starts underestimating the true factor. In this regime using $0.6$ or higher in place of $1/2$ will yield better results.

This doesn’t mean you should invest in real estate though.

Dylanchords

Sun, 25 Dec 2016 13:50:34 +0000

Ahh… dylanchords.com. A long time ago, when the internet was in its puberty and I just ridded myself of »Music« as a subject in school, you were there for me. A website in the style of the times, which filled my youthful admiration of Dylan with an actual, achievable, purpose: To teach myself to play the guitar (after a fashion). I scanned those songs for the evil F chord and other cryptic nastiness (Bm7-5?!), printed out and tried to recreate one after another (after a fashion). Soon enough I tuned my borrowed guitar in Open D tuning and played a heartfelt Buckets of Rain. I felt sad but proud, the family went from amused to annoyed, and my guitar skills from pre-beginner to amateur.

The guitar, Dylan, and dylanchords.com would stay with the through university. Printing out old-style HTML would take a detour through an arguably more old-styled system, and that mere process helped convince a number of people that I had sufficient interest in »culture« to be worth their time. Today, the resulting »project« is dormant, and made obsolete by more recent technology.

And yet, dylanchords stayed with me. The .com domain name effectively moved to another one for reasons that don’t seem to matter so much now, and these days, picking up an instrument isn’t something I do daily anymore. Or even weekly. But dylanchords still seems to matter. Recently, even the choice of subject has arguably been vindicated.

It being Christmas today, it feels right to thank the creator. No, not that one. Eyolf Østrem has »tabbed« around 900 songs, with patience and skill that others couldn’t even think of displaying, to not even mention giving away all this work for free. Thank you, Eyolf Chordmaster. We don’t always agree. It doesn’t matter. Thanks for your work. It inspired me, I’m sure it inspired many others.

Merry Christmas.

For further convenience, I’ve started hosting a jekyll-generated, mobile-enabled version of Eyolf’s chordwork at dylaniki.org (code at github). I find it useful for looking up songs from the phone, and maybe others feel the same.

In every beginning there is a delusion

Sun, 21 Aug 2016 23:31:34 +0000

So… somewhere in the order of 15 years too late I’ve also taken up this blogging thing. At least for a while. Karpathy’s blog and the elegance of Jekyll finally convinced me.

	\(n = 2\)	\(n = 5\)	\(n = 7\)	\(n = 10\)	\(n = 15\)	\(n = 20\)	\(n = 25\)	\(n = 30\)
\(p = 0.5\%\)	\(4.19\%\) \(\bf 4.19\%\)	\(1.69\%\) \(\bf 1.69\%\)	\(1.21\%\) \(\bf 1.21\%\)	\(0.85\%\) \(\bf 0.85\%\)	\(0.58\%\) \(\bf 0.58\%\)	\(0.44\%\) \(\bf 0.44\%\)	\(0.35\%\) \(\bf 0.35\%\)	\(0.30\%\) \(\bf 0.30\%\)
\(p = 1.0\%\)	\(4.21\%\) \(\bf 4.21\%\)	\(1.71\%\) \(\bf 1.71\%\)	\(1.23\%\) \(\bf 1.23\%\)	\(0.88\%\) \(\bf 0.88\%\)	\(0.60\%\) \(\bf 0.60\%\)	\(0.46\%\) \(\bf 0.46\%\)	\(0.38\%\) \(\bf 0.38\%\)	\(0.32\%\) \(\bf 0.32\%\)
\(p = 1.5\%\)	\(4.23\%\) \(\bf 4.23\%\)	\(1.73\%\) \(\bf 1.73\%\)	\(1.25\%\) \(\bf 1.25\%\)	\(0.90\%\) \(\bf 0.90\%\)	\(0.62\%\) \(\bf 0.62\%\)	\(0.48\%\) \(\bf 0.48\%\)	\(0.40\%\) \(\bf 0.40\%\)	\(0.34\%\) \(\bf 0.34\%\)
\(p = 2.0\%\)	\(4.25\%\) \(\bf 4.25\%\)	\(1.75\%\) \(\bf 1.75\%\)	\(1.28\%\) \(\bf 1.27\%\)	\(0.92\%\) \(\bf 0.92\%\)	\(0.64\%\) \(\bf 0.64\%\)	\(0.51\%\) \(\bf 0.50\%\)	\(0.42\%\) \(\bf 0.42\%\)	\(0.37\%\) \(\bf 0.36\%\)
\(p = 3.0\%\)	\(4.30\%\) \(\bf 4.29\%\)	\(1.80\%\) \(\bf 1.79\%\)	\(1.32\%\) \(\bf 1.32\%\)	\(0.96\%\) \(\bf 0.96\%\)	\(0.69\%\) \(\bf 0.68\%\)	\(0.55\%\) \(\bf 0.54\%\)	\(0.47\%\) \(\bf 0.46\%\)	\(0.42\%\) \(\bf 0.40\%\)
\(p = 4.0\%\)	\(4.34\%\) \(\bf 4.33\%\)	\(1.84\%\) \(\bf 1.83\%\)	\(1.36\%\) \(\bf 1.36\%\)	\(1.01\%\) \(\bf 1.00\%\)	\(0.74\%\) \(\bf 0.72\%\)	\(0.60\%\) \(\bf 0.58\%\)	\(0.52\%\) \(\bf 0.50\%\)	\(0.47\%\) \(\bf 0.44\%\)
\(p = 5.0\%\)	\(4.38\%\) \(\bf 4.38\%\)	\(1.88\%\) \(\bf 1.88\%\)	\(1.41\%\) \(\bf 1.40\%\)	\(1.06\%\) \(\bf 1.04\%\)	\(0.79\%\) \(\bf 0.76\%\)	\(0.65\%\) \(\bf 0.62\%\)	\(0.58\%\) \(\bf 0.54\%\)	\(0.53\%\) \(\bf 0.49\%\)
\(p = 7.0\%\)	\(4.47\%\) \(\bf 4.46\%\)	\(1.97\%\) \(\bf 1.96\%\)	\(1.50\%\) \(\bf 1.48\%\)	\(1.15\%\) \(\bf 1.13\%\)	\(0.89\%\) \(\bf 0.85\%\)	\(0.76\%\) \(\bf 0.71\%\)	\(0.69\%\) \(\bf 0.62\%\)	\(0.65\%\) \(\bf 0.57\%\)
\(p = 10.0\%\)	\(4.59\%\) \(\bf 4.58\%\)	\(2.10\%\) \(\bf 2.08\%\)	\(1.64\%\) \(\bf 1.61\%\)	\(1.30\%\) \(\bf 1.25\%\)	\(1.05\%\) \(\bf 0.97\%\)	\(0.94\%\) \(\bf 0.83\%\)	\(0.88\%\) \(\bf 0.75\%\)	\(0.85\%\) \(\bf 0.69\%\)
\(p = 20.0\%\)	\(5.01\%\) \(\bf 5.00\%\)	\(2.56\%\) \(\bf 2.50\%\)	\(2.12\%\) \(\bf 2.02\%\)	\(1.83\%\) \(\bf 1.67\%\)	\(1.64\%\) \(\bf 1.39\%\)	\(1.57\%\) \(\bf 1.25\%\)	\(1.55\%\) \(\bf 1.17\%\)	\(1.54\%\) \(\bf 1.11\%\)