heiner.ai
https://heiner.ai/
Tue, 16 Apr 2024 18:29:15 +0000Tue, 16 Apr 2024 18:29:15 +0000Jekyll v3.9.5The chain rule, Jacobians, autograd, and shapes
<div class="right">
<p><cite>“Man muss immer umkehren”</cite> <br />
– Carl Gustav Jacob Jacobi<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
</div>
<p>This is a short explainer about the chain rule and autograd in PyTorch and JAX,
from the perspective of a mathematical user.</p>
<div class="floated centered">
<p><img src="/img/jacobi.jpeg" alt="Carl Gustav Jacob Jacobi" style="width:200px;" />
<br /></p>
<s>The Joker</s>
<p>Carl Gustav Jacob Jacobi</p>
<div class="small">
<p>(1804 – 1851)<br />
“Die Haare immer nach hinten kehren”<br />
Image source: <a href="https://en.wikipedia.org/wiki/Carl_Gustav_Jacob_Jacobi#/media/File:Carl_Jacobi.jpg">Wikipedia</a></p>
</div>
</div>
<p>There are many, many explanations of this on the web. Many are likely
better than this one. I’ll still write my own, which in the spirit of
this blog is meant to be written, not read. I’ll also don’t focus on
the implementation of the system, just on its observable behavior.</p>
<p>Other, perhaps better sources for the same info are:</p>
<ul>
<li><a href="https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py">A gentle introduction to
<code class="language-plaintext highlighter-rouge">torch.autograd</code></a>
from the tutorials at pytorch.org.</li>
<li><a href="https://pytorch.org/docs/stable/notes/autograd.html">Autograd
mechanics</a>
from pytorch.org.</li>
<li>Zachary DeVito’s excellent <a href="https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC">colab with an example implementation
of reverse-mode
autodiff</a>
from scratch. I highly recommend studying this one.</li>
<li><a href="https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html">The Autodiff
Cookbook</a>
in the JAX docs. Also very good.</li>
</ul>
<p>The JAX docs especially are delightfully mathematical. There are many more
sources on the web. I like the more extensive treatment is in <a href="https://mediatum.ub.tum.de/doc/1638886/document.pdf#page=25">Thomas Frerix’s
PhD thesis</a>.</p>
<p>Still, let me add my own spin on the issue. One reason is that many
other articles write things like <code class="language-plaintext highlighter-rouge">dL/dOutputs</code> and <code class="language-plaintext highlighter-rouge">dL/dInputs</code> and
generallly use a “variable”-based notation that, while entirely
reasonable from an implementation standpoint, would <a href="https://twitter.com/HeinrichKuttler/status/1262337725161771009">make Spivak
sad</a>.</p>
<h2 id="the-chain-rule">The chain rule</h2>
<p>All of “deep learning”, and most of Physics, depends on this theorem
discovered by Leibniz ca. 1676.</p>
<h3 id="one-dimensional-functions">One dimensional functions</h3>
<p>Given two differentiable functions \(f,g\from\R\to\R\), the chain rule
says</p>
\[(g\circ f)'(x) = g'(f(x))f'(x) \where{x\in\R}. \label{eq:cr1}\tag{1}\]
<p>Here, \((g\circ f)(x) = g(f(x))\) is the composition, i.e., chained
evaluation of first \(y = f(x)\), then \(g(y)\).</p>
<h3 id="multidimensional-functions">Multidimensional functions</h3>
<p>It’s one of the wonders of analysis that this rule keeps being correct
for differentiable multidimensional functions \(f\from\R^n\to\R^m\),
\(g\from\R^m\to\R^k\) if “differentiable” is defined
correctly. Skipping over some technicalities, this requires \(f\) and
\(g\) to be <em>totally differentiable</em> (also called
<em>Fréchet differentiable</em>), which implies that all components
\(f_j\from\R^n\to\R\) are partially differentiable; the derivative
\(f'(a)\) for \(a\in\R^n\) can then be shown to be equal to the
<em>Jacobian matrix</em></p>
\[f'(a) = J_f(a)
:= \bigl(\partial_k f_j(a)\bigr)_{\substack{j=1,\ldots,m\\ k=1,\ldots,n}}
= \begin{pmatrix}
\partial_1 f_1(a) & \cdots & \partial_n f_1(a) \\
\vdots && \vdots \\
\partial_1 f_m(a) & \cdots & \partial_n f_m(a)
\end{pmatrix}
\in\R^{m\times n}.\]
<p>The idea is to view \(f(a) = (f_1(a), \ldots, f_m(a))^\top\in\R^m\)
as a column vector and add one column per partial derivative, i.e.,
dimension of its input \(a\).</p>
<p>With this definition, the multidimensional chain rule reads like its
1D version \(\eqref{eq:cr1}\),</p>
\[(g\circ f)'(x) = g'(f(x))\cdot f'(x) \where{x\in\R^n}. \label{eq:crN}\tag{2}\]
<p>However, in this case this is the matrix multiplication
\(\cdot\from\R^{k\times m}\times\R^{m\times n}\to\R^{k\times n}\) of
\(J_g(f(x))\) and \(J_f(x)\) and their order is important. Jacobi
would have called this <em>nachdifferenzieren</em>.</p>
<p>Via the nature of this rule it iterates, i.e. the derivative of the
composition \(h\circ g\circ f\) is</p>
\[(h\circ g\circ f)(x)
=
h'(g(f(x)))\cdot g'(f(x))\cdot f'(x)
=
(h'\circ g\circ f)(x)\cdot (g'\circ f)(x)\cdot f'(x),\]
<p>and likewise for a composition of \(n\) functions</p>
\[(f_n\circ \cdots \circ f_1)'(x)
=
(f_n'\circ f_{n-1}\circ\cdots\circ f_1)(x)\cdot
(f_{n-1}'\circ f_{n-2}\circ\cdots\circ f_1)(x)
\,\cdots\,
(f_2'\circ f_1)(x) \cdot f_1'(x). \label{eq:crNn}\tag{3}\]
<p>Note that subscripts no longer mean components here, we are talking
about \(n\) functions, each with multidimensional inputs and outputs.</p>
<h3 id="backprop">Backprop</h3>
<p>Now, while matrix multiplication isn’t commutative, meaning in general
\(AB \ne BA\), it is associative: For a product of several matrices
\(ABC\), it does not matter if one computes \((AB)C\) or \(A(BC)\);
this is what makes the notation \(ABC\) sensible in the first
place. However, this only means the same output is produced by those
two alternatives. It does not mean the same amount of “work” (or
“compute”) went into either case.</p>
<p>Counting scalar operations, a product \(AB\) with
\(A\in\R^{k\times m}\) and \(B\in\R^{m\times n}\) takes \(nkm\)
multiplications and \(nk(m-1)\) additions, since each entry in the
output matrix is the sum of \(m\) multiplications. In a chain of
matmuls like \(ABCDEFG\cdots\), it’s clear that some groupings like
\(((AB)(CD))(E(FG))\cdots\) might be better than others. In
particular, there may well be better and worse ways of computing the
Jacobian \(\eqref{eq:crNn}\). However, to quote
<a href="https://en.wikipedia.org/wiki/Automatic_differentiation#Beyond_forward_and_reverse_accumulation">Wikipedia</a>:</p>
<blockquote>
<p>The problem of computing a full Jacobian of \(f\from\R^n\to\R^m\) with a
minimum number of arithmetic operations is known as the optimal
Jacobian accumulation (OJA) problem, which is NP-complete.</p>
</blockquote>
<p>The good news is that in important special cases, this problem is
easy. In particular, if the first matrix in the product has only one
row, or the last only one column, it makes sense to compute the
product “from the left” or “from the right”, respectively. And those
cases are not particularly pathological either, as they correspond to
either the final function \(f_n\) mapping to a scalar or the whole
composition depending only on a scalar input \(x\in\R\). The former
case is especially important as that’s what happens whenever we
compute the gradients of a scalar <em>loss function</em>, such as in bascially all
cases where neural networks are used.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>
The corresponding orders of multiplications in the chain rule
\(\eqref{eq:crNn}\) are known as
<em>reverse mode</em> or <em>forward mode</em>, respectively. Ignoring some
distinctions not relevant here, reverse mode is also known as
<em>backpropagation</em>, or <em>backprop</em> for short.</p>
<p>In this case, the image of \(f_n\) is one dimensional, and therefore
its Jacobian matrix \(f_n'\) has only one row – it’s a row
vector. After multiplication with the next Jacobian \(f_{n-1}'\), this
property is preserved: All matmuls in the chain turn into
<em>vector-times-matrix</em>. Specifically, they are
vector-times-Jacobian-matrix, more commonly known as a vector-Jacobian
product, or VJP. To illustrate:</p>
\[\begin{pmatrix}
\unicode{x2E3B}
\end{pmatrix}_1
\begin{pmatrix}
| & | & | \\
| & | & | \\
| & | & |
\end{pmatrix}_2
\begin{pmatrix}
| & | & | \\
| & | & | \\
| & | & |
\end{pmatrix}_3
\cdots
\begin{pmatrix}
| & | & | \\
| & | & | \\
| & | & |
\end{pmatrix}_n
=
\begin{pmatrix}
\unicode{x2E3B}
\end{pmatrix}
\begin{pmatrix}
| & | & | \\
| & | & | \\
| & | & |
\end{pmatrix}_3
\cdots
\begin{pmatrix}
| & | & | \\
| & | & | \\
| & | & |
\end{pmatrix}_n
=
\text{etc.}\]
<p>What this means is that for neural network applications, a system like
PyTorch or JAX <em>doesn’t need to actually compute</em> full Jacobians – all
it needs are vector-Jacobian products. It turns out that
(classic) PyTorch does and can in fact do nothing else.</p>
<p>Also notice that the row vector
\(\begin{pmatrix}\unicode{x2E3B}\end{pmatrix}\) multiplied from the
left is “output-shaped” from the perspective of the Jacobian it gets
multiplied to. This is the reason <code class="language-plaintext highlighter-rouge">grad_output</code> in PyTorch always has
the shape of the operation’s output and why
<a href="https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html"><code class="language-plaintext highlighter-rouge">torch.Tensor.backward</code></a>
receives an argument described as the “gradient w.r.t. the
tensor”.</p>
<p>You may wonder what it means for it to have a specific shape vs just
being a “flat” row vector as it is here. I certainly wondered about
this; the answer is below.</p>
<h2 id="in-pytorch">In PyTorch</h2>
<p>Here’s a very simple PyTorch example (lifted from
<a href="https://medium.com/@monadsblog/pytorch-backward-function-e5e2b7e60140">here</a>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">empty</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
<span class="n">y</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">5</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">y</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="mi">3</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">])</span>
<span class="n">y</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="c1"># VJP.
</span>
<span class="k">print</span><span class="p">(</span><span class="s">"y:"</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"x.grad:"</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c1"># Manual computation.
</span><span class="n">dydx</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span>
<span class="p">[</span>
<span class="p">[</span><span class="mi">2</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">0</span><span class="p">],</span>
<span class="p">[</span><span class="mi">2</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">10</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
<span class="p">]</span>
<span class="p">)</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="n">equal</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="p">,</span> <span class="n">v</span> <span class="o">@</span> <span class="n">dydx</span><span class="p">)</span>
</code></pre></div></div>
<p>In PyTorch. <code class="language-plaintext highlighter-rouge">Tensor.backward</code> computes the backward pass via
vector-Jacobian products. If the tensor in question is a scalar, an
implicit <code class="language-plaintext highlighter-rouge">1</code> is assumed, otherwise one has to supply a tensor of the
same shape which is used as the vector in the VJP, as in the example
above.</p>
<h2 id="but-wait-tensor-or-vector">But wait, tensor or vector?</h2>
<p>For the typical mathematian, the term “tensor” for the
multidimensional array in PyTorch and JAX is a bit on an acquired (or
not) taste – but to be fair, so is anything about the real tensor
product \(\otimes\) as well.</p>
<p>The situation here seems especially confusing – the chain rule from
multidimensional calculus makes a specific point of what’s a row and
what’s a column and treats functions \(\R^n\to\R^m\) by introducing
\(m\times n\) matrices. In PyTorch, functions depend on and produce one (or
several) multidimensional “tensors”. What gives?</p>
<p>The answer turns out to be relatively simple: The math part of the
backward pass doesn’t depend on these tensor shapes in any deep
way. Instead, it does the equivalent of “reshaping” everything into a
vector. Example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span>
<span class="p">[</span>
<span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">],</span>
<span class="p">[</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">],</span>
<span class="p">],</span>
<span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">empty</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">y</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">5</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">y</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">3</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="n">y</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">y</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span>
<span class="p">[</span>
<span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">],</span>
<span class="p">[</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">],</span>
<span class="p">[</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">],</span>
<span class="p">]</span>
<span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"y:"</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">y</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="c1"># VJP.
</span><span class="k">print</span><span class="p">(</span><span class="s">"x.grad:"</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="p">)</span>
</code></pre></div></div>
<p>What did this even compute? It’s the equivalent of reshaping inputs
and outputs to be vectors, then applying the standard calculus from
above:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dydx</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span> <span class="c1"># y.reshape(-1) x x.reshape(-1) Jacobian matrix.
</span> <span class="p">[</span>
<span class="p">[</span><span class="mi">2</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="c1"># y[0, 0]
</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="c1"># y[0, 1]
</span> <span class="p">[</span><span class="mi">2</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="mi">10</span> <span class="o">*</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="c1"># y[1, 0]
</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="c1"># y[1, 1]
</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="c1"># y[2, 0]
</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="c1"># y[2, 1]
</span> <span class="p">]</span>
<span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"dydx"</span><span class="p">,</span> <span class="n">dydx</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="n">equal</span><span class="p">(</span>
<span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="p">,</span>
<span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">@</span> <span class="n">dydx</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>
<p>To be clear: PyTorch does not actually compute the Jacobian only to
multiply it from the left with this vector, but what it does has the
same output as this less efficient code.</p>
<h2 id="in-jax">In JAX</h2>
<p>PyTorch is great. But it’s not necessarily principled. This tweet
expresses this somewhat more aggressively:</p>
<blockquote class="twitter-tweet tw-align-center"><p lang="en" dir="ltr">The Aesthetician in me wants to be constantly annoyed by how ugly PyTorch is but frankly I’m consistently impressed with how easy it is to develop in.</p>— Aidan Clark (@_aidan_clark_) <a href="https://twitter.com/_aidan_clark_/status/1568052833030897664?ref_src=twsrc%5Etfw">September 9, 2022</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>JAX is, arguably, different, at least on the first account. It comes
with vector-Jacobian products, Jacobian-vector products, and also the
option to compute full Jacobians if required. I couldn’t write up the
details better than the <a href="https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html">JAX Autodiff
Cookbook</a>
does, so I won’t try. To quote just one relevant portion, adjusted
slightly to our notation:</p>
<blockquote>
<p>The JAX function <code class="language-plaintext highlighter-rouge">vjp</code> can take a Python function for evaluating \(f\)
and give us back a Python function for evaluating the VJP \((x, v)\mapsto(f(x), v^\top f'(x))\).</p>
</blockquote>
<h2 id="so-what-does-it-really-do">So what does it really do?</h2>
<p>For the autodiff details, read either Zachary DeVito’s <a href="https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC">colab with an
example implementation of what PyTorch does</a>,
or the JAX Autodiff Cookbook, or ideally both.</p>
<p>But to round things out, let’s look at how one defines a “custom
function” with its own forward and backward pass in PyTorch.</p>
<p>We’ll take elementwise multiplication as our first example; the fancy
mathematics name of this simple operation is
<a href="https://en.wikipedia.org/wiki/Hadamard_product_(matrices)"><em>Hadamard
product</em></a>
(also known as <em>Schur product</em> – their motto was “<em>name, always name</em>”<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>). We’ll
denote it by \(\odot.\) To fit it within the
standard calculus above, we reinterpret
\(\odot\from\R^n\times\R^n\to\R^n\) as \(\odot\from\R^{2n}\to\R^n\), multiplying
the first and second “half” of its single-vector input, i.e.,
\(\R^{2n}\ni x\mapsto\odot(x) = (x_jx_{n+j})_{j=1,\ldots,n}\in\R^n.\) Its derivative
is</p>
\[\odot'(x) =
\begin{pmatrix}
x_{n+1} & & & & x_1 & & & \\
& x_{n+2} & & & & x_2 & & \\
& & {\lower 3pt\smash{\ddots}} & & & & {\lower 3pt\smash{\ddots}} & \\
& & & x_{2n} & & & & x_n
\end{pmatrix}
\in\R^{n\times 2n}, \where{x\in\R^{2n}}\]
<p>where empty cells are zeros. It should be immediately obvious that
there’s no need to “materialize” these diagonal Jacobians. In fact,
given a vector \(v = (v_1, \ldots, v_{2n})\in\R^{2n}\), the VJP here
is just</p>
\[v\cdot \odot'(x) = (v_1x_{n+1}, \ldots, v_nx_{2n},
v_{n+1}x_1, \ldots, v_{2n}x_n) \in \R^{2n}.\]
<p>(And that already seems like a needless complication of such a simple
thing als elementwise multiplication!)</p>
<p>The PyTorch version of this is … well,
it’s just <code class="language-plaintext highlighter-rouge">x * y</code>, and PyTorch knows full well how to compute and
differentiate that. But if we wanted for some reason to add this from
scratch, it might look like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HadamardProduct</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">Function</span><span class="p">):</span>
<span class="s">"""Computes lhs * rhs and its backward pass."""</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">lhs</span><span class="p">,</span> <span class="n">rhs</span><span class="p">):</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">save_for_backward</span><span class="p">(</span><span class="n">lhs</span><span class="p">,</span> <span class="n">rhs</span><span class="p">)</span>
<span class="k">return</span> <span class="n">lhs</span> <span class="o">*</span> <span class="n">rhs</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">grad_output</span><span class="p">):</span>
<span class="n">lhs</span><span class="p">,</span> <span class="n">rhs</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">saved_tensors</span>
<span class="n">grad_lhs</span> <span class="o">=</span> <span class="n">grad_output</span> <span class="o">*</span> <span class="n">rhs</span>
<span class="n">grad_rhs</span> <span class="o">=</span> <span class="n">grad_output</span> <span class="o">*</span> <span class="n">lhs</span>
<span class="k">return</span> <span class="n">grad_lhs</span><span class="p">,</span> <span class="n">grad_rhs</span>
<span class="n">mul</span> <span class="o">=</span> <span class="n">HadamardProduct</span><span class="p">.</span><span class="nb">apply</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">ctx</code> argument is used to stash the tensors required for the
backward pass somewhere (in Zachary’s
<a href="https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC">colab</a>,
this is done via a closure) and the <code class="language-plaintext highlighter-rouge">grad_output</code> argument is the
output-shaped “vector” of the VJP. It’s a single argument in this case
since the function has a single output. PyTorch may reuse this tensor,
so it’s important even in cases where the gradient computed has the same
shape to “NEVER modify [this argument] in-place”, as the <a href="https://pytorch.org/docs/stable/notes/extending.html#how-to-use">PyTorch docs
say</a>.</p>
<p>Testing this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lhs</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">rhs</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">mul</span><span class="p">(</span><span class="n">lhs</span><span class="p">,</span> <span class="n">rhs</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"y ="</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])</span>
<span class="n">y</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="n">equal</span><span class="p">(</span><span class="n">lhs</span><span class="p">.</span><span class="n">grad</span><span class="p">,</span> <span class="n">v</span> <span class="o">*</span> <span class="n">rhs</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="n">equal</span><span class="p">(</span><span class="n">rhs</span><span class="p">.</span><span class="n">grad</span><span class="p">,</span> <span class="n">v</span> <span class="o">*</span> <span class="n">lhs</span><span class="p">)</span>
</code></pre></div></div>
<p>This example is typical for elementwise operations, where the
Jacobian matrices are diagonal and VJPs are just elementwise
operations themselves.</p>
<p>To contrast this with another example, let’s look at a real, genuine
linear function. Remember that for a matrix \(W\in\R^{m\times n}\),
the function \(\R^n\ni x \mapsto Wx \in \R^m\) is Fréchet
differentiable and its derivative is the constant \(W\).<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> Taking \(x\)
as a constant and \(W=(W_{jk})_{j=1,\ldots,m;\; k=1,\ldots,n}\) as the “dependent variable”, and “reshaping to
vector form” as above, the derivative at \(W\) is</p>
\[\begin{pmatrix}
x_1 & \cdots & x_n & & & & & & & \\
& & & x_1 & \cdots & x_n & & & & \\
& & & & & & \ddots & & & \\
& & & & & & & x_1 & \cdots & x_n
\end{pmatrix}
\in\R^{m\times nm}.\]
<p>Forming the VJP with a vector \(v\in\R^m\) and reshaping back to the
shape of \(W\) results in VJP of \(v\cdot x^\top\), with entries
\((v_jx_k)_{j,k}\).</p>
<p>The PyTorch code for that is (cf. the <a href="https://pytorch.org/docs/stable/notes/extending.html#example">PyTorch
docs</a>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ActuallyLinear</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">Function</span><span class="p">):</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="nb">input</span><span class="p">,</span> <span class="n">weight</span><span class="p">):</span>
<span class="n">ctx</span><span class="p">.</span><span class="n">save_for_backward</span><span class="p">(</span><span class="nb">input</span><span class="p">,</span> <span class="n">weight</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">weight</span> <span class="o">@</span> <span class="nb">input</span>
<span class="k">return</span> <span class="n">output</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">grad_output</span><span class="p">):</span>
<span class="nb">input</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">saved_tensors</span>
<span class="n">grad_input</span> <span class="o">=</span> <span class="n">grad_weight</span> <span class="o">=</span> <span class="bp">None</span>
<span class="c1"># Checks for extra efficiency -- only compute VJP with vector != zero.
</span> <span class="k">if</span> <span class="n">ctx</span><span class="p">.</span><span class="n">needs_input_grad</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span>
<span class="n">grad_input</span> <span class="o">=</span> <span class="n">grad_output</span> <span class="o">@</span> <span class="n">weight</span>
<span class="k">if</span> <span class="n">ctx</span><span class="p">.</span><span class="n">needs_input_grad</span><span class="p">[</span><span class="mi">1</span><span class="p">]:</span>
<span class="n">grad_weight</span> <span class="o">=</span> <span class="n">grad_output</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">@</span> <span class="nb">input</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">grad_input</span><span class="p">,</span> <span class="n">grad_weight</span>
</code></pre></div></div>
<p>Testing this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">W</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span>
<span class="p">[</span>
<span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">],</span>
<span class="p">[</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">],</span>
<span class="p">[</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">],</span>
<span class="p">],</span>
<span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">10.0</span><span class="p">,</span> <span class="mf">11.0</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">ActuallyLinear</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"y ="</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">])</span>
<span class="n">y</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="c1"># Save for comparing.
</span><span class="n">x_grad</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">grad</span>
<span class="n">W_grad</span> <span class="o">=</span> <span class="n">W</span><span class="p">.</span><span class="n">grad</span>
<span class="n">x</span><span class="p">.</span><span class="n">grad</span> <span class="o">=</span> <span class="bp">None</span> <span class="c1"># Reset to zero.
</span>
<span class="n">torch_linear</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">W</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">W</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c1"># Set weight without touching gradient tape.
</span><span class="n">torch_linear</span><span class="p">.</span><span class="n">weight</span><span class="p">.</span><span class="n">data</span><span class="p">[...]</span> <span class="o">=</span> <span class="n">W</span>
<span class="n">torch_y</span> <span class="o">=</span> <span class="n">torch_linear</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">torch_y</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="n">equal</span><span class="p">(</span><span class="n">x_grad</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="n">equal</span><span class="p">(</span><span class="n">torch_linear</span><span class="p">.</span><span class="n">weight</span><span class="p">.</span><span class="n">grad</span><span class="p">,</span> <span class="n">W_grad</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="one-final-comment">One final comment</h2>
<p>I could go on, but this is plenty for now. The last thing worth
mentioning is that \(\eqref{eq:crNn}\) doesn’t quite fit the actual
situation of deep neural networks, where one takes the gradient of all
weights, and each layer \(f_j\) depends both on the outputs
(“activations”) of the previous layer and its own weight, and the
inputs of the first layer \(f_1\) (and potentially some later layers,
e.g. for “targets”) are just “data”, which is treated as
a parametrization of the functions. One way to apply the Calculus 102
rule above would be to have earlier layers pass on all weights they
don’t need as an identity function (with identity matrix for that part
of its Jacobian). This can also be written in other ways, but the
<a href="https://developer.mozilla.org/en-US/docs/Web/CSS/margin-bottom"><code class="language-plaintext highlighter-rouge">margin-bottom</code></a>
here is too small to contain it.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Did Jacobi actually say that? Googling this is hard because
<a href="https://fs.blog/inversion/">Charlie Munger likes the supposed quote so
much</a>.
However, there’s a <a href="https://www.ams.org/journals/bull/1916-23-01/S0002-9904-1916-02863-1/S0002-9904-1916-02863-1.pdf#page=3">1916
source</a>
claiming “[t]he great mathematician Jacobi is said to have
inculcated upon his students the dictum”. A <a href="https://books.google.de/books?id=pVNR7-6XeCQC&pg=PA1214&lpg=PA1214&dq=%22man+muss+alles+umkehren%22+jacobi&source=bl&ots=MAEDG85f23&sig=ACfU3U1AEZdgV7tqBzSLUxTojqAQLVihuQ&hl=en&sa=X&redir_esc=y#v=onepage&q=%22man%20muss%20alles%20umkehren%22%20jacobi&f=false">1968 source</a>
gives a slightly different version. Perhaps it’s a
matter of <em>invent, always invent</em>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>To illustrate the extremes this is pushed to for large scale
cases: A large language model like GPT3 can have billions or even
trillions of weights (corresponding to the dimension of
\(x\in\R^n\)) but still computes a scalar (one dimensional) loss. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>This is a joke. In reality, mathematicians are taught that if
something is named after someone, that is a good indication that
person <em>didn’t</em> invent that. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>In fact, the underlying idea of the Fréchet derivative is linear
approximations like that, Unlike partial deriviatives, this idea
carries over to the infinite dimensional case, where it’s a stronger
requirement than weaker notions of differentiability like the
existence of the <a href="https://en.wikipedia.org/wiki/Gateaux_derivative">Gateaux derivative</a>. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Sun, 19 Feb 2023 00:00:00 +0000
https://heiner.ai/blog/2023/02/19/chain-rule-jacobians-autograd-shapes.html
https://heiner.ai/blog/2023/02/19/chain-rule-jacobians-autograd-shapes.htmlblogGompertz, annuities, and special functions<p>This is part 2 of a 2 part series on Gompertz’s law. <a href="/blog/2021/03/23/pis-deaths-and-statistics.html">Part 1 is
here</a>.</p>
<p>The <a href="/blog/2021/03/23/pis-deaths-and-statistics.html">first part of this
series</a> discussed the
Gompertz distribution and gave a formula for “How
likely am I to live until at least age \(t_0 + t\) conditional on
having lived until age \(t_0\)?”. In this post, we will use this to
compute the present value of an annuities. This is a riff on chapter 3
of <a href="https://www.goodreads.com/book/show/13838804-7-most-important-equations-for-your-retirement">Moshe Milevsky’s book <em>The 7 Most Important Equations for Your
Retirement</em></a>
with a more mathy twist.</p>
<h3 id="present-value">Present value</h3>
<p>A standard idea in finance is that getting \(\$1000\) today is worth more
than getting \(\$1000\) in a year from now. This should be true even if
these are inflation-adjusted dollars: If we assume some bank will give
us the expected rate of inflation as their interest rate when we deposit
the \(\$1000\), we could just do that and create the “\(\$1000\) a year
from now” situation. Ignoring the dilemma of choice, the first situation
gives us more options and should therefore be worth more to us.</p>
<p>How much more should it be worth? Or, equivalently, what’s the value
for \(y\) at which point we become ambivalent about \(\$x\) now and
\(\$y\) in one year?</p>
<p>The answer will be different for different people (you may really need
that money <em>now</em>), but generally this is modeled with an
interest/discount rate as well: \(\$x\) today will be worth \(\$x\cdot(1+r)\)
a year from now (and therefore \(\$x\cdot (1+r)^2\) two years from now) for some
discount rate \(r\). Why the temporal structure is exponential as
opposed to some other increasing function is a topic all for itself
(look up <em>hyperbolic discounting</em>), but it’s a common choice in
various fields including finance, economics, psychology, neuroscience,
and reinforcement learning.</p>
<p>Conversely, \(\$y\) in one year should be worth \(\frac{\$y}{1+r}\)
today.</p>
<h3 id="getting-money-forever">Getting money <em>forever</em></h3>
<p>A nice property of this discounting rule is that the option of getting
a neverending stream of money (say, \(a = \$1000\) every month) is worth a
finite amount today by the magic of the geometric series:</p>
\[\sum_{n=0}^\infty \frac{a}{(1+r)^n} = \frac{a}{1-\frac{1}{1+r}}
= a \frac{1+r}{r} = a(1 + \tfrac{1}{r})\]
<p>for a monthly discount rate \(r\ne 0\). Just to be extra clear: The
\(n\)th term of this sum is the present value of the payment we
receive in the \(n\)th period.</p>
<p>Such an arragement is called an <em>annuity</em>. We <a href="/blog/2019/08/20/annuity-loans.html">talked about annuity
loans before on this blog</a> and
just as in that case, the present value of a neverending stream of
\(a\) per month is \(a\) times an <em>annuity factor</em>, in this case
\(\af(r)=1 + \frac{1}{r}\).</p>
<p>The same is true if the annuity runs out after a fixed number of
periods. However, as a component for retirement planning, annuities
that pay for the rest of ones natural life (or the life of ones
spouse) tend to be a better option, as they
help covering <em>longevity risk</em>. In fact, annuities are (in theory,
although maybe not psychologically in practice) a great component of a
retirement plan as they allow pensioners to take more risk with
the rest of their funds, e.g., stay invested in the stock market
where expected returns are higher (but so is the variance of returns).</p>
<h3 id="getting-money-until-you-die">Getting money until you die</h3>
<p>This raises a question: What’s the present value of an annuity that
pays for the rest of ones life? It ought to be less than
\(1+\frac{1}{r}\) times the annual amount (unless one expects to live
forever), but how much less exactly?</p>
<p>Our answer, of course, is the survival function of the Gompertz
distribution, as per the <a href="/blog/2021/03/23/pis-deaths-and-statistics.html#plugging-it-all-together">last blog
post</a>. Of
course this depends on the current age of the pensioner: The survival
function of the Gompertz distribution,
conditioned on having lived until year \(t_0\), is</p>
\[p_{t_0}(t) := \exp\bigl(- e^{b(t_0-t_m)} (e^{bt} - 1)\bigr)
\where{t\ge 0}\]
<p>where \(b \approx 1/9.5\) is the inverse dispersion coefficient and
\(t_m \approx 87.25\) is the modal value of human life.</p>
<p>The value of a pension that starts now (with a pensioner at age
\(t_0\)) and pays \(\$1\) per year for the rest of the life is
therefore</p>
\[\af(r, t_0) =
\sum_{n=0}^\infty \frac{p_{t_0}(n)}{(1 + r)^n}.
\tag{#3} \label{discraf}\]
<p>The \(n\)th term in this series is the probabilty of being alive at
year \(t_0 + n\) given having been alive at year \(t_0\) times the
discount rate that turns it into a present value.</p>
<p>This is equation #3 in Milevsky’s book <em>The 7 Most Important Equations
for Your Retirement</em>.</p>
<p>Some insurance contracts allow the pensioner to decide between a lump
sum payment or an annuity; this formula can help to compare these two
options. Some contracts provide payments until death, but with a
minimum of (say) 10 years. Replacing \(p_{t_0}(0), \ldots,
p_{t_0}(9)\) with \(1\) in \(\eqref{discraf}\) would model this situation.</p>
<p>Milevsky suggests using a spreadsheet program to sum this series,
along the lines of</p>
<table>
<thead>
<tr>
<th style="text-align: right">\(n\)</th>
<th>\(p_{65}(n)\)</th>
<th>\((1+r)^{-n}\)</th>
<th>product</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">70</td>
<td>93.6%</td>
<td>0.705</td>
<td>0.660</td>
</tr>
<tr>
<td style="text-align: right">75</td>
<td>83.6%</td>
<td>0.497</td>
<td>0.415</td>
</tr>
<tr>
<td style="text-align: right">80</td>
<td>69.1%</td>
<td>0.35</td>
<td>0.242</td>
</tr>
<tr>
<td style="text-align: right">85</td>
<td>50.0%</td>
<td>0.247</td>
<td>0.124</td>
</tr>
<tr>
<td style="text-align: right">90</td>
<td>29.0%</td>
<td>0.174</td>
<td>0.050</td>
</tr>
<tr>
<td style="text-align: right">95</td>
<td>11.5%</td>
<td>0.122</td>
<td>0.014</td>
</tr>
<tr>
<td style="text-align: right">100</td>
<td>2.4%</td>
<td>0.086</td>
<td>0.002</td>
</tr>
<tr>
<td style="text-align: right">105</td>
<td>0.2%</td>
<td>0.061</td>
<td>0.000</td>
</tr>
<tr>
<td style="text-align: right"> </td>
<td> </td>
<td><strong>∑</strong></td>
<td><strong>1.507</strong></td>
</tr>
</tbody>
</table>
<p>In this example a pensioner of age 65 buys an annuity that gives them
\(\$1\) every 5 years (starting at age 70) with a discount rate of
\(r=7.25\%\). The present value of that annuity is \(\$1.51\). If the
payment every 5 years is \(\$50.000\) instead, the present value of the
annuity at age 65 is \(1.507\cdot \$50.000 = \$ 75369.90\).</p>
<p>The example uses 5 year intervals in order to not get unduly
long. But typically, annuities will pay monthly. Computing the sum in
a spreadsheet then becomes somewhat annoying:
Here’s an <a href="https://docs.google.com/spreadsheets/d/1IvtfyU9qddNthHRmai9soHpTad79aNgUU_GRnEQlukU/edit#gid=2051005501">monthly example in Google
sheets</a>. (Notice
that the difference to the <a href="https://docs.google.com/spreadsheets/d/1IvtfyU9qddNthHRmai9soHpTad79aNgUU_GRnEQlukU/edit#gid=0">5 year interval
version</a>
actually matters: In the latter case, one gets the money for five
years in advance; in the former case one gets it only for the current
month. A lot can happen in 5 years.)</p>
<p>I wasn’t quite satisfied with this.</p>
<h3 id="the-hunt-for-a-closed-form-solution">The hunt for a closed-form solution</h3>
<p>Equation \(\eqref{discraf}\) is nice and all, but it would be much
better to have a closed-form expression for it. Since \(p_{t_0}\) is a
doubly-exponential function, the solution isn’t immediately
obvious. In fact, I don’t know a closed-form solution for
\(\eqref{discraf}\)<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>
<p>But let’s imagine a world with continuous banking, where instead of
once a month we get a small portion of our annuity every moment. The
continuous annuity factor would then be</p>
\[\af_c(r, t_0)
=
\int_0^\infty p_{t_0}(t) e^{-rx} \,dx =
\int_0^\infty \exp\bigl(-\eta(e^{bx} - 1)\bigr) e^{-rx} \,dx
\tag{G.1}
\label{contaf}\]
<p>where \(\eta = e^{b(t_0 - t_m)}\) as in the last blog post.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<p>This, too, doesn’t look super easy. In fact, it’s not solvable with
“elemantary” functions. But somewhere within the <a href="https://dlmf.nist.gov/">large zoo of special
functions</a> there is the right one for us.</p>
<p>In this case, <a href="https://en.wikipedia.org/wiki/Gompertz_distribution">Wikipedia already tells
us</a> that the
moment-generating function (aka Laplace transform) of the Gompertz
distribution is</p>
\[\E(e^{-tx})
=
\eta e^{\eta}\mathrm{E}_{t/b}(\eta) \where{t>0}\]
<p>with the <a href="https://dlmf.nist.gov/8.19#E3">generalized exponential integral</a></p>
\[\mathrm{E}_{t/b}(\eta)=\int_1^\infty e^{-\eta v} v^{-t/b}\,dv \where{t>0}.\]
<p>The Gompertz distribution has the nice property that “left-truncated”
versions of itself are still Gompertz distributed (see the last blog
post for details). This is a consequence of the fact that its
survival function \(S(x) = \exp(-\eta(e^{bx}-1))\) shows up in its
pdf \(f(x) = b\eta S(x)e^{bx}\).</p>
<p>Hence,</p>
\[\E(e^{-tx}) = b\eta\int_0^\infty S(x) e^{-x(t-b)}\,dx
=
b\eta\af_c(t-b, t_0) \where{t>0}.\]
<p>So the moment-generating function at \(t = r + b\) gives us a
closed-form for \(\eqref{contaf}\):</p>
\[\af_c(r, t_0)
=
\frac{1}{b\eta}\E(e^{-(r+b)x})
=
\frac{e^{\eta}}{b} \mathrm{E}_{1+r/b}(\eta)
=
\frac{1}{b}\exp(e^{b(t_0 - t_m)})\mathrm{E}_{1+r/b}(e^{b(t_0 - t_m)}).
\label{contaf-gef}
\tag{G.2}\]
<div class="centered">
<img src="/img/af.svg" alt="pdf" style="width:70%;" />
<br />
The annuity factor \(\af_c(r, t_0)\) for different interest rates
\(r\) and starting ages \(t_0\). Retiring early is expensive.
<a href="/img/af.gp">Gnuplot source</a>
</div>
<h3 id="trying-to-use-this">Trying to use this</h3>
<p>Formula \(\eqref{contaf-gef}\) is in fact a closed-form solution to
our problem. So let’s try to use it in a computation with
Python. <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expn.html">SciPy has the generalized exponential integral
function</a>,
so this should be easy:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">special</span>
<span class="n">b</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="mf">9.5</span>
<span class="n">t_m</span> <span class="o">=</span> <span class="mf">87.25</span>
<span class="k">def</span> <span class="nf">af</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">t_0</span><span class="p">):</span>
<span class="n">eta</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">b</span> <span class="o">*</span> <span class="p">(</span><span class="n">t_0</span> <span class="o">-</span> <span class="n">t_m</span><span class="p">))</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">eta</span><span class="p">)</span> <span class="o">/</span> <span class="n">b</span> <span class="o">*</span> <span class="n">special</span><span class="p">.</span><span class="n">expn</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">r</span> <span class="o">/</span> <span class="n">b</span><span class="p">,</span> <span class="n">eta</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">af</span><span class="p">(</span><span class="mf">0.025</span><span class="p">,</span> <span class="n">t_0</span><span class="o">=</span><span class="mi">65</span><span class="p">))</span>
</code></pre></div></div>
<p>and the output is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gompertz_discount.py:10: RuntimeWarning: floating point number truncated to an integer
return np.exp(eta) / b * special.expn(1 + r / b, eta)
19.439804660538815
</code></pre></div></div>
<p>Ah, dang! Even though the generalized exponential function is part of
SciPy, the implementation there only supports integer orders.</p>
<p>We’ll have to visit the special functions zoo for a bit longer.</p>
<h4 id="more-zoo-animals-the-incomplete-gamma-function">More zoo animals: The incomplete gamma function</h4>
<p>The classic reference for special functions is <a href="https://en.wikipedia.org/wiki/Abramowitz_and_Stegun">Abramowitz and
Stegun</a>. Wikipedia
has a quote about it from the (American) National Institute of
Standards and Technology:</p>
<blockquote>
<p>More than 1,000 pages long, the Handbook of Mathematical Functions
was first published in 1964 and reprinted many times […] [W]hen
New Scientist magazine
recently asked some of the world’s leading scientists what single
book they would want if stranded on a desert island, one
distinguished British physicist said he would take the
Handbook. […] During the mid-1990s, the book was cited every 1.5
hours of each working day.</p>
</blockquote>
<p>In the internet age, the <a href="https://dlmf.nist.gov/">Digital Library of Mathematical Functions
(DLMF)</a> hosts an updated version of this
classic work and it is very useful for situations like the one we are
in now.</p>
<p>In fact, looking at <a href="https://dlmf.nist.gov/8.19#E1">equation 8.19.1</a>
in DLMF, we see that the generalized exponential integral is nothing
but the (upper) incomplete gamma function:</p>
\[\mathrm{E}_p(z) = z^{p-1}\Gamma(1-p, z) \where{p, z\in\C}\]
<p>and thus</p>
\[\af_c(r, t_0)
=
\frac{e^{\eta}}{b} \mathrm{E}_{1+r/b}(\eta)
=
\frac{\eta^{r/b}e^{\eta}}{b} \Gamma(-r/b, \eta).\]
<p>Alas, this won’t do either. <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.gammaincc.html">SciPy’s implementation of the incomplete
gamma
function</a>
doesn’t allow for a negative first argument.</p>
<p>Why might this be the case? Looking at the definition of the
incomplete gamma functions in DLMF (<a href="https://dlmf.nist.gov/8.2#E2">8.2.2
there</a>), we read</p>
\[\Gamma(a, z) = \int_z^\infty t^{a-1}e^{-t}\,dt \where{a, z\in\C, \
\Re a > 0}.\]
<p>Note how this is an “incomplete” version of the standard gamma
function \(\Gamma(a) = \Gamma(a, 0)\). Also note how this integral
won’t work for negative \(a\): There’s a non-integrable singularity
when \(z=0\).</p>
<p>So did our previous formulae not make any sense? They did. We just
have to understand these functions a little bit better.</p>
<h3 id="analytic-continuation">Analytic continuation</h3>
<p><em>Complex analysis</em>, the theory of differentiable complex functions, is
one of the most beautiful areas of mathematics. In fact, I believe I
once saw a video of Donald Knuth saying it’s such a great topic that
he asked his daughter to attend a complex analysis course – although
she didn’t otherwise study at any university.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
<p>One of the results of complex analysis is that if two nice
(aka complex-differentiable, aka holomorphic, aka analytic) functions
coincide on a set that’s not totally discrete, they <a href="https://en.wikipedia.org/wiki/Identity_theorem">are the
same</a>! This means any
analytic function like \(a\mapsto\Gamma(a, z)\) has at most one
analytic continuation. This is an important principle which also
applies to the famous Riemann zeta function \(\zeta\). The zeros of
its analytic continuation to the left complex half-plane
are what the <a href="https://en.wikipedia.org/wiki/Riemann_hypothesis">most important open problem in mathematics is
about</a> (you’ll get
<a href="https://www.claymath.org/millennium-problems/riemann-hypothesis">$1M if you solve this
one</a>,
which should help with retirement planning). In fact, \(\zeta\) and
\(\Gamma\) are closely linked.</p>
<p>In the case of \(\zeta(s) := \sum_{n=1}^\infty n^{-s}\) its unique
analytic continuation to the left half-plane yields for
instance \(\zeta(-1) = -\frac{1}{12}\) which gives some sense to
Ramanujan’s mysterious
equation \(1 + 2 + 3 + \cdots = -\frac{1}{12}.\)</p>
<p>How is such an analytic continuation found? In the case of both
the zeta and the gamma functions, it’s via functional equations. For
instance, \(\Gamma(n) = (n-1)!\) for integer \(n\), and more generally
\(\Gamma(z+1) = z\Gamma(z)\) for complex \(z\). If we start with some
\(z\) in the left
half-plane (i.e., \(\Im z \le 0\)), repeatedly applying this formula
eventually yields only terms of \(\Gamma\) at “known” arguments, which
establishes the value of the analytic continuation of \(\Gamma\) to
that point \(z\).</p>
<p>For the incomplete Gamma function, a similar recurrence relation
exists, namely (<a href="https://dlmf.nist.gov/8.8#E2">8.8.2</a> in DLMF)</p>
\[\Gamma(a+1, z) = a\Gamma(a,z) + z^ne^{-z}\]
<p>for, say, \(a\) and \(z\) in the right half-plane. This equation can be
used to extend the incomplete gamma function to negative values of
\(a\). It also gives rise to a corresponding recurrence relation for the
generalized exponential integral
(<a href="https://dlmf.nist.gov/8.19#E12">8.19.12</a> in DLMF)</p>
\[p\mathrm{E}_{p+1}(z) + z\mathrm{E}_{p}(z) = e^{-z}.\]
<h3 id="applying-the-functional-equation">Applying the functional equation</h3>
<p>Since the implementations of \(\mathrm{E}_{p}(z)\) and \(\Gamma(a,z)\)
that we want to use don’t implement the continued versions of
these functions, we can use the functional equations that define the
continuations ourselves. The recurrence relation for
\(\mathrm{E}_{p+1}\) yields</p>
\[\begin{align}
\af_c(r, t_0)
& =
\frac{e^{\eta}}{b} \mathrm{E}_{1+r/b}(\eta)
=
\frac{e^{\eta}}{r}\bigl(e^{-\eta} - \eta\mathrm{E}_{r/b}(\eta)\bigr)
= \frac{1}{r}\bigl(1 - \eta e^\eta\mathrm{E}_{r/b}(\eta)\bigr)
\\
& = \frac{1}{r}\bigl(1 - e^\eta \eta^{r/b} \Gamma(1 - r/b, \eta)\bigr).
\end{align}\]
<p>For moderate values of \(r\) (and \(b\)), this is enough. For very
large \(r\) we’d have to apply the recurrence relation again.</p>
<p>This allows us to finally run this in Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">special</span>
<span class="n">b</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="mf">9.5</span>
<span class="n">t_m</span> <span class="o">=</span> <span class="mf">87.25</span>
<span class="k">def</span> <span class="nf">af</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">t_0</span><span class="p">):</span>
<span class="n">eta</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">b</span> <span class="o">*</span> <span class="p">(</span><span class="n">t_0</span> <span class="o">-</span> <span class="n">t_m</span><span class="p">))</span>
<span class="k">return</span> <span class="p">(</span>
<span class="mi">1</span>
<span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">eta</span><span class="p">)</span>
<span class="o">*</span> <span class="n">eta</span> <span class="o">**</span> <span class="p">(</span><span class="n">r</span> <span class="o">/</span> <span class="n">b</span><span class="p">)</span>
<span class="o">*</span> <span class="n">special</span><span class="p">.</span><span class="n">gamma</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">r</span> <span class="o">/</span> <span class="n">b</span><span class="p">)</span>
<span class="o">*</span> <span class="n">special</span><span class="p">.</span><span class="n">gammaincc</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">r</span> <span class="o">/</span> <span class="n">b</span><span class="p">,</span> <span class="n">eta</span><span class="p">)</span> <span class="c1"># A normalized version of \Gamma(a, z).
</span> <span class="p">)</span> <span class="o">/</span> <span class="n">r</span>
<span class="k">print</span><span class="p">(</span><span class="n">af</span><span class="p">(</span><span class="mf">0.025</span><span class="p">,</span> <span class="n">t_0</span><span class="o">=</span><span class="mi">65</span><span class="p">))</span>
</code></pre></div></div>
<p>Which prints <code class="language-plaintext highlighter-rouge">14.79901377449508</code>.</p>
<h4 id="what-about-spreadsheets">What about spreadsheets?</h4>
<p>The above works for SciPy, but Excel (or Google Sheets) is a different
matter as it doesn’t directly have support for the incomplete gamma
function.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> However, it does support the <a href="https://en.wikipedia.org/wiki/Gamma_distribution">Gamma
distribution</a> which
has a normalized version of the lower incomplete gamma function as
its cdf. As the helpful authors of <a href="https://en.wikipedia.org/wiki/Incomplete_gamma_function#Software_Implementation">Wikipedia
note</a>,
\(\Gamma(s, x)\) can be computed as
<code class="language-plaintext highlighter-rouge">GAMMA(s)*(1-GAMMA.DIST(x,s,1,TRUE))</code>.</p>
<p>This finally gives us closed-form we can use in spreadsheets. <a href="https://docs.google.com/spreadsheets/d/1IvtfyU9qddNthHRmai9soHpTad79aNgUU_GRnEQlukU/edit#gid=1825725391">Here’s
an example in Google
Sheets</a>.
Note that the relative difference of \(\af(r,t_0)\) and
\(\af_c(r,t_0)\) for \(r = 0.25\%\) and \(t_0 = 65\) is only
\(0.64\%\).</p>
<h3 id="what-does-it-all-mean">What does it all mean?</h3>
<p>If you (or your parents) bought an annuity in your name and the issuer
offers the option for a lump sum payment instead, you can multiply the
yearly (or monthly) payments with the annuity factor as determined
above, treat that as the annuity’s value, and compare it to the lump
sum. (The computations here mostly assume the annuity will start right
away, extending this to an annuity that starts in a number of years is
left as an exercise for the reader.)</p>
<p>That raises the question of which interest rate to pick? And why that
should be treated as constant over time?</p>
<p>Also, the annuity provider will have done some math as well and will be
aware of <em>adverse selection</em>: People who self-assess as being in good
health might be right about this and choose the annuity more
often. Milevsky quotes from Jane Austen’s <em>Sense and Sensibility</em>,</p>
<blockquote>
<p>People always live forever when there is an annuity to be paid them.</p>
</blockquote>
<p>Milevsky also ends his discussion of his equation #3 with a warning:
This is a model. If market prices are different, they are likely to be
right and the model wrong.</p>
<p>That said, <em>if</em> you got an annuity a while back, for instance pre-2008
when the <a href="https://fred.stlouisfed.org/series/FEDFUNDS">effective Fed funds rate was
5%+</a>, you likely got a
good deal by today’s standards. The future might be different. But at
least some people argue that interest rates will stay permanentely
low, as they did in Japan for the last decades (demographic factors
might play a role here). Others disagree, and of course people some
always predict stronger inflation.</p>
<p>That’s it for today, thanks for reading. Reach out if you have
comments.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Although coming to think of this perhaps
<a href="https://en.wikipedia.org/wiki/Abel%E2%80%93Plana_formula">Abel-Plana</a>
might help? Eyeballing our function \(f(z) =
p_{t_0}(z)e^{-z\ln(1+r)}\) it might apply, but the \(\int_0^\infty
\frac{f(iy) - f(-iy)}{e^{2\pi y} - 1}\,dy\) integral looks <em>hard</em>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>There’s some extra confusion regarding the interest rate here:
For continuous compounding we need something else than the
effective rate, as \(e^r - 1 \ne r\) for \(r > 0\). See <a href="https://math.stackexchange.com/a/4099424/5051">this
answer</a> for some
explanation. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>I might misremember this; at any rate I can’t find the reference
right now. It might have been part of his <a href="https://www.youtube.com/watch?v=mert0kmZvVM">1987 course series on
mathematical writing</a>. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Apparently, <a href="https://support.microsoft.com/en-us/office/gamma-function-ce1702b1-cf55-471d-8307-f83be0fc5297">Microsoft Excel supports <code class="language-plaintext highlighter-rouge">Gamma</code> at negative
values</a>. However,
<a href="https://support.google.com/docs/answer/9365856?hl=en">Google
Sheets</a>
doesn’t. For Excel, we could have also used a series like <a href="https://dlmf.nist.gov/8.7#E3">8.7.3 in
DLMF</a> to compute the incomplete
gamma function. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Wed, 14 Apr 2021 00:00:00 +0000
https://heiner.ai/blog/2021/04/14/gompertz-annuities-and-special-functions.html
https://heiner.ai/blog/2021/04/14/gompertz-annuities-and-special-functions.htmlblogπs, deaths, and statistics<p>This is part 1 of a 2 part series on Gompertz’s law.</p>
<p>After enjoying a <a href="https://rationalreminder.ca/podcast/122">podcast interviewing Moshe
Milevsky</a>, I got interested
in Milevsky’s work and read his book <em>The 7 Most Important Equations for Your
Retirement</em>.</p>
<p>It’s a bit of a weird book. I can’t say I didn’t enjoy it. But it
gave me the impression Moshe Milevsky didn’t expect his readers to enjoy the
book, or at least he didn’t seem to expect them to like the formulas. That
seems like a weird proposition for a book with “equations” in its
very title. Perhaps the readers were imagined to be part of a
captured audience of students, asked to read the book as an
assignment? At any rate the author apologizes every time he actually
discusses formulae, and having to resort to <em>logarithms</em> seems to make
him positively embarrassed.</p>
<p>Then again, Prof. Milevsky wrote many books and this is apparently one
of his bestsellers. And
the anecdotes of Fibonacci, Gompertz, Halley, Fisher, Huebner, and
Kolmogorov<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> in the book are lovely, as is Milevsky’s account of his
experience of economist Paul Samuelson. Perhaps the professor just
knows his audience.</p>
<p>But so do I, being well acquainted with the empty set. So there’s no
issue with too much math on this blog.</p>
<p>Let’s therefore discuss what Milevsky doesn’t do in his book and try
to explain why his equation #2 might be true: Let’s talk about the
Gompertz distribution.</p>
<h3 id="gompertzs-discovery">Gompertz’s discovery</h3>
<div class="floated">
<img src="/img/gompertz.png" alt="Benjamin Gompertz" style="width:200px;" />
<br />
<div class="centered small">
Image source: <a href="https://en.wikipedia.org/wiki/Benjamin_Gompertz#/media/File:Gompertz.png">Wikipedia</a>
</div>
</div>
<p>Benjamin Gompertz (1779–1865) was a British mathematician and actuary
of German Jewish descent. According to Milevsky, Gompertz was looking
for a law of human mortality, comparable to Newton’s laws of
mechanics. At his time, statistics of people’s lifespan were already
available and formed the basis for Gompertz’s discovery. For instance,
one could compile a <em>mortality table</em> of a (hypothetical) group of
45-year-olds as they age. The data might look like this:</p>
<table>
<thead>
<tr>
<th>Age</th>
<th>Alive at Birthday</th>
<th>Die in Year</th>
</tr>
</thead>
<tbody>
<tr>
<td>45</td>
<td>98,585</td>
<td>146</td>
</tr>
<tr>
<td>46</td>
<td>98,439</td>
<td>161</td>
</tr>
<tr>
<td>47</td>
<td>98,278</td>
<td>177</td>
</tr>
<tr>
<td>48</td>
<td>98,101</td>
<td>195</td>
</tr>
<tr>
<td>49</td>
<td>97,906</td>
<td>214</td>
</tr>
<tr>
<td>50</td>
<td>97,692</td>
<td>236</td>
</tr>
</tbody>
</table>
<p>What does this tell us? It doesn’t immediately tell <em>me</em> anything not
very obvious: The older one gets, the more likely one dies soon.</p>
<p>But let’s
compute the <em>mortality rate</em>: How likely were people to die at a given
age in this cohort? To be precise, we compute “<em>Which proportion of
people alive at age \(t\) died before age \(t+1\)?”</em></p>
<table>
<thead>
<tr>
<th>Age</th>
<th>Alive at Birthday</th>
<th>Die in Year</th>
<th>Mortality Rate</th>
</tr>
</thead>
<tbody>
<tr>
<td>45</td>
<td>98,585</td>
<td>146</td>
<td>0.148%</td>
</tr>
<tr>
<td>46</td>
<td>98,439</td>
<td>161</td>
<td>0.164%</td>
</tr>
<tr>
<td>47</td>
<td>98,278</td>
<td>177</td>
<td>0.180%</td>
</tr>
<tr>
<td>48</td>
<td>98,101</td>
<td>195</td>
<td>0.199%</td>
</tr>
<tr>
<td>49</td>
<td>97,906</td>
<td>214</td>
<td>0.219%</td>
</tr>
<tr>
<td>50</td>
<td>97,692</td>
<td>236</td>
<td>0.242%</td>
</tr>
</tbody>
</table>
<p>So again, so much so obvious: The probability of dying the next year,
conditional on having lived until the current year,
increases with age.</p>
<p>But <em>how much</em> does it increase? This is
Gompertz’s discovery, now called Gompertz’s law: The mortality rate
appears to increase exponentially. One way of seeing this is to take
logs and compute their difference:</p>
<table>
<thead>
<tr>
<th>Age</th>
<th>Alive at Birthday</th>
<th>Die in Year</th>
<th>Mortality Rate</th>
<th>Log of Mortality Rate</th>
<th>Difference in Log Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>45</td>
<td>98,585</td>
<td>146</td>
<td>0.148%</td>
<td>−6.515</td>
<td>—</td>
</tr>
<tr>
<td>46</td>
<td>98,439</td>
<td>161</td>
<td>0.164%</td>
<td>−6.416</td>
<td>0.0993</td>
</tr>
<tr>
<td>47</td>
<td>98,278</td>
<td>177</td>
<td>0.180%</td>
<td>−6.319</td>
<td>0.0964</td>
</tr>
<tr>
<td>48</td>
<td>98,101</td>
<td>195</td>
<td>0.199%</td>
<td>−6.221</td>
<td>0.0987</td>
</tr>
<tr>
<td>49</td>
<td>97,906</td>
<td>214</td>
<td>0.219%</td>
<td>−6.126</td>
<td>0.0950</td>
</tr>
<tr>
<td>50</td>
<td>97,692</td>
<td>236</td>
<td>0.242%</td>
<td>−6.026</td>
<td>0.1000</td>
</tr>
</tbody>
</table>
<p>The difference in log values increases by about 0.1, regardless of the
current age. Since \(\exp(0.1) \approx 1.1\), this works out to
roughly a 10% increase in the mortality rate per year.</p>
<p><b>According to Gompertz’s discovery, the likelihood of dying in the
next year (conditional on having lived until this year) increases
exponentially by a fixed percentage every year.<b></b></b></p>
<p>How do we model this mathematically? And if Gompertz’s law has a
rate growing by a rate, how come everything stays a probability, i.e.,
\(\le 1\)?</p>
<p>Keep reading to learn about hazard functions and find out!</p>
<h3 id="pdfs-cdfs-and-hazard-functions">PDFs, CDFs, and Hazard Functions</h3>
<div class="floated">
<img src="/img/gompertz-pdf.svg" alt="pdf" style="width:250px;" />
<br />
<div class="centered small">
Some probability density functions<br />
Image source: <a href="https://en.wikipedia.org/wiki/File:GompertzPDF.svg">Wikipedia</a>
</div>
<img src="/img/gompertz-cdf.svg" alt="cdf" style="width:250px;" />
<br />
<div class="centered small">
Some cumulative distribution functions<br />
Image source: <a href="https://en.wikipedia.org/wiki/File:GompertzCDF.svg">Wikipedia</a>
</div>
</div>
<p>If you have taken a probability or statistics course, you probably
(ha!) know about <em>probability density functions</em> (pdfs). A pdf is a
positive function that we use as a density and to make it a
<em>probabilty</em> density it needs to integrate to one. If \(f\)
is a pdf and \(X\) is a random variable with that distribution then</p>
\[\P(a < X\le b) = \int_a^b f(x) \dx,\]
<p>i.e., the probability of \(X\) landing between \(a\) and \(b\) is the
integral from \(a\) to \(b\) of \(f\).<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> Let’s say our random
variable takes only positive values, then for \(a=0\) and \(b=\infty\)
this value will be \(1\), i.e., 100%. For smaller values of \(b\) this
will be the so-called cumulative distribution function (cdf):</p>
\[F(b) := \P(X\le b) = \int_0^b f(x)\dx.\]
<p>While pdfs are positive and integrate to \(1\), cdfs are positive, \(0\)
at \(0\), monotone (non-decreasing), and \(1\) at the limit to
\(\infty\). They are also typically assumed to be
right-continuous. To compute the previous expression one can then
take</p>
\[\P(a < X \le b) = \int_a^b f(x) \dx = F(b) - F(a).\]
<p>Moreover the fundamental theorem of calculus tells us that (barring a
few exceptional points),</p>
\[F'(x) = f(x).\]
<p>What this means is that the cdf determines the pdf and therefore the
probability distribution itself: Any function \(F\) with the properties
above can serve as a cdf and gives rise to a pdf \(f\), and vice
versa.</p>
<p>If you made it this far, you likely knew this in principle. So on to
hazard functions now (which I learned about from
<a href="https://en.wikipedia.org/wiki/Survival_analysis#Hazard_function_and_cumulative_hazard_function">Wikipedia</a>
only recently).</p>
<h4 id="hazard-functions">Hazard functions</h4>
<div class="floated">
<img src="/img/gompertz-hazard.svg" alt="hazard" style="width:250px;" />
<br />
<div class="centered small">
Some hazard functions<br />
<a href="/img/gompertz.gp">Gnuplot source</a>
</div>
</div>
<p>If we imagine \(f\) to describe deaths (or failure rates) over time,
\(F(t)\) will be the probability to have died by time
\(t\). Conversely, \(S(t) := 1 - F(t)\) (the <em>survival function</em>) is
the chance to still be alive at time \(t\). If \(f\) is the
corresponding pdf and \(X\) is a random variable with this
distribution,</p>
\[S(t) = \P(X > t) = \int_t^\infty f(x)\dx = 1 - F(t).\]
<p>Suppose we made it until some time \(t_0 \ge 0\). How will the future
look like? We will need to condition on having lived so far and
renormalize with \(S(t_0)\) to get the pdf going forward. This is the
same as finding the probability of not surviving an additional
infinitesimal time:</p>
\[h(t_0) := \frac{f(t_0)}{S(t_0)}
= \lim_{t\downarrow 0} \frac{\P(t_0\le X < t_0 + t)}{t S(t_0)}
= -\frac{S'(t_0)}{S(t_0)} = - \frac{\d}{\d t}\ln(S(t))\Biggr\rvert_{t=t_0}.\]
<p>This is the <em>hazard function</em>, also known as <em>force of mortality</em> or
<em>force of failure</em>. It takes positive values and the last equation
indicates that it will not be integrable, i.e., \(\int_0^\infty
h(x)\dx = \infty\).</p>
<p>There is also the <em>cumulative hazard function</em>, which is</p>
\[H(t) = \int_0^t h(x)\dx = -\ln S(t).\]
<p>This last equation, which follows from the above, tells us something
interesting: We ought to be able to retrieve \(S\), and therefore the cdf
\(F\), and therefore the pdf \(f\), from \(h\) itself, because</p>
\[S(t) = e^{-H(t)}.\]
<p>This means, just as any one of a pdf or a cdf determines the other, so
does a hazard function: Given either a pdf, a cdf, or a hazard
function (or a cumulative hazard function), the probability density and
therefore all of these functions and the entire distribution are
uniquely determined. The criteria
for \(h\) are: (1) being nonnegative and (2) not being integrable.</p>
<h3 id="back-to-gompertz">Back to Gompertz</h3>
<p>So how do we model Gompertz’s discovery? We choose the most
simple exponentially growing function we know and take it as our hazard
function:</p>
\[h(t) := b \eta e^{bt}\]
<p>where \(\eta, b > 0\) are parameters to be determined from the
data (see below). This also tells us why our “rate increase of a rate”
doesn’t lead to probabilities greater than one over time: It’s a rate
increase for the conditional probability at that fixed point in time
\(t_0\). The hazard function itself does grow to infinity.</p>
<p>Given this, what is the cdf for this <em>Gompertz distribution</em>? Well,</p>
\[H(t) = \int_0^t h(x)\dx = b \eta\int_0^t e^{bx}\dx
= \eta(e^{bt} - 1)\]
<p>and therefore</p>
\[F(t) = 1 - S(t) = 1 - e^{-H(t)} = 1 - \exp\bigl(-\eta(e^{bt} -
1)\bigr)\]
<p>and</p>
\[f(t) = h(t)S(t) = b\eta\exp\bigl(bt - \eta(e^{bt} - 1)\bigr).\]
<p>So the simple exponential choice for the hazard function yields this not
quite so simple double exponential as a pdf for the Gompertz
distribution.</p>
<div class="centered">
<img src="/img/gompertz-pdf.svg" alt="pdf" style="width:30%;" />
<img src="/img/gompertz-cdf.svg" alt="cdf" style="width:30%;" />
<img src="/img/gompertz-hazard.svg" alt="hazard" style="width:30%;" />
<br />
The pdf, cdf and hazard function of the Gompertz distribution for some
choices of \(\eta\) and \(b\). <br />
(See above for source.)
</div>
<p>We could proceed to compute the mean, variance or
moment-generating function (aka Laplace transform) of this
distribution, but the math gets somewhat hairy and special functions
(mainly the <a href="https://dlmf.nist.gov/8.19">generalized exponential
integral</a>, a special function related to
the incomplete gamma function) are involved. We
will return to this for other reasons in a future blogpost. The
<a href="https://en.wikipedia.org/wiki/Gompertz_distribution">infobox on
Wikipedia</a> has
the data if necessary.</p>
<p>For now, let’s note that \(h(t) = b \eta e^{bt}\) describes a hazard
function where the chance of dying in the next moment increases
throughout life, but there’s only this time-dependent component. If
there are time-independent causes of death or failure (war, some kinds of
diseases, voltage surges), one could consider modelling this
differently. William Makeham, another British 19th century
mathematician, proposed such an addition via the hazard function</p>
\[h(t) = b \eta e^{bt} + \lambda.\]
<p>The result is known as the <a href="https://en.wikipedia.org/wiki/Gompertz%E2%80%93Makeham_law_of_mortality">Gompertz-Makeham
distribution</a>.</p>
<p>As simple as this change may seem, it does complicate the resulting
integrals quite a bit. So much so that finding closed-form expressions
of the <a href="https://dl.acm.org/doi/abs/10.1016/j.matcom.2009.02.002">quantile
function</a> or
the <a href="https://www.researchgate.net/publication/261641522_On_Order_Statistics_from_the_Gompertz-Makeham_Distribution_and_the_Lambert_W_Function">moments of this
distribution</a>
is still an active field of research!</p>
<h4 id="gompertz-from-here-on-out">Gompertz from here on out</h4>
<p>If a random variable \(X \sim \textrm{Gompertz}(\eta, b)\) describes
the expected time of death of a newborn (or, less morbidly, the
expected time of failure for a newly built device), how does the world
look like at time \(t_0\)? That’s an important question for,
e.g., retirement planning.</p>
<p>So let’s compute the pdf after surviving until \(t_0\):</p>
\[\frac{f(t_0 + t)}{S(t_0)} = b\eta\frac{b\eta\exp\bigl(b(t_0+t) -
\eta(e^{b(t_0+t)} - 1)\bigr)}{\exp\bigl(-\eta(e^{bt} - 1)\bigr)}
= b\eta e^{bt_0}\exp\bigl(-\eta e^{bt_0}(e^{bt}-1) + bt\bigr).\]
<p>This is the pdf of \(\textrm{Gompertz}(\eta e^{bt_0}, b)\). So the
future stays Gompertz-distributed with new parameters. This is a
useful property of the Gompertz distribution: truncated
renormalized versions of it stay within the family, in contrast to,
e.g., the Gaussian distribution.</p>
<p>This helps us answer questions like “How long will I spend in
retirement?”, which is why Milevsky discusses Gompertz in his book
about retirement planning.</p>
<p>Before we can do that, let’s discuss how one could fit the parameters
\(\eta\) and \(b\).</p>
<h4 id="how-to-fit-the-data-modal-value-of-human-life">How to fit the data: Modal value of human life</h4>
<p>The modal value(s) of a distribution is the maximum (or set of maxima)
of the pdf. In general, it is distinct from both the mean as well as the
median. In the case of Gompertz it’s also an easy
computation:</p>
\[\frac{\d}{\d t}f(t) = 0
\ \Rightarrow\
b\eta\exp\bigl(bt - \eta(e^{bt} - 1)\bigr)(b - \eta b e^{bt}) = 0
\ \Rightarrow\
\eta e^{bt} = 1\]
<p>and therefore \(\eta = e^{-bt_m}\) for the modal value \(t_m\).</p>
<p>So once we have \(b\) we can get \(\eta\) from the data by looking for
the year (or month) with the most deaths.</p>
<p>The \(b\) parameter we also get from the data as it determines the
year-on-year increase in the mortality rate: From the data above we
would estimate \(b=0.1\). Milevsky suggests using \(1/b = 9.5\) years
and calls this the <em>dispersion coefficient of human life</em>.</p>
<p>As the modal value of human life Milevsky suggests using \(t_m=87.25\)
for the general population in North America. Obviously \(t_m\) was lower
in Gompertz’s time; for a population of healthy females in a modern
society Milevsky suggests using up to \(t_m=90\).</p>
<h4 id="plugging-it-all-together">Plugging it all together</h4>
<p>For a member of the general population in a developed country we can
take \(b = 1/9.5\), \(t_m=87.25\) and therefore \(\eta = e^{- bt_m} =
e^{-87.25 / 9.5} \approx 1.026\cdot 10^{-4}\). If that person is
\(t_0\ge 0\) years old, the Gompertz pdf for their future looks like</p>
\[b\eta e^{bt_0}\exp\bigl(-\eta e^{bt_0}(e^{bt}-1) + bt\bigr)
=
b e^{b(t_0 - t_m)} \exp\bigl(-e^{b(t_0 - t_m)}(e^{bt}-1) + bt\bigr)\]
<p>while the cdf is</p>
\[1 - \exp\bigl(- e^{b(t_0-t_m)} (e^{bt} - 1)\bigr).\]
<p>Therefore, the survival function conditioned on having lived until
year \(t_0\) is</p>
\[p_{t_0}(t) := \exp\bigl(- e^{b(t_0-t_m)} (e^{bt} - 1)\bigr).\]
<p>This is also the chance of still being alive at year \(t_0 + t\) given
being alive at year \(t_0\) and constitutes an answer to the question
of “How long will I spend in retirement?”, or more precisely “How
likely am I to be alive \(t\) years into my retirement?”. It’s also
Milevsky’s equation #2.</p>
<p>In a future blogpost, we will use this to compute values of
annuities. For now, let’s wrap up with a quick discussion of how
realistic this model is.</p>
<h3 id="is-this-model-any-good">Is this model any good?</h3>
<p>The first thing to say is that sadly, human babies are more likely to
die than the Gompertz distribution would indicate. That’s partially a
sad fact of life and partially a consequence of the fact that severe
medical problems in unborn babies or during birth are less likely to
cause outright death at the point of delivery for the baby in
question, but may still cause death after a few days, months, or even
years.</p>
<p>On a potentially more positive note, the Gompertz distribution might
<em>overestimate</em> the chance of dying next year for very old people. As
<a href="https://en.wikipedia.org/wiki/Gompertz%E2%80%93Makeham_law_of_mortality">Wikipedia has
it</a>:</p>
<blockquote>
<p>At more advanced ages, some studies have found that death rates
increase more slowly – a phenomenon known as the late-life mortality
deceleration – but more recent studies disagree.</p>
</blockquote>
<p>Generally, there seems to be consensus that the Gompertz distribution
models mortality really well between the ages of 30 and 80.</p>
<p>A question very dear to “transhumanists” is to what
extent the gains in longevity seen in the last centuries can be
expected to continue to happen in the future. Obviously a simple
two-parameter model won’t tell us a lot about that, and neither will
the three-parameter Gompertz-Makeham distribution. Still, I found it
interesting if a bit discouraging to read, again in
<a href="https://en.wikipedia.org/wiki/Gompertz%E2%80%93Makeham_law_of_mortality">Wikipedia</a>:</p>
<blockquote>
<p>The decline in the human mortality rate before the 1950s was mostly
due to a decrease in the age-independent (Makeham) mortality
component, while the age-dependent (Gompertz) mortality component
was surprisingly stable. Since the 1950s, a new mortality
trend has started in the form of an unexpected decline in mortality
rates at advanced ages and “rectangularization” of the survival
curve.</p>
</blockquote>
<p>So apparently \(\eta\) and \(b\) have been stable over a long time
periods. Whatever that says about human longevity, it also says the
Gompertz (or the Gompertz-Makeham) model is not so bad.</p>
<p>That’s it for today. My apologies for not actually having any \(\pi\)s in all of
this after all :)</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>For Kolmogorov, I’d love to have another reference for his
supposed <em>War and Peace</em>-inspired nickname “ANK”. The book in question
is the only reference I could find. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>This works for probability measures that are absolutely
continuous with respect to the Lebesgue measure, meaning such a
Lebesgue density exists. All measures on the reals can be decomposed
into such a measure, a discrete measure and a so-called singular
continuous measure. Let’s just stay with Lebesgue densities here. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Tue, 23 Mar 2021 00:00:00 +0000
https://heiner.ai/blog/2021/03/23/pis-deaths-and-statistics.html
https://heiner.ai/blog/2021/03/23/pis-deaths-and-statistics.htmlblogWell-definedness in measure theory<p>This blog covers very niche topics. Today: How to easily show that the
Lebesgue measure and the Lebesgue integral is well-defined.</p>
<p>I recently started reading reading the <a href="https://banditalgs.com">“Bandit Algorithms” book
by Tor Lattimore and Csaba
Szepesvari</a>. <a href="https://twitter.com/HeinrichKuttler/status/1343551842580639750">I like it a
lot.</a>
Chapter 2 there got me back to
one of my favorite topics in math: Measure Theory. While measure
theory is often held as being boring or dry, I disagree. It’s just
often presented in a suboptimal way. In particular, many authors get
so carried away by a progression of simple functions – positive
functions – real-valued functions that they forget that linearity is a
great thing (one symptom of this: Trying to find a “common partition”
for two simple functions).</p>
<p>Here we present a simple lemma and its corollary that takes care of
most well-definedness questions in measure theory, in particular: Why
the extension of a pre-measure from a semiring to a ring is
well-defined (and therefore unique), and why the Lebesgue integral of
simple functions is well-defined.</p>
<p><strong>Lemma 1.</strong> Let \(\H\) be a family of sets with the following partition
property: For each finite family \((A_1, \ldots, A_n)\) in \(\H\)
there is a finite disjoint family \((B_k\st k \in
K_0)\) in \(\H\) such that</p>
\[\forall j \in\set{1,\ldots,n}\, \exists K_j \subseteq
K_0: A_j=\mathop{\dot{\bigcup}}_{k \in K_j} B_k.\]
<p>(In particular, either \(A_j \supseteq B_k\) or \(A_j\cap
B_k=\emptyset\) is true at all times, with
\(A_j\supseteq B_k \Leftrightarrow k \in K_j.\))</p>
<p>Let \(\mu\from \H\to[0, \infty)\) be an additive set function (i.e.,
if \(A, B\in\H\) are disjoint with \(A\mathbin{\dot{\cup}} B\in\H\),
then \(\mu(A\mathbin{\dot{\cup}} B) = \mu(A) + \mu(B)\)). Let \(A_1, \ldots, A_n \in
\H\) and \(a_1, \ldots, a_n \in \F \in\set{\R, \mathbb{C}}\). Define</p>
\[\qquad b_k:=\sum_{\substack{1\le j\le n\\ A_j\supseteq B_k}} a_j \where{k\in K_0}.\]
<p>Then \(\sum_{j=1}^n a_j \1_{A_j}=\sum_{k\in K_0} b_k\1_{B_k}\) and</p>
\[\sum_{j=1}^n a_j \mu(A_j)=\sum_{k \in K_0} b_k
\mu(B_k)\]
<p><em>Proof.</em> The first statement follows directly from the definition of
\(b_k\). The second:</p>
\[\begin{aligned} \sum_{j=1}^n a_j \mu(A_j)
&=\sum_{j=1}^n a_j \sum_{k \in K_j}
\mu(B_k) \\ &=\sum_{k \in K_0}
\underbrace{\sum_{j=1}^n \1_{K_j}(k) a_j}_{=b_k}
\mu(B_k)=\sum_{k \in K_0}
b_k \mu(B_k). \qquad\text{//}\end{aligned}\]
<p><strong>Corollary 2.</strong> Let \(\H, \mu\) as in Lemma 1. Let
\(X:=\lin\set{\1_A\st A\in\H}\). Then \(J\from X\to\F\), defined via</p>
\[X\ni f=\sum_{j=1}^n a_j \1_{A_j} \mapsto \sum_{j=1}^n a_j \mu(A_j),\]
<p>is well-defined and linear.</p>
<p><em>Proof.</em> It suffices to treat the case \(f=0\). Let
\((B_k\st k \in K_0)\) be the disjoint representation and \((b_k\st b
\in K_0)\) as in Lemma 1. Then \(b_k=0\) for all \(k \in k_0,\) as
\(f=\sum_{j=1}^n a_j \1_{A_j}=\sum_{k \in
K_0} b_k \1_{B_k}\) and the \(B_k\) are pairwise disjoint. By Lemma 1,</p>
\[\sum_{j=1}^n a_j \mu(A_j)
% =\sum_{j=1}^n a_j \mu(\mathop{\dot{\bigcup}}_{k \in k_j}B_k)
= \sum_{k\in K_0} b_k \mu(B_k)=0.\]
<p>This shows \(J\) is well-defined; it’s also clearly linear. //</p>
<p>With this simple tool, we can look at our first application: The
Lebesgue-Stieltjes pre-measure.</p>
<p><strong>Lemma 3.</strong> Let \(\H\) be the set of half-open intervals and let
\(\mathcal{R}\) be the ring of finite unions of
half-open intervals. Then each \(F \in \mathcal{R}\) is the disjoint
union of finitely many intervals.</p>
<p><em>Proof</em> (Adapted from <a href="http://www.math.tu-dresden.de/~voigt/mui/mui.pdf#page=7">J. Voigt, “Einführung in die Integration”, Satz 1.1.1</a>).
Let \(F=\bigcup_{j=1}^n
I_j\) with \(I_j=[a_j, b_j)\).
The set \(\set{a_j, b_j\st j=1,\ldots,n}\) can be written as
\(\set{x_k\st k=0,\ldots,m}\) with \(x_0 <x_2 < \cdots < x_m\). A set
of disjoint intervals with union \(F\) is</p>
\[\{[x_{k-1}, x_k)\st
1\le k\le m \text{ und } \exists j\in\set{1,\ldots,n} :
[x_{k-1}, x_k) \subseteq[a_j,b_j)\}. \qquad\text{//}\]
<p><strong>Corollary 4.</strong> Let \(G\from\R\to\R\) be
monotone. The mapping \(\mu\from\H\to[0,\infty)\),
defined via \(\mu([a,b)) := G(b)-G(a)\), can be uniquely extended to
an additive set function \(\mu\from\mathcal{R}\to[0, \infty)\).</p>
<p><em>Proof.</em> \(\mu\) is additive and \(\H\) has the partition property
from Lemma 1 by Lemma 3. Hence \(J\from X\to\F\) from Corollary 2 is
well-defined. Let \(F\in\mathcal{R}\). Then
\(\1_F\in X\) and thus \(\mu(F):=J(\1_F)\) is well-defined and \(\geq 0\).
This extension \(\mu\) is additive, since for \(\tilde{F}\in\mathcal{R}\),</p>
\[\mu(F \mathbin{\dot{\cup}} \tilde{F}) =
J(\1_F+\1_{\tilde{F}}) = J(\1_F) + J(\1_{\tilde{F}}) =
\mu(F)+\mu(\tilde{F}).\]
<p>For the uniqueness: Let \(\tilde{\mu}\from\mathcal{R}\to[0,\infty)\)
be another additive extension and let \(F =
\mathop{\dot{\bigcup}}_{k=1}^m B_k \in\mathcal{R}\) with \(B_1, \ldots, B_m\in\H\).
Then</p>
\[\tilde{\mu}(F)=\tilde{\mu}(\mathop{\dot{\bigcup}}
B_k)=\sum \tilde{\mu}(B_k)=\sum \mu(B_k)=\mu(F). \qquad//\]
<p>If \(G\) left-continuous, \(\mu\) can be shown to be
\(\sigma\)-additive (and hence a pre-measure). One can then use
Carathéodory’s extension theorem to extend \(\mu\) to a measure on the
\(\sigma\)-algebra generated by \(\mathcal{R}\), which is the Borel
\(\sigma\)-algebra \(\mathcal{B}(\R)\).</p>
<p>Later in the theory, Lemma 1 proves the well-definedness of the
Lebesgue integral for simple functions:</p>
<p><strong>Lemma 5.</strong> Let \((\Omega, \mathcal{A}, \mu)\) be a measure space and
let \(f=\sum_{j=1}^n a_j \1_{A_j}\) be a simple function.
Then \(\int_\Omega f {~d}\mu:=\sum_{j=1}^n a_j \mu(A_j)\) is well-defined.</p>
<p><em>Proof.</em> For \(K\subseteq K_0:=\{1, \ldots, n\}\), define</p>
\[B_K:=\bigcap_{k\in K}
A_k \cap \bigcap_{k \in K_0 \setminus
K}(\Omega \setminus A_k)\]
<p>Then \((B_K\st \emptyset \neq K
\subseteq K_0)\) is a partition as in Corollary 2, and hence
\(\int f {~d}\mu=J(f)\) is well-defined. //</p>
<p>This defines the Lebesgue integral for simple functions. Since simple
functions are dense in \(L_1(\mu)\), one is tempted to just use the
<a href="https://en.wikipedia.org/wiki/Continuous_linear_extension">BLT
theorem</a>
now. That doesn’t quite work however, since the norm of \(L_1(\mu)\)
is defined via that integral in the first place. Instead, one can now
define the integral for positive functions via pointwise convergence
almost everywhere from below, then extend to real-valued (and
complex-valued) functions.</p>
Tue, 29 Dec 2020 00:00:00 +0000
https://heiner.ai/blog/2020/12/29/well-definedness-in-measure-theory.html
https://heiner.ai/blog/2020/12/29/well-definedness-in-measure-theory.htmlblogMore on linear regression: Capital asset pricing models<p>This is the long-awaited second part of February’s post on <a href="/blog/2020/02/19/linear-regression.html">Linear
Regression</a>. This time,
without the needlessly abstract math, but with some classic portfolio
theory as an example of “applied linear regression”.</p>
<p>We’ll be discussing two papers: <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.1968.tb00815.x" target="_blank"><em>The performance of mutual funds in
the period 1945-1964</em> (1968) by Michael
Jensen</a>
and <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.139.5892" target="_blank"><em>Common Risk
Factors in the Returns On Stocks And Bonds</em> (1993) by Eugene Fama
and Kenneth French</a>.</p>
<p>Some disclaimers:</p>
<ul>
<li>I am by no means an expert on this topic. I learned about these
investing concepts on the <a href="https://rationalreminder.ca" target="_blank">Rational Reminder
podcast</a> by the
excellent Benjamin Felix and Cameron Passmore.</li>
<li>There’s many more results on this, both classic and
recent. The two papers discussed here added some delta at the time,
but previous results by others were essential too, and get no
mention here. I chose these two papers because they are relatively easy to
follow and their main message can be explained well in a blog post.</li>
<li>All of this is just for fun. Capital at risk.</li>
</ul>
<h2 id="jensen-1968-standard-stuff">Jensen (1968): Standard stuff.</h2>
<h4 id="on-models">On models</h4>
<p>The papers we discuss present <em>models</em>. Models have assumptions as
well as domains in which they are valid and domains in which they are
not. We are not going to make precise statements about either of
these, but it’s useful to not confuse models with reality. ‘Whereof one
has no model thereof one must be silent’ (otherwise, <em>philosophus
mansisses</em>). This is also known as <a href="https://en.wikipedia.org/wiki/Streetlight_effect" target="_blank">searching under the
streetlights</a>.</p>
<p>The <em>aim</em> of our model will be to evaluate, or assess, the performance
of any given portfolio. Specifically, the model should evaluate
a portfolio manager’s ability to increase returns by successfully
predicting future prices for a <em>given level of riskiness</em>. For
instance, if stock market returns are positive in expectation, a
leveraged portfolio could outperform an unleveraged portfolio without
any special forecasting ability on the manager’s part. Conversely, a
classical <a href="https://www.investopedia.com/articles/financial-advisors/011916/why-6040-portfolio-no-longer-good-enough.asp" target="_blank">60/40 mix of stocks and
bonds</a>
would likely do worse than a 100% equity portfolio in this case, but
may still be preferable due to its reduced risk. We won’t go into
detail how risk is measured, but we should mention that under certain
assumptions, it is precisely its riskiness that causes a given asset
to yield higher returns (<em>ceteris paribus</em>, and on average): If
investors are risk-averse, they will need to be compensated for
taking on the additional risk of a specific asset compared to some
other asset, and higher expected future returns (expected future
prices relative to its current price) are the way that
compensation happens in a liquid market. Of course, the actual
expectation of any given investor may also just be wrong, but over
time one may expect the worst investors to drop out, improving the
accuracy of the average investor’s expectations.</p>
<h4 id="time-series-regressions">Time series regressions</h4>
<p>The specific models here do linear regression on time series. Suppose
we have one data point per time interval, e.g. per trading day:</p>
<table>
<thead>
<tr>
<th style="text-align: left">day</th>
<th> </th>
<th style="text-align: right">value</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">5 January 1970</td>
<td> </td>
<td style="text-align: right">388.8K</td>
</tr>
<tr>
<td style="text-align: left">6 January 1970</td>
<td> </td>
<td style="text-align: right">475.2K</td>
</tr>
<tr>
<td style="text-align: left"> </td>
<td>⋮</td>
<td style="text-align: right"> </td>
</tr>
</tbody>
</table>
<p>We will use this data as an \(n\)-dimensional feature vector (aka explanatory
variable). Think of it as coming from the value of an index,
specifically from the capitalization-weighted value of the
whole stock market on that trading day. We will call this the <em>market
portfolio</em>.</p>
<p>The portfolio (or individual stock) we want to assess will also have a
value each trading day. This puts us in a situation in which we could
try to use linear regression to express our assessed portfolio as a
linear combination of the market value and a constant intercept
vector. However, little would be gained by just doing that – we’ll
have to at least normalize things a bit, otherwise this is what the
model will use its capacity (two numbers!) for. And while we are at
it, we remember that there used to be a time when the (almost, or by
definition) risk-free return offered by central banks was not
basically zero. To tease out the “risk factor”, we will use the market
return minus this risk-free rate as our feature vector. Our data thus
could look like this:</p>
<table>
<thead>
<tr>
<th style="text-align: left">day</th>
<th> </th>
<th style="text-align: right">market</th>
<th style="text-align: right">risk-free</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">5 January 1970</td>
<td> </td>
<td style="text-align: right">1.21%</td>
<td style="text-align: right">0.029%</td>
</tr>
<tr>
<td style="text-align: left">6 January 1970</td>
<td> </td>
<td style="text-align: right">0.62%</td>
<td style="text-align: right">0.029%</td>
</tr>
<tr>
<td style="text-align: left"> </td>
<td>⋮</td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
</tr>
</tbody>
</table>
<p>For the risk-free rate, one could use the treasury bill rate. All these
numbers can in principle be retrieved from the historical
records (and in practice downloaded from <a href="https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html" target="_blank">Kenneth French’s
homepage</a>).</p>
<p>Let’s call the market return \(R_M\) and the risk-free return
\(R_F\). A sensible regression without an intercept term could then
read as</p>
\[R \approx R_F + \beta(R_M - R_F), \label{eq:1}\tag{1}\]
<p>where \(R\) is vector the observed returns of the portfolio or stock
we want to assess, and \(\beta\) is computed via the least-squares
condition of linear regression, i.e., \(\beta = \argmin\{\abs{R - R_F -
\beta(R_M - R_F)}_2\st \beta\in\R\}\), as in the <a href="/blog/2020/02/19/linear-regression.html" target="_blank">last blog post</a>. Notice that the \(R_F\)s
make sense here: It’s deviations from this risk-free return that we
want to model, and we want \(\beta\) to scale the non risk-free
portion of the market portfolio.</p>
<p><strong>Example.</strong> For a classic 60/40
portfolio with 60% whole market and 40% risk-free return (historically
not realistic for private investors, but easy to compute) we have</p>
\[R - R_F= 0.6R_M + 0.4R_F - R_F = 0.6(R_M - R_F)\in\R^{n\times 1}\]
<p>and therefore, by the <a href="/blog/2020/02/19/linear-regression.html" target="_blank">normal equation</a>,</p>
\[\beta = \bigl((R_M - R_F)^\top (R_M - R_F)\bigr)^{-1}
(R_M - R_F)^\top (R - R_F) = 0.6.\]
<p>That seems sensible in this case.</p>
<h4 id="finding-alpha">Finding Alpha</h4>
<p>However, \eqref{eq:1} isn’t quite good enough: More precisely, it
reads</p>
\[R = R_F + \beta(R_M - R_F) + e, \label{eq:2}\tag{2}\]
<p>with an error term \(e\in\R^n\). As Jensen (1968) argues,</p>
<blockquote>
<p>[W]e must be very careful when applying the equation to managed portfolios. If the manager is a superior forecaster (perhaps because of
special knowledge not available to others) he will tend to systematically select
securities which realize [\(e_j > 0\)].</p>
</blockquote>
<p>This touches on a subject glossed over in the last blog post: Most
statements about linear regression models depend on certain
statistical assumptions, among them that
the error terms are elementwise iid, ideally with a mean of zero. There’s
autocorrelation tests like
<a href="https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic" target="_blank">Durbin-Watson</a>
to test if this is true for a particular dataset. In this
particular modeling exercise, we can do better by
adding the constant \(\1=(1,\ldots,1)\in\R^n\) intercept vector to the
subspace we project on, which turns \eqref{eq:2} into</p>
\[R - R_F = \alpha + \beta(R_M - R_F) + u, \label{eq:3}\tag{3}\]
<p>with an error term \(u\in\R^n\).</p>
<p><em>Ever wondered where the
“<a href="https://en.wikipedia.org/wiki/Alpha_(finance)" target="_blank">alpha</a>” in the
clickbait website <a href="https://seekingalpha.com/" target="_blank">Seeking Alpha</a> comes
from?</em> It is this \(\alpha\), the coefficient of the
\(\1\) intercept vector in \eqref{eq:3}. To quote
Jensen (1968) again:</p>
<blockquote>
<p>Thus if the portfolio manager has an ability to forecast security prices, the
intercept, [\(\alpha\), in eq. \eqref{eq:3}] will be positive. Indeed,
it represents the average incremental rate of return on the portfolio
per unit time which is due solely to
the manager’s ability to forecast future security prices. It is
interesting to
note that a naive random selection buy and hold policy can be expected to
yield a zero intercept. In addition if the manager is not doing as
well as a
random selection buy and hold policy, [\(\alpha\)] will be
negative. At first glance it
might seem difficult to do worse than a random selection policy, but such
results may very well be due to the generation of too many expenses in unsuccessful forecasting attempts.</p>
<p>However, given that we observe a positive intercept in any sample of returns on a portfolio we have the difficulty of judging whether or not this
observation was due to mere random chance or to the superior forecasting
ability of the portfolio manager. […]</p>
<p>It should be emphasized that in estimating [\(\alpha\)], the measure
of performance,
we are explicitly allowing for the effects of risk on return as implied by the
asset pricing model. Moreover, it should also be noted that if the model is
valid, the particular nature of general economic conditions or the particular
market conditions (the behavior of \(\pi\)) over the sample or evaluation period
has no effect whatsoever on the measure of performance. Thus our measure
of performance can be legitimately compared across funds of different risk
levels and across different time periods irrespective of general economic and
market condition.</p>
</blockquote>
<p>About the error term \(u\), first notice that thanks to the intercept
term we can expect it to have a mean of zero. Further, Jensen (1968)
argues it “should be serially [i.e., elementwise] independent” as
otherwise “the manager could increase his return even more by
taking account of the information contained in the serial dependence
and would therefore eliminate it.”</p>
<h3 id="just-show-me-the-code">Just show me the code!</h3>
<p>After introducing this model, Jensen (1968) continues with “the data
and empirical results”. In ca. 2015 AI parlance, this part could be
called “Experiments”. Take a look at Table 1 in the paper to get a
list of quaint mutual fund names. Notice too that it’s not immediately
obvious how to get the market portfolio’s returns from historical
trading data, as companies enter and leave the stock market, and how
they leave will play a huge role. (Bankruptcy? Taken private at
$420?)</p>
<p>For our purposes though, all of this has been taken care of and
<a href="https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html" target="_blank">Kenneth French’s
homepage</a>
has the data, including data for each trading day, in usable
formats.</p>
<p>Let’s start by getting the data and sanity-checking our 60/40 example
above. All of the code in this post can also be <a href="/src/ff3f.py">downloaded
separately</a> or
run in a <a href="https://colab.research.google.com/drive/1iqpYDgpElizyAWW-6xXHMTpatw0Hqm1c?usp=sharing" target="_blank">Google Colab</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">urllib.request</span>
<span class="kn">import</span> <span class="nn">zipfile</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">FF3F</span> <span class="o">=</span> <span class="p">(</span> <span class="c1"># Monthly data.
</span> <span class="s">"https://mba.tuck.dartmouth.edu/pages/faculty/"</span>
<span class="s">"ken.french/ftp/F-F_Research_Data_Factors_TXT.zip"</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">download</span><span class="p">(</span><span class="n">url</span><span class="o">=</span><span class="n">FF3F</span><span class="p">):</span>
<span class="n">archive</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">basename</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">archive</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Retrieving"</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span>
<span class="n">urllib</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">urlretrieve</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">archive</span><span class="p">)</span>
<span class="k">return</span> <span class="n">archive</span>
<span class="k">def</span> <span class="nf">extract</span><span class="p">(</span><span class="n">archive</span><span class="p">,</span> <span class="n">match</span><span class="o">=</span><span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="sa">rb</span><span class="s">"\d"</span> <span class="o">*</span> <span class="mi">6</span><span class="p">)):</span>
<span class="k">with</span> <span class="n">zipfile</span><span class="p">.</span><span class="n">ZipFile</span><span class="p">(</span><span class="n">archive</span><span class="p">)</span> <span class="k">as</span> <span class="n">z</span><span class="p">:</span>
<span class="n">name</span><span class="p">,</span> <span class="o">*</span><span class="n">rest</span> <span class="o">=</span> <span class="n">z</span><span class="p">.</span><span class="n">namelist</span><span class="p">()</span>
<span class="k">assert</span> <span class="ow">not</span> <span class="n">rest</span>
<span class="k">with</span> <span class="n">z</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="c1"># Filter for the actual data lines in the file.
</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">((</span><span class="n">line</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span> <span class="k">if</span> <span class="n">match</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="n">line</span><span class="p">)),</span> <span class="n">unpack</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">date</span><span class="p">,</span> <span class="n">mktmrf</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">rf</span> <span class="o">=</span> <span class="n">extract</span><span class="p">(</span><span class="n">download</span><span class="p">())</span>
<span class="n">mkt</span> <span class="o">=</span> <span class="n">mktmrf</span> <span class="o">+</span> <span class="n">rf</span>
<span class="c1"># Linear regression using the normal equation.
</span><span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">stack</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">mktmrf</span><span class="p">),</span> <span class="n">mktmrf</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">alpha</span><span class="p">,</span> <span class="n">beta</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">A</span><span class="p">)</span> <span class="o">@</span> <span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="p">(</span><span class="mf">0.6</span> <span class="o">*</span> <span class="n">mkt</span> <span class="o">+</span> <span class="mf">0.4</span> <span class="o">*</span> <span class="n">rf</span> <span class="o">-</span> <span class="n">rf</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"alpha=%f; beta=%f"</span> <span class="o">%</span> <span class="p">(</span><span class="n">alpha</span><span class="p">,</span> <span class="n">beta</span><span class="p">))</span>
</code></pre></div></div>
<p>As expected, this prints</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alpha=0.000000; beta=0.600000
</code></pre></div></div>
<p>If we are seeking \(\alpha\), we’ll have to look elsewhere.</p>
<p>Let’s try some real data. The biggest issue is getting hold of
the daily returns of real portfolios. Real researchers use data
sources like the <a href="http://www.crsp.org/" target="_blank">Center for Research in Security Prices
(CRSP)</a>, but their data isn’t available for
free. Instead, let’s the data for iShares S&P Small-Cap 600 Value ETF
(IJS) from <a href="https://finance.yahoo.com/quote/IJS/history?period1=964742400&period2=1599350400&interval=1mo&filter=history&frequency=1mo" target="_blank">Yahoo
finance</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Continuing from above.
</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="n">IJS</span> <span class="o">=</span> <span class="p">(</span>
<span class="s">"https://gist.githubusercontent.com/heiner/b222d0985cbebfdfc77288404e6b2735/"</span>
<span class="s">"raw/08c1cacecbcfcd9e30ce28ee6d3fe3d96c07115c/IJS.csv"</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">extract_csv</span><span class="p">(</span><span class="n">archive</span><span class="p">):</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">archive</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span>
<span class="n">f</span><span class="p">,</span>
<span class="n">delimiter</span><span class="o">=</span><span class="s">","</span><span class="p">,</span>
<span class="n">skiprows</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="c1"># Header.
</span> <span class="n">converters</span><span class="o">=</span><span class="p">{</span> <span class="c1"># Hacky date handling.
</span> <span class="mi">0</span><span class="p">:</span> <span class="k">lambda</span> <span class="n">s</span><span class="p">:</span> <span class="n">time</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span>
<span class="s">"%Y%m"</span><span class="p">,</span> <span class="n">time</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="s">"ascii"</span><span class="p">),</span> <span class="s">"%Y-%m-%d"</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">},</span>
<span class="p">)</span>
<span class="n">ijs_data</span> <span class="o">=</span> <span class="n">extract_csv</span><span class="p">(</span><span class="n">download</span><span class="p">(</span><span class="n">IJS</span><span class="p">))</span>
<span class="n">ijs</span> <span class="o">=</span> <span class="n">ijs_data</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]</span> <span class="c1"># Adj Close (includes dividends).
</span>
<span class="c1"># Turn into monthly percentage returns.
</span><span class="n">ijs</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="p">(</span><span class="n">ijs</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="o">/</span> <span class="n">ijs</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">ijs_date</span> <span class="o">=</span> <span class="n">ijs_data</span><span class="p">[</span><span class="mi">1</span><span class="p">:,</span> <span class="mi">0</span><span class="p">]</span>
<span class="n">ijs_date</span><span class="p">,</span> <span class="n">indices</span><span class="p">,</span> <span class="n">ijs_indices</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">intersect1d</span><span class="p">(</span><span class="n">date</span><span class="p">,</span> <span class="n">ijs_date</span><span class="p">,</span> <span class="n">return_indices</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Regression model for CAPM.
</span><span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">stack</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">ijs_date</span><span class="p">),</span> <span class="n">mktmrf</span><span class="p">[</span><span class="n">indices</span><span class="p">]],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">ijs</span><span class="p">[</span><span class="n">ijs_indices</span><span class="p">]</span> <span class="o">-</span> <span class="n">rf</span><span class="p">[</span><span class="n">indices</span><span class="p">]</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">A</span><span class="p">)</span> <span class="o">@</span> <span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">y</span>
<span class="n">alpha</span><span class="p">,</span> <span class="n">beta</span> <span class="o">=</span> <span class="n">B</span>
<span class="c1"># R^2 and adjusted R^2.
</span><span class="n">model_err</span> <span class="o">=</span> <span class="n">A</span> <span class="o">@</span> <span class="n">B</span> <span class="o">-</span> <span class="n">y</span>
<span class="n">ss_err</span> <span class="o">=</span> <span class="n">model_err</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">model_err</span>
<span class="n">r2</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">ss_err</span><span class="p">.</span><span class="n">item</span><span class="p">()</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">ddof</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">adjr2</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">ss_err</span><span class="p">.</span><span class="n">item</span><span class="p">()</span> <span class="o">/</span> <span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">A</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">ddof</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span>
<span class="s">"CAPM: alpha=%.2f%%; beta=%.2f. R^2=%.1f%%; R_adj^2=%.1f%%. Annualized alpha: %.2f%%"</span>
<span class="o">%</span> <span class="p">(</span>
<span class="n">alpha</span><span class="p">,</span>
<span class="n">beta</span><span class="p">,</span>
<span class="mi">100</span> <span class="o">*</span> <span class="n">r2</span><span class="p">,</span>
<span class="mi">100</span> <span class="o">*</span> <span class="n">adjr2</span><span class="p">,</span>
<span class="p">((</span><span class="mi">1</span> <span class="o">+</span> <span class="n">alpha</span> <span class="o">/</span> <span class="mi">100</span><span class="p">)</span> <span class="o">**</span> <span class="mi">12</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span>
<span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>This prints:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CAPM: alpha=0.13%; beta=1.14. R^2=77.8%; R_adj^2=77.7%. Annualized alpha: 1.58%
</code></pre></div></div>
<p>Since \(\alpha\) is the weight for the constant intercept vector \(\1
= (1,\ldots,1)\), we can think of it as having percentage points as
its unit. Note that fees are not included in this calculation.
However, as for many ETFs fees for IJS are low, currently at 0.25%
per year. (Managed mutual funds will typically have an
annual fee of at least 1%, historically often more than that.)</p>
<p>It seems bit strange that this ETF tracking the S&P Small-Cap 600
Value index has significant \(\alpha\): Presumably, the index just
includes firms based on simple rules, not genius insights by some
above-average fund manager. Looking at the \(R^2\) value, we “explain”
only 77% of the variance of the returns of IJS (the usual caveats
to the wording “explain” apply).</p>
<p>Clearly more research was needed. Or just a larger subspace for the
linear regression to project onto?</p>
<h2 id="fama--french-1993-more-factors">Fama & French (1993): More factors.</h2>
<p>By the 1980s, research
into financial economics had noticed that certain segments of the market
outperformed other segments, and thus the market as a whole, on
average. There are several possible explanations for this effect with
different implications for the future. For example: Are these segments
of the market just inherently riskier such that rational traders
demand higher expected returns via sufficiently low prices for these
stocks? Or were traders
just irrationally disinterested in some ‘unsexy’ firms and have
perhaps caught on by now (or not, hence TSLA)? The latter is the
behavioural explanation, while the former tends to be put
forth by proponents of the <a href="https://en.wikipedia.org/wiki/Efficient-market_hypothesis" target="_blank">Efficient-market hypothesis
(EMH)</a>,
which includes
<a href="https://en.wikipedia.org/wiki/The_Superinvestors_of_Graham-and-Doddsville" target="_blank">Jensen</a>
as well as Fama and French.
We won’t be getting into this now. Let’s instead take
a look at which ‘easily’ identifiable segments of the market have
historically outperformed.</p>
<p>Citing previous literature, Fama & French (1993) mention <em>size</em>
(market capitalization, i.e., price per stock times number of shares),
<em>earnings per price</em> and <em>book-to-market</em> (book value divided by
market value) as variables that appear to have “explanatory power”,
which I take to mean that some model that includes these variables has
nonzero regression coefficients and a relatively large \(R^2\) or other
appropriate statistics.</p>
<p>The specific way in which Fama & French (1993) introduce these variables into
the model is through the construction of portfolios that mimic these
variables. This approach contributed to their being awarded the
Nobel (Memorial) Prize in Economic Sciences in 2013. The specific
construction goes as follows:</p>
<p>Take all stocks in the overall market and order them by their size
(i.e., market capitalization). Then take the top and bottom halves and
call them “<em>big</em>” (<em>B</em>) and “<em>small</em>” (<em>S</em>), respectively.</p>
<p>Next, again take all stocks in the overall market and order them by
book-to-market equity. Then take the bottom 30% (“<em>low</em>”, <em>L</em>), the middle 40%
(“<em>medium</em>”, <em>M</em>), and the top 30% (“<em>high</em>”, <em>H</em>). In both cases, some care
needs to be taken: E.g., how to handle firms dropping in and out of the
market, how to define book equity properly in the presence of deferred
taxes, and other effects.</p>
<p>Then, construct the six portfolios containing stocks in the
intersections of the two size and the three book-to-market equity
groups, e.g.</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: center">low</th>
<th style="text-align: center">medium</th>
<th style="text-align: center">high</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right"><strong>small</strong></td>
<td style="text-align: center"><em>S/L</em></td>
<td style="text-align: center"><em>S/M</em></td>
<td style="text-align: center"><em>S/H</em></td>
</tr>
<tr>
<td style="text-align: right"><strong>big</strong></td>
<td style="text-align: center"><em>B/L</em></td>
<td style="text-align: center"><em>B/M</em></td>
<td style="text-align: center"><em>B/H</em></td>
</tr>
</tbody>
</table>
<p>Out of these six building blocks, Fama & French build a <em>size</em> and a
<em>book-to-market equity</em> portfolio:</p>
<ul>
<li>
<p>The <em>size</em> portfolio is “small minus
big” (<em>SMB</em>) consisting of the monthly difference of the three
small-stock portfolios <em>S/L</em>, <em>S/M</em>, and <em>S/H</em> to the three big-stock
portfolios <em>B/L</em>, <em>B/M</em>, and <em>B/H</em>.</p>
</li>
<li>
<p>The <em>book-to-market equity</em> portfolio is “high minus low” (<em>HML</em>),
the monthly difference of the two high book-to-market
portfolios <em>S/H</em> and <em>B/H</em> to the two low book-to-market
portfolios <em>S/L</em> and <em>B/L</em>. This is also known as the <em>value</em>
factor.</p>
</li>
</ul>
<p>Additionally, the authors also use the <em>market portfolio</em> as “market
return minus risk-free return (one-month treasury bill rate)” in the
same way as Jensen (1968).</p>
<h4 id="an-aside">An aside</h4>
<p>I’m not sure why <em>SMB</em> and <em>HML</em> need the to have their
two terms be equally weighted among the splits of the other
ordering. The authors mention</p>
<blockquote>
<p>[For <em>HML</em>] the difference between the two returns should be largely free of
the size factor in returns, focusing instead on the different return
behaviors of high- and low-[book-to-market] firms. As testimony to the
success of this simple procedure, the correlation between the
1963–1991 monthly mimicking returns for the size and
book-to-market factors is only \(- 0.08\).</p>
</blockquote>
<p>Taking “correlation” to mean the Pearson correlation coefficient, we
can test this using the data from French’s homepage:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">date</span><span class="p">,</span> <span class="n">mktmrf</span><span class="p">,</span> <span class="n">smb</span><span class="p">,</span> <span class="n">hml</span><span class="p">,</span> <span class="n">rf</span> <span class="o">=</span> <span class="n">extract</span><span class="p">(</span><span class="n">download</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">corrcoef</span><span class="p">(</span><span class="n">smb</span><span class="p">,</span> <span class="n">hml</span><span class="p">))</span>
</code></pre></div></div>
<p>This prints</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[[1. 0.12889074]
[0.12889074 1. ]]
</code></pre></div></div>
<p>which implies a coefficient of \(0.13\). In 1993, Fama and French had
less data available: The paper uses the 342 months from July 1963 to
December 1991. Let’s check with this range:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">orig_indices</span> <span class="o">=</span> <span class="p">(</span><span class="mi">196307</span> <span class="o"><=</span> <span class="n">date</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">date</span> <span class="o"><=</span> <span class="mi">199112</span><span class="p">)</span>
<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">smb</span><span class="p">[</span><span class="n">orig_indices</span><span class="p">])</span> <span class="o">==</span> <span class="mi">342</span>
<span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">corrcoef</span><span class="p">(</span><span class="n">smb</span><span class="p">[</span><span class="n">orig_indices</span><span class="p">],</span> <span class="n">hml</span><span class="p">[</span><span class="n">orig_indices</span><span class="p">]))</span>
</code></pre></div></div>
<p>This yields</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[[ 1. -0.09669641]
[-0.09669641 1. ]]
</code></pre></div></div>
<p>a coefficient of roughly \(-0.10\), not the \(-0.08\) the authors
mention, but relatively close. I guess the data has been cleaned a bit
since 1993?</p>
<p>As a further aside, accumulating this data and analyzing it was a true
feat in 1993. These days, we can do the same using the internet and
a few lines of Python (or, spoiler alert, using just a
<a href="https://www.portfoliovisualizer.com/" target="_blank">website</a>).</p>
<h4 id="back-to-modelling">Back to modelling</h4>
<p>So what are we to do with <em>SMB</em> and <em>HML</em>? You guessed it – just add
them to the regression model. Of course, this makes the subspace we
project on larger, which will always decrease the “fraction of
variance unexplained”, without necessarily explaining much. However,
in the case of IJS it appears to explain a bit:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Continuing from above.
</span><span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">stack</span><span class="p">(</span>
<span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">ijs_date</span><span class="p">),</span> <span class="n">mktmrf</span><span class="p">[</span><span class="n">indices</span><span class="p">],</span> <span class="n">smb</span><span class="p">[</span><span class="n">indices</span><span class="p">],</span> <span class="n">hml</span><span class="p">[</span><span class="n">indices</span><span class="p">]],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span>
<span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">ijs</span><span class="p">[</span><span class="n">ijs_indices</span><span class="p">]</span> <span class="o">-</span> <span class="n">rf</span><span class="p">[</span><span class="n">indices</span><span class="p">]</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">A</span><span class="p">)</span> <span class="o">@</span> <span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">y</span>
<span class="n">alpha</span><span class="p">,</span> <span class="n">beta_mkt</span><span class="p">,</span> <span class="n">beta_smb</span><span class="p">,</span> <span class="n">beta_hml</span> <span class="o">=</span> <span class="n">B</span>
<span class="n">model_err</span> <span class="o">=</span> <span class="n">A</span> <span class="o">@</span> <span class="n">B</span> <span class="o">-</span> <span class="n">y</span>
<span class="n">ss_err</span> <span class="o">=</span> <span class="n">model_err</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">model_err</span>
<span class="n">r2</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">ss_err</span><span class="p">.</span><span class="n">item</span><span class="p">()</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">ddof</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">adjr2</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">ss_err</span><span class="p">.</span><span class="n">item</span><span class="p">()</span> <span class="o">/</span> <span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">A</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">ddof</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span>
<span class="s">"FF3F: alpha=%.2f%%; beta_mkt=%.2f; beta_smb=%.2f; beta_hml=%.2f."</span>
<span class="s">" R^2=%.1f%%; R_adj^2=%.1f%%. Annualized alpha: %.2f%%"</span>
<span class="o">%</span> <span class="p">(</span>
<span class="n">alpha</span><span class="p">,</span>
<span class="n">beta_mkt</span><span class="p">,</span>
<span class="n">beta_smb</span><span class="p">,</span>
<span class="n">beta_hml</span><span class="p">,</span>
<span class="mi">100</span> <span class="o">*</span> <span class="n">r2</span><span class="p">,</span>
<span class="mi">100</span> <span class="o">*</span> <span class="n">adjr2</span><span class="p">,</span>
<span class="p">((</span><span class="mi">1</span> <span class="o">+</span> <span class="n">alpha</span> <span class="o">/</span> <span class="mi">100</span><span class="p">)</span> <span class="o">**</span> <span class="mi">12</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span>
<span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>This prints:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FF3F: alpha=0.04%; beta_mkt=0.97; beta_smb=0.79; beta_hml=0.51. R^2=95.8%; R_adj^2=95.8%. Annualized alpha: 0.43%
</code></pre></div></div>
<p>In other words, we dropped from an (annualized) \(\alpha_{\rm CAPM} = 1.58\%\) to
only \(\alpha_{\rm FF3F} = 0.43\%\). The explained fraction of
variance has increased to above 95%.</p>
<p>Remembering that Jensen (1968) talked about assessing fund managers
with this model, we could try the same with actual managed
funds. While I couldn’t produce any impressive results there, French
and Fama did go into the question of <a href="http://mba.tuck.dartmouth.edu/bespeneckbo/default/AFA611-Eckbo%20web%20site/AFA611-S8C-FamaFrench-LuckvSkill-JF10.pdf" target="_blank"><em>Luck versus Skill in the
Cross-Section of Mutual Fund
Returns</em></a>
in a 2010 paper. The results, on average, don’t look good for fund
managers’ skill. The story for individual fund managers may be better,
but don’t hold your breath.</p>
<h3 id="was-this-worth-it">Was this worth it?</h3>
<p>We discussed academic outputs of Jensen, French and Fama. The latter
two even got a Nobel for
their work on factor models. But nowadays, we can do (parts) of their
computations in a few lines of Python.</p>
<p>It’s actually easier than that still. The website
portfoliovisualizer.com allows us to do <a href="https://www.portfoliovisualizer.com/factor-analysis?s=y&regressionType=1&symbols=IJS&sharedTimePeriod=true&factorDataSet=0&marketArea=0&factorModel=1&useHMLDevFactor=false&includeQualityFactor=false&includeLowBetaFactor=false&fixedIncomeFactorModel=0&__checkbox_ffmkt=true&__checkbox_ffsmb=true&__checkbox_ffsmb5=true&__checkbox_ffhml=true&__checkbox_ffmom=true&__checkbox_ffrmw=true&__checkbox_ffcma=true&__checkbox_ffstrev=true&__checkbox_ffltrev=true&__checkbox_aqrmkt=true&__checkbox_aqrsmb=true&__checkbox_aqrhml=true&__checkbox_aqrhmldev=true&__checkbox_aqrmom=true&__checkbox_aqrqmj=true&__checkbox_aqrbab=true&__checkbox_aamkt=true&__checkbox_aasmb=true&__checkbox_aahml=true&__checkbox_aamom=true&__checkbox_aaqmj=true&__checkbox_qmkt=true&__checkbox_qme=true&__checkbox_qia=true&__checkbox_qroe=true&__checkbox_qeg=true&__checkbox_trm=true&__checkbox_cdt=true&timePeriod=2&rollPeriod=12&marketAssetType=1&robustRegression=false" target="_blank">all
of</a>
<a href="https://www.portfoliovisualizer.com/factor-analysis?s=y&regressionType=1&symbols=IJS&sharedTimePeriod=true&factorDataSet=0&marketArea=0&factorModel=3&useHMLDevFactor=false&includeQualityFactor=false&includeLowBetaFactor=false&fixedIncomeFactorModel=0&__checkbox_ffmkt=true&__checkbox_ffsmb=true&__checkbox_ffsmb5=true&__checkbox_ffhml=true&__checkbox_ffmom=true&__checkbox_ffrmw=true&__checkbox_ffcma=true&__checkbox_ffstrev=true&__checkbox_ffltrev=true&__checkbox_aqrmkt=true&__checkbox_aqrsmb=true&__checkbox_aqrhml=true&__checkbox_aqrhmldev=true&__checkbox_aqrmom=true&__checkbox_aqrqmj=true&__checkbox_aqrbab=true&__checkbox_aamkt=true&__checkbox_aasmb=true&__checkbox_aahml=true&__checkbox_aamom=true&__checkbox_aaqmj=true&__checkbox_qmkt=true&__checkbox_qme=true&__checkbox_qia=true&__checkbox_qroe=true&__checkbox_qeg=true&__checkbox_trm=true&__checkbox_cdt=true&timePeriod=2&rollPeriod=12&marketAssetType=1&robustRegression=false" target="_blank">these
computations</a>
and more with a few clicks. In that sense, this blog post was perhaps
not worth it.</p>
<p>Another question is how useful these models are. This touches on <em>why</em>
<em>SMB</em> and <em>HML</em> ‘explain’ returns of portfolios (e.g., the risk
explanation vs the behavioural explanation mentioned above, or perhaps
both or neither). In
2014, Fama and French presented another updated model with five
factors, adding <em>profitability</em> and <em>investment</em>; judged by \(R^2\),
this five-factor model ‘explains’ even more of the variance of example
portfolios. Other research
suggesting alternative factors abounds.</p>
<p>How well do these models
really ‘explain’ the phenomenal historical returns of star investors
like Warren Buffett? Given that Buffett is a proponent of the <a href="https://en.wikipedia.org/wiki/Benjamin_Graham" target="_blank">Benjamin
Graham</a> school of
<em>value investing</em>, including a value factor like <em>HML</em> could perhaps
be the key to explain his success?
For the Fama & French five-factor model, we can
check
<a href="https://www.portfoliovisualizer.com/factor-analysis?s=y&regressionType=1&symbols=BRK.A&sharedTimePeriod=true&factorDataSet=0&marketArea=0&factorModel=5&useHMLDevFactor=false&includeQualityFactor=false&includeLowBetaFactor=false&fixedIncomeFactorModel=0&__checkbox_ffmkt=true&__checkbox_ffsmb=true&__checkbox_ffsmb5=true&__checkbox_ffhml=true&__checkbox_ffmom=true&__checkbox_ffrmw=true&__checkbox_ffcma=true&__checkbox_ffstrev=true&__checkbox_ffltrev=true&__checkbox_aqrmkt=true&__checkbox_aqrsmb=true&__checkbox_aqrhml=true&__checkbox_aqrhmldev=true&__checkbox_aqrmom=true&__checkbox_aqrqmj=true&__checkbox_aqrbab=true&__checkbox_aamkt=true&__checkbox_aasmb=true&__checkbox_aahml=true&__checkbox_aamom=true&__checkbox_aaqmj=true&__checkbox_qmkt=true&__checkbox_qme=true&__checkbox_qia=true&__checkbox_qroe=true&__checkbox_qeg=true&__checkbox_trm=true&__checkbox_cdt=true&timePeriod=2&rollPeriod=12&marketAssetType=1&robustRegression=false" target="_blank">portfoliovisualizer.com</a>:
With \(R^2 \approx 33\%\), and an annualized \(\alpha\) of 4.87%, the results
don’t look too good for the math nerds but very good for the ‘Oracle
of Omaha’.</p>
<p>This is obviously not a new observation. There is even a paper by a
number of people from investment firm AQR about <a href="https://www.tandfonline.com/doi/full/10.2469/faj.v74.n4.3" target="_blank"><em>Buffett’s
Alpha</em></a>
that aims to explain Buffett’s successes with leveraging as well as
yet another set of new factors in a linear regression model:</p>
<blockquote>
<p>[Buffett’s] alpha became insignificant, however, when we controlled
for exposure to the factors “betting against beta” and “quality
minus junk.”</p>
</blockquote>
<p>Nice as this may sound, it would appear more convincing to this author
if the financial analysis community could converge on a small common
set of factors instead of seemingly creating them <em>ad hoc</em>. Otherwise,
von Neumann’s line comes to mind: “With four parameters I can fit an
elephant, and with five I can make him wiggle his trunk.”</p>
<h3 id="and-now-what">And now what?</h3>
<p>We discussed two financial economics papers and the linear regression
models they propose, merely to give us a sense of what’s done in this
field. One may get a sense that this research should be useful for
more than just amusement, perhaps it could even inform our investment
choices? Many good <a href="https://www.youtube.com/watch?v=ViTnIebSzj4" target="_blank">financial
advisors</a> will make use
of data analyses
like this and suggest <em>factor-tilted</em> portfolios. However, value
investing, both with factors as well as the Buffett/Munger variety,
has trailed the overall market in the last 10–15
years. Statistically, this is to be expected to happen every now and
then, so we cannot read too much into that. But it’s <em>possible</em> the
market has just caught on, past performance is not indicative of
future results and value investing should be cancelled in 2020.
That would at least match the <em>zeitgeist</em>. However, it’s also
entirely possible it’s exactly times like the present that make value
investing hard but ultimately worthwhile and we should be greedy when
others are fearful.</p>
<p>Time will tell.</p>
<h4 id="references">References</h4>
<ul class="simplelist">
<li>Frazzini, Andrea & Kabiller, David & Pedersen, Lasse Heje.
<a href="https://www.tandfonline.com/doi/full/10.2469/faj.v74.n4.3" target="_blank"><em>Buffett’s Alpha</em></a>.
Financ. Anal. J. 74 (4), 35–55
<span class="smallcaps"><a href="https://doi.org/10.2469/faj.v74.n4.3" target="_blank">DOI:10.2469/faj.v74.n4.3</a></span>.</li>
<li>Fama, Eugene F. & French, Kenneth R.
<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.139.5892" target="_blank"><em>Common risk factors in the returns on stocks and bonds</em></a>.
J. Financ. Econ. 33, (1), 3–56 (1993).
<span class="smallcaps"><a href="https://doi.org/10.1016/0304-405X(93)90023-5" target="_blank">DOI:10.1016/0304-405X(93)90023-5</a></span>.</li>
<li>———.
<a href="http://mba.tuck.dartmouth.edu/bespeneckbo/default/AFA611-Eckbo%20web%20site/AFA611-S8C-FamaFrench-LuckvSkill-JF10.pdf" target="_blank"><em>Luck versus skill in the cross‐section of mutual fund
returns</em></a>.
J. Financ. 65 (5), 1915–1947 (2010).
<span class="smallcaps"><a href="https://doi.org/10.1111/j.1540-6261.2010.01598.x" target="_blank">DOI:10.1111/j.1540-6261.2010.01598.x</a></span>.</li>
<li>———.
<a href="https://tevgeniou.github.io/EquityRiskFactors/bibliography/FiveFactor.pdf" target="_blank"><em>A five-factor asset pricing model</em></a>.
J. Financ. Econ. 116 (1), 1–22 (2015).
<span class="smallcaps"><a href="https://doi.org/10.1016/j.jfineco.2014.10.010" target="_blank">DOI:10.1016/j.jfineco.2014.10.010</a></span>.</li>
<li>Jensen, Michael C.
<a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.1968.tb00815.x" target="_blank"><em>The performance of mutual funds in the period 1945–1964</em></a>.
J. Financ. 23 (2), 389–416 (1968).
<span class="smallcaps"><a href="https://doi.org/10.1111/j.1540-6261.1968.tb00815.x" target="_blank">DOI:10.1111/j.1540-6261.1968.tb00815.x</a></span>.</li>
</ul>
Tue, 08 Sep 2020 00:00:00 +0000
https://heiner.ai/blog/2020/09/08/more-linear-regression-capm.html
https://heiner.ai/blog/2020/09/08/more-linear-regression-capm.htmlblogOn linear regression<p><small>[Edit 14 May 2020: Thanks to Marcus Waurick for pointing out \(A\)
needs to have close range for Lemmas 1 and 2 to be true in
general. Unfortunately, there seems no easy way of doing linear
regression in general Hilbert spaces (no equivalent of \(\1\),
although the idea of using weighted \(L_2\) spaces and writing a
blog post on <em>Linear Regression in Hilbert spaces with Morgenstern
norm</em> has a certain ring to it.]</small></p>
<p>Apropos of nothing, this is a post about linear regression. There’s a
whole world of information on that out there, but little with the
right point of view: The Hilbert space perspective. That might not
be universally loved, so you might feel more comfortable reading
\(\H_1 = \R^m\), \(\H_2 = \R^n\) and “matrix” in place of “linear
operator” below.</p>
<p>We’ll do the short theory up front, then look at an example, then
say something about “fraction of variance explained”. A later blog
post will discuss the CAPM, French-Fama models and efficient markets as examples.</p>
<h2 id="two-lemmas">Two lemmas.</h2>
<p>Let \(\H_1\), \(\H_2\) be Hilbert spaces and \(A\in L(\H_1,\H_2)\) be
a linear operator with closed range (i.e., \(\overline{\im A} = \im A\)).</p>
<p><strong>Lemma 1.</strong> \(\H_2 = \im A\oplus\ker A^*\) in the sense of direct sums
of Hilbert spaces.</p>
<p><em>Proof</em> Let \(x\in\im A\), \(y\in\ker A^*\). There is a \(x_1\in\H_1\)
with \(x = Ax_1\) and</p>
\[\langle x,y\rangle = \langle Ax_1, y\rangle = \langle x_1, A^*
y\rangle = 0.\]
<p>Thus \(\im A\perp\ker A^*\).</p>
<p>Reversely let \(y\in(\im A)^\perp\). Then \(0 = \langle Ax_1,
y\rangle = \langle x_1, A^* y\rangle\) for all \(x_1\in\H_1\), and
thus \(y\in\ker A^*\). //</p>
<p><strong>Lemma 2.</strong> Let \(P\from\H_2\to\im A\) be the orthogonal
projection.</p>
<p>(a) Let \(y\in\H_2\). Then there is a \(b\in\H_1\) such that \(A^*Ab =
A^*y\) and \(Py = Ab\). This is called the <em>normal equation</em>.</p>
<p>(b) If \(\ker A = \{0\}\), i.e., \(A\) injective, then \(P =
A(A^*A)^{-1}A^*\).</p>
<p><em>Proof.</em> (a) There is \(b\in\H_2\) with \(Py = Ab\). Also \(y - Py\in(\im
A)^\perp\), and thus by Lemma 1</p>
\[0 = A^*(y -Py) = A^*(y - Ab) = A^*y - A^*Ab.\]
<p>(b) Lemma 1 implies \(\H_1 = \im A^*\oplus\ker A = \im A^*\),
i.e. \(A^*\) is surjective. Since \(\im A = (\ker A^*)^\perp\) and
evidently \(A^*\from(\ker A^*)^\perp\to\H_1\) is injective, that
mapping is bijective. Additionally
\(A\from\H_1\to\im A\) is bijective, and thus \(A^*A\from\H_1\to\H_1\)
is bijective and invertible. Using (a) it follows that \(Py = Ab =
A(A^*A)^{-1}A^*y\). //</p>
<h2 id="linear-regression-as-projection-to-subspaces">Linear regression as projection to subspaces.</h2>
<h3 id="example-warships">Example: Warships</h3>
<p>To take a <a href="https://de.wikipedia.org/wiki/Bestimmtheitsma%C3%9F#Rechenbeispiel">favorite example from German
Wikipedia</a>,
let’s discuss warships from the Second World War. Let’s say we’ve got
\(x_1\in\R^n\), a list of lengths of \(n\) warships, \(x_2\in\R^n\) a list
of their beams (widths) and \(y\in\R^n\) a list of their
drafts (distance from bottom to waterline), all in meters. We call
\(x_1\) and \(x_2\) <em>feature vectors</em> (also known as <em>explanatory</em> or
<em>independent</em> variables) and can safely assume they are
linearly independent, such that the matrix</p>
\[A = \begin{pmatrix}
| & | \\
x_1 & x_2 \\
| & |
\end{pmatrix}\in\R^{n\times 2}\]
<p>defines an injective operator. Its image is the linear span of \(x_1\)
and \(x_2\), i.e., \(\im A = \{\alpha x_1 + \beta x_2 \st \alpha,
\beta\in\R\}\subseteq\R^n\). We wouldn’t expect \(y\) to be an element
of \(\im A\), but the hope of a linear regression model trying to
predict a warship’s draft from its length and beam is that \(y\)
is not too far from \(\im A\) as a subspace of \(\R^n\)
either. Specifically, the \(\abs{\dotid}_2\)-distance is</p>
\[\dist(\{y\}, \im A) = \min_{y'\in\im A}\abs{y - y'}_2 = \abs{y - P y}_2\]
<p>where \(P\from\R^n\to\im A\) is the orthogonal projection. We can call
\(Py\) the model’s prediction, and Lemma 2 above tells us how to
compute \(P\) from \(A\), namely \(P = A(A^\top A)^{-1}A^\top\). Moreover, the
normal equation tells us which linear combination<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> of \(x_1\) and \(x_2\)
computes \(Py\):</p>
\[Py = A(A^\top A)^{-1}A^\top y = Ab \;\text{ with }\; b = (A^\top
A)^{-1}A^\top y\in\R^2.\]
<p>Thus, if we are given an additional warship’s length and beam as a
vector \(a\in\R^2\), the model’s prediction of its draft will be
\(\langle a, b\rangle\).</p>
<p>We can implement this idea in simple Python code (<a href="https://colab.research.google.com/drive/1lPXUxTBDrRC8aZ7MqZZDkvKaUz5Ogh5b" target="_blank">Colab
link</a>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="c1"># Lengths, beams and drafts of 10 warships
</span><span class="n">data</span> <span class="o">=</span> <span class="s">"""
187 31 9.4
242 31 9.6
262 32 8.7
216 32 9
195 32 9.7
227 31 11
193 20 5.2
175 17 5.2
"""</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="n">lengths</span><span class="p">,</span> <span class="n">beams</span><span class="p">,</span> <span class="n">drafts</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">unpack</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">stack</span><span class="p">([</span><span class="n">lengths</span><span class="p">,</span> <span class="n">beams</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">A</span><span class="p">)</span> <span class="o">@</span> <span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">drafts</span>
<span class="k">print</span><span class="p">(</span><span class="s">"b:"</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"model prediction:"</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">200</span><span class="p">,</span> <span class="mi">20</span><span class="p">])</span> <span class="o">@</span> <span class="n">b</span><span class="p">)</span>
</code></pre></div></div>
<p>This will print</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>b: [-0.00539494 0.34045321]
model prediction: 5.730077087608246
</code></pre></div></div>
<p>So our model would predict a hypothetical warship with length 200m
and beam 20m to have 5.73m of draft.</p>
<p>Notice that the way we built our model makes it predict a draft of
zero for the (nonsensical) inputs of a warship
of zero meters length or beam. While this seems fine enough in this
case, it’s not in others. A simple way of fixing this is to add an
“intercept” (in machine learning known as “bias”) term: Define
\(\1=(1,\ldots,1)\in\R^n\) and add that as an additional (say, first)
“feature” vector. This turns our model into an affine function of its
data inputs. Doing this in our Python example changes little:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">stack</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">lengths</span><span class="p">),</span> <span class="n">lengths</span><span class="p">,</span> <span class="n">beams</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">A</span><span class="p">)</span> <span class="o">@</span> <span class="n">A</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">drafts</span>
<span class="k">print</span><span class="p">(</span><span class="s">"b:"</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"model prediction:"</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">200</span><span class="p">,</span> <span class="mi">20</span><span class="p">])</span> <span class="o">@</span> <span class="n">b</span><span class="p">)</span>
</code></pre></div></div>
<p>Result:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>b: [ 0.09353128 -0.00580401 0.34027063]
model prediction: 5.738140993529072
</code></pre></div></div>
<p>However, trying to predict crew sizes changes that, see the <a href="https://colab.research.google.com/drive/1lPXUxTBDrRC8aZ7MqZZDkvKaUz5Ogh5b" target="_blank">Colab for
details</a>.</p>
<p>Of course, we can use more than two or three feature vectors as part
of the <em>design matrix</em> \(A\), as long as they are linearly
independent, which is is typically the case in practice with enough
examples \(n\).</p>
<p>Using a matrix \(Y\in\R^{n\times m}\) in place of
\(y\in\R^n\) the same math allows us to succinctly treat the
“multivariate” case in which we’re trying to predict more than one
“dependent variable”, e.g. a warship’s draft and the size of its
crew. This is just doing more than one linear regression at once.</p>
<h3 id="r2-and-all-that">\(R^2\) and all that</h3>
<p>As mentioned above, the distance \(\abs{y - Py}_2\) gives us a sense
of the quality of our model’s predictions when applied to the data we
built it on. The textbooks, being obsessed with element-wise
expressions, call \(\abs{y - Py}_2^2\) the <em>residual sum of squares</em>.</p>
<p>While this quantity (or a normalized version of it) is a measure of
the error our model produces on the data we know, it doesn’t tell us
if it was our input data that was useful in particular. To quantify
that, we can compare it with a minimal, “inputless” model built from only
the intercept entry, i.e., using \(A_\1 = \1\). The prediction of that model
will be the mean of the data, \(\frac{1}{n}\langle y,
\1\rangle\), and the orthogonal projection on \(\im A_\1 =
\lin\{\1\}\) is \(P_1y = \frac{1}{n}\langle y, \1\rangle\1\). The
squared distance \(\abs{y - P_\1y}_2^2\) of that minimal model’s
predictions from the data is known as the <em>total sum of squares</em>. It’s
also the unnormalized variance of the data. If our original model used
an intercept term, i.e., \(\1\in\im A\), then \(Py - P_1y\in\im A\)
and therefore \(y - Py \perp Py - P_1y\). Hence, the Pythagorean
identity tells us</p>
\[\abs{y - P_1y}_2^2 = \abs{y - Py + Py - P_1y}_2^2
= \abs{y - Py}_2^2 + \abs{Py - P_1y}_2^2.\]
<p>The last term is known somewhat factitiously as the <em>explained sum of
squares</em> with the idea that it measures deviations from the mean
caused by the data.</p>
<p>The ratio \(\abs{y - Py}_2^2 / \abs{y - P_1y}_2^2\) is called the
<em>fraction of variance unexplained</em>, although one has to be careful
with that terminology.
The <em>coefficient of determination</em> is one minus that number, or</p>
\[R^2 := 1 - \frac{\abs{y - Py}_2^2}{\abs{y - P_1y}_2^2}
= \frac{\abs{Py - P_1y}_2^2}{\abs{y - P_1y}_2^2},\]
<p>where the last equality is true if and only if \(\1\in\im A\).</p>
<p><strong>Adjustments.</strong> Since including more features in our model matrix
\(A\) will make
\(\im A\) a larger subspace of \(\R^n\), the error \(\abs{y -
Py}_2^2\) will be monotonically decreasing and \(R^2\) monotonically
increasing. While this offers the
opportunity to claim to “explain” more, it might in fact make our
model’s predictions on new inputs worse. Various solutions for this
have been proposed. One somewhat principled approach is to use
“adjusted \(R^2\)”, defined as</p>
\[\bar{R}^2 = 1 - \frac{\abs{y - Py}_2^2 / (n - p - 1)}{\abs{y -
P_1y}_2^2 / (n - 1)} \le R^2,\]
<p>where \(p\) is the number of features, i.e., \(A\in\R^{n\times
(p+1)}\) if we use an intercept term. This substitutes the unbiased
sample variance and error estimators for their biased versions.</p>
<p>That’s it for now. We’ll add a proper example from finance in a future
blog post.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>In situations with very many feature vectors, computing the
inverse \((A^\top A)^{-1}\) may no longer be the best way of
finding \(b\). Instead, one could try to minimize \(\abs{y-Ab}_2\)
in another way, e.g. via gradient descent. This is how “linear
layers” in machine learning are trained. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Wed, 19 Feb 2020 00:00:00 +0000
https://heiner.ai/blog/2020/02/19/linear-regression.html
https://heiner.ai/blog/2020/02/19/linear-regression.htmlblogAnnuity loans<p><small>Edit June 2023: Via
<a href="https://twitter.com/HeinrichKuttler/status/1673272830560022529?s=20">Twitter</a>,
I found that the same formula I arrive at here has been published in
P. Milanfar, “<em>A Persian Folk Method of Figuring Interest</em>”, Mathematics
Magazine, vol. 69, no. 5, Dec. 1996 and apparently was known and used
“in the bazaar’s of Iran (and elsewhere)”. The mnemonics in <a href="https://www.maa.org/sites/default/files/Peyman_Milanfar45123.pdf">that
paper</a>
are
\(\textrm{Monthly payment}
=
\frac{1}{\textrm{Number of months}}(\textrm{Principal} + \textrm{Interest})\)
where
\(\textrm{Interest} = \tfrac{1}{2} \textrm{Principal} \cdot
\textrm{Number of years} \cdot \textrm{Annual interest rate}\).
</small></p>
<p>I’m fond of small mental calculation helpers like the <a href="https://en.wikipedia.org/wiki/Rule_of_72">rule of
72</a>. Not that I am good at
mental math (<a href="https://play.google.com/store/apps/details?id=org.kuettler.mathapp">I once tried to fix that and got
sidetracked</a>),
but I am good at spending time contemplating how I’d do it if I was
better at it.</p>
<p>Another thing I’m not good at is finance. Lack of capital usually
saves me from the worst mistakes, but despite the <a href="https://www.youtube.com/watch?v=Uwl3-jBNEd4">brilliant advice
from Ben Felix</a>, I sometimes
contemplate spending money on real estate. Real estate tends to be
financed with a morgage, which often is a type of annuity loan. What
is an annuity loan and is there a neat rule of thumb for it? Read on
to find out.</p>
<h2 id="whats-an-annuity-loan">What’s an annuity loan?</h2>
<p>We receive a loan of size \(S_0\) and pay it back. Each period we’ll
pay the same amount</p>
\[R = T_k + Z_k\]
<p>with a principal repayment \(T_k\) and an interest payment \(Z_k\) for
the \(k\)th payment out of \(n\) total payments.</p>
<p>The interest payment is going to be a fixed percentage of the
outstanding principal balance and thus \(Z_1 = pS_0,\) where we set
e.g. \(p=0.02\) for a 2% interest rate.</p>
<h2 id="so-we-always-pay-the-same-amount-how-much">So we always pay the same amount. How much?</h2>
<p>Let’s consider an interest rate of \(p\) and \(n\) periods with one
payment per period. We note that the payment</p>
\[R = T_1 + pS_0 = T_2 + p(S_0 - T_1),\]
<p>since \(S_0 - T_1\) is the outstanding balance after the first
payment. Hence \(T_2 = (1 + p)T_1\). After \(k\) and \(k+1\) periods</p>
\[R = T_k + p\Bigl(S_0 - \sum_{j=1}^{k-1} T_j\Bigr)
= T_{k+1} + p\Bigl(S_0 - \sum_{j=1}^k T_j\Bigr),\]
<p>hence \(T_{k+1} = (1+p)T_k\) and</p>
\[T_k = (1 + p)^{k-1}T_1.\]
<p>To impose another condition, let’s say we want to fully pay back the
loan after \(n\) periods, i.e.,</p>
\[S_0 = \sum_{j=1}^{n} T_k = T_1 \sum_{j=1}^{n}(1+p)^{k-1} =
T_1\frac{(1+p)^n - 1}{p},\]
<p>where, as so often, we made use of the sequence of partial sums
\(\sum_{k=0}^{n} q^k = \frac{q^{n+1}-1}{q-1}\) of the geometric
series. Thus we find</p>
\[R = T_1 + Z_1 = pS_0\frac{1}{(1+p)^n - 1} + pS_0
= pS_0\frac{(1+p)^n}{(1+p)^n - 1}\]
<p>The regular annuity payment \(R\) is therefore a constant factor of
the loaned amount \(S_0\), the <em>annuity factor</em></p>
\[\af(p, n) = p\frac{(1+p)^n}{(1+p)^n - 1}.\]
<p>Not quite incidentelly, the annuity factor also shows up in formulas
for computing the equivalent annual cost from the net present
value. We won’t go into that here.</p>
<p>Note that the condition to fully pay back \(S_0\) does not in practice
constrain us very much – if less is paid back, say \(S_1\), we would
do the calculation with \(S_1\) in place of \(S_0\) and add a regular
interest payment of \(p(S_0 - S_1)\).</p>
<h2 id="months-and-years">Months and years</h2>
<p>Typically we’ll make a monthly payment over many years. This is where
some treatments get confusing and also where banks make a bit of an
extra buck. For \(n\) years and \(p\) interest per year, for our
purposes we’ll just take \(12n\) periods and a monthly interest of
\(\sqrt[12]{1 + p} - 1 = \frac{p}{12} + O(p^2)\). For small \(p\),
e.g. not more than 10%, this is a good approximation. The monthly
payment is then</p>
\[\frac{p}{12}\frac{(1 + p/12)^{12n}}{(1 + p/12)^{12n} - 1}S_0.\]
<h2 id="taylored-for-mental-arithmetic">Taylored for mental arithmetic</h2>
<p>We started this looking for a simple approximation we can use for
mental arithmetic, like the rule of 72 (it takes \(72/x\) periods for an
investment to double in value if it appreciates \(x\)% per
period). The above isn’t that. The binomial theorem helps:</p>
\[\Bigl(1 + \frac{p}{12}\Bigr)^{12n}
= \sum_{k=0}^{12n} \binom{12n}{k} \Bigl(\frac{p}{12}\Bigr)^k
= 1 + np + \binom{12n}{2}\Bigl(\frac{p}{12}\Bigr)^2 + O(p^3).\]
<p>If we want to forget about the \(p^2\) term as well we end up at</p>
\[\af(p/12, 12n) \approx \frac{p}{12}\frac{1 + np}{np} = \frac{1 +
np}{12n},\]
<p>which is nice. But not great. A short calculation with the binomial
coefficient \(\binom{12n}{2}\) yields</p>
\[\binom{12n}{2}\Bigl(\frac{p}{12}\Bigr)^2
= \frac{(np)^2}{2} - \frac{np^2}{24}.\]
<p>For typical \(n\) and \(p\) (e.g., \(n = 20\) and \(p = 0.02\)) the
second term won’t play any role. With the first term we improve our
approximation to</p>
\[\af(p/12, 12n)
\approx \frac{p}{12}\frac{1 + np + (np)^2/2}{np + (np)^2/2}
= \frac{1}{12n}\frac{1 + np(1 + np/2)}{1 + np/2}
= \frac{1}{12n}\Bigl(\frac{1}{1 + np/2} + np\Bigr).\]
<p>Now, this is clearly unsuitable. But using the geometric series again
we see that \(\frac{1}{1 + np/2} = 1 - np/2 + (np/2)^2 + O(p^3)\) and
hence</p>
\[\af(p/12, 12n) \approx \frac{1 + \frac{np}{2}(1 +
\frac{np}{2})}{12n}.\]
<p>This, too, is not too convienient. But if we drop the \((np)^2\) term
we end up with</p>
\[\af(p/12, 12n) \approx \frac{1 + np/2}{12n}.\]
<p>This isn’t too bad. For slightly larger \(n\) and \(p\) we could also
have taken \(1 + \frac{np}{2} = 1.2\), which would give us \(0.6\)
instead of \(1/2\) in the above formula. Of course we can also do the
division by 12 and arrive at another formula.</p>
<p>Let’s look at this from a slightly different perspective. A naive,
“interest-free” calculation on what the monthly payment for an \(n\)
year loan of \(S_0\) is would be \(\frac{S_0}{12n}.\)</p>
<p>Our formula \(\af(p/12, 12n) \approx \frac{1 + np/2}{12n}\) says: <strong>Do
the naive thing, then add x% to the result, where x is “\(n\) times
the interest rate over 2”.</strong></p>
<h2 id="example">Example</h2>
<p>A loan of $400000 with 2% interest over 20 years:</p>
\[\frac{\$400000}{12\cdot 20} \ \cdot\ \frac{20\cdot 2}{2}\%
= \$1666 \ \cdot\ 20\% = \$333,\]
<p>so the monthly rate will be \(\$1666 + \$333 = \$2000\). This isn’t
too far from the exact value of \(\af(\sqrt[12]{1.02} - 1, 12n) \cdot
\$400000 = \$2020\).</p>
<h2 id="more-numbers">More numbers</h2>
<p>The following table compares exact monthly rates as a percentage of
the total loan with our estimate, i.e., the true annuity factor
\(\af(\sqrt[12]{1 + p} - 1, 12n)\) and our approximation
\(\frac{1 + np/2}{12n}\) of it (<strong>bold</strong>).</p>
<div class="centered overflow">
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: right">\(n = 2\)</th>
<th style="text-align: right">\(n = 5\)</th>
<th style="text-align: right">\(n = 7\)</th>
<th style="text-align: right">\(n = 10\)</th>
<th style="text-align: right">\(n = 15\)</th>
<th style="text-align: right">\(n = 20\)</th>
<th style="text-align: right">\(n = 25\)</th>
<th style="text-align: right">\(n = 30\)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">\(p = 0.5\%\)</td>
<td style="text-align: right">\(4.19\%\)<br />\(\bf 4.19\%\)</td>
<td style="text-align: right">\(1.69\%\)<br />\(\bf 1.69\%\)</td>
<td style="text-align: right">\(1.21\%\)<br />\(\bf 1.21\%\)</td>
<td style="text-align: right">\(0.85\%\)<br />\(\bf 0.85\%\)</td>
<td style="text-align: right">\(0.58\%\)<br />\(\bf 0.58\%\)</td>
<td style="text-align: right">\(0.44\%\)<br />\(\bf 0.44\%\)</td>
<td style="text-align: right">\(0.35\%\)<br />\(\bf 0.35\%\)</td>
<td style="text-align: right">\(0.30\%\)<br />\(\bf 0.30\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 1.0\%\)</td>
<td style="text-align: right">\(4.21\%\)<br />\(\bf 4.21\%\)</td>
<td style="text-align: right">\(1.71\%\)<br />\(\bf 1.71\%\)</td>
<td style="text-align: right">\(1.23\%\)<br />\(\bf 1.23\%\)</td>
<td style="text-align: right">\(0.88\%\)<br />\(\bf 0.88\%\)</td>
<td style="text-align: right">\(0.60\%\)<br />\(\bf 0.60\%\)</td>
<td style="text-align: right">\(0.46\%\)<br />\(\bf 0.46\%\)</td>
<td style="text-align: right">\(0.38\%\)<br />\(\bf 0.38\%\)</td>
<td style="text-align: right">\(0.32\%\)<br />\(\bf 0.32\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 1.5\%\)</td>
<td style="text-align: right">\(4.23\%\)<br />\(\bf 4.23\%\)</td>
<td style="text-align: right">\(1.73\%\)<br />\(\bf 1.73\%\)</td>
<td style="text-align: right">\(1.25\%\)<br />\(\bf 1.25\%\)</td>
<td style="text-align: right">\(0.90\%\)<br />\(\bf 0.90\%\)</td>
<td style="text-align: right">\(0.62\%\)<br />\(\bf 0.62\%\)</td>
<td style="text-align: right">\(0.48\%\)<br />\(\bf 0.48\%\)</td>
<td style="text-align: right">\(0.40\%\)<br />\(\bf 0.40\%\)</td>
<td style="text-align: right">\(0.34\%\)<br />\(\bf 0.34\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 2.0\%\)</td>
<td style="text-align: right">\(4.25\%\)<br />\(\bf 4.25\%\)</td>
<td style="text-align: right">\(1.75\%\)<br />\(\bf 1.75\%\)</td>
<td style="text-align: right">\(1.28\%\)<br />\(\bf 1.27\%\)</td>
<td style="text-align: right">\(0.92\%\)<br />\(\bf 0.92\%\)</td>
<td style="text-align: right">\(0.64\%\)<br />\(\bf 0.64\%\)</td>
<td style="text-align: right">\(0.51\%\)<br />\(\bf 0.50\%\)</td>
<td style="text-align: right">\(0.42\%\)<br />\(\bf 0.42\%\)</td>
<td style="text-align: right">\(0.37\%\)<br />\(\bf 0.36\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 3.0\%\)</td>
<td style="text-align: right">\(4.30\%\)<br />\(\bf 4.29\%\)</td>
<td style="text-align: right">\(1.80\%\)<br />\(\bf 1.79\%\)</td>
<td style="text-align: right">\(1.32\%\)<br />\(\bf 1.32\%\)</td>
<td style="text-align: right">\(0.96\%\)<br />\(\bf 0.96\%\)</td>
<td style="text-align: right">\(0.69\%\)<br />\(\bf 0.68\%\)</td>
<td style="text-align: right">\(0.55\%\)<br />\(\bf 0.54\%\)</td>
<td style="text-align: right">\(0.47\%\)<br />\(\bf 0.46\%\)</td>
<td style="text-align: right">\(0.42\%\)<br />\(\bf 0.40\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 4.0\%\)</td>
<td style="text-align: right">\(4.34\%\)<br />\(\bf 4.33\%\)</td>
<td style="text-align: right">\(1.84\%\)<br />\(\bf 1.83\%\)</td>
<td style="text-align: right">\(1.36\%\)<br />\(\bf 1.36\%\)</td>
<td style="text-align: right">\(1.01\%\)<br />\(\bf 1.00\%\)</td>
<td style="text-align: right">\(0.74\%\)<br />\(\bf 0.72\%\)</td>
<td style="text-align: right">\(0.60\%\)<br />\(\bf 0.58\%\)</td>
<td style="text-align: right">\(0.52\%\)<br />\(\bf 0.50\%\)</td>
<td style="text-align: right">\(0.47\%\)<br />\(\bf 0.44\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 5.0\%\)</td>
<td style="text-align: right">\(4.38\%\)<br />\(\bf 4.38\%\)</td>
<td style="text-align: right">\(1.88\%\)<br />\(\bf 1.88\%\)</td>
<td style="text-align: right">\(1.41\%\)<br />\(\bf 1.40\%\)</td>
<td style="text-align: right">\(1.06\%\)<br />\(\bf 1.04\%\)</td>
<td style="text-align: right">\(0.79\%\)<br />\(\bf 0.76\%\)</td>
<td style="text-align: right">\(0.65\%\)<br />\(\bf 0.62\%\)</td>
<td style="text-align: right">\(0.58\%\)<br />\(\bf 0.54\%\)</td>
<td style="text-align: right">\(0.53\%\)<br />\(\bf 0.49\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 7.0\%\)</td>
<td style="text-align: right">\(4.47\%\)<br />\(\bf 4.46\%\)</td>
<td style="text-align: right">\(1.97\%\)<br />\(\bf 1.96\%\)</td>
<td style="text-align: right">\(1.50\%\)<br />\(\bf 1.48\%\)</td>
<td style="text-align: right">\(1.15\%\)<br />\(\bf 1.13\%\)</td>
<td style="text-align: right">\(0.89\%\)<br />\(\bf 0.85\%\)</td>
<td style="text-align: right">\(0.76\%\)<br />\(\bf 0.71\%\)</td>
<td style="text-align: right">\(0.69\%\)<br />\(\bf 0.62\%\)</td>
<td style="text-align: right">\(0.65\%\)<br />\(\bf 0.57\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 10.0\%\)</td>
<td style="text-align: right">\(4.59\%\)<br />\(\bf 4.58\%\)</td>
<td style="text-align: right">\(2.10\%\)<br />\(\bf 2.08\%\)</td>
<td style="text-align: right">\(1.64\%\)<br />\(\bf 1.61\%\)</td>
<td style="text-align: right">\(1.30\%\)<br />\(\bf 1.25\%\)</td>
<td style="text-align: right">\(1.05\%\)<br />\(\bf 0.97\%\)</td>
<td style="text-align: right">\(0.94\%\)<br />\(\bf 0.83\%\)</td>
<td style="text-align: right">\(0.88\%\)<br />\(\bf 0.75\%\)</td>
<td style="text-align: right">\(0.85\%\)<br />\(\bf 0.69\%\)</td>
</tr>
<tr>
<td style="text-align: right">\(p = 20.0\%\)</td>
<td style="text-align: right">\(5.01\%\)<br />\(\bf 5.00\%\)</td>
<td style="text-align: right">\(2.56\%\)<br />\(\bf 2.50\%\)</td>
<td style="text-align: right">\(2.12\%\)<br />\(\bf 2.02\%\)</td>
<td style="text-align: right">\(1.83\%\)<br />\(\bf 1.67\%\)</td>
<td style="text-align: right">\(1.64\%\)<br />\(\bf 1.39\%\)</td>
<td style="text-align: right">\(1.57\%\)<br />\(\bf 1.25\%\)</td>
<td style="text-align: right">\(1.55\%\)<br />\(\bf 1.17\%\)</td>
<td style="text-align: right">\(1.54\%\)<br />\(\bf 1.11\%\)</td>
</tr>
</tbody>
</table>
</div>
<div class="centered">
<img src="/img/afapprox.svg" alt="AF" style="width:70%;" />
<br />
Our linear approximation \(\frac{1 + np/2}{12n}\), solid, and the true annuity factor
\(\af(\sqrt[12]{1 + p} - 1, 12n)\), dashed, for different number of years \(n\).
<a href="/img/afapprox.gp">Gnuplot source</a>
</div>
<p>For low interests \(p\) our approximation is rather good. For larger
interests over many years it starts underestimating the true
factor. In this regime using \(0.6\) or higher in place of \(1/2\)
will yield better results.</p>
<p>This doesn’t mean you should invest in real estate though.</p>
Mon, 19 Aug 2019 23:25:06 +0000
https://heiner.ai/blog/2019/08/19/annuity-loans.html
https://heiner.ai/blog/2019/08/19/annuity-loans.htmlblogDylanchords<p>Ahh… dylanchords.com. A long time ago, when the internet was in its
puberty and I just ridded myself of »Music« as a subject in school,
you were there for me. A website in the style of the times, which
filled my youthful admiration of Dylan with an actual, achievable,
purpose: To teach myself to play the guitar (after a fashion). I
scanned those songs for the <a href="http://dylanchords.info/roadmaps.htm#fingering">evil F
chord</a> and other
cryptic nastiness (Bm7-5?!), printed out and tried to recreate one
after another (after a fashion). Soon enough I tuned
my borrowed guitar in Open D tuning and played a heartfelt
<em>Buckets of Rain</em>. I felt sad but proud, the family went from
amused to annoyed, and my guitar skills from pre-beginner to amateur.</p>
<p>The guitar, Dylan, and dylanchords.com would stay with the through
university. Printing out old-style HTML would take a detour through an
arguably more <a href="https://en.wikipedia.org/wiki/TeX">old-styled system</a>,
and that mere process helped convince a number of people that I had
sufficient interest in »culture« to be worth their time. Today,
the <a href="http://kuettler.org/seal/">resulting</a> »project« is dormant, and
made obsolete by more <a href="http://www.apple.com/iphone/">recent</a>
<a href="https://madeby.google.com/phone/">technology</a>.</p>
<p>And yet, dylanchords stayed with me. The .com domain name effectively
moved to another one for reasons that don’t seem to matter so much
now, and these days, picking up an instrument isn’t something I do
daily anymore. Or even weekly. But dylanchords still seems to
matter. Recently, even the choice of subject has <a href="https://www.nobelprize.org/nobel_prizes/literature/laureates/2016/dylan-facts.html">arguably been
vindicated</a>.</p>
<p>It being Christmas today, it feels right to thank the creator. No, not
that one. Eyolf Østrem has »tabbed« around 900 songs, with patience
and skill that others couldn’t even think of displaying, to not even
mention giving away all this work for free. <em>Thank you</em>, Eyolf
Chordmaster. We don’t always
<a href="http://oestrem.com/thingstwice/2010/06/neighbourhood-bully-indeed/">agree</a>. It
doesn’t matter. Thanks for your work. It inspired me, I’m sure it
inspired many others.</p>
<p>Merry Christmas.</p>
<p>For further convenience, I’ve started hosting a jekyll-generated,
mobile-enabled version of Eyolf’s chordwork at
<a href="http://dylaniki.org">dylaniki.org</a> (<a href="https://github.com/heiner/dylaniki/tree/gh-pages">code at
github</a>). I find it
useful for looking up songs from the phone, and maybe others feel the same.</p>
Sun, 25 Dec 2016 13:50:34 +0000
https://heiner.ai/blog/2016/12/25/dylaniki.html
https://heiner.ai/blog/2016/12/25/dylaniki.htmlblogIn every beginning there is a delusion<p>So… somewhere in the order of 15 years too late I’ve also taken up
this blogging thing. At least for a while. <a href="http://karpathy.github.io/">Karpathy’s
blog</a> and the elegance of
<a href="http://jekyllrb.com">Jekyll</a> finally convinced me.</p>
Sun, 21 Aug 2016 23:31:34 +0000
https://heiner.ai/blog/2016/08/21/in-every-beginning-there-is-a-delusion.html
https://heiner.ai/blog/2016/08/21/in-every-beginning-there-is-a-delusion.htmlblog