On linear regression
[Edit 14 May 2020: Thanks to Marcus Waurick for pointing out needs to be closed for Lemmas 1 and 2 to be true in general. Unfortunately, there seems no easy way of doing linear regression in general Hilbert spaces (no equivalent of , although the idea of using weighted spaces and writing a blog post on Linear Regression in Hilbert spaces with Morgenstern norm has a certain ring to it.]
Apropos of nothing, this is a post about linear regression. There’s a whole world of information on that out there, but little with the right point of view: The Hilbert space perspective. That might not be universally loved, so you might feel more comfortable reading , and “matrix” in place of “linear operator” below.
We’ll do the short theory up front, then look at an example, then say something about “fraction of variance explained”. A later blog post will discuss the CAPM, FrenchFama models and efficient markets as examples.
Two lemmas.
Let , be Hilbert spaces and be a closed linear operator.
Lemma 1. in the sense of direct sums of Hilbert spaces.
Proof Let , . There is a with and
Thus .
Reversely let . Then for all , and thus . //
Lemma 2. Let be the orthogonal projection.
(a) Let . Then there is a such that and . This is called the normal equation.
(b) If , i.e., injective, then .
Proof. (a) There is with . Also , and thus by Lemma 1
(b) Lemma 1 implies , i.e. is surjective. Since and evidently is injective, that mapping is bijective. Additionally is bijective, and thus is bijective and invertible. Using (a) it follows that . //
Linear regression as projection to subspaces.
Example: Warships
To take a favorite example from German Wikipedia, let’s discuss warships from the Second World War. Let’s say we’ve got , a list of lengths of warships, a list of their beams (widths) and a list of their drafts (distance from bottom to waterline), all in meters. We call and feature vectors (also known as explanatory or independent variables) and can safely assume they are linearly independent, such that the matrix
defines an injective operator. Its image is the linear span of and , i.e., . We wouldn’t expect to be an element of , but the hope of a linear regression model trying to predict a warship’s draft from its length and beam is that is not too far from as a subspace of either. Specifically, the distance is
where is the orthogonal projection. We can call the model’s prediction, and Lemma 2 above tells us how to compute from , namely . Moreover, the normal equation tells us which linear combination^{1} of and computes :
Thus, if we are given an additional warship’s length and beam as a vector , the model’s prediction of its draft will be .
We can implement this idea in simple Python code (Colab link):
import numpy as np
# Lengths, beams and drafts of 10 warships
data = """
187 31 9.4
242 31 9.6
262 32 8.7
216 32 9
195 32 9.7
227 31 11
193 20 5.2
175 17 5.2
""".split("\n")
lengths, beams, drafts = np.loadtxt(data, unpack=True)
A = np.stack([lengths, beams], axis=1)
b = np.linalg.inv(A.T @ A) @ A.T @ drafts
print("b:", b)
print("model prediction:", np.array([200, 20]) @ b)
This will print
b: [0.00539494 0.34045321]
model prediction: 5.730077087608246
So our model would predict a hypothetical warship with length 200m and beam 20m to have 5.73m of draft.
Notice that the way we built our model makes it predict a draft of zero for the (nonsensical) inputs of a warship of zero meters length or beam. While this seems fine enough in this case, it’s not in others. A simple way of fixing this is to add an “intercept” (in machine learning known as “bias”) term: Define and add that as an additional (say, first) “feature” vector. This turns our model into an affine function of its data inputs. Doing this in our Python example changes little:
A = np.stack([np.ones_like(lengths), lengths, beams], axis=1)
b = np.linalg.inv(A.T @ A) @ A.T @ drafts
print("b:", b)
print("model prediction:", np.array([1, 200, 20]) @ b)
Result:
b: [ 0.09353128 0.00580401 0.34027063]
model prediction: 5.738140993529072
However, trying to predict crew sizes changes that, see the Colab for details.
Of course, we can use more than two or three feature vectors as part of the design matrix , as long as they are linearly independent, which is is typically the case in practice with enough examples .
Using a matrix in place of the same math allows us to succinctly treat the “multivariate” case in which we’re trying to predict more than one “dependent variable”, e.g. a warship’s draft and the size of its crew. This is just doing more than one linear regression at once.
and all that
As mentioned above, the distance gives us a sense of the quality of our model’s predictions when applied to the data we built it on. The textbooks, being obsessed with elementwise expressions, call the residual sum of squares.
While this quantity (or a normalized version of it) is a measure of the error our model produces on the data we know, it doesn’t tell us if it was our input data that was useful in particular. To quantify that, we can compare it with a minimal, “inputless” model built from only the intercept entry, i.e., using . The prediction of that model will be the mean of the data, , and the orthogonal projection on is . The squared distance of that minimal model’s predictions from the data is known as the total sum of squares. It’s also the unnormalized variance of the data. If our original model used an intercept term, i.e., , then and therefore . Hence, the Pythagorean identity tells us
The last term is known somewhat factitiously as the explained sum of squares with the idea that it measures deviations from the mean caused by the data.
The ratio is called the fraction of variance unexplained, although one has to be careful with that terminology. The coefficient of determination is one minus that number, or
where the last equality is true if and only if .
Adjustments. Since including more features in our model matrix will make a larger subspace of , the error will be monotonically decreasing and monotonically increasing. While this offers the opportunity to claim to “explain” more, it might in fact make our model’s predictions on new inputs worse. Various solutions for this have been proposed. One somewhat principled approach is to use “adjusted ”, defined as
where is the number of features, i.e., if we use an intercept term. This substitutes the unbiased sample variance and error estimators for their biased versions.
That’s it for now. We’ll add a proper example from finance in a future blog post.

In situations with very many feature vectors, computing the inverse may no longer be the best way of finding . Instead, one could try to minimize in another way, e.g. via gradient descent. This is how “linear layers” in machine learning are trained. ↩