Linear Algebra for Those Who Know Linear Algebra

This book is targeted at an audience who has already had an intro to linear algebra course, but may not have the sense that they’ve mastered the topic. The intended audience may not even remember linear algebra well. For those readers, we have some review sections in the appendix. (The section on bra-ket notation is not usual covered in an intro course, but is also indispensable for the material.)

You’d normally spend a lot of time in an intro to linear algebra class doing calculations, including reduced row echelon form; inverses; eigenvalues and eigenvectors; and Cramer’s rule. We skip these topics, even in the review section. This book covers a lot of applications of linear algebra, and we accept that many of the interested readers will be practicioners. So we don’t need to pretend that these things can’t be easily calculated with a computer program. I won’t go so far as to say that there’s no educational value in doing some of these calculations by hand, but given that you’ve successfully completed a linear algebra class, I give you permission to forget them.

This book attempts to put linear algebra concepts in a more intuitive context. As such it covers most topics that could be called linear algebra. But it doesn’t cover everything. Some very rich and interesting advanced topics that are not covered here include:

Finally, some good news for some of you. I’ve written this book to avoid talking about complex numbers. It’s surprising, but I’ve pulled it off. With the approach I’ve taken, considering complex numbers only adds a bunch of pesky qualifiers to any statement I make. So throughout the book, we’ll pretend that these don’t even exist.

2 Singular Value Decomposition

2.1 Statement of SVD

We start the book with the most remarkable fact of linear algebra: For any matrix $A$, there exists a singular value decomposition (SVD). The SVD of a matrix $A$ is three matrices, $U$, $D$, and $V$, where

If $A$ is dimension $m\times n$, then $U$ is dimension $m\times m$, $V$ is dimension $n\times n$, and $D$ is dimension $m\times n$. $D$ is zero everywhere expect for potentially $D_{ii}$ for $1\leq i\leq\min(m, n)$. By convention, we sometimes call $\lambda_i:=D_{ii}$. As well, we’ll sometimes call $r:=\min(m, n)$.

Example

Let’s whip up an example real quick to demonstrate. We start with an array chosen arbitrarily and show that there is an SVD.

\[\left( \begin{array}{ccc} 1 & 1 & 2 \\ 3 & 5 & 8 \end{array} \right) = UDV^T,\]

where

\[\begin{array}{rcl} U &=& \left( \begin{array}{cc} -0.2381 & -0.9712 \\ -0.9712 & 0.2381 \end{array} \right)\\ \\ D&=&\left( \begin{array}{ccc} 10.192 & 0 & 0 \\ 0 & 0.3399 & 0 \end{array} \right)\\ \\ V^T&=&\left( \begin{array}{ccc} -0.3092 & -0.4998 & -0.8090 \\ -0.7557 & 0.6456 & -0.1100 \\ -0.5774 & -0.5774 & 0.5774 \end{array} \right) \end{array}\]

This calculation was done with a computer; the numbers are rounded here. The reader should verify that:

The matrices $U$, $D$, and $V^T$ multiply to the original matrix.
The two outside matrices are orthogonal. That is, the rows are orthonormal.

At this point, this is a neat party trick. It’s not at all clear, but such a decomposition can be found for any matrix

The SVD is generally not unique for a given a matrix. This fact causes difficulties. Because the SVD is not unique, we usually say that a decomposition is a SVD and not the SVD. We show later in the chapter that SVDs are almost unique. Often we usually only need any SVD.

Example

To give an example of non-uniqueness, let’s look at the matrix:

\[\left(\begin{array}{ccc}1&0&0\\0&0&0\\0&0&0\end{array}\right)\]

This can be decomposed as:

\[\left(\begin{array}{ccc}1&0&0\\0&1&0\\0&0&1\end{array}\right) \left(\begin{array}{ccc}1&0&0\\0&0&0\\0&0&0\end{array}\right) \left(\begin{array}{ccc}1&0&0\\0&1&0\\0&0&1\end{array}\right)\]

Or as:

\[\left(\begin{array}{ccc}1&0&0\\0&\sqrt{1/2}&\sqrt{1/2}\\0&\sqrt{1/2}&-\sqrt{1/2}\end{array}\right) \left(\begin{array}{ccc}1&0&0\\0&0&0\\0&0&0\end{array}\right) \left(\begin{array}{ccc}1&0&0\\0&\sqrt{1/2}&\sqrt{1/2}\\0&\sqrt{1/2}&-\sqrt{1/2}\end{array}\right)\]

If you follow the multiplication through, you can see that for the outside matrices, everything except the first row and column get ignored. We must choose orthonormal vectors for the other two rows and columns in order to satisfy the orthogonality condition of the SVD. But outside of that constraint, we’re free to do whatever we want with these other rows and columns.

Low-rank matrices are one way that uniqueness can fail, but not the only way. See Section 2.3 for a more complete discussion.

Soon we’ll show that the SVD exists. Said another way: We’ll show there exists bases in the domain and range for which $A$ is diagonal. (See Section 9.3 for a review of this concept.) On its face, this is a simple-yet-powerful theorem that doesn’t have the usual qualifiers seen in linear algebra. It doesn’t require complex numbers, and it’s one of the few theorems that apply cleanly to rectangular matrices. But beyond these surface advantages, we’ll see that the SVD is widely applicable. In fact, it’s so applicable that we make this a cornerstone of the book. The SVD is critical to understanding linear algebra, because it expresses something fundamental about linear transformations.

After a brief detour to discuss bra-ket form, we’ll spend the rest of this section exploring that fundamentality. We will view the SVD through three major lenses: Firstly by viewing matrixes as a collection of eigenvectors and eigenvectors. Then by viewing matrixes as the product of vectors and covectors. Finally by viewing matrices as bilinear forms.

2.1.1 Bra-ket Form

This section uses Bra-ket Notation heavily. If you’re unfamiliar with this form, please read Section [Bra-ket Notation] before proceeding.

Here $| u _i\rangle$ are the columns of $U$ and $\langle v _i|$ are the rows of $V^T$.

Proof. Say $A$ is $m\times n$ dimensional with $r=\min(m, n)$ as usual. We write the matrices as though $m<n$, but other cases work out the same way. Write the SVD of $A$ as:

\[A = UDV^T = \left[ \begin{array}{cccc} \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex}\\ |u_1\rangle & |u_2\rangle & \cdots & |u_m\rangle \\ \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex} \end{array} \right] \left[ \begin{array}{cccccc} \lambda_1 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \cdots & \vdots \\ 0 & 0 & \cdots & \lambda_r & \cdots & 0 \\ \end{array} \right] \left[ \begin{array}{ccc} \rule[.5ex]{2.5ex}{0.5pt}& \langle v_1| & \rule[.5ex]{2.5ex}{0.5pt}\\ \rule[.5ex]{2.5ex}{0.5pt}& \langle v_2|& \rule[.5ex]{2.5ex}{0.5pt}\\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt}& \langle v_n|& \rule[.5ex]{2.5ex}{0.5pt} \end{array} \right]\]

We can write the diagonal matrix as:

\[D = \sum_{i=1}^r \left[\begin{array}{ccccc} 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & \cdots & \lambda_i & \cdots & 0 \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 \\ \end{array}\right]\]

Each summand is the diagonal matrix with all but the $i$-th entry zeroed out. We distribute the $U$ on the left and $V^T$ on the right, to end up with:

\[\begin{array}{rcl} A &=& \sum_{i=1}^r\left[ \begin{array}{cccc} \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex}\\ | u _1\rangle & | u _2\rangle & \cdots & | u _m\rangle \\ \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex} \end{array} \right] \left[\begin{array}{ccccc} 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & \cdots & \lambda_i & \cdots & 0 \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 \\ \end{array}\right]\left[ \begin{array}{ccc} \rule[.5ex]{2.5ex}{0.5pt}& \langle v _1| & \rule[.5ex]{2.5ex}{0.5pt}\\ \rule[.5ex]{2.5ex}{0.5pt}& \langle v _2|& \rule[.5ex]{2.5ex}{0.5pt}\\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt}& \langle v _n|& \rule[.5ex]{2.5ex}{0.5pt} \end{array} \right] \\ &=& \sum_{i=1}^r\left[ \begin{array}{cccc} \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex}\\ | u _1\rangle & | u _2\rangle & \cdots & | u _m\rangle \\ \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex} \end{array} \right] \left[ \begin{array}{ccc} \rule[.5ex]{2.5ex}{0.5pt}& 0 & \rule[.5ex]{2.5ex}{0.5pt}\\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt}& \lambda_i\langle v _i|& \rule[.5ex]{2.5ex}{0.5pt}\\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt}& 0& \rule[.5ex]{2.5ex}{0.5pt} \end{array} \right] \\ &=& \sum_{i=1}^r\left[ \begin{array}{ccc} \lambda_i| u _i\rangle_1\langle v _i|_1 & \lambda_i| u _i\rangle_1\langle v _i|_2 & \cdots \\ \lambda_i| u _i\rangle_2\langle v _i|_1 & \lambda_i| u _i\rangle_2\langle v _i|_2 & \cdots \\ \vdots & \vdots & \ddots \end{array} \right] \end{array}\]

When you compare this final form to the definition of the outer product, then you see that it equals $\sum_{i=1}^r\lambda_1| u _i\rangle\langle v _i|$. ◻

The reader should pause to appreciate that although the matrix is $m\times n$, only $r=\min(m, n)$ summands are needed. What happened to the remaining $n-m$ rows of $V^T$?

Because of the orthogonality of $U$ and $V$, the set of vectors $\left\{\langle v _i|\right\}$ form a basis of a subspace of the domain, and the covectors $\left\{| u_i \rangle\right\}$ form a basis of a subspace of the codomain. (See Exercises [Ex-SVD Null Space] and [Ex-SVD Range] for a more precise statement.) The orthogonality of these vectors is very important; we use that fact constantly throughout the book.

To see why this form is convenient, we’ll show a quick proof that will be used later.

Proof. Transposition distributes over sums. As well $(AB)^T=B^TA^T$, so we have that each summand $\left(|u_i\rangle\langle v_i|\right)^T=|v_i\rangle\langle u_i|$. So we can start the $A^TA$ equation as:

\[\begin{array}{rcl} A^TA&=& \left(\sum_i\lambda_i|v_i\rangle\langle u_i|\right) \left(\sum_j\lambda_j|u_j\rangle\langle v_j|\right) \\ &=&\sum_i\sum_j\lambda_i\lambda_j|v_i\rangle\langle u_i|u_j\rangle\langle v_j| \end{array}\]

Now we look closer at the middle term $\langle u_i|u_j\rangle$. Because the vectors $|u_j\rangle$ are orthonormal, this product equals $0$ when $i\neq j$ and $1$ when $i=j$. So the double index which loops over all possible values of $i$ and $j$ ends up being non-zero for only those terms where $i=j$.

We will call the decomposition $\sum_i\lambda_i|u_i\rangle\langle v_i|$ the bra-ket form (or summation form of the SVD; the $UDV^T$ form is called the matrix form of the SVD. The steps from above can be done in reverse to go from bra-ket form to matrix form.

Example

Given an SVD in bra-ket form $A=5|u_1\rangle\langle v_1|+2|u_2\rangle\langle v_2|$, where

\[\begin{array}{rcl} \vec u_1 &=& (1, 0) \\ \vec v_1 &=& (\sqrt{1/2}, -\sqrt{1/2}) \\ \vec u_2 &=& (0, -1) \\ \vec v_2 &=& (\sqrt{1/2}, \sqrt{1/2}) \end{array}\]

Then $A=UDV^T$ where

\[\begin{array}{rcl} U &=& \left( \begin{array}{cc} 1 & 0 \\ 0 & -1 \end{array} \right)\\ \\ D&=&\left( \begin{array}{ccc} 5 & 0 \\ 0 & 2 \end{array} \right)\\ \\ V&=&\left( \begin{array}{ccc} \sqrt{1/2} & -\sqrt{1/2} \\ \sqrt{1/2} & \sqrt{1/2} \end{array} \right) \end{array}\]

Notice that we can just slot the vectors and scalars into their corresponding spots. We end up with the matrix form of the SVD. In Exercise [Ex-Confirm SVD], you’ll be asked to confirm these calculations.

Example

Let’s continue the above example.

\[A=\left(\begin{array}{ccc}1&0&0\\0&0&0\\0&0&0\end{array}\right)\]

We can write this using the bra-ket form as:

\[A=\left(\begin{array}{c}1\\0\\0\end{array}\right)\left(\begin{array}{ccc}1&0&0\end{array}\right) = |e_1\rangle\langle e_1|\]

The right-most expression uses the convention that $\vec e_i$ is a vector with a $1$ in the $i$-th position and $0$ elsewhere.

In most cases, the bra-ket form of a $3\times3$ matrix will have three summands, but that didn’t happen for this example. We can think of this as having two extra terms with coefficient $\lambda_2=\lambda_3=0$. In that case, you can get creative with what you choose $|u_2\rangle\langle v_2|$ and $|u_3\rangle\langle v_3|$ to be. As long as the vectors are orthonormal, this is a valid SVD. Filling in random vectors like this is necessary to convert back to the matrix form, as we did above. (See Exercise [Ex-Rectangular bra-ket to matrix].)

2.1.2 Eigenvalues and Eigenvectors

A basic fact about linear maps is that they’re often invariant for certain axes. That is, $A\vec v=\lambda\vec v$ for some $\vec v$ and $\lambda$. This is a somewhat remarkable fact. For example, a map, $A$, sends $(1, 0)$ to $(5, 1)$ and $(0, 1)$ to $(1, 5)$. Imagine picking up the x-y plane at these $(0, 1)$ and $(1, 0)$ and dragging them to their respective targets. As you drag these points, you stretch some parts of the plane and squeeze other parts. Yet the squeezing cancels out along the line $y=x$, and this axis only gets stretched out radially. The second and fourth quadrants get stretched apart, causing a similar cancelation effect along the line $y=-x$. This example is easy to visualize. Yet even when you do fairly complicated linear transformations with many dimensions, usually you’ll get some axes somewhere where all the pulling and pushing cancel out.

The fact that eigenvalues usually exist isn’t obvious, and it’s a fact that people can’t shut up about. If you remember one thing from linear algebra, it’s probably eigenvalues. Eigenvalues and eigenvectors are important. But talking about eigenvalues is difficult and requires a lot of caveats. Eigenvalues only exist for square matrices and they are sometimes complex. Defective matrices don’t have a "full set" of eigenvalues. (Defective matrices are covered much later in the book.) The SVD doesn’t have any of these qualifiers. We’ll see throughout this book that for many applications of eigenvalues, the SVD will work instead. Look to Section 2.5 for a couple of immediate examples.

It’s not clear yet why the SVD is analogous to eigenvectors. To demonstrate: Say a matrix $A$ has a set of eigenvectors, $|v_i\rangle$ that span the domain with corresponding eigenvalues $\left\{\lambda_i\right\}$. Define a linear map $P$ by the actions $P|e_i\rangle=|v_i\rangle$, and define $D$ as a diagonal matrix with $\lambda_i$ as the $i$-th entry on the diagonal. Then $A=PDP^{-1}$. This must be true because it’s true for every eigenvector, and ever vector is a linear combination of eigenvectors (because the span):

\[PDP^{-1}|v_i\rangle = PD|e_i\rangle = \lambda_iP|e_i\rangle = \lambda_i|v_i\rangle = A|v_i\rangle\]

This equality makes sense. $P^{-1}$ maps the eigenvalues of $A$ to the standard basis; $D$ stretches invariantly along each of the standard basis; and $P$ maps the standard basis back. The net effect is that the invariant stretching from $D$ gets moved to the eigenvectors. An SVD, $UDV^{-1}$, on the other hand rotates, stretches along standard basis, then rotates to (potentially) new rotation. (In the last sentence, we call $V^T=V^{-1}$, which is true for all orthogonal matrices; we also called the orthogonal matrices "rotations" which is a good way to think of orthogonal matrices.) We can see that $PDP^{-1}$ and $UDV^{-1}$ are just two decompositions of matrices. The former, called a diagonalization, requires the outside matrices to be inverses, while the SVD requires the outside matrices to be orthogonal. Diagonalizations exist sometimes, whereas SVD exist all the time. We’ll see in the Section 2.4 that these two concepts are intimately connected. See Chapter [Symmetric Matrices] for a discussion of how the SVD is also a diagonalization.

2.1.3 Vectors and Covectors

One view of matrices is that they’re the product of vectors and covectors. The columns of $A$ tell us how to act on standard basis vectors. Call the $i$-th column of $A$ $|A_{*i}\rangle$. Then,

\[A = \left[ \begin{array}{cccc} \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex}\\ |A_{*1}\rangle & |A_{*2}\rangle & \cdots & |A_{*n}\rangle \\ \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex} \end{array} \right] = \left[ \begin{array}{cccc} \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex}\\ A|e_1\rangle & A|e_2\rangle & \cdots & A|e_n\rangle \\ \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex} \end{array} \right]\]

Any vector $\vec v=\begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}$ can be written as $|v\rangle = \sum_i v_i|e_i\rangle$. So to figure out the effect of $A$ on $|v\rangle$, we only need to find the $i$-th component of $|v\rangle$ and multiple by $|A_{*1}\rangle$. That is,

\[\begin{array}{rcl} A|v\rangle &=& A\sum_iv_i|e_i\rangle \\ &=& \sum_iv_iA|e_i\rangle \\ &=& \sum_i|A_{*i}\rangle v_i \end{array}\]

In the last line, we’ve written the coefficient after the covector to hint at the next step. The next step, we observe that $\langle e_i|v\rangle=\sum_j\langle e_i|v_j|e_j\rangle = v_i$, where the last equality follows from the orthonormality of the standard basis. Now we can write the above equality as

\[\begin{array}{rcl} A|v\rangle &=& \sum_i|A_{*i}\rangle\langle e_i|v\rangle \end{array}\]

Comparing both sides, we can now say that $A=\sum_i|A_{*i}\rangle\langle e_i|$. This rewriting recognizes that $A$ is a sum of vector/covector products. The covectors $\left\{\langle e_i|\right\}$ project onto the standard axes to get a scalar. Then the vectors $\left\{|A_{*i}\rangle\right\}$ give direction to each of these scalars. This is an important way to understand matrices.

But by studying linear algebra, you learn that there’s nothing special about the standard basis $\left\{|e_i\rangle\right\}$. For any basis $\left\{| v _i\rangle\right\}$, we can write the matrix $A$ with this basis on the domain. (See Section [Change of Bases] for more detailed discussion on bases.) Call the columns of this new matrix $\left\{| u _i\rangle\right\}$, and it follows that $A=\sum_i| u _i\rangle\langle v _i|$. Although $\left\{\langle v _i|\right\}$ are orthonormal, it’s not usually true of the resulting $\left\{| u _i\rangle\right\}$. The existence of an SVD is the existence of a domain basis $\left\{\langle v _i|\right\}$ for which the corresponding $\left\{| u _i\rangle\right\}$ are also orthogonal, but not normal. We can write $|u_i\rangle=\lambda_i|u_i'\rangle$, so that $\left\{u_i'\right\}$ are orthonormal. The basis $\left\{\langle v _i|\right\}$ that produces orthogonal columns on $A$ is a very special basis. There’s no reason to prefer to represent $A$ in the standard basis or any other basis. So we might as well choose $\left\{\langle v _i|\right\}$, giving $A=\sum_i\lambda_i|u_i'\rangle\langle v_i|$, the SVD. We’ll see throughout the course of this book that it simplifies many things.

I want to revisit the equivalence of the two forms of the SVD with this new language. Given a $UDV^T$ representation of a matrix, $A$, we can write $V^T=\sum_i|e_i\rangle\langle v_i|$, where the $|v_i\rangle$ are the rows of $V^T$, and we can write $U=\sum_i|u_i\rangle\langle e_i|$ where the $|u_i\rangle$ are the columns of $U$. And of course $D=\sum_i\lambda_i|e_i\rangle\langle e_i|$, where $\lambda_i$ are the diagonal components of $D$. Now we have

2.1.4 Coordinates

Coordinate Systems Consider a basis, $\mathcal B$, of $\mathbb R^3$ given by $\vec v_1=\langle 1/\sqrt2, 1/\sqrt2, 0\rangle$, $\vec v_2=\langle 1/\sqrt2, -1/\sqrt2, 0\rangle$, $\vec v_3=\langle 0, 0, -1\rangle$. Because $\mathcal B$ is a basis, we know that any vector in $\mathbb R^3$ can be written as a linear combination of of these three vectors. For example, $\langle 1, 2, 3\rangle = 3/\sqrt2 \vec v_1-1/\sqrt2\vec v_2-3\vec v_3$. Occasionally, this linear combination is written in its own vector notation, with a special $\mathcal B$ subscript. $\langle 1, 2, 3\rangle=\langle 3/\sqrt2, -1/sqrt2, -3\rangle_{\mathcal B}$.

The purpose of this short section is to make the connection that both $\langle 1, 2, 3\rangle$ and $\langle 3/\sqrt2, -1/sqrt2, -3\rangle_{\mathcal B}$ are equally legitimate representations of the same vector. This is a conceptual decoupling between a vector and the coordinates we use to write it. The observation is akin to the realization that the number $14$ can be written in binary as $1110$; the number itself is not either of these things, but is merely represented by them.

And if any choice of coordinates are equally legitimate, we may as well choose the bases given by the SVD, $M=\sum_i|u_i\rangle\langle v_i|$. That is, if $\mathcal B_u=(|u_1\rangle, |u_2\rangle, \cdots, |u_n\rangle)$ and $\mathcal B_v=(|v_1\rangle, |v_2\rangle, \cdots, |v_n\rangle)$ then $M\langle x_1, x_2, \cdots x_n\rangle_{\mathcal B_u}=\langle \lambda_1x_1, \lambda_2x_2, \cdots, \lambda_nx_n\rangle_{\mathcal B_v}$. So the existance of an SVD says that any matrix is merely a dillation under the correct coordinate systems. Quite a surprising fact!

Functions on Coordinate Systems You and two friends, Leeta and Odo, are doing some classical physics problem, and have to draw your own axes to solve the problem. You correctly label the canonical $x$ and $y$ axes, $\langle 1,0\rangle$ and $\langle 0,1\rangle$. But Leeta is looking at the problem from an angle, so she ends up with axes $\langle 1/\sqrt2, 1/\sqrt2\rangle$ and $\langle -1/\sqrt2, 1/\sqrt2\rangle$. And Odo is left-handed, so he ends up with axes $\langle 0,1\rangle$ and $\langle 1,0\rangle$. As an intermediate step, the problem requires you to write down a rotation matrix. Because the axes are made up by the solver, they shouldn’t affect the ultimate solution. So you can write the rotation in any basis. When you use the canonical basis $\mathcal B$, you get the usual rotation matrix $R_{\mathcal B\mathcal B}=\left(\begin{array}{cc}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{array}\right)$. When Leeta calculates the same matrix in her basis, $\mathcal B_L$, she too ends up with $[R]_{\mathcal B_L\mathcal B_L}=\left(\begin{array}{cc}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{array}\right)$. Note the notation on the matrix, which means that both the domain and range are written in Leeta’s coordinates, $\mathcal B_L$. Now Odo’s a little odd, so in his coordinates, $\mathcal B_O$, counterclockwise rotation rotates his $y$-axis towards his $x$-axis, and he gets $[R]_{\mathbb O\mathbb O}=\left(\begin{array}{cc}\cos\theta & \sin\theta \\ -\sin\theta & \cos\theta\end{array}\right)$.

As you may guess, it’s also possible to write down the rotation matrix where the input is in one coordinate system and the output is in another. For example $[R]_{\mathcal B_L\mathcal B_O}=\left(\begin{array}{cc} \cos\left(\theta-\pi/4\right) & \sin\left(\theta-\pi/4\right) \\ -\sin\left(\theta-\pi/4\right) & \cos\left(\theta-\pi/4\right)\end{array}\right)$.

If we have a change-of-basis matrix, $C$ that maps our coordinates to Leeta’s then we could write, for any matrix, $M$, $[M]_{\mathcal B\mathcal B}=C^{-1}[M]_{\mathcal B_L\mathcal B_L}C$. The change-of-basis matrix itself can be written as $C=[I]_{\mathcal B_L\mathcal B}$.

With this notation for matrices in different basis, the SVD of a matrix $M$ is simply bases $\mathcal B$ and $\mathcal B'$ for which $[M]_{\mathcal B\mathcal B'}=I$.

2.1.5 Bilinear Forms

Linear forms are defined for matrices generally. But we’ll define a linear form as a function $f:\mathbb R^n \to\mathbb R$ that is "linear" meaning that it satifies the two properties:

We state without proving that all linear forms can be written as $\lambda\langle u|$ for some unit vector $\vec u$, meaning that $f\left(|x\rangle\right)=\lambda\langle u|x\rangle$.

By the definition of linear forms, the sum of any two linear forms is itself a linear form. It’s interesting, though not surprising that the sum of two covectors $\lambda\langle u|$ must therefore result in a new covector, so that it can also be written this way.

By analogy a bilinear form is a function $f:\mathbb R^m\times\mathbb R^n \to\mathbb R$ that satisfies linearity for both inputs:

We also state without proof that all bilinear forms can be written as a matrix, $A$, so that $f\left(\vec x, \vec y\right)=\langle x|A|y\rangle$. Conversely all matrices yield a bilinear form in this way. So maybe a good way to think about matrices is exactly as bilinear forms. In analogy to linear forms, say that a particular matrix $A$ can be written as $\lambda|u\rangle\langle v|$. Then it’s action on a pair of vectors $\vec x,\vec y$ is $\lambda\langle x|u\rangle\langle v|y\rangle$, which is a product of two linear forms. This seems like a useful analogy. But unlike linear forms, we can’t add together a bunch of matrices of the form $\lambda|u\rangle\langle v|$ and still expect to get a matrix of the form $\lambda|u\rangle\langle v|$. Instead you get a sum of such matrices, $\sum_i\lambda_i|u_i\rangle\langle v_i|$. The surprising thing here is that if you added very many such matrices, you could always reduce to $r:=\min(m, n)$ summands. Either way, this representation seems like a sensible way to represent a matrix.

2.1.6 Conventions

The bra-ket form of the SVD yields a set of scalars $\left\{\lambda_i\right\}$ and vectors $\left\{\vec u_i, \vec v_i\right\}$. We can and do require that all of these be real. We also require that all the scalars be positive $\lambda_i>0$. We can require this because:

Finally we adopt the convention that these are ordered: $\lambda_1\geq\lambda_2\geq\cdots\lambda_r$.

We prove in Section 3 that with our conventions, then multiset¹ of scalars is unique to the matrix $A$; that is, all SVDs of $A$ yield the same multiset. We call these values the singular values of $A$.

2.1.7 Outro

The goal of this book is to build intuition and demonstrate applications. With the intuition, the applications will seem natural. (You should agree when you read Chapter 2.) Hopefully by the end of the book, you’ll be able to recall or modify the techniques more easily. At this point, you should have some intuition for what an SVD is, but how somebody comes up with such a representation is still fairly abstract. The rest of this chapter proves the existence and almost-uniqueness of the SVD. Outside of these sections, this book generally is not heavy on the proofs. But the techniques used in these proofs should help you get a concrete sense for what the SVD is.

2.2 Existence

In this section, we prove the existence of an SVD. This proof is constructive, in the sense that it gives an algorithm for building a bra-ket form of an SVD.

Algorithm 1: SVD[Algorithm1]

Set $A_0 \gets A$ While $A_i\neq 0$

Find the unit vectors $\vec u _i$ and $\vec v _i$ that maximize² $\langle u _i|A_{i-1}| v _i\rangle$.
Set $A_i = A_{i-1}-| u _i\rangle\langle u _i|A| v _i\rangle\langle v _i|$

When the algorithm finishes, we have $A_r=0$. So that $0=A_r=A-\sum_{i=1}^r| u _i\rangle\langle u _i|A_{i-1}| v _i\rangle\langle v _i|$. We recognize that $\lambda_i:=\langle u _i|A_{i-1}| v _i\rangle$ is just a scalar, and add these to both sides so that we get $A=\sum_{i=1}^r\lambda_i| u _i\rangle\langle v _i|$. The resulting vectors are normal, by choice. But for this to be an SVD as described in the first section, we need the vectors $\left\{u_i\right\}_i$ to be a perpendicular set, and for $\left\{v_i\right\}_i$ to be a perpendicular set. We prove that this is so in Theorem Theorem 2.7.

An objection to this algorithm may be that it only works if the while loop eventually terminates. For the algorithm to work, it must be true that $A_r=0$ for some $r$. In Theorem Theorem 2.8, we show that for $A$ with dimension $m\times n$ this must happen within $r := \min(m, n)$ steps.

Before proving these two theorems, let’s build some intuition. We see that $| u _i\rangle\langle u _i|A_{i-1}| v _i\rangle\langle v _i|=\lambda_i| u _i\rangle\langle v _i|$ is just a matrix; and it’s the same shape as $A_{i-1}$. But it’s not just any matrix. It’s the matrix that captures all the signal that $A_{i-1}$ has in the axes $\vec u_i$ and $\vec v_i$. To illustrate what this means, let’s first look at a lower-dimensional analogy.

Covector Analogy

Consider the axis $\vec v=\left(\begin{array}{ccc} 3 & 3 & 2 \end{array}\right)$. If we wanted to remove all the signal along the axis $\vec e_1 = \left(\begin{array}{ccc} 1 & 0 & 0 \end{array}\right)$, then we would subtract $3\vec e_1 = \left(\begin{array}{ccc} 3 & 0 & 0 \end{array}\right)$ from $\vec v$ to get $\left(\begin{array}{ccc} 0 & 3 & 2 \end{array}\right)$.

Saying that this "removes signal" along $e_1$ means that now any vector along $\vec e_1$, $\lambda\vec e_1$, gets zeroed out by $\vec v-3\vec e_1$. That is, $\langle v-3e_1|e_1\rangle=0$. We arrived at the coefficient $\lambda=3$ by looking at the first element of $\vec v$. This can be formalized as $3=\langle v|e_1\rangle$.

Now we could repeat this with the second dimension $3=\langle v|e_2\rangle$. So now $v-3\langle e_1|-3\langle e_2| = \left(\begin{array}{ccc} 0 & 0 & 2 \end{array}\right)$. This covector equals zero when it pairs with any linear combination of $|e_1\rangle$ and $|e_2\rangle$. That is $\left\langle\left(\begin{array}{ccc} 0 & 0 & 2 \end{array}\right)|\left(\begin{array}{ccc} \lambda & \mu & 0 \end{array}\right)\right\rangle = 0$. We can do an SVD-like algorithm to assign $v$’s signal to it’s components: $v = \sum_{i=1}^3\langle v|e_i\rangle\langle e_i|$.

Moreover there’s nothing special about the standard vectors. It’s true that for any basis $\left\{\vec v _i\right\}_{i=1}^3$, $\langle v| = \sum_{i=1}^3\langle v| v _i\rangle\langle v _i|$. It’s further true that if we stop short, the covector zeros out encountered vectors, like $\left(v-\langle v| v _1\rangle\langle v _1|-\langle v| v _2\rangle\langle v _2|\right)\left(\lambda| v _1\rangle+\mu| v _2\rangle\right)=0$. We say that $\langle v| v _i\rangle\langle v _i|$ "captures all the info" from $\langle v|$ in the direction of $\langle v _i|$, because when subtract off that bit, there’s nothing left in that direction.

By choosing the maximizing pair of vectors for remaining signal with each step, we guarantee that we’re getting perpendicular vectors from earlier vectors. The reasoning is that if we chose some vector that was pointing in a similar direction as an earlier vector, then we’d be pointing partly in a zeroing direction. The zeroing part of the new matrix, $| u \rangle\langle u |A| v \rangle\langle v |$, starts by projecting onto the zeroed axis; if this projection is non-zero, then some part of the vector is going to waste.

2.2.1 Proof of the Correctness of Algorithm [Algorithm1]

This is an important fact that helps with the next lemma. It also advances our intuition. We think of a matrix $A$ as acting on a vector by projecting that vector onto a covector ($\vec v$) and using the resulting coefficient $\lambda$ to tell us how far to go in the direction of another vector $u$. The above lemma says that when we’ve chosen the maximizing unit vector ($u$) and covector ($v$), as prescribed by the SVD algorithm, we’ve chosen a matching pair. If we wanted to remove the effect of the projecting onto $v$, then we need to subtract from the output some multiple of $u$.

We can immediately conclude from Lemma Lemma 2.4 that $A_1=A_1\left(I-| v _1\rangle\langle v _1|\right)$. This says that applying $A_1$ to a vector $| v \rangle$ is the same as first applying $\left(I-| v _1\rangle\langle v _1|\right)$ to $| v \rangle$, then applying $A_1$ to the resulting vector. $\left(I-| v _1\rangle\langle v _1|\right)$ is a special matrix that acts on any vector by projecting onto the subspace of all vectors perpendicular to $| v _1\rangle$. This will be justified reach the section on projection matrices. For now it’s sufficient to understand:

The first statement is obvious. The second statement should be fairly intuitive because we’re subtracting off a component of a vector. This is analogous to zeroing out the first term of a unit vector, which will leave it smaller. A more formal proof of this point is left as Exercise [Ex-Finish Projection].

We state without justification that this also holds for more dimensions. This fact follows in much the same way as single case we’ve discussed, but the proof is tedious.

Finally because the same logic can be done on the other side of the matrix, we have $A_i=\left(I-\sum_{k=1}^i| u _k\rangle\langle u _k|\right)A_i$. So we make a more general statement.

We’re ready to prove the theorems mentioned at the opening of this chapter. First we’ll prove that the vectors $\left\{\langle u_i|\right\}$ are perpendicular, and that $\left\{|v_i\rangle\right\}$ are perpendicular. Specifically we argue that each new vector generated by the algorithm is perpendicular to all previous ones.

Proof. Suppose to the contrary that one or both of the vectors is not perpendicular, and that the maximum is $M$.

$M$ must be positive. To see this: $\langle u|A_i|v\rangle$ must be non-zero for at least one pair of vectors, for otherwise $A_i=0$. Consider any non-zero value, $N=\langle u|A_i|v\rangle$. If $N$ is positive, then the maximimum $M>N>0$. Otherwise $-N=\langle -u|A|v\rangle$ must be positive, which also implies $M>-N>0$.

Using Corollary Corollary 2.6, we can write $M$ as:

\[M=\langle u |A_i| v \rangle=\langle u |\left(I-\sum_{k=1}^i| u _k\rangle\langle u _k|\right)A_i\left(I-\sum_{k=1}^{i}| v _k\rangle\langle v _k|\right)| v \rangle=\langle u '|A_i| v '\rangle\]

where $\langle u'|=\langle u |\left(I-\sum_{k=1}^i| u _k\rangle\langle u _k|\right)$ and $|v'\rangle=\left(I-\sum_{k=1}^{i}| v _k\rangle\langle v _k|\right)| v \rangle$.

Because we assumed that either $\vec u$ or $\vec v$ are not perpendicular to all previous vectors, one of vectors $u '$ or $v '$ have magnitude less than $1$, by Theorem Theorem 2.5, point (b).

We can now reach a contradiction, because the unit vectors $\frac{\langle u '|}{\left\Vert\langle u '|\right\Vert}$ and $\frac{| v '\rangle}{\left\Vert| v '\rangle\right\Vert}$ do a better job of maximizing $\langle u|A|v\rangle$ than $\langle u_i|$ and $|v_i\rangle$. This is because:

\[\frac{\langle u '|}{\left\Vert\langle u '|\right\Vert}A_i\frac{| v '\rangle}{\left\Vert| v '\rangle\right\Vert} = \frac{\langle u'|A_i|v'\rangle}{\left\Vert\langle u '|\right\Vert\left\Vert| v '\rangle\right\Vert} = \frac{M}{\left\Vert\langle u '|\right\Vert\left\Vert| v '\rangle\right\Vert} > M >0\] ◻

Note: This doesn’t preclude $A_{r'}=0$ for some $r'<r$. If that happens, then the algorithm will terminate early.

2.2.2 Change of bases for outer product sums

Before moving on, we’ll state a fact that we’ll need later. This is a little out of place here, except that it shows a nice application of the concepts cover in this section.

We will see in Exercise [Ex-Outer Product Identity] that for any basis $\left\{|v_i\rangle\right\}$, we have $I=\sum_i|v_i\rangle\langle v_i|$. And a natural extension of this is that if you had two sets of orthonormal vectors, $\left\{|v_i\rangle\right\}$ and $\left\{|v_i'\rangle\right\}$, spanning the same space, then $\sum_i|v_i\rangle\langle v_i|=\sum_i|v_i'\rangle\langle v_i'|$. This is true, but we’ll an even more general statement. We will encounter in the next section sums of outer products, but they won’t be necessarily symmetric. So we need that $\sum_i|u_i\rangle\langle v_i|=\sum_i|u_i'\rangle\langle v_i'|$. This only happens under special circumstances, but these happen to be the circumstances that arise when picking apart the SVD.

For the second point, there will always exist some mapping from one orthonormal set to another (of the same cardinality). The key assumption here is that the $B$ is the same between the pairs of sets.

By Theorem Theorem 9.5, we can write $|v\rangle=|v_\parallel\rangle+|v_\bot\rangle$ with $W$ in the Theorem set as $\operatorname{span}\left\{|v_i\rangle\right\}=\operatorname{span}\left\{|v_i'\rangle\right\}$. With this construction $\langle v_i|v_\bot\rangle=0$ for each $\langle v_i|$, so $M|v_\bot\rangle=M'|v_\bot\rangle=0$.

Because $|v_\parallel\rangle\in\operatorname{span}\{|v_i\rangle\}$, $|v_\parallel\rangle=\sum_ia_i|v_i\rangle$.

\[\begin{array}{rcl} M|v\rangle &=& M|v_\parallel\rangle+M|v_\bot\rangle \\ &=& M|v_\parallel\rangle \\ &=& \left(\sum_i|u_i\rangle\langle v_i|\right)\left(\sum_ja_j|v_j\rangle\right) \\ &=& \sum_i|u_i\rangle a_i \\ &=& \sum_ia_iB|v_i\rangle \\ &=& B\left(\sum_ia_i|v_i\rangle\right) \\ &=& B|v_\parallel\rangle \end{array}\]

The same calculation with primes on all the $M$, $u$, and $v$ variables gives that $M'|v\rangle$ also equals $B|v_\parallel\rangle$. ◻

2.3 Almost Uniqueness

An annoying fact about the SVD is that it’s not unique. But it is almost unique. In this section, we see that there are limitations.

Let’s review briefly. Given a matrix $A$, we can find a SVD, $A=\sum_i\lambda_i|u_i\rangle\langle v_i|$. The $\lambda_i$ do not need to be distinct. For the sake of this section, we write this as

\[\label{MultiLambdaNotation} A=\sum_i\lambda_i\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|,\]

where $\lambda_1>\lambda_2>\cdots$. By indexing the vectors/covectors with a second index $j$, we can make sure that all our $\lambda_i$ are unique.

Example

Suppose we had an SVD: \[A=2|u_1\rangle\langle v_1|+2|u_2\rangle\langle v_2|+2|u_3\rangle\langle v_3|+|u_4\rangle\langle v_4|+|u_5\rangle\langle v_5|\]

To rewrite this in the language of Equation ([MultiLambdaNotation]), we would write: \[A=2\cdot\left(|u_{11}\rangle\langle v_{11}|+|u_{12}\rangle\langle v_{12}|+|u_{13}\rangle\langle v_{13}|\right)+1\cdot\left(|u_{21}\rangle\langle v_{21}|+|u_{22}\rangle\langle v_{22}|\right)\]

So that $\lambda_1=2$ and $\lambda_2=1$. $|u_1\rangle$ gets renamed to $|u_{11}\rangle$, $\langle v_4|$ gets remapped to $\langle v_{21}|$, etc.

If the SVD was built with our algorithm from the previous section, then $\langle u_{11}|A|v_{11}\rangle=\lambda_1$ maximizes $\langle u|A|v\rangle$. Of course because $\langle u_{1j}|A|v_{1j}\rangle=\lambda_1$ for any $1\leq j\leq n_i$, the algorithm may just have chosen $v_{1,2}$ or $v_{1,3}$ first, so any reordering of these could have been produced. In fact we don’t need to be restricted to these exact vectors; for any unit vector $|v\rangle\in\operatorname{span}\left\{|v_{i*}\rangle\right\}$, we can find a $\langle u|$ (given by Lemma Lemma 2.3) for which $\langle u|A|v\rangle=\lambda_1$. It also turns out that any $|v\rangle\not\in\operatorname{span}\left\{|v_{i*}\rangle\right\}$, there is not a $\langle u|$ for which $\langle u|A|v\rangle=\lambda_1$. These two statements are proven in Corollary Corollary 2.11.

Together these show that if you follow the algorithm from the previous section of repeatedly finding the maximizing vectors, the only variation you can get is a change of basis in $|v_{i*}\rangle$. (If $n_i=1$, then this means that we can only vary $\lambda_1|u\rangle\langle v|=\lambda_1|-u\rangle\langle -v|$.) Actually we’ve only shown this for for the largest $\lambda=\lambda_1$, but by subtracting off these components and applying the same argument to the resulting difference with a new largest $\lambda=\lambda_2$, we can apply the same argument; this is a trick we’ll use repeatedly.

So running the algorithm won’t give you the exact same answer each time you do it. It may vary, but only in choosing different a different basis for each $\left\{|v_{i*}\rangle\right\}$. But is it possible to use some other algorithm to come up with a SVD? And maybe if you did, you’d get a much different SVD? We show in this section that no matter how you come up with an SVD it must be the same as one of the SVDs created by the algorithm. That is, it can only vary by a change of basis for each spanning space $\operatorname{span}\left\{\langle u_{i*}|\right\}$ and $\operatorname{span}\left\{|v_{i*}\rangle\right\}$.

Proof. Using Lemma Lemma 2.3, we get that

\[\max_{\|u\|=\|v\|=1}\langle u|A|v\rangle = \max_{\|v\|=1}\frac{\langle v|A^T}{\lambda}A|v\rangle = \max_{\|v\|=1} \frac{\langle v|A^TA|v\rangle}{\left\|A|v\rangle\right\|}\]

For any vector $\vec w$, $\left\|\vec w\right\|=\sqrt{\vec w^T\vec w}$. So,

\[\max_{\|v\|=1} \frac{\langle v|A^TA|v\rangle}{\left\|A|v\rangle\right\|} = \max_{\|v\|=1} \frac{\langle v|A^TA|v\rangle}{\sqrt{\langle v|A^TA|v\rangle}} = \max_{\|v\|=1} \sqrt{\langle v|A^TA|v\rangle}\]

So to maximize $\langle v|A^TA|v\rangle$, we use the fact that $A^TA=\sum_i\lambda_i^2|v_i\rangle\langle v_i|$ (Lemma Lemma 2.2). If $\left\{\vec v_i\right\}$ do not span the domain (with dimension $=n$), then we pad it with additional basis vectors, and $\lambda_i=0$. $A^TA=\sum_{i=1}^n\lambda_i^2|v_i\rangle\langle v_i|$. Now $\left\{\vec v_i\right\}_{i=1}^n$ is a basis so that any unit vector $|v\rangle = \sum_{i=1}^na_i|v_i\rangle$, with $\sum_{i=1}^na_i^2 =1$.

\[\label{UpperboundEquation} \begin{array}{rcl} \langle v|A^TA|v\rangle &=& \langle v|\left(\sum_i\lambda_i^2|v_i\rangle\langle v_i|\right)|v\rangle \\ &=& \left(\sum_i a_i\langle v_i|\right)\left(\sum_i\lambda_i^2|v_i\rangle\langle v_i|\right)\left(\sum_i a_i|v_i\rangle\right) \\ &=& \sum_ia_i^2\lambda_i^2 \leq \sum_ia_i^2\lambda_1^2 = \lambda_1^2 \end{array}\]

The inequality on the last line follows because $\lambda_1$ is larger than all the other $\lambda_i$.

Putting it all together, we have:

\[\langle u|A|v\rangle\leq\max_{\|u\|=\|v\|=1}\langle u|A|v\rangle=\max_{\|v\|=1}\sqrt{\langle v|A^TA|v\rangle}\leq\sqrt{\lambda_1^2}=\lambda_1\] ◻

While the theorem gives an upper bound of $\lambda_1$, note that this bound is achievable, namely with $\vec u_1$ and $\vec v_1$. But we can be a little more specific:

We skip some of the details to this proof. It amounts to observing that the equality at the end of Eq. ([UpperboundEquation]) is strict unless all the non-zero terms have $\lambda_i$ coefficients equal to $\lambda_1$.

Proof. Redo the work of Lemma Lemma 2.2, but this time with the SVD written as $A=\sum_i\lambda_i\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|$ to get

\[A^TA=\sum_i\lambda_i^2\sum_{j=1}^{n_i}|v_{ij}\rangle\langle v_{ij}|\]

Write $|v\rangle=\sum a_{ij}|v_{ij}\rangle$. Because $|v\rangle$ is a unit vector, $\sum a_{ij}^2=1$. By the same steps logic as Eq. ([UpperboundEquation]), we get that

\[\langle v|A^TA|v\rangle = \sum_ia_i^2\lambda_i^2\]

where $a_i^2=\sum_ja_{ij}^2$. Again we have that the coefficients must satisfy $\sum_ia_i^2=1$, but this time the coefficients are distinct, so it’s clear that this equals $\lambda_1$ if and only if $a_1=1$ and all others are $0$. But this happens if and only if $|v\rangle\in\operatorname{span}\left\{|v_{1*}\rangle\right\}$. ◻

Some words should be said about Theorem Theorem 2.10 and Corollary Corollary 2.11. We’ll see in a bit that this is used to prove the almost-uniqueness of the SVD. In many ways, this theorem is the keystone of the entire chapter; it connects all the concepts. We needed to first prove the existence of an SVD to prove it. But we now see a motivation for Algorithm [Algorithm1]. Until now, the singular values seemed like an artifact of the specific SVD representation. But now we see that the first singular value $\lambda_1$ is something fundamental about the matrix - it’s an upper bound that doesn’t depend on the SVD representation. (By repeated subtraction, all singular values are representation-independent.) So it make sense that we should try to find and use these singular values. Looking ahead, the next chapter shows how to calculate the vectors $|v_i\rangle$. That calculation relies entirely on the trick (first encountered in Theorem Theorem 2.10) of canceling the $\langle u_i|$ by taking $A^TA$.

Theorem Theorem 2.10 and its corollary are also perhaps the most surprising. It seems sort of like magic. Maybe you’re along for the ride when I said that we’ll take $A^TA$ to cancel out the $|u_i\rangle$, but it seems like luck when the you end up with $\sum_ia_i^2$, which just happened to be $1$, by nature of looking at unit vectors. Maybe there’s some magic here. But at the same time, matrices are sometimes thought of as quadratic operators. For a vector $\vec x=\left(\begin{array}{c}x_1\\x_2\\\vdots\\x_n\end{array}\right)$ and an $n\times n$ matrix $M$, $\vec x^TM\vec x$ has every power-$2$ term of $x$ with coefficients derived from $M$. (Check this.) Given this quadratic nature, maybe it’s less surprising that it’d take vector components to their magnitude.

Imagine that your matrix $A$ is a diagonal matrix. This is already an SVD with $D=A$ and $U=V=I$. Look at how obvious the logic here is now. With some $\vec x=\left(\begin{array}{c}x_1\\x_2\\\vdots\\x_n\end{array}\right)$, $\langle x|D|x\rangle=\sum_i\lambda_ix_i^2$. If you wanted the first singular value, $\lambda_1$, the largest value, then you need to do something to get $A_{11}$. Of course $A_{11}=\langle e_1|A|e_1\rangle$ would do the trick. When $U$ and $V$ aren’t the identity then you should do $\langle e_1|U^T$ and $V|e_1\rangle$ to cancel. The existence of the SVD just says that aside from these little twists, $U$ and $V^T$, all matrices are just these canonical quadratic forms.

With that digression behind us, we can now actually prove almost-uniqueness. All the hard work has been done already. This theorem brings together lots of ideas, so take your time with it.

Proof. We first prove the four statements for $i=1$.

Theorem Theorem 2.10 tells us that the largest value of $\langle u|A|v\rangle$ is the largest singular value for any SVD. Because these SVDs represent the same matrix, $A$, it follows that these two singular values are equal, $\lambda_1=\lambda_1'$. So (1) is true.

Because $\langle u_{1j}|A|v_{1j}\rangle=\lambda_1=\lambda_1'$ for each $j$, Corollary Corollary 2.11 tells us that $\langle u_{1j}|\in\operatorname{span}\left\{\langle u_{1*}'|\right\}$ and $|v_{1j}\rangle\in\operatorname{span}\left\{|v_{1*}'\rangle\right\}$. Since spans are closed under linear combinations, it follows also that $\operatorname{span}\left\{\langle u_{1*}|\right\}\subset\operatorname{span}\left\{\langle u_{1*}'|\right\}$ and $\operatorname{span}\left\{|v_{1*}\rangle\right\}\subset\operatorname{span}\left\{|v_{1*}'\rangle\right\}$. Applying the same argument in the other direction, we get the subset going the other direction, so that by double inclusion, we can conclude that $\operatorname{span}\left\{\langle u_{1*}|\right\}=\operatorname{span}\left\{\langle u_{1*}'|\right\}$ and $\operatorname{span}\left\{|v_{1*}\rangle\right\}=\operatorname{span}\left\{|v_{1i*}'\rangle\right\}$. This proves (3) and (4).

Because $\left\{|v_{1*}\rangle\right\}$ are orthonormal, the dimension of their span is the number of elements in the set, $n_1$. The span of $\left\{|v_{1*}'\rangle\right\}$ is the same space, so the same dimension. But this time the dimension is the number of elements in this orthonormal set, $n_1'$. It follows that $n_1=n_1'$, which proves (2).

We’ve only proven this for the first singular value. This is because Theorem Theorem 2.10 and Corollary Corollary 2.11 are only proven for the maximum singular value. We want to use the usual trick of subtracting that signal out and repeating the argument on the resulting matrix, repeatedly. Usually we gloss over this detail, but here requires a little more care. We subtract $A-\lambda_1\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|$. On the representation $\sum_{i=1}^n\lambda_i\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|$, this just removes the first summand.

\[A-\lambda_1\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|=\sum_{i=2}^n\lambda_i\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|\]

However for the other representation, we have something messier.

\[A-\lambda_1\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|=\lambda_1\left(\sum_{j=1}^{n_j}|u_{ij}'\rangle\langle v_{ij}'|-\sum_{j=1}^{n_j}|u_{ij}\rangle\langle v_{ij}|\right)+\sum_{i=2}^n\lambda_i'\sum_{j=1}^{n_i'}|u_{ij}'\rangle\langle v_{ij}'|\]

In order to repeat the above argument, we need to reduce this to an SVD again. We use Lemma Lemma 2.9 with $B=\frac{1}{\lambda_1}A$ to conclude that $\sum_{j=1}^{n_j}|u_{ij}'\rangle\langle v_{ij}'|=\sum_{j=1}^{n_j}|u_{ij}\rangle\langle v_{ij}|$, which cancels out the $\lambda_1$ terms from the above, resulting in:

\[A-\lambda_1\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|=\sum_{i=2}^n\lambda_i'\sum_{j=1}^{n_i'}|u_{ij}'\rangle\langle v_{ij}'|\]

Now we have the two SVDs of $A-\lambda_1\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|$, $\sum_{i=2}^n\lambda_i'\sum_{j=1}^{n_i'}|u_{ij}'\rangle\langle v_{ij}'|$ and $\sum_{i=2}^n\lambda_i\sum_{j=1}^{n_i}|u_{ij}\rangle\langle v_{ij}|$. So we can repeat the above arguments on this. Subtract, and repeat again, and so on. ◻

This tells you is that no matter how you obtain the SVD, it will be in basically the same form. It will have the same singular values with the same multiplicity, and the only difference is that there may be a different basis chosen for the vectors.

This is a lot of work to show almost-uniqueness, but by now we should start to have some intuition for what the SVD is. Firstly, the singular values are something fundamental about the matrix, and not something that depends on the SVD representation. Secondly, our algorithm from the previous section is a rational thing to try. The largest singular value defines a space where these values can occur, but then after you subtract out that dimension, you get a big jump down to the next singular value.

2.4 How to Calculate

So far we haven’t said how to actually calculate an SVD. Algorithm [Algorithm1] says to repeatedly find maximal vectors, but we haven’t said how to do this. In this section we show that this calculation amounts to calculating certain eigenvalues and eigenvectors. In practice, you could calculate the SVD using a computer program. (See Section 3.1.) Yet knowing how to calculate manually will advance our understanding of the SVD.

Before going on, we need a small fact about calculus. I’ll state a lemma here without proof. Calculus is outside the scope of this book, but I provide some discussion in Appendix X.

The left-hand side of this equation is a vector whose entries equal $\frac{\partial\langle v|B|v\rangle}{\partial v_i}$. This lemma seems simple. (It’s analogous to $d(bx^2)=2x$.). Indeed, it could be demonstrated with basic calculus, but the calculation is slightly more difficult than you may guess.

We saw in the last section that we could find $|v_1\rangle$ in $A=\sum_i\lambda_i|u_i\rangle\langle v_i|$ as the unit vector $|v\rangle$ that maximizes $\langle v|A^TA|v\rangle$. To calculate this max, we can do some calculus, but we have to be careful to restrict to unit vectors ($\langle v|v\rangle=1$). To enforce this, we use Lagrange multipliers. (We call the Lagrange multiplier $\varepsilon$ instead of the usual $\lambda$ to avoid confusion with singular values.)

That is, $|v\rangle_1$ is an eigenvector of $A^TA$. Though this doesn’t tell us which eigenvector, we can check all eigenvectors until we get a maximum. Similarly $|u_1\rangle$ is an eigenvector of $AA^T$.

If the eigenvalue corresponding to $|v_1\rangle$ is $\varepsilon_1$, then $\langle v_1|A^TA|v_1\rangle=\varepsilon_1$. From the discussion in the proof of Theorem Theorem 2.10,

\[\lambda_1=\langle u_1|A|v_1\rangle=\sqrt{\langle v_1|A^TA|v_1\rangle}=\sqrt{\varepsilon_1}\]

It’s not surprising that these vectors are eigenvectors of $A^TA$, because recall that

It’s clear to see that $A^TA|v_i\rangle=\lambda_i^2|v_i\rangle$, which makes it an eigenvector.

Yet another way to see this to write $A=UDV^T$, so that $A^TA=VD^2V^T$. Because $V^T$ rotates the domain. You can see this as a rotation, a stretch or squeeze along the axes, then a rotation back ($V$). $V^T$ maps each of the vectors $\vec v_i$ to the axes, and $V$ maps the axes back to the vectors $\vec v_i$. It clear to see then that $VD^2V^T$ maps each $\vec v_i$ to some $\varepsilon_i\vec v_i$.

2.5 Quick Conclusions

Though it took a lot of work to build some understanding of the SVD, we will use it throughout the book. Next chapter, we can discuss a number of applications of the SVD. For now, I will whet your appetite with some immediate conclusions.

2.5.1 Rank-Nullity Theorem

2.5.2 Determinant

There’s another way to understand this. The orthogonal matrices $U$ and $V^T$ are isometries. This means that they preserve distances, and importantly, volumes. The determinant can be understood as the volume of the image of a unit cube. With a $UDV^T$ transformation, we see the cube gets rotated ($V^T$), stretched or squashed ($D$) in a way that maps unit cubes to cubes of volume $|D|$, then rotates again ($U$). The rotations don’t affect the volume, so the unit cube gets mapped to some cube of volume $|D|$.

2.5.3 Misc

3 Direct Applications of SVD

3.1 Computer Programming

For the applications we discuss, the work will be done with computers. Though we won’t focus on programming in this book, there will be exercises with computer work.

If you have a language of choice, feel free to use that. If not, I recommend Python. For a quick start, I recommend using Google Colab to make small Python notebooks.

Some exercises will involve a deeper dive on the programming side. For these problems, you may need to look up and learn about libraries not covered in this book. I label these problems with a $\diamondsuit$.

\[A = \left(\begin{array}{ccc} 1 & 2 & 3 \\ 0 & 3 & 5 \\ 3 & 6 & 9 \end{array}\right)\]

3.2 Collaborative Filtering

Famously Netflix created a prize several years ago, which paid anyone who could beat their recommendation algorithm by more than ten percent. The winner of the prize used collaborative filtering.

Netflix streams movies to a large number of users. Most of those streams originate from Netflix’s recommendations. One way to think of recommendation is a map from users to movies. We might say that this is a linear map, so that this is just a giant matrix.³ But before we’re ready to talk about a recommendation matrix, let’s instead start with a "watch matrix" $W$. $W_{ij}=1$ if user $i$ watched movie $j$, and $0$ otherwise.

This isn’t very useful yet, but let’s try to understand this better. When trying to understand a matrix, we could try taking the SVD.

This decomposition has millions of summands, and not all of them are equally important. The vectors are always unit vectors, but the $\lambda_i$ will typically vary quite a bit. Some will be very large, while others will be relatively small.

An observation is that if you had a very tiny $\lambda_n$ and you omitted the summand from the SVD (call it $A'$), you’d get a matrix that is very similar. This is similar in the sense that for all $|v\rangle$, $A|v\rangle\approx A'|v\rangle$.

But who cares? Well, this gives a good way to smooth out a linear map. This is sometimes called a low-rank approximation. Similar to fitting a polynomial to a scatterplot, a low-rank approximation can help capture the signal through the noise. And experience has shown that this works quite well.

Proving Theorem Theorem 3.1 would need a better definition of "average" and would be too much of a diversion to prove here, but we do leave a simplified version as Exercise [Ex-CFMinimizes Simplified]. Applying this theorem repeatedly we see that the best low-rank approximation we can get for a matrix is to just truncate the SVD after a number of the largest summands. Very cool!

Let’s talk about Netflix again. $W$ is a large $m\times n$, where $m$ is the number of users and $n$ is the number of movies. The rank is probably close to $n$, which is probably over ten thousand. But what if we low-rank approximated $W$ with a $100$-dimensional:

Maybe this captures signal through the noise, but what practically does it give us? You’ll remember from Exercise [Ex-Irrational Summands] that the summands of the SVD usually aren’t rational, even if their sum is. $W$ mapped users to movies, but it wasn’t very useful, because it only told us what movies a user already watched. $W'$ maps users to movies, but it gives an irrational value for every movie, which gives us a total ranking of movies. Now we can tell you not just what a user watched, but what they’re likely to have watched. By recommending movies that the user hasn’t watched, but score high for, they’re likely to want to watch.

This low-rank approximation is called collaborative filtering. The $j$-th row in the resulting matrix represents an interest profile for each user, with one entry for every movie. This translates to recommendations by simply looking for those movies with the highest score. There’s a lot of conversation in this section, but the algorithm is really nothing more than a truncated SVD.

Some form of Collaborative Filter won the Netflix prize. Though conceptually simple when stood up against some of today’s complex models, it performed quite well. I should point out that they trained this not on $W$, but on $S$. Where $S_{ij}$ represents the star rating that user $j$ gave to movie $i$. (That is, users manually rated movies from 1 star to 5 stars.) If you were working at Netflix today and tried to build this, the choice between $W$ and $S$ would be a modeling decision you’d have to make.

3.2.1 Latent Dimensions and Embeddings

As it happens, SVD is very good at getting meaningful signal. The summands of the SVD end up being meaningful genres. You might get a dimension that represents action movies, which scores high for action movies and low for non-action movies. Or you might get a Kurt Russell dimension which is near $1$ for all Kurt Russell movies and near $0$ for all others. The values $\langle m|m_i\rangle$ end up being a value which describes how well the the movie $\langle m|$ fits into the $i$-th genre, and the values $\langle u_i|u\rangle$ describe how much the user $|u\rangle$ is interested in the $i$-th genre.

Continuing our example, say that our low-rank matrix has two genres action and Kurt Russell ($KR$), and say that both $\lambda=1$. So that

Let’s say our movie library includes Overboard ($\mathcal O$), Escape From New York ($\mathcal E$), Frankenstein ($\mathcal F$), and Rambo ($\mathcal R$). (Sorry to anyone that hates movies.) Let’s also assume the following genre assignments:

\[\begin{array}{lll} \langle m_{\mathcal O}|m_{action}\rangle=0&\quad\quad&\langle m_{\mathcal O}|m_{KR}\rangle=1 \\ \langle m_{\mathcal E}|m_{action}\rangle=1&\quad\quad&\langle m_{\mathcal E}|m_{KR}\rangle=1 \\ \langle m_{\mathcal F}|m_{action}\rangle=0&\quad\quad&\langle m_{\mathcal F}|m_{KR}\rangle=0 \\ \langle m_{\mathcal R}|m_{action}\rangle=1&\quad\quad&\langle m_{\mathcal R}|m_{KR}\rangle=0 \end{array}\]

Now assume you’re a user who is three times more interested in action movies than in Kurt Russell. That is:

\[\begin{array}{lll} \langle u_{action}|u\rangle=3&\quad\quad&\langle u_{KR}|u\rangle=1 \end{array}\]

\[\langle m_{\mathcal R}|W'|u\rangle = \langle m_{\mathcal R}|m_{action}\rangle\langle u_{action}|u\rangle + \langle m_{\mathcal R}|m_{KR}\rangle\langle u_{KR}|u\rangle = 1\cdot 3 + 0\cdot 1 = 3\]

And though we’ll see that the user has no interest in Frankenstein, $\langle m_{\mathcal F}|W'|u\rangle=0$, they’ll have some mild interest in Overboard, $\langle m_{\mathcal O}|W'|u\rangle=1$. Yet, because these features are additive, this user will have the most interest in Escape From New York, a movie that is both an action movie and stars Kurt Russell, $\langle m_{\mathcal E}|W'|u\rangle=4$.

Collaborative Filtering is also smart enough to figure out how two weak signal should be considered more highly than a single strong signal. For the sake of argument say that Death Proof ($\mathcal D$) is rated lower for Kurt Russell, because he’s not the lead, $\langle m_{\mathcal D}|m_{KR}\rangle = 0.5$, and is rated lower for action, because it’s a bit of a genre-bender, $\langle m_{\mathcal D}|m_{action}\rangle = 0.7$. We can do the calculation to find that $\langle m_{\mathcal D}|W'|u\rangle = 0.7\cdot 3+0.5\cdot 1 = 2.6$. So it should be ranked above Overboard but below Rambo, closer to Rambo.

We should pause to mention the importance of orthogonality here. You may object that Kurt Russell has done mostly action movies. Isn’t a Kurt Russell fan just a specific type of action fan? Though this example isn’t very realistic, in actuality the dimensions will be orthogonal. If a user watches Escape From New York, The Thing, and Big Trouble in Little China, Collaborative Filtering can figure out how much of this behavior is attributable to the user being a run-of-the-mill action fan, and how much should be attributed to a Kurt Russell-affinity. They may end up being $0.8$ on the action axis, and $0.3$ on the Kurt Russell fan axis. But that $0.3$ won’t be in a vacuum. The model can figure out that for that user Kurt Russell enhances an action movie by $0.3$, so that an action movie with Kurt Russell will end up totaling $1.1$. This $0.3$ will apply to all Kurt Russell movies, so that if some other signal gives them a rom-com score of $0.35$, then Overboard (a rom-com with Kurt Russell) will score $0.65$.

The dimensions "action" and "KR" aren’t decided ahead of time in a board room, but rather come out as the result of Collaborative Filtering. These are called latent dimensions, where "latent" means something like hidden or underlying. Collaborative Filtering basically "finds" these hidden dimensions by finding what genres should be projected too to find the most prominent dimensions. They may turn out to even surprise the Netflix data scientists by surfacing genres that they didn’t know exist, like "90s irreverent comedies."

Our example isn’t realistic, because in practice the numbers won’t be rational, so $1$ and $0$ are unlikely. Rather what you’d see is that there may be a gradation. For the action dimension, maybe Escape From New York scores a $0.99$, but Big Trouble in Little China (an action movie with fewer action scenes) may score a $0.82$. Hateful Eight and Miracle might score a middling $0.67$ and $0.40$ respectively; these speak more to how likely they are to appeal to action-movie fans then anything else. And Overboard might score a low $0.11$. (These numbers are all made up, of course.). For the Kurt Russell dimension, movies may be a little more black-and-white, but there will still be some gray. Movies where Kurt Russell is on the poster may score higher than when he has a smaller role. Some non-Kurt Russell movies might score high based on how much they resemble a Kurt Russell movie. Rather those movies will score based on how much they would appeal to a Kurt Russell fan. For example, Ladyhawke originally cast Kurt Russell as the lead, and would make sense in his career; so it might score 0.4 for KR despite not having Kurt Russell anywhere in it. The scores may even surprise the data scientists doing the calculation. They may see Overload score really high on the Kurt Russell dimension, only to realize that it stars Kurt Russell’s son. Or they may see The Christmas Chronicles score lower than they expected; although this movie features Kurt Russell prominently, it’s a break from the movies he’s most well-known for.

When you browse Netflix, you’ll see Netflix group movies by genre or category, like say "Irreverent 90s Comedies." Although we don’t know how Netflix makes these categories, it’s likely they do something like: List movies that score high for a given latent dimension, then have a human curate these by adding or subtracting movies.

3.2.2 Embeddings

Embeddings is a term that’s used a lot in ML. There’s a precise definition for it, and it’s rich with connotation. But for us, we’ll just use this to mean a dimensionality reducer. For example if you take a user, their full set of data would be their row in $W$, what movies they watched and didn’t watch. If there are $m$ movies, then this is an $m$-dimensional vector. But as soon as we’ve decided that we’re going to do Collaborative Filtering, we don’t need that much data from a user. All we actual care about is the values $\langle u_i|u\rangle$ for each $i$ representing a latent dimension. (In our above example there were $100$ of these.)

It also happens to be a linear map, and it’s exactly equal to $U^T$, where $W'=MDU^T$ is the matrix form of the SVD of $W'$. Another embedding is $\sqrt{D}U^T$, where $\sqrt{D}$ is a diagonal matrix with the terms the square roots of the terms on $D$.

We should be a little more careful with dimensions. Recall that $U^T$ is an $n\times n$ matrix where $n$ is the number of users. This isn’t actually a dimensionality reducer, but we think of it like it is. This is because $DU^T$ will zero out all but the first $100$ entries. $\sqrt DU^T$ is an $m\times n$ matrix, but the all but the first $100$ column will be zero. So actually if we replaced $\sqrt D$ with a diagonal matrix with the same entries but shaped $100\times n$, we’d have a true dimension reduction. Let’s go ahead and do that, so that all this is just a pesky caveat.

If we did the same thing for movies, the linear map $\sqrt DM^T$ takes movies vectors (columns in $W'$) to a $100$-dimensional vector space. We can think of this space as the same space as the image of $\sqrt DU^T$. If we take the embedding of a movie $|m\rangle$, $\sqrt DM^T|m\rangle$, and the embedding of a user $|u\rangle$, $\sqrt DU^T|u\rangle$, and take the inner product, we get $\langle m|M\sqrt D\sqrt DU^T|u\rangle=\langle m|W'|u\rangle$. So from these two embeddings, we can reconstruct the entries in $W'$, by doing dot products. Recall that the dot product is the product of the norms times the cosine of the angle between them. In practice the embeddings will be comparable in norm, so the closeness of the movie/user vectors as measure by angle corresponds to how similar the the movie is to the user. In fact, similar movies will be close to each other as well, owing to the fact that users who like one movie will like the other. Similar users will also be close.

This is just a rearrangement of the same data, but the interpretation is helpful. We could leverage approximate nearest neighbor algorithms to produce good recommendations on extremely large datasets. As well the movie-to-movie similarity that this interpretation gives you, let’s you make categories like, "Because you watched The Christmas Chronicles."

3.2.3 Averaging

There’s a cool feature about Collaborative Filter that I want to mention. Call $B=\sqrt DU^T$, the embedding of users, and $A=\sqrt DM^T$, the embedding of movies, so that $W\approx W'=A^TB$. Let’s say that you know the embeddings of movies, $A$, the singular values $D$, and the watch matrix $W$. From these can you get user embeddings $B$? Yes, some simple algebra yields:

\[\begin{array}{rcl} W\approx A^TB \implies B & \approx & \left(A^T\right)^{-1}W \\ & = & \left(MD^{1/2}\right)^{-1}W \\ & = & \left(D^{-1/2}M^T\right)W \\ & = & D^{-1/2}D^{-1/2}D^{1/2}M^TW \\ & = & D^{-1}AW \\ \end{array}\]

Call $D^{-1}A$ the "modified embedding" (a non-standard term). This amounts to taking the embedding of each movie, and dividing the $i$-th element by $\lambda_i$. Now the above equation says that the embedding for a user is approximately the sum of the modified embeddings of movies that the user watched! This is hugely helpful when a user starts on the platform. After they watch a few movies, we could determine what their embedding should be, without have to refactor an enormous matrix.

There’s some theoretical value here too. Let’s say that all the top $100$ singular values are close in value, call it $\lambda$. (The idea here is that because we’re only keeping the largest singular values, they’ll be near to each other.) Then we can say that $D\approx\lambda I$. (It’s this author’s experience that this assumption won’t be close to true; however making the assumption will still make a useful model that makes good recommendations.). Under this assumption, we can drop the "modified." Then the user embedding is the sum of the movie embeddings across all the movies they watched, AND the movie embedding is the sum of the user embeddings across all the users who watched it. This fact gives a good intuition for what Collaborative Filtering actually does.

Quick note on notation here before we wrap: Breaking $UDV^T$ into two bits $U\sqrt D\sqrt DV^T$, the bra-ket notation is a little trickier. To keep it being a sum, we need to put the $i$-th term of $\sqrt DV^T$ into the $i$-th coordinate by attaching the $|e_i\rangle$. So we write $\sqrt DV^T=\sum_i\sqrt\lambda_i|e_i\rangle\langle v_i|$ and $U\sqrt D=\sum_i\sqrt\lambda_i|u_i\rangle\langle e_i|$. Verify that these multiply to equal $\sum_i\lambda_i|u_i\rangle\langle v_i|$

	Squid Game	Tiger King	Nomadland	The Father	Mank	Hamilton	Borat	Palm Springs
Abed	1	0	1	1	0	0	0	0
Bayani	1	1	1	1	0	0	0	0
Carol	0	1	1	1	1	0	0	0
Diego	1	1	1	0	0	0	1	1
Ellie	1	1	0	1	0	1	1	0
Frank	1	1	0	0	0	1	0	0

3.3 Principle Component Analysis

Let’s say you’re trying to build a credit profile on a set of users. We know that everyone gets a one-number credit score, but if you’re a bank or an insurance company, you may find this insufficient. These same credit companies will sell you a packet of hundreds (or thousands) of variables for each of your customers. You can use these to predict loan defaults⁴ or car accidents. You could even build an insurance-specific credit score for each of your customers. (Such a score would omit medical debt, by law.) This will be our motivating example for this section. Although the use of credit scores can be controversial, we’ll focus only on the modeling challenge it presents.

A difficulty of working with hundreds of variables, is that modeling can be difficult. Experienced practicioners may agree with this immediately. Neural network architects may laugh at less than a thousand variables. And newer modelers may not see why this would be difficult; can’t I just hit run on my regression? So I feel this claim needs some justification. I’ll give one example of a problem: Let’s take for example a regression on a hundred variables. Likely some of these are correlated with each other. Linear regressions struggle with correlation, sometimes it will give a large coefficient to one of variables, other times to a correlated variable. The inclusion or exclusion of a few points can change which variable gets the coefficient. If you’re not careful, a regular refresh can change your model a lot. Even if you only added a few points, you could greatly impact the credit score you give to some users. With low-variable models, it is easier to monitor and account for this.

So if you’re a credit modeler, you may want to perform a dimensionality reduction, something we know about. The most canonical dimensionality reducer is the Principal Component Analysis (PCA). We’ll describe this soon, but first let’s imagine that you’re the first person to come up with a dimensionality reducer - never been done before. You have $1000$-dimensions, but you’re targeting $10$. Your first approach might be to just take the first $10$ variables on your list. Nice job! You’ve just built a dimensionality reducer! But it’s not very good. There’s not actually an order to variable, so why these ten and not some other ten? Well, okay, which variables would be the best candidates?

An example of a bad variable to use is number of bankruptcies in the last three years, because the vast majority of users have not had any bankruptcies in the last three years. And if you only have ten variables, you don’t want to waste one on a variable that doesn’t give you much information. On the other hand, a variable like total credit card debt will be a better variable because it gives you a big range of experiences that varies from customer to customer. There are lots of things that make variables "good" or "bad;" I’m choosing to focus on this specific criteria. I propose that if a variable has lots of variance, then it’s a good variable to try. Without domain knowledge about what these variables are and how they will be used, this criteria seems as good as any.

Recall the defintion of variance: It’s a measure of the spread of data points, with more points further away from the center being better. It can be calculated as:

\[\label{VarianceEquation} \operatorname{Var}(X) = \operatorname{E}(X^2) - \operatorname{E}(X)^2\]

The $\operatorname{E}$ means expected value, or average. We’ll ignore the second term; this move is justified later in the Normalization section. So we get:

We can ignore the $1/n$ also, because $n$ is constant. If you have a matrix $A$ where $A_{ij}$ is the value of the $i$-th variable for the $j$-th user, then the values of the $i$-th variable is the vector $\langle e_i|A$. And the sum of the squares is $\langle e_i|AA^T|e_i\rangle$. Pretty cool; it’s starting to look familiar.

With this proposal, you’re trying to find the standard axes $\vec e_i$ that maximize $\langle e_i|AA^T|e_i\rangle$. But we know now that the standard axes are just one choice of bases; there’s nothing really special about them. To be more general, let’s consider any axis. That is find the unit vector that maximizes $\langle u|AA^T|u\rangle$. We know how to do this. We can perform the same algorithm that we use to find the $\left\{|u_i\rangle\right\}$ in the SVD $A=\sum_i\lambda_i|u_i\rangle\langle u_i|$. If we only want $10$ axes, then we apply the algorithm for just $10$ steps, which will be the $|u_i\rangle$ that correspond to the $10$ largest singular values. These chosen axes are called the principle components.

What’s nice about the SVD is that it builds orthogonality right in. This means that, like with Kurt Russell and action movies, the new axes will be independent. To visualize this, we think briefly about scatter plots. If we have $n$ users and $m$ features about them, we can view this as $n$ points on an $m$-dimensional scatter plot. If two variables are highly correlated, then they’ll fall mostly along a line. For example, number of credit cards and amount of credit card debt will be correlated, so they will live mostly along a line. The first principle component will pick up this axis. Any other principle component will have to be perpendicular to this. Maybe it’ll represent something like, the amount of credit card debt above what’s typical for the number of cards they have.

After you get the Principle Components, we project a user’s data onto these axes. This projection map can be called $\sum_i |e_i\rangle\langle u_i|$. The $|e_i\rangle$ terms just say that we want the the $i$-th projection to go in the $i$-th coordinate.

3.3.1 Normalization

It’s impossible to talk about PCA without discussing normalization. Let’s say you have two variables, age and annual income. These things are likely correlated in your data, and you want to capture the best axis to capture these. But age only varies between users by a couple of decades, whereas income varies by tens of thousands of dollars. PCA will say that most of the variance lives with the variance on income. But this doesn’t actually make sense because income is measured in dollars and age is measured in years. Our answer shouldn’t change if we decide to measure income in Euros and age in minutes. We need something unitless that captures some general sense of spread. As well, the range of incomes will be different than the range of ages. Why should a model capture that people don’t get credit cards until the age of $18$? We need something that also centers.

There are a few different ways to recenter and scale. More generally such techniques are called regularization. We’ll take about a specific technique, call normalization.⁵ This is done before we apply any algorithm. For each variable $X$, with points $x_i$, we make new variables:

Say that the $x_i'$ are points of the new variable $X'$. Well mention some nice effects of normalization. Subtracting the $\operatorname{E}(X)$ guarantees that $\operatorname{E}(X')=0$. This is what allows us to ignore the $\operatorname{E}(X)^2$ in Eq. ([VarianceEquation]) above. We put a square root in the denominator (vs. using variance) to ensure that the unit matches the numerator. Also notice that $\langle x_i'|(A')^TA'|x_i'\rangle=\frac{\langle x_i|A^TA|x_i\rangle}{\lambda_i^2} = \frac{\lambda_i^2}{\lambda_i^2}=1$. Because the $\left\{|x_i\rangle\right\}$ form a basis, every vector in the span can be written as $|x\rangle=\sum_ia_i|x_i\rangle$, where $\sum_ia_i^2=1$; it follows that $\langle x|(A')^TA'|x\rangle=1$ for all $|x\rangle$.

3.3.2 User embeddings

Start with a matrix $A$ that maps users to their normalized features with SVD, $A=\sum_i|u_i\rangle\langle v_i|$. (Notice: The $\lambda_i=1$ because we normalized.) We’ve been talking about principle components as the vectors $\left\{\vec u_i\right\}$. But if we want to know the values for an actual user $e_j$, then we can first map the user to their raw variables, $A|e_j\rangle$. Then project the raw values (number of bankruptcies, number of credit cards, etc.) onto the principle components by applying the map $\sum_{i=1}^{10}|u_i\rangle\langle u_i|$. Then this equals:

\[\left(\sum_{i=1}^{10}|e_i\rangle\langle u_i|\right)\left(\sum_{i=1}^{1000}|u_i\rangle\langle v_i|\right)|e_j\rangle = \sum_{i=1}^{10}|e_i\rangle\langle v_i|e_j\rangle\]

Compare to the user embeddings from the last section, we get that the embedding of the $j$-th user with $10$ latent dimensions is $\sum_i|e_i\rangle\langle v_i|e_j\rangle$. They’re the same. We might think of PCA as Collaborative Filtering, where we just don’t use the embeddings of the features. It’s an imperfect analogy, because really these are used in different cases. Not only do we not want the "embeddings" of the features, but the vectors $\left\{\vec u_i\right\}$ are not even meaningful in the same way that the movie vectors were; the rows of $U\sqrt D$ would be the embedding of a credit profile that has a single normalized feature set to $1$ and the rest set to $0$, which doesn’t represent anything.

3.4 Partial Least Squares

In the previous section, we may have rejected number of bankruptcies too quickly. Though it’s uncommon, it may be very predictive of a default. If so, we should include it. Instead of looking at variance of the feature space by itself, Partial Least Squares (PLS)⁶ tries to find those axes that most correlate with the target.

Continuing the example, say that the target is loan default ($1$ for default and $0$ for no default). We have historical data which says whether users with a given profile defaulted. With $n$ historical data points, we put these targets in a $n\times 1$ matrix $X$. We also have features for these data points, an $n\times m$ matrix $Y$. $X^TY$ is a $1\times m$ matrix, where each entry $(X^TY)_{1j}$ is correlation coefficient of the $j$-th dimension of $Y$ with the target variable, provide that $X$ and $Y$ are normalized. $\langle u|X^TY|v\rangle$ is the correlation between the axes $X|u\rangle$ and $Y|v\rangle$. To maximize correlation, we find the maximizing $\vec u$ and $\vec v$, which we can again do with the SVD. $\vec u$ is a $1$-dimensional unit vector, so it must be $\pm (1)$. The maximizing vector $|v\rangle$ is more interesting; it represents the axis in the feature space that most correlates with the target. Because $X^TV$ will have rank $1$, we can only get a single PLS dimension out. However, this dimension is likely to be a good dimension for predicting a default.

If there are multiple target dimensions, you could still perform an SVD on $X^TY$, and get as many PLS axes as there are target dimensions. (See Exercise [Ex-Iris PLS] for an example of this.) In that case, the vectors $\left\{\vec u_i\right\}$ will be non-trivial. However, even in that case, only the dimensions in the feature space, $\left\{\vec v_i\right\}$ will be of interest.

Collaborative Filtering and PCA are considered unsupervised models, because they don’t take into account target. PLS is a supervised model because it does. Unsupervised models are common for building embeddings, because we think of them as trying to capture some intrinsic about the data. A single embedding will often be used in multiple applications, and tying too closely to a single target may not be desirable.

PLS is less commonly used than the other algorithms presented here. One reason may be that it can only produce as many dimensions as you have targets. (However, see Exercise [Ex-PLS PCA hybrid] for a way to supplement PLS and PCA.) Another reason may be that PCA is an unsupervised algorithm. This means that you don’t need target data. But also, dimensionality reduction is usually done upstream of modeling; the output may be used for exploration or sent to many different applications. A third reason that PLS is less common than PCA may just simply be that it is less well-known. PLS might be thought of as a variant of PCA. PCA is valued for its simplicity, and modelers probably don’t spend much time researching variants. In my experience, I’ve found that PLS can be very useful, especially if you have a single application in mind. It’s my opinion that PLS should be a tool in any modelers tool belt.

3.5 Canonical Correlation Analysis

For this last section, we imagine that we’re making recommendations in the iPhone app store. A partner team uses a model that takes as input features of apps that users have downloaded, and outputs a recommendation for a new app. We don’t know exactly how that model works, but that team counts on us to build features that well represents apps.

We’ll consider a couple of good approaches to featurizing apps. One approach is to use Collaborative Filtering to get co-download embeddings for the apps. This means that if two apps have lots of users in common, then the apps’ embedding vectors will be near each other; apps with less users in common will be far apart. Another approach is to use the app description. These descriptions are entered by the developer and are presented to the user in the app store. Somebody from the natural language team has made embeddings based on these text descriptions.

Although these two embeddings are capturing different signal, these signals are likely to be correlated. For example, apps like Microsoft Teams, Microsoft Word, and Microsoft OneDrive are likely to be downloaded by the same people, but these are also likely to all say “Microsoft” in the description. Here we think correlated signals may amplify each other. The word “bird” might show up in bird watching apps, but also in bird-themed cell phone games (Angry Birds, Flappy Bird, and some knock-offs). It’s not until a signal shows up in both embeddings that we think we’ve caught a family of related apps.

To do this, we look for axes that maximize covariance. This is exactly the same problem as PLS. Specifically, we start with two matrices of features $X$ and $Y$ of dimension $m_1\times n$ and $m_2\times n$, with $n$ the number of users, and $m_1$ and $m_2$ are the number of dimensions in the two feature sets. The matrix $XY^T$ has entries $m_{ij}$ equal to the correlation between the $i$-th feature in $X$ and the $j$-th feature in $Y$. We look for the axes that maximize $\langle u|XY^T|v\rangle$. Again this is solved with an SVD. We call the search for these axes a Canonical Correlation Analysis (CCA).

Many of the topics covered in this chapter are closely related, but none more than PLS and CCA. Again the use cases for the two are much different. For PLS, we use the target variable to determine the best axis among a single set of features. But for CCA, we use two sets of features to find the best axis.

Because of this there are two sets of features, we end up having two signals to which we can attach for any user. We can embed a user $|e_i\rangle$ by first mapping them to their $X$ features, then embedding this like we did in the Collaborative Filtering section, $\sqrt DU^TX|e_i\rangle$. Or we could map first into $Y$ and embed, $\sqrt DVY^T|e_i\rangle$. These are almost the same signal (see Exercise [Ex-Two LSRLs]), and in practice we could use just one of these signals.

4 [WIP] Regression

4.1 Projection matrices

We’ll talk now about an important family of matrices, called projection matrices. A Projection Matrix is a square matrix with an SVD given by $P=\sum_i|v_i\rangle\langle v_i|$ for some set of orthonormal vectors. The important properties are that each vector is paired with itself and that all coefficients are $1$. Note that projection matrices under this definition are necessarily symmetric. At its surface level, this is a simple definition - a projection matrix just has a special SVD. But this has a number of equivalent properties, that are sometimes given as definition.

Property [pm_projection] is the projection property that gives these matrices their name, and it’s probably the most important of the three. This property says that the matrix projects vectors onto it’s image, $\mathcal P$. We know from Theorem Theorem 9.5 that we can write any vector $|v\rangle$ as $|v_\parallel\rangle+|v_\bot\rangle$, where $|v_\parallel\rangle\in\mathcal P$ and $|v_\bot\rangle\in\mathcal P^\bot$. Property [pm_projection] then tells us that $P|v\rangle=|v_\parallel\rangle$; this is the orthogonal projection of $|v\rangle$ to the subspace $\mathcal P$. This is called orthogonal, because the difference, $|v_\bot\rangle$ is orthogonal to $\mathcal P$. $|v_\parallel\rangle$ happens to be the vector in $\mathcal P$ that’s closest to $|v\rangle$. When we say "projection" colloquially, usually we mean orthogonal projection; for example the projection of all the points on a sphere to a planar subspace is the disc that gets projected (like a shadow cast) from the sphere onto the plane. $\mathcal P$ is can be called the Invariant Subspace of $P$.

Property [pm_idempotence] contains the property that $P^2=P$, which is called idempotence, meaning simply that if you apply the matrix repeatedly, it does the same as if you apply it only once. See Exercise [idempotence_taylor] for an example of the type of arithmetic you can do with these matrices. Property [pm_idempotence] also mentions that the matrix must be symmetric. The reader may be familiar with this property that means $P^T=P$. If you’re unfamiliar, read the beginning of Section 5.1, search the internet, or skip this property for now.

Proof. [pm_svd]$\implies$[pm_projection] Let $|v\rangle$ be in $\mathcal P=\operatorname{Im}(P)$, then $|v\rangle=\sum_ia_i|v_i\rangle$. Then $P|v\rangle=\sum_j\sum_ia_i|v_j\rangle\langle v_j|v_i\rangle=\sum_ja_j|v_j\rangle=|v\rangle$. If $|w\rangle\in\mathcal P^\bot$, then $P|w\rangle=|0\rangle$ is immediate.

To prove that $P$ must be symmetric, we show that $\langle u|P|v\rangle=\langle u|P^T|v\rangle$ for all $\langle u|$, $|v\rangle$. (Why is this sufficient?) We have that

\[\begin{array}{rcl} \langle u|P|v\rangle &=& \langle u|v_\parallel\rangle \\ &=& \langle u_\parallel|v_\parallel\rangle + \langle u_\bot|v_\parallel\rangle \\ &=& \langle u_\parallel|v_\parallel\rangle \end{array}\]

The last equality follows from the fact that $v_\parallel\in\mathcal P$, and is therefore perpendicular to $u_\bot$. A similar argument says that $\langle u|P^T|v\rangle=(\langle u|P^T)|v\rangle=\langle u_\parallel|v\rangle=\langle u_\parallel|v_\parallel\rangle$. So these calculations are the same.

[pm_idempotence]$\implies$[pm_svd] We look ahead to a theorem from Section 5.1 to say that any SVD for a symmetric matrix has the form $P=\sum_i\lambda_i|v_i\rangle\langle v_i|$. Now when we square this we get

\[\begin{array}{rcl} P^2 &=& \left(\sum_i\lambda_i|v_i\rangle\langle v_i|\right)\left(\sum_j\lambda_j|v_j\rangle\langle v_j|\right) \\ &=& \sum_i\sum_j\lambda_i\lambda_j|v_i\rangle\langle v_i|v_j\rangle\langle v_i| \\ &=& \sum_i\lambda_i^2|v_i\rangle\langle v_i| \end{array}\]

Where the last equality from the usual trick of taking advantage of orthonormality to cancel any term with $\langle v_i|v_j\rangle$ where $i\neq j$.

Now $P^2=P\implies \sum_i\lambda_i^2|v_i\rangle\langle v_i|=\sum_i\lambda_i|v_i\rangle\langle v_i|$. We can apply $|v_i\rangle$ to both side of either matrix to get $\lambda_i^2=\lambda_i$, implying that $\lambda_i$ is $0$ or $1$, completing the proof. ◻

Recall that we used projection matrices already in Theorem Theorem 2.5. There we argued that a matrix like $I-|v\rangle\langle v|$ is a projection. With Theorem Theorem 4.1, we can see that this must be a projection matrix, because it’s obviously symmetric, and the reader can check that it’s idempotent. But what is it’s SVD? By Exercise [Ex-Outer Product Identity], we can choose any basis $\left\{|v_i\rangle\right\}_i$, where $|v_1\rangle=|v\rangle$, then $I=\sum_i|v_i\rangle\langle v_i|$, and $I-|v\rangle\langle v|=\sum_{i\neq1}|v_i\rangle\langle v_i|$ is a valid SVD. This begins to establish the one-to-one connection between a projection matrix’s SVD and its invariant subspace, made explicit by the next theorem.

Proof. Start by assuming that $\left\{|v_i\rangle\right\}_i$ spans the invariant subspace, $\mathcal P$, of $P$. Then we argue as above that for any $\langle u|$, $|v\rangle$, that $\langle u|P|v\rangle=\langle u_\parallel|v_\parallel\rangle$. Now for any $|v\rangle$, we can write $|v\rangle=|v_\parallel\rangle+|v_\bot\rangle$. Since $|v_\parallel\rangle\in\mathcal P$, the span of $\left\{|v_i\rangle\right\}_i$ we can write $|v_\parallel\rangle=\sum_ia_i|v_i\rangle$, for some $\left\{a_i\right\}_i$. So that $\left(\sum_i\langle v_i|\right)|v\rangle=\left(\sum_i\langle v_i|\right)\left(\sum_ja_j|v_j\rangle\right)=\sum_ia_i|v_i\rangle$. So that

\[\begin{array}{rcl} \left(\sum_i\langle v_i|\right)|v\rangle &=& \left(\sum_i\langle v_i|\right)\left(|v_\parallel\rangle+|v_\bot\rangle\right) \\ &=& \left(\sum_i\langle v_i|\right)|v_\parallel\rangle + \left(\sum_i\langle v_i|\right)|v_\bot\rangle \\ &=& |v_\parallel\rangle + 0 \end{array}\]

To prove the other direction, assume that $P=\sum_{i=1}^k|v_i\rangle\langle v_i|$, we take some basis of the domain $\left\{|w_i\rangle\right\}_i$ with $|w_i\rangle=|v_i\rangle$ for $i=1..k$. That is, this basis extends the $|v_i\rangle$ to the entire domain, as in Theorem Theorem 9.4. Now any vector $|v\rangle=\sum_{i=1}^na_i|w_i\rangle$, and $\left(\sum_{i=1}^k|v_i\rangle\langle v_i|\right)\sum_{j=1}^na_j|w_j\rangle=\sum_{i=1}^ka_i|w_i\rangle$. This equals $|v\rangle$ if and only if $a_i=0$ for all $i>k$; that is if $|v\rangle\in\operatorname(span)\left\{|v_i\rangle\right\}$. ◻

Notice that this theorem doesn’t rely on the fact that the invariant subspace is the image of the projection operator, but from the SVD, we can see that the image is precisely spanned by the $\left\{|v_i\rangle\right\}$.

Before moving on, we want to highlight a special projection matrix, $|v\rangle\langle v|$. For this to be a projection matrix, $|v\rangle$ must be a unit vector. (Why?) This takes a vector and projects onto $|v\rangle$, so that $|w\rangle\to|v\rangle\langle v|w\rangle$. Recall that the $\langle v|w\rangle$ is the inner product, or dot product; this projection is sometimes written as $\operatorname{proj}_{\vec v}\vec w=\left(\vec v\dot\vec w\rangle\right)\vec v$. Because $|v\rangle$ is a unit vector, it gives the direction of $|v\rangle$ and $\langle v|w\rangle$ gives the magnitude of the resulting projection. Doing some trig, we get that the magnitude must also equal $\left\||w\rangle\right\|\cos\theta$, where $\theta$ is the angle between $|v\rangle$ and $|w\rangle$. More generally if $|v\rangle$ isn’t unital, then $\langle v|w\rangle=\left\|v\rangle\right\|\left\||w\rangle\right\|\cos\theta$. This formula gives a concrete interpretation of the inner product.

4.2 Inverses and pseudo-inverses

An inverse of a matrix, $M$, is a new matrix, $M^{-1}$ that "undoes" $M$. That is, whenever $M|v\rangle=|w\rangle$, then $M^{-1}|w\rangle=|v\rangle$. Said another way, $MM^{-1}=M^{-1}M=I$.

Rotation Matrix A rotation matrix in two dimensions is given by

\[R(\theta)=\left(\begin{array}{cc}\cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta)\end{array}\right)\]

This acts on any vector $|v\rangle$ by rotating counter clockwise by $\theta$. It follows then that $R(\theta)^{-1}=R(-\theta)$.

Notice that if $A$, $B$, and $AB$ are invertible, then $(AB)^{-1}=B^{-1}A^{-1}$. That is, the order is reversed, and because matrices do not commute, this matters. This is because if $AB|v\rangle=|w\rangle$, then $B|v\rangle=A^{-1}|w\rangle$, and finally $|v\rangle=B^{-1}A^{-1}|w\rangle$.

Next we notice that if a matrix is a diagonal with non-zero entries $D=\left(\begin{array}{cccc}\lambda_1&0&\cdots&0\\0&\lambda_2&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\0&0&\cdots&\lambda_n\end{array}\right)$, then the inverse is also a diagonal matrix, $D^{-1}=\left(\begin{array}{cccc}1/\lambda_1&0&\cdots&0\\0&1/\lambda_2&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\0&0&\cdots&1/\lambda_n\end{array}\right)$. The reader can easily verify that these are inverses.

Putting these facts together with the fact that $M^{-1}=M^T$ for orthogonal matrices (see Section 9.1), we can calculate the inverse of a matrix, $M$, with the given SVD (in matrix form) of $M=UDV^T$. That is, $M^{-1}=(V^T)^{-1}D^{-1}U^{-1}=VD^{-1}U^T$. This form is already given as an SVD, since the multiplicands are still orthogonal and diagonal. Since every square matrix has an SVD $UDV^T$, this approach gives an inverse whenever $D$ has all non-zero diagonals; that is whenever $D$ is full-rank. Conversely if $D$ does have a zero-entry on the diagonal, then it’s not invertible, or singular. To see this, suppose there’s a zero entry in the $(1, 1)$ entry of $D$, then nothing maps to the first row of $U$. Restated in summation form if $M=\sum_i\lambda_i|u_i\rangle\langle v_i|$ where $\lambda_1=0$, then no $|v\rangle$ maps to $M|v\rangle=|u_1\rangle$; the best we could do is some combination of $|u_i\rangle$ with $i>1$, which are all perpendicular to $|u_1\rangle$. Given Theorem Theorem 2.15, we can conclude that a matrix is singular if and only if the determinant is $0$.

What inversion does to an SVD is very natural. Suppose a square matrix $M=\sum_i\lambda_i|u_i\rangle\langle v_i|$ is full-rank, that is $\lambda_i\neq0$ for $i=1..n$, with $n$ equal to the dimension of domain. Then $M^{-1}=\sum_i\frac{1}{\lambda_i}|v_i\rangle\langle u_i|$. Because $M$ is full-rank, both $\left\{|u_i\rangle\right\}_i$ and $\left\{|v_i\rangle\right\}_i$ form bases, sometimes called coordinates. The SVD of $M$ basically says that to calculate $M|v\rangle$, you must first write $|v\rangle$ in the $\left\{|v_i\rangle\right\}_i$ coordinate system, then scale each $i$-th coordinate by $\lambda_i$, and the result is $M|v\rangle$ written in the $\left\{|u_i\rangle\right\}_i$ coordinate system. (See Subsection 2.1.4 for more details.) The inverse then writes a vector in the $\left\{|u_i\rangle\right\}_i$ coordinate system, unscales the $i$-th coordinate by $1/\lambda_i$, and ends up with a vector written in the $\left\{|v_i\rangle\right\}_i$ coordinate system.

Singular Rotation Matrix The following matrix rotates the first two dimensions, but zeroes out the third dimension.

\[R=\left(\begin{array}{ccc}\cos(\theta) & -\sin(\theta)&0 \\ \sin(\theta) & \cos(\theta)&0 \\ 0&0&0\end{array}\right)\]

$R\langle x, y, z\rangle$ first zeroes out the third coordinate, then rotates $\langle x, y\rangle$ by $\theta$ in the $x-y$ plane, landing on $\langle x', y', 0\rangle$. Of course, this is singular because we nothing maps to $\langle x', y', z'\rangle$ if $z'\neq0$. Any vector $\langle x', y', 0\rangle$ does have a vector that maps to it, $\langle x, y, 0\rangle$. However, this isn’t unique, because $\langle x, y, z\rangle$ also maps to $\langle x', y', 0\rangle$ for all $z\in\mathbb R$. Because of the non-uniqueness, we don’t call these inverses.

Although $R$ has no inverse, we could do an almost-inverse. Given a vector, $|w\rangle$, we could project $|w\rangle$ to the first two dimensions, resulting in $P|w\rangle$, then apply $R(-\theta)$ to those two dimensions. This would give us a new vector, $|v\rangle$, which has the property that $R|v\rangle$ is as close to $|w\rangle$ as possible. This "close as possible" claim is justified by the fact that $R|v\rangle=RR(-\theta)P|w\rangle=P|w\rangle$ is the projection of $|w\rangle$ onto the image of $R$. By Theorem Theorem 9.5, the projection is the closest vector in the first two dimensions (the image of $R$) to $|w\rangle$. Writing such an "almost-inverse" down would look like.

\[S=\left(\begin{array}{ccc}\cos(-\theta) & -\sin(-\theta)&0 \\ \sin(-\theta) & \cos(-\theta)&0 \\ 0&0&0\end{array}\right)\left(|e_1\rangle\langle e_1|+|e_2\rangle\langle e_2|\right)\]

The first matrix is the inverse on the image of $R$ and the second matrix is projection onto that image. If we knew an SVD for $R=\lambda_1|u_1\rangle\langle v_1|+\lambda_2|u_2\rangle\langle v_2|$, then we’d know that the image $\operatorname{Span}\left\{|u_1\rangle, |u_2\rangle\right\}=\operatorname{Span}\left\{|e_1\rangle, |e_2\rangle\right\}$. By Theorem Theorem 4.3, we can rewrite $|e_1\rangle\langle e_1|+|e_2\rangle\langle e_2|=|u_1\rangle\langle v_1|+|u_2\rangle\langle v_2|$. And so we can rewrite,

\[S=\left(\frac{1}{\lambda_1}|v_1\rangle\langle u_1|+\frac{1}{\lambda_2}|v_2\rangle\langle u_2|\right)\left(|u_1\rangle\langle v_1|+|u_2\rangle\langle v_2|\right)=\frac{1}{\lambda_1}|v_1\rangle\langle u_1|+\frac{1}{\lambda_2}|v_2\rangle\langle u_2|\]

The first matrix is the inverse of the SVD of R, which exists if we restrict to the first two dimenisions. This tells us that the almost inverse is given by doing the inverse to the non-zero terms of the SVD. It almost doesn’t feel worth mentioning projection, but when you consider that there’s an implicit zero term in the SVD of $R$, we see that $R=R=\lambda_1|u_1\rangle\langle v_1|+\lambda_2|u_2\rangle\langle v_2|+0|e_3\rangle\langle e_3|$. It hardly feels justifiable to invert the non-zero terms and leave the zero term untouched. But when we consider that it’s the ortogonal projection onto $R$’s image, then it seems justifiable by closeness arguments.

Generalizing the example, we define the pseudo-inverse of a matrix $M=\sum_i\lambda_i|u_i\rangle\langle v_i|$ is $M^+=\sum_i\frac{1}{\lambda_i}|v_i\rangle\langle u_i|$. The pseudo-inverse of a matrix always exists, even for non-square matrices. In general, there’s no good way to write this in matrix form, because we’re unable to ignore the zero entries in the diagonal matrix. Also in general, there is no such relationship $M^+$, $M$, and $I$. We’ll discuss some special cases now.

Given a matrix $M=\sum_i\lambda_i|u_i\rangle\langle v_i|$, with non-zero terms, if $\left\{|v_i\rangle\right\}_i$ span the domain, then $M^TM=\sum_i\lambda_i^2|v_i\rangle\langle v_i|$ is square and full-rank, and therefore invertible. It’s inverse using arguments above must be $\left(M^TM\right)^{-1}=\sum_i\frac{1}{\lambda_i^2}|v_i\rangle\langle v_i|$. This looks like projection, with some scaling, so we try composing it with $M$ to see if we can get the pseudo-inverse. And indeed, we see that $\left(M^TM\right)^{-1}M^T=\sum_i\frac{1}{\lambda_i}|v_i\rangle\langle u_i|=M^+$. In the case that $\left\{|v_i\rangle\right\}_i$ span the domain, and $M^+=\left(M^TM\right)^{-1}M^T$, we have that $M^+M=\left(M^TM\right)^{-1}M^TM=I$. We say then that $M^+$ is a Left Inverse of $M$.

On the other hand, given a matrix $M=\sum_i\lambda_i|u_i\rangle\langle v_i|$, with non-zero terms, if $\left\{|u_i\rangle\right\}_i$ span the codomain, then $MM^T=\sum_i\lambda_i^2|u_i\rangle\langle u_i|$ is full-rank, and therefore invertible. In this case $M^+=M^T\left(MM^T\right)^{-1}$, and $MM^+=I$, making $M^+$ a Right Inverse of $M$.

The definition of a pseudo-inverse is very natural when you have the bra-ket form of the SVD. But most linear algebra books do not center the SVD. So the pseudo-inverse is given as an abstract matrix satisfying certain properties. In the two cases outlined above, an explicit formula can be given for the pseudo-inverse. As such some books decide to only define the pseudo-inverse when it can be written explicitly. But this gets the implication backwards: It doesn’t make sense to define the pseudo-matrix in terms of the inverse, because the inverse is a special case of the pseudo-inverse. These books prefer this only because they’ve taken time to teach matrix inversion in detail, even though almost all matrices are not invertible.

The condition for a left inverse is that $\left\{|v_i\rangle\right\}_i$ span the domain. But this is equivalent to saying that $M$ is injective. That is, for any two distinct vectors $|v\rangle$ and $|w\rangle$ in the domain, $M|v\rangle\neq M|w\rangle$. It’s outside the scope of this book, but any injective function (even non-linear functions) have left inverses. The condition for a right inverse is that $\left\{|u_i\rangle\right\}_i$ span the codomain. This is equivalent to the matrix being surjective. That is, for any vector $|w\rangle$ in the codomain, there’s a vector $|v\rangle$ in the domain with $M|v\rangle=|w\rangle$.

Exercise [Ex-Outer Product Identity] shows that for any basis of $\mathbb R^n$, $\left\{|v_i\rangle\right\}$, $\sum_i|v_i\rangle\langle v_i|=I$. This section puts that exercise in a new light, because we can recognize $\sum_i|v_i\rangle\langle v_i|$ as a projection matrix. As such, its invariant subspace is it’s image, but because the image is the basis, it spans $\mathbb R^n$; that is, $\left(\sum_i|v_i\rangle\langle v_i|\right)|v\rangle=|v\rangle$ for all $|v\rangle\in\mathbb R^n$. So it’s the identity.

Note that by Theorem [OrthogonalMatrixDefiniton], square matrices have orthonormal rows precisely when they orthonormal columns. By the above theorem, such matrices must have $Q^TQ=QQ^T=I$. So Theorem Theorem 9.2 that $Q^{-1}=Q^T$ for orthogonal matrices follows immediately.

4.3 Least Squares Regression

4.3.1 Solving System of Equations

Let’s look at how inverses can be used to solve problems. We’ll present a word problem like you might see in a high school class.

Word Problem You’re selling tickets to a 2-day show. You sell two types of tickets, regular and VIP. On the first day, you sell $100$ regular tickets and $3$ VIP tickets, and make $11,500$ dollars. On the second day, you sell $90$ regular tickets and $6$ VIP tickets and make $12,000$ dollars. How much does a VIP ticket cost versus a regular ticket?

Say that a regular ticket costs $\beta_1$ dollars and a VIP ticket costs $\beta_2$ dollars. We then have the linear relationship:

\[\left(\begin{array}{c} 11500 \\ 12000 \end{array}\right)=\left(\begin{array}{cc} 100&3\\90&6\end{array}\right)\left(\begin{array}{c}\beta_1\\ \beta_2\end{array}\right)\]

Of course you can solve this system of equations with some logic or using algebra techniques, but linear algebra saves you from having to do anything difficult and scales nicely to larger systems (at least from a human effort perspective). The matrix in this equation is invertible, so we can solve this system, by saying:

\[\left(\begin{array}{c}\beta_1\\ \beta_2\end{array}\right)=\left(\begin{array}{cc} 100&3\\90&6\end{array}\right)^{-1}\left(\begin{array}{c} 11500 \\ 12000 \end{array}\right)=\left(\begin{array}{c}100 \\500\end{array}\right)\]

This is bread-and-butter applied linear algebra, and will probably be familiar to the reader. It’s not surprising that solving system of linear equations is the primary motivation for studying inverses; this is the matrix-equivalent of solving $y=ax$ for $a$.

4.3.2 Under-determined System of Equation

But what if we modify the example slightly, so that we add a third class of tickets, called "VIP Plus?" Now your system looks like:

\[\left(\begin{array}{c} 12500 \\ 13000 \end{array}\right)=\left(\begin{array}{ccc} 100&3&2\\90&6&1\end{array}\right)\left(\begin{array}{c}\beta_1\\ \beta_2\end{array}\right)\]

You may recall from algebra that you generally cannot (and never for linear equations) solve for three unknowns with only two equations. Such a system is called underdetermined. But with the help of pseudo-inverses we can make a best guess. For now let’s call the system $|y\rangle=X|\beta\rangle$, but we’ll continue to assume that $X$ is $2\times 3$. We can almost solve this equation as $X^+|y\rangle=|\beta\rangle$.

It’s probably still a little unclear what the pseudo-inverse means, or why this is a good approximate solution. So let’s try a different approach: We know that as a $2\times3$-dimensional matrix, $X$ cannot possibly be invertible. But suppose it’s rank-$2$, as in our example. (A $2\times3$ matrix cannot have more than rank $2$.) Then we want to reduce the dimensionality of the domain by $1$ to end up with a $2\times2$ matrix, that we can then invert. We know from Section 3.3, that the PCA is the way to reduce dimensions with minimal information loss. For a rank-$2$ matrix, there will be two PCA axes that we can choose from. Let $X=UDV^T$ be an SVD. Now $V^T$ will be a $3\times3$ orthogonality matrix, where the bottom row has no impact on $X$, because of a zero entry in $D$. The two PCA axes will be the first two rows of $V^T$, $|v_1\rangle$ and $|v_2\rangle$. To transform data points to the two SVD axes, we can just project onto each one. That is we let the first coordinate be the magnitude of the projection onto $|v_1\rangle$, and the second coordinate will be projection onto $|v_2\rangle$. To transform any point to these two axes, we can just apply the matrix $W^T:=|e_1\rangle\langle v_1|+|e_2\rangle\langle v_2|$.

We project down to $\mathbb R^2$ via $W^T$, then extend the map $X$ via $W$ to get $XW$, which is the composition of $X$ and $(W^T)^{-1}=W$.

This gives us a map from the PCA space to the image. The inverse $\left(XW\right)^{-1}$ can be composed with $|y\rangle$ gives coefficients of the PCA axes. If we wanted to recover coefficients in the original $\mathbb R^3$, then we need to compose this result with $W$. So that the resulting coefficients are $W\left(XW\right)^{-1}|y\rangle$. Notice that if $X$ were rectangular, then $W$ would be full rank and invertible, so that this would reduce to $WW^{-1}X^{-1}|y\rangle=X^{-1}|y\rangle$ - This is a good sanity check.

So this tells us that we can almost invert by first projecting to the PCA, and this give an almost inverse of $W\left(XW\right)^{-1}$. We compare to the usual pseudo inverse of $X^T\left(XX^T\right)$.

\[\begin{array}{rcl} W\left(XW\right)^{-1} &=& \left(|v_1\rangle\langle e_1|+|v_2\rangle\langle e_2|\right)\left[\left(\sum_i\lambda_i|u_i\rangle\langle v_i|\right)\left(|v_1\rangle\langle e_1|+|v_2\rangle\langle e_2|\right)\right]^{-1} \\ &=& \left(|v_1\rangle\langle e_1|+|v_2\rangle\langle e_2|\right)\left[\lambda_1|u_1\rangle\langle e_1|+\lambda_2|u_2\rangle\langle e_2|\right]^{-1} \\ &=& \left(|v_1\rangle\langle e_1|+|v_2\rangle\langle e_2|\right)\left(\frac{1}{\lambda_1}|e_1\rangle\langle u_1|+\frac{1}{\lambda_2}|e_2\rangle\langle u_2|\right) \\ &=& \frac{1}{\lambda_1}|v_1\rangle\langle u_1|+\frac{1}{\lambda_2}|v_2\rangle\langle u_2| \\ &=& X^+ \end{array}\]

The last equality follows from the previous section. This calculation shows that taking pseudo-inverse is the same as projecting on PCA axes and inverting the resulting full-rank matrix. This adds some intuition, because PCA is a straight-forward dimensionality reduction.

4.3.3 Over-determined System of Equations

The other side of under-determined systems of equations is, of course, over-determined systems of equations. These are scenarios where there are more equations than unknowns. Importantly, an over-determined system gives different coefficients depending on which subset of equations you use. Going back to our example, if you had VIP tickets at $500$ dollars and regular tickets at $100$ dollars, and data from multiple weeks, you could choose any given weekend, and solve the problem just as you did in the earlier section. In fact, in this case, if you did do an SVD, you’d get a rank-$2$ matrix, and the pseudo-inverse would still give the same answer.

Imagine instead a situation where customers spend additional money on merchandise and refreshments. We still know for each day the number of regular customers, $x_1$, and VIP customers, $x_2$. And we know for each day, how much money we made, $y$. Then we let $\beta_1$ and $\beta_2$ be the typical total spend (ticket plus merch plus refreshments) of regular and VIP customers. Can we solve for $\beta_1$ and $\beta_2$? If we found the right values of $\beta$, we may not have that $y=\beta_1x_1+\beta_2x_2$ holds for any day in our data set, but it will still provide the best estimate.

Over-determined System[OverDeterminedSystem] Suppose, as above, we know for each day how many of each type of ticket we sold, and the total revenue after merch and refreshments. If we have sales like this.

	Revenue (000s)	Num regular	Num VIP
Jan 3	14.1	100	3
Jan 4	14.8	90	6
Jan 10	15.1	94	5
Jan 11	18.0	163	0
Jan 17	20.0	119	5
Jan 18	11.8	78	4

We can formulate this as a matrix problem like:

\[\left(\begin{array}{c} 12.6\\13.7\\14.0\\16.4\\15.8\\10.7 \end{array}\right)=\left(\begin{array}{cc}100&3\\90&6\\94&5\\163&0\\119&5\\78&4 \end{array}\right)\left(\begin{array}{c}\beta_1 \\ \beta_2\end{array}\right)\]

Then we can almost solve for $\vec\beta$, by taking the pseudo-inverse. Call this matrix $|y\rangle=X|\beta\rangle$. We come up with a solution $\hat\beta=X^+|y\rangle=X^T\left(XX^T\right)^{-1}$.

This example shouldn’t be surprising or even interesting at this point. But I do want to point out that this is a more interesting problem than under-determined or full-rank square matrices. This is the first example that resembles a real world problem. By solving it, we learn something interesting about the customers; we probably already knew how much tickets themselves cost. The over-determined matrix arises from real-world problems more often. A possible explanation is that we’re unlikely to build a model with more variables than data points. There are situations where a family of inputs is quite large, but in those cases, a dimensionality-reduction, like PCA is done explicitly as a pre-processing step. An old-fashioned rule of thumb (from before the era of big data) states that with $n$ data points, you shouldn’t try to model more than $\sqrt[3]{n}$ variables. So if you did have a $216\times1000$ matrix, it’d be better probably to use PCA to project onto the $6$ most priniple axes, then set up a linear system on a $216\times6$ matrix. This may do a better job smoothing out the noise; and you’d see better performance on out-of-sample data points.

The solution you get from $\hat\beta=X^+|y\rangle=X^T\left(XX^T\right)^{-1}$ is the "best" solution in some abstract sense; the argument is something, something, orthogonal projection. But for the over-determined scenario has a clearer inpretation. Specifically, $\hat\beta$ is what we get when we minimize $\mathcal L=\sum_i\left(\beta_1(x_i)_1+\beta_2(x_i)_2-y_i\right)^2$. Written in matrix form, this is

\[\begin{array}{rcl} \mathcal L &=& \left(X|\beta\rangle - |y\rangle\right)^T\left(X|\beta\rangle - |y\rangle\right) \\ &=& \left(\langle\beta|X^T - \langle y|\right)\left(X|\beta\rangle - |y\rangle\right) \\ &=& \langle\beta|X^TX|\beta\rangle-\langle\beta|X^T|y\rangle-\langle y|X|\beta\rangle+\langle y|y\rangle \\ &=& \langle\beta|X^TX|\beta\rangle-2\langle y|X|\beta\rangle+\langle y|y\rangle \end{array}\]

Where the last equality follows from the fact that $\langle\beta|X^T|y\rangle$ is a scalar, so it’s its own inverse. (Or equivalently, the commutativity of the inner product.) We minimize this using Lemma Lemma 2.13.

\[0=\frac{\partial\mathcal L}{\partial \vec\beta} = 2X^TX|\beta\rangle-2X^T|y\rangle\]

Solving this for $|\beta\rangle$, we get $|\beta\rangle=\left(X^TX\right)^{-1}X^T|y\rangle$, precisely as before.

$\mathcal L$ is the sum of the squares of the residuals, the difference between predicted and actual. And as such, this type of model is called Least Squares Regression, or sometimes Ordinary Least Squares (OLS). You may ask if $\mathcal L$ is the correct thing to minimize. Minimizing $\mathcal L$ achieves the maximum likelihood estimation when we assume that the residuals are all drawn from the same normal curve, and that observations are independent. This assumption allows us to write covariance of residuals as a matrix $\sigma^2I$. The maximum likelihood estimation is outside the scope of this book, but we can think of it like a best guess. The key assumption though is that we assume that the residuals are independent samples belonging to a normal distribution, and that the variance doesn’t change for different inputs. If that still doesn’t make sense, it suffices for now to know that linear regression is a simple technique that works well very often; as such, it should always be your first attempt.

One notation that we find useful is the "hat" notation. After we solve for $|\beta\rangle$, we say that the solution is "beta-hat" or $|\hat\beta\rangle$. This just means the predicted value of $\beta$. With this notation we may write $|\hat y\rangle=X|\hat\beta\rangle$ to mean the predicted values of $|y\rangle$ that we get by plugging $|\hat\beta\rangle$ into our original equation. It helps to keep this distinguished from the original $|y\rangle$. For example, now we can write the vector of residuals as $|\hat y\rangle-|y\rangle$. It then follows that $\mathcal L=\left\| |\hat y\rangle-|y\rangle \right\|^2$.

We have that $|\hat y\rangle=X|\hat\beta\rangle=X\left(X^TX\right)^{-1}X^T|y\rangle$. We sometimes call $H:=X\left(X^TX\right)^{-1}X^T$ the "hat matrix," because it puts a hat on $|y\rangle$: $|\hat y\rangle=H|y\rangle$. The remarkable thing about the hat matrix is that it equals

\[\begin{array}{rcl} H&=&X\left(\left(X^TX\right)^{-1}X^T\right) \\ &=&\left(\sum_i\lambda_i|u_i\rangle\langle v_i|\right)\left(\sum_j\frac{1}{\lambda_j}|v_j\rangle\langle u_j|\right) \\ &=&\sum_i|u_i\rangle\langle u_i| \end{array}\]

which is a projection matrix! (Aren’t you glad we just went over those?) If $X$ is a $m\times n$ matrix, representing $m$ data points with $n$ variables, then $H$ is a projection matrix in $\mathbb R^m$. This is the sample space, where each dimension represents a data point. It’s strange to think of this as a space. A vector in this space, like $|y\rangle$ represents the values observed across the data points. Each column in $X$ is a vector in the sample space, representing all the sampled values for that input variable. The subspace spanned by $\left\{|u_i\rangle\right\}_i$ is the same as the space spanned by the columns $X$, by Theorem Theorem 2.16. So you can think of this subspace as the space of all vectors that can be written as a linear combination of the input variables. If $|y\rangle$ is in this span, then $|\hat y\rangle=|y\rangle$, meaning that $|y\rangle$ can be written exactly as a linear combination of the input variables, and there’s no "almost-ness" to the almost inverse we compute. Otherwise, $|\hat y\rangle$ is the closest vector (to $|y\rangle$) in the span of inputs, by Theorem Corollary 4.2, which is why $|\hat y\rangle$ minimizes $\mathcal L=\left\| |\hat y\rangle-|y\rangle \right\|^2$.

4.4 Least Squares Regression, Part 2

4.4.1 Variance

We read this as the estimated coefficient for the regular_tickets is $$113.00$ and the estimated coefficient for vip_tickets is $$930.90$. (Recall that the $y$ values are in thousands of dollars.) Which in the context of the problem, this means that we estimate each regular ticketholder will spend $113$ dollars (including ticket cost), while each VIP ticketholder will spend around $931$ dollars. This is a valuable insight. Say you know that you’re selling regular tickets for $$100$ and VIP tickets for $$500$. Now you know that VIPs will spend almost their entire ticket price on merchandise and refreshments, whereas regular ticket holders spend a much less significant amount. This kinda insight can inform how you market to VIPs, where you place merchandise, or even how you set prices. And it can be achieved without having to track the movement of every VIP throughout the venue.

We should be careful however about how we state our conclusions. Say the venue has an East entrance and a West entrance, with separate ticket stands and separate revenue, but virtually identical. If you run an analysis on the East entrance and somebody else runs an analysis on the West entrance, you’d be embarassed to confidently assert that the average VIP spends $$931$ if the analyst on the West side concludes that a VIP spends $$510$, barely more than their ticket price. But this could happen. By modeling this as a linear regression, we’re asserting that $y=\beta_1x_1+\beta_2x_2+\varepsilon$, where $\varepsilon$ is a random normal variable. And although we’re guessing $\beta_1$ and $\beta_2$ based on the data we’ve observed, our estimate itself is subject to the randomness. So it stands to reason that if we measured on different days or even a different half of the venue on the same days, we’d expect to get different estimates for $\beta$.

But we do have a sense of the range of error on our estimate. We need to dive into stats a little bit. Variance gives a sense of spread and it’s calculated as $\operatorname{Var}(T)=\sum_i\left(t_i-E(T)\right)^2$, where $\left\{t_i\right\}_i$ is a large sample of $T$, and $E(T)=\frac{1}{\#(T)}\sum_it_i$. (Statisticians may object to the use of "large sample" instead of "population," but close enough is fine for this book.) Standard deviation is the square root of variance, and for normal variables, this is all we need to comment on the relative likelihood of different scenarios. Any observation drawn from a normal distribution is $95\%$ likely to be within about $2$ standard deviations of the center. In Table [sm_readout], the coefficient of vip_tickets is $$930.90$ with a standard deviation of $$172$. The $95\%$ confidence interval is given as $$453$ and $$1,409$. This means that there’s a $95\%$ chance that whatever the West entrance measures for this coefficient will between $$453$ and $$1,409$, with a $95\%$ probability. This is a pretty wide range. (With a keen eye, you may notice that the confidence interval is more than $2$ standard deviations. This is because statsmodels uses a more-accurate t-test rather than a z-test. Don’t worry about this difference, at around $30$ or more data points, the difference is negligible.) The $95\%$ confidence interval is all points that are within $2$ standard deviations of the mean, $|\hat\beta\rangle$; said another way, it’s all points for which the z-score, $z=\frac{|\beta\rangle-|\hat\beta\rangle}{\sigma}$ is less than $2$.

There’s a multidimensional version of variance. For a multivariate normal distribution, with center $|\mu\rangle$ and variance $\Sigma$ given as a matrix, the "Mahalanobis distance" of a point $|x\rangle$ is given by $\sqrt{\langle x-\mu|\Sigma^{-1}|x-\mu\rangle}$. In the one dimensional case, this equals the absolute value of the z-score. And this generalizes the z-score in that $95\%$ of data drawn from the distribution have Mahalanobis distance less than $2$. As before $|\mu\rangle=\frac{1}{n}\sum_{i=1}^n|x_i\rangle$, the average of points. And $\Sigma=\frac{1}{n}\sum_{i=1}^n\left(|x_i\rangle-|\mu\rangle\rangle\right)\left(\langle x_i|-\langle\mu|\rangle\right)$. We have a convenient identity here:

\[\begin{array}{rcl} \Sigma &=& \frac{1}{n}\sum_{i=1}^n\left(|x_i\rangle-|\mu\rangle\rangle\right)\left(\langle x_i|-\langle\mu|\rangle\right) \\ &=& \frac{1}{n}\sum_{i=1}^n\left(|x_i\rangle\langle x_i|-|\mu\rangle\langle x_i|-|x_i\rangle\langle\mu|+|\mu\rangle\langle\mu|\right) \\ &=& \frac{1}{n}\sum_{i=1}^n|x_i\rangle\langle x_i|-\frac{1}{n}\sum_{i=1}^n|\mu\rangle\langle x_i|-\frac{1}{n}\sum_{i=1}^n|x_i\rangle\langle\mu|+\frac{1}{n}\sum_{i=1}^n|\mu\rangle\langle\mu| \\ &=& \frac{1}{n}\sum_{i=1}^n|x_i\rangle\langle x_i|-|\mu\rangle\langle \mu|-|\mu\rangle\langle \mu|+|\mu\rangle\langle \mu| \\ &=&\frac{1}{n}\sum_{i=1}^n|x_i\rangle\langle x_i|-|\mu\rangle\langle \mu| \\ &=& \operatorname{E}\left(|x\rangle\langle x|\right) - \operatorname{E}\left(|x\rangle\right)\operatorname{E}\left(\langle x|\right) \end{array}\]

This is a rather quick treatment of variance. See Section 7.1 for more on this topic.

Proof. We start with $|\hat\beta\rangle=\left(X^TX\right)^{-1}X^T|y\rangle$. On the right-hand side, only $|y\rangle$ is random (being a function of $\varepsilon$, so that $\operatorname{E}\left(|\hat\beta\rangle\right)=\left(X^TX\right)^{-1}X^T\operatorname{E}\left(|y\rangle\right)$. (See Exercise [random-constant-2].)

Now using the identy above we do some serious math here:

\[\begin{array}{rcl} \Sigma &=& \left[\operatorname{E}\left(|\hat\beta\rangle\langle\hat\beta|\right)\right] - \left[\operatorname{E}\left(|\hat\beta\rangle\right)\operatorname{E}\left(\langle\hat\beta|\right)\right] \\ &=& \left[\operatorname{E}\left(\left(X^TX\right)^{-1}X^T|y\rangle\langle y|X\left(X^TX\right)^{-1}\right)\right]-\left[\operatorname{E}\left(\left(X^TX\right)^{-1}X^T|y\rangle\right)\operatorname{E}\left(\langle y|X\left(X^TX\right)^{-1}\right)\right] \\ &=& \left[\left(X^TX\right)^{-1}X^T\operatorname{E}\left(|y\rangle\langle y|\right)X\left(X^TX\right)^{-1}\right]-\left[\left(X^TX\right)^{-1}X^T|\operatorname{E}\left(y\rangle\right)\operatorname{E}\left(\langle y|\right)X\left(X^TX\right)^{-1}\right] \\ &=& \left(X^TX\right)^{-1}X^T\left[\operatorname{E}\left(|y\rangle\langle y|\right)-\operatorname{E}\left(|y\rangle\right)\operatorname{E}\left(\langle y|\right)\right]X\left(X^TX\right)^{-1} \\ &=& \left(X^TX\right)^{-1}X^T\left[\operatorname{E}\left(|\hat y\rangle-|y\rangle\right)\right]X\left(X^TX\right)^{-1} \\ &=& \left(X^TX\right)^{-1}X^T\left[\sigma^2I\right]X\left(X^TX\right)^{-1} \\ &=& \left(X^TX\right)^{-1}\sigma^2 \end{array}\]

The last equality follows from that the fact that scalars and $I$ commute with all matrices, so that the penultimate equation is $\sigma^2\left(X^TX\right)^{-1}X^TX\left(X^TX\right)^{-1}$, allowing for some cancellations. ◻

The standard deviations given in Table [sm_readout] are given by the diagonal elements $\left(\left(X^TX\right)^{-1}\right)_{i,i}$. The idea is that if you start with the coefficient estimate $|\hat\beta\rangle$ and vary one dimension, $|\beta'\rangle=|\hat\beta\rangle+\varepsilon|e_i\rangle$, then the Mahalanobis distance becomes

\[\begin{array}{rcl} \sqrt{\langle \varepsilon e_i|\Sigma^{-1}|\varepsilon e_i\rangle}&=&\varepsilon\sqrt{\left(\Sigma^{-1}\right)_{i,i}} \\ &\approx& \varepsilon\sqrt{\left(\Sigma_{i,i}\right)^{-1}} \\ &=& \frac{\varepsilon}{\sqrt{\Sigma_{i,i}}} \\ &=& \frac{\left\| |\beta'\rangle-|\hat\beta\rangle\right|}{\sqrt{\left(\left(X^TX\right)^{-1}\right)_{i,i}}} \end{array}\]

At the end, this looks like an ordinary z score in one dimension, which justifies the confidence intervals reported by the regression program. The approximation in this calculation is doing some heavy lifting. The approximation is exact when $X^TX$ is diagonal (which is the case that the input variables are independent). In practice this is often close enough to correct, but there are real cases where this isn’t true. Imagine the same problem setup, but there are some "mega-event" weekends that we sell ten thousand tickets, instead of the usual hundred or so. We’d expect mega-event weekends to have large numbers for both VIP and regular weekends, introducing a strong correlation. When variables are correlated, our prediction of coefficients is also correlated. That is if we lowered the predicted spend from VIP customers, we’d increase the predicted spend from regular customers to make up the difference. But in these cases, a single-dimension confidence interval is less meaningful: It may actually be likely that VIP spend will be predicted significantly higher, but only in tandem with a lower regular spend. We talk about this more in the next subsection.

So far we’ve discussed variance on the coefficients themselves. And this is useful. One thing you may want to do with linear regressions is report the coefficient. In our example, it gives the relative "worth" of each type of customer, which can help with some business decisions. But another use of linear regression is "inference" or making out-of-sample predictions. This, for example, would be answering a question like, "If we sell $100$ regular tickets and $10$ VIP tickets, how much revenue should we expect to make? Given the numbers above, we model this as $\hat y=113(100)+930.9(10)$, or $y=113(100)+930.9(10)+\varepsilon$ to express the randomness. Or written in matrix form $y=\left(\begin{array}{cc}100&10\end{array}\right)\left(\begin{array}{c}113\\930.9\end{array}\right)+\varepsilon=\langle x|\hat\beta\rangle+\varepsilon$. As a point estimate, we’d expect revenue of $$20,609$, but what kind of range can we give on this estimate?

Given that $|y\rangle=\langle x|\hat\beta\rangle+\varepsilon$, and both $|\hat\beta\rangle$ and $\varepsilon$ are random, we have

\[\begin{array}{rcl} \operatorname{Var}\left(|y\rangle\right) &=& \operatorname{Var}\left(\langle x|\hat\beta\rangle\right) + \operatorname{Var}(\varepsilon) \\ &=& \langle x|\Sigma^2|x\rangle + \sigma^2 \\ &=& \langle x|\left(X^TX\right)^{-1}\sigma^2|x\rangle + \sigma^2 \\ &=& \sigma^2\left(\langle x|\left(X^TX\right)^{-1}|x\rangle + 1\right) \end{array}\]

Where $\Sigma^2$ is taken from Theorem Theorem 4.5. Some of the details here are covered in Exercise [random-constant], but this calculation can also be revisited after reading Section 7.1. The interesting thing about this calculation is the common $\sigma^2$, the variance of the residuals. In the first case factors into the uncertainity in the coefficients, $|\hat\beta\rangle$, but in the second case, it adds constant variance into this single new prediction. The range given on coefficients is sometimes called "confidence interval," while an equivalent interval given on a new prediction using $\sigma^2\left(\langle x|\left(X^TX\right)^{-1}|x\rangle + 1\right)$ is called "prediction interval."

We look at some measures of variance, given a training set. So far, we’ve talked about residual sum of squares (RSS) $=\sum_i\left(y_i-\hat y_i\right)^2$. This measures how far off our predictions are. But when we talk about variance of $y$, we usually mean $\sum_i\left(y_i-\bar y\right)^2$, where $\bar y=\frac{1}{n}\sum_i y_i$, the average. We call this variance, the total sum of squares (TSS). Finally, we have a explained sum of squares (ESS) $=\sum_i\left(\hat y_i-\bar y\right)^2$. This name is justified by the next theorem.

Proof. Throughout, it will be useful to write any sum of squares, $\sum_i x_i^2$, as an inner product $\langle x|x\rangle$. When needed, $|\bar y\rangle$ will be a vector with every entry equal to $\bar y$. Other vectors will match names but without the subscripts.

\[\begin{array}{rcl} \text{TSS} &=& \langle y-\bar y|y-\bar y\rangle \\ &=& \langle y-\hat y+\hat y -\bar y|y-\hat y+\hat y -\bar y\rangle \\ &=& \langle y-\hat y|y-\hat y\rangle + \langle \hat y - \bar y|\hat y-\bar y\rangle + 2\left(\langle y|\hat y\rangle -\langle\hat y|\hat y\rangle\right) -2\left(\langle y-\hat y|\bar y\rangle\right) \\ &=& \text{RSS} + \text{ESS} + (0) + \left(\bar y\sum_i\left(y_i-\bar y_i\right)\right) \\ &=& \text{RSS} + \text{ESS} + (0) + (0) \\ \end{array}\]

The zeroing of the first parenthetical follows from Exercise [y-hat-ex-1], and the second follows from the fact that $\langle y-\hat y|$, the residuals, average to zero, and therefore sum to zero. ◻

Notice that both ESS and RSS must be positive. In the case that $y_i=\hat y_i$ for all $i$, this is a perfect prediction; here ESS=TSS and RSS=$0$. In the case that $y_i=\bar y$ for all $i$, then the model is totally informationless; here ESS=$0$ and RSS=TSS. We see that as the prediction gets better TSS gradually falls more into the ESS bucket, rather than the RSS bucket. ESS/TSS is a number between $0$ and $1$ that measures goodness of fit; this is called the coefficient of determination, or more commonly just $R^2$. ($\sqrt{R^2}$ also has a meaning, which justifies the exponent, but we don’t discuss that here.) This is also reported in Figure [sm_readout].

4.4.2 Multicolinearity

If you ask a computer to do a linear regression, the computer will set off on calculating $|\hat y\rangle=X|\hat\beta\rangle=X\left(X^TX\right)^{-1}X^T|y\rangle$. But this exists only if $X^TX$ is invertible. As mentioned earlier, this is invertible if and only if the columns of $X$ span the domain. The columns of $X$ fail to span the domain if there’s some linear relationship among the columns (by Theorem Theorem 9.3). For example, let’s say that you’re pricing insurance policies. You have a matrix $X$ where each row is a data point (customer) and the first column is a constant $1$, added to many models; the second columns is an indicator for male, $1$ if the customer is a man, and $0$ if the customer is a woman; and the third column is an indicator for female. Call these columns "const", "is_male", and "is_female". Imagine that the dataset has every customer recorded as either male or female (due to software limitations or some policy), then we’d have the relationship const = is_male + is_female. In this case $X^TX$ is not invertible. This is called aliasing or multicollinearity.

There’s a plain English problem here. If you removed the const column and run a regression against the insurance cost of the customers, then you’d find that the coefficient for is_male would be the average cost for men, and is_female’s coefficient would be the average cost for women. Let’s say for this example, it’s $$600$ and $$400$ respectively. If instead we dropped the is_male column and kept the other two. Then we’d see that const has coefficient $600$ and is_female $-200$, so that men still get a prediction of $600$ and women still get $400$. But what happens if we include all three variables? We may see that we get const $0$, is_male $600$, and is_female $400$. But we could just as validly get const $1000$, is_male $-400$, and is_female $-600$. Or we could get const $99999$, is_male $-99399$, and is_female $-99599$. Any of these are equally valid, so the linear regression doesn’t know which to pick. There is still a pseudoinverse even for low-rank $X$, but without the formula $X\left(X^TX\right)^{-1}X^T$. However, we don’t want to jump to using that in this case, because in this case there’s not a perfect answer. It’s true that we may get a decent overall model, but imagine a model reporting that being male saves you $99,399$ dollars from your insurance; it doesn’t make sense! It’s better to force the modeler to resolve the alias, usually by dropping an unneeded column.

Perhaps a more pernicious issue is the issue of near-aliasing. Imagine the same aliasing issue as above, but this time there’s a single data point with both is_male and is_female equal to $0$. This is no longer technically aliasing, but the problem persists. Imagine that the single non-binary data point had a cost of $10,000$ dollars, then our coefficients will be const $10000$, is_male $-9400$, and is_female $-9600$. Again this presents a problem with explanability. As well there’s a problem of variance; all the confidence intervals will be wide. There’s a problem of over-leveraged points; the single non-binary point has too much influence on predictions. There’s a stability problem; that is, if we run the same model next year, predictions could be widely different. Instability and lack of explanability are unaaceptable for insurance, but also many other applications.

Near Aliasing Example Let’s think about what’s happening mathematically. Let $X$ be an $(2n+1)\times2$ matrix with every entry equal to $0$ or $1$, we’ll assume that for the first $2n$ rows, the two columns match, and that exactly half of these will $1$. The last row will differ. So for example this may look like this:

\[\left(\begin{array}{cc} 0&0\\1&1\\1&1\\0&0\\0&0\\ \vdots &\vdots\\1&1\\ 1&0 \end{array}\right)\]

Assume that $n$ is large. This matrix is almost-aliased because there’s nearly a linear relationship between the two columns: Specifically the first column minus the second column is nearly $0$.

Say that $X$ has the SVD $\sum_i\lambda_i|u_i\rangle\langle v_i|$. Then $X^TX=\sum_i\lambda_i^2|v_i\rangle\langle v_i|$. Meanwhile $X^TX$ is

\[\left(\begin{array}{cc}n+1&n\\n&n\end{array}\right)\]

Using the arguments from Section 2.4, we know that we can calculate eigenvalues and eigenvectors from here. We don’t cover calculations much in this book, because when $n$ isn’t a variable, this can be done with a computer. But the reader may recall that it looks like:

\[0=\left\|\begin{array}{cc}n+1-\lambda&n\\n&n-\lambda\end{array}\right\|=(n+1-\lambda)(n-\lambda)-n^2=\lambda^2-(2n+1)\lambda+n\]

Solving this for $\lambda$ using the quadratic equation, we get $\lambda=\frac{(2n+1)\pm\sqrt{4n^2+1}}{2}$. $\sqrt{4n^2+1}$ is between $2n=\sqrt{4n^2}$ and $2n+1=\sqrt{4n^2+4n+1}$. So the two eigenvalues are $a$ and $2n+b$, with both $a$ and $b$ between $0$ and $1$. Associated with $a$ is a singular matrix

\[\left(\begin{array}{cc}n+1-a&n\\n&n-a\end{array}\right)\]

with an eigenvector pretty close to $\langle1/sqrt2, -1/sqrt2\rangle$. Associated with $2n+b$ is a singular matrix

\[\left(\begin{array}{cc}-n+1-b&n\\n&-n-b\end{array}\right)\]

with an eigenvector close to $\langle1/sqrt2, 1/sqrt2\rangle$. These two eigenvectors are $|v_1\rangle$ and $|v_2\rangle$. To get $\langle u_1|$ out of the equation $X=\lambda_1|u_1\rangle\langle v_1|+\lambda_2|u_2\rangle\langle v_2|$, we can compute $\frac{1}{\lambda_1}X|v_1\rangle$. If $\vec v_1=\langle1/sqrt2, -1/sqrt2\rangle$, then $\frac{1}{\lambda_1}X|v_1\rangle=|e_{2n+1}\rangle$. Rather than worrying too much about constants, we know that $\vec u_1$ must be a unit vector in the direction of $\vec e_{2n+1}$ (because all other vector entries cancel. Because $\langle1/sqrt2, -1/sqrt2\rangle$ is only approximately $\vec v_1$, $|e_{2n+1}\rangle$ is only approximately $|u_1\rangle$. Similarly, $|u_2\rangle$ is approximately a unit vector point in the direction of the average of the two columns; the average of the two columns is $1/2$ for the last entry, and the value from either column for the rest of the entries. We call this vector $|\bar u\rangle$. Let’s also call $\vec v_{\text{sum}}=\langle1/sqrt2, 1/sqrt2\rangle$ and $\vec v_{\text{diff}}=\langle1/sqrt2, -1/sqrt2\rangle$

Recalling that the eigenvalues we calculated earlier are $\lambda_i^2$, we have that $X$ has the SVD $X\approx\sqrt{2n+b}|\bar u\rangle\langle v_{\text{sum}}|+\sqrt{a}|e_{2n+1}\rangle\langle v_{\text{diff}}|$. This tells us not surprisingly that the principal axis looks very close to either of the two columns, and the second principal axis is the just the difference between these two columns. Then $X$ maps these two axis to some perpendicular axes in the range, but with the first principal axis weighted much higher. And this makes sense: The first principal axis takes into account all $2n+1$ entries, so it’s weight should be much higher than the second axis, which only looks at the last row.

But now observe what happens to the pseudoinverse: $X^+\approx\frac{1}{\sqrt{a}}|v_{\text{diff}}\rangle\langle e_{2n+1}|+\frac{1}{\sqrt{2n+b}}|v_{text{sum}}\rangle\langle\bar u|$. When applied to observed values $\vec y$, the first term takes the last element and the first terms takes the correlation of $y$ to the $\bar u$. The odd thing though is that this time, the first terms is much larger. What’s happening is that the linear regression needs to determine an entire coefficient using only the last data point; it does this by scaling up. This demonstrates that the less influence a principal axis has on a matrix, the larger its influence on the pseudoinverse. Mathematically this explains how the non-binary customer described above was able to single-handly dramatically shift all the coefficients.

OLS works best when the input variables are independent, and fails completely when there’s a linaer relationship (perfect correlation). But there’s a spectrum of correlation. Small categories, like non-binary, are usually grouped with larger buckets to avoid near-aliasing. But near-aliasing can also be the consequence of over-zealous feature engineering. Imagine you find that historical insurance cost is a good predictor of future cost. As such you make two variables, number of accidents, and size category of last accident (1-3). A coworker says that you should instead do average size category of all accidents. Another coworker says that you should total all losses and bucket (1-9). Another coworker prefers a lighter touch, and creates a variable called "is loss" which is $1$ or $0$ based on history. If you include all of these variables, you won’t get a perfect alias, but you will get a ton of correlation. For example "is loss" will be correlated with number accidents and total losses, since most customers have no accidents. Average size category will equal last size category most of the time, since most people with accidents had only one accident. You may even find that total losses (1-9) is usually close to 3 x (number of accidents - 1) + (last accident size). Long story short here is that lots of correlation produces weaker models.

The lack of stability here is a big deal. Unlike the non-binary example above, it’s not clear where the instability occurs. What data points will get a different prediction if we update the model? What happens to our model if we remove a few data points? You don’t want to be on-the-job and change your model dramatically when you remove a few data points. You don’t want to give a much different price to an important client than you did a week ago. It would be better if you used fewer variables. A priori, I don’t know which ones are the best, but you could learn with a forward regression, for example. There’s also an art to building features that are likely to be close to independent. For example, instead of including both average accident size and last accident size; you could do last accident size and "trend", which equals last accident size minus second-to-last accident size (or NULL if only one accident). This would just as correlated, but it moves all the instability to the trend variable, which can be more clearly analyzed. Today non-linear models (random forest or neural network) are common, and throwing in many highly-correlated variables is a standard operating here. The "include everything" strategy is less artful, but scales better to many models, updating frequently.

4.4.3 VIF and Cook’s D

This subsection will be light on proofs; the curious reader is encouraged to search. We frame this subsection around understanding how variance depends on data points (rows of $X$) and input variables (columns of $X$).

\[\operatorname{Var}\left(\hat\beta_j\right)=\frac{\sigma^2}{n\operatorname{Var}\left(X_j\right)}\cdot\frac{1}{1-R_j^2}\]

The left-hand side is just the variance of the estimate for $\beta_j$. On the right-hand side, $\sigma^2$ as usual is the variance of the residuals, $n$ is the number of data points, $\operatorname{Var}\left(X_j\right)$ is the variance of the $j$-th column of $X$ (viewed as $n$ observations of a random scalar variable), and $R_j^2$ is the coefficient of determination when we fit $X_j$ as a linear regression on the other columns of $X$. (Note that we’ve again been careless about degrees of freedom, and this equation is only approximate.)

This formula tells us that variance of a coefficient estimate varies directly with the variance of the residuals, and inversely with the number of observations and the variance of the input variable itself. We already knew about the variance of residuals. The $n$ in the denominator says that coefficients will become less variable with more data, which is expected. (Notice that the residual part of the prediction interval does not shrink with $n$ though.) The variance of the column in the denominator makes some sense: Imagine, for example, I had an input variable representing length in feet, and I had a coefficient confidence interval of $(9, 15)$; if I replaced this with the same variable in yards, I should have a coefficient confidence interval of $(3, 5)$. The only difference is the scale, and variance measures this scale.

The second multiplicand, $\frac{1}{1-R_j^2}$, is the most surprising part of this formula. It is called the Variance Inflation Factor (VIF). This inflation factor is minimized when $R_j^2=0$. This is the case that the $X_j$ is independent from the other columns of $X$; that is, there’s no relationship. On the other end of the spectrum, if there was aliasing, then $X_j$ could be prefectly predicted by the other columns of $X$, and the variance would be something like infinite. The VIF tells you how much the variance of the the $j$-th coefficient is inflated due to the correlation between the variables in $X$. If you think you have multicollinearity problems, you can use the VIF of each variable (computed for you by a computer), to determine which variables you should be most concerned about. Some practicianers will use rules of thumb, like "if VIF $>5$, then the variable is highly correlated to others."

Sometimes high variance in $\vec\hat\beta$ are due to individual points, as in our example with a single non-binary data point. Cook’s Distance or Cook’s D provides a very simple definition for how much a single point can change our prediction. Call $\hat y_j$ the predicted value for the $j$-th data point, and $\hat y_{j(i)}$ the predicted value for the $j$-th data point when the $i$-th data point is removed from the data set. The define

Where $r$ is the number of input variables. This is very intuitive; it just measures how much the predictions change by including the $i$-th data point. The denominator is a sort of normalizing factor that gives comparable values of $D$ for different types of models. When normalized, we can use rules of thumb, like "remove any data point with $D_i>4$." Generally, we can look at data points with the highest $D_i$ as the most influential, and consider removing them. Or we can just use this as an additional piece of information. Imagine you were building a model to predict outcomes of American football games. If the most influential data point was a large upset that occurred because the quarterback was injured, then this may tell you that injuries need to be a variable in your model.

The nice thing about Cook’s D is that there’s this identity that I won’t prove:

Where $h_ii$ is the $i$-th diagonal entry of the hat matrix that we have to compute to solve for $\hat\beta$ anyway. This allows us to get all the values of Cook’s D fairly inexpensively.

4.4.4 Closing remarks

One note is that the least squares regression is among many models. But whereas many complex or weak-assumption models, like neural networks, ARIMA, or MCMC, generally solve an optimization via some variation of gradient descent, least squares regression uses a matrix solution. Solving this is fast for computers, even with large datasets.

Finally, it should be mentioned that this book is meant to motivate and explore the mathematics that go into a regression. But this does not make you a modeling expert. There are many considerations that go into modeling, like model choice, data collection, data quality, outlier modeling, train/test splits, hyperparameter optimization, over-fitting, winner’s curse, model stability, explanability, acceptability, and model matainence. These things are outside the scope of this book. And although many of these topics are less mathematically complex than the formula itself, these topics can take years to master. For the practicianer, this book should be supplemented with modeling books and practice.

4.5 Gram-Schmidt and QR factorizations

4.5.1 Gram-Schmidt Algorithm

The Gram-Schmidt Algorithm provides a way to transform a set of vectors to a set of orthonormal vectors that span the same subspace. It does this through two basic operations, normalizing and orthogonal projecting. (The latter is a term I made up for this chapter that can probably be understood from context.) Normalizing means to transform a vector $|v\rangle\to\frac{|v\rangle}{\sqrt{\langle v|v\rangle}}$. The resulting vector is normal (unit length), because $\left(\frac{\langle v|}{\sqrt{\langle v|v\rangle}}\right)\left(\frac{|v\rangle}{\sqrt{\langle v|v\rangle}}\right)=\frac{\langle v|v\rangle}{\langle v|v\rangle}=1$. Orthogonal projecting means to subtract off the components of a vector that live in a given subspace; to orthogonally project a vector against the span of an orthonormal set $\left\{|v_i\rangle\right\}_i$, we do $|v\rangle\to|v\rangle-\sum_i|v_i\rangle\langle v_i|v\rangle$. The resulting vector is orthogonal to all the $|v_i\rangle$ (and therefore the span of $\left\{|v_i\rangle\right\}_i$, because

Notice how we used orthonormality of the $\left\{|v_i\rangle\right\}_i$ for this calculation. This arithmetic may seem magic, but it’s a natural consequence of how we chose the transformation.

Algorithm 2: Gram-Schmidt[Algorithm1]

Given: A set of independent vectors $\left\{|v_i\rangle\right\}_{i=1}^n$.

Set $|v_1'\rangle:=\frac{|v_1\rangle}{\sqrt{\langle v_1|v_1\rangle}}$

For $i=2..n$

Orhogonally project: $|v_i^\bot\rangle:=|v_i\rangle-\sum_{j=1}^{i-1}|v_j'\rangle\langle v_j'|v_i\rangle$.
Normalize: $|v_i'\rangle:=\frac{|v_i^\bot\rangle}{\sqrt{\langle v_i^\bot|v_i^\bot\rangle}}$

The resulting set $\left\{|v_i'\rangle\right\}_i$ spans the same set as $\left\{|v_i\rangle\right\}_i$ and is orthonormal.

Note that $\sum_{j=1}^{i-1}|v_j'\rangle\langle v_j'|$ is just projection onto the first $i-1$ vectors, and $I-\sum_{j=1}^{i-1}|v_j'\rangle\langle v_j'|$ is projection onto the orthogonal space.

By now you may appreciate how useful orthonormal vectors are. But to demonstrate the usefulness, we’ll look at an example.

Equation of a plane Call the four axes of $\mathbb R^4$, $v$, $x$, $y$, and $z$. Find the equation of the hyperplane that’s spanned by $\langle 1,2,2,0\rangle$, $\langle 0,3,3,0\rangle$ and $\langle 0,0,0,2\rangle$.

We perform Gram-Schmidt to orthonormalize these vectors.

\[|v_1'\rangle=\langle 1/3, 2/3, 2/3, 0\rangle\]

\[|v_2^\bot\rangle=\langle 0,3,3,0\rangle-\left(\langle 0,3,3,0\rangle\cdot\langle 1/3, 2/3, 2/3,0\rangle\right)\langle 1/3, 2/3, 2/3,0\rangle=\langle -4/3, 1/3, 1/3, 0\rangle\]

\[|v_2'\rangle=\langle -4/3\sqrt2, 1/3\sqrt2, 1/3\sqrt2, 0\rangle\]

$\langle 0,0,0,2\rangle$ is already seen to be orthogonal to $|v_1'\rangle$ and $|v_2'\rangle$. If we do the projection, we’ll see that this is untouched. The normalization then is $|v_3'\rangle=|e_4\rangle$.

We can then use these vectors to get an orthogonal vector. We want to project something into the orthogonal space $I-\sum_{i=1}^3|v_i'\rangle\langle v_i'|$, but what? Any ole vector will work, so long as it’s not in the plane, and almost no vectors are. Let’s try $|w\rangle=\langle 1,0,0,0\rangle$.

\[|w_\bot\rangle=|w\rangle-\frac{1}{3}|v_1'\rangle+\frac{4}{3\sqrt2}|v_2'\rangle-0|v_3'\rangle=\langle 16/9, -4/9, -4/9, 0\rangle\]

Though it’s common to normal $|x_\bot\rangle$, it’s not needed. Any vector orthogonal to $|w_\bot\rangle$ is in the plane. So $\frac{16}{9}v-\frac{4}{9}x-\frac{4}{9}y=0$ is an equation for the plane.

Another clear application is that when you have an orthonormal span of a subspace, you can write a projection matrix.

4.5.2 QR factorization

In the following theorem we’ll need a nice representation. Say $\left\{|v_i\rangle\right\}_i$ transforms via Gram-Schmidt to $\left\{|v_i'\rangle\right\}_i$. By construction, each $|v_i\rangle$ is in the span of the first $i$ transformed vectors. That is, $|v_i\rangle=\sum_{j=1}^i a_j|v_i'\rangle$. We can use orthonormality to solve for the $\left\{a_i\right\}$. Because for any $k$, we can say that

\[\langle v_k'|v_i\rangle=\left\langle v_k'|\left(\sum_{j=1}^i a_j|v_j'\rangle\right)\right\rangle=\sum_{j=1}^i a_j \langle v_k'|v_j'\rangle=a_k\]

This allows us to write $|v_i\rangle=\sum_{j=1}^i \langle v_j'|v_i\rangle |v_j'\rangle$.

Proof. Let $\left\{|v_i\rangle\right\}_{i=1}^n$ be the columns of $A$, and $\left\{|v_i'\rangle\right\}_i$ be the Gram-Schmidt transform of these.

From the argument above we can write, $|v_i\rangle=\sum_{j=1}^i \langle v_j'|v_i\rangle |v_j'\rangle$. We can write

\[Q = \left( \begin{array}{cccc} \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex}\\ |v_1'\rangle & |v_2'\rangle & \cdots & |v_n\rangle \\ \rule[-1ex]{0.5pt}{2.5ex}& \rule[-1ex]{0.5pt}{2.5ex}& & \rule[-1ex]{0.5pt}{2.5ex} \end{array} \right)\]

and

\[R = \left(\begin{array}{ccccc} \langle v_1'|v_1\rangle & \langle v_1'|v_2\rangle & \langle v_1'|v_3\rangle \cdots & \langle v_1'|v_n\rangle \\ 0 & \langle v_2'|v_2\rangle & \langle v_2'|v_3\rangle & \cdots & \langle v_2'|v_n\rangle \\ 0 & 0 & \langle v_3'|v_3\rangle & \cdots & \langle v_3'|v_n\rangle \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \langle v_n'|v_n\rangle \end{array}\right)\]

The reader can verify that $A=QR$. ◻

The QR factorization makes some calculations easier; especially if you’re solving by hand. If you were solving a system of equations $|y\rangle=A|x\rangle$ for $|x\rangle$ (with $A$ an $n\times n$), $|y\rangle=QR|x\rangle$ is easier to solve because we could write as $Q^T|y\rangle=R|x\rangle$, then $n$-th element of $Q^T|y\rangle$ must equal $R_{n,n}x_n$ (all other elements equal $0$), so we can solve for $x_n$. The $(n-1)$-th element of $Q^T|y\rangle$ equals $R_{n-1,n-1}x_{n-1}+r_{n-1,n}x_n$; we can plug in $x_n$ and solve for $x_{n-1}$ and so on. Similarly we can easily calculate determinant. Because determinant distributes over matrix multiplication and the determinant of orthogonal matrices is $1$, we need only compute determinant of $R$, but for an upper triangular matrix, this is just the product of the diagonal elements. (See Exercise [determinant-ex].)

QR factorization is key in the QR algorithm, which Society for Industrial and Applied Mathematics once called one of the ten most important algorithms of the 21st centuries.

Algorithm 3 QR Algorithm[Algorithm1]

Given a square matrix $A$.

Set $A_0=A$

For $i=1...$

Factor $A_{i-1}=Q_{i-1}R_{i-1}$.
Set $A_i=R_{i-1}Q_{i-1}$.

Under certain conditions, $A_i$ converges to an upper triangular matrix.

In the algorithm, $A_i=Q_{i-1}^TA_{i-1}Q_{i-1}$. By concatenating all the $Q_i$ and calling it $Q$, we get the relationship that $A_n=Q^{-1}AQ$. This is called similarity. We don’t spend enough time on similarity, but it preserves lots of nice properties, for example:

This is true because if $A-\lambda I=0$, then $P^{-1}AP-\lambda P^{-1}P=0$. For an upper triangular, eigenvalues are simply the elements of the diagonal. So if you found a matrix $A_n$ that was upper-triangular and similar to $A$, you could easily read off the eigenvalues. The QR algorithm computes quickly, and so is useful for many computer applications.

4.5.3 Least Squares Regression as Projection

Suppose you have a regression, $|y\rangle=\sum_{i=1}^n\beta_i|x_i\rangle$. Suppose you had an orthonormal basis of $\left\{|x_i\rangle\right\}_{i=1}^{n-1}$. You could use this to build a projection matrix, $P$. Then if you define $|x_n^\bot\rangle=|x_n\rangle-P|x_n\rangle$, then $|\beta_n\rangle=\frac{\langle y|x_n^\bot\rangle}{\langle x_n^\bot|x_n^\bot\rangle}$. This is precisely the coefficient you get if you fit $|y\rangle$ as a regression over the single variable $|x_n^\bot\rangle$. We don’t prove this here.

This is reminiscient of Gram-Schmidt, and indeed Gram-Schmidt could be used to generate the projection matrix. However, don’t extend the analogy too far: Whereas Gram-Schmidt produces orthonormal vectors as you run the algorithm, this rule will only produce the last coefficient, with intermediate steps giving you no information about the other coefficients. While it’s true that you could rearrange your variables so that any one of them is last, there is no generate-as-you-go algorithm for fitting coefficients, because the coefficient requires information about all other variables.

We owe the idea for the connection between least squares and Gram Schmidt to an excellent book called The Elements of Statistical Learning.

4.6 Legendre polynomials

This is a fun section. Instead of $\mathbb R^n$, we look at the vector space of polynomials. This is a well defined vector space, because it’s closed under addition and scalar multiplication. We define an inner product as:

We know that $\left\{x^n\right\}_{n=0}^\infty$ spans the infinite-dimensional space of polynomials. Here we will use Gram-Schmidt to make an orthonormal basis for polynomials. The results are called Legendre polynomials.

The norm of $p_0(x)=1$ is $\sqrt{\int_{-1}^1dx}=\sqrt2$, so the first Legendre polynomial is $P_0(x)=\frac{1}{\sqrt2}$.

Next, we look at $p_1(x)=x$. Now $\langle p_0|p_1\rangle=\int_{-1}^1\frac{1}{\sqrt2}xdx=0$. And $\langle p_1|p_1\rangle = \int x^2dx=\frac{2}{3}$. So that $P_1(x)=\sqrt{\frac{3}{2}}x$.

Now $p_2(x)=x^2$. $\langle p_1|p_2\rangle=0$, since the coefficients are all odd. $\langle p_0|p_2\rangle=\frac{1}{\sqrt2}\int x^2=\frac{2}{3\sqrt2}$. So $p_2^\bot(x)=x^2-\frac{2}{3\sqrt2}\cdot\frac{1}{\sqrt2}=x^2-\frac{1}{3}$. $\langle p_2^\bot|p_2^\bot\rangle=\frac{8}{45}$. So $P_2(x)=\sqrt{\frac{45}{8}}\left(x^2-\frac{1}{3}\right)$.

\[\begin{array}{rcl} P_3(x)&=&\sqrt{\frac{175}{8}}\left(x^3-\frac{3}{5}x\right) \\ P_4(x)&=&\sqrt{\frac{11025}{128}}\left(x^4-\frac{30}{35}x^2+\frac{3}{35}\right) \end{array}\]

Any polynomial can be written as a linear combination of Legendre polynomials. For example $x^2=\sqrt{\frac{8}{45}}P_2(x)+\frac{\sqrt2}{3}P_0(x)$. We got this by doing some rearranging, but more algorithmically, we could write any polynomial as $p(x)=\sum_i\lambda_iP_i(x)$. Then we solve for $\lambda_j$ by applying an inner product to both sides:

\[\langle p(x)|p_j(x)\rangle=\left\langle\left(\sum_i\lambda_iP_i(x)\right)|P_j\right\rangle=\sum_i\lambda_i\langle P_i(x)|P_j(x)\rangle=\lambda_j\]

Once we have the Legendre representation, $x^2=\sqrt{\frac{8}{45}}P_2(x)+\frac{\sqrt2}{3}P_0(x)$, we can do cool math with it like

\[\int_{-1}^1x^4dx=\langle x^2|x^2\rangle = \left\langle \sqrt{\frac{8}{45}}P_2(x)+\frac{\sqrt2}{3}P_0(x)|\sqrt{\frac{8}{45}}P_2(x)+\frac{\sqrt2}{3}P_0(x)\right\rangle=\frac{8}{45}+\frac{2}{9}=\frac{2}{5}\]

Generally if want to integrate any product, you can dot product the coefficients for an easier calculation. It may not be worth it to convert a polynomial to a linear combination of Legendre polynomials just to integrate. But in theory, you could live in a world where your polynomial is already a linear combinations. $\sqrt{\frac{8}{45}}P_2(x)+\frac{\sqrt2}{3}P_0(x)$ is an equally valid representation of $x^2$. What if we taught school children Legendre polynomials instead of regular polynomials. Then in Physics class instead of learning that kenetic energy is $\frac{1}{2}mv^2$, we’d learn that it’s $\frac{\sqrt{3}}{2\sqrt2}P_1(m)\left(\sqrt{\frac{8}{45}}P_2(v)+\frac{\sqrt2}{3}P_0(v)\right)$.

Note that there are some variants to Legendre polynomials, and this is not the most common one. These variants might use a different definition of inner product or a scaling factor.

4.7 Computer Programming, Regression

This section contains some questions to run with a computer. Some of the problems will require the reader to lookup commands. Examples will be given in Python, but feel free to use the language of your choice.

Python has two common packages for linear regression, sklearn (sklearn.linear_model.LinearRegression) and statsmodels. sklearn is far more widely used, and better supports more modern modeling best practices. But this author prefers statsmodels for simple use cases, primarily due to its friendly readouts.

This chapter’s contents are discussed entirely through exercises. These exercises use the diabetes dataset, which can be loaded with sklearn.

If you’re using another language: This is a small dataset so you can copy as a CSV or literals, after finding it online or executing the above code in a Python environment.

By multiset we mean a collection where we care about the number of times each element appears, but we don’t care about the order.↩︎
The keen-eyed mathematician may object that maximums don’t always exist. But because we’ve restricted to unit vectors, this is compact. Therefore maximums exist.↩︎
The shear size of the matrix presents technical challenges that are beyond the scope of this book.↩︎
"Default" means failure to make payments owed.↩︎
Some readers will recognize the equation as the z-score used for normal distribution, from which normalization gets its name.↩︎
Don’t try to make sense of the name.↩︎

1 Introduction

2 Singular Value Decomposition

2.1 Statement of SVD

2.1.1 Bra-ket Form

2.1.2 Eigenvalues and Eigenvectors

2.1.3 Vectors and Covectors

2.1.4 Coordinates

2.1.5 Bilinear Forms

2.1.6 Conventions

2.1.7 Outro

2.2 Existence

2.2.1 Proof of the Correctness of Algorithm [Algorithm1]

2.2.2 Change of bases for outer product sums

2.3 Almost Uniqueness

2.4 How to Calculate

2.5 Quick Conclusions

2.5.1 Rank-Nullity Theorem

2.5.2 Determinant

2.5.3 Misc

3 Direct Applications of SVD

3.1 Computer Programming

3.2 Collaborative Filtering

3.2.1 Latent Dimensions and Embeddings

3.2.2 Embeddings

3.2.3 Averaging

3.3 Principle Component Analysis

3.3.1 Normalization

3.3.2 User embeddings

3.4 Partial Least Squares

3.5 Canonical Correlation Analysis

4 [WIP] Regression

4.1 Projection matrices

4.2 Inverses and pseudo-inverses

4.3 Least Squares Regression

4.3.1 Solving System of Equations

4.3.2 Under-determined System of Equation

4.3.3 Over-determined System of Equations

4.4 Least Squares Regression, Part 2

4.4.1 Variance

4.4.2 Multicolinearity

4.4.3 VIF and Cook’s D

4.4.4 Closing remarks

4.5 Gram-Schmidt and QR factorizations

4.5.1 Gram-Schmidt Algorithm

4.5.2 QR factorization

4.5.3 Least Squares Regression as Projection

4.6 Legendre polynomials

4.7 Computer Programming, Regression

5 [WIP] Symmetric Matrices

5.1 Symmetric Matrices

5.2 Normal matrices and simultaneous diagonalizability

5.3 Inner products

5.4 Spectral Theorem

5.5 Taylor series and smooth functions

5.6 Functions on Non-symmetric matrices

6 [WIP] Definite Matrices

6.1 Definite matrices

6.2 Quadratic programming

6.3 Partial second derivative test

6.4 Ridge Regression

7 [WIP] Stats

7.1 Covariance matrix

7.2 Multi-dimensional normal

7.3 Moment generating function

7.4 Markov Chain

7.5 PageRank

7.6 Linear discriminant analysis

8 [WIP] Applications

8.1 Polynomial Regression

8.2 Logistic

8.3 GLM

8.4 LOESS

8.5 SVM

9 Review

9.1 Orthogonal Matrices

9.2 Spans and Bases

9.3 Change of Basis

9.4 Bra-Ket Notation

9.5 Misc