In statistics, one often encounters linear models for data. That is, each data point in $$\mathbb{R}^n$$ is modelled approximately as $$\bar{u} + d$$ for some fixed $$\bar{u} \in \mathbb{R}^n$$ and $$d$$ in the column space of some matrix $$A \in \mathbb{R}^{m\times n}$$. Normally, $$\bar{u}$$ is the mean of all the data points.

As such models almost ever fit the data perfectly, there is usually an error in the approximation given by the model. Say $$u'$$ is the actual vector in $$\mathbb{R}^n$$ representing a particular data point and the approximation from the model is $$\hat{u} = \bar{u} + d'$$. The error is usually measured as $$\|u' - \hat{u}\|$$ where $$\|x\|$$ is defined as $$\sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$$ for $$x \in \mathbb{R}^n$$. The least-squares approximation of $$u'$$ is an $$\hat{u}$$ that minimizes $$\|u' - \hat{u}\|$$. Note that the norm is zero if and only if we have perfect fit; that is, there exists $$d$$ such that $$u' = \bar{u} + d$$.

## Least-squares problem

Given a system $$Ax = b$$, a least-squares solution is an $$x$$ such that $$\|Ax -b\|$$ is minimized.

This is a nonlinear optimization problem. But we can make use of a SVD of $$A$$ and some algebraic tricks to solve it without resorting to calculus.

Observe that minimizing $$\|Ax -b\|$$ is the same as minimizing $$\|Ax -b\|^2$$.

Consider the SVD of $$A$$ given by $$U\Sigma V^\mathsf{T}$$. Let $$r$$ denote the rank of $$A$$. Let $$y = V^\mathsf{T} x$$.

Then \begin{eqnarray*} \|Ax -b\|^2 & = & (Ax - b)^\mathsf{T} (Ax-b) \\ & = & (U\Sigma y - b)^\mathsf{T} (U\Sigma y-b) \\ & = & (y^\mathsf{T}\Sigma^\mathsf{T} U^\mathsf{T} - b^\mathsf{T})(U\Sigma y-b) \\ & = & (y^\mathsf{T}\Sigma^\mathsf{T} U^\mathsf{T})(U\Sigma y) - 2y^\mathsf{T}(\Sigma^\mathsf{T} U^\mathsf{T}b) + b^\mathsf{T} b \\ & = & y^\mathsf{T}\Sigma^\mathsf{T} \Sigma y - 2y^\mathsf{T}(\Sigma^\mathsf{T} U^\mathsf{T}b) + b^\mathsf{T} b \\ \end{eqnarray*}

Observe that $$\Sigma^\mathsf{T} \Sigma$$ is a square diagonal matrix with the $$i$$th diagonal entry equal to 0 for $$i\gt r$$.

Thus, letting $$d = \Sigma^\mathsf{T} U^\mathsf{T}b$$, we can write $$y^\mathsf{T}\Sigma^\mathsf{T} \Sigma y - 2y^\mathsf{T}(\Sigma^\mathsf{T} U^\mathsf{T}b)$$ as $\sum_{i=1}^r (\sigma_i^2 y_i^2 - 2d_iy_i) =\sum_{i=1}^r \sigma_i^2 \left(y_i^2 - 2\frac{d_i}{\sigma_i^2}y_i\right).$

Completing the squares give $\sum_{i=1}^r \left(\sigma_i^2\left(y_i - \frac{d_i}{\sigma_i^2}\right)^2 - \left(\frac{d_i}{\sigma_i}\right)^2\right) .$

Hence, the minimum occurs when $$y_i = \frac{d_i}{\sigma_i^2}$$ for $$i = 1,\ldots, r$$, or equivalently, $$y_i = \frac{1}{\sigma_i}(U^\mathsf{T}b)_i$$ for $$i = 1,\ldots, r$$. Note that we can set $$y_i = 0$$ for $$i \gt r$$.

Since $$\sigma_i = 0$$ for $$i \gt r$$, we can say that a solution is given by $y = \Sigma^+ U^\mathsf{T} b$ where $$\Sigma^+$$ is obtained from $$\Sigma$$ by replacing every nonzero diagonal entry by its multiplicative inverse and transposing the resulting matrix. ($$\Sigma^+$$ is known as the pseudoinverse of $$\Sigma$$.) Since $$y = V^\mathsf{T}x$$ and $$V$$ is an orthogonal matrix, we get $x = V(\Sigma^+ U^\mathsf{T} b) = (V\Sigma^+ U^\mathsf{T}) b.$

The matrix $$V\Sigma^+ U^\mathsf{T}$$ is called the pseudoinverse of $$A$$ and is denoted by $$A^+$$.

In other words, a least-squares solution to $$Ax = b$$ is given by $$A^+ b$$. Note that there can be many least-squares solutions. As we saw above, $$y_i$$ could be set to anything for $$i \gt r$$. But it is known that setting them all to 0 gives a solution $$x$$ such that $$\|x\|$$ is minimized.