In statistics, one often encounters linear models for data. That is, each data point in \(\mathbb{R}^n\) is modelled approximately as \(\bar{u} + d\) for some fixed \(\bar{u} \in \mathbb{R}^n\) and \(d\) in the column space of some matrix \(A \in \mathbb{R}^{m\times n}\). Normally, \(\bar{u}\) is the mean of all the data points.
As such models almost ever fit the data perfectly, there is usually an error in the approximation given by the model. Say \(u'\) is the actual vector in \(\mathbb{R}^n\) representing a particular data point and the approximation from the model is \(\hat{u} = \bar{u} + d'\). The error is usually measured as \(\|u' - \hat{u}\|\) where \(\|x\|\) is defined as \(\sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}\) for \(x \in \mathbb{R}^n\). The least-squares approximation of \(u'\) is an \(\hat{u}\) that minimizes \(\|u' - \hat{u}\|\). Note that the norm is zero if and only if we have perfect fit; that is, there exists \(d\) such that \(u' = \bar{u} + d\).
Given a system \(Ax = b\), a least-squares solution is an \(x\) such that \(\|Ax -b\|\) is minimized.
This is a nonlinear optimization problem. But we can make use of a SVD of \(A\) and some algebraic tricks to solve it without resorting to calculus.
Observe that minimizing \(\|Ax -b\|\) is the same as minimizing \(\|Ax -b\|^2\).
Consider the SVD of \(A\) given by \(U\Sigma V^\mathsf{T}\). Let \(r\) denote the rank of \(A\). Let \(y = V^\mathsf{T} x\).
Then \begin{eqnarray*} \|Ax -b\|^2 & = & (Ax - b)^\mathsf{T} (Ax-b) \\ & = & (U\Sigma y - b)^\mathsf{T} (U\Sigma y-b) \\ & = & (y^\mathsf{T}\Sigma^\mathsf{T} U^\mathsf{T} - b^\mathsf{T})(U\Sigma y-b) \\ & = & (y^\mathsf{T}\Sigma^\mathsf{T} U^\mathsf{T})(U\Sigma y) - 2y^\mathsf{T}(\Sigma^\mathsf{T} U^\mathsf{T}b) + b^\mathsf{T} b \\ & = & y^\mathsf{T}\Sigma^\mathsf{T} \Sigma y - 2y^\mathsf{T}(\Sigma^\mathsf{T} U^\mathsf{T}b) + b^\mathsf{T} b \\ \end{eqnarray*}
Observe that \(\Sigma^\mathsf{T} \Sigma\) is a square diagonal matrix with the \(i\)th diagonal entry equal to 0 for \(i\gt r\).
Thus, letting \(d = \Sigma^\mathsf{T} U^\mathsf{T}b\), we can write \(y^\mathsf{T}\Sigma^\mathsf{T} \Sigma y - 2y^\mathsf{T}(\Sigma^\mathsf{T} U^\mathsf{T}b)\) as \[\sum_{i=1}^r (\sigma_i^2 y_i^2 - 2d_iy_i) =\sum_{i=1}^r \sigma_i^2 \left(y_i^2 - 2\frac{d_i}{\sigma_i^2}y_i\right).\]
Completing the squares give \[\sum_{i=1}^r \left(\sigma_i^2\left(y_i - \frac{d_i}{\sigma_i^2}\right)^2 - \left(\frac{d_i}{\sigma_i}\right)^2\right) .\]
Hence, the minimum occurs when \(y_i = \frac{d_i}{\sigma_i^2}\) for \(i = 1,\ldots, r\), or equivalently, \(y_i = \frac{1}{\sigma_i}(U^\mathsf{T}b)_i\) for \(i = 1,\ldots, r\). Note that we can set \(y_i = 0\) for \(i \gt r\).
Since \(\sigma_i = 0\) for \(i \gt r\), we can say that a solution is given by \[ y = \Sigma^+ U^\mathsf{T} b\] where \(\Sigma^+\) is obtained from \(\Sigma\) by replacing every nonzero diagonal entry by its multiplicative inverse and transposing the resulting matrix. (\(\Sigma^+\) is known as the pseudoinverse of \(\Sigma\).) Since \(y = V^\mathsf{T}x\) and \(V\) is an orthogonal matrix, we get \[x = V(\Sigma^+ U^\mathsf{T} b) = (V\Sigma^+ U^\mathsf{T}) b.\]
The matrix \(V\Sigma^+ U^\mathsf{T}\) is called the pseudoinverse of \(A\) and is denoted by \(A^+\).
In other words, a least-squares solution to \(Ax = b\) is given by \(A^+ b\). Note that there can be many least-squares solutions. As we saw above, \(y_i\) could be set to anything for \(i \gt r\). But it is known that setting them all to 0 gives a solution \(x\) such that \(\|x\|\) is minimized.