We have seen how we can represent a black-and-white bitmap image using a matrix. Here, we outline a technique for facial recognition described in a paper by Matthew Turk and Alex Pentland.
To recognize a face, a computer needs to “learn” from different instances of the face of a person. (One can think of the process as learning by looking at examples.) Therefore, the first step is to obtain a training set of images of the person. To illustrate the ideas described in the paper, say we want to consider the problem of recognizing the number “2”.
Suppose we start with \(k\) images of “2”. This set of images is called the initial training set. For the method to work, all the images need to have the same dimensions. In addition, we “vectorize” each of the matrices representing the images by stacking up the columns one after another. For example, vectorizing the matrix \(\begin{bmatrix} 1 & 3 & 5\\ 2 & 4 & 6\end{bmatrix}\) gives the vector \(\begin{bmatrix} 1 \\2 \\3\\4\\5\\6\end{bmatrix}\) in \(\mathbb{R}^6\).
Let \(v_1,\ldots,v_k\) denote the vectorizations of the matrices of the images. Let \(\bar{u}\) denote the mean of these vectors, namely \(\bar{u} = \frac{1}{k} (v_1+\cdots + v_k)\).
Let \(A\) be the matrix whose \(i\)th column is given by \(v_i - \bar{u}\). (In other words, each column corresponds to the difference between an image and the mean.) To test if a freshly acquired image is a “2”, let \(v\) be the vectorization of the matrix for the image and find a least-squares solution to \(A x = b\) where \(b = v - \bar{u}\) by computing \(\hat{x} = A^+b\). Then \(v' = A\hat{x} + \bar{u}\) gives an approximation to our image according to our linear model. If \(\|v - v'\|\) is less than some threshold, we say that it is a “2” and we add the image to the training set. Otherwise, we reject it.
The above outlines the idea behind the method. However, there are a number of implementation details that need to be considered.
In practice, a database for facial recognition may contain images of thousands of individuals, not just one. As a result, one needs to create an image class for each individual. If there are 10,000 individuals against which you want to test a new image, you will need to solve, in the worse case, 10,000 least-squares problems. In the case when no match is found, one may either discard the image or add the image into a new image class, signifying a new individual to be added to the database.
Having to solve a large number of least-squares problems is a major concern. If each image is merely \(256\times 256\), the matrix \(A\) will already have \(65,536\) rows. Furthermore, the training set can become large. These two factors could make obtaining the pseudoinverse \(A^+\) computationally prohibitive if we are not careful.
Each image class essentially corresponds to the column space of the matrix \(A\) formed from the images in the training set. If the image class contains 300 images, each having dimension \(720\times 1280\), the matrix \(A^+\) will have 921,600 rows and 300 columns. Hence, in the full singular value decomposition of \(A\) given by \(U\Sigma V^\mathsf{T}\), \(U\) is \(921,600\times 921,600\)! That is way too big. However, there is no need to obtain the full \(U\).
First, obtain the positive eigenvalues of \(A^\mathsf{T}A\) and an orthogonal \(V'\) whose columns consist of the corresponding eigenvectors. Let \(U' = AV'\) and let \(\Sigma'\) be a diagonal matrix consisting of the singular values, the square roots of the eigenvalues. Then \(A = U' \Sigma' V'^\mathsf{T}\) and \(U'\) now only has as many columns as there are training images.
We can further reduce the size of \(U'\) by sacrificing a bit of accuracy by taking, say, the largest 40 singular values. Hence, we form \(U''\) and \(V''\) from the first 40 columns of \(U'\) and \(V''\), respectively. We take \(\Sigma''\) to be the top-left \(40\times 40\) submatrix of \(\Sigma'\). Obviously, \(U''\Sigma'' {V''}^\mathsf{T}\) is not \(A\) but it is a closest rank-40 approximation to \(A\) under the Frobenius norm.
Note that the columns of \(U'\) give an orthonormal basis of a subspace of the column space of \(A\). But it is not just any orthonormal basis. Statistically, the columns are ordered in decreasing order of the variability of the data. One can imagine that each column corresponds to a varying feature across the images. Since the columns of \(U'\) are orthonormal, these features are uncorrelated; that is, the variability of the feature corresponding to a particular column is not tied to the variability of the feature corresponding to another column. The columns of \(U'\), in statistical terms, are the result of performing principal component analysis on the image data. Turk and Pentland refer to the images corresponding to the columns of \(U'\) as eigenfaces.
We have now come to the end of these lecture notes. The application of linear algebra to facial recognition encompasses all the main ideas we have seen in these notes. In particular, the singular value decomposition is, in the opinion of the present author, the most elegant result one can see in an introductory course in linear algebra. Not only that it is an important mathematical result, its utility is far-reaching.