Data Analysis

Up Main page

We have seen that a vector space can have many different bases and that we can change between them via matrix muliplication. But is there ever a practical need to change basis?

Choosing a good basis to work with is central to some of the popular compression algorithms for audio and images. For example, the wavelet basis is widey used in digital signal processing. Discrete Fast Fourier Transform is an efficient change-of-basis method for expressing a matrix in terms of the wavelet basis.

Changing basis is also frequently employed in statistical analysis. In the present age of big data, linear dimension reduction through Principal Component Analysis is a common technique.

We now illustrate the use of changing basis with a simple example. Consider the following table listing the number of hours per week Ann, Bob, Cam, Dan, and Eve spend on studying and their midterm marks for algebra, calculus, and statistics.

Name Ann Bob Cam Dan Eve

Hours 3 8.5 6 5 7

Algebra 45 100 75 65 85

Calculus 56 100 80 72 88

Statistics 48 81 66 60 72

In real life, we could have hundreds of students and many more subjects. But for the sake of illustration, let us pretend that the above table is too large to load into the computer for analysis. Since we want to analyse the data as a whole, having only a subset of records at any given time is not an option. What can we do?

Name	Ann	Bob	Cam	Dan	Eve
Hours	3	8.5	6	5	7
Algebra	45	100	75	65	85
Calculus	56	100	80	72	88
Statistics	48	81	66	60	72

Let us first express the data as tuples in \(\mathbb{R}^4\) as follows:

Name Ann Bob Cam Dan Eve

Data \(\begin{bmatrix}3\\45\\56\\48\end{bmatrix}\) \(\begin{bmatrix}8.5\\100\\100\\81\end{bmatrix}\) \(\begin{bmatrix}6\\75\\80\\66\end{bmatrix}\) \(\begin{bmatrix}5\\65\\72\\60\end{bmatrix}\) \(\begin{bmatrix}5\\85\\88\\72\end{bmatrix}\)

Name	Ann	Bob	Cam	Dan	Eve
Data	\(\begin{bmatrix}3\\45\\56\\48\end{bmatrix}\)	\(\begin{bmatrix}8.5\\100\\100\\81\end{bmatrix}\)	\(\begin{bmatrix}6\\75\\80\\66\end{bmatrix}\)	\(\begin{bmatrix}5\\65\\72\\60\end{bmatrix}\)	\(\begin{bmatrix}5\\85\\88\\72\end{bmatrix}\)

So far, we have not reduced the amount of data. We have just rewritten the table in a different way. But now observe that if \(d = \begin{bmatrix} 1\\10\\8\\6\end{bmatrix}\) and \(v = \begin{bmatrix} 0 \\ 15 \\ 32 \\ 30\end{bmatrix}\), then Ann's tuple is given by \(v + 3d\), Bob's by \(v+8.5d\), Cam's by \(v + 6 d\), Dan's by \(v + 5d\), and Eve's by \(v+7d\). Hence, the scalar multiples of \(d\) capture all the variability of the data.

Suppose that \(B = \{v_1, v_2, v_3, d\}\) is a basis for \(\mathbb{R}^4\). We know that every vector in \(\mathbb{R}^4\) can be written as a linear combination of \(v_1,v_2,v_3\), and \(d\). But what is special about the tuples in the above table is that if \(u\) is any of these tuples, then the linear combination of \(v_1,v_2,v_3\), and \(d\) expressing \(u - v\) will be of the form \(0 v_1 + 0 v_2 + 0 v_3 + \alpha d\) for some nonzero \(\alpha\). In essence, with respect to the basis \(B\), only one out of the four dimensions is relevant!

Remark. In real life, we almost never have data like those above that permit a perfect fit using a simple model. We are more likely to have data that resemble the following instead:

Name Ann Bob Cam Dan Eve

Hours 3 8.5 6 5 7

Algebra 47 100 74 64 87

Calculus 55 99 80 73 89

Statistics 50 80 65 59 73

Name	Ann	Bob	Cam	Dan	Eve
Hours	3	8.5	6	5	7
Algebra	47	100	74	64	87
Calculus	55	99	80	73	89
Statistics	50	80	65	59	73

Even though we cannot use \(v\) and \(d\) as above to perfectly match the data, we can still get very close. For practical purposes, an approximation often is good enough. As George E.P. Box once wrote, “Essentially, all models are wrong, but some are useful.” The key to doing proper statistical analysis is coming up with an appropriate model for the data at hand and determine if it is a good approximation.

Exercise

Give an orthogonal basis for \(\mathbb{R}^4\) that contains the vector \(d = \begin{bmatrix} 1 \\ 10 \\ 8 \\ 6\end{bmatrix}\).