In the last video, we've gone through the steps of the PCA algorithm.

In order to do PCA,

we need to compute the data covariance matrix.

In D dimensions the data covariance matrix is a D by D matrix.

If D is very high,

so in very high dimensions,

then computing the eigenvalues and eigenvectors of this matrix can be quite expensive.

It scans cubicly in the number of rows and columns,

in this case, the number of dimensions.

In this video, we provide a solution to this problem for

the case that we have substantially fewer data points than dimensions.

Assume we have a dataset given as

X1 up to XN in RD,

and we assume that the data is centered so it has means zero.

Then the data covariance matrix is given as S equal to

1 over N times X transpose times X,

where we define X to be the matrix that consists of X1 transpose up

to XN transpose and that is an N by D matrix.

We now assume that N is significantly smaller than D. That means

a number of data points is significantly smaller than the dimensionality of the data.

And then the rank of the covariance matrix is N. So rank

of S equals N. And that also

means it has D minus N plus 1 many eigenvalues which are zero.

That means that the matrix is not full rank,

and the rows and columns are linearly dependent.

In other words, there are some redundancies.

In the next few minutes, we'll exploit this and turn the D by

D covariance matrix S into a full rank N by N covariance matrix without eigenvalue zero.

I've just moved the covariance definition up here to have a bit more space on the board.

In PCA, we ended up with the following eigenvalue eigenvector equation.

We had S times bi equals lambda I times bi,

where bi is a basis vector of the orthogonal complement of the principal subspace.

Now, let's rewrite this equation a little bit.

We're now going to replace S with a definition up here.

So, we'll get 1 over N times X transpose X.

This is S times bi equals lambda I times bi.

And now, we multiply x from the left hand side.

So we will get X times

X transpose Xbi times 1 over N equals lambda I

times X times bi.

And now, we have a new eigenvector eigenvalue equation.

So, lambda I is still an eigenvalue.

And now, we have eigenvectors X times bi,

which we call ci of the matrix 1 over N times X times X transpose.

This means that 1 over N times X X transpose has

the same non-zero eigenvalues as

a data covariance matrix but this is now an N by N matrix,

so that we can compute the eigenvalues and eigenvectors

much quicker than for the original data covariance matrix.

So, this is an N by N matrix,

whereas S used to be a D by D matrix.

So, now, we can compute the eigenvectors of this matrix 1 over N times X X transpose.

And we use this to recover the original eigenvectors which we still need to do PCA.

Currently, we know the eigenvectors of 1 over N times X X transpose,

and we want to recover the eigenvectors of S. If we

left multiply our eigenvalue eigenvector equation with X transpose,

we get the following: If you get 1 over N times X transpose times

X times X transpose times

ci equals lambda I

times X transpose times ci.

And now, we find our S matrix again.

This is S and this

also means that we recover X transpose times

ci is an eigenvector of S that belongs to the eigenvalue lambda I.

In this video, we reformulated PCA so that we can efficiently run PCA

on data sets but the dimensionality of

the data is substantially bigger than the number of data points.