0:00

In this video, I'm going to introduce Principal Component Analysis, which is a

Â very widely used technique in signal processing.

Â The idea of principal components analysis, is that high dimensional data

Â can often be represented using a much lower dimensional code.

Â This happens when the data lies near a linear manifold in the high dimensional

Â space. So the idea is, if we can find this

Â linear manifold, we can project the data onto the manifold,

Â and then just represent where it is on the manifold.

Â And we haven't lost much, because in the directions orthogonal to

Â the manifold, there's not much variation in the data.

Â As we'll see, we can do this operation efficiently using standard principle

Â components methods, or we can do it inefficiently using a

Â neural net with one hidden layer, where both the hidden units and the output

Â units are linear. The advantage of doing it with the neural

Â net is that we can then generalize the technique to using deep neural nets in

Â which the code is a nonlinear function of the input.

Â And our reconstruction of the data from the code is also a nonlinear function of

Â the code. This enables us to deal with curved

Â manifolds in the input space. So we represent data by where it gets

Â projected on the curve manifold. And this is a much more powerful

Â representation. In principal components analysis, we have

Â N dimensional data, and we want to represent it using less than N numbers.

Â And so we find M orthagonal directions in which the data has the most variance.

Â And we ignore the directions in which the data doesn't vary much.

Â The M principal dimensions form a lower dimensional subspace, and we represent an

Â N dimensional data point by its projections onto these M dimensions in

Â the lower dimensional space. So we've lost all information about where

Â the data point is located in the remaining orthogonal directions.

Â But since these don't have much variance, we haven't lost that much information.

Â If we wanted to reconstruct the data point, from our representation in terms

Â of M numbers, we reduce the mean value for all the N minus M directions that are

Â not represented, and then the area in our reconstruction.

Â Would be the sum over all these unrepresented directions of the square

Â difference between the value of the data point count on that direction, and the

Â mean value on that direction. This is most easily seen in the picture.

Â So consider 2-dimensional data. This distributed according to an

Â elongated Gaussian like this. The ellipse is meant to show kind of one

Â standard deviation contour of the Gaussian.

Â And consider a data point like that red one.

Â If we used principal components analysis with a single component.

Â That component would be the direction in the data that had the greatest variance.

Â And so to represent the red point, we'd represent how far along that direction it

Â lay. In other words we'd represent the

Â projection of the red point onto that line, i.e.

Â the green point. When we need to reconstruct the red

Â point, what we'll do is simply use the mean

Â value of all the data points, in the direction that we've ignored.

Â In other words you'll represent a point on that black line.

Â And so the lost in the reconstruction will be the squared difference between

Â the red point and the green point. That is, would have lost the difference

Â between the data point and the mean value of all the data, in the direction we're

Â not representing, which is the direction of least variance.

Â And so we obviously have minimized our loss if we choose to ignore the direction

Â of least variance. Now, we can actually implement PCA, or a

Â version of it, using back propagation, but it's not very efficient.

Â So what we do is we make a network in which the output of the network is the

Â reconstruction of the data. And we try and minimize the squared error

Â in the reconstruction. The network has a central bottleneck that

Â only has M hidden units. And those are going to correspond to the

Â principal components, or something like them.

Â So it looks like this. We have an input vector.

Â We project that onto a code vector. And from the code vector, we construct an

Â output vector. And the aim is to make the output vector

Â as similar as possible to the input vector.

Â 4:58

The activities of the hidden units in the code vector from a bottleneck.

Â So the code vector is a compressed representation of the input vector.

Â If the hidden units and the output units are linear, then ordering coder like

Â this, we'll learn codes that minimize the

Â squared reconstruction error. And that's exactly what principle

Â components analysis does. It will get exactly the same

Â reconstruction error as principle components analysis does.

Â But it won't necessarily have hidden units that correspond exactly the, the

Â principle components. They will span the same space as the

Â first N-principal components, but there may be a rotation and skewing of those

Â axes. So the incoming weight vectors of the

Â code units, which are what represent the directions of the components, may not be

Â orthogonal. And unlike principal components analysis,

Â they will typically have equal variances. But the space spanned by the incoming

Â weight vectors of those code units will be exactly the same as the space spanned

Â by the M principal components. So in that sense this network will do an

Â equivalent thing to principal components. It's just if we use to cast the great

Â descend learning for this network, it will typically much less sufficient than

Â the algorithm used for principle components.

Â Although if there's a huge amount of data, it might actually be more

Â efficient. The main point of implementing principal

Â component analysis using back-propagation in a neural net is that it allows us to

Â generalize principal component analysis. If we use a neural net that has nonlinear

Â layers before and after the code layer, it should be possible to represent data

Â that lies on a curved manifold rather than a linear manifold in a

Â high-dimensional space. And this is much more general.

Â So our network will look something like this, the B input vector, and then one or

Â more, layer of non-linear hidden unit, typically we use logistic units.

Â Then there'll be a code layer which might be linear units.

Â And then following the code layer, there'll be one or more layers of

Â non-linear hidden units. And then there'll be an output vector,

Â which we train to be as similar as possible to the input vector.

Â So this is a curious network in which we're using a supervised learning

Â algorithm to do unsupervised learning. The bottom part of the network is an

Â encoder. Which takes the input vector and convert

Â it into a code using a non-linear method. The top part of the network is a decoder,

Â which takes the nonlinear code and maps it back to a reconstruction of the input

Â vector. So after we've done the learning, we have

Â mappings in both directions.

Â