If the encoder and decoder are linear mappings,

then we get the PCA solution when we minimise the squared autoencoding loss.

If we replace the linear mapping of PCA with a nonlinear mapping,

we get a nonlinear autoencoder.

A prominent example of this is a deep autoencoder with

the linear functions of the encoder and decoder are replaced with deep neural networks.

Another interpretation of PCA is related to information theory.

We can think of the code as a smaller compressed version of the original data point.

When we reconstruct our original data using the code,

we don't get the exact data point back,

but a slightly distorted or noisy version of it.

This means that our compression is lossy.

Intuitively, we want to maximise

the correlation between the original data and the lower dimensional code.

More formally, this would be related to the mutual information.

We would then get the same solution to PCA we discussed

earlier in this course by maximising the mutual information,

a core concept in information theory.

When we derived PCA using projections,

we reformulated the average reconstruction error loss as minimising

the variance of the data that is projected onto

the orthogonal complement of the principle subspace.

minimising that variance is equivalent to mazimising

the variance of the data when projected onto the principle subspace.

If we think of variance in the data as information contained in the data,

this means that PCA can also be

interpreted as a method that retains as much information as possible.

We can also look at PCA from the perspective of a latent variable model.

We assume that an unknown lower dimensional code z generates data

x and we assume that we have a linear relationship between z and x.

So, generally, we can then write that x is B times z plus mu and maybe some noise.

We assume that the noise is

isotropic with mean zero and covariance matrix sigma squared times I.

We further assume that the distribution of the z is a standard normal so P of z is

Gaussian with mean zero and covariance matrix the identity matrix.

We can now write down the likelihood of this model.

So, the likelihood is P of x given z and that is a Gaussian distribution in x

with mean Bz plus

mu and covariance matrix sigma squared I.

And we can also compute the marginal likelihood as P

of x is the integral of P of x given z.

So, that is the likelihood times the distribution on z, dz,

and that turns out to be a Gaussian distribution in x with

mean mu and with covariance matrix B times B transpose plus sigma squared I.

The parameters of this model

are mu, B, and sigma squared.

And we can write them explicitly down in our model up here.

So model parameters are B and mu and sigma squared.

We can now determine the parameters of this model

using maximum likelihood estimation,

and we will find that mu is the mean of the data and B is

a matrix that contains the eigenvectors that correspond to the largest eigenvalues.

To get the low dimension code of a data point,

we can apply Bayes' theorem to invert the linear relationship between z and x.

In particular, we are going to get P of z

given x as P of x given z.

So, that is the likelihood which comes from here times P of z.

So, that's our distribution that we have here.

Divided by the marginal likelihood P of x which comes from here.

In this video, we looked at five different perspectives of

PCA that lead to different objectives: minimising the squared reconstruction error,

minimising the autoencoder loss,

mazimising the mutual information,

mazimising the variance of the projected data,

and mazimising the likelihood in a latent variable model.

All these different perspectives give us the same solution to the PCA problem.

The strengths and weaknesses of individual perspectives become

more clear and important when we consider properties of real data.