In the last video,

we looked into the PCA as a dimension reduction tool

that is based on that linear transmission of correlated features,

such as returns of all stocks in the index into uncorrelated issues.

We also looked at the PCA as a data visualization tool,

by projecting all historical stock returns onto

a two dimensional plane formed by the first two principal components.

We did not find much insights in this graph though.

Let's take another look at it again.

In that exercise, we color it points

from different historical regimes by different colors.

Green points here correspond to the period of 2001,

2002, which corresponds to the.com bubble in the technology sector.

Red dots correspond to the period of 2007 to 2009,

where creative crisis has morphed into a general economic crisis.

The blue points are the rest of the observations,

which we call here benign regime.

For this data, the PCA relies on a linear transformation of data,

alongside with a truncation of less important dimensions,

to arrive at the graph shown here.

One thing that stands out in this graph is

that most points are located at the center of the graph.

This is called the Crowding Problem.

As you can see,

the Crowding Problem seems to be a real problem indeed,

and apparently, not much can be learned

just by looking at such two dimensional view of the data.

An interesting question though is what causes the Crowding Problem?

It turns out that there is

a different non-linear way of two dimensional data visualization,

which often works much better than the PCA.

This method is known as the tSNE,

which stands for the t-distributed Stochastic Neighbor Embedding.

The tSNE method was proposed in 2008 by van der Maaten and Jeff Hinton.

And since then, has become a very popular tool in machine learning and data science.

Now, how does the tSNE compare with the PCA.

There are multiple dimensions by which we can compare these two methods.

Let's start with the type of optimization algorithm used by both methods.

While the PCA is a deterministic algorithm,

the tSNE is stochastic.

One consequence of this is that the PCA leads to a unique solution,

but the tSNE may lead to a different results for multiple rounds on the same data.

Furthermore, while the PCA is a global method that is

all end dimensional points affect a given two dimensional projection point,

the tSNE is a local method,

which is within induced global behavior.

The two methods also differ in how they handle extra dimensions.

The PCA simply truncates all other principal components when projecting onto a 2D plane.

But the tSNE does not discard extra dimensions.

Instead, they impact the position of a projection onto the 2D plane.

Finally, the last major difference between the two methods deals with interpret-ability.

While the PCA is straightforward to interpret,

it's just the rotation of coordinates,

the results of the tSNE are sometimes not easy to interpret as we will see shortly.

Now, let's see how the tSNE works.

The tSNE method is a probabilistic method,

based on matching two distributions.

The first distribution is a distribution of

data points in the original d-dimensional space.

The second distribution is a distribution of projections of these points onto a 2D plane.

Let's start with the first distribution.

The tSNE assumes that all pairs of points X_I and

X_J are generated via Gaussian distribution P_IJ,

which is a symmetrize version of a Gaussian density of X_J

centered at X_I with the data specific variance sigma I squared,

shown in this form.

We will discuss the choice of sigma_I below.

Next, let's discuss the distribution of projections of datapoints into a 2D plane.

The method assumes that points on

a 2D plane correspond to students t-distribution with one degree of freedom,

which is also known as the Laplace distribution.

This distribution has two interesting features.

First, it has fat tails,

and that is probabilities of large events

are substantially larger there than for a Gaussian distribution.

Second, large distances would be suppressed in this distribution as a power law.

It turns out that these two features of the Laplace distribution

help a lot for a better visualization of the tSNE method.

Now, one remaining step is to choose the data specific variances sigma I squared.

The tSNE fixes variance for each point X_I,

such that the distribution of points X_J around X_I has a fixed perplexity.

The perplexity of the distribution is simply

two raised to the power of entropy of the distribution.

As suggested by this formula,

when we fix perplexity,

we fix the number of neighbors that the point X_I compares itself to.

After variances are fixed for all points there,

algorithm chooses optimal values of projections Y_I by

minimizing the KL divergence between

the original distribution and the distribution of projections.

This objective function can be optimized using stochastic gradient descent,

with it's gradient given by this simple expression.

For your convenience, the complete algorithm is copied here from the original paper.

All steps here should be already familiar to you except

the very last line in the loop that uses the momentum term in optimization,

which is the last term in this expression.

We will talk about momentum when we discuss optimization methods in more lens.

But for now, let me finish this video by showing you results of

the tSNE visualization of

the same Dow Jones data that we used in our examples with the PCA.

Let's take a few values of perplexity and draw the resulting projections.

So, if we take perplexity equal 10,

we get this shape that vaguely resembles an octopus.

But if we increase complexity to 50,

we all of a sudden get something that looks like an anaconda.

Let's increase complexity to 100.

And now, it looks like a dinosaur.

Let's continue, let's take it to be 500,

and then we get shark jaw.

And finally, drums please,

we take complexity to 1000,

and what we get is the first ever visualization of the famous volatility smile,

known in option trading.

Well, of course, it's a joke,

it has nothing to do with volatility smile,

it's just a nice visualization of the data.

But it does look like a smile, and which,

perhaps, we should see as a hint for us to stop accusing perplexity.

So, as you can see,

what you get on these graphs substantially varies with

the amount of the failure of the perplexity parameter,

and often leaves lots of freedom for different interpretations.

Further details on the tSNE method can be found in the original paper.

I only want to add a few short comments here,

without going into much details.

First, the basic algorithm is not well scalable,

as it has a quadratic complexity.

But using some special t-based algorithms,

a complexity can be reduced to O of Ns times log N. Second,

the basic tSNE algorithm is a data visualization tool,

but not a dimensional reduction tool,

as it doesn't provide the trained representation of data that could

be used to project new data points onto the plane.

All new data point should be included again in the data set and the train.

So, to summarize, in this video you got familiar with the tSNE,

one of the most popular tool for data visualization.

This was our first non-linear, unsupervised method.

And as I noted earlier, the tSNE,

in its basic form,

is mostly used as data visualization tool,

rather than as a dimensional reduction tool,

as it does not provide you with a given [inaudible] presentation that

you could apply to a new data point once it's available.

Instead, the whole algorithm should be run again.

The basic tSNE algorithm also has issues with the computational complexity,

that calls for some additional technical tweaks,

if we want to apply to large data sets.

In the next video,

we will take a first look at the Autoencoder,

another non-linear dimension reduction

method that is free of such deficiencies of the tSNE.