0:03

In this video, we'll see what are Gaussian processes.

But before we go on,

we should see what random processes are,

since Gaussian process is just a special case of a random process.

So, in a random process,

you have a new dimensional space,

R^d and for each point of the space,

you assign a random variable f(x).

So, those variables can have some correlation.

And using this, you'll have for example,

some smooth functions as samples or maybe non-smooth ones.

So, if you take f(x) at some point in this case for x=3,

you'll have one dimensional distribution that will look like,

for example this red curve.

You can also sample all the f(x) points through all of the space,

R^d, and in this case you'll have just a normal function

f. We can write down like this trajectory.

And so those functions after we sample the other invariables are called trajectories.

When the dimension of x=1,

we can say that it equals the time.

We're now ready to define the Gaussian process.

So, what I would like to say is that the joint distribution over

all points in R^d is Gaussian.

However, we didn't define the Gaussian for the infinite number of points.

We have multivaried Gaussian only for finite number of points,

and so what we can say is that for arbitrary number of points n,

it would take the endpoints x1 to xn,

their joint distribution will be normal.

And since we said that it happens for arbitrary number of points,

we can select n to be arbitrarily large.

And it would be something like,

the joint probability of all points approximately would be Gaussian.

So, when we take n points and take the joint and distribution with them,

it is called a finite-dimensional distribution.

Actually, using finite-dimensional distributions is useful for something.

For example, we cannot sample the whole function,

but we can sample it in like thousand points

and plot it by interpolating the space between them.

And actually, it is the case, it's what I used to draw this plot.

So, Gaussian process is franchised by the mean and the covariance matrix.

So, the mean would be the function that takes the random variable f(x) at each point

of the space and assigns the mean value of it to all points.

Also we have the covariance matrix that takes two points x1 and

x2 and returns the covariance between the line of variables f(x1) and f(x2).

And so it will be equal to sum of function K,

that depends simply on the positions of those two points.

We'll call this function kernel.

So, finally, it will have endpoints.

The joint distribution of them would be normal and with mean being

the vector where the components of it are the function m(x1),

m(x2) and so on, m(xn),

and the covariance matrix would have the function K

of different points and these elements.

For example, the first element would be K(x1,

x1) and so on.

We'll also need a notion of a stationary process.

The process is called stationary if it's

finite-dimensional distributions depend only on its relative positions of the points.

For example, here I have four points drawn as red,

and their joint distribution would be equal to the joint distribution of

the blue points since they're just obtained by shifting the red points to the right.

So let's play a game.

I have some samples from a Gaussian process,

and we should find out whether those are samples from the stationary process or not.

So, what do you think about this sample.

So, actually it is not stationary since here is seasonality.

If we take the points in the beginning of the period,

you will easily say that you are well,

in the beginning of that period.

And if we move it a bit to the right,

you'll say something like you're in the end of the period.

And so, the joint distribution would be different in different parts of the space.

And so it is not stationary.

What about this sample?

Well, again we have a trend here.

And so by computing the mean of for example,

some points, you can take 10 points,

compute their mean, and by using this mean you would be able to predict where,

in which part of the space you are.

And so this means that the process is not stationary. What about this one?

Well, seems like stationary.

So, for stationary process,

we shouldn't have trend.

This means that the mean variance should be constant over all space.

So m(x) as a function would be constant.

Also the kernel should depend only on the difference of the two points,

and this would mean that the joint distribution

depends only on the relative positions of the points.

So, we'll write it down as the K(x1 - x2).

Also using this citation,

it is really easy to compute the variance of the line of variable f(x).

The variance would actually be equal to the kernel at position zero,

since it would be equal to the covariance

between the line of variable f(x) with itself which actually equal to the variance.

Below I have the example of the kernel.

So, the covariance at 0 would be 1,

so the variance of f(x) is 1,

and as it goes further from the point,

the covariance becomes lower and lower.

There are many different kernels that you can use for training Gaussian process.

The most widely used one is called the radial basis function or RBF for short.

So, it equals to the sigma squared times the exponent of minus

the squared distance between the two points over 2l^2.

l is a parameter that is called a length scale,

and sigma squared just controls the variance at position zero,

which is a variance of f(x).

There are also many other kernels like a rational quadratic kernel or Brownian kernel.

They all have some different parameters and

will give you different samples from the process.

So, let's see the radial basis function a bit closer.

When the line scale parameter equals to 0.1,

the samples would look like this.

If we increase the length scale,

the samples would look a bit smoother and they'll change less rapidly.

And if we keep increasing the value of l,

we'll have almost constant functions.