This, in the next video,

we'll see another examples of conjugate priors.

Well, before we continue, we need to learn some new distributions.

And what we will learn now is called a gamma distribution.

Its probability density function is given as follows.

It is parameterized by two parameters, a and b.

Those are assumed to be positive.

Gamma is also assumed to be positive.

That is, the gamma distribution is a distribution over positive axis.

It's probability density function plot look as follows.

So, you can see, it can be either a unimodal distribution,

or can be a exponential decaying one depending on which parameters we will choose.

Its functional form is gamma to the power of a minus one,

times the exponent of minus b gamma.

For some parameters, the first term is increasing function,

the second term is decreasing function.

We can model the unimodal of distribution.

Then the this constant looks as b to the power of a over gamma function of a.

The gamma function is a smooth extension of a factorial.

That is, the gamma at point n equals to the n minus one factorial.

For example, gamma of five equals to 24.

Its statistics are given as follows,

the mean value equals to a/b.

The mode is a-1/b,

and the variance is a/b squared.

Let's see example how we can apply the gamma distribution to model real things.

Imagine that you run everyday.

You run somewhere around five kilometers.

However, some days you are bit more active.

Some days you are bit tired and so,

you can run 100 meters more or 100 meters less.

This means that the expected value of the distance that you run is

five kilometers and the variance is 0.1-1 kilometer squared.

So, we can model these as your run variable that has a gamma distribution.

We could also use a normal distribution, however,

it would imply that you can run a negative distance, which is impossible.

We can plug in those values into the formulas

for the mean and the variance and find out that a should be

equal to 2,500 and b should be equal to 500 and for density would look like this.

The gamma distribution is actually conjugate to be normal with respect to the precision.

I haven't defined precision yet so let's do that now.

The precision is inverse of the variance.

For example, here are the blue curve has high precision that

these you can easily predict where these samples would be.

However, it has a low variance and

the green line would have high variance and low precision.

Here's the probability density function of a normal distribution.

If we replace the variance with the inverse of that precision,

we'll get the following formula.

Now, we asked ourselves,

what is the conjugate prior with respect to the precision?

Here's our formula again.

If we drop all the constants that do not depend on gamma,

we'll get the following function.

Let's try to find the conjugate distribution in the following way.

It would be proportional to gamma power 1/2 times the exponent of minus b gamma.

What if four cannot?

What do you expect.?

All right. Let's check it out.

Here's our basic formula.

Again, we can drop the denominator since it does

not depend on the parameters and we'll get the form functional form.

If we rearrange the terms,

we see that the gamma now has the power of 1,

which means that it doesn't lie in the same form of distributions.

They wanted, as we considered before had the power of 1/2 so this is a wrong answer.

What if we choose a gamma distribution?

That is we'll have another parameter that would allow us to vary the power of the gamma.

It would look like follows.

Again, this is gamma distribution parameterized by two parameters a and b.

Well, let's try to compute the posterior.

Again, here is our prior.

Here's our posterior.

It is proportional to the likelihood times of prior.

Here is what we get if we drop all the constants.

If we rearrange the terms,

we'll get the gamma to some new power times the exponent of the minus gamma times new

constant and so we can estimate the parameters for the posterior from this form.

So the mean would be a+1/2 and the variance would be b+sum of quadratic term,

and so we avoided computing the evidence simply by choosing the conjugate prior.