0:01

Continuing in our discussion fo parameter estimation.

previously we talked about maximum likely of estimation, which tries to optimize

the likelihood of the data, given the parameters.

And, an alternative approach that offers some better properties, is the approach

of Bayesian estimation, which is what we're going to talk about today.

So first let's understand why Maximum Likelihood of Estimation isn't perfect.

So consider two scenarios, in the first one the team, two teams that played ten

times and the first team wins seven out of the ten matches.

So if we're going to use maximum likely estimation. The probability of the first

team winning is 0.7 which seems like an unreasonable guess going forward.

On the other hand, we take a dime out of our pocket and we toss it ten times and

it comes out heads seven out of the ten tosses.

Maximum likely destination is going to come out with exactly the same estimate,

which is the probability of the next coin coming out heads is also 0.7.

In this case that doesn't seem like quite as reasonable inference based on the

results of these ten tosses. To elaborate the scenario still further,

let's imagine that we take that same dime and now we patiently sit and toss it

10,000 times. And sure enough if comes out heads 7,000

out of the 10,000 tosses. Now the probability of heads is still 0.7

but now it might be a more plausible inference for us to make than in the

previous case, where we only had ten tosses to draw on.

And so, maximum likelihood estimation has absolutely no ability to distinguish

between these three scenarios. Between the case of a familiar setting

such as a coin versus an unfamiliar event such as the two teams playing, on the one

hand, and between the case where we toss a coin ten times verses tossing a coin

ten thousand times. Neither of these distinction is apparent

in the maximum?]. likelihood estimate.

To provide an alternative formulism, we're going to go back to our view of,

parameter estimation as probabilistic graphical model, where we have the

parameter theta over here and we have the data being dependent on the parameter

theta. But unlike in the previous case, where we

were just trying to figure out the most likely value of theta.

Now we're going to take a radically different approach.

And we're going to see that theta is in fact, a random variable by itself.

It's a continuous valued random variable. which in this case, in the case of a coin

toss, takes on value in the space, 01 but in either case it is a random variable,

and therefore something over which will maintain and probability distribution.

Now this is factored at the heart of the Bayesian formalism.

Anything about which we are uncertain we should view as a random variable over

which we have a distribution that is updated over time as data is acquired.

Now let's understand the difference between this view and the maximum

likelihood estimation view. So certainly we have as before that given

theta, the tosses are independent. But now that we're explicitly viewing

Theta as a random variable we have that if thena is unknown, then the process are

not marginally independent. So for example, if we observe that X1 is

equal ten that's going to tell us something about the perimeters, going to

increase our probability that the perimeter favors heads over tails and

therefore is going to change our probability of other coin tosses.

So the coin tosses are dependent. Marginally, not given theta but, without

being given theta they're marginal dependent.

So that really gives us a joint probabilistic model over all of the point

tosses and the parameter together. So if we break down that probability

distribution using this PTM that we have over here, it breaks down using the chain

rule for that Bayesian network that we have drawn there.

So we have P of theta which is the parameter for the roots of this network

and then the probability of the X's given theta, which because of the structure of

the network we have that they are conditionally independent given theta and

so we hav. Which this over here is just our good

friend from before, the likelihood function.

4:53

Which is just a probability of the theta given the parameters and we've already,

specified, computed what that is in the context of this coin tossing example, and

that is data to the power of the number of heads times one minus theta to the

number of tails. But now we have an additional term which

is the probability of theta which we obtain from the prior that we have over

thena. And now we can, by virtue of having a

prior and in fact joint distribution, you can now go ahead and compute a posterior

over my parameter theta given my data set D.

So this is after having observed the values of N coin tosses, I have a

probability distribution over a new probability distribution over the

parameter and by simple application of Bayes rule that is going to be equal to

the probability of the data given Theta which is again my likelihood function

5:53

times the prior, divided by the probability of the beta,

which importantly just as in our application of Bayes' rule is a

normalizing constant. And constant here means relative that

they know which means that if I know how to compute the numerator, I can derive

the denominator by simply, in this case, integrating out over the value of theta

to derive the normalizing constant required to make this a legal density

function. But the most common parameter

distribution to use when we have a parameter that describes multi-nomial

distribution over K different values. Such as this parameter beta is a, is

what's called a Dirichlet Distribution. Now the Dirichlet distribution is

characterized by a set, alpha-1 up to alpha-K, of what are called

hyperparameters. And that is to distinguish them from the

actual parameter's data. So the,

probability distribution, that is defined using these hyper parameters, is a

density over theta, here's theta, which has the following form.

Let's first look at this part over here, which is which is the part that acts as

kind of a parameters staina, and what we see here is that we have for each of my,

param, from each, for each of the, entries theta I in the multinomial, we

have an expression of the form theta I, to the power of alpha I minus one, where

alpha I is desociated hyper parameter. In order to make this a legal density,

we have, in addition, a normalizing constant, that partition function.

Which, in this case. And this is something that we'll come

back to has the following form that we're not going to dwell on right now.

it's a ratio of these things called gamma functions where a gamma function is

defined via the following integral. And for the moment we don't really need

to worry about this because the only thing we really care about for the moment

is the form of this, internal expression over here, knowing that it needed to be

normalized in order to produce density. Now intuitively, and we'll see this in a

couple of different ways, these hyper parameters, these alphas correspond

intuitively to the number of samples that we've seen so far.

So let's understand why that intuition holds.

but before we do that, let's look at a couple of examples of Dirichlet

Distributions and this is an, a special case of the Dirichlet Distribution.

Where we have just two, values for the random variable.

So it's really a distribution for a Bernoulli Random Variable and in this

case, Dirichlet is actually known often as a beta distribution.

But a beta is just a Dirichlet with two micro parameters.

9:00

So here we have several examples of a Dirichlet beta distribution.

This one is the Dirichlet beta one, one. And, notice that, that corresponds to, a

to a uniform distribution. as we increase the number of increase the

hyperparameters. For example, we go to this green line,

which is Dirichlet 22, we notice we get a peak in in the middle.

So there's, there's an increase around 0.5 and that corresponds to a stronger

belief that the parameter is centered around the middle.

That, probability increases yet further, when we go to the Dirichlet 55 or beta 55

where now we, have an even bigger peak around, the value in the middle.

And as we shift the amount of data that we get and its mix, this distribution is

going to get, to moved to the left, or to the, or to the right depending on the mix

between heads or tails in this case. And as we get more and more data, the

distribution becomes more and more peaked.

So, so roughly speaking we have the, the mix.

between alpha heads and alpha tales, the balance determines is the position of the

peak. And the total alpha returns how sharp it

is. So now that we know a little bit about

what the Dirichlet distribution look like. Let's see how it's updated as we

obtain data. So let's, consider a case where we have a

prior, which we're going to assume is Dirichlet, we have a likelihood which is,

for data set, d, derived from a multinomial, a multinomial theta.

And now we'd like to figure out the posterior C of theta given D after having

seen the data D. So, the likelihood we've already seen

before. This is the probability of a data set

that has, in this case mi being the number of instances with value little xi.

And, so this is just the likelihood function.

And, the prior, has the form of a Dirichlet with the associated type of

parameters. And what's important to see, looking at

this, is that the theta I term in the likelihood and the theta I term in the

prior have exactly the same form. So when you multiply the likelihood with

the prior, you can bring together like terms, those with theta I at the base of

the x turn. And you're going to end up with a

posterior that looks exactly like a Dirichlet distribution as well because

it's going to have the form theta i to the power of mi plus alpha I minus one.

So if our prior was Dirichlet alpha one up to alpha k and the data counter m 1 up

to m k then the posterior is simply a Dirichlet with hyper parameters alpha 1

plus m1 up to alpha k plus mk. And, that again, suggests that the

hyperparameters of a distribution represent counts that we've seen.

If a priory are counts for, xi were alpha i, and now we saw an additonal mi counts

for alpha i, then now in the posterior we have an alpha I plus mi counts that we've

seen for that particular event. Now, from a formal perspective, this is a

useful term to know. this situation where the prior and the

posterior have the same form is called a conjugate fire.

14:13

So to summarize we've presented the framework of Bayesian learning.

Bayesian learning treats parameters as random variables.

Continuous variables random variables but still random variables which then allows

us to reformulate the learning problem simply as an inference problem.

Because what we're doing is we're taking a distribution over the random variables

and updating it using evidence which in this case is the observed training data.

Now specifically in the context of discrete random variables over which we

have a multinomial distribution in the likelihood and a Dirichlet distribution

as the prior, we have this very elegant situation where the prior, the Dirichlet

distribution prior is conjugate to the multinomial distribution, which as we

just discussed means that the posterior has the same form as the prior.

And that in turn allows us to keep a closed form distribution on.

Of the parameters. Which has the same form all along as we

keep updating it. And that update uses the sufficient

statistics from the data for the update process, usually in a very efficient

form.