0:17

So, let's look at the more general case where now we have a general Bayesian

Â network, not necessarily binary valued, and our parameters now consist of a set

Â of parameters theta X for all possible values of the variable X.

Â And the set of parameters theta Y given X for all values of the parent X and the

Â child Y and that's the parameterization of this Bayesian network in the general

Â case of table CPDs. Now let's think about the likelihood of,

Â of the parameters, this set of parameters, theta what is the likelihood

Â of the function look like? Well, the likelihood function, remember,

Â is the probability of the data given the parameters,

Â which decomposes because the parameters are IID, as the product of all data

Â instances, xm and ym relative to the parameterization [INAUDIBLE].

Â Well, so we're going to break that probability up using the chain rule for

Â Bayesian networks. And that's going to give me a

Â decomposition of this joint probability as a product of the probability of the

Â parent X times the conditional probability of y given x, which is just

Â again, the application of the chain role in this very simple example.

Â And now, I'm going to basically, break this up first into two separate products.

Â So I'm going to move this product here. So I have one product over, just the

Â first set of terms and then a second product over the y given x terms, and

Â then I'm going to observe that the first set of terms, the probability of xm only

Â depends on the parameters theta x, and the probability of ym given xm only

Â depends on theta y given x. And so now I've broken up the likelihood

Â function into a product of two or are called local likelihoods,

Â 6:52

And if we now [SOUND] consider maximum likelihood of estimation,

Â it turns out that the maximum likelihood estimate of x, of theta x given u is

Â simply, as one would expect, the fraction of X = x, among faces,

Â where u equals little u. So what happens with maximum likelihood

Â estimation when we have a model with shared parameters?

Â Let's consider that question in the context of a hidden Markov model, where

Â initially, we're actually just going to look at a Markov chain, that is a case

Â where we have just a single state variable S and we transition from one

Â state to the other. So let's look at the likelihood function

Â in this context before we consider what maximum likelihood estimation would look

Â like. So the likelihood function of the

Â parameter vector theta, which dictates the transition model given a set of

Â observations, alpha of the states between time 0 and

Â time t decomposes using the Markov assumption.

Â As a product of t from 1 to t of the probability of the state, at time t,

Â given the state at time t+1.1. Now we can further reformulate this

Â product by, looking at how it decomposes across specific pairs of states, Si and

Â Sj. So we can first multiply over all

Â possible pairs of states, i and j, and then, consider all of the time points

Â in which we transition from Si to Sj. And then, we have the probability of that

Â transition from Si to sj. Now, the critical observation is that

Â this probability is the same regardless of the time points,

Â this is because of the time invariance assumption that we have in these models.

Â And so the parameter, specifically, that we would be multiplying in this context

Â is simply the transition parameter theta Si to Sj.

Â Now, if you look at this expression which has all these products in it, the product

Â of i and j, and then the product of the time points consistent with the

Â transition or in which the transition from i to j took place. We end up with

Â the product over here which multiplies over all ij of the parameter associated

Â with that transition exponentiated by the sufficient statistics MSI to Sj, which is

Â simply the number of times in which the, in the data stream that we saw, we had a

Â transition between those two states. Now, if we now consider what maximum

Â likelihood estimation would look like for this parameter theta hat sorry,

Â for the parameter theta Si to Sj, the maximum likelihood estimate they've had,

Â Si of Sj, would be exactly what we would expect,

Â the number of transitions, in which Si transitioned to Sj divided by the number

Â of total occurrences of the state Si within the time sequence.

Â Let's now elaborate this to the context of a hidden Markov model just to see what

Â it would look like. So here we have, in addition to the first

Â component of the likelihood function, which is the same one that we had before,

Â just the transition probabilities of the state sequence, we also have an

Â additional component which looks at every state from one to big T of the

Â probability of the particular observation, O of T, given the state S of

Â T. Now we can just do the exact same process

Â of reformulating the likelihood function. we've already seen what the first term

Â looks like. It's exactly the same as the one that we

Â had on the previous slide theta Si to Sj exponentiated with

Â sufficient statistics. And we have an, a very analogous term

Â that looks for over pairs of obeserve, of states i and observations k of the

Â parameter that corresponds to the kth observation in the state i.

Â exponentiated with the sufficient statistics,

Â which is the number of times that we saw in the same state the in the same time

Â point, the observation k and the state i.

Â And so, adding now to our set of sufficient statistics, in addition to the

Â count of going from Si to Sj, we also have sufficient statistics that count the

Â number of times, time points, that we had the state Si and the observation okay.

Â So to summarize maximum likelihood estimation in the case of discrete graph

Â Bayesian networks. For a Bayesian network that has disjoined

Â sets of parameters in the CPDs, that is where each CPD has its own set of

Â parameters, the likelihood function decomposes as a product of local

Â likelihood functions and this is important, because we're going to use

Â that later on, one per variable.

Â So the likelihood function is a product of small likelihoods or local

Â likelihoods, one per variable.

Â Now, when we have table CPU we get a further decomposition.

Â Specifically even the local likelihood now decomposes as a product of.

Â Likelihood's forced specific multinomials.

Â One for each combination of the parents. And so now we get a further decomposition

Â that is basically going to allow us to estimate each of those multinomials

Â seperately via the same likelihood estimation that we had in the original

Â case of estimating parameters for a single multinomial,

Â because if you just have a product of these likelihood functions each of them

Â can be optimized seperately. Finally in the very common case of

Â networks that have shared CPDs, so this is a case unlike the first bullet where

Â we had disjoint sets of parameters, here we have shared CPDs.

Â We simply accumulate the sufficient statistics over all occurrences or uses

Â of that CPD, and then, once we do that maximum likelihood estimation can be done

Â in effectively the same way. Now, one important observation that

Â arises from the form of maximum likelihood estimation is worth

Â remembering and thinking about when one uses this.

Â so let's look at the form of the, of the maximum likely estimate for, for

Â parameter theta x given u and as we've seen that is a ratio between M, between

Â the number of accounts of the substantiation little x and the parent

Â assignment little u divided by the sum of all M x prime u for different values of X

Â prime or written otherwise, it's M x u divided by M u.

Â Now, what are the consequences of this expression when the number of parents

Â grows large? When the number of parents grows larger,

Â the number of buckets, that is, the number of different possible assignments,

Â little u, increases exponentially with the number of parents.

Â If you have two parents, it it grows exponentially with,

Â to the number of different buckets u, but if you have fifteen or 20 parents,

Â then the number of buckets into which we need to partition the data, increases

Â very quickly. And that means that most buckets, that is

Â most assignments little u, will have very few instances assigned to them which

Â means that, effectively most of these estimates which divides M of xu divided

Â by M of u, will be done with few or even zero instances Mu), of u, a which point

Â any estimate is just totally bogus. And specifically, the that means that the

Â parameter estimates that we're going to get in most of the multinomial

Â parameters, once u gets large are going to be very poor.

Â So this concept is called fragmentation of the data and it has the following

Â important consequence. If the amount of data is limited relative

Â to, to model dimensionality, we can often get better estimates of the parameters

Â with simpler structures, even when these structures are actually wrong.

Â That is, even if we make incorrect in dependence assumptions that are making,

Â that are removing edges that really ought to be there,

Â we can sometimes get better generalization in terms of say, log

Â likelihood of test data than when using the much more complicated correct

Â structure.

Â