0:36

So first let's reconsider the learning problem from very high up.

Â The learning problem can be viewed in terms of three different components.

Â The first is the hypothesis phase, the classic of models that we're interested

Â in considering. The second is the objective function that

Â we're trying to optimize when we're applying the learning algorithm.

Â And the third is the optimization itself that aims to optimise the objective

Â function to give us some good model. And pretty much all of the learning

Â algorithms, that we've discussed, can be viewed in terms of these three different

Â components with just different combination of objective function or

Â optimization algorithms hypothesis base. So let's look at each of these components

Â in turn. The hypothesis space is what we're

Â searching for, and there's really two different pieces that we can be searching

Â for, the first is the parameters. Of the algorithm, and the second is the

Â structure. And in some cases we're searching for

Â one. In some we can search for the other.

Â And in many cases we're actually searching for both.

Â And this sets up a different depending on which choice we make, it sets up a

Â different learning problem. Some is a, in some cases it's a

Â continuous learning problem when we're searching for parameters.

Â Searching for structure is usually discrete and when we're searching for

Â both, it's actually hybrid search problem.

Â Then the second part of the hypothesis space is when we impose a certain set of

Â constraints on what we're searching for. And this can be done for multiple

Â reasons. We can constrain the hypothesis space for

Â computational efficiency. A smaller space is often easier to search

Â for. And admits different algorithms that

Â might be, computationally more efficient. the second is that we might restrict the

Â search space to reduce the model capacity.

Â That is the, the complexity of the models that we're considering.

Â And that can reduce the risk of overfitting because we're searching over

Â a smaller set of possibilities. And so, less likely to overfit, random

Â statistics in the training set. And then, a third form of con-,

Â constraint is when we have some kind of prior knowledge that we want to encode

Â and that allows us to guide the learning algorithm towards models that are more

Â reasonable than in a totally unconstrained way.

Â The second piece is the objective function.

Â So a most, perhaps the most commonly used objective function is that of a penalized

Â likelihood. so penalized likelihood has a log

Â likelihood component, which measures the likelihood of the log likelihood of both

Â our graph structure and the parameters relative to the training set.

Â But that has to be accompanied in most cases by a second term which is this

Â regularizer, which serves to guide the model of the learning toward simpler

Â models that are less likely to over fit the data.

Â And that can, that regularizer can include one or both of a parameter prior,

Â which guides the parameters usually towards more smooth models.

Â So for example, in the case of MRF's, we have

Â L2 or L1 as the regularizer. And in the case of Basemat,

Â we had some kind of Dirichlet prior in many cases.

Â Those are some examples that we've looked at and both of these tend to smooth the

Â parameters and avoid over fitting to the statistics of the training set.

Â And then the second form of regularization is the structure

Â complexity penalty, which of course, is only relevant if

Â we're also searching over structures. Which aims to push us towards models that

Â have fewer parameters or fewer edges. Again, reducing over fitting.

Â An alternative paradigm, which we also discussed, is what we called the baysian

Â score. The Bayesian score is a score that's use

Â when we're really searching over graph structures alone and integrating out over

Â parameters. So this is a case where a hypothesis

Â space is really just a space of structures and the parameters are kind of

Â implicit or integrated out. And the Bayesian score or, which is the

Â log of the probability of the graph given the data, is equal to the log of this

Â guide, the probability of the data given the graph which as a reminder is called

Â the marginal rythm and the log of graph prior.

Â Now, importantly, as a, as just as a reminder, the graph prior is a

Â regularizer. So, log of P of G is a regularizer over

Â graphs. But log of P of D given G, that is the

Â log of the marginal likelihood incorporates within it a regularization

Â term as well as we've seen before. And also serves to reduce the

Â over-fitting of the model. To the data.

Â The final component is the optimization algorithm which of course depends both on

Â the space that we're trying to optimize over as well as the objective that we're

Â trying to optimize. So if we're trying to optimize purely

Â over the continuous space, then in some cases we're very lucky.

Â We showed, for example, that for Bayesian networks with multinomial CPD's, we can

Â actually optimize the likelihood in closed form, and finding unique optimum

Â 5:56

for the parameters. And this also happens when we have a

Â penalized likelihood, or even a full Dirichlet distribution over the

Â parameters. In other cases we're not so lucky and we

Â need to somehow optimize the function when we can't have an algorithm that

Â finds it in close form. So in this case we use gradient ascent or

Â some other kind of perhaps second order method that allows us to hill climb over

Â the likelihood function. And we saw that this happens for example

Â for both MRF learning as well as learning with missing data.

Â For both Bayesian and Cyrus. And of course gradient ascendant's naive

Â form is a fairly slow method. But there are second order methods that

Â build on top of gradient ascent, for example some kind of conjugal gradient,

Â or LBSGS that uses similar computation but that converges much faster.

Â Finally, there is the expectation maximization algorithm, which is an

Â algorithm that's specifically geared for learning likelihoods, learning with

Â missing data. So optimizing the log likelihood function

Â in cases where the log likelihood is multimodal because of the case of missing

Â data and we talked about some of the tradeoffs of.

Â Local optima, for example, in this method.

Â If we're trying to do discrete optimization, trying to search over the

Â parameter space, again in some cases, we can be lucky.

Â So when we're trying to do optimization over tree structures, we saw that a very

Â nice simple algorithm that's very computationally efficient, that of

Â finding a maximal weight expanding tree, can allow us to find the optimal scoring

Â tree in a very efficient, very efficiently, in polynomial time.

Â In other cases this space is more complicated, and we can do an

Â optimization that's guaranteed to give us, the optimal answer.

Â And, in this case, we typically end up doing some kind of local hill climbing,

Â where we And in this case we typically end up

Â doing some local hill climbing where we do some things like add delete or move

Â edges. And

Â And that gradually gets us to a better to a better optimum, there are also

Â algorithm's that takes larger steps in the space.

Â And finally an interesting combination is when we have to search over both the

Â discreet and continuous Have space together when we're trying to

Â optimize over both parameters and structures and that ends up being in many

Â cases quite complicated because when your taking a space on, on the discreet side

Â you also have to optimize the parameters for that new structure before you can

Â score that structure to figure out whether that was a good move to take.

Â So this tends to be computationally expensive in many cases and there's some

Â tricks that one can do to reduce the cost on that.

Â Now, every learning algorithm that we've discussed has some set of

Â hyper-parameters that we need to set before we can run the learning algorithm.

Â So, these are include for example, if we're doing say a Dirichlet prior over

Â the parameters it includes the equivalent sample size.

Â And perhaps the the prior itself, more broadly.

Â That usually a hyper parameter that we need to set.

Â if we're doing say, CRF or MRF learning, we need to set the regularization

Â strength for either the L1 or the L2 prior, .

Â If we're doing expectation maximization, we need to figure out when to stop.

Â And we already talked about the fact that early stopping is a useful strategy,

Â because it reduces over fitting. For doing structure learning, we need to

Â figure out the strength of the structure penalty.

Â How much do we want to penalize models that are more complex?

Â If we're doing, say, MRF or CRF learning. We have to figure out how many features

Â we want to add. And which of those are, we should add

Â initially. And when we're doing learning with latent

Â variables then there's the there's the decision of how many values a latent

Â variable ought to take for example when we're doing clustering how many clusters

Â should we consider. Each of these is an important decision

Â that we need to think about and. The question is, where does this come

Â from? Did we just invent this from thin air, or

Â do we where do we get the answers to this.

Â one bad answer that we shouldn't do is try and estimate these on the training

Â set. That is, find the hyper parameters that

Â maximize the objective function on the training set.

Â Because if we do that, that is effectively calling for a choice of these

Â hyper parameters that are going to over fit to the statistics of the training

Â set. And so a common strategy, perhaps the one

Â that's most commonly used, is to have a separate validation set on which these

Â hyper parameters are set. Which means that we we train.

Â On the training set. And then we evaluate on the value, on the

Â validation set. And we pick the high pro perimeters that

Â give us the best evaluation on the validation set and not on a training set.

Â Now. If computational cost allows us its good

Â to do this not on a single training set and a single validation set and at that

Â point we end up usually doing something like cross validation where we split up

Â the training set into a training set and a validation set in several different

Â ways and pick the hyper parameters that give us the best cross validation scores

Â and then we use those hyper parameters on the each try to train on the entire

Â training step. This of course gives rise to the question

Â of what it is that we're actually evaluating.

Â So what are some of the evaluation criteria that we might use to figure out

Â whether a mall is good or not. So one obvious one is log likelihood on a

Â test set. So if we're actually trying to fig, to

Â find a model that gives us good predictive performance on unseen data,

Â then log likelihood of the model on the test set is a good way of measuring.

Â But in many cases that's not the objective that we actually care about.

Â And so in that case we might often evaluate the learned model using a

Â task-specific objective. For example segmentation accuracy for

Â trying to do image segmentation or Or speech recognition accuracy.

Â Word error rate. it's called W, E, R.

Â Would be another task specific objective on which we can evaluate a model, and, if

Â you really want to get a hyper forming model, it's useful to, think about

Â optimizing the objective that you actually care about, in the context of

Â task. Finally when we're trying to do knowledge

Â discovery, it's often useful to think about the extent to which the model that

Â you learned matchs, with prior knowledge, so if you're doing clustering for

Â example, if you have some notion of, clusters, that you, that you think exist

Â in your data tryin to see, what that the clustering algorithm.

Â Is, more or less compatible with those, is a good idea to do a sanity check, on

Â the model as well. As well as for example, if we have some

Â notion on edges, that ought to exist in the network, trying to see a match with

Â those is also a useful thing. So now we've learned and now let's think

Â about some. Possible error modes that might occur and

Â how we might diagnose them and how we might fix them.

Â So one error mode that occurs often is under-fitting.

Â Under-fitting means that the model is not sufficiently expressive to capture the

Â patterns and data. And this we can recognize when the

Â training and the test performance are both low.

Â And in that case, it suggests that we just haven't put enough expressive power

Â into the model, and it can't even get good performance on the training set.

Â at that point, we can have several different solutions.

Â We can decrease the amount of regularization, because maybe the

Â regularization is driving us towards a regime where we can't fit the patterns in

Â the training set. We can reduce.

Â Structure penalties, to allow more edges and more interactions to be learned.

Â And then, a very important one is to add features into the model.

Â Often done by a very targeted error analysis.

Â In which we look at the errors the model is making, and say, wait a minute.

Â In this kind of, in this kind of instance.

Â I would have expected to see this answer, and I didn't.

Â So how do we improve the model by adding features, so as to give us, better

Â results for, for this set of instances. I, the complimentaryt error mode is over

Â fitting, so over fitting can be diagnosed in cases where the training performance

Â is high, often rediculously high, where as the test performance is low, and that

Â indicates the model has fit, patterns that are in the training data that

Â release the statistical noise and don't generalize to other, unseen instances.

Â And in this case the solutions are complementary.

Â You increase the regularization. You impose capacity constraints by

Â reducing the by forcing the model to pick within the subclass.

Â And you reduce the set of features so as to eliminate or reduce the capacity of

Â the model to overfit. A different error mode happens when our

Â optimization algorithm is not doing a good job of optimizing the objective that

Â we've laid out for it. So that happens when the optimization

Â might not be converging to a good or the global optimal and.

Â People tend to associate this error mode in cases where the problem is

Â multi-modal, so we're converging to a poor local optimum but in fact, it, this

Â can happen depending on whether we carefully, design our optimizational

Â algorithm and can happen even if the problem is a convex problem, it can for

Â example when we don't set our learning rates appropriately and we never actually

Â converge to, to an optimal. So a way to try and diagnose that is to

Â compare models learned with different learning rates with different random

Â initializations and basically see whether these models give rise to very different

Â values of the objective function. And if they do it suggests that we should

Â be very careful about how we optimize and maybe try multiple random starting points

Â or set the learning rate more carefully. Now.

Â This, previous. analysis called for comparison of the

Â objective function. That is the objective that we're trying

Â to optimize. A very different error mode happens when

Â our objective function is not actually a good match for the performance method

Â that we are trying to optimize. So for example, one important thing to

Â check for, is cases where we have two models say that we learned in some week

Â say as we discussed for the previous slide and in mod, the objective function

Â for model one is significantly better than the objective function for model two

Â that the performance objective that we care about say segmentation accuracy for

Â model one is considerably lower from the performance for model two and this

Â suggest the case where the, we are optimizing the objective is not.

Â a good surrogate for optimizing the performance that we care about and that

Â happens quite often when we are using a regularize likelihood as an objective

Â where regularize likelihood is not necessarily a good match for the

Â performance for the objective that we actually care about.

Â And an important mistake that people often make is to say, well, let's try and

Â change the optimization algorithm to give us better results in this case.

Â That is a bad idea because it's. You can't fix a bug in the objective

Â function by breaking the optimization algorithm.

Â The right thing to do is to redesign the objective function to match better the

Â desired performance criterion as, so as to get a better alignment between those

Â two. So to summarize let's think about the

Â typical learning loop that we use when trying to apply machine learning in

Â practice. First we design the model template which

Â tells us for example the set of random variables that are involved and whether

Â the model is directed or undirected. And perhaps some notion of the kind of

Â features that we might include. We then select the hyperparameters via

Â cross validation on the training set. We train on the training set.

Â With the chosen hyper parameters. And then we evaluate performance on a

Â held out set. To see which of the error modes, if any,

Â that we've have discussed on the previous slides, is actually occurring .

Â This gives rise to a process of error analysis.

Â What kinds of errors are we seeing? And how do we redesign the model or the

Â objective function or the optimization algorithm, in order to address those

Â problems? .

Â That gives rise to a new model template and perhaps some other changes.

Â And this cycle then repeats. When we're done and only when we're done

Â we can now report results on a separate test set which is not the same.

Â As the held out set because we've effectively used that held out set to set

Â a lot of model parameters and reporting results on that held out set would be

Â misleadingly optimistic in regards to the performance of the learned model on real,

Â unseen data because this held out set is not actually unseen data at this point.

Â