0:00

In this video, I'm going to talk about the Bayesian interpretation of weight

Â penalties. In the full Bayesian approach, we try to

Â compute the posterior probability of every possible setting of the parameters of a

Â model. But there's a much reduced form of the

Â Bayesian approach, where we simply say, I'm going to look for the single set of

Â parameters that is the best compromise between fitting my prior beliefs about

Â what the parameters should be like, and fitting the data I've observed.

Â This is called Maximum alpha Posteriori learning and it gives us a nice

Â explanation of what's really going on, when we use weight decay to control the

Â capacity of a model. I'm now going to talk a bit about what's

Â really going on, when we minimize the squared error during supervised maximum

Â likelihood learning. Finding the weight vector that minimizes

Â the squared residuals, that is the differences between the target value and

Â the value predicted by the net, Is equivalent to finding a weight vector

Â that maximizes the log probability density of the correct answer.

Â 1:14

In order to see this equivalence, we have to assume that the correct answer is

Â produced by adding Gaussian noise to the output of the neural net.

Â So the idea is, we make a prediction by first running the neural net on the input

Â to get the output. And then adding some Gaussain noise.

Â And then we ask, what's the probability that when we do that, we get the correct

Â answer? So the model's output is the center of a

Â Gaussian and what we're interested in is having the target value of high

Â probability under that Gaussian because the probability producing the value t,

Â given that the network gives an output of y is just the probability density of t

Â under a Gaussian centered at y. So the math looks like this: let's suppose

Â that the output of the neural net on training case c is yc and this output is

Â produced by applying the weights W to the input c. The probability that we'll get

Â the correct target value when we add Gaussian noise to that output yc is given

Â by a Gaussian centered on yc. So we're interested in the probability

Â density of the target value, under a Gaussian centered at the output of the

Â neural net. And on the right here, we have that

Â Gaussian distribution, with mean yc. We also have to assume some variance, and

Â that variance will be important later. If we now take logs and put in a minus

Â sign, We see that the negative log probability

Â density of the target value tc given that the network outputs yc, is a constant that

Â comes from the normalizing term of the Gaussian plus the log of that exponential

Â with the minus sign in which is just (tc2 - yc)^2 divided by twice the variance of

Â the Gaussian. So what you see is that, if our cost

Â function is the negative log probability of getting the right answer, that turns

Â into minimizing a squared distance. It's helpful to know that whenever you see

Â a squared error being minimized, you can make a probabilistic interpretation of

Â what's going on, and in that probabilistic interpretation, you'll be maximizing the

Â log probability under a Gausian. So the proper Bayesian approach, is to

Â find the full posterior distribution over all possible weight vectors.

Â 4:06

If there's more than a handful of weights, that's hopelessly difficult when you have

Â a non-linear net. Bayesians have a lot of ways of

Â approximating this distribution, often using Monte Carlo methods.

Â But for the time being, let's try and do something simpler.

Â Let's just try to find the most probable weight vector.

Â So the single setting of the weights that's most probable given the prior

Â knowledge we have and given the data. So what we're going to try and do is find

Â an optimal value of W by starting with some random weight vector, and then

Â adjusting it in the direction that improves the probability of that weight

Â factor given the data. It will only be a local optimum.

Â 4:55

Now, it's going to be easier to work in the log domain than in the probability

Â domain. So if we want to minimize a cost, we

Â better use negative log props. Just an aside about why we maximize sums

Â of log probabilities, or minimize sums of negative log probs,

Â What we really want to do is maximize the probability of the data, which is

Â maximizing the product of the probabilities of producing all the target

Â values that we observed on all the different training cases.

Â 5:30

If we assume that the output errors on different cases are independent, we can

Â write that down as the product over all the training cases, of the probability of

Â producing the target value, tc, given the weights.

Â That is the product of the probability of producing tc, given the output that we're

Â going to get from out network, if we give it input c and it has weights W.

Â 5:58

The log functions monotonic, and so it can't change where the maxima are.

Â So instead of maximizing a product of probabilities, we can maximize the sums of

Â log probabilities, and that typically works much better on a computer.

Â It's much more stable. So we maximize the log probability of the

Â data given the weights, which is simply maximizing the sum over all training cases

Â of the log probability of the output for that training case, given the input and

Â given the weights. In Maximum a Posteriori learning, we're

Â trying to find the set of weights that optimizes the trade off between fitting

Â our prior and fitting the data. So that's base theorem.

Â 6:44

If we take negative logs to get a cost, we get that the negative log of the

Â probability of the weights given the data, is the negative log of the prior term, and

Â the negative log of the data term, and an extra term.

Â So, that last extra term, is an integral overall possible weight vectors.

Â And so that doesn't affect W. So we can ignore it when we're optimizing

Â W. The term that depends on the data is the

Â negative log probability is given W, and that's our normal error term.

Â 7:21

And then term that only depends on W is the negative log probability of W under

Â its prior. Maximizing the log probability of a weight

Â is related to minimizing a squared distance, in just the same way as

Â maximizing the log probability of producing correct target value is related

Â to minimizing the square distance. So minimizing the squared weights is

Â equivalent to maximizing the log probability of the weights under a

Â zero-mean Gaussian prior. So here's a Gaussian.

Â It's got a mean zero, and we want to maximize the probability of the weights,

Â or the log probability of the weights. And to do that, we obviously want w to be

Â close to the mean zero. The equation for the Gaussian is just like

Â this, where the mean is zero so we don't have to put it in.

Â And the log probability of w is then the squared weights scaled by twice the

Â variance, plus a constant that comes from the normalizing term of the Gaussian.

Â And isn't affected when we change w. So finally we can get to the basing

Â interpretation of weight decay or weight penalties.

Â We're trying to minimize the negative log probability of the weights given in the

Â data and that involves minimizing a term that depends on by the turn the weights

Â namely how will we shift the targets and determine that depends only on the

Â weights. Is derived from the log probability of the

Â data given the weights, which if we assume Gaussian noise is added to the output of

Â the model to make the prediction, then that log probability is the squared

Â distance between the output of the net on the target value scaled by twice the

Â variance of that Gaussian noise. Similarly, if we assume we have a Gaussian

Â prior for the weights, the log probability of a weight under the prior is the squared

Â value of that weight scaled by twice the variance of the Gaussian prior.

Â 9:40

So now let's take that equation and multiply through by two sigma squared D.

Â So, we got a new cost function. And the first term, when we multiply through turns

Â into simply the sum of all training cases of the squared difference between the

Â output of the net and the target. That's the squared error that we typically

Â minimize in the neural net. The second term now becomes, the ratio of

Â two variances times the sum of the squares of the weights.

Â And so what you see is, the ratio of those two variances is exactly the weight

Â penalty. So we initially thought of weight

Â penalties as just a number you make up to try and make things work better.

Â Where you fix the value of the weight penalty by using a validation set.

Â But now we see that if we make this Gaussian interpretation where we have a

Â Gaussian prior and we have a Gaussian model of the relation of the output of the

Â net to the target, Then the weight penalty is determined by

Â the variances of those Gaussians. It's just the ratio of those variances.

Â It's not an arbitrary thing at all within this framework.

Â