Previously, we've gone through and we've defined different types of networks,

we started introducing what a deep network actually is,

what a deep neural network actually is and

we've set up and talked about what they can do.

But we haven't yet really talked about how we can actually learn a deep network.

In this video today,

we're going to be talking about how we actually can define learning,

and this is going to mean how can we define learning mathematically,

and this will allow us to set this up as

a problem and allow us to actually learn and network.

So, if you recall,

we had this previous schematic,

where we are showing that we wanted to combine our training set or training data,

which is shown here on the left,

where we have an different training examples,

these are independent data examples,

where we have features or co-variates and we have

a particular outcome that shown here in the y's and what we want to be able to do,

is create a network that can actually predict these outcomes given the features.

So, on the right we're showing here the logistic regression model or network,

where we're just showing we have all these features and they're combining

linearly to construct the zi and we give the equation there,

this is the exact same equation that you've seen several times

already and then we're going to convert it into a probability sigma zi.

We previously have stated that,

we're just going to combine our training set and with

this network to learn the parameters and we're going to have our learn parameters here.

But we need to talk about how we actually do this,

how do we actually learn the parameters,

given our training data and our network that we want to learn.

So, this fundamental question that we're going to talk about today,

is how do we actually do this?

In particular, given a large amount of data and a model or network that we want to fit,

how can we both efficiently and effectively learn the model parameters,

and we want to have these parameters be effective,

we want them actually to predict fairly accurately.

But we also want them to be learned efficiently.

We have a finite amount of computational resources,

we want to use these resources efficiently and this is going to be a big deal

if you want to use this in something like a cell phone or learn these on the fly.

Specifically, let's talk about what it is we're trying to

do when you're trying to do learning and so succinctly,

we just want to learn parameters that give us the best performance.

All this means is, given data,

let's find the best parameters b for that data.

But, we're going to run into our first problem here,

we haven't actually defined performance.

So, how can we actually go through and define the performance of this network?

So, we have this network on the right and we say it

can make predictions and we're going to find this parameters,

but how are we actually going to define how well it's actually doing.

So, the way this actually happens is we're going to

use something called Empirical Risk Minimization.

Empirical Risk Minimization is jargon in

the machine learning field and we'll talk about what it actually means.

But what we're going to have here,

is we're going to have a loss function and a loss function is going to take two inputs.

It's going to have a true value or true outcome and it's also going to have

our prediction and this loss function is going to define a penalty on a poor prediction.

If we have really great prediction,

we're not going to really pay a loss,

but whenever we have poor predictions we're going to pay

a big loss and what we want to do is

minimize the average loss and so we still need to define what this loss actually is,

but if we know that we have this loss,

all we're going want do,

is just minimize the average loss,

if the loss is penalty for doing bad,

we just want to minimize it.

So, mathematically we can state how we actually want to find our parameters b.

So, in this equation here on the bottom,

we have that b star,

these are optimal parameter,

is going to be the argument of the minimum of one over N,

the sum over all of our data points and it's

going to be for each data point we have our loss function,

we have our true label and we have our guess, right?

So, this guess the sigma zi here,

the sigma zi comes from

our logistic regression model and we have the parameters b a little bit hidden here,

but zi is dependent on parameters b,

so our parameters b are actually going to choose,

what we're guessing for each of these.

So, what we want to say,

is now that we have some sort of loss function,

we just want to find the parameters that give us

the minimum average loss and this is our mathematical statement of what it is,

but we still need to define what our loss function actually is.

So, our loss function,

so just as a reminder we have sigma zi,

it's going to be defined as

our predicted probability and yi is our true label or outcome.

This is something that we have from our training set.

Our network is going to change into this probability by combining with our parameters b,

and this was shown in that network model.

So we can view this loss function,

we can view this as the negative log-likelihood,

the log-likelihood is widely used in statistics,

generally in statistics, we're trying to do a maximum log-likelihood.

When we talk about losses we're trying to minimize a loss,

so we just take the negative of the log-likelihood.

So, we say that, the loss of our true outcome and

our predicted probability is just

equivalent to its negative log-likelihood and we're just writing it in there,

minus log p of yi given our predicted probability,

and the mathematical form here for this negative log-likelihood,

when we're dealing with any binary problem is given below here.

So, we have this loss function of y and our prediction,

the sigma z and it's going to be equal to minus

y log of sigma z minus 1 minus y log of 1 minus sigma z.

So, this is a little bit of an obtuse mathematical form if you've never seen it before.

So, we're going to go through and we're actually going to draw what

this equation is actually doing to try to get some understanding about what's going on.

So, what we have here is just a visualization of the logistic loss function,

so the logistic across entropy loss here,

is that equation that we have defined,

we defined this on our previous slide,

this is just a reminder of what it is.

But there's two possible outcomes for what y is,

y can either be positive or y can be negative, right?

If it's positive it's a one and if it's negative it's a zero,

this is just how we've decided to define y.

So, we can visualize the two cases here,

we can visualize when the case is

positive and we're showing this in the dashed blue line,

and if we look at the trend in the dashed blue line here,

which is when we have a positive outcome.

If we're predicting that the probability of a one is happening every single time,

we're saying there's a 100 percent chance that there's a one,

then we pay no penalty at all if we get it correct.

But the problem is, that if we're overconfident we're actually going to pay huge penalty.

So what's going to happen,

is if we're predicting that it's a one and we're predicting this

extremely confidently and it's actually a zero,

we're going to pay a penalty that actually increases to infinity and so we

want to guess the correct answer but we actually don't want to be overconfident,

and so this is a nice loss function that's going to penalize overconfidence,

but still encourage us to get a correct and it comes from deep properties of statistics.

We can talk about what this goal actually looks like for binary classification, right?

So, if we have our optimization goal,

we're going to minimize the average loss.

So, we have this b star here,

this argument minimum over all possible b of our average loss function here,

and the average is taken over all data points.

So, if we have a logistic problem or we're using

logistic regression or anything where we're doing

a binary problem and we're doing it probabilistically,

we can use this logistic loss,

or it's also sometimes called

the cross-entropy loss and we've just defined it there again,

this is the same exact equation you've seen before.

But now we're just going to try to find the b star that

minimizes this loss and we're defining lost just like this.

So, we need to come up with a way of actually saying

while now we have a precise mathematical statement on what our parameters should be,

we need to come up with

an optimization algorithm that can actually find these parameters.

So, just to recap what we've talked about in this video,

we have our training set,

we have our model or network that we're trying to learn and we can define this,

however we want and here we're just visualizing

the logistic regression model and we have this learn parameters,

and in order to learn these parameters we're going to define

this loss function and we're going to define an average loss

and we're going to set up an optimization algorithm

to find these parameters and that's what's going to come up in our next videos.