0:00

So, why do ResNets work so well?

Â Let's go through one example that illustrates why ResNets work so well,

Â at least in the sense of how you can make them deeper and deeper without really

Â hurting your ability to at least get them to do well on the training set.

Â And hopefully as you've understood from the third course in this sequence,

Â doing well on the training set is usually a prerequisite to doing

Â well on your hold up or on your depth or on your test sets.

Â So, being able to at least train ResNet to do well on

Â the training set is a good first step toward that. Let's look at an example.

Â What we saw on the last video was that if you make a network deeper,

Â it can hurt your ability to train the network to do well on the training set.

Â And that's why sometimes you don't want a network that is too deep.

Â But this is not true or at least is much less true when you training a ResNet.

Â So let's go through an example.

Â Let's say you have X feeding in to

Â some big neural network and just outputs some activation a[l].

Â Let's say for this example that you are going to modify

Â the neural network to make it a little bit deeper.

Â So, use the same big NN,

Â and this output's a[l],

Â and we're going to add a couple extra layers to this network so

Â let's add one layer there and another layer there.

Â And just for output a[l+2].

Â Only let's make this a ResNet block,

Â a residual block with that extra short cut.

Â And for the sake our argument,

Â let's say throughout this network we're using the value activation functions.

Â So, all the activations are going to be greater than or equal to zero,

Â with the possible exception of the input X.

Â Right. Because the value activation output's numbers that are either zero or positive.

Â Now, let's look at what's a[l+2] will be.

Â To copy the expression from the previous video,

Â a[l+2] will be value apply to z[l+2],

Â and then plus a[l] where is this addition of a[l]

Â comes from the short circle from the skip connection that we just added.

Â And if we expand this out,

Â this is equal to g of w[l+2],

Â times a of [l+1], plus b[l+2].

Â So that's z[l+2] is equal to that, plus a[l].

Â Now notice something, if you are using L two regularisation away to K,

Â that will tend to shrink the value of w[l+2].

Â If you are applying way to K to B that will also shrink this although

Â I guess in practice sometimes you do and sometimes you don't apply way to K to B,

Â but W is really the key term to pay attention to here.

Â And if w[l+2] is equal to zero.

Â And let's say for the sake of argument that B is also equal to zero,

Â then these terms go away because they're equal to zero,

Â and then g of a[l],

Â this is just equal to a[l] because we assumed we're using the value activation function.

Â And so all of the activations are all negative and so,

Â g of a[l] is the value applied to a non-negative quantity,

Â so you just get back, a[l].

Â So, what this shows is that the identity function is easy for residual block to learn.

Â And it's easy to get a[l+2] equals to a[l] because of this skip connection.

Â And what that means is that adding these two layers in your neural network,

Â it doesn't really hurt your neural network's ability to do as

Â well as this simpler network without these two extra layers,

Â because it's quite easy for it to learn the identity function to just copy

Â a[l] to a[l+2] using despite the addition of these two layers.

Â And this is why adding two extra layers,

Â adding this residual block to somewhere in

Â the middle or the end of this big neural network it doesn't hurt performance.

Â But of course our goal is to not just not hurt performance,

Â is to help performance and so you can imagine that if all of

Â these heading units if they actually learned something useful then

Â maybe you can do even better than learning the identity function.

Â And what goes wrong in very deep plain nets in very deep network without

Â this residual of the skip connections is

Â that when you make the network deeper and deeper,

Â it's actually very difficult for it to choose parameters that learn

Â even the identity function which is why a lot of layers

Â end up making your result worse rather than making your result better.

Â And I think the main reason the residual network works is

Â that it's so easy for these extra layers to learn

Â the identity function that you're kind of guaranteed that it doesn't hurt

Â performance and then a lot the time you maybe get lucky and then even helps performance.

Â At least is easier to go from a decent baseline of not

Â hurting performance and then great in decent can only improve the solution from there.

Â So, one more detail in the residual network that's

Â worth discussing which is through this edition here,

Â we're assuming that z[l+2] and a[l] have the same dimension.

Â And so what you see in ResNet is a lot of use of same convolutions

Â so that the dimension of this is

Â equal to the dimension I guess of this layer or the outputs layer.

Â So that we can actually do this short circle connection,

Â because the same convolution preserve dimensions,

Â and so makes that easier for you to carry out

Â this short circle and then carry out this addition of two equal dimension vectors.

Â In case the input and output have different dimensions so for example,

Â if this is a 128 dimensional and Z or therefore,

Â a[l] is 256 dimensional as an example.

Â What you would do is add an extra matrix and then call that Ws over here,

Â and Ws in this example would be a[l] 256 by 128 dimensional matrix.

Â So then Ws times a[l] becomes 256 dimensional and

Â this addition is now between

Â two 256 dimensional vectors and there are few things you could do with Ws,

Â it could be a matrix of parameters we learned,

Â it could be a fixed matrix that just implements

Â zero paddings that takes a[l] and then zero

Â pads it to be 256 dimensional and either of those versions I guess could work.

Â So finally, let's take a look at ResNets on images.

Â So these are images I got from the paper by Harlow.

Â This is an example of a plain network and in which you input an image

Â and then have a number of conv layers

Â until eventually you have a softmax output at the end.

Â To turn this into a ResNet,

Â you add those extra skip connections.

Â And I'll just mention a few details,

Â there are a lot of three by three convolutions here and most of these are

Â three by three same convolutions

Â and that's why you're adding equal dimension feature vectors.

Â So rather than a fully connected layer,

Â these are actually convolutional layers but because the same convolutions,

Â the dimensions are preserved and so the z[l+2] plus a[l] by addition makes sense.

Â And similar to what you've seen in a lot of NetRes before,

Â you have a bunch of convolutional layers and then there are

Â occasionally pulling layers as well or pulling a pulling likely is.

Â And whenever one of those things happen,

Â then you need to make an adjustment to the dimension which we saw on the previous slide.

Â You can do of the matrix Ws,

Â and then as is common in these networks,

Â you have <unknown> pool,

Â and then at the end you now have

Â a fully connected layer that then makes a prediction using a softmax.

Â So that's it for ResNet.

Â Next, there's a very interesting idea

Â behind using neural networks with one by one filters,

Â one by one convolutions.

Â So, one could use a one by one convolution.

Â Let's take a look at the next video.

Â