0:02

Hi, my name is Andrey.

Â This week, you will learn how to solve computer vision tasks with narrow networks.

Â You already know about Multi Layer Perceptron that has lots of hidden layers.

Â In this video, we will introduce a new layer of

Â neurons specifically designed for image input.

Â What is image input?

Â Let's take an example of a gray-scale image.

Â It is actually a matrix of pixels or picture elements.

Â Dimensions of these metrics are called image resolution.

Â For example, it can be denoted as 300 by 300.

Â Each pixel stores its brightness or intensity,

Â ranging from zero to 255.

Â Zero intensity corresponds to black color.

Â You can see that, on example of giraffe image,

Â we have a close up of a left ear,

Â where black colors correspond to roughly zeros,

Â and light colors are close to 255.

Â Color images store pixel intensities for three different channels: red, green, and blue.

Â You know that neural networks like when the inputs are normalized.

Â Now you know that image is just a matrix of numbers so let's normalize them,

Â Let's divide by 255 and substract a 0.5.

Â This way, we will have zero mean and our numbers are normalized.

Â What do we do next?

Â You already know about Multi Layer Perceptron, right?

Â So what if we use it for the same task?

Â We take our pixels, which are green nodes here,

Â and for each of these pixels,

Â we train a weight W.

Â Our perceptron will take all that inputs,

Â multiply by weights, add a bias term,

Â and pass through activation function.

Â It seems like we can use it for images, right?

Â But actually it doesn't work like that.

Â Let's see on example.

Â Let's say we want to train a cat detector,

Â on these training image where we have a cat in the lower right corner,

Â red weights will change during back propagation to better detect a cat,

Â but let's take a different example,

Â where we have a cat in the upper left corner.

Â Then, green weights will change,

Â and what is the problem here?

Â The problem is that we learn the same cat features in different areas,

Â and hence, we don't fully utilize the training set.

Â The red weights are only trained on the images,

Â where we have a cat in that corner,

Â as well as for the green weights.

Â What if cats in the test set to appear in different places?

Â Then, our neurons are just not ready for that.

Â Luckily, we have convolutions.

Â Convolution is a dot product of a kernel, or a filter,

Â and a patch of an image,

Â which is also called a local receptive field of the same size.

Â Let's see an example of how it works.

Â We have an input,

Â which can be an image,

Â and we have a sliding window,

Â which has a red border.

Â Let's extract a first patch,

Â a local receptive field.

Â We multiplied by a kernel.

Â We're taking actually a dot product and what we get here,

Â we have one plus four, which is five.

Â Then we slide that window across the image,

Â and for all possible locations,

Â we take a dot products with a kernel.

Â For example, somewhere in the middle of the road,

Â we can have this convolution.

Â We can have our patch 1101 and if we take a dot product with a kernel,

Â then we will have 1+2+4, which is seven.

Â Actually, convolutions have been used for a while.

Â Let's see an example.

Â We have an original image,

Â and we have a kernel,

Â which has an eight in the center,

Â and all the rest on minus ones.

Â How does it work?

Â Actually, it will sum up to zero,

Â which corresponds to black color when the patch is a solid fill.

Â When all the inputs of our patch of the same color,

Â then we will have zero.

Â Actually, it works like an edge detection because anywhere where we have an edge,

Â which is contrary to the solid fill,

Â we will have non-zero activation.

Â Another example is a sharpening filter.

Â It has a five in the center and minus ones on the north, west, east,

Â and south so it doesn't sum up to zero,

Â and it doesn't work like edge detection.

Â For solid fills, it actually outputs the same color,

Â but when we have an edge,

Â it adds a little bit of intensity on

Â the edges because it is somewhat similar to the edge detection kernel.

Â That's why we perceive it as an increase in sharpening.

Â The last but not least is a simple convolution which takes an average of its inputs,

Â and this way we'll lose details and it acts like blurring.

Â Convolution is actually similar to correlation.

Â Let's take an input where we have a backwards slash painted on that image,

Â and if we try to convolve it with a kernel that looks like a backwards slash,

Â then for two locations of our sliding window,

Â we will have non-zero dot product.

Â They're denoted by a red border here.

Â In the output, we have one and two,

Â and all the rest are zeros.

Â If we take a different image,

Â where our slash is not a backslash,

Â but a forward slash,

Â and we convolve it with the same pattern,

Â a kernel of a backslash,

Â then, in the output,

Â we will have something like this,

Â which are two activations of one and the rest are zero.

Â What can we see here?

Â If we take the maximum value of activations from our convolutional layer,

Â for the first example it will become two,

Â and for the second one, one.

Â Actually, it looks like we've made a simple classifier

Â of slashes of backslashes on our image.

Â Another interesting property of convolution is translation equivariance.

Â It means that if we move the input,

Â if we translate the input, and imply convolution,

Â it will actually act the same as if we first applied convolution,

Â and then translated an image.

Â Let's look at the example.

Â We moved our backslash on the image, and convolution result.

Â Actually, in the result,

Â we have the same numbers but they're translated.

Â So if we try to take the maximum of these outputs it will actually stay the same so it

Â looks like our simple classifier is invariant to translation.

Â How does convolutional layer in neural network works?

Â First, we have an input which can be an image,

Â and we add so-called padding.

Â It is denoted as gray area and it is necessary,

Â so that our convolution result will have the same dimensions.

Â Okay, let's look how it works.

Â Well take the first batch of three by three from our image with padding.

Â And if we try to take a dot product with our kernel,

Â which have weights that we need to train,

Â from W1 to W9,

Â then we will have in the output, W6 plus W8,

Â plus W9 plus a biased term,

Â and then we apply activation function,

Â which can be sigmoid.

Â If we move that window,

Â then we will get a different neuron, right?

Â The stat with which we move that window is actually called a stride.

Â In this example, we have a stride of one,

Â and we have a new output,

Â which is W5 plus W7 plus W8,

Â and notice that W8 is re-used,

Â we actually shared that weight,

Â plus bias term, and then we apply the sigmoid activation.

Â In the result, if continue to do that for all our output neurons,

Â we will have a so-called feature map,

Â and it has a dimension,

Â the same as the input image,

Â three by three, and we employed only 10 parameters to calculate it.

Â How does back propagation work for convolutional neural networks?

Â Let's look at this simple example.

Â We have a three by three inputs,

Â and we have a two by two convolution,

Â which means that we have four weights that we have to train.

Â Let's take our first batch from the image which has a purple border.

Â The weight that corresponds to W4,

Â let's denote it with B.

Â If we move it to the right,

Â then the next batch will have the parameter A for the weight of the W4.

Â Actually, we have four different locations where W4 is used.

Â Let's assume that that is not W4, but different parameters.

Â How will back propagation work then?

Â We get to the gradients.

Â DL, DA, DL, DB, and so forth,

Â and we have to make a step in the direction opposite to the gradient, right?

Â If we look at all these update rules then you can see that we were updating A,

Â B, C, and D,

Â with some rule, but actually,

Â A, B, C, and D,

Â are the same parameter W4,

Â because we shared it in convolutional layer.

Â That means that we're effectively changing the value of W4

Â and the step that we make is equal to the sum of the gradients for all the parameters,

Â A, B, C and D. That's how back propagation works for convolutional layer.

Â We just sum up their weights,

Â we just sum up the gradients for the same shared weight.

Â In convolutional layer, the same kernel is used for every output neuron,

Â and that way, we share parameters of the network and train a better model.

Â Remember the cat problem,

Â when it appeared in different regions of the image?

Â With convolutional layer, we will train the same cat

Â features no matter where we have a cat.

Â Let's look at the example,

Â we have a 300 by 300 the input,

Â the same size of output and five by five convolutional kernel.

Â In convolutional layer, we will have a lot 26 parameters to train,

Â and if we want to make it a fully connected layer,

Â where each output is a perceptron,

Â then we will need eight billion of parameters.

Â That is too much. Convolutional layer can be

Â viewed as a special case of a fully connected layer where,

Â when all the weights outside the local receptive field of each neuron,

Â equals zero, and kernel parameter are shared between neurons.

Â To wrap it up,

Â we have introduced it convolutional layer,

Â which works better than fully connected layer for images.

Â This layer will be used as a building block for large and narrow networks.

Â In the next video,

Â we will introduce one more layer that we will need to

Â build our first fully working convolutional network.

Â