Hello everyone, welcome back in this video, we're going to talk about convolutional neural network. And this video has a 15 minute reviews from last week and then from there we're going to talk about images. So if you remember a few weeks ago that we talked about the beginning of the deep learning. We talked about why we learn deep learning and deep learning it can be used in self driving car. So instead of trying to build all 3D maps and localize the vehicle with the expensive lighters and sensors. You can use cheap camera, cheaper than lighter and then you can do object recognition. So you can build this model and it can recognize cars and pedestrian and things like that, so that's one use of deep learning. And another application in still image using deep learning is to help doctors to diagnose their image medical images. So another image in work, it's useful and you can build some tools that can recognize and search visual information. Using just putting your take a photo of some items that you want and then the e commerce website can search the image and searches the items for you. You can create some fake images using convolutional neural network that looks very real, so it became very advanced by now. Not only the images you can also use for sound, the voice recognition, you can still use convolutional neural network. To recognize voice, speech and you can even generate the sound using convolutional neural network. So we talked about perceptron which is very small, the basic unit of neural network. And it is also called the Artificial Neuron and this has very similar model as biological neuron. So in biological neuron, they have many, many don't rise that takes the input signal from other neurons and then they process something. And make this kind of non linear function through the [INAUDIBLE] and this information gets to other neurons that's connected to it. Similarly, in perceptron we take inputs and then we have weights and we do some processing like this. Followed by activation function and then we send this output to other neurons that are connected to this neuron. Just a quick note about activation function that we talked about last time. There is a binary threshold, step function activation function which is given a threshold. If your logic value is below that threshold then it gives a zero and if it's a the logical is a bigger than the threshold then it gives one. So this is for binary class classification and its threshold value, we also talked about Sigmoid, which is more smaller version of this step function. Which goes from zero to one smoothly using this logistic function or sigmoid function. Where it is [INAUDIBLE] linear combination of your feature or input X with some weights and some bias. There is something called Softmax which is more useful when you do multi class classification. So in the Sigmoid this is a probability of your label, your output prediction is going to be a label one etcetera. But here because you have a multiple layer, this is a conditional probability given X, the probability of your label becoming a j's index or the j's class. This is the probability and the formula is like this, so you have some of all this possible exponential factors. And then that's normalizing vector and you have that component, so Tanh is another similar thing to Sigmoid. This is again for binary class but actually the output is from minus one to one. So it's more suitable for something like steering angle of self driving car model if you have one, so it's closely related the Sigmoid. But the output is between minus one to one, so you have to choose activation functions. Among many depending on what kind of output that you will have in your particular problem, okay? Some kick comparison between Sigmoid and Softmax, so Sigmoid is usually ideal for binary class classification and Softmax is for multi class classification. However, Sigmoid can be also used for multi class but as you can see if you have a multi class classification with the Sigmoid. You have to be careful and you have to be able to choose which one you should use. Depending on the type of the multi class classification problem that you have, so for example in Softmax. Usually it is used for your output should be only one valid answer out of multiple choices. So either you're answer is either a cat or dog or a bird or maybe it's a something like tiger, etcetera. So you have many choices and if you want the answer should be just one of them, then you need to use Softmax as an output activation function. On the other hand, if you have something that can be either dog or cat like this, it's just called a cat dog. Anyway, if you have this kind of data, you can use a sigmoid as a multi class classification. So in Softmax, the probability or conditional probability of all this in they=1, x+ y=2, x+ y=3given x etcetera. All this sum should be summed equaled to one where as Sigmoid multi class, it doesn't have to. So what's the probability of having y equals what, so you can have p of having cat in this image this maybe 0.9 p of dog. Having in this image is 0.97 and p of bird having this images maybe 0.21, things like that. So they don't have to sum equaled to one but you won't have this kind of image anyway, so what is this? Sigmoid for multi class classification useful for, usually if you have some data like this. You have a multiple objects and then if you are asked to pick all the things that are in this image, then you can use Sigmoid for in a multi class. Whereas you cannot use Softmax because they will choose one or the other, right? Okay, more activation functions so far the Sigmoid, the step function and Tanh and what else, Softmax. They were all four players, I mean they don't have to be but they are mostly used for our players. That's why we talked about different target values, target types that we have to think about Rectified Linear ReLU for short. They can be also for output layers, if you have a regulation problem, ReLU is perfect but ReLU also are typically used for hidden layers as well. So remember if you had multi layers each layer, neurons also output some non linear activation function will be with one of them. And if you have an input layer and there is a output layer anything in between. It's called the hidden layer and hidden layer, neurons also put some kind of non linear output and Rectified Linear. Because it has all values from zero to one, not like threshold or Sigmoid that most of the time they will be close to zero or close to one. This has more spread so have more expressive power, so that's why we use in the hidden layer. Similarly, Parametric Rectified Linear, PReLU, for short or sometimes people call it leaky ReLU because they kind of has a leak here. Looks like this so instead of zero before a threshold, they can also parametric differently so that it has more flexibility in the values below the threshold. And oftentimes this is also very good choice as a activation functions for sheets and layers. But both also again can be used for our players depending on what type of target variable you have. Okay, so let's talk about review on Multi Layer Perceptron, so as I mentioned anything in between called hidden layers and there are design parameters. So this kind of structure is called architecture, a number of layers, a number of neurons in one layer becomes a design parameter. Design choices and what the activation function should be in each layer can be also designed parameter. You may have like different types of activation function whatever here, but normally we don't do that. We simply keep same type of activation function per layer because there is no useful thing by doing all kinds of activation function individually, okay? We also talked about last time how to train this neural network, so because they have multiple layers. We use chain rule and we use this update weight update rule, it's derived from gradient descent. So what it is is that first these connections are actually the weight, so in this case zero, one, two, three. This line correspond to weight of 0, 3 for the maybe I can call it I here, so something like that. So you will have all combinations and because they are fully connected layer. And each connection corresponds the weight and initially all these weights are not known. So we initialized with the random number, some random, small values for each of them. And then we produce the output and then this output is not going to be same or any any close to target value. So we will measure the error and then when we measure this error, we measure the gradient of that with respect to each weight. So to be able to measure the gradient of the loss function with respect to this way particularly, we need to use chain rule. So after we use chain rule, we can calculate this guy and then this all of this and all this, so we can calculate back and that was called back propagation. Before and this was forward, yeah, forward propagation and this is back propagation. Okay, so we update away and this is just a learning rate or step size that how much of the update we do for each weight? Typically it's much smaller than one maybe 0.1 or 0.10 to the minus three, something like that small value. So that we don't swing, we don't swing update or overshoot the value for estimated parameters, okay, so that was reviews so far.