Hey, everyone. In this video, we're going to talk about introduction to non-parametric models. Parametric models, they look like this. Model has parameters, and then the model may or may not have hyper-parameters, and the model takes a features and make a prediction. It's going to compare the prediction value with the target value, and then produces error. This error will optimize the parameter values so that this prediction value will be as close as possible to the target value. In non-parametric models, the parameter doesn't exist. Therefore, the question is, well, how do we optimize the model such that this prediction value gets as close as possible to the target value? The model has hyper-parameters. Usually, they may not have, but usually, they should have. Then this non-parametric models, sometimes they use this error or sometimes they don't, but uses some other quantity to optimize the model. We'll get to that. Examples of non-parametric models are KNN, K-Nearest Neighbor, which is the simplest in Machine Learning algorithm, and then decision trees that uses a tree-like model. We'll get to that later. Support Vector Machine, which uses distance between the points and the decision boundary or hyperplane. K-Nearest Neighbor works like this. Imagine, I have training data that looks like this, red dots and blue dots. The task is to classify my data points, whether it's red or blue. Let's say I have data points to classify here, and I don't know whether it's red or blue. K-Nearest Neighbor says that just take K numbers of nearest neighbor and classify to the majority of them. Let's say if I have K equals one, I take the closest one. I take one nearest neighbor, which is red. My green points is going to be red, in this case. If I had three neighbors, then I have a two blues and one red, therefore, my green points will be classified as two by the majority rule of voting rule. If I had a K equals 5, then now I have a three red neighbors and two blue neighbors, so it's going to be classified as red now. You might have noticed two things. First, this green points between red and blue. Second, the choice of K number is odd number. Why is that? Because if I have even number, then I might have tie. I might have just two rather than two blues, and I don't know what to choose them. So that's why we usually use odd number for the K values or KNN model. Another thing you might have noticed is that why is this green swing between red and blue? The reason is just happened to be that this green sits on the decision boundary. For example, let's say this side is red and this side is blue and this green just at the right in between so you can swing. That's not very important. I just wanted to show you it can happen. Another question you might ask is that, can KNN do other than classification? Yes, it can. It can also do the regression. The difference would be, instead of taking the majority rule here, if it's a regression, it's going to take the average of these five values. For example, when the K equals 5. KNN uses distance metric, for example, Manhattan distance in Euclidean distance. Euclidean distance is a simple distance between these two points, whereas Manhattan distance would be the Delta x plus Delta y, for example. There are more distance metrics that you can use that these two are pretty popular. All right. So let's have an example. This is from famous Iris data set. To display more conveniently, I only used two features and then two classes of Iris. You can see some blue points and red points are mixed in some area, so it's hard to separate. These two graph shows that the decision boundary, KNN model, each case K values are different. Now, I have a question for you. Which of these this case have a smaller K number? We'll see the answer here. The answer was the left one has a smaller k value, in fact, it was k equals 1. As you can see here, as the k increases, the decision boundary becomes smoother and smoother. When the k is small, let's say 1, then I only have to consider just one neighbor. If my data point here, I only consider this one and the next time my data point here, then I consider this one. Therefore, the decision boundary can be very granular. Therefore you can fit to the very complex data like this. Whereas, if I have to consider many neighbors, when I'm here, I will have to consider 61 neighbors and then count red versus blue and decide which one is more dominant here, in this case red. The decision boundary can be very small as in this way because I'm averaging out a lot of data points. This might remind you the concept of bias and variance. How's the bias and variance in KNN? Here are some k's. Which model has a larger bias, when the k is small or when the k is large? The answer is, when we have a larger k, we have a larger bias. Why is that? Because a k in the model with the larger k is a simpler model and it's less flexible as you saw in the previous slides, that the decision boundaries are much smoother for when k is larger. The simpler model, which is a less flexible model, has a larger bias because it simplifies the real world data. Therefore, it introduces more bias and more assumption about the data. Another question, which model has a larger variance? When the k is small or when the k is large? Larger variants happens when the model is more flexible. Therefore, we can guess that the small k, KNN should have a larger variance. How do we determine the optimal k value? As you saw previously that the training error goes down as the model complexity increases. Test error goes down in the beginning, but then it has an optimal value and it goes up again as the model complexity increases because the model is too complex to the data, therefore, it's overfitting. It's not generalizing very well. That point happens here. Around the k equals 21, the optimal value happens and the test error is still minimized, whereas it can go up if we keep increasing the model complexity. You can see that this side is more complex model. When the k value gets smaller, the model gets more complex, more flexible, and the other side becomes simpler. It has larger bias, larger variance. That's the relationship between the k and then k of the KNN and the bias and variance. More KNN and properties. As we saw, it's simple and memory-based algorithm. Memory-based means that it just need all the training data in order to influence. Its time complexity is on the order of number of samples times number of features. There can be k here as well, but if you had to rank k neighbors anyway, then there are some clever algorithms. When you measure the time actually, it doesn't go very linearly because there are some better sorting algorithm. But anyway, time complexities of roughly n times m, where this is number of samples and this is number of features. We can just support that by doing a few experiments. This data comes from an ml and the size at 90, more than 90,000 samples with the 100 features and the task is to classify binary class. At k equals 9, you choose optimal value. The train time for k and n linearly increases as the number of sample increases. One other layer supports that as well. Instead of increasing the number of samples this time by increasing number of features, the train time also goes linearly. This data was classifying three different boundary types of the gene sequences and has 180 binary features. Training the logistic regression on the same data set and measuring the training time can give some comparison. Surprisingly, this KNN model is very efficient. Well, latency usually say the KNN is a slow because it has to measure all the distance between the points in the training data set so it's said to be slow, but with this number of samples, it's not terribly bad. It has a training time very small and compared to the logistic regression, it's surprisingly fast. Logistic regression might be slow just because they uses a fancy second derivative optimization algorithm that can run many times as well. This graph shows that there is an optimal k value for the KNN model. The test accuracy has some optimal value at certain k value, which is seven in this case. Another property the KNN has is that it it suffer severely from curse of dimensionality. What is the curse of dimensionality? Curse of dimensionality is that the model performs very poorly when we have a lot of features. That there is a curse when the dimension is a high. To see that, we're going to do some experiment. I just applauded explained variance ratio from PCA. By transforming our 180 features using principal component analysis that it's going to rank the combination of this 180 features in order of importance. That is called the explained variance ratio. This gradual increase of this explained variance ratio tells that a lot of these features are all important. If only a few of these features were important, then this explained variance ratio graph would look like this. The very sharply increase until point, that means more than 90 percent of variance would be explained by just only few features. However, this graph shows that it gradually increases that means all features are important, so with that in mind. Let's have a look and compare with the load's say regression. Because most of these features are important in logistic regression, you can see that the test accuracy still increases as the number of features increases. However, in KNN, as you can see, with the various values of k. For the various k values, it has some peak value at very small dimension of features, and then it sharply decrease the performance so that's the curse of high dimensionality. Let's fix it though, our optimal value, k equals 7. As you can see, the test accuracy dropped very sharply as we increase the number of features that are included in the model. Why does curse of dimensionality happened here? It happens because intuitively, the number of data points in the given volume of this high dimension sharply decreases when this dimension becomes high, therefore, we need more data points in order to have the same level of accuracy. However, with the fixed data size, the concentration of data decreased dramatically, therefore we have degradation of performance inaccuracy when the dimension is too high. But it's not that simple. Researchers have found that if the features are highly correlated each other, it may suffer less because the effective dimension is less than the number of features. But anyway, still KNN suffers from curse of dimensionality. When this happens, you want to use smaller number of features and avoid it from being high dimension when you're using KNN. Also, not only the KNN, other machine learning models they use to descent metric in their algorithm can suffer from curse of dimensionality. You might choose wisely which model to use and when you're dimension is too high, honestly you can or you want to reduce the number of features. In this video we talked about KNN as a example of non parametric model, which is a simplest machine learning model. We talked about its property, it's a bias-variance, its hyper parameter k, and how it behaves when the k increases or the k decreases, and its properties such as curse of dimensionality. We'll talk about more sophisticated models, non-parametric models in the next videos.