Let's now focus on how to compute distance between two given points. This is the similarity measure in our model, where close points are thought to be similar. The assumption is that similar points will most likely have similar outcomes, churn versus not churned here. So what is the distance between these two points? Now there are a few choices that one can make here. Perhaps the most popular is going to be the Euclidean distance. This is easy to understand and interpret because it's very visual. It's the literal distance. We have the distance from our point of prediction and data usage on the vertical axis, and we have the distance from our point of prediction and phone usage on the horizontal axis. And the distance is just going to be the square root of the sum of each one of these distances square. And with more features, we just square each one of these distances for each of the individual features from that point of prediction, and take the square root of that sum of each one of those distances. That'll be our Euclidean distance. Another measure can be the L1 distance. It's appropriately named the Manhattan distance. Since you usually can't walk between two points directly, you have to turn at that corner first. So now we just add all the absolute distances. And again, if we extend to more dimensions, we simply add on the absolute distance from each of these individual features from that point that we're trying to predict. And these are going to be the two choices that are most popular in practice. So we've discussed how distance is going to be important for our K nearest neighbors model. And given that it relies so heavily on distance, when we're trying to predict that point from our training set, this implies that the scale of our variables is going to be really important. Suppose we have another customer feature, number of services to which they subscribe, and we can see an example this here with the number of services is small, ranging between 1 and 5 as we see on the x axis relative to the average data usage which is measured in gigabytes and ranges from 10 to 60. Unscaled than the services feature has a reduced impact on the distance relative to this data usage. Now, this is a problem because once we expand that region, and we look at the closest distance to this point, we can see that the number of services will largely determine the nearest neighbors. And data usage will have little effect on the outcome at all since it's much more spread out, has much greater distances. And if you remember the formulas, we're artificially weighing the number of services distances heavily here. This can be a good thing in some examples maybe, if you find given your domain expertise, that perhaps we want to weight number of services more heavily. But in most cases, this is probably not going to be the case. So let's look at the same example when the data has been appropriately scaled. And we've discussed this in prior courses, the process of creating a uniform range for all of our features is something known as feature scaling. And thinking back, can we recall different methods that we discussed for scaling our data? Hopefully what you're thinking of are min max scalars was one of those examples where we subtract the minimum value of a given feature, and divide that value by the maximum minus the minimum for that given feature, or the standard scalar where we subtract the mean and divide by the standard deviation. So once we perform our feature scaling, we can now ensure that our features have equal influence on our K nearest neighbors model. Here we can see that the x axis the number of services is expanded much more so than before. And in this case, the nearest neighbors are going to be different than what we saw previously. This time with equal influence coming from the data usage, they are customers that churned rather than, not churn, which was the case when we didn't scale our features. So we can also predict multiple classes with KNN fairly easily. Here we are predicting three classes. Churned and left for competitor, churned by canceling and we don't have other info, and just they did not churn. And we can again just set the value of K and create a decision boundary as before. The prediction of multiple classes is quite simple for K nearest neighbors. For other machine learning methods, we'll see that sometimes modifications are required to handle multiple cases. But here it'll be as simple as just choosing between more potential classes. Which one class has the majority votes in regards to which one of those classes is closest to the point that we're trying to predict or has a majority vote in the class that we're trying to predict? And notice that in multiple classes, choosing an on K may not save us so, instead you may want to choose K where it's equal to a multiple of the number of labels. So if we have three labels here, some multiple of 3 plus 1 to ensure that one always has the majority vote. Now we have been focused on classification this far, but it's worth noting that regression can also be performed with K nearest neighbors. That is we can predict continuous values as well, that idea of how much. And this works in a similar fashion to classification, except that the value predicted is the mean value of its neighbors. So when K is the same value as the number of points, a single value is predicted, that is the mean of all the values will be predicted. When K is larger than 1, KNN regression acts as a smoothing function with just the rolling average of the closest three points or K equals to 3 here. And when K is 1 the prediction simply connects each one of the points exactly. We're just saying the nearest neighbor for each one of the points is going to be our prediction.