[MUSIC] And, what this motivates us to look at is something called k-Nearest neighbors, which is a very simple extension off of one nearest neighbors. And, again going back to our housing application, it's what real estate agents do. I said they just look at the nearest house, but that's not quite true, they tend to go out and get a set of what are called comps. So, instead of just looking for the closest, most similar house to yours, they're going to collect a set of houses that are similar. And look at the values for that set of houses, and say, well, your house value should be somehow related to these set of houses. Because the agent is understanding that the one most similar house, there might have been something strange about the value it got. Maybe a buyer was really drawn to some specific feature of that house, and that drove up the cost. Or maybe it sold at a strange time in the year, and maybe that gave it an artificially low value. Sso looking at a larger set of houses to assess the value of yours can give you a much more reliable estimate. So, formally k nearest neighbors is gonna take the same data set that we had before and a query point. And what it's gonna do, is it's gonna find the k closest observations in our data set. So specifically let's call it x nearest neighbor 1, x nearest neighbor 2, all the way to x nearest neighbor k. I'm gonna assume that I've ordered these nearest neighbors so that x nearest neighbor k, is the furthest, the least similar out of this set to the query house. Well this set is defined such that for any xi not in my nearest neighbor set. Then the distance, between xi and my query point xq is greater than or equal to the distance between my kth nearest neighbor and xq. So what this means is that, every other observation in the data set, is further away than my kth nearest neighbor. That's what defines it as my kth nearest neighbor. Okay. So then what do I do for my prediction? Because when we were doing one nearest neighbor regression, we were just taking our nearest neighbor looking at the output associated with that nearest neighbor and saying that was my predicted value. But now I have a set of nearest neighbors. And so I need to decide what I'm gonna do with these nearest neighbors and their associated outputs to form my prediction. And what k nearest neighbors does, is it says that predicted value associated with my query point is simply gonna be the average value of all the nearest neighbor outputs. So 1 / k y nearest neighbor 1 + + y nearest neighbor 2 + all the way up to y nearest neighbor k, or we can write this more simply as 1/k sum j=1 to k of y nearest neighbor j. Okay, so k nearest neighbors, like I said, is a very simple variant or generalization of one nearest neighbors. Now, let's talk about how we implement our K nearest neighbor search in practice. And the setup is exactly like it was for one nearest neighbor, but instead of outputting just the most similar house, we're gonna output a set of similar houses, specifically k similar houses. For completeness I've included what the k nearest neighbor algorithm looks like where the one key insight here is to maintain a sorted queue of the k nearest neighbors as you're going through the different steps of your search. So let's just walk through this algorithm here. What we're gonna do is we're gonna initialize what I'm calling distance to k nearest neighbors, which is a list of distances, the current distances to the k nearest neighbors I have at the current iteration. And so to start with, we can just look at the first k houses in our data set and initialize those as our k nearest neighbors. But we're gonna sort these distances for these k nearest neighbors for these first k houses. However these distances sort, we're gonna likewise sort the first k houses in our data set to be our current estimate of our nearest neighbors. Then for now, we're gonna just cycle through houses k + 1 to capital N. What we're gonna do is we're gonna compute the distance of this new house that we haven't seen before to our query house. And if that distance is less than the distance to my kth nearest neighbor, the house that's furthest away, then what I'm going to do is I'm going to go through my sorted list. Because remember just because this new house is closer then the kth nearest neighbor, that doesn't mean that this new house becomes my kth nearest neighbor because it could be closer than my second nearest neighbor or fourth. So what I need to do is I need to go through and find which index this house should go into. Meaning there's some point at which all houses nearest neighbors one, two, j minus one are actually closer than this house that I'm looking at now. But this house is closer than all the other nearest neighbors in my set. And what I'm gonna do, now that we have this index j, we're gonna go and try and insert this house that we found that's one of our current estimates of our nearest neighbor into the queue. And what we're gonna do is we're simply knock off the kth nearest neighbor, which we know was our worst estimate of our nearest neighbor so far. We're gonna go and insert this house into this q. So this step is just shifting the indices for our list of houses, just knocking off that kth last house. And then this step is doing exactly the same thing, but for the distances associated with each of these nearest neighbors. So finally, here, this is where we're inserting the new nearest neighbor. Where, in that jth lost, we're putting the distance to this newly found house. And in the jth slot of our list of houses, we're putting the house itself. So once we cycle through all our observations in our data set, what we're gonna end up with is a set of the k most similar houses to the query house. And like we described before, when we wanna go and predict the value associated with our query house, we're simply gonna average over the values associated with these k nearest neighbors. [MUSIC]