[MUSIC] So if our query point falls in this bin here, meaning it gets scored as 0 1 0 so maybe, maybe our query point is right here. And so according to the lines that we threw down you're going to have a binary vector, 0 1 0. And then what we do is just go and look within this box. And look at the distances to data points 4, data point 8, and data point 11. The question is do we know we're going to find the nearest neighbor. And so hopefully it shouldn't be such a trick question at this point, or such a challenging question, the answer is no. We know that we're just thinking about returning an approximate nearest.neighbor, because the nearest neighbor could very well be in some other bin and have been split from the bin that our query point was put into. But the important thing here is that the situation is even worst than what we mentioned before when we just three down one line, because the chance of splitting our query point and its nearest neighbor is even higher when we're throwing down multiple lines, because there's more of a chance that one of these lines or multiple of these lines will split our query point from its nearest neighbor. So what we see is that, when we think about throwing down multiple lines, what were trading off is accuracy for speed. Because there's a greater chance that searching within a single bin we're not going to find our true nearest neighbor, but we can search much efficiently. But there's actually even better news which is we have some control over this approximations that we are making, and we have control over how much computation that we're going to perform to perform our nearest neighbour search. Because in particular, we don't have to just search the one bin that our query point falls into, we can also search nearby bins. So what does it mean to be a nearby bin? Well, let's look at this. Okay, so remember our query point was, let's say somewhere around here, bin that it falls into is 010. So what are the nearby bins, they're bin's 000 and 110. And what we see which make sense if you think about how these vectors were actually defined is that the nearby bins are the ones that are just one bit off from our current bin indices. So for example going from one, flipping that bit to a zero we get to the orange bin or if we take our zero, flip it to a 1, we get to this blue bin. So it's changing the score of the point by saying that it just fell on the other side of the line that defines the current bin. Okay, so what we can do in practice if we go back to our hash table, is instead of just searching, bin 2, can't see that at all with this color. Let me switch colors. Instead of just searching Bin 2, we're going to look at the binary encoding of 2, and then we're going to go flip bits and search these nearby bins, so then were going to search bin 0 and bin 6 in this case, okay. So we're going to compute not just distances to these points here but were also going to compute distances to these other data points, and we're going to hope that if our nearest neighbor didn't fall into this pink bin, here, maybe he was in the orange, or maybe he was in the blue. And so what that assumption is saying, is that we're assuming if our nearest neighbor fell in one of these other orange and blue bins, is that, relative to our query point, there was only one line that split the nearest neighbor from the query point. But there might have been two lines that split our query point and its nearest neighbor. So if that were the case, so let's just draw a little picture over here, and say this is our query point and this here is its nearest neighbor, well, it could be that We threw down two lines that split these two points, and in this case, we'd have to search two bits off. So we'd have to search, now flipping 2 bits, taking our 0, turning it into a 1, and taking our other 0 and also turning it into a 1 to search bin 7, and then also consider the distances to all of these data points here. And again, if we found a nearest neighbor that was in this bin 7, what that would mean is that we had exactly this case where two lines separated our query point from the nearest neighbor. So just to make sure it's very clear, the quality of the nearest neighbor that we can return can only improve as we search more and more neighboring bins. So what can we do in practice? Well, typically when we go and use locality sense of hashing to perform nearest neighbor search. We think about running the algorithm, keeping, searching nearby bins, flipping bits, more and more bits and searching all of those bins until we've run out of our computational budget that we've set for that search or alternatively maybe you have a standard for how good their. The nearest neighbor should be from the query point so you know as long as the point we find is within some epsilon distance of the query point we're going to say, that's good enough so then in that case you would just keep searching until you found that neighbor within that epsilon ball of the query point, even if you hadn't run out of your computational budget so far. So, the cool thing here, relative to the approach of just throwing down one line, is we can actually control this approximation versus the complexity of search trade off that we're making, because we can get better and better approximations the more bins we search, but we can cut off that process at some point. If we've searched too many points, we've run out of time in performing this search, whereas if we just threw down one line, we wouldn't know where to stop the search. We wouldn't know which set of points were closer or further. So that's part of the benefit of throwing down multiple lines. Okay, so just to recap what we've seen so far about Locality-Sensitive Hashing. Well the way that we partition our datapoints, is we throw down some h different lines, and then we're going to score every datapoint under every line and use that create this binary bit factor. And then we're going to take that and we're going to use that to encode a bin index or really that's going to serve as the key informing a hash table that lists all the data points that have that specific binary vector or bin number. And so this data structure is going to serve as basically the direct analog of what we did for kd-tree. So this is going to be a fixed partitioning of our data points into different bins. And then when we go to perform our nearest neighbor search when we have a given query, we're going to use this data structure that we've created or just simply a hash table, and we're going to search bins and nearby bins to return our nearest neighbor, really approximate nearest neighbor. And so we're going to use this same hash table for every query we have. So although there's a cost to forming this hash table, which we'll discuss a little bit later, we can reuse it again and again and again every time we're doing a query for our nearest neighbor. [MUSIC]