[MUSIC] So there's one more algorithm I wanna talk about that's less well known than k-means. But it has slightly nicer properties so it's another good one to be familiar with. And that is dbscan. So how dbscan works is given points, say, in a two dimensional space as always, we're going to look for points that are separated by a distance of no more than some epsilon. And so, if you can hop from one point to another, by hopping no more than epsilon at each point, then all of those points will be considered to be in the same cluster, and so whenever you need to jump a little further than epsilon, you'll be entering a new cluster, okay? So in this example, if epsilon is this distance, then we know we can get This thing is driving me nuts a little bit. These points are within epsilon, these points are within epsilon, and so on. And so you sort of induce this graph over the data. And so all of these, B is reachable from A by hops of no more than epsilon, so all of these are in the same cluster. And actually C is also reachable from B by hops in the same cluster but D is not, and so D is in a different cluster. So all the solid points are in one cluster and all these hollow points are in another one. All right, so there are some advantages here that are illustrated by this picture. So one is it can find non linearly separable clusters. So right here, there's no line you can draw to separate these two clusters. And yet, clearly, there's kind of a dense region and another dense region. So you wanna kind of draw a curved line. And DBSCAN finds this naturally, but k-means won't. K-means will sort of find a point, depending on where you start, maybe here and here or something, in which case you'll get clusters like this. Okay. So, there is no dependence on a fixed number of starting to clusters like k-means was. There's no dependence on starting conditions. You sort of compute this from any of the vertices in the graph and start making hops. There's only two parameters to this. One is this distance threshold epsilon and the other is a minimum numbers of neighbors. And what this controls is, like if I find a point way out here in space that doesn't have any neighbors, only has a few neighbors, then maybe I just consider that point noise and I ignore it completely. I don't try to put it into any cluster whatsoever. And so here if you put min neighbors at zero, you'd get this one in its own very little cluster, this very little cluster, and so on. So that just requires an extra bit of book keeping. It is not necessarily a big flaw in the algorithm, you'll still find the obvious two main clusters, but you can set this second min neighbors parameter to control the extra book keeping if you do to maintain clusters. And then, the core primitive here, as it's operating, is to find me all the neighbors within epsilon, find me all the neighbors within epsilon, find me all the neighbors within epsilon. So this operation is amenable to spatial indexing techniques, which can be implemented n log n, so that every lookup requires only a logarithmic number of steps, remember as we talked about in the scalability lecture. And so the overall runtime of this is n log n. So some disadvantages here is that it's sensitive to Euclidean distance measure problems. And if you remember a few lectures ago, I made a point to say that whenever you see Euclidean distance, you should be thinking about there's one big major problem with it, which is the curse of dimensionality. So as you get a very, very large number of dimensions, Euclidean distance starts to become somewhat meaningless. It starts to get bigger, and bigger, and bigger, and the data sets start to get sparser, and sparser, and sparser. The space that it's embedding in is very, very sparse. But k-means also has this problem, because it also tends to rely on Euclidean distance. And you actually can define other distance measure and adapt k-means and DBSCAN both to use them. And other problem is that you're kinda making this implicit assumption that the density that defines a cluster is constant throughout the dataset. And so this is related to this concept of heteroskedasticity that we talked about that may have seemed a little unusual at the time, but it comes up in various guises in different contexts, so I wanted to make sure that I mentioned it. [MUSIC]