[MUSIC] So for example this plot is actually the same one that we generated for to make the point about Heteroscedasticity in the statistics segments. But what I've done is cut out some portions of the data here and here. And so now It looks to our eyes like there ought to be three clusters here, right? One here, one here, and one here. But DBSCAN, given its sensitivity to this epsilon number, do you define epsilon such that this cluster is easy to detect? A very small epsilon, about this kind of distance or do you find it a slightly bigger epsilon so that you can capture these kinds of clusters? And if you're not careful, I suppose here I've actually doctored this one to where it's okay to actually use the bigger one. But if you're not careful, the density region here may be on par with the distance between these two clusters here. In which case with a small change to epsilon you might all of the sudden put the whole data set into one cluster. And so adaptive epsilons, which region you're in, might be something you could think about. All right. And just like with k means you can think about how to parallelize DBSCAN and the only trick I wanted to point out here is that you need to worry about this halo region around each cluster. Or rather, around each segment. Here we divide up the space into, say four different processors. And each processor is going to be responsible for that space. And in the middle, internal to this processor, you can run DB scan as usual. But when you get close to the boundary of the region handled by this processor, you need to do some careful bookkeeping. And so all the ones that are within this boundary region need to be sent to this other processor for comparison to see whether you need to relabel. C four and c six as part of the same cluster whether those two pluses are connected by instance of epsilon or they're not. And so they'll independently find all their clusters but then you'll merge some clusters when you combine and you can actually express this in terms of a sequence of map produced jobs, which I won't go into details, but there's some work done by a student here a few years ago to describe and analyze the algorithm here. Finally I just wanna wrap up with this great picture form the scikit learn python library website, which talks about the differences between various clustering algorithms. And we've only talked about two of these. We talked about k means and we talked about DB scan. And mini batch k means is a online version of k means it uses a batch at a time to compute centroids. But you can see some of the strengths and weaknesses here. K means actually doesn't do that great in any one of these. Admittedly tricky challenge problems, where here, you would probably think that you want this internal circle to be a cluster, and this outer circle to be a second cluster, and k means doesn't quite get that right, depending on where you have the starting condition. Meanwhile, these two clusters kind of overlap in space as well. And here, if you have a bad starting condition in between these two clusters, you might put them all in the same cluster. Or really anything anytime the starting condition is up here and say down here, this one will sort of gravitate toward this cluster and this one will sort of gravitate toward this cluster and end up in the middle and so on. Meanwhile in DBSCAN, this hop from this cluster to this cluster is greater than epsilon and so you'll do what seems to be the correct thing, which is put the inner circle in one cluster and the outer cluster in another. Similarly here, the distance between these clusters is enough where DBSCAN figures it out. And similarly here. And for the noisy case, DBSCAN puts all these points all in the same cluster, which to our eyes, that's probably the right thing to do, okay. Thing to take away here is that at least in these four challenge problems, DBSCAN does the right thing in every case, and so it's a good one to be familiar with. [MUSIC]