0:15

Hello. This lesson we're going to introduce one of

Â the first machine learning algorithms that we'll be discussing in more detail.

Â In particular, it's the k-nearest neighbors or k-nn algorithm.

Â This algorithm is important for a couple of reasons.

Â First of all, it's a very simple algorithm to understand.

Â And second of all, it's known as a lazy evaluator or non-parametric technique

Â that simply uses the properties of

Â the training data to make predictions about new test data.

Â The key aspect here is that we don't actually build a model with this algorithm.

Â Instead, it uses the training data during

Â the prediction phase to quantify or make predictions about that new data.

Â So, by the end of this lesson,

Â you should be able to articulate what's going on with the k-nearest neighbor algorithm.

Â You should also be able to apply a k-nearest neighbor algorithm for

Â classification or regression tasks by using the Scikit-learn Library.

Â And you should know where you can apply this algorithm

Â successfully and where you probably need to use a different algorithm.

Â Now, the entirety of this lesson is

Â contained in the introduction to k-nearest neighbors notebook.

Â So what is this notebook going to do?

Â Well, we're going to talk about this algorithm,

Â and we're going to move on to other ideas that are

Â applicable to other algorithms as well because

Â the k-nn provides a nice demonstration of some of these.

Â So, what is this k-nn algorithm?

Â If you think about the way where data comes from,

Â you can understand that often,

Â data points that are near each other are related.

Â I like to use the example of people.

Â If you think about people who are similar to you,

Â you typically live near people that are similar to you either because of income,

Â educational level or religious or political beliefs.

Â Thus, if you take a census and you look at people that live in the same neighborhood,

Â there's often a lot of similarities between those people.

Â Thus if somebody new moves into that neighborhood,

Â and you want to understand what are they probably like,

Â the easiest thing to do is to average over their neighbors.

Â Now, as you add more neighbors,

Â you can sometimes get a more accurate prediction.

Â But on the other hand if you start going too far away,

Â you may start getting other neighborhoods that

Â have other properties and you may lose accuracy,

Â so that number of neighbors or distance that you expose

Â your algorithm to making predictions can be important.

Â So, what are some things that we have to understand?

Â First, we have to understand how to compute distances,

Â and there's actually different ways to compute distances,

Â that will be important in trying to find the nearest neighbors.

Â We'll also have to look at something known as the curse of dimensionality.

Â And the reason we talk about that is because usually

Â you would think as you add more and more features or dimensions to your data,

Â you'll be able to make bigger predictions.

Â But, sometimes that can actually slow down

Â the computations because it becomes harder and harder to, in this case,

Â find nearest neighbors or alternatively,

Â you have to get more and more training data to adequately

Â sample the space that your data actually can occupy.

Â So, we'll do this by talking about some of those basic concepts.

Â We will introduce the idea of Helper code,

Â and I put these in a separate file that you can open.

Â But we're not going to go into too much detail,

Â they basically do things like load data,

Â do some of that pre-processing,

Â maybe they make visualization plots.

Â Then we'll look at k-nearest neighbors for classification.

Â For that, we're going to talk about some performance metrics,

Â we're going to talk about some of the hyper-parameters that might be important.

Â And then we're going to go on and look at regression.

Â So we start with our standard import,

Â and then we move into the actual formalism of this method.

Â So what I'm doing here is I'm actually generating some data,

Â and I'm transforming it into a different space and making a plot.

Â So here, I have our data and I say show me the five nearest neighbor.

Â So here's my green point.

Â That's my special point.

Â Might say this, blue are all the training data and green is now a test data point,

Â and I say find me the five nearest neighbors.

Â And you can see here's the five nearest neighbors.

Â If I had only said four,

Â then perhaps this point would have been picked.

Â And if I had said six,

Â probably this point would have been picked.

Â And as you see, as the number of neighbors changes,

Â the radius of a circle centered on

Â our test point will get bigger and include more and more points.

Â And we'll be averaging over a greater area.

Â That can be both good and bad,

Â and that's one of the reasons the number of

Â neighbors is a hyper-parameter that we have to tune.

Â Next step is what are distance measurements?

Â Typically, you might just say we'll use a Pythagorean theorem: x squared minus x1,

Â x2 minus x1 quantity square plus y2 minus y1 quantity squared square root.

Â You've seen that for years,

Â but that's really only applicable for data that followed the Euclidean Distance Metric.

Â And so if you've plotted points on a Cartesian plot,

Â that would be a good example of where you would do that.

Â But you might want to do something different for other types of data.

Â If it's currency data or even categorical or text data,

Â what kind of distances would you use?

Â For this, we define the idea of a metric

Â which is how do you measure distance for a particular type of data.

Â And there's a number of different metrics.

Â One is the Euclidean that we saw,

Â that's the, what we know as the L2-norm.

Â Another is Manhattan, and this is where you restrict distances to be on grid lines.

Â So if you think about it the idea this came from

Â taxicab distance on the island of Manhattan where the streets are all laid out in a grid,

Â one block square, and you had to stay on the streets.

Â You couldn't drive through buildings and so

Â the idea of this then is known as the L1-norm.

Â There's also some other ones I mentioned a few here, the haversine.

Â This is for distances traveled on the surface of a sphere.

Â If you've ever seen the the path of

Â an airline will take traveling on a flat map, it looks curved.

Â The reason for that is because it's actually

Â traveling across the surface of a sphere of the earth.

Â So when you flatten that out,

Â it actually the shortest distance between New York and

Â Los Angeles or San Francisco and Beijing,

Â is actually a curved path,

Â and we use the haversine metric for that.

Â Then there's some other ones, the Chebyshev and the Minkowski, as well as others.

Â And so we demonstrate calculating these in this particular code cell and you

Â should get a little familiar with this because it may be a little different for you.

Â So for instance, Euclidean distance, you should understand that.

Â The Manhattan distance, you're just summing up

Â the block travel times or travel distances,

Â the Chebyshev, the Minkowski.

Â I mentioned briefly here with the Curse of Dimensionality,

Â as the number of dimensions grows,

Â the number of points we have to have to maintain the

Â same sampling density jumps very fast.

Â And that of course is a problem.

Â So we can look at this in terms of a visualization and say well,

Â what's the relative density if we keep the number of points the

Â same as the number of dimensions increases what's our relative density?

Â And you can see that it very quickly drops.

Â So, really if you have a very high dimensional dataset,

Â you need a lot of training data to sufficiently span

Â that space to be able to make accurate predictions for new unseen data.

Â We talk a little bit about the Helper Code,

Â you can read about this and understand it.

Â There's a few functions I've created that load

Â the data and do some pre-processing for it or make plots.

Â We demonstrate one of them here.

Â We're importing from our Helper code and calling a function on it here to get the data.

Â This loads the data and splits it into training and testing,

Â and makes this plot.

Â So you can see here's the class 0,1,2,

Â that's the setosa, versicolor and virginica.

Â We also have some test data to show

Â data where the test data comes from and we can analyze this.

Â So the next thing is k-nearest neighbors again. How do we use this?

Â We import our module,

Â we create our estimator,

Â we get our testing and training data,

Â and then we fit our model to this training data,

Â and then we can make predictions on our test data.

Â So here we go. This is it.

Â Right? We saw this in

Â the first lesson notebook where we did the introduction to Machine Learning,

Â demonstrated classification with k-nearest neighbor classifier.

Â We specified one hyper-parameter,

Â the number of neighbors,

Â we fit the model and we score the model.

Â That's very simple. We can also try changing a few things.

Â We can look at the Confusion matrix.

Â So here we are going to explicitly make predictions for our test data,

Â and we're going to then see how well do we do on a per class basis.

Â And this is called a Confusion Matrix.

Â So each row is a different predicted class,

Â each column is the actual class.

Â I've made a convenience function called confusion,

Â which will plot this.

Â So it's a little easier to see but it was the same data.

Â So here you can have the predicted label and

Â the true label and you can see the relationship between them.

Â We can also measure other metrics such as precision and recall or F1 score.

Â And this talks about that these can be generated and displayed on

Â a per class basis with something called the classification report.

Â And that's demonstrated here,

Â where we see how well do our model

Â perform for the different classes as well as an average score.

Â The support is the number of test data points in each class,

Â and you can see those are all roughly the same.

Â The last thing I want to show you here about

Â classification is the idea of decision surfaces,

Â and how they can be used to understand the impact of hyper-parameters.

Â For the nearest neighbor algorithm,

Â we typically vary the neighbor hyper-parameter which might be two, three, four, five.

Â So this code here simply makes different plots of a decision surface.

Â And here's an example of that.

Â We've restricted ourselves to two dimensions,

Â Sepal width and Petal width.

Â Here's the data that we actually are predicting on.

Â Underneath it is a grid of points and you can see the colors change as we move around.

Â This is known as a Decision Surface because it

Â tells us if we put a desk data point right here,

Â a test data point, what would it be classified as.

Â And in this case, you can see it's going to be blue.

Â If it was over here it would be green and if it's over here it was brown.

Â So it shows you how these this algorithm is splitting up this underlying space.

Â Now, this was for three nearest neighbors.

Â If I change it to five nearest neighbors,

Â you'd notice that changes dramatically.

Â Right? Look up here, this is very different than what we had up here.

Â And it changes a little bit over here as well.

Â The idea of the decision surface is it shows you the impact of

Â a hyper parameter on making predictions in the general space that you're looking at.

Â So it's a good way to try to figure out what's a good value

Â for a hyper parameter if you don't want to resort to just scoring.

Â And then here we've increased it to nine,

Â and again it's changed things just slightly.

Â Notice now we have these two values here that are predicted right.

Â There's other things you can look at.

Â One of them is called weights.

Â This says for instance you may want to say if a data points close to me,

Â I want to give it a higher weight.

Â Data points that are farther away are going to be less so.

Â So we can do the same thing.

Â Here we're going to change weighing scheme from

Â uniform where all data has the same weight.

Â Alternatively, we could say we're going to weigh points by how far away they are from us.

Â So if they're farther away they have a lower weight.

Â And so we can look at the decision surface in this case,

Â here is uniform weighting.

Â This is the same plot we saw before for three nearest neighbors,

Â and then we can look at distance weighting and you could see how it changes.

Â Notice that this point is now correctly classified, and this one is as well.

Â This is the only one that's not.

Â That's interesting in how that all works,

Â and probably gives you a bit of a feel for why decision surfaces are important.

Â So the next thing is regression,

Â k-nearest neighbor for regression,

Â very similar to classification.

Â We simply change our estimator, we get our data.

Â In this case we're just going to make a plot of the values,

Â and then we're going to show how the regression works.

Â So here's some data points.

Â Here's the k-nearest neighbor regressor for five with a uniform weighting,

Â and here's five with a uniform distance,

Â just to give you an introduction to regression here in this particular case.

Â So with that I'm going to go ahead and stop.

Â If you have any questions let us know.

Â K-nearest neighbor, very simple algorithm,

Â should give you a pretty good feel for how things work.

Â If you have any questions, let us know. And good luck.

Â