Let's go to that simple example we had in the past. So we had this data on houses and trying to predict the house prices being low or high. So a price class can be either high or low. So let's forget the price class for a minute and look at different variables we are using. We are using the total area, the living area, the floor space, the maximum number of floors in a building, type of material used, which year it was built, the number of rooms, and the kitchen area, as well as, of course the price class is not going to be used because that's what we want to classify. So the idea being, if you gave me a house, find out three houses which are very similar to that, and see what price class they belong, let's say, and then classify them into that particular class. I have to tell you that I'm going to use RStudio because this function being so simple, it's not available in Rattle. But I'll just initially walk you through the R commands, and then I'll run it in RStudio to Studio to show you how it works. So in the first part of this, we're going to read the data. You don't have to type anything up, it's already done. Then these are the input variables. So the input variables number one is the ID, we don't use it, the columns two to nine, we're going to use, and the target variable is the price, which is column number 10. The next trick we play is, we don't want something measured in inches versus meters to sway the results. So what we try to do is, we normalize the variables as we have done in other examples. What we do is we center the variables and we scale it, and that's the next step. The next one is, you are familiar with it, we subset the variables based on a sample, we're splitting it in the ratio 80 and 20, and we're taking the independent values, the X values, into a train and test. That's what we call X_train and X_test, and we've got the dependent variables, values y_train and y_test, we are separating it into another part. First, we're going to run this model with k equal to five. When we run this model with k equal to 5. We'll see the accuracy we get is about 67 percent, and that will be one of our validation dataset. That means, we are able to predict 67 percent of the price classes as correctly using five nearest neighbors. Then, we're going to loop through different values of k, starting from one going all the way to 20, and calculating the accuracy for each value of k, and we will see when we run this and graph them, that you will get the highest value at about 16 on the validation set. So you will see that the accuracy goes up and stops coming on rigorous stop, a little earlier too if you want, rather than go all the way down. When it's starts flattening out, we're going to just stop there. Now remember, in this part, we use only three variables for prediction. So we're using the first three variables in the dataset, which is if you go back, the area, the living area, and the floor space. In the last spot, we'll show you, I'm going to use all the variables in the dataset, and when I use all the variables in the dataset, my accuracy goes up to about 74 percent. So let me run this and show it to you very quickly, so that you can try it afterwards. Actually, I'm going to directly call the Rscript instead of starting RStudio. So I have this setup already here, and we're looking at k nearest neighbors. I just click on my Rscript file. As you know that my environment is empty because there's no data already there, and here is your script file, which opens up here on your left. Now, it automatically does everything for you, all you have to do is that run command. So every time you run, it advances one line, so it's installing packages for you, it's loading the library, it's installing other packages you need, you can read it at your leisure. Here it is reading the file, which is there. Because the file is where the R file is, it is having no problem. Otherwise, you may have to put the CSV file exactly where the script file or where R expects the file to be. It's asking for the dimension. It says, "Dimensions is 11,995 rows with 10 columns." Then, it's dropping the first variable which is the ID, and just retaining two to nine as the classifiers, and then it's depending the target variable, which is y, which is column number 10. It's done randomly, so you're fixing the seed so that even if you run it, you get the same result as the item, and if you don't set the seed due to randomness, your results will be different from mine. It's normalizing the values, centering it, scaling it, and that's what it is doing. You can see the normalized values. The normalized values, the range is much smaller because it's divided by the standard deviation. You're creating the sample, the sample is just a logical variable, whether it is in the sample or not. Then it's finding the train data using, whether it is in the sample or not in the sample, so the training data, the test data, you can see, it has created 9,595 rows of training data and the testing data is 2,400, together, they are the original 11,000 and change. Similarly, the prediction data has to be similarly partitioned using the same logical variable. Then, we're using the first three inputs. Remember the mistake I made, but I am saying that again, we are using the first three inputs of the dataset and therefore, we are building the model with only three, and you can see one to three out here. Then it says, "Summarize the model." So it summarizes the values, okay? Then it creates a confusion matrix, the accuracy. So it puts the classification into three categories; height, medium, low, and medium prices, and says, "The accuracy of classification is 67 percent." Well, we're not happy because we took k equal to 5. In this part, what we're going to do is, we're going to step through different values of k. So that's what this is doing, for i equal to 1 and 20. This loop actually is stepping through different values of k from 1-20, you can read it carefully. Then you're plotting the accuracy as you can see, it's joining these data points using a line. At k equal to 16, how do we know that? We are asking, where do you get the maximum accuracy? The answer is 16. You can see on the plot on the screen. Then, we are going to evaluate the accuracy of the model at 16, which is at 69 percent. Now, we run the model for all eight variables. Recall the earlier model was only for three variables. We get a new accuracy, and that is maximum at k equal to 15, and so the highest accuracy there with all the variables is 74 percent. Remember instead of three variables, we added all the variables and we've got 74 percent. So you can play with this at your leisure and we will give you some exercises for you to try this out on, okay? So to summarize, it's not a great site, very complicated to implement. You can implement it in any language you want. It works well because of the searching time, it works well on small to moderate datasets. It doesn't do very well when there are too many features because again, with too many features, this distance function doesn't work too well, and that your data looks very sparse in multiple dimensions. So you may like to reduce the dimensionality or use some other method. Then our extensions of this method to categorical data, if you search on the internet, you will find. I find it a very handy method and it works. Unless, some class, there are too many points of one class, then it starts classifying everything as that. So one of its drawbacks is that if there's a bias in favor of one particular class, it doesn't work so well and there are ways of adjusting for that. Let me give you one interesting example, which I used maybe yesterday. So my students, [inaudible] of graduating students, and ask them various questions; their background, what's their major, and then what subjects they took, and what is their preference of living in rural or living in a city, and how do they want to live close to work or not, etc etc. Then finally questions I asked, which city are you going to live in, and what suburb if you know? So basically, the data now says, "For every student, I now know a number of features and I know their living preferences." The idea is, for next year, we can have a student who hasn't yet got a job or who's thinking of getting a job saying, "Hey, we do know all your preferences and we can predict, we can pull three students who are very similar to you, and see what city and what suburb they went to." It's an interesting guy because then, instead of randomly going and living somewhere in the US, they can say, "Hey, last year my seniors went here, and that's because they are very similar to me and they got jobs, " That may be a nice way of creating the data and keeping augmenting it away. Yes, you can imagine for every year that we collect this data, they can just bring this database and say, "If I am an accounting major who likes living in a city, who wants to live close to work, loves music, blah blah blah blah." Here there are two or three cities you may like to go. Okay. So these are the suburbs you may like to live in. That would be useful, the thought for students to use. But at the heart of it is this idea, that we use nearest neighbor matching is not too larger dataset, not too many features, which will allow students to figure out where to live.