This lecture is about random forests, which you can think of as an extension to bagging for classification and regression trees. The basic idea is very similar to bagging in the sense that we bootstrap samples, so we take a resample of our observed data, and our training data set. And then we rebuild classification or regression trees on each of those bootstrap samples. The one difference is that at each split, when we split the data each time in a classification tree, we also bootstrap the variables. In other words, only a subset of the variables is considered at each potential split. This makes for a diverse set of potential trees that can be built. And so the idea is we grow a large number of trees. And then we either vote or average those trees in order to get the prediction for a new outcome. The pros for this approach are that it's quite accurate. And along with boosting, it's one of the most, widely used and highly accurate methods for prediction in competitions like Kaggle. The cons are that it's, it can be quite slow. It has to build a large number of trees. And it can be hard to interpret, in the sense that you might have a large number of trees that are averaged together. And those trees represent bootstrap samples with bootstrap nodes that can be a little bit complicated to understand. It can also lead to a little bit of overfitting which can be complicated by the fact that it's very hard to understand which trees are leading to that overfitting, and so it's very important to use cross validation when building random forests. Here's an example of how this works in practice. So the idea is that you build a large number of trees where each tree is based on a bootstrap sample. So, for example, this tree is built on a random subsample of your data and this is a separate random subsample of the data and this is a separate random subsample of the data. And then at each node we allow a different subset of the variables to potentially contribute to the splits. Then if we get a new observation, say this V observation here, we run that observation through tree one, and it ends up at this leaf down here at the bottom of that, tree. And so it gets a particular prediction here. Then the next, we take that same observation, we run it through the next tree, and it goes down a slightly different leaf, and it gets a slightly different set of predictions here. And finally we go down the third tree, and we get an even different set of predictions. Then what we do is we basically average those predictions together in order to get the predictive probabilities of each class across all the different trees. So I'm going to show you an example of how this works. I'm going to load the iris data in and I'm going to use the ggplot2 package for showing some plots. I'm going to build a training set and a test set again building the data set only on the training set. In the caret package, I use the train function just like I've used for the other model building. I send it the training data set and I tell it method equals rf, which is the random forest method. I'm also telling it to fit the outcome to be Species and to use any of the other predictive variables as potential predictors. I'm setting prox equals true. You'll see why I'm doing that in a minute because it produces a little bit of extra information. And then I could use when I'm building these model fits. So here it tells me that I built the model and I've done bootstrap re-sampling and then, I tried a bunch of different tuning parameters. And so the tuning parameter in particular is the number of basically tries, or number of repeated trees that it's going to build. I can look at a specific tree in our final model fit using the get tree function. So I applied get tree here to our final model and I say I want the second tree out. And this is what the tree looks like. So each of these columns, or each of these rows corresponds to a particular split. And so, you can see what the left daughter of the tree is, the right daughter of the tree, which variable we're splitting on, what's the value where that variable is split and then what the prediction is going to be out of that particular split. You can use this centers information as well to see what the predictions would be, or the center of the class predictions. So what I've done here is, I'm looking at two particular variables. The petal length and the petal width. So I plotted petal width on the X axis and petal length on the Y axis. I then get the class centers. So these are going to be the centers for the predicted values. So I'm going to send in the model fit, and I'm going to give it this prox variable which we asked for in the previous fitting. And when I'm going to tell it we're looking at the training data set. And so that gives us the class centers. Those class centers will then, we can then plot those to see where they fall in the data. So now, I've created the centers data set, as well as the species data set. And what I'm going to do is plot petal width versus petal length. And I'm going to color it by species in the training data. That's what I did with this qplot command. Then I'm going to add points on top of that. That are the petal width and petal length, corresponding to the color being the species, and now I am using it from the irsP which is the centers of the data set. So what you can see is each dot here represents an observation. And the x's show the color center, or the observation centers for each of the different predictions. So you can see that we predict the, each species has a prediction for these two variables that's right in the center of the cloud of points corresponding to that particular species. You can then predict new values using the predict functions. So you past to predict our model fit and the testing data set. And here, I'm also setting a variable, testing predict right, which is that we got the prediction right in the data set. In other words, our prediction is equal to the testing data set species. I can then make a table of our predictions versus the species to see what that variable would look like. So I can see for example I missed two values here with my random forest model. But overall [INAUDIBLE] it was highly accurate in the prediction. I can then look and see which of the two that I missed. And perhaps unsurprisingly you can see the two that I missed, marked in red here, are the two that, in-between, two separate classes. So remember there was one class up in this right corner, and one class right here in the middle, and this cloud, and the two points that lie right on the border we were misclassified. So you can kind of use this to explore, and see where your prediction is doing well and where your prediction is doing poorly. Random forests are usually one of the top performing algorithms along with boosting in any prediction contests. They're often difficult to interpret because of these multiple trees that we're fitting but they can be very accurate for a wide range of problems. You can check out the rfcv function to make sure that cross validation is being performed, but the train function in caret also handles that for you. For more information you can read about random forests directly from the inventor here. The Wikipedia page for random forests is also quite good, and the elements of statistical learning covers it as well.