So up until now we focused on methods of building individual classifiers or regressors. But why limit ourselves to just one model? Why not play as a team? If we do this, how should we form a team? And that's really the role of today's lecture and the role of ensembles. We can take all of these different classifiers that we create and we can parameter rise them. And we can build them all together into one team and have that team make a prediction. So there's several kinds of ensemble approaches the first one and the most common one is called voting. And this is very simple, we treat each classifier as if it were a citizen in our community. We just take the majority vote on a new classification task. So this is called hard voting. If model support probabilistic interpretations like decision trees do, then we can combine the probabilities. So we can imagine that when we're trying to classify who's going to win a hockey game, for instance, that we might have three different models, a logistic regression. Maybe a decision tree and maybe an SVM, and if the SVM. And the decision tree say it team A is going to win and team B is going to lose, then that's what we go with. If it turns out that team A gets all three votes, then that's what we're going to go with. So conceptually voting is a very simple ensemble method. Now begging is a little bit different. There we take a portion of our training data and we create a model and then we repeat this process for a random slice of the training data. And we can either sample with replacement or without replacement. And the latter is called pasting. It's a kind of bagging and then we use a vote of the models. So here we could actually take lots of different chunks of data. And we can choose it randomly or we can choose it principally. Maybe we want to train one year and then another year and then another year and then bring those together into an ensemble to predict on some holdout year of data. Now this feels a lot like cross validation, it's very similar, but in cross validation we're just splitting up our data set, often into five folds or 10 folds. We're just splitting it up there to understand the accuracy of our final data better. And when it comes to a bagging ensemble, we're doing this to actually improve the overall accuracy of our model. Now, boosting is a little different even from bagging. There we're going to take a model, we're going to tune it, we're going to focus on the MS classifications from that model. So let's say we're predicting the outcome of a game and there's about 20% of them were wrong. And then we're going to train new models to deal with just those miss classifications. And then we're going to do this in an iterative process as well. Now the problem with boosting is that it's very hard to do in parallel. Because you take one model and then you look at the miss classifications and then you take another model. And on it on those miss classifications and you look at those miss classifications and so forth. And so when it comes to training many different models in general, when it comes to building ensembles, it becomes quite computationally complex. Especially if you're trying to do something like a grid search at the same time and hyper parameter tuning. And with boosting unfortunately you run into a bit of a problem because it's a serialized task. So you train on the first, then on a second, then on the third you can't just fork it over many different machines and train on all of those. And then stacking, so stacking is probably my favorite ensemble. And it's a method that's a lot like recur version. And here you create something called a metal learner and the metal earner's job is to take a whole bunch of different models and learn which model is informative for new instances. So this is a lot like having a team of athletes under the direction of the coach. Each of the athletes has their own knowledge and input into the problem, but the coach gets to call the shots. And so with a stacking ensemble, we focus on first building a whole bunch of our learners. And then trying to protect our data to understand which models we should trust, when and for which cases. So let's go take a look at how we build those ensembles. And I think you'll find that it's surprisingly easy with SK learn and surprisingly powerful. So for these ensembles, let's take a look at our baseball pitch prediction again. And I'm just going to bring in a bunch of the data that we've already been looking at. And data process this just like I did before. Okay, so now we want to create our training and validation sets and this time I'm going to stratify our sample across the different kinds of pitches which exist. So you can see here that I'm setting our training set size to 10%. Again, I always set this random state so that you can reproduce it. And here, I am saying, I want the stratification to be this pitch numeric column. Then we can fill in those missing values with the mean. Now we've done this before but I always want to remind you to do this because it's really important. So first choosing the mean on my end is an arbitrary choice. It might not make sense even here. I'm just doing that for a quick lecture. But B you must do this after you split your training and validation sets. These two data sets have to be treated as independently as possible. Or you risk leaking some information from one set to the other. Okay, now we can verify that these look correct as far as stratification goes by looking at the history grams. So I'm going to take a look at our training set and our validation set and they should look the same as far as histograms. And so this shows that they do so across the bottom will be our pitch type. Remember this is a numeric value and we dropped one of the values because it's not very common. And then the other access is going to be the prevalence of that pitch. And so we can see that this one here is quite common and six here is actually very uncommon, right? So let's start by looking at our voting classifier. So a reminder, this is just a majority vote of the models that we've trained. So we just pull it in from the sK learn ensemble. And I'm going to create three different decision trees. So I'm the very first one I'm going to create with the max depth of three. The next one I'm going to create and I'm going to give it a stopping parameter of men samples per leaf of seven. So this means it will create a tree of an arbitrary depth, but it will stop and when ever a leaf has at least seven items. And the purity is pure. If there's only seven items, it'll stop regardless of what the purity level is. And then on the last one, I'm going to create a decision tree, which balances the classes because we've got a lot of unbalanced data here. Now ensembles are just another kind of classifier. But many of the ensembles, like the voting ensemble here can do their work in parallel. So here I'm going to set the end jobs parameter to minus one This tells the classifier to use all of the systems CPUs. So with this, for instance, the voting classifier and having three different decision trees. What this will do is fire up three different python processes one for each of the decision trees. When it goes to actually create the classifier and I'm going to fit the data to the classifier from our training center. All right, so the response out of that, the result of that is our voting classifier and it tells us some information about what we've actually passed in as far as parameters. So now the voters variable in this collection actually has that three different models and so we can do things like look at the accuracy or the evaluation measures. So it looks to us with the SK learn API that an ensemble is just another kind of classifier, which is really quite a powerful approach. And one of the reasons I really like working with SK learn. So we'll do a voters don't score here and we'll look at the training data. Okay, so our accuracy was about 0.84 and of course now we want to look on that validation data as well. All right, so you can see here that our training data accuracy score is much higher than our validation accuracy score. And this is really to be expected. All of the previous comments about using cross field validation to improve your understanding of the accuracy or even better, not using accuracy at all and instead looking at something like a confusion matrix, those all still apply. But instead of tuning this more, I want to move on to our next ensembling technique. The bagger. Okay, so recall that the bagging approached ensembles creates a number of different classifiers but does so from a single model and acts more like cross validation, pulling out random subsets of data. So in this approach we only have one model definition. So let's use one of our decision trees and I'm just going to pick DT zero arbitrarily. We can set how many classifiers we want the bagger to use as well as the n underscore estimators parameter. There's lots of other parameters. The one I'm going to use here, which sets the maximum number of features each classifier in a given bag can use at 70%. So you see here, I've set the base estimator. That's the classifier that I want to be using or the classifier parameterization I want to use. I've said that I want there to be 10 of these created, 10 different trees to be created, and that I want them to use up to 70% of the features each. That's a function of the bagger classifier I can use. And of course I want to train all of these in parallel. So let's fit our data there. Okay, so with both the bagger and the voter, we can actually explore the individual models which have been created using this estimators attributes. So let's take a look at what that looks like with the bagger. So you can see that the only difference is in the parameterization of the individual classifier models. Here is the random state. So it's created 10 decision tree classifiers as I asked. And they all have a max depth equal to three. That's the hyper parameter tuning that I decided on and it's just randomized the state and set stat's for repeatability. So if I find something interesting here, I can go look see what that random state value was and I can explore it more later. So since these are all just regular decision trees underneath, we can do anything we want with these trees. So we could plot them for instance. So here I'm just going to take bagger dot estimators sub zero. And I'm just going to pull this out and plot it just like we would before, and there's our tree, if we wanted to actually explore what the rules of this tree are. So even though it's an ensemble, there's some opportunity with the bagger to get to some inspectability of what the individual models and the individual trees are. We can of course take a look at the score though as well. And so I'll look at the bagger's score. So just looking at the score, the accuracy score on the training data, we see it's pretty abysmal in comparison to our previous model. And when I apply it though to a validation data, it's actually not that bad 0.62. So we can see here that actually the amount of overtraining or overconfidence are bagger has is much less than our voters. Our voters seems to be quite confident that inaccuracy above 80% but they both seem to perform pretty similarly on that validation set. Now remember one of the big differences with the bagger and the voter are how they take diversity of models. The voter takes it by saying, okay, there's three different kinds of models. They've all been parameterized differently and I'm going to learn for each one. The bagger takes it by saying, Okay, I'm going to blind you to 30% of the features, and I'm going to take a chunk of the data and I'm going to train you as a model. And now I'm going to do the same to the next person and then the same to the next person. And then federates all of these together into the ensemble. So they both capture diversity of the models but they do so in different ways. All right, with the boosting ensemble, our goal is to build an additive model where our second model builds on the first and then the third model builds on both of these and so forth. There's various algorithms that can be used to do this. A common approach in SK learn is the gradient boosting classifier. This method doesn't take a model to use it is its own model. And so it's actually a series of regression trees which are used underneath. Since this is a tree, we can select various tree parameters. Again, we set the end estimators parameter to determine how many as a maximum tree should be built. But remember these estimators are different than the estimators in the previous bagger. In the bagger, all of those estimators will always be built, in the booster the estimators. The second estimators only going to be built on the MS classifications from the first estimator. And so even if we set this to a large value, the size of the data set, different estimators C. Is actually smaller so progressively they'll be faster to train and the number of them might actually run out because we might hit a purity level. That's quite clean before having to get to 100. And that's what I've chosen here with the number of estimators and equals 100. So I've chosen men samples per leaf at seven and max depth at five. Actually if you change and get rid of max depth this will take much longer to run. Performance of models is something you start to get a feeling for overtime. What will lead to overtraining and overfitting of your data, but also what will take a long time to run but not give you any better understanding of your data as you're going through the process. All right, so let's fit this booster. Now the booster will take in general longer to run. There are other ensembles and that's because nothing can be done in parallel and because models are being built off of kind of the off put of other models and so there's a whole bunch of them to be created here. The nice thing with decision trees is that they're fast to build. If you want to see your computer generate a lot of heat and spend a lot of time. Change this to a linear SPM and then try and use a booster with it. It's a very different experience. Once the booster's done, we can see how many models it ended up actually generating. So in this case, it used all 100. And you could set that number arbitrarily high. And just like our other classifiers, we can look at the score on the training set. So the score here is phenomenal, 0.98, some 98.5% accuracy. And then we can look at the score on the validation set. And we see that that improvement in training accuracy doesn't actually translate into improvement on our validation set. So let's think about that for a minute and think about how this model is actually building itself out, this ensemble. So diversity isn't being captured in the same way here. In this ensemble, we first build a model. And then on everything we get wrong, we try and build another model. And then everything we get wrong from it, we try and build another one. These ensembles overfit, they overfit like crazy very often. If there is nothing common about what you get wrong, then you end up just segmenting that space very small. And training many, many, many different models. And so the final model itself might actually perform not badly. In this case, I'm not sure if 62% is good or bad, but it's on par with our other models. But you rarely get a lot of confidence in the accuracy values on your training data. Because you aren't penalizing the model actually for misclassifications outside of the initial model training, you're not. Instead you're trying to make up for those. So the Stacking Ensemble. The Stacking Ensemble's probably my favorite. And I love meta-learning, it really consists of using a number of different models and learning which one is best. And you can think of this as surveying the talking heads on TV as to who they think is going to perform the best or win the trophy or win the award or win the tournament. And then trying to learn over these which one knows their stuff and which one you should be weighing more heavily with certain kinds of questions. So what this means to you, though, as a data scientist is that you need to provide both the list of the models that you want to stack as well as some classifier that's going to learn, this meta-learner, which model is best. And so you can think of this as sort of weighted intelligent voting. So we're going to have to provide at least two different classifiers. So the classifier you want to use to learn over your individual voters is completely up to you. I'm just going to go with the default here, which is logistic regression. And I'm just going to bring it in completely unparameterized, except I'm going to increase the number of iterations. Then we're going to use cross validation here. And truthfully, I'm doing that just to show you that we can use cross validation. So many of these classifiers as ensembles are just classifiers. And so we can set all of our regular parameters with them. And then I'm going to pass in our three decision trees here as the dictionary. So let's fit these models. Now, this work with the Stacking Ensemble can kind of be done in an iterative manner and in a parallel manner. You can train your individual models, your individual items, estimators in parallel. But then you want to actually look at your final estimator, and that one has to be trained after everybody else has. So let's take a look at the score of the stacker, so almost 90%. And then we can look at the score on the validation set. And that's kind of interesting to see that the stacker got a 0.67 on our validation set, the bagger, a 0.62, the booster a 0.62, and the voters a 0.66. So our two essentially voting methods, where we have diversity captured in the algorithm that we're using and in the parameterization of the model, seemed to work better on this validation set. But we did see that that's not necessarily clear from that score that we get from the training data and from fitting to the training data and then scoring against the training data. Which looked all over the map, and certainly the booster looked like it was the worst or rather the best there, but ended up performing the worst. So ensembles are a powerful way to leverage the benefits of the different kinds of models in making accurate predictions. In general, ensembles perform better than individual models. In fact, it used to be that whenever somebody presented some results with a model, you'd say, well, what did you then use in production? An ensemble of them. And it was kind of an in joke that ensembles always just boost performance a little bit, because they capture something new. But you have to keep in mind the issue of the data leakage and overfitting in particular. And I see these two issues as kind of novice issues that happened a lot. Data leakage, in particular, is kind of a funny beast. Because you have to think about what it is you're actually holding out as your validation set and how representative that's going to be in the real world. In addition to overfitting, keep in mind the idea of explainabilities of models. Now, ensembles make it actually pretty difficult to understand why a given prediction is being made. And sometimes a limited depth decision tree, for instance, or even a logistic regression are a good place to start. And allow you to engage in a different kind of dialogue about the models that your training with different stakeholders in the organization. Whether they be coaches, athletes, people who are placing bets, people who are making wagers, news reporters, or so forth. The various kinds of people who are interested in sports analytics.