Let's look at how to implement stacked ensembles in H2O. So I'm going to start by looking at some R code and show you the overall process. And then we'll switch over to Python and actually work through an example. So here's the outline structure. You will bring in h2o, you'll bring in your data, and you'll split it the normal way. You'll then create a number of normal h2o models. They can be GLMs, deep learning trees, anything. Then you take those model IDs, And you pass them in the base model's argument to stackedEnsemble. And then this acts just like any other h2o model. And you can evaluate the performance of your stackedEnsemble as if it was a single object. Okay, let's flesh that out a little bit more. So I'm going to set m1 to be a GLM, giving it the x, y, train. glm, you normally specify a family, validation_frame, and so on. Any other parameters will go here. Same for a gbm, I'll put the gbm into m2, and I'll put a randomForest into m3. And the rest of the code as we already saw. Okay, now stacked ensembles only work with cross validation. And if you want to understand the algorithm behind stacked ensembles, there will be some links after this video in the further reading. So if you've already split your data into train valid test as you would normally, we're just going to join train and valid back together in the data set called train2. And it's important that each of your models, m1, m2, m3, use exactly the same cross validation settings. So I've just pulled nfold out into a variable to make sure it's kept in sync. And so I'm setting nfolds in each of my models. I'm also specifying the fold_assignment as Modulo. This means record one will go into fold one. Record two will go into fold two. Record one, two, three, four, five, it's record six will go into fold one. Record 11 will go into fold one. The point of doing this with a stacked ensemble is to make sure the folds get done identically between m1, m2 and m3. And the final setting you need is to keep the cross validation predictions. That has to be set on each model as well. So while a stacked ensemble, you can just create a bunch of models and then decide to use them together as a stacked ensemble, you see you do have to follow a few conventions, Which are summarized in these three arguments. You can't just take any old model that you've built. Okay, let's look at how that fits together in a real example in R, and then we'll work through a Python example. So I'm bringing in airlines data, splitting it as normal, deciding what I want to predict. It's a binary column, so I'm going to be doing a binomial categorization classification. The fields I want to use to learn with. Combining the data, making a binomial glm, A gbm with all defaults, and a random forest with all defaults. I then grab a list of the model_ids. It's important these are the model_id. Pass that into base models. And I'm training on train2, Both in the stacked ensemble and in each of my models. And then I'm going to do logloss AUC, And h2o.performance on each of the constituent models, as well as the stacked_ensemble. Okay, so I've already loaded in the airlines data set, and it's the same split we've been looking at in other videos. Those are the fields. Setting nfolds to 5. And we have 39,500 rows to use for cross-validation. So it's the training data plus the validation data. Then I'm bringing in all the models I want to use, random_forest, gbm, glm, and stackedensemble. Let's build the GLM. So you can see we have nfolds, fold assignment, keep cross_validation_predictions set for every model. And I've set binomial for the GLM. Otherwise, each of these three models I've gone with default settings. If you want to spend more time on this, you will, of course, be tuning each of those three models. You might also want to bring in a deep learning model. Let's build the GBM. And once that's finished, we'll build the random forest. So I'm setting this to the three model IDs. So this list contains three strings. And then we're passing that to the StackedEnsembleEstimator, And then calling train. And what you'll notice is this is very quick to train, Much quicker than each of the constituent models. Okay, let's go through and analyze the performance. This isn't part of a stacked ensemble. This is just some panda structure, so I can compare the four models I now have, a list of the models and some names. Let's start by looking at the logloss of each model. Remember, lower is better for logloss. So the GLM 0.57, the GBM has 0.5. And the random forest, 0.51. The stacked ensemble is much, much better, at 0.23. If we look at AUC, Similar results, the stacked ensemble is practically perfect. You should be getting suspicious at this point. Ensembles generally give you a few more percentage points of performance. They don't give you these kind of dramatic performances normally. So I'm going to add the xval. At the moment, we're looking at the data it was trained on. Let's look at the cross validation results. We don't have cross validation results for the stacked ensemble. What's happening is the stacked ensemble is being built on all the data, all the cross validation data. There was no separate data set to evaluate it against. So to evaluate it properly what we're going to do, Is run model performance on the test data set. We're going to do this for each model. So test_perf now contains four h2o.performance objects. Let's call logloss on each of those. Okay, these are much more realistic results. It varies quite a lot from run to run, but we'll see GLM is still doing worse. GBM is on 0.54, logloss up here on the data it had seen was 0.5. Random forest was 0.51, it's improved, actually, on the unseen data, to 0.48. So the GBM was showing more overfitting. But when we bring all three models together in an ensemble, we get the best result of all, slightly better, 0.479. And with AUC, yes, so again, random forest is showing it's the best of our three models on the test data set. And, For AUC, the ensemble has improved it on the fourth decimal place. So really, the ensemble has given us only the slightest improvement over random forest. These results will vary from run to run and will vary quite dramatically from data set to data set. So don't just run an ensemble mechanically, automatically, and not compare the performance at the end.