0:16

Hello, and welcome to the lesson on Decision Trees.

Â Decision trees are a powerful algorithm that can be

Â used for both classification and regression tasks.

Â One reason decision trees are popular is that they are simple to build,

Â and another is they're easy to visualize,

Â which means you can explain why the model made the predictions that it did.

Â Some machine learning algorithms are harder to do that with.

Â One example of an easy one,

Â easy model to explain is linear regression.

Â You have the formula so you know putting in a value,

Â why are you going to get the output you do.

Â Other algorithms might be harder to do that with such as k-nearest neighbors.

Â You actually have to understand the distribution of

Â data to be able to understand why it's making a prediction.

Â That's a more complex problem.

Â Decision trees are easier because you can literally see

Â why each decision was made as you traverse the tree.

Â By the end of this lesson, you should be able to understand why

Â decision trees are powerful and how to use them effectively,

Â how to create and use them in Python using the scikit-learn

Â library and know how to apply it to both classification and regression tasks.

Â There's two activities, one a website that

Â is introduction to machine learning but a visual introduction,

Â the second is our course notebook.

Â Let me first quick demonstrate this.

Â The visual introduction to machine learning is a very powerful website.

Â I really like this because it shows the data

Â visually while also providing some text to give you background.

Â So you can see different ways to interact with the data,

Â how to split the data,

Â how to visualize it,

Â how to apply machine learning,

Â in particular, decision tree,

Â which involves chopping the data into subcategories or chunks.

Â This is useful in order to be able to build the tree,

Â and that's what this all talks about.

Â Before going any farther though,

Â I want to talk about our course notebook.

Â In this notebook, we're going to introduce the decision tree algorithm

Â and how it can be used for both classification and regression.

Â First, we're going to talk about the fundamental concepts that are

Â used to create a decision tree for classification purposes.

Â These includes entropy and information gain.

Â Then we're going to show how a decision tree can be used to classify the average dataset.

Â We'll look at the decisions surface,

Â and we'll see how that can be used to understand the impact of a hyperparameter.

Â Then we'll introduce a new concept called feature importance that decision trees

Â allow you to understand which features were

Â most important for building the decision tree.

Â Then we're going to look at how to visualize a tree,

Â and then we're going to introduce a new dataset,

Â the adult dataset that is right for a classification task.

Â This is a bigger dataset than the average data so it provides

Â a nice more real-world example of the application of decision trees.

Â Then we're going to introduce decision trees for

Â regression and also another new dataset called

Â the Automobile Fuel Prediction Dataset and show

Â how the decision trees can be used to build a regression model for these data.

Â These last two datasets are going to be used in many other algorithms,

Â so you'll see the performance,

Â the relative performance of different algorithms on these data and be able to

Â understand better how these different algorithms work and how they compare.

Â First, we start with our standard setup code before jumping into the formalism.

Â When you're building a decision tree,

Â one of the first concepts to keep in mind is you have to split the data.

Â That's the fundamental concept in the decision tree.

Â You split the data into two categories,

Â and the tree then encapsulates that with nodes in the tree,

Â and the nodes keep being split until you reach a terminal node,

Â which is known as a leaf node.

Â The following figures shows this.

Â This is a figure I made.

Â We start with our big dataset represented by this black square.

Â We split it into two,

Â A and B, and you see these datasets here.

Â And then we split B into two more,

Â C and D. This is a tree.

Â We have our root node. We have two nodes here at this level.

Â This one is a leaf node. It's not split anymore.

Â This node is a non-leaf node.

Â It's split into two leaf nodes.

Â The data are color-coded,

Â such that these here in C go with C,

Â and the blue ones here in D go with D et cetera.

Â This is visually what a decision tree is,

Â but we have to figure out how do we know where to

Â split and how do we know when to stop splitting.

Â The key point for where to split

Â has several different concepts or techniques that can be used.

Â One is variance reduction.

Â You can choose to maximally reduce the variance along a given feature.

Â That's often used for regression problems.

Â The Gini impurity is another technique,

Â and it's designed to minimize misclassification particularly in a multiclass setting.

Â A multiclass would be where you're predicting Class 0,

Â Class 1, Class 2 et cetera.

Â The alternative is a binary class,

Â where you're saying true or false.

Â The other technique is information gain,

Â where you're trying to create the purest child nodes.

Â The idea here is you want to keep data that's near each other and similar together.

Â To do this, we have to introduce the concepts of an entropy and information gain.

Â entropy is simply a statistical measure of the information content in a dataset.

Â And this section here talks about what that means.

Â Formally, we have to take the logarithm of the probability,

Â multiply by the probability and sum up. That gives us the entropy.

Â I find it easier sometimes to look at this in code,

Â so we demonstrate this in code.

Â So we can compute the entropy for binary probability. We have the probability.

Â One is, say, success zeros false and you see the entropy.

Â The maximum entropy is 0.5 in this particular example.

Â If we have a different case, it would be something different.

Â We can also compute entropy from data.

Â Here, we're going to load the tips data.

Â Pull out two categorical features: day and time.

Â Here, I just show those two categorical features.

Â And then we can actually look at this data in terms of a pivot table and say,

Â "What's the total values?"

Â Here's the individual values for each one.

Â And then we can compute the entropy for

Â these in terms of the relative frequencies, and so we can do that.

Â And at the end, we can get what our node counts would

Â be for that particular splits that we would see.

Â Now, we don't split on entropy.

Â Instead, we split on information gain,

Â and information gain is a way of using entropy

Â to actually compute how much gain do we get by making a particular split,

Â and that's what this entire section here talks about.

Â We can compute that a counts.

Â We can turn that into a probability.

Â We can turn the probabilities and entropies.

Â And then we can say if we make a split at a certain value,

Â what's the information gain?

Â And as we change that split value,

Â we'll change the information gain,

Â and the idea is to maximize that.

Â Now, what about Decision Tree Classification?

Â It's very similar to the way

Â other types of machine learning model building that we've done the scikit-learn,

Â we have some hyper parameters.

Â Decision trees have a lot of hyper parameters and some of them are listed here.

Â We're going to want to vary these to get different results.

Â First, we're going apply it to the Iris data set because we've seen the Iris data set.

Â You should start to get a pretty good feel for how to split, how good you can do.

Â We use our helper functions,

Â we make our plot, we see our test and train data.

Â Then we create our classifier,

Â we use a random state for reproducibility,

Â we fit it and we score it.

Â And you can see, pretty good score about the same as we got with the k-nearest neighbors.

Â We can then do a classification report,

Â and we can make a confusion matrix just like we've done before.

Â One thing I want to introduce though is,

Â we can do feature importance here.

Â We can say, "What are the most important features?"

Â The way you do that is,

Â you just access the feature importances attribute from our decision tree classifier.

Â So here, we zip these two together.

Â That simply combines them into a new data structure.

Â We then iterate through it pulling out the name

Â and the feature importance, and print them out.

Â That's all we're doing here. The result shows us that

Â the Petal Width and Petal Length have most of the feature importance.

Â In fact, together, those are over 96 percent of all feature importance.

Â That means we could just use those two features

Â and have captured most of the information in our data.

Â We can look at the decision tree surface.

Â This should be similar code to what you saw before with the k-nearest neighbors.

Â Here we split up our data,

Â you can see the decision tree.

Â There are nice linear boundaries.

Â We can also visualize the tree,

Â and by doing this,

Â we'll simply compute the breakdown.

Â So here's our root node.

Â This is actually using the Gini coefficient not the information gain.

Â We could have changed the tree to be built with the information gained instead.

Â We have 90 data values,

Â you can see values,

Â and it tells you we're going to split on Petal Width.

Â So if you're less than Petal Width you go over here,

Â if you're not great at that value, you come over here.

Â This now is a leaf node.

Â We now split on Petal Width again.

Â Remember, most of the information was in Petal Width,

Â so we're going to split on that a lot.

Â Again, what's the split value?

Â How many data points do we have?

Â If you're true, you come over here.

Â If it were false, you come over here then we split on Petal Length.

Â Again, we split on Petal Length over here,

Â we end up with four new child nodes and then it stops.

Â And this shows you the values that you're getting out.

Â Remember, this is a three class problem;

Â class Setosa, Virginica, and Versicolor.

Â That's what these values are showing you.

Â That there's 31, 0 in the first class,

Â 31 of the second class,

Â one of the third class in this particular leaf nodes.

Â So it shows you the purity, if you will.

Â You can see that most of these nodes are fairly pure.

Â Dealing with, it's not as this one.

Â So we can then look at the variation of our hyper parameters.

Â Again, we're going to try.

Â And this case, not the number of neighbors,

Â but the number of the depth of our tree.

Â It's either three, six or nine.

Â Three is less than what we had at first.

Â You can see there's just some basic splits through the data.

Â Six, we start to get a little more structure,

Â and then nine we have even more structure, right?

Â We've even cut out that one little point there.

Â But you see how a decision tree is cutting the features,

Â and showing us the results here in the decisions surface.

Â We can also now move to a new data set,

Â the adult income prediction task.

Â The idea here is, we have data from a census and we want to predict,

Â does somebody make above or below $50,000?

Â We have the two codes cells where we define our local data file name,

Â and then if it exists, we use it.

Â If not, we grab it from the website.

Â We then do some basic processing of this data set,

Â and then we can actually create our label,

Â and then we actually show some values here to make sure our label was done right.

Â And then, we actually do something called the Zero model performance.

Â This is an important idea and you always will understand,

Â if I have a, in this case,

Â binary classification task, I could simply say,

Â in this case, most people are in the low salary category.

Â So what happens if I just say everybody's in the zero model,

Â at the low salary class?

Â Well then, I would be right 75 percent of the time.

Â Now, that's a bad idea for a model

Â because you're never going to predict somebody's highest salary,

Â and you want to make sure that you're actually measuring both categories.

Â But this gives you sort of a baseline idea of,

Â you should be doing that well.

Â If possible, should be doing that well or better.

Â If you're not, you need to have a good reason why.

Â It may be that, you know what,

Â you do a good job of predicting the high salaries and

Â a bad job at the low salaries, or vice versa.

Â But it's always important to understand what that zero model is.

Â We can then go through and convert our categorical features into

Â a appropriate categories or appropriate features.

Â And then we can actually combine those together into a features data set,

Â and then we can actually get to the classification.

Â First step, we take our features and label,

Â turn them into trading and test data,

Â and then we get some metrics.

Â We've created our decision tree classifier.

Â The only hyper parameter we specify is the random state for reproducibility.

Â We fit the model, and then we score the model.

Â And this is actually doing really well.

Â Look at this, it's 82 percent on recall.

Â It's not even in terms of the class.

Â This class is lower and we're going to have to worry about that.

Â What this is telling us is that it does a good job of predicting the majority class,

Â and a reasonable job in predicting the minority class.

Â Now, the next thing to do is, how do we do regression?

Â Well, the simple answer is almost the exact same way we did classification.

Â What I'm going to do here is load in a new data set,

Â the auto MPG data.

Â We're going to, again,

Â define the local filename,

Â and then download it if the file doesn't already exist.

Â We extract data, and then we can actually build a classification.

Â In this case, I'm going to introduce one thing that I want to highlight here.

Â We're using the formula interface which is

Â fundamentally provided by a library called Patsy.

Â This is actually how statsmodel uses it.

Â But I wanted to do this for the data to use in scikit-learn.

Â So, this is what we do.

Â We say, let's relate the miles per gallon

Â feature to the cylinders feature which is categorical.

Â That's why we prefix it with the C,

Â the displacement feature, the weight feature,

Â the acceleration feature, the year categorical feature,

Â and the origin categorical feature.

Â So here's our input features,

Â you can see there's a lot of them.

Â Mostly because we have all of these categorical features.

Â You can try removing those and seeing how the performance changes.

Â And then we import our regression algorithm,

Â we split our data into training and testing.

Â We create our regressor,

Â we fit our regressor, and we score our reggressor.

Â Now, when you look at this, it's 53.

Â You think, "Well, that's really bad."

Â But remember, this is a regression task,

Â and so the score is not the classification accuracy.

Â Instead, it's a measure of how well we're predicting the continuous feature.

Â So this particular cell makes several other metrics measurements,

Â including the Pearson correlation coefficient and shows those.

Â So here are a good thing I like to look at is,

Â if you look at this, what's our mean absolute error? It's only three.

Â And if you're thinking about this in terms of the prediction of a miles per gallon,

Â that means you're getting within three,

Â four miles per gallon value which isn't too bad.

Â So with this notebook, I've introduced the idea of

Â decision trees and how to use them for both classification of regression tasks.

Â A lot's been presented, but hopefully you have

Â a better feeling of how to build a decision tree,

Â how to use a decision tree and why it's an important algorithm to understand.

Â If you have any questions, let us know. And good luck.

Â