[MUSIC] Okay. So how do we compute this information gain and how do we use it to make decisions in building the decision tree? So let's go back to this simpler example of predicting when we're gonna play golf or do some other outdoor activity. Okay, so before we've made any decision about the decision tree, we've got 14 records and nine of which are labeled yes, they're positive examples. And so we can compute the unpredictability, the entropy of this data set, just as we have with the Titanic data set and with the die rolls and so on, and we get 0.94. Now, if we choose the outlook attribute to be the root of our tree or to be the next node in the tree, then we see that we'll have four records that represent overcast. If we choose the attribute outlook, then the value overcast has four records, all of which are yes. And so the entropy of that choice is going to be 0. Okay? It's entirely, it's 100% predictable, right, it's zero unpredictability. All right. And the value rainy has two nos and three yeses. Okay. So 5 records total, (3 out of 5, log 2, 3 out of 5 + 2 over 5, log 2, 2 over 5) = 0.97. Okay. And finally, if it's sunny then you have three nos and two yeses. And it's the same value, 0.97. And so now the expected new entropy, if we had chosen outlook, you compute expected value. Expected value is the probability of making that choice, times, the value of that choice. And so, 4 out of 14 has a value of 0.0. 5 out of 14 has a value of 0.97. And 5 out of 14 has 0.97 again. All right. So compute the total there and it's 0.69. Now, the gain here is the difference between where we started and where we end up. All right. So, considering temperature, well here we have four records. Sorry, for the value, cool, we have four records. One, two, three, four, and we have yes, yes, no, and yes. And so 3 of them are yes out of 4 and you can compute the formula and you get 0.81. You can do the same thing for rainy. Excuse me, hot and this should be mild. With hot, we have exactly 2 records are yes out of 4. That's 50/50, remember that's always unpredictability of 1. And then finally when it's mild, there's 6 records, 4 yes, and we get an unpredictability of 0.92. And so then you plug in the formula, and the expected new entropy is 0.91, okay. And so the gain here is less than it was for outlook. And finally we can, finally. Next, we can consider humidity, and here there's only two values. Seven records have normal humidity, and 6 of those are yes. And so the entropy is 0.59, if you plug in the numbers. And if humidity is high, there's 7 records there as well, and only 2 of them are yes, and the entropy here is 0.86. And so the expected new entropy here is 0.725. And finally, windy, there's also only two values here. So if it is windy, there's 8 records with windy, 6 of which indicate we played, and the entropy is 0.8. And when it's not windy, there are 5 records, 3 of which we played and the value is 0.97. And run through the numbers and you get 0.87. So now then we have a computation for every possible choice of attribute, and we compute these four gains, and it's really just the minimum of them since we're subtracting the same amount, the same value every time. And so outlook, the first one we did, is indeed the one with the highest gain and so that's the one we'll pick. And so this sort of makes sense, if you think about it. We went through the calculations here just to sort of try to help you internalize what's going on with entropy, but it does sort of make sense. It's the one that, if you can choose an attribute that tells you exactly what's going on, you wanna pick that one. Right? So if you always play when it's sunny, alright, that attribute tells you a lot of information by selecting that, so you want that one at the top. You want the most discriminative choices at the top. Okay. And just like we showed with that document classification example, we put, well, if the document includes the word Sunday, then we go over here and then we check and see if there's a sports team reference in it and determine the sports. And over here we'll choose if something related to science is in it. So really it wasn't much use to pick Sunday at the top and, in fact, I guess I have that on the next slide here. Which of these two has the higher information gain? Right, if you include the word Falcons, well, Falcons might appear in some science articles and they also appear in some sports articles. And so let's say that down the tree about half are sports and about half are science, on both branches. But if it includes the word Mars, it's not too likely to be about sports, and so only about 2% of the articles that include the word Mars will be about sports while 95% will be about science. Let's say, [LAUGH] let's make sure that adds up to 100. Okay. Fine. So which one of these has the higher information gain? Well, it includes Mars. So, that one would be chosen first. [MUSIC]