Hello. This is Huan Liu professor of computer science and engineering. I'm with the School of Computing Informatics and Decision Systems Engineering. I'm going to cover supervised learning and unsupervised learning. This lecture is about supervised learning. In this segment, we would like to define supervised learning. It is also called classification and we will describe various methods. These are the basic methods. We start with Twitter data example. In this dataset, there are 10 instances, then there are three attributes. And the last one is the class attribute. So in this case, we just imagine that we have collected data and for the three attributes: celebrity, verified account and number of followers. And the class level is something we would like to learn or predict. This user is influential or not. And this dataset is very important because some for the classification we usually have this kind of a format and we will have collected data and try to learn the patterns. So for the supervised learning, the process is like this. We have a training set, dataset, then we have a learning algorithm, after we learn the model, we will use the model for the testing part. But in reality, we will just use the model for classification test. However, we usually include the testing because we would like to see how good the model is. And in the early example, when we try to apply the model, we predict that any user is influential or not and in this example here, we will say after we fall our e-mail messages, we would like to classify some messages as spam or not. So, here for example we have these unlabelled instances we would like to know the class. But m here is the model. So the key is to find the mapping. So we call the M the smart Mabi maps x 2 y, y is the class. So for supervised learning algorithms, there are numerous number of and we won't be able to cover all of them but we will try to cover those very basic and intuitive as well as popular ones. There are three listed here. So one category it's called the classification. It's Decision Tree Learning and k- Nearest Neighbor Classifier then another category is regression. So well spend time on all of these three methods. And another reason for us to choose or start with these three methods because usually people use one of them as their baseline for a performance comparison and also all these three methods are intuitive and they are more understandable than other methods. So we start with Decision Tree Learning. So when we discuss Decision Tree, we focus on induction or induce a tree. So we still use an example to show you. What is a Decision Tree? So this is the dataset we use in the very beginning to illustrate the need for supervised learning. So this is the one of the trees we can induce from this dataset. And for the training data, we have known classes and the long the tree if a tree like this will be applied to those data or instances with unknown classes. So we say unknown classes usually we mention we talk about this influential classes. So the question here now is, obviously I can have many trees and which attribute like we have three attributes, which attribute should we start as the root note? Multiple Decision Trees as we said can be learned from the same data set. In this case, we can start with celebrity as the root note, we can also start with verified account as the root note or even the number of followers. And after we pick the first attribute, then we will try to split the tree based on this attributes' values. In this case celebrity, we have two values here. Yes or no. For Verified account, we have three values. We have no, yes, unknown. But there are so many possible trees to induce or to learn then the question is, when the data set is big and number of attributes is large, how can we find a good way to learn the tree? Decision trees are constructed recursively. In what sense? So, we can have a root node, that's the dataset, and let's say we use verify the account as the first attribute here to split the data. So because this attribute has three bodies; yes, no and unknown, so we can have three branches. And then the data is split into three subsets; D1, D2 and D3. In this case, the union of D1, D2 and D3 equals D. So you can see, after the data is split based on an attribute value, the subsets of data are smaller and smaller. So this, we call tree stump. To build the tree recursively means after we have these subsets of data, we decide if we need to continue to build another tree stump. So in this case, let's say this is yes, this is no, this is unknown. And if you check what's the data, you will see, this is already pure. But if this part is not pure, then we decide, "Hey, I need another split." So, I will use this dataset and find another attribute. That's another attribute. Now, the attributes we can choose are the remaining two. And one is called the celebrity, the other one is card number of followers. So let's say we use the celebrity attribute. Then it's yes or no. Yes or no. Then we have the D31, D32, because it's the D31 and D32 are the subsets of D3. So this, way we build the tree recursively and datasets and each of these subsets of data becomes smaller and smaller. So now, we need to know when should we stop. So one obvious criterion is when a branch data becomes pure, all instances belong to one class. It's pure, then we can stop. Then we create a leaf node under that branch with the class label of that pure class. But in some cases, if the leaf node it's not pure, we still stop. That's why we call it a leaf node, because we don't have enough data points or instances to make statistically sound split. In this case, then we just use the majority of the class in that node. So now, we know we can view the tree recursively and then we know when to stop. But the key now is, how to measure purity? We have to calculate it, right? We have to compute it. So usually what we use it's called entropy. We try to minimize the entropy. So over a subset of training instances, T, that's training instances. If we have a binary class, then all these instances can be divided into positive class or negative class. If there are two class values, we can calculate the entropy this way. P positive is for the number of instances belong to the positive class divided by total number of instances, and P negative is the number of instances belong to the negative class divided by the total number of instances. Total number of instances equals the total number of positive instances plus the total number of negative instances. So, here, P is the probability, right? So P is used twice. Therefore, the log, and here we used it also in the log. So, this is the formula we used to calculate entropy. Now I will give you an example. Let's assume for the sake of easy calculation we have 10 instances, seven of them belong to a positive class and three belong to a negative class. So in this case, with the node to this as seven plus and three minus, then we can calculate the entropy like this. And here, as long as we pick the log base consistently, it is easy. It is okay. You use two or you use 10. But for the explanation, usually we use two. You can pick any base. Now, in the pure subset, all instances have the same class attribute value. That means then attribute is zero. So what does that mean? So in the previous example, we have 10 positive, zero negative. Or zero negative, 10 positive. This is pure case. So now we learnt entropy calculation. We will use this entropy formula to evaluate a range of values. So what is a pure subset? A pure subset, we will have such a case like 10 positive instances and zero negative instance, or zero positive instances and 10 negative instances, in this case. So now, let's calculate entropy. 10 positive, zero negative that equals P 10 over 10, log 10 over 10 minus zero over 10, log zero over 10. So in this case, log 10 over 10 is log 1, so it's a zero and zero times anything is zero. So it's a zero. Now we consider another it is the worst case for the century split. It is 5 positive instances and 5 negative instances. Basically the splitting does not help us at all. So what is the entropy of this case? So the entropy 5 positive, 5 negative equals in this case because it's 5,5. So for the two terms here, 1 and 2, they are exactly the same. So that's why we can just calculate once, it's two times this. The 5 minus 5 over 10, log 5 over 10. And equals two times half. And in this case, minus here. Basically when we put it up over there, it becomes log 2. So if we choose base two, then this is one. And at the end this is one. This is the worst case scenario. And of course, there are many cases, the entropy is between zero and the one. That ends our first segment about classification for this decision tree learning.