How do you evaluate your model performance? In this lecture, we will look at different metrics that can be used to evaluate the performance of your classification model. After this video, you will be able to discuss how performance metrics can be used to evaluate models. Name three model evaluation metrics, and explain why accuracy may be misleading. For the classification task, an error occurs when the model's prediction of the class label is different from the true class label. We can also define the different types of errors in classification depending on the predicted and true labels. Let's take the case with the task is to predict whether a given animal is a mammal or not. This is a binary classification task with the class label being either yes, indicating mammal, or no indicating non-mammal. Then the different types of errors are as follows. If the true label is yes and the predicted label is yes, then this is a true positive, abbreviated as TP. This is the case where the label is correctly predicted as positive. If the true label is no and the predicted label is no, then this is a true negative, abbreviated as TN. This is the case where the label is correctly predicted as negative. If the true label is no and the predicted label is yes, then this is a false positive, abbreviated as FP. This is the case with the label is incorrectly predicted as positive, when it should be negative. If the true label is yes and the predicted label is no, then this is a false negative abbreviated as FN. This is the case where the label is incorrectly predicted as negative, when it should be positive. These definitions can take a while to sink in, so feel free to hit the pause button and replay button several times here to review this part. These four different types of errors are used in calculating many evaluation metrics for classifiers. The most commonly used evaluation metric is the accuracy rate, or accuracy for short. For classication, accuracy is calculated as the number of correct predictions divided by the total number of predictions. Note that the number of correct predictions is the sum of the true positives, and the true negatives, since the true and predicted labels match for those cases. The accuracy rate is an intuitive way to measure the performance of a classification model. Model performance can also be expressed in terms of error rate. Error rate is the opposite of accuracy rate. Let's look at an example to see how accuracy and error rates are calculated. The table on the left, lists the true label along with the model's prediction for a data set of ten samples. First, let’s figure out the number of true positives. Recall that a true positive occurs when the output is correctly predicted as positive. In other words the true label is yes, and the model's prediction is yes. In this example there are three true positives as indicated by the red arrows. So, TP=3, remember that value as we'll need it later. Now, let's figure out the number of true negatives. A true negative occurs when the output is correctly predicted as negative. In other words, the true label is no and the model's prediction is no. In this example there are four true negatives as indicated by the green arrows. So TN = 4, we'll need to remember this value as well. Now we use the values for TP and TN to calculate the accuracy rate. Using the equation for accuracy rate, we plug in three for TP and four for TN. We get seven correct predictions for the numerator. The denominator is simply the total number of samples in our data set, which is ten. So the accuracy rate for example is 7 out of 10 which is 0.7 or 70%. The error rate is the exact opposite of the accuracy rate. To calculate the error rate, we simply subtract the accuracy rate from 1. For our example that is 1- 0.7 which is 0.3. So the error rate for this example is 0.3 or 30%. There's a limitation with accuracy and error rates when you have a class imbalance problem. This is when there are very few samples of the class of interest, and the majority are negative examples. An example of this is identifying if a tumor is cancerous or not. What is of interest is identifying samples with cancerous tumors, but these positive cases where the tumor is cancerous are very rare. So, you end up with a very small fraction of positive samples, and most of the samples are negative. Thus the name, class imbalance problem. What could be the problem with using accuracy for a class imbalance problem? Consider the situation where only 3% of the cases are cancerous tumors. If the classification model always predicts non-cancer, it will have an accuracy rate of 97%, since 97% of the samples will have non-cancerous tumors. But note that in this case, the model fails to detect any cancer cases at all. So the accuracy rate is very misleading here. You may think that your model is performing very well with such a high accuracy rate. But in fact it cannot identify any of the cases in the class of interest. In these cases we need evaluation metrics that can capture how well the model classifies positive, versus negative classes. A pair of evaluations metrics that are commonly used when there is a class imbalance are precision and recall. Precision is defined as the number of true positives divided by the sum of true positives and false positives. In other words, it is the number of true positives divided by the total number of samples predicted as being positive. Recall is defined as the number of true positives divided by the sum of true positives and false negatives. It is the number of true positives divided by the total number of samples, actually belonging to the true class. Here's an illustration that shows precision and recall. The selected elements indicated by the green half circle are the true positives. That is samples predicted as positive and are actually positive. The relevant elements indicated by the green half circle and the green half rectangle, are the true positives, plus the false negatives. That is samples that are actually positive, but some are correctly predicted as positive, and some are incorrectly predicted as negative. Recall then is the number of samples correctly predicted as positive, divided by all samples that are actually positive. The entire circle indicated by the green half circle and the pink half circle, are the true positives plus the false positives. That is samples that were predicted as positive although some were actually positive and some were actually negative. Then precision is the number of samples correctly predicted as positive, divided by the number of all samples predicted as positive. Precision is considered a measure of exactness because it calculates the percentage of samples predicted as positive, which are actually in a positive class. Recall is considered a measure of completeness, because it calculates the percentage of positive samples that the model correctly identified. There is a trade off between precision and recall. A perfect precision score of one for a class C means that every sample predicted as belonging to class C, does indeed belong to class C. But this says nothing about the number of samples from class C that were predicted incorrectly. A perfect recall score of one for a class C, means that every sample from class C was correctly labeled. But this doesn't say anything about how many other samples were incorrectly labeled as belonging to class C. So they are used together. For example, precision values can be compared for a fixed value of recall or vice versa. The goal for classification is to maximize both precision and recall. Precision and recall can be combined into a single metric called the F-measure. The equation for that is 2 times the product of precision and recall divided by their sum. There are different versions of the F-measure. The equation on this side is for the F1 measure which is the most commonly used variant of the F measure. With the F1 measure, precision and recall are equally weighted. The F2 measure weights recall higher than precision. And the F0.5 measure weights precision higher than recall. The value for the F1 measure ranges from zero to one, with higher values giving better classification performance. In summary, there are several metrics for evaluating the performance of a classification model. They are defined in terms of the types of errors you can get in a classification problem. We covered some of the most commonly used evaluation metrics in this lecture, namely accuracy and error rates, precision and recall and F1 measure.