0:08

So, let's go back to the matrix of possible binary classification outcomes.

Â This time filled out with the actual counts from the notebooks decision tree output.

Â Remember our original motivation for creating

Â this matrix was to go beyond a single number accuracy,

Â to get more insight into the different types of

Â prediction successes and failures of a given classifier.

Â Now we have these four numbers that we can examine and compare manually.

Â Let's look at this classification result visually to help

Â us connect these four numbers to a classifier's performance.

Â What I've done here is plot the data instances by using two specific feature values out

Â of the total 64 feature values that make up each instance in the digits dataset.

Â The black points here are the instances with true class positive

Â namely the digit one and the white points have true class negative,

Â that is, there are all the other digits except for one.

Â The black line shows

Â a hypothetical linear classifier's decision boundary

Â for which any instance to the left of

Â the decision boundary is predicted to be in the positive class and

Â everything to the right of the decision boundary

Â is predicted to be in the negative class.

Â The true positive points are those black points in

Â the positive prediction region and

Â false positives are those white points in the positive prediction region.

Â Likewise, true negatives are the white points in

Â the negative prediction region and

Â false negatives are black points in the negative prediction region.

Â We've already seen one metric that can be derived from

Â the confusion matrix counts namely accuracy.

Â The successful predictions of the classifier,

Â the ones where the predicted class matches

Â the true class are along the diagonal of the confusion matrix.

Â So, if we add up all the accounts along the diagonal,

Â that will give us the total number of correct predictions across all classes,

Â and dividing this sum by the total number of instances gives us accuracy.

Â But, let's look at some other evaluation metrics we can compute from these four numbers.

Â Well, a very simple related number that's sometimes used is classification error,

Â which is the sum of the counts off the diagonal

Â namely all of the errors divided by total instance count,

Â and numerically, this is equivalent to just one minus the accuracy.

Â Now, for a more interesting example, let's suppose,

Â going back to our medical tumor detecting

Â classifier that we wanted an evaluation metric that would give

Â higher scores to classifiers that not only achieved the high number

Â of true positives but also avoided false negatives.

Â That is, that rarely failed to detect a true cancerous tumor.

Â Recall, also known as the true positive rate,

Â sensitivity or probability of detection is such an evaluation metric and it's obtained

Â by dividing the number of true positives

Â by the sum of true positives and false negatives.

Â You can see from this formula that there are two ways to get a larger recall number.

Â First, by either increasing the number of

Â true positives or by reducing the number of false negatives.

Â Since this will make the denominator smaller.

Â In this example there are 26 true positives and

Â 17 false negatives which gives a recall of 0.6.

Â Now suppose that we have a machine learning task,

Â where it's really important to avoid false positives.

Â In other words, we're fine with cases where not all true positive instances

Â are detected but when the classifier does predict the positive class,

Â we want to be very confident that it's correct.

Â A lot of customer facing prediction problems are like this, for example,

Â predicting when to show a user

Â A query suggestion in a web search interface might be one such scenario.

Â Users will often remember the failures of

Â a machine learning prediction even when the majority of predictions are successes.

Â So, precision is an evaluation metric that reflects the situation

Â and is obtained by dividing the number of

Â true positives by the sum of true positives and false positives.

Â So to increase precision,

Â we must either increase the number of true positives the classifier predicts or reduce

Â the number of errors where the classifier incorrectly

Â predicts that a negative instance is in the positive class.

Â Here, the classifier has made seven false positive errors and so the precision is 0.79.

Â Another related evaluation metric that will be useful is called the false positive rate,

Â also known as specificity.

Â This gives the fraction of all negative instances that

Â the classifier incorrectly identifies as positive.

Â Here, we have seven false positives,

Â which out of a total of 407 negative instances,

Â gives a false positive rate of 0.02.

Â Going back to our classifier visualization,

Â let's look at how precision and recall can be interpreted.

Â The numbers that are in the confusion matrix here are

Â derived from this classification scenario.

Â We can see that a precision of 0.68 means that about 68 percent of the points in

Â the positive prediction region to the left of

Â the decision boundary or 13 out of the 19 instances are correctly labeled as positive.

Â A recall of 0.87 means,

Â that of all true positive instances,

Â so all black points in the figure,

Â the positive prediction region has 'found about 87 percent of them' or 13 out of 15.

Â If we wanted a classifier that was oriented towards higher levels of

Â precision like in the search engine query suggestion task,

Â we might want a decision boundary instead that look like this.

Â Now, all the points in

Â the positive prediction region seven out of seven are true positives,

Â giving us a perfect precision of 1.0.

Â Now, this comes at a cost because out of

Â the 15 total positive instances eight of them are now false negatives,

Â in other words, they're incorrectly predicted as being negative.

Â And so, recall drops to 7 divided by 15 or 0.47.

Â On the other hand, if our classification task is like the tumor detection example,

Â we want to minimize false negatives and obtain high recall.

Â In which case, we would want the classifier's decision boundary to look more like this.

Â Now, all 15 positive instances have

Â been correctly predicted as being in the positive class,

Â which means these tumors have all been detected.

Â However, this also comes with a cost since the number of false positives,

Â things that the detector triggers as

Â possible tumors for example that are actually not, has gone up.

Â So, recall is a perfect 1.0 score but the precision has dropped to 15 out of 42 or 0.36.

Â These examples illustrate a classic trade-off

Â that often appears in machine learning applications.

Â Namely, that you can often increase the precision of

Â a classifier but the downside is that you may reduce recall,

Â or you could increase the recall of a classifier at the cost of reducing precision.

Â Recall oriented machine learning tasks include medical and legal applications,

Â where the consequences of not correctly identifying a positive example can be high.

Â Often in these scenarios human experts are deployed to help filter out

Â the false positives that almost inevitably increase with high recall applications.

Â Many customer facing machine learning tasks, as I just mentioned,

Â are often precision oriented since here

Â the consequences of false positives can be high, for example,

Â hurting the customer's experience on a website by

Â providing incorrect or unhelpful information.

Â Examples include, search engine ranking and

Â classifying documents to annotate them with topic tags.

Â When evaluating classifiers, it's often convenient

Â to compute a quantity known as an F1 score,

Â that combines precision and recall into a single number.

Â Mathematically, this is based on

Â the harmonic mean of precision and recall using this formula.

Â After a little bit of algebra,

Â we can rewrite the F1 score in terms of the quantities

Â that we saw in the confusion matrix: true positives,

Â false negatives and false positives.

Â This F1 score is a special case of

Â a more general evaluation metric known as an F score that introduces a parameter beta.

Â By adjusting beta we can control how much emphasis

Â an evaluation is given to precision versus recall.

Â For example, if we have precision oriented users,

Â we might say a beta equal to 0.5,

Â since we want false positives to hurt performance more than false negatives.

Â For recall oriented situations,

Â we might set beta to a number larger than one, say two,

Â to emphasize that false negatives should hurt performance more than false positives.

Â The setting of beta equals one corresponds to the F1 score

Â special case that we just saw that weights precision and recall equally.

Â Let's take a look now at how we can compute

Â these evaluation metrics in Python using scikit-learn.

Â Scikit-learn metrics provides functions for computing accuracy,

Â precision, recall, and F1 score as shown here in the notebook.

Â The input to these functions is the same.

Â The first argument here, y_test,

Â is the array of true labels of the test set data instances

Â and the second argument is the array of predicted labels for the test set data instances.

Â Here we're using a variable called tree_predicted which are

Â the predicted labels using the decision tree classifier in the previous notebook step.

Â It's often useful when analyzing

Â classifier performance to compute all of these metrics at once.

Â So, sklearn metrics provides a handy classification report function.

Â Like the previous core functions,

Â classification report takes the true and

Â predicted labels as the first two required arguments.

Â It also takes some optional arguments that control the format of the output.

Â Here, we use the target names option to label the classes in the output table.

Â You can take a look at the scikit-learn documentation for

Â more information on the other output options.

Â The last column support,

Â shows the number of instances in the test set that have that true label.

Â Here we show classification reports for

Â four different classifiers on the binary digit classification problem.

Â The first set of results is from the dummy classifier and we can see

Â that as expected both precision and recall for the positive class are very

Â low since the dummy classifier is simply guessing

Â randomly with low probability of

Â predicting that positive class for the positive instances.

Â