So now that we have this confusion matrix set up, we can start to calculate all kinds of evaluation metrics that could help us identify areas where a machine learning system could be more inclusive. But with respect to making machine learning more inclusive, we tend to really focus on the false positive rates as well as the false negative rates in order to get a sense of how adversely affected a subgroup might be performing. We can calculate things like the true positive rate, sensitivity, or recall, for example, all of which represent the proportion of times your model predicts, say, a face in an image when the label itself also shows there being a face in the image. All you need here are the corresponding true positives and false negative values in order to calculate the recall. Another example of the sort of calculations you can get from a confusion matrix are things like the precision, which represents the proportion of times when the model predicts the labels correctly. Factoring in when it's a positive label, for example, when there is a face in the image and the model predicts the positive label. As well as when it's a negative label, when there isn't a face in the image, and the model predicts it's the negative label. So in this calculation, all you need are the corresponding true positives and false positive measurements. False positive rates, false negative rates, true positive rates, precision, recall, these are a lot of metrics to deal with. So how should we go about selecting which metrics to focus on for the purposes of making your machine learning system more inclusive? The answer to that depends. It depends on the outcomes of your false positive and false negatives. Depending on the trade-offs between the two, perhaps you may want your machine learning model to have low recall, missing a lot of stuff, in exchange for high precision, or when the limited amount of stuff the ML classified is all correct. Take this example of a machine learning model that's determining whether or not an image should be blurred to preserve privacy. A false positive would result in something that doesn't need to be blurred but gets blurred because the model predicted that it should. That can be a bummer. But a false negative is when something needs to be blurred but is not, because the model doesn't predict that it should. And something like that could result in identity theft, because the privacy of the individual in that image could be exposed. So in this example, you may want to minimize as much false negatives as possible. So you would focus your metrics around achieving a low false negative rate. On the flip side, you might have situations where it may be better to encounter a false negative over a false positive. Let's say you're working on a SPAM filtering model. A false negative will result in a SPAM message not getting caught by the model, so you end up seeing it in your inbox, and that can be annoying. But what happens when you encounter a false positive? The result is that potentially a message from a friend or a loved one gets marked as SPAM and remove from your inbox. And that can be a total loss. So in this case, perhaps the metric to focus on here is reducing the false positive rate as much as possible. So once you figure out what the right set of evaluation metrics to focus on, make sure that you go one step further and calculate those metrics in mind across the different subgroups within your data. As shown in this plot, you can visualize the distributions of your evaluation metrics across a subgroup, as depicted by the blue and green distributions, each representing a separate subgroup within your data. But once all of that is in place then it's just a matter of finding the point that's an acceptable value and compare those values across the sub groups. For example you may find that a false negative rate at 0.1 is acceptable for the problem you're trying to solve with your machine learning system. So now, given that overall rate, how does that rate look across your subgroups? By incorporating these methodologies, you're one step closer to identifying ways in which you can make your machine learning system more inclusive. So, to reiterate, evaluation metrics are some of the key things we can do to measure how inclusive a machine learning system is. And it's important to do so in light of the acceptable trade-offs between your false positives and your false negatives.