There are many reasons why you could have errors in your data analysis. Bad data and bad models lead to bad decisions. So it's important to understand what the sources of error are and to remove them. If decision-making is opaque, the results can be bad in the aggregate. It could be catastrophic for an individual. The consequences of having errors can be really bad. For example, what if someone has a loan denied because of an error in the data analyzed or maybe because of an error in the analysis method design? We have actually harmed somebody because we have conducted data science in a less than perfect way. There are many different sources of error that we're going to talk about. Let's begin by talking about the first of them, which is choice of representative sample. There is the apocryphal story of the drunk looking for keys under the lamppost because that's where he can see. Similarly, we are often limited by what data we have as data scientists. We often cannot choose the data that we would like to have and so we just analyze what data we do have and hope for the best. This is actually not that much different from looking for keys under the lamppost because that's where we can see. The keys happen to be under the lamppost, we might actually find them, but if the keys are elsewhere, we're out of luck. Let's look at a concrete example. The Twitter universe, or Twitterverse, is a popular source of analysis of public opinion. But are Twitter users representative of the population as a whole? Surely we know that they aren't. Surely they are younger and more tech savvy, richer than the average population. Furthermore, even if we know what the characteristics are of Twitter users, are tweets representative of the opinions of Twitter users as a whole? Most Twitter users don't tweet at all, they just watch tweets go by. So what we're really hearing are the opinions of opinion makers. And that may or may not have a good representation of what the population thinks about as a whole. Think about what a lot of companies do in terms of establishing a web presence. They listen to customer response on a forum that they established or that other third parties have established. And in this sort of situation, it may actually be a reasonable thing to do. A company may care about complaints that they hear about their products. And they may care irrespective of whether this opinion is representative of the population. May be enough for them to know that there's an opinion that's representative of a segment of the population that is pointing out some problem with their product and they act to fix it. I think that's a good thing and that's good for everybody. And the fact that they're dealing with data that's not representative is not a problem in this case. As long ago as 1963, there was a famous quotation by William Cameron who said, not everything that can be counted counts, and not everything that counts can be counted. This distinction between the data that we have and the data that we wish we had is something that's important to keep in mind. When we don't have exactly the data that we would like, there are statistical techniques that help us weight the samples that we have to be able to balance at least the important attributes. So the attributes that we think are likely to matter for whatever be the question at hand. So if we think that opinions may differ by race, or gender, or age, we can try to make sure that the samples that we are collecting are balanced with respect to these. And to the extent they're not balanced, that they're weighted to be able to give us an effective mathematical balance. And sometimes it's hard to know what these attributes should be. For example, biology experiments often have been conducted with mice of a single sex, typically male. It's only recently that the National Institutes of Health have instituted a policy requiring female mice in biology experiments. So to think about how balancing may work, here's a chart of American Idol semifinalists across a number of years that this show is popular. And the color of each state shows how underrepresented or overrepresented it is in the semifinalists. Of course, normalizing for the population of the state as a whole. And what we find is that most states on the west coast and in the south are fairly represented, that's green on the spectrum. However if you look at the midwest and the plain states, these are all blue, which means there are very few semifinalists, compared even to these generally low population states. And same is true of New England. And if you really want to be a semifinalist at American Idol, you want to be from Hawaii which is a brilliant red and seriously overrepresented. That was a little fun place of trying to obtain balance in your data. Here is a much more troubling situation. Google, when it was developing its image recognition technology, mistakenly labeled this photograph Gorillas. Now, naturally, this was a thing that caused considerable distress. And they immediately fixed it, and In some sense that incident was a small teething problem in a fast developing technology. The question is, why did that kind of problem arise at all in the first place? And that has to do with the training data that was used to teach the labeling algorithm. And since the training data involved very few faces that were as dark as that of the woman in this photograph, the learning algorithm wasn't appropriately trained to recognize dark human faces. Training data is also a problem when you want to deal with systems that are going to work with the future population if the future population is different from the past population. By definition, we are training on what we had in the past and what we're going to use is on something in the future. If these are the same, it's not a problem. When the future doesn't resemble the past, which happens from time to time, we have watch out for those singularities. A much harder thing to watch out for is gradual drift. As society changes over time, the nature of our population may shift. The way we use language, or the way we dress, or the way we speak. And things that were trained a while ago will no longer work because we would have drifted away from that that they were trained on. Managing this kind of drift in terms of how things are retrained is an open question.