[MUSIC] This was a study. The title here is the Expression of Emotions in 20th Century Books. And so what they were interested in is have the words that we choose to use in our collective literature changed over time and eventually does that tell us something about sort of a culture or civilization. I find the scientific inquiry sort of compelling, but what I think is most striking about this and why I wanted to include it as an example is that the methodology that they used here is pretty straight forward. You could do this yourself. With not a significant background in either technology or in statistics or even in linguistics or anything. And so this is what they did. So the first step is kind of a doozy. This is take all the books written in the 20th century and digitize them. Well, that would be a non-starter. None of us could do that. But that's okay, Google has already done it for us, and they've made that data available at this URL. And so you can go check that out. What they've done is digitize the books, done the character recognition on it and produced these n-gram data sets. So these are tables of data where each row has an n-gram and followed by the year and the counts of the number of times that n-gram occurred. This has already been broken down and processed into a form that's digestible. Okay, do what's an n-gram? Well, it's pretty simple. A 1-gram is just a single word, like yesterday. A 5-gram, an example here is the phrase, analysis is often described as. Okay, and so in this study they just ignored everything but the one grams and then they took some subset of those one grams and assigned them a mood score. So how do they do this? Well you can imagine that certain words are charged with a particular mood or associated with joy or sadness or fear and so on. And you could also imagine that synonyms of those words might also be associated with that mood. And so this analysis sounds nontrivial and it is. But once again, that' already been done for you. There's a resource on the Web called WordNet where they've done this kinda affect analysis. And so the authors of this paper were able to take the digitized books from Google, already broken down into n-grams, and the affect scores from WordNet. And then do this calculation which may sort of look intimidating if you're not used to staring at these mathematical expressions, but it's actually pretty simple. This is the count of a particular word in the set. The set being the set of WordNet words, which is not as big as all the words. Only some words are able to be scored as mood. And then you normalize by the count of the number of occurrences of the word the. So why did they do that? Well you need to normalize over something in order to account for the fact that perhaps we just write more books in 2005 then we did in 1937 or we've been able to digitize more books. So we need to normalize by that total. But why not just normalize by the total number of words? Well the reason is that the, the word the, is a better indicator of prose than the total number of words. And this is because apparently we've also started to produce more sort of captions and figures and more sort of technical language and more sort of formula and more expressions, more non prose utterances in these books. And therefore we can sort of skew the results, and so they really want to capture it in our language when we write full and complete sentences. How often are these words being used. Okay, and then you add those up, and you divide by the total number of words in the set. And then there's one more transformation here that should look familiar to you if you sort of recall your high school statistics. You subtract the mean and divide by the standard deviation. Okay? So this is normalizing with respect to a normal distribution. Okay? But that's about it. There's a count and there's a division. There's two data sets that you can pull from the web, and they're big but they're not exceedingly big. They fit in memory on most or your laptops nowadays. So it's a significant computational task, but nothing that requires Hadoop. You can do this in a weekend if you had thought about it, okay? So I find that pretty compelling. So, these are the results. This is joy words minus sadness words. This is the Z-score for joy and sadness. And you can see there's a sort of a big dip after World War II. And that's one of the points they make in the paper. And you can see that this sort of thing starts to increase in the late 90s. I won't try to analyze this for the scientific value. I'll just present the results. What I think is maybe more interesting is this one. So this is now emotion words total minus random words total. And there's a sort of prominent downward slope over time. So what's going on here? Well, apparently you can make the argument that we're using fewer emotion words over time. Okay. That said, there's a bit of an uptick in this red line. So what does that represent? Well, that's fear words and you can imagine some of the reasons why either might be an increase in fear words since the 1980s. So this is pretty fun though. This is a significant analysis that can be done just by taking these data sets that they didn't have to prepare themselves. And then the other point I want to make about this, this is just a copy and paste of a segment of some of the papers that this paper cites. And I just was struck by the titles here. Quantitative analysis of culture using millions of digitized books. Quantifying the evolutionary dynamics of language. Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Song lyrics and linguistic markers. What strikes me about this is the linguistic, anthropology, history, culture, these studies are becoming hard sciences by the virtue of data-driven methods. So all science is becoming data science. And therefore, data scientists have a lot of power [LAUGH] in this regime. It's a great time to be a data geek. Okay. There's data journalism, as well. I mean, one point I probably should've put a slide in here about this. But when the WikiLeaks material came out. You're not going to pour yourself a pot of coffee and pore over those materials, print them all out, and sort of go through them one by one. You're going to write algorithms. That do this kind of an analysis, word use analysis, look for email chains and dialogues. These sort of computational methods in order to analyse that material. So now journalism itself is a computational enterprise. It is a data science problem, or at least it's amenable to data science technique. So, as a data scientist the world is your oyster. All right, so let me pause there and we'll think of a couple more examples before moving on in the next segment. [MUSIC]