[MUSIC] Hello, my name is Johannes Eichstaedt from the University of Pennsylvania and I'd like to talk to you today about Big Data Psychology. Some of the recent developments we've seen in psychology using social media, and what we can do with social media to learn more about humans. This short segment is called Measuring Psychological States of Large Populations through Social Media. So it all started in the year 2009, when Google came out with a study that showed that just by measuring the occurrence of search queries related to the flu, things that had to do with medications and with symptoms, they could do a decent job measuring how the flu was spreading through the United States. So that were able to create a map and a time series that showed how the flu was developing over the seasons and across the US states. In this slide, you see two things. You see a black dot and a red dot and you'll notice two things. The black dot which is Google's estimates is about two weeks ahead of the estimates from the leading medical authorities in the US about the incidents of flu. The other thing you'll notice is that these curves track perfectly. Something like 98% accuracy in the Google estimates. And this really was the proof of concept that you could use the Internet and data flows in the Internet to measure things about people that we cared about. What Google had done was the following. They went from reporting to the Centers for Disease Control to just listening in to the data flows and getting the same data that they needed about human health. Now, the question is can we do something very similar with psychology? Most psychology studies are very, very small. Most psychology research is based on something like 50 to 100 people, and even the largest psychology studies that we've ever done have a few thousand people max. The question is can we do with psychology what Google did with the flu? Can we just listen in to the data that flows through data centers and get an estimate about the things we care about? Normal study in psychology as I said has something between 50 and 1,000 people. Whereas on social media, you suddenly have access to billions of people. So Facebook alone is above a billion people. And Twitter has been growing, and social networks in China like Renren and have been growing. So what that means is rather than take a small sample of the population and try to understand what it means for people across the population, you can get estimates about the entire population or large fractions thereof because they're part of your data set. So this is a huge statistical advantage, if you can measure so many people because you can look at different subgroups. And you can understand how women are different from men, young people from old people and so forth. So the great advantage among some others of big data psychology, is that the numbers on the right are much larger than the numbers on the left. But there's an additional advantage. The additional advantage is that people on social media behave in ways that they don't behave when you ask them surveys. It's sort of like a digital campfire. They talk about the things they care about with their friends in an environment that is more natural than knocking on somebody's door and asking them a survey. But how do we do these analyses? Well, first you start with the number of people here, number of people. And let's assume they're on Facebook, so they all have Facebook statuses. But before we can get their Facebook statuses, we have to ask them for permission. So every single user is presented with a permission dialogue and informed consent. And we collect the data very often through Facebook apps that run within the Facebook universe if we're doing Facebook. So once we've gotten permission from every single user, we also can pull other data. For example, we can give them a personality survey. A traditional survey on the computer where they answer questions about themselves, very much like me, not very much like me, indifferent. And we get the survey responses, which allows us to get a profile of their personality. That's just the way psychologists have always collected this hut of information. And we also know other things about them, such as age and gender. These things fortunately already come in the form of numbers. When you fill out a survey, the different responses, you can average the responses, and you get a average agreeableness score, or extraversion score for a person. And age and gender, gender you can 01, you can encode as numbers. The question is, though, with the language of Facebook data, how do you turn that into numbers? As psychologists we always need things to be quantitative in order to do statistics with it. So we take these status updates and we find a way to turn them into numbers. The easiest way to do this, and then we do additional statistics with it, and then we visualize it. The easiest way to do this, is simply to count the number of words or take given word, the, and count what fraction of the words of a given person are that word. So what it does is that it turns every single word into a frequency. So, you have the frequency with which a person uses a given word. And you have their say personality information or age and gender information. So you can combine these two data points across all the different people and correlate every word against the outcome. And then, try to shortlist those words that show the strongest statistical association with the outcome you're interested in. So what does that look like? Well, here's a very, very simple example, this is the language of women on Facebook. And in this slide you see two things. You see words of different sizes and you see words of different colors. The size in this word cloud encodes the strength of the correlation, the strength of the statistical association of the word with being female. Or in other words, how predictive a given word is when it's used that the user is female. In this case, the <3 heart character is the single most predictive feature of being female on Facebook, and this is corrected for age. The color indexes the frequency of use. So if something is grey, it's rarely used, if something is blue, it's moderately frequently used, and if something's red it's very, very highly frequently used. So you see that the <3 heart character is both highly predictive because it's large, and it's frequently used because it's dark red. Compare this to the language of men. When we show this to big audiences, we generally get a big laugh at this point. The first thing that jumps to people's minds is, of course, the curse words in the slides. The curse words have to do with the fact that there are signs of disagreeableness. Think about what it takes to say these words on Facebook. And you'll realize you have to be more disagreeable, more willing to break social norms to share these words on Facebook. And that's a trait that men have more than women. You'll also notice competitive concerns such as video games and sports. And you notice where it makes more sense for men such as beard and shaving. Okay, so far so good. So this is a very, very simple example of what we can do with these methods on something like self-identifying binary gender. There are other gender choices that are not included in our analysis. Here's something that's psychologically a little more interesting. Extraversion, just the way we all use the word extroverted. Somebody who enjoys going out, enjoys spending time with other people. Here's the language that distinguishes extroverts. The single most predictive feature is the word party. Notice also that the That they're bigrams, so the phrase of two words. Can't wait us in here as highly predictive. Note that it's missing an apostrophe because the person couldn't wait. Extroversion if often associated with lack of impulse control and with reward seeking in social situations. So for somebody to drop an apostrophe, as silly as it seems, makes a lot of sense in light psychological theory. What about introversion? This is introversion on the Internet in the year 2009 through 2012. You see a preponderance of Japanese culture. You see emoticons, Japanese emoticons that are eye-focused, as opposed to our emoticons, which are mouth-focused. And you see Pokemon and other signs of an introverted lifestyle. So now that we have these language models, so these word clouds are a way of capturing, of encapsulating the information we have about people. Once we have these language models, we can use these language models, apply them to the language of a new person to get an estimate of their personality. Now, what we've done in one specific study is to compare how well your friends do in estimating your personality with how well our computer, text-based algorithms do in estimating your personality. On the red line here, for the different personality dimension, this is a sort of a standard model in psychology, openness, conscientiousness, extraversion, agreeableness, neuroticism. The red lines here show you how well your friends do. The blue lines show you how well our language models do. And you see that for most personality traits, they're about as good, with the exception of openness, where they do much better than your friends. Openness is a trait that has to do with intellect, with your interest in the arts, with liberalism, and so forth. So in conclusion, we're now at a point where these text-based algorithms are about as good as generating personality profiles of users as are other people. One thing to note with these word cloud results is that what you're seeing here is the language that happens to distinguish men from women. If you actually look at the language use of men and women, they're almost indistinguishable. Both genders use the, and A, and comma much more than other words. What our methods do is they amplify the perception of differences. So even though there might only be a 3, 4% difference in the language use, our methods are able to pull out these words that drive those differences. Very often, when people are confronted with our research, they wonder if people are really putting up a face on Facebook. Are we all changing ourselves to represent a truer side of us than we'd like other people to see? Our research has shown that that's the case. We're particularly more likely to express positive emotion and success on Facebook. And conversely, when things are not going so well and we experience negative emotion, we tend to not share that information with our Facebook friends. But because of the size of these data sets and the way our algorithms are able to subtract these biases from the overall pattern of language use, we can still distill a psychologically meaningful signal that differentiates different people from one another. And I want to give you one example of how this works with something you might not think people are willing to talk about on Facebook, and that's depression. In this case, we gave 16,000 people a depression survey. And we were able to isolate a low mood component, so a sadness component. And before I show you these results, I'd like you to think what you think the most predictive word is of having low mood. Take a moment. Here's what it really is, it's the word alone. When I ask audiences, they generally pick words such as sad or other low mood words. But it turns out that there's really a process here of social isolation that is coming out with these language results. And this shows one of the powers of these kinds of analyses, have the data tell the story. This is a data-driven analysis where the results bubble up from the data itself. And then we, as psychologists and as specialists, we can make sense of them. But the fact that social isolation, feeling alone, is so centrally implicated In feeling low mood, to a psychologist, is a really cool finding. Here's another component of depression, low self-worth, feeling poorly about yourself. Think again what you think might be the language that indicates people aren't feeling good about themselves. Turns out it's words like why, not having a sense of meaning in your life, not being able to make sense of the world, feeling that your world is a disordered place. Also, note words like apparently, probably, and actually in this word cloud. Sometimes we refer to this as epistemic hedging. The idea is that people are disconnecting from the things they say. They don't mean things, or they indicate that they're not convinced of the truth of their own statements, which is a subtle indicator that something is wrong with the way they perceive reality. And all of this bubbled out of a simple data-driven language analysis like this one. So to summarize what we've talked about, these Facebook-based language analyses that use people as units of analysis can give you insight about thoughts, emotions, and behaviors. So all the things that psychologists tend to care about. They make measurement possible, it's cheap. You can measure tens of thousands of people in an afternoon on a computer server, and it's unobtrusive. That means you don't have to knock on anyone's door and have them interrupt what they do. Instead, you get to observe them in their own ecosystems, in their own friendship circles. There are biases. Biases are things that people putting on a face, social desirability biases, and sample biases. Younger people tend to be on social media. These can be handled statistically. And increasingly, as something like, in the US, 50% of the population are now on Facebook, we no longer have to worry about these biases. And certainly, when you compare these biases to psychological studies that are based on 100 weird undergraduates, you're doing pretty well if you can get 10 million people in your sample. So far what we've seen, these language analyses were about people. But here at Penn, we've discovered something we think is really cool, that what you do for people also works for communities. So rather than people, in this next analysis, we're going to use counties. Counties are communities in the US. There's about 3,000 of them. They have, on average, a few tens of thousands of people living in them. From these counties, we can get tweets. So we start with a billion tweets. We figure out where they were sent from, and then we put them in the corresponding counties. And we end up with a data set with about 1,400 counties or so that covers something like 80% of the US population. We have enough language from every county. For the counties, we also have other information from government agencies, such as the Centers for Disease Control or the US Census Bureau. And we can even get things like how many people died from heart disease, specific kinds of heart disease that have to do with clogging of your arteries. For every county, we get these estimates, because they were recorded on death certificates, and average that to the county level. And we, as researchers, can pull it from these organizations. So for every county, we get these numbers. Now we start with tweets again. And now, again, we have the job of coming to a quantitative representation of the tweets, which is to say we have to turn the tweets into numbers. How do we do that? Well, one of the ways we can do this, that we haven't seen before, is rather than to count words, what we do is we run a topic algorithm over these tweets. This is a cool, nifty thing out of natural language processing that allows us to describe, say, 150 million tweets as a distribution over 2,000 topics. That is to say, we tell the algorithm, algorithm, assume that these 150 million tweets are not talking about 150 million things, but really, are only talking about the same 2,000 things. Now find me the 2,000 things that best explain these tweets. And what you end up with is these really cool little clusters of language that are semantically coherent, that seem to talk about the same concept. And I'll show you a few results in a moment. But just to finish the method pipeline, so we have these numbers again. In this case, it's these topics. We have the numbers that come from places like the Centers for Disease Control. And now we can correlate every topic against this outcome, and shortlist and visualize those topics that are the most strongly associated. And I'm going to show you results now for the case of heart disease. Here's the language that is associated with higher heart disease in US counties. And you see language like this, which looks like hate and interpersonal attention. Now, every one of these little word clouds is a topic. So these are words that occur together or occur in some other context. You see language like this, which looks like hostility, and aggression, and cursing, and disagreeableness. There's an old hypothesis in the heart disease literature referred to as type A personality. The idea used to be that if you're really high strung, you're more likely to die from heart disease. What we've learned over the years is it's actually not being that high strung that kills you. What kills you is the hostility towards other people, because they release the stress hormones in our system, and it tends to do things to our arteries that promote heart disease. So same result here, as in this hostility component of type A personality, but we also saw this. It looks to us like disengagement or boredom, fatigue. This is the language of people not having a reason to get out of bed in the morning. And time and time again, we've seen that these subtle psychological factors can have surprising explanatory power in what the heart disease rates are in the communities. Here's the language that is associated with lower heart disease. So this is the language that is statistically protective of heart disease. It looks like positive experiences. We'd expect that, because richer communities also have lower heart disease. We see things like skilled occupations, which makes sense, because richer communities with higher education talk about things that are associated with higher education, such as going to conferences, public service, community service, communication, and so forth. So this makes sense. This has to do with what health professionals call socioeconomic status. These are both manifestations of socioeconomic status, but there's a third bit here. Look at these word clouds. To us, they look like optimism. Optimism, resilience, goal pursuit, these are all related constructs. But they have to do with a mindset, a mindset that is willing to overcome obstacles, plan around them, and persevere in the face of obstacles. Now, all the results I'm showing you are what's called cross-sectional. It means we haven't followed individuals over time, but we've taken one time slice and compared language rates to disease rates. Other studies that have followed people over time have shown that things like optimism are highly protective at the individual level against dying from heart disease. And there's a number of reasons why, but it's a really cool finding that we see it here as well. So what can we do with these methods? All right, so we've seen this was the inside piece again. We've seen the power of these data-driven analyses to identify specific processes that might be associated with getting or not getting a disease. Well, we can flip it around again. Once we've learned all these language associations, we can just use Twitter to estimate the death rates from heart disease, atherosclerotic heart disease in U.S. counties. And to show you how well this works, I'm showing you a map here. On the left is what the Centers for Disease Control report, as recorded on death certificates. And on the right is our estimates, just based on Twitter. And you see it does surprisingly well, how well? Well, if you compare the power of different predictors to predict heart disease, first, we have here demographic predictors, percentage black, female, married, and Hispanic. They have some predictive power. Here are the traditional health risk factors. The biggest one of them is still smoking, smoking, diabetes, hypertension, obesity. And you see they begin to account for more variance in heart disease. They do a better job predicting heart disease. Here's income and education, still very important. There's a huge disparity in the US between how long the rich and the poor live, for a number of different reasons. This predictor here captures that variance. Here's a standard model then that combines all these ten predictors, the demographic, the health, the socioeconomic factors, into a standard epidemiological model. And compare this to just the prediction performance of Twitter. Twitter outperforms these models, significantly. When you combine Twitter with these other predictors, it doesn't do much better. So what that tells us is that what Twitter is doing is it's measuring income, it's measuring education, it's measuring smoking, and all these other things. And then it's adding a sliver of psychological causes of psychological reasons why you may or may not get heart disease, like the optimism that we saw a few slides ago. The interesting thing about these findings is that the people dying from heart disease are not the people on Twitter. And the people on Twitter are not the people dying. The people on Twitter live in the same communities than people who have heart disease, or develop heart disease, or die from heart disease. And what the people on Twitter are reporting are general attributes of their communities, what it's like to live in those communities. Are there green spaces, places to walk, do you trust your neighbors, all these other things. They serve as canaries in the coal mine, if you will, for the quality of the community. And if they report a psychological profile of the community that looks like it would increase heart disease risk, it really is the people who really are at risk from heart disease that die. So that explains the, at first sight, bizarre association between what people say on Twitter and the death rates of older people dying from heart disease. So what can we do with these methods? Well, we're about to release this. This is a live map of the US that shows every county in the US, and the last month or the last year's wellbeing. If you were about to move somewhere in the United States, wouldn't you want to know what the wellbeing is of the community you're moving to? These wellbeing attributes of community have to do with green spaces, as to whether you trust your neighbors, as to whether your kids have places to play. All of this figures into the quality of life in a community. And through social media, again, we can measure this cheaply and unobtrusively. But we also run projects where we look at individuals. Individuals That we want to follow over time like students you might be interested in measuring student and to do so where we can application in the student's with their permission and their permission of the parents and the application reads and scans all the texts they write to get a really sort of course estimate of their well being at the time but certainly you should be able to see when people start being depressed so you can help them. Very often these technologies are a stepping stone between people and other people to help them, right? So if you're a parent, the information when your child is not doing well is something you find very valuable. And if these technologies can aid you in obtaining that information so you can help in the way you see best, you see most fit, I think that is an appropriate use for these technologies. This is where we are right now. But what will happen in the next ten to twenty years with this kind of research? And with these kinds of technologies? Well if there's something we're learned in the last 20 years, it's that computers really don't help other people. People help other people. And now, that researchers like us and big organizations like Google and data conglomerates like Facebook have this information. And increasingly healthcare systems such as insurances have this information with your permission, without your permission is something that's still being debated in legislatures all across the world. Now, that we have this information, the question is what do we do with it? And here really comes the biggest challenge I think that is coming out of all this research. How do we respond, how do we design systems that ethically and effectively respond to knowing that you might be physically unwell or mentally unwell? In the case of therapy I've been a therapist for two years, I saw patients, the idea that I could as their therapist install something in their data ecosystem. Be it a phone, today we talk about phones 20 years we might talk about different devices but the idea install something in their technology layer that is able to get me a daily feed of how they are doing over time Is really attractive and why is it attractive? Because I have a define relationship with them which is my job to take care of them and they have given me the permission to do so. And when I see kind of drop I can pick up the phone and call them and ask them whats up if they want to come in and so forth. Similarly, if you're an insurance company and you're not just using this information to re-stratify your and make insurance more affordable or less affordable, depending on your risk profile. But imagine you're realizing that somebody that's insured with you, is adopting unhealthy lifestyles what's the way that you can approach that person in a way that is compassionate? Is not talking down to the person but still effective. And how can you make that person an ally? Over the last few years, my sense is that the answer to this is health coaching, having specialists who can help people with behavior change were skilled and talking in the right way so that they don't come up as offensive and who are also protected by regulations that keep the data private are possible answer to this question of how we integrate this technologies, this detection. And scanning technologies with our systems of care and our systems of education.