[MUSIC] Very often when people are confronted by that research. They wonder if people are really putting up the face on Facebook. And we are changing ourself to represent the truth inside of us, then we would like to people to see. Our research has shown that that's the case, we're particularly more likely to express positive emotion and success on Facebook. And conversely, when things are not going so well. And we experience negative emotion, we tend to not share that information with our Facebook friends. But because of the size of these data sets and why our algorithms are able to subtract these biases from the overall pattern of languages. We can steal the still psychologically meaning for signal that differentiates different people from one another. And I want to give you one example of how this works with something you might not think people are willing to talk about in Facebook, and that's depression. In this case, we gave 16,000 people a depression survey. And we were able to isolate a low mood component, so a sadness component. And before I show you these results, I'd like you to think what you think the most predictive word is of having low mood. Take a moment. Here's what it really is, it's the word alone. When I ask audiences, they generally pick words such as sad or other low mood words. But it turns out that there's really a process here of social isolation that is coming out with this language results. And this shows one of the powers of this pensive analysis, let the data tell the story. This is a data drawn analysis, where the results bubble up from the data itself. And then we are psychologists and as specialist we can make sense of them. But the fact that social isolation, feeling alone, is so sensually implicated. And feeling low mood, to a psychologist is a really cool finding. Here is another component of depression. Low self worth, feeling poorly about yourself. Think again when you think might be the language that indicates people are not feeling good worth of themselves. Turns out, it's words like why. Not having a sense of meaning in your life, not being able to make sense of the world, feeling that your world is a disordered place. Also know words like apparently, probably and actually and that's were it cloud. Sometimes we refer to this as epistemic hedging, the idea is that people are disconnecting from the things they say. They don't mean things, or they indicate that they're not convinced of the truth of their own statements. Which is a subtle indicator that something is wrong with the way they perceive reality. And all of this bubbled out of a simple data driven language analysis like this one. So to summarize what we've talked about. These Facebook based language analysis that use people as units of analysis can give you insight about thoughts, emotions and behaviors. So all the things that psychologists tend to care about. They make measurement possible, it's cheap. You can measure tens of thousands of people in an afternoon on a computer server. And it's unobtrusive, that means you don't have to knock on anyone's door and have them interrupt what they do. Instead you get observe them in their own ecosystems and their own friendship circles. There are biases, biases I think some people putting on the face, social desirability biases and sample biases. Younger people tend to be in social media. This can be handled statistically and increasingly, is something like in the US 50% of the population are now on Facebook. We no longer have to worry about these biases. And certainly when you compare these biases to psychological studies that are based on 100 weird undergraduates. You're doing pretty well, if you can get ten million people in your sample. So far what we've seen is language analysis where about people. But here at Penn, we’ve discovered something that we think is really cool. That what you do for people also works for communities. So rather than people, in this next analysis, we’re going to use counties. Counties are communities in the US, there’s about 3000 of them. They have an average of few 10s of 1000s of people living in them. From these counties, we can get tweets. So we start with a billion tweets, we figure out where they were sent from and then we put them in the corresponding counties. And we end up with a data set with about 1,400 county or so that cover something like 80% of the US population. And we have enough language form from other county. For the counties we also have other information from government agencies. Such as the Center for Disease Control or the US Census Bureau. And we can even get things like how many people died from heart disease specific kinds of heart disease that have to do with clogging of your arteries. For every county we get this estimates, because they were recorded on death certificates. And average up to the country though and we, as researchers can pull it from these organizations. And so for every county, we get these numbers. Now we start with tweets again and now again, we have the job of coming to a quantitative representation of the tweets, which is to say we have to turn the tweets into a numbers. How do we do that? Well one of the ways we can do this that we haven't seen before is rather than to count words, what we do is we run a topic algorithm over these tweets. This is a cool and nifty thing out of natural language processing that allows us to describe say, 150 million tweets as a distribution over 2,000 topics. That is to say we tell the algorithm, algorithm assume that these 150 million tweets are not talking about 150 million things, but really are only talking about the same 2,000 things. Now find me the 2,000 things that best explain these tweets. And what you end up with is these really cool little clusters of language that are semantically coherent that seem to talk about the same concept. And I'll show you a few results in a moment, but just to finish the method pipeline. So we have these numbers again, in this case, it's these topics. We have the numbers that come from places like the Center for Disease Control. And now we can correlate every topic against this outcome, and shortlist and visualize those topics that are the most strongly associated. And I'm going to show results now for the case of heart disease. Here's the language that is associated with higher heart disease in US countries. And you see language like this, which looks like hate and interpersonal tension. Now everyone of these little word clouds is a topic, so these are words that occur together or occur in similar contexts. You see language like this, which looks like hostility and aggression and cursing and disagreeableness. There's an old hypothesis in the heart disease literature that refer to as type A personality. The idea is used to be that if you're really high strung you're more likely to die of heart disease. What we've learned over the years, it's actually not being the high strung that kills you. What kills you is the hostility towards other people. Because they release the stress hormones in our system and it tends to do things to our arteries that promote heart disease. So same result here as in this hostility component of type A personality. But we also saw this, it looks to us like disengagement or boredom fatigue. This is the language of people not having a reason to get out of bed in the morning. And time and time again, we've seen that these subtle psychological factors can have surprising explanatory power in what the heart disease rates are in communities. Here's the language that is associated with lower heart disease. So this is the language that is statistically protective of heart disease. It looks like positive experiences, we'd expect that, because richer communities also have lower heart disease. We see things like skilled occupations which make sense because these are communities with higher education. Talk about things that are associated with higher education just going to conferences, public service, community service, communication and so forth. So this makes sense, this has to do with what health professionals call socioeconomic status. These are both manifestations of socioeconomic status. But there's a third bit here, look at these word clouds. To us they look like optimism. Optimism, resilience, gold pursuit. These are all related constructs, but they have to do with a mindset. A mindset that is willing to overcome obstacles, plan around them and persevere in the face of obstacles. Now all the results I'm showing you are what's called cross sectional. It means, we haven't followed individuals over time, but we've taken one time slice and compared language rates to disease rates. Other studies that have followed people over time have shown that things like optimism are highly protective at the individual level against dying of heart disease. There's a number of reasons why, but it's a really cool finding that we see it here as well. So what can we do with these methods, right? So we've seen this was the inside peace again. We've seen the power of the data driven analyses to identify specific processes that might be associated with getting or not getting a disease, but we can flip it around again. Once we've learned all these language associations, we can just use Twitter to estimate the death rates from heart disease, atherosclerotic heart disease in newest counties. And to show you how well this works, I'm showing you a map here. On the left is what the Center for Disease Control report as recorded on death certificates. And on the right is our estimates just based on Twitter and you see it does surprisingly well. How well? Well if you compare the power of different predictors to predict heart disease. First, your demographic predictors, percentage black female, married and hispanic. They have some predictive power. Here are the traditional health risk factors. The biggest one of them is still smoking. Smoking, diabetes, hypertension, obesity, and you see they begin to account for more variants in heart disease. They do a better job predicting heart disease. Here's income and education, still very important. There's a huge disparity in the US between how long the rich and the poor live. For a number of different reasons, this predictor here captures that variance. Here is a standard modal then that combines all these ten predictors. The demographic, the health, the socioeconomic effectors into a standard evidemonical modal. And compare this to just the prediction performance of Twitter. Twitter outperforms these models significantly. When you combine Twitter with these other predictors, it doesn't do much better. So what that tells us, is that what Twitter's doing, is it's measuring income, and it's measuring education, it's measuring smoking, and all these other things. And that is adding a sliver of psychological causes of psychological reasons reasons why you may or may not get heart disease. Like the optimism that we saw a few slides ago. The interesting thing about these findings is that the people dying from heart disease are not the people on Twitter and the people on Twitter are not the people dying. The people on Twitter live in the same communities. Then people who have heart disease will develop heart disease or die from heart disease. And what the people on Twitter are reporting are general attributes of the communities. What it's like to live in those communities? Are the green spaces places to work? Do you trust your neighbors? All these other things. The surface canaries and the coal mine if you will, for the quality of the community. And if they report psychological profile of the community that looks like it would increase heart disease risk. It really is the people who really are at risk from heart disease that die. So they explains to, at first sight bizarre association between what people say on Twitter, and the death rates of older people dying from heart disease. So what can we do with these methods? Well we're about to release this, this is a life map of the US that shows every county in the US and the last months or the last years will be. Wouldn't you know, if you were about to move somewhere in the United States, wouldn't you want to know what the wellbeing is of the community you're moving to? These well being attributes of community have to do with green spaces, as to whether you trust your neighbors, as to whether your kids have places to play. All of this figures into the quality of life in a community. And through social media again, we can mirror this cheaply and unobtrusively. Well we will also run projects where we look at individuals. Individuals that we want to follow over time like students who might be interested in measuring student B. To do so, we're now at where we can install an application on a students phone with their permission and their permission of the parents. And the application reads and scans all the texts that they write, to get a very sort of course estimate of their well being over time. But certainly you should be able to see when people start being depressed, so you can help them. Very often these technologies are a stepping stone between people and other people to help them. So if you're a parent, the information when the child is not doing well is something that you find very valuable. And if these technologies can aid you in obtaining that information, so you can help in the way you see best, you see most fit. I think that's an appropriate use for these technologies. This is where we are right now. But what will happen in the next 10 to 20 years with this kind of research and with these kinds of technologies? Well if there's something we've learned in the last 20 years, it's that computers really don't help other people, people help other people. And now that researchers like us, and big organizations like Google, and data conglomerates like Facebook have this information. And increasingly healthcare systems, such as insurances, have this information. With your permission, without your permission, is something that's still being debated in legislatures all across the world. Now that we have this information, the question is what do we do with it? And here really comes the biggest challenge I think, that is coming out of all this research. How do we design systems that ethically and effectively respond to knowing that you might be physically unwell or mentally unwell? In the case of therapy, I've been a therapist for two years, I saw patients. The idea that I could, as a therapist install something in your data ecosystems, be it a phone. Today, we talk about phones, 20 years from now we might be talking about different devices. But the idea, install something in their technology layer that is able to get me a daily feed of how they're doing over time is really attractive. And why is it attractive? Because I have a defined relationship with them, it's my job to take care of them, and they have given me the permission to do so. And when I see that curve dropped, I can pick up the phone and call them and asked them what's up? If they want to come in and so forth. Similarly if you are an insurance company and you are not just using this information to re-stratify your risk pools. And make insurance more affordable or less affordable depending on your risk profile. But imagine you're realizing that somebody that's insured with you is adopting unhealthy lifestyle. What's the way that you can approach that person in the way that is compassionate, is not talking down to the person? But still effective and how can you make that person an ally? Over the last few years my sense is that the answer to this is health coaching. Having specialist who can help people with behavior change, who are skilled in talking in the right way. So that they don't come up as offensive and who are also protected by regulations that keep the data private. Are possible answer to this question of how we integrate these technologies these detection. And scanning technologies with our systems of care and our systems of education.