Hello. Welcome back to the course. I'm Trampas Kirk, and today we're going to talk about threats to validity for gathered data. We've already defined what gathered data and design data are, and we're going to talk through some threats to the validity of that kind of data today. Just to make sure that we're all on the same page, we've already talked about the dimensions of total data quality here. We're going to keep referring to this picture throughout the sequence of courses, but where we are right now is on the measurement side of the picture. We're going to be talking about the validity piece. In-between the construct in variable and field elements of measurement is really this validity element that we're going to be talking about, and specifically we're going to be talking about things that can threaten the validity of gathered data. Remember, gathered data are those data that are generated from web sources, social media data, big data, those kinds of things, just to keep that in the back of your mind. The interesting part about, why would we care about the threats for gathered data in the first place? Some people can argue that big data are opportunistic. We get and receive this data, we find this data, we're going to use this data and so on. Why do we care about the quality? Why would we care about the validity of such data? There's nothing we can do about it. We're sourcing it from somewhere else. There is something to that point in the sense that we didn't design that data for research purposes, but we want to use that data for research purposes, but here's the thing. Erroneous data, this is such an interesting study that dates back 20 years or so, and it's an old study, but you can imagine the power of this information even in the current day. Erroneous data cost you as businesses over $600 billion annually. This was an estimate that basically represents that these costs are somewhere between eight and 12 percent of the revenue stream, and 40-60 percent of a service organizations expenses may be consumed as a result of poor data quality. Can you imagine using data that are not valid to make decisions for your company only to realize that decision wasn't correct because the data was misinforming you? This is a really interesting aspect of business operations, but imagine putting that into the research arena. For data that you don't necessarily understand the origin or why it was created in the first place, these threats to quality can really derail research studies if we don't start to understand or at least begin to think through some of the threats that this quality measure or validity could present to us. Organizations typically find data error rates between one and five percent, but they can be as high as 30 percent for some other data sources and organizations. This is really a metric that simply is the quotient of the total fields that they might be working with errors in them over all the possible fields that they might care about. This is a study from 2014 looking at potential error rates in this data. While it does seem like the error rates may not be something we can do anything about, they do occur at some level and we should be aware of them, and think about ways to evaluate them, and potentially solve this problem for our research purposes. The platform or the data source dynamics and the structure of the platform may limit the accurate reflection of human behavior. This is one of the primary threats for the validity of gathered data according to Ruths and Pfeffer from 2014. Now what I mean by a platform is something like Twitter, LinkedIn, Facebook, or a data source might be something like transactional data that you get from a meter or a sensor that's connected to a cash register in a store. These are really sources of places where we might acquire gathered data or how the gathered data may be generated, or the repositories that store them. But the whole goal of platform designers like Twitter and Facebook are to improve the user experience. Who wants a bad Facebook? Nobody wants to be on Facebook if it isn't enjoyable or if there isn't something informative about it. This is the whole job of these platform designers, and so they improve the user experience based on really three main ideas. Homophily, which is otherwise said birds of a feather flock together. Transitivity, a friend of a friend is a friend. Propinquity those close by form a tie so there is a geographic element to it, there is a similarity element to it. Then there is this transitivity that if I know you and you know someone else, then I know someone else. You can see this actually coming to pass when you get recommendations from Facebook, for example, of people that you don't know, but they're tangential to your friend network. This is exactly how these platforms can generate information or give suggestions to you about how to grow the networks. If as researchers, we are using that data to study networks, for example, we should care about how those platforms are curating and generating those networks because potentially that could be a threat to the validity of the questions or the data that we're going to gather and then the data that we use for the purposes of our research. We may have to think more wholeheartedly about whether or not there are errors or threats to the quality in that data that could impact my ability to use them for research purposes. I'm going to talk about these three design elements as they relate to threats, to validity for gathered data because I think they're really important for us to think through not necessarily for their sources of error outright, but because they help us to begin to understand how to think through more logically about gathered data in the first place. I mean, today we have Facebook, but tomorrow we may have handbook or we may have pal page, we don't know what's coming in the future and big data is ever evolving. I think a part of the point of this course is to really help you understand how to think through some of these issues even though Facebook may be different tomorrow than it is today. The concepts though I think, are helpful in thinking through how to use data that you might source from these repositories. Think about this. The optimal user experience may not result in accurate measures based on data gathered from these sources. For example, following users on Twitter may not be an adequate measure of true network size if the following is only in one direction. If I follow a bunch of people, if you look at who I follow and think that that's an estimate of my network size, it may be an overestimate if a lot of those people don't follow me back. There is this directional element to it that might actually threaten the validity of using that data for a network analysis because we're assuming too much. Correlation may be higher among friend samples recruited from Facebook because of recommendations made based on concepts of similarity across those friends. If Facebook picks up on some similarity measure like we're all in Taekwondo and you wanted to study whether or not we all have a BMI that's in a reasonable range. Well, if I'm connected to only my friends through the Taekwondo network, we might be physically active and our BMIs might be lower than average. The point here is that some of these platform-specific curation techniques, or at least concepts for curation, may actually create correlations or inadequacies in the data that we're trying to gather and use and it might threaten the validity of their use for our particular purpose. The technical specification of a platform and the processes that are used in that platform may also create distortions in measurement of human behavior. Couple of examples here. Only the most recent 3,200 tweets are shown in public accounts on Twitter when a specific username is queried. If you're trying to understand how someone might have been talking about the pandemic, for example, their perception of COVID, their perception of COVID might have changed over the last year, but my most recent 3200 tweets might be me having a really bad spell with hating COVID and the prior 3200 tweets which you couldn't get by a simple query might have been me putting on my smiley face about being happy about the way things were with COVID. By only being able to access the most recent 3200 tweets may limit the perception that you have of my opinion about COVID currently. Google stores and report spinal searches submitted after auto-completion is done as opposed to the actual text that was typed. When you're thinking about how people access COVID or what keywords people used to search for COVID, you're going to get the versions of those search terms that were autocompleted or corrected from Google or they were suggested and then people clicked on there. If you're trying to get a sense of maybe search terms that you might want to include as possible misspellings or slang terms that people may use for COVID on the street that are not necessarily picked up by Google's auto-completion methodologies, then you might miss some of those things in your research as well. Twitter dismantles retweet chains back to the original user who posted a tweet. For example, if you're trying to figure out the geographic location of a retweet, it is taken back to the location of the original Twitter. If I'm trying to understand some geographical opinion based on retweets, I am not able to look at the geography of the retweet. I'm only able to ascribe to that retweet, the geography of the original Twitter. This gives me some threat to the validity of that measure with respect to geographic specificity. These are all things that are related to the way these platforms make decisions about how to store represent and allow you to query data. Gathered data, even from human-oriented platforms can also contain non-human results. If you care about a study that's looking at human opinion about a certain topic, on Twitter you might actually get an opinion from a bot. This may or may not be considered eligible for your study. The ineligibility of bots, even though you cannot detect that initially, may create some threats to the validity of the data that you gather because you're looking at a mix of bot data and human data and you may only want to study human data. Varol and colleagues in 2017 estimated that between nine and 15 percent of active Twitter accounts are bots. Now, some of these bots are nefarious, but other bots are considered white box or friendly bots. If you use a bot to get traffic information or weather information on Twitter, those are considered friendly bots and Twitter will likely not eliminate those bots from the ecosystem. In fact, in 2018, Twitter released a new review policy for bot accounts that sought to limit the number of bad actors, but it doesn't eliminate all bots completely, as I mentioned just now. Here's an interesting slide that looks at very popular contemporary Twitter users that you might recognize from Popular Press: Justin Bieber, Lady Gaga, Katy Perry, and the list goes on. Now, the question that was asked here is, "Is the number of Twitter followers a valid measure of interests, popularity, support, or engagement?" We looked at these very popular Twitter users and we classified, or this particular slide classifies the users as either fake, inactive or good, so that would be the human users, the fake would be the bots for example. You can see that the percentage of bots who are following these particular list of users varies somewhere between five and 15 percent, as we already mentioned. If you had an interest in studying human populations, you may not be interested in getting the bots that come along for the ride. We have to think about the fact that when you're collecting or using gathered data, in some cases, big data, you have lots and lots of variables that come along in those datasets, for example and you may actually have relationships between some of those variables, but those relationships may not be causative. Correlation certainly is not causation, as we know over and over again from all of our mantras and research methods. But correlation can also be spurious and it doesn't necessarily represent a useful finding for research purposes. Here's an example. The per capita consumption of Mozzarella Cheese in the US correlates positively with a number of civil engineering doctorates awarded, and the correlation recorded here is 0.96. You can see from the plot, this is a really strong correlation. As a social scientist, I'm having a party over this correlation. It's really strong. This is also my most cheesy slide that you're going to see from me so I apologize for that here. But the point of the matter is, these two things, Mozzarella Cheese consumption and number of civil engineering doctorates may likely not be related. But when we're putting together lots and lots of data sources from different areas, it is not uncommon to find these things spuriously. That in fact, we found these things happen together, but they may not necessarily be related. These are all threats to using big data sources in terms of the validity that we might experience. Just want to talk a little bit about what's coming so that you're aware of what's ahead. In the next segment of the course, we're going to explore a historic example of threats to gather data in the Google Flu Trends case study. In that case study, we're going to ask you to read a couple of articles that show how the Google Flu Trends tool worked for predicting flu-like illnesses, and then we're going to ask you to read another article that discusses the ways in which the Google Flu Trends tool could be improved by combining information from the gathered data sources to design data sources, specifically survey data. We're going talk about this throughout this course. When we get to this, the ability to leverage multiple data sources is one of the ways that we can overcome threats to gathered data, particularly around validity. Thanks a lot for joining us until next time, be well.