Hello, everyone. Welcome to Week 2 of this first course. We're going to start off this week by giving a definition to validity as one of the key measurement dimensions of the total data quality framework. Let's revisit that big picture that we learned about last week in terms of the overall total data quality framework. Again, we have the measurement dimensions and the representation dimensions. We're going to start off this week talking about validity as one of the measurement dimension, so you see we've highlighted the validity dimension as where we're getting started with measurements. Again, we're starting with a theoretical construct and then we're thinking about variables or fields that we'd like to measure, which should be giving you valid measures of that particular construct. We'll talk about definition of validity and what we need to think about in terms of validity as this first dimension. Validity means that a variable for analysis is accurately measuring what it is supposed to be measuring in theory. Again, we have a theoretical construct that we like to measure, and then we choose a variable to use to capture measures of that particular construct. The more valid a variable is, the more accurately it's measuring what it's supposed to be measuring in theory. This concept of validity could pertain to construct validity, for example, for design surveys so are you coming up with a valid measure of a theoretical construct or it could represent face validity for gathered data. In other words, on the surface is something measuring what it's supposed to be measuring in theory if you're trying to gather data from existing sources. Now what do we mean by a construct? A construct is a theoretical trait or characteristic that we're attempting to quantify for the different units that we're trying to study. Possible examples of constructs. Constructs could be physical functioning, personality, socio-economic status, et cetera. These are all theoretical constructs and we try to measure these constructs with different variables or fields depending on whether we're collecting data in the design data framework or we're gathering data from different existing sources. The validity idea is whether or not the variables that we're actually working with are providing valid measures of the constructs that we're interested in, like physical functioning or personality, for example. Here's an example with design data. Does a measure of socio-economic status or SES, derived from the other variables that are measured in a survey, for example. Something like household income, education, wealth. These could potentially combine to give you an overall measure of socio-economic status. The question that we're asking with validity is if we use these different measures collected in a survey like household income, education, and wealth, if we use these different measures and combine them in some way to come up with an overall measure of socio-economic status, are we getting an accurate measure of the construct that's known as someone's overall socio-economic status? In this picture, you can see that we have a circle with socio-economic status inside it. We think of that as a latent variable or an unmeasured variable. We don't know a person's socio-economic status right away. We have to try to infer what their socio-economic status is based on their responses to other survey questions in our design data collection. That latent measure of socio-economic status that construct could be driving the values on household income in years of education and household wealth. That's why we have question marks here. We're not sure how much of socio-economic status is indicated by household income. Socio-economic status could be a very strong predictor of household income and household wealth, but maybe not as strong of a predictor of years of education. The boxes that were using, those are observed variables in the actual survey. We might be interested in using these observed variables to indicate a person's socio-economic status. But the relationship between these observed variables and the latent construct that we're trying to measure may not be exactly clear. We want to make sure that these are all valid indicators of socio-economic status before using them to define that particular construct. That's a survey example. What about the case of gathered data? Suppose that we were collecting data on tweets. Does a qualitative analysis of the text that's gathered from a collection of tweets about mental health difficulties with the pandemic provide valid measures of the mental health of Twitter users with particular characteristics. Again, the same picture here. We have some latent unobserved measure of mental health that we're really interested in, and we're deciding to measure that construct with certain information extracted from those existing tweets. Again in this gathered data example, where we're trying to make sense of existing data or organic data, data that are generated organically. For example, somebody might tweet, "This pandemic, really has me feeling blue." Is that a valid indicator of a person's mental health? Again, is mental health a good predictor of what they're actually going to tweet in terms of how they feel. That's the question that we're really getting at here. If we're coding that particular tweet, "This pandemic really has me feeling blue," as an indicator of a person's mental health saying that this person is struggling with mental health is that in fact a valid indicator? That's the question mark that we're trying to answer here. Again, this is not designed data. This is trying to make sense of data that we're gathering from existing sources such as tweets. Here's another example. This is a bar chart that just shows the number of COVID-19 deaths in the first 100 days in office of the past two presidents. You see Joe Biden, he has over 8,000 COVID-19 deaths attributed to his presidency in that first bar. Then if you go to Donald Trump who Biden succeeded, you can see that there is zero deaths in his first 100 days in office. The question here is what's wrong with using the number of deaths attributed to COVID-19 in the first 100 days of a presidency as a comparative measure of performance for the last two presidents. If your research was aiming to compare the performance of the past two presidents and you decided to use this as a measure of their performance, is that really a valid measure of performance? Our construct is the performance of these two presidents. The variable that we're choosing as a measure of that construct is number of COVID-19 deaths in the first 100 days in office. Is that really a valid measure of their performance or is that the context in which they're working? That's really the question. So obviously, this is not a fair comparison of the performance of the past two presidents because it's not really a direct measure performance. That's what we really mean by validity. Are we choosing variables or fields that provide valid measures of that theoretical construct that in this case performance that we're really interested in. Because by this measure of performance, it looks like Donald Trump was doing a whole lot better in his first 100 days than Joe Biden. But clearly there's other factors that are driving this particular measure that rather than just their performance. We started with an initial definition of validity and some simple examples. Validity is very crucial when designing a research study. We need to make sure that our measures are valid. They're valid measures of what it is that we're trying to measure in theory. This week, we're going to continue to learn about threats to validity for both designed and gathered data. Then we'll turn to the data origin and data processing, measurement dimensions of the total data quality framework. Thank you.