Hi! In this video we're going to further build intuition for inverse probability of treatment weighting. In particular we're going to focus on how this type of weighting creates pseudo-populations, and in fact creates pseudo-populations that are unconfounded. As a motivation, we'll look at survey sampling. In surveys it's very common to oversample some groups relative to others. For example, you might ultimately be interested in looking at some minority group or older adults or obese individuals or some other subpopulation that might be small relative to the larger population, and you want to make sure that you get a large enough sample size of that group, so you might oversample one of these populations. What you'll have to do, then, is if you want to, say estimate a mean of the whole population, not just for one of the subpopulations, well then you're going to have to weight the data to account for the oversampling. And so this kind of an approach is known as Horvitz-Thompson estimation. So we'll end up... you can oversample and as long as you know the probability of sort of the underlying, sort of proportion of each group and the sort of sampling we might say you used, how much you oversample by; then you can weight back to get the original population. And so this is actually... this relates to observational studies when we're thinking about comparing different treatments because in a confounded kind of study, in an observational study, where you have confounding, then you typically going to have oversampling of either the treated group or the control group at various values of the covariates. So in other words, you know, you could imagine that maybe healthier people are more likely to get one treatment than another. You could sort of think of that as a type of oversampling, where you sort of oversample treated people in some sense, relative to your ideal situation where you had randomized and then there would have been an equal number of treated and control subjects who are healthy. It's like survey sampling in that sense, in which case there's this potential to use weighting to get back to the original population. So what we're going to try to do, then, is use weighting to create a pseudo-population where there's no confounding. So, as an example, let's think about a situation where we have a propensity score of 0.9. What I mean here is that there's some subpopulation of people defined by X, and the probability that they would be treated is equal to 0.9. So these are people who are highly likely to be treated. And so, you could represent this as follows, where in the treated population you see that there's nine red circles and in the control there's one, so that's the proper ratio where we're seeing 9 out of every 10 subjects is treated. So that's the original population. And if you look at that, that does look like oversampling, right? Oversampling relative to the ideal. You're sort of oversampling treated people relative to what you would have liked to have done, which is really a balance in the treated and control arms. So what we can do is we can then apply weighting to get back to what we really want. So for the treated group, we'll apply weights 1/0.9, so one over the probability of treatment, and that corresponds to a weight of 10/9. So each treated subject will get weight 10/9. For the control arm, we'll apply a weight of 1/0.1. That's one over the probability of being in the control arm, essentially, given X. So that's 1/0.1, which is equal to 10. If you apply those weights, what you would end up with is 10 circles in the treated group and 10 circles in the control group, right because in the treatment group each one counts as 10/9 of a person. There's nine people, so if you multiply 10 ninths times nine, you would get 10 people. In the control arm, they each count as 10 people, where there was one person originally in the control arm, but they count as 10 people. So we'll think of this as a pseudo-population, where we apply the weights and now we have this new population. And in this new population we'll see that there's no oversampling; it's perfectly balanced in the way that we would like, as if it was a randomized trial. Just to reiterate, in the original population, some people were more likely to get treated than others, and that was based on their covariates. But in the pseudo-population, everyone is equally likely to be treated regardless of their X values. And this is of course what you would like if you were doing a randomized trial, that would be the situation where everyone would be equally likely to be treated, and it wouldn't depend on X because we're essentially flipping a coin. So now let's imagine we actually want to carry out estimation. So if we want to estimate a causal effect, we're typically going to want to estimate something like this, which here is the expected value of a potential outcome. I just chose one of the potential outcomes, so this is the potential outcome—undertreatment. So the expected value of Y¹ is the average value of Y if everyone in the whole population had been treated. So hypothetically, if we were able to treat everybody, this would be the average value. That's one of the things we're interested in. We also would want the expected value of Y_superscript_zero (Y⁰). We would also want the mean of the other potential outcome—you would estimate that in a similar way, but I'm just going to illustrate how you would estimate the mean for this one potential outcome. So i'll walk through this particular formula, which will hopefully make it more clear what's going on. But the first thing to notice is that we have these two indicator functions, here and here, and we're also summing to n. So n plus summation, n is a total number of people in your population, in your data. You randomly sample ten people, so you have n data points. The syndicator function is going to pick off the ones that were treated. So the syndicator Ai=1 is just a binary dummy kind of variable. So it takes value one if treated, zero if not. So it's just an on-off switch: if you treat it, it's flipped on, and that means you keep that value; if it's equal to zero we're going to ignore it. Okay. So this is just sort of a fancy way to say we only want the treated subjects. So now if you didn't have any confounding, we could just take the usual kind of sample mean. If we wanted to estimate the mean of Y among treated subjects, we would just take the sample mean among treated subjects. But here we have confounding, right so we actually want to take the sample mean of the pseudo-population, not the original population, because the original population has confounding. So this is all we're doing here is, this part here is the pseudo-population. It's a value of Y in the pseudo-population, so it's going to either upweight or downweight with the particular values of Y. So if you add that up—we have a summation symbol here, so we're adding—so in the numerator here we're just adding all of the Y's in the treated pseudo-population. I say treated because we're picking off the treated subjects, and I say pseudo-population because we're weighting. So we're adding up all the values of Y in the treated pseudo-population. But we don't want to sum here; we don't want to just add up the values of Y, we would then need to divide by some kind of total. Just like if you were taking a sample mean, you would add all the values of Y and divide by N. Here we need to divide by something; what we need to divide by is the number of subjects in the treated pseudo-population. So that's just essentially the sum of the weights. We still have to say only treated subjects, but then we're just going to add up their weights. So then if you do that division, we'll end up with a valid estimate of the mean of the potential outcome, and that's all assuming that our typical kinds of assumptions are met. So exchangeability, which is also known as ignorability, so that our X's can fully capture confounding. So therefore treatment assignment is random given X, and also positivity, which says that the propensity score essentially is going to be strictly between zero and one so it's never going to be exactly zero or exactly one. Everyone has some non-zero probability of getting either treatment, and these pis (π) here just as a reminder, these are the propensity score. I'll make one additional comment about this, which is you can also kind of see why positivity is so important because you'll notice we're dividing by a propensity score itself. So if the propensity score is actually zero, you would be dividing by zero and we would have a problem. So that's just sort of further reiterate the importance of the positivity assumption.