In the previous module, we took up the model based approach to the estimation of causal effects, focusing on linear regression in lessons 2 and 3. We saw that in a completely randomized experiment, the estimator tossed or hat adjusted, which is equal to the difference in the treatment and control group means adjusted for the difference in covariant means. That this guy is unbiased for the average treatment effect. It also has smaller variance than the unadjusted estimator below. It's important to note that we made no assumption that the regression model is correctly specified nor that the average treatment effect that add value X of the covariates was actually equal to the average treatment effect. In other words, we could have had heterogeneous effects across the covariate values. Now in observational studies, we know that the difference between the treatment and control group mean is generally neither unbiased nor consistent for the average treatment effect, because of the confounders. But, we know that the conditional expectation of Y given treatment assignment and the confounders should be by the unconfounded in this assumption equal to the conditional expectation of the potential outcomes Y at level z given the confounders. So, this suggest modeling the conditional expectation function of the observed data, and using our estimate of this to estimate the average treatment effect. In lesson three, we did just that, again using linear regression. Unlike the case of the completely randomized experiment however, we saw that if the regression function was not specified correctly, the estimate of the average treatment effect was biased. Unless it just so happened that the means of the confounders were equal in the treatment and control groups. But in observational studies, the difference in means in the treatment and control groups can be substantial. To remove the bias from estimates of the average treatment effects in observational studies, the previous analysis suggests two remedies. First, one might attempt to model the regression function non-linearly. However, a researcher often doesn't have enough knowledge to specify a functional form. Well, then one might try non-parametric model. Second, one might attempt to make the covariate distributions the same in the treatment and control groups as would happen in a completely randomized study. A traditional way to achieve such balance is by matching. For each unit in the treatment group, we find an observation in the control group with the same covariate values. The difference in the outcomes between the two units would then be unbiased for the average treatment effect at the value X of the confounders. Now both approaches, estimating the regression function or matching, they both may become unwieldy when there are many covariates. This is the so-called curse of dimensionality. In the past 30 years, great advances have been made in computing that have allowed significant progress to be made in both directions. Now, we will discuss some of these advances, but to place them in context and to understand the role of certain key ideas, and also as the majority of empirical studies are not conducted using these latest advances, it's useful to start with the propensity score that comes from the seminal article by Rosenbaum and Reuben. The propensity score also figures prominently in weighting approaches. So, it's important for us to understand the score and its properties very well. So, let's talk about the propensity score. It is simply the probability of treatment given the covariates. Rosenbaum and Reuben showed that if treatment assignment is strongly ignorable given covariate X, by which this meant two things. One, treatment assignment is unconfounded given the covariates, and two, zero is less than the propensity score is strictly less than one, then treatment assignment is strongly ignorable given e of X. The second condition just means that at every value of the covariates, the units could be exposed to either the treatment or control condition. So, Rosenbaum and Ruben also showed that the distribution of the confounding covariates is the same in the treatment group and the control group for subjects with the same propensity score. This is important. The results are not hard to prove. First, the probability of assignment given the propensity score and the potential outcomes. So, the probability that Z is one is just an expectation of Z. So, it is just that expectation. Now, you use iterated expectations and get the second line. So, we're going to shove in a more fine and then the e(X), Y(0 ), Y(1) is coarser. So, we know that by the tower property that the third line that that'll give the expectation e of X given e of X and the potential outcomes but that's just e of X. So, where we use the unconfoundness assumption in the second to the last equality. The result is important because it implies that the expectation of Y given the treatment assignment and the propensity score is equal to the expectation of the potential outcome Y(z) given the propensity score. So, even though it might be difficult to non-parametrically model, the conditional expectation is a function of many covariates due to the curse of dimensionality, it might be easier to do so as a function of the scalar propensity score. The estimated regression functions may then be used to form an estimate of the average treatment effect at that propensity score and averaging over the distribution of propensity score would then give an estimate of the average treatment effect, which is the integral over the average treatment effect at the propensity score, averaged up against that. That the distribution of the confounding covariance is the same in the treatment group and the control group for subjects with the same propensity score follows by a similar argument. So, the probabilities and expectation since Z is binary and now we're using iterated expectations again, I mean it's the same argument, the expected value of Z given X is just e of X. So, its expected value of e of X given e of X, that's just e of X, and that's the probability that Z is one given X and then now in the last one well that's the probability that Z is one given X and e of X. No. So, that's how that follows. Now, because the distribution of the covariates is the same in the treatment and control groups given the propensity score, we call the propensity score balancing score. The propensity score is to divide the covariate space into components with equal probability of receiving treatment. Any division that is finer will also balance the covariates, while any division that is coarser will not. Now, caveat, as is evident from these arguments, but sometimes forgotten, the propensity score is a balancing score, whether or not treatment assignment is unconfounded. But, using the propensity score to create groups that are balanced on X does not imply that potentially important confounders that the investigator has failed to include in X are balanced across the treatment and control groups. So, the propensity score is no magic, you've got to have all the confounders. Now, I want to move on towards propensity score sub-classification. In a randomized block experiment, the function that maps covariates into strata is a balancing score. Previously, we saw that a natural way to estimate the average treatment effect from a randomized block experiment is to first estimate the average treatment effects using the stratum proportions as weights. The result does immediately suggests using the propensity score to form strata and proceeding as in the randomized block experiment. Rosenbaum and Ruben refer to this approach as sub-classification. Now, more generally, the result that the propensity score is a balancing score suggests comparing subjects in the treatment group and subjects in the control group with the same propensity score. Matching observations from the two groups on the propensity score balances the distribution of covariates across groups. Prior to Rosenbaum and Ruben, statisticians would attempt to match on the covariates themselves to achieve such balance. But when the covariate space is large and the sample not so large, one quickly runs out of matches, requiring the investigator to either throw away cases that cannot be matched or accept matches that may not be that close on the covariate values. Now, in theory, matching on the propensity score achieves the same balance without requiring matching on the covariates. As the propensity score is a many to one function of the covariates, in theory an investigator should be able to match subjects more precisely. However, matters may not be as rosy in practice. Further, over the last 25 years improved methods for matching directly on the covariates had been developed. In the next few lessons, we shall study methods used in the statistical literature to estimate effects such as the average treatment effect and the effect of treatment on the treated in observational studies. We begin by discussing methods that use propensity score for regression, sub-classification, and matching. Throughout, we will assume the unconfoundness assumption holds.