In the previous module, we introduced the propensity score and developed its key properties. We also discussed how sub-classification on the propensity score can be used to estimate average treatment effects and effects of treatment on the treated. Now, we looked at sub-classification as an attempt to make an observational study mimic a randomized block experiment, and then we also looked at weighting and we saw that that could be thought of as a refined version of sub-classification. In this module, we will be entirely focused on matching. In its most basic form, so-called one-to-one matching, where each treated unit is matched to a control unit, matching mimics the paired randomized experiment. In this lesson, we're going to examine the matching of experimental units using the propensity score and other functions of the covariates. We will also touch upon recent work that matches using the covariates directly. So, now, matching is an old idea and it goes back at least to the first first half the 20th century. It has been used, and continues to be used, for making causal inferences and for other purposes as well. For example, Cochran and Rubin studied the use of matching to reduce bias from confounding in observational studies relative to just using this simple estimator, the difference between the treatment group mean and the control group mean. Stuart is a useful review of the literature to that time, 2010. One caveat, when matching is used for causal inference, it is imperative that it be performed without looking at the outcome values. Otherwise, a researcher might pick the match that he or she likes best if there are many good control matches are available. Then, the ones most consistent with the researchers preferred findings are chosen and that biases the findings obviously. So. It's important that you perform the matching without looking at the outcomes. Now, in the majority of the literature on matching for causal inference, one finds matches for the treated observations and uses those matches to get controls, and one is estimating the effective treatment on the treated. In observational studies, typically, the majority of units are untreated and there are relatively more controls available for matching, we'll be back to this. So. our exposition will parallel this treatment. We're going to focus on estimating the effect of treatment on the treated. But it's very important to note that the same approach can be used to find matches for control observations in the treatment group and estimate the average effect of treatment on the untreated. As we know, the average treatment effect is a weighted average of the effect of treatment on the treated and the effect of treatment on the untreated, so matching can be used to estimate the average treatment effect as well. So, suppose for the moment, it is possible to exactly match each of the n1 treated units to M sub i greater than or equal to one on treated observations. By which we mean that the treated and untreated observations that are matched have identical values on all confounders. Now, in the ith matched pair, we'll let Yi1 denote the outcome for the treated observation, and we'll let Yi2 bar be the average of the Mi untreated matches. Special case is going to be where we only have one match. Okay. We'll talk about that quite a bit. But in general, Y1 minus Y bar two is an estimate of the unit effect for the particular confounders values for all the matches. Then, what I do is, so, I get the estimate and I average over all the n1 pairs and that gives me an estimate of the ATT. In general, however, exact matching is much too stringent. When there are many confounders, few exact matches will be obtained. The majority of the data would then have to be thrown away. Often in practice, researchers use approximate matching methods when the observations for which matches are to be found do not have counterparts in the control group with the same values on all confounders. So, a natural way to do this is to define a distance and to match treated units to the closest available control unit. So, for now, let's assume that all confounders are continuous. In this case, The Euclidean distance is an obvious choice, but generally the Mahalanobis distance which takes into consideration the fact that confounders do not in general share the same units of measurement, should be preferred. So, the Mahalanobis distance between two units, i and i prime is defined as follows. Where S is basically the sample covariance matrix for both the treated and the untreated. It's the average of those guys. S1 and S naught star the within group sample covariance matrices. MS naught star is obtained using the entire pool of n naught star and treated units. We're not going to use all the untreated units once the matching is done, we'll only use n naught matching controls. So, when matching is done with replacement, that is, when the controls can be reused and match to multiple treated observations, the order in which the treated observations or match doesn't matter at all. But you can see that when matching is done without replacement, that is, once a control is used as a match, it is not used again, the ordering can matter quite a bit. Rosenbaum and Ruben, remember the famous article, showed that the difference between the treated observation and an untreated observation, with the same propensity score is unbiased for the average treatment effect at that score. So, if you match each traded observation to an untreated observation with the same propensity score, and you average the differences over all the pairs, you're going to get an unbiased estimator of the effect of treatment on the treated. Now, this result is useful because it may easier to find good matches on their propensity score, which reduces the confounders to one dimension than it is to find matches on the multidimensional set of confounders. Okay. So, now, if you think about that, we don't know the propensity score, so we have to estimate it. Even though the propensity score is a one-dimensional summary of a covariates, exact matching on the estimated score is also too stringent in practice. So, we need to use some distance metrics on the propensity score. The first one is just the difference between the propensity score is the absolute value. The second one is just the difference between the logits. Now, even so, even allowing matches between units where the distance between the propensity scores is small, it is often still difficult to find acceptable matches. This is especially the case for treated units with values of the estimated propensity score near one, they are likely to be treated but there's not likely to be controls as there may may no controls with the high probability of receiving treatment, as I just said. So, to deal with this, one might try to match such observations first, especially if you're going to match without replacement. If you can't find matches that are adequate eg, the difference between logits is greater than some d, that's a threshold that you've decided. You can then discard the treated unit. This is often done when there's insufficient overlap, we talked a bit about that in previous module. So, there are a number of other practical issues that arise when you match either directly on the covariates or the propensity score. Consider now, a unit in the treatment group that we want to match to an untraded unit. First quetion, do we even have enough control observations that we could match each treated observation without reusing the observations in the control group? So, as I mentioned earlier, typically in an observational study, the number of controls is quite large relative to the number of treated observations. So, at least, that problem solve. Every treated observation can be matched with control, not necessarily a great match, but could find one. If that weren't the case, one might prefer some kind of alternative to matching, such as sub-classification or weighting. So, hereafter, we're going to assume that we're in this happy case, where the number of controls is large relative to the number of treated units. Another issue that we already raised is if the propensity score is typically unknown and must be estimated. But then, things are a little different here than in the case of weighting. We saw in weighting that model misspecification could really screw up your estimates. But here, what people use the propensity score for is to obtain a matched sample in which the treated and control units have similarly distributed covariates, as would theoretically happen in a completely randomized experiment. That is, they're capitalizing on the fact that propensity scores is a balancing score. When the propensity score model used does not lead to the creation of such a sample, you can always re-estimate the model, typically by including more higher order terms and interactions amongst confounders and then you can recheck the balance. When the balance is deemed adequate, the resulting match sample can then be used for estimating the effect of treatment on the treated. But you want to check the balance, there's a lot of literature on that, but one thing you might do to check balance, is look at the normalized difference for each covariate for the match sample that you're tentatively entertaining using. So, the covariates are index K from one through cap K, and so for each covariate, you look at the following value of equation one. When this is small for all of the covariates, the match would be deemed adequate. Otherwise, you could re-estimate the propensity score, use the new estimate to create a new match sample and then reexamined equation one. Now as an alternative to equation one, you could always use a summary measures such as the Mahalanobis distance between the sample averages in the match sample, or the normalized difference for the estimated propensity scores, or the logits of these, et cetera. Now, if there are categorical confounders, it's often desirable to require the treated and control units to match exactly. One reason for doing this is is that interactions between these characteristics and continuous confounders are often very important in the propensity score model. For example, consider confounders such as sex and ethnicity. Okay. Anybody that's ever done any social science would realize that those things often interact with continuous covariates in the model for the outcome. With ordinal confounders, you can assign scores and treat them as continuous or you can treat them like categorical confounders, either before or after aggregating some of the categories. For example, if you have ordinal scale with categories running from strongly disagree to strongly agree, you might collapse these to just disagree and agree. As the goal of the matching methods considered above is to achieve a sample in which the covariate distributions in the treatment and control groups are as close as possible, some recent work has focused on developing matching methods intended to achieve balance directly, by passing the intermediate steps of estimating the propensity score, checking for balance, and iterating the procedure again until the desired level of balance is achieved. Several of the recent methods include genetic matching, optimal subset matching, cardinality matching.