Hi, this video is on matching directly on confounders. We're going to begin by understanding why a distance matrix is needed, and understand options for distance matrices. And then we'll move on to discussions on how distance matrices could be use for matching. Here we'll begin to think about how to match. In order to find close matches, we'll need some metric of closeness. So we'll at two of them. Mahalanobis distance, which I'll just call M distance for short, and robust M distance. And these are two distance measures that will look at how similar two sets of covariates are from each other. As a reminder, we won't be able to match exactly, because we'll typically have a lot of variables that we need to control for. And it's very likely that no treated person will have exactly the same set of variables or values of their variables are as the control subject. So we'll look at these two different distance measures, which will look at how similar. It'll measure the degree to which these sets of variables are similar to each other. So now we look at M distance. And we'll begin with some notations. So first, X subscript j is just a vector of covariates for subject j. So j is a person in your data sets. Subject j is a person in your data set. So you could think of it as one row in your data. And X is some collection covariates that you decided ahead of time you needed to control for. So these are the set of variables that are sufficient to control for confounding. So just picture X as values of those variables. So it could be age, sex, blood pressure, and so on. So it's a set of values of those variables. So then, the distance then between subject i and subject j on the covariates we'll denote with D. So D(Xi, Xj) just means that's a function that's going to, as input, you submit the values of the covariates for subject i and subject j. And as output, it'll give you a score, a single value, which is a distance. And so we can look at this formula, we'll try to break it down in some detail. So, first, the inner part here, this is just the difference between the covariates, so Xi and Xj. So, remember it's a vector of covariates. So, for each of the covariates, let's say for example blood pressure, we would take the difference in blood pressure between subject i and subject j. If the next variable is age, we would take the difference in age between subject i and subject j, and so on. We would run through the whole list. So then we'd have a vector of differences. The T here, that's just a transpose. So that's just a transpose of a vector. And then here we have the difference again. So in matrix and vector world, if you have a vector transpose times another vector like this, what you're essentially doing is summing these squared differences. So you could roughly think of this as squaring the differences and then adding them up. But there's one additional complication is this part here of this S inverse. So S, that's just the covariance of X. So remember X is a bunch of random variables. So X is a random vector, and those will have some covariants. So of course, each of the Xs has a variance but they also might covariate together. So S is just your standard kind of covariance matrix, and we're inverting it. So that's what S inverse is. So you could roughly think of that as just a scaling kind of thing, where we're scaling by the unit of measure. And the reason we want to do that is because we have different kinds of variables here. So one unit in difference in age will mean something a lot different than a one unit difference in diabetes status, where diabetes is a 01 variable. So for example, if you compare somebody age 67 or somebody age 68, their age differs by one year. If you compare somebody who has diabetes versus who doesn't, that variable differs by one. But we don't necessarily think those are the same thing. And differing by one unit of age probably, if it's years, it's probably not very significant, whereas differing on diabetes status probably is. This S inverse is going to scale things. You could roughly think of this as if you had one variable, what we would do is we would take the difference, let's say, it's age. We would take the difference in age, square it, divide by the variance, and then take the square root. So what we're really doing then is just scaling. We're scaling each variable, essentially sort of by its variance. And that's basically, then, putting all of these different variables on the same kind of scale. So that hopefully, once we rescale, a difference in one on one variable means something similar to a difference in one on another variable. So that's the main idea here. So that's what the formula is. You could roughly think of it as a sum of squared differences that are scaled, and then we take the square root. But the details maybe aren't quite as important as just the big picture is if these collection of variables differ by a lot, this distance value will be bigger. If they don't differ by very much, this distance measure will be small. So we'll get this value of D for any pair of subjects. In particular, we're going to get it for a given treated subject, compared with all the different controls. So we'll have a list of distances and then we can use that to help us find good matches. So we can walk through an example here where imagine that we just have 3 covariates. So we have age, COPD, which is just equal to 1=yes, 0=no. So that's chronic obstructive pulmonary disorder, COPD, and then female which equals one if yes, zero equals no. So imagine now we have this treated subject on the left and we want to find a match for them in the controls. And imagine there's six controls available just for simplicity, okay? So this person is age 78.17. They don't have COPD and they're female. So as a candidate match, we could look at this first person. So this person is fairly close in age, they're 70.25, but they have COPD, whereas the treated person didn't. And they're male, whereas the treated person's female. So this is not really a great match overall. They differ on COPD and sex. And you'll see the distance, if you used that formula on a previous page, we would end up with the distance of a little over 4. Here's a different person. This person, they actually match on COPD and sex, but they differ on age by a lot. So this is an 18 year old compared to a 78 year old. So they're not a good match, and you see that reflected in the distance. The distance there is 3.6, so that wouldn't be a good match. Whereas this person, you'll see that they agree on COPD and sex. And on age, they don't differ by very much, so it's a 75 year old compared to a 78 year old. So it's a close match, and out of this list, you would probably say, okay, this is the best match. The distance between their sets of covariates is the smallest, and we would probably be pretty happy with this particular one. So that distance, you can use a formula on the previous page, you can calculate it pretty easily with software such as our. So now we'll look at an alternative, which is this robust version of a distance. And the motivation here is that it has to do with outliers. So outliers could create a great distance between subjects, even if their covariates are otherwise similar. So a treated subject and a control subject who maybe all of their covariates are very close. But maybe this control subject has one really extreme value of one of the covariates, unusually large value, for example. Well, then the distance would be very large here. Even though they'd seem like otherwise they're a good match. So an outlier can create a great distance, maybe greater than we would want it to. So an alternative is to use ranks. So it's like robust statistics with classical of kind of robust statistics, where instead of using the original values you use ranks. You can do the same kind of thing here where just for the purpose of matching we could replace all of our variables with their ranks. So if we looked at age, for example, and whatever software you use, you would ask it to rank the age variable. So a person who was the youngest in that data set might get a rank of one. If you had 100 people in your data set, then the person who was the oldest would maybe get a rank of 100 and so on. So you would replace the original values with ranks. And this should help with the issue of outliers. So now the difference between the largest and second largest value would be just a difference of one in the ranks. Whereas if you kept the original values, that could be very extreme difference between the largest and the second largest. So that's that the basic idea is that if you want to use this robust distance measure, you could replace the original values with ranks. And then apply the same kind of distance measure to it. So that's the big picture idea, is you replace them with the ranks. And then there's a subtle thing having to do with the diagonal of the covariates matrix. Remember we have that S inverse. Well, if you're going to use a ranked centipical, you would want the diagonal of covariates matrix to be a constant. And that's because, now ranks should all be on the same scale now. So, you really don't want to weight one variable more than others in that sense. So again, there's statistical software that could just automatically do this for you. But that's the main idea. And so then you could use this distance matrix on the ranks. So those were just two distance matrices that you could use, or distance metrics. And hopefully, it's very clear why we need distance, right? We need distance because we need to figure out what does close mean? When we're trying to find a match, we want to know, how similar are these covariates? So you could use either of the two measures that I just talked about. There's other things you can do as well. So one of the nice things about matching is you can do some kind of clever things to get exact matches on some variables. And be more willing to tolerate inexact matches on others. So, if you want an exact match on a few important covariates, what you could then is essentially set the distance to infinity if they don't match on those variables. So, you could apply the M distance kind of formula. But then you could take an additional step which says, we found these important variables. They're not an exact match. Replace the original distance with infinity, essentially,to make it impossible for them to match. So, sometimes there's going to be a couple of covariates that you think are the most critical, the biggest confounders. And you might want to do exact matching on those and then tolerate and exact matching, but good matching on others. So that's one thing you could do. There's also a common approach is to use a propensity score and do a distance match on the propensity score. So this is a topic for another video, but I just wanted to raise awareness of it here. So where are we going from here? So we figured out that we need to be like calculate a distance between sets of covariates on different subjects. But then once you have that, how do you actually select matches? So we'll look at a couple of algorithms, and these are not the only ones but these are probably the most popular. So we'll look at greedy matching, which is also known as nearest neighbour. And this has the advantage of being very computationally fast. So this you can apply to large data sets without a problem typically. Then there's optimal matching, which should be better but it's very computationally demanding. And so, you might have more trouble with this on large data sets.