Hi, in this video we'll talk about greedy or nearest neighbor matching. And our goals are to understand what greedy matching is and how the algorithm works. We'll discuss advantages and disadvantages of greeding matching. We'll also look at many to one matching versus pair matching and discuss trade offs with the two approaches. And finally, we'll introduce the idea of calipers. So just to begin as a set up here, again, we are going to imagine that you've already identified a set of variables that are sufficient to control for confounding. Again, these are pre-treatment covariates. And these variables, we believe will satisfy the ignorability assumption. And we've calculated a distance, which we'll call here Dij between each treated subject and every control subject. So here you could think of subject i as being a treated subject, and then subject j as being one of the control subjects. We'd have a distance between them, meaning being between their covariates. We would have a distance then between that same treated subject in every control subject. So for a given treated subject, imagine we have a distance score with it and in every control subject. So imagine we've already done that, we've already accomplished that. We have our distance measure and we've already selected our covariants. And now for this part of the talk, we'll imagine that we have more controls than treated subjects, and hopefully, many more controls than treated subjects. We want to have a big pool of subjects to choose from for matching. And this is often the case, but it doesn't have to be the case. And you could potentially still do matching, but ideally you have a lot more controls in treated subjects and that gives you a lot of options in terms of selecting matches. So for now, let's assume that's the case. We're also going to focus on pair matching for now which is also known as one-to-one matching. So now I'll introduce greedy matching. So here greedy matching is really relatively straight forward and so we'll just kind of walk through the steps here on how you'd actually if you're going to program this yourself what would you actually do? So the first thing would be to just randomly order your list of treated subject and control subjects. You wouldn't actually have to do this necessarily, but it's probably good practice. It could be that maybe you've already sorted your data in some way and you really don't want it sorted, you would rather randomly have it sorted. So you might as well just randomly order your data set. So, we want to have the treated subjects kind of, in some kind of random order especially. So you can do that first. Then what we'll do is we'll go to the top of that list. So now we have a list of treated subjects. We'll go right to the top of that list. And then what we'll do is, we'll match it to the control that has the smallest distance. So we already have our distances, so we'll look through the whole list of distances, and we'll just find the one that's the smallest distance, okay? And so, this is the greedy step. You're just sort of going for whatever's the best without sort of thinking about a bigger picture. So you're just grabbing whenever seems to be the best match. So that's basically why it's considered greedy is you're always going to basically take whatever the best match is. Once you have match, then you'll remove that control from the list of available matches. Then you move on to the next treated subject. And then you'll find a control that has the smallest distance with that treated subject. And then, you'll just keep repeating these steps until you've matched all treated subjects. So you'll go one at a time finding the control that has the smallest distance with that treated subject. Then you'll remove that control from the set of possible matches for the next person, and continue on like that. So, I think this is really sort of straightforward and intuitive kind of algorithm and then, therefore, it should be pretty easy to implement in practice. So as an example, let's imagine we have just two treated subjects and six controls. And we're going to use this greedy algorithm to try to find matches for these two treated subjects. And again, we'll just have three variables age, COPD, and female. So let's find the first match. So the first person has 78 years old, they don't have COPD and a female. And you'll see that we've calculated the distance between this treated subject covariates and then each available control covariates. And so you see that distance on the far right. And we found a match. So the match is the one with the smallest distance. So we're just going to match this treated subject up with this control. And you'll see that they agree on COPD, they're both female, and they're pretty similar on age, 78 versus 75. So now that I found a match, I'll start creating my matched data set. You'll notice on the far left there what I am calling Match_id. You can name it whatever you want, but you need some kind of identifying variable to identify a matched pair. So I'm going to call this Match_id 1 for this first set. And then I also have a variable that I call treated, so treated can either equal yes or no, 1 or 0. So here is treated. So 1 equals yes, so 1 means you're treated, 0 equals control. And if you recall from the previous slide, the treated person with 78 years old, didn't have COPD, and was female,and you will see those values here. And then they were matched to a person who was untreated, so this is a control. They were age 75, we can see from the previous slide they were age 75, they didn't have COPD and they're female. And you'll notice they have the same Match_id. So the Match_id is what's going to tell us that this a match paired, we want to link these two together. So, we found a match pair and then we are basically now adding them to our data set. And we're using an id, identifying variable to let us know that yes, these two are linked together. So next, we're going to find a match for the second treated subject. So this person is about 68 years old, don't have COPD, and they're male. So female equals 0 means they are male. So, again we've already calculated the distance between that subject and all of the available controls. But you'll notice we're crossing out that second control there, that person's already been matched. So that person is not a candidate to be matched with the second treated subject. And we find a match here, so the smallest distance here is 0.78. If you look at it, you'll notice that the treated subject and the control subject here that we matched, they're both male, neither has COPD. And age wise they differ by a fair amount, but it's the best we could do, so one's 55 years old, the other's about 68 years old. So they're as close as we could get in age, but this was our best match. And of course because we only have six available controls we're probably not going to get great matches but this is just a simple example to illustrate the main ideas. So again, we were greedy we just went for the best match that we could find. And now, we can add that to our data set. So you'll notice here that the key here is that, I have a new Match_id, which is 2. This is the second match. So, I want to make sure to link these two subjects to each other. So, they're paired up by this Match_id. You know, one is treated and one is not. And we saw from the previous slide one's about 68 years old and one's about 55 years old and so on. So you would just continue in this way until you have a big data set full of matched pairs. So in this particular example, I only have two treated subjects, but in practice you might have thousands and you would create a big data set. So greedy matching in general, it's intuitive, it's easy to explain. I just explained exactly how you would carry out the algorithm if you want to decode it yourself, so it didn't take very long to explain. It's computational very fast because it just involves a very simple series of steps. Imagine we've already calculated, we already have these distances so then we just go one by one through the treated group and just then find the minimum distance and line them up. So, any software should be able to do that very quickly. So even if you have a very large data set, this isn't very computationally demanding. There are packages that can do it. For example, there's one called Matchit. It's not invariant to the initial order of the list, so in other words, the order of the list actually does matter. So I said, let's initially randomized the order of the list. But whatever order that list happened to be, it does matter you're not going to get the exact same matches if the list was ordered differently. And that's actually I would consider that a draw back of this approach. It would be nice if it wasn't dependent at all on randomization, on how you randomize the list. More importantly, it's not optimal. Really what I mean by that is globally optimal. So always taking the smallest distance for each match does not necessarily minimize the total distance. So let's imagine that by total distance I mean let's say we add all the distances up of all the matched pairs. What we might consider optimal would be something that minimized that total distance. And greedy matching is not guaranteed to do that. And it even could lead to some bad matches, as a result. Now let's imagine, instead of pair matching, let's say we want to do many-to-one matching. Well, you could actually use this greedy algorithm to do many-to-one matching. As an example, suppose we wanted k:1 matching, where k is some integer greater than 1. For example, we might want three-to-one matching. What you would do then, you would proceed with the algorithm just as before. So you would find one match for everybody, you would go through the whole list. But then what you would do is once you go to the end of the list, you would cycle back up to the first person again. You would find a match for them, that would end up being their second match. And you would go through and find the second match for every single person. And you would keep doing that until you have k matches for each person. So you would still use the same kind of greedy algorithm. You would just cycle through the list multiple times. So there are trade offs between pair matching and many-to-one matching. So with pair matching you'll find closer matches in general because you're finding the best match you can for a treated subject with the control. If you want to find multiple matches for a treated subject and controls you're guaranteed that some of the matches aren't going to be as good because you're trying to match more people essentially. So you're going to have to accept some matches that aren't as good. And pair matching is also of course would have faster computing time because you're just imagine a single treated subject to a single control subject as opposed to a single treated subject and controls. So it should be faster. An advantage of many-to-one matching publicly, the only advantage is you get a lot of sample size. So, with pair matching the second you've found a match for everybody, then you discard all the rest of the controls and you won't use them. In many-to-one matching, you're going to end up using more of the controls. So we have a larger sample size, which should mean more efficient estimates of say causal effects. However, there are couple caveats there. So one is that even though you gain sample size, the gain isn't perhaps as much as you might think. If you were to add a new pair, if you were add a new treated subject to your data set, that would be more of an efficiency gain than if you just add another control to a matched set. So essentially, the efficiency gain by adding additional controls in a match set is not as much as it would be if you were actually adding new treated subjects that you could find matches for. So there is some power gain, some efficiency gain, but it's not very large. But essentially, this comes down to a bias-variance tradeoff issue. So pair matching there should be less bias because the matching is closer, but it also should be less efficient because you're discarding data. So the many-to-one matching should lead to a little more bias, but smaller variance. And so, you have to think about that kind of a trade off. Are you willing to tolerate some bias for more efficiency or not? The next thing that I wanted to mention was this idea of a caliper. And so, you can imagine that there might be some treated subjects for which there really aren't even reasonably good matches in the controls. There might only be bad matches, where the distance is kind of big. So we might want to not allow that. So you can use a caliper for that, where a caliper would be the maximum acceptable distance. So the main idea would be we would go through this greedy matching algorithm, one treated subject at a time, finding the best match. But now, we would say only accept the best match if it's within the caliper. So if the best match has a distance that's greater than the caliper then we just won't match anybody to that treated subject, we'll get rid of that treated subject. We'll say there was nobody in the control group that was a good much for them. So you would pick a value for the caliper, the maximum distance you're willing to tolerate. Then you could go through this greedy algorithm in the same way, but it would exclude any matches that were not within your tolerance limit. And this also gets back to this positivity assumption where positivity assumption says that the probability for each treatment for a given set of X's should be non-zero. So, for given values of X, there should be some probability that's not zero of getting either treatment. But if there's nobody that's a good match for a given treated subject, in fact if there's only bad matches, that kind of suggests that this treated subject really didn't have much of a chance of being a control subject because there's nobody like them in the control group. So caliper matching will help make this positivity assumption more plausible. A drawback is the population becomes a little bit harder to define. So you start with this original population, let's say which is all treated subjects. But now we're saying, okay, it's all treated subjects except for those for which there's no match available. So that's a little harder to define, to one of your collaborators, for example. Nevertheless, it's probably a useful idea to use a caliper and not accept matches that are really bad. But there are tradeoffs between the two.