Hi, in this video, we'll go over how to carry out an actual data analysis in R, so in particular, a matched causal kind of analysis. The dataset that we'll use as an example is this right heart catheterization data which is publicly available on this Vanderbilt website. It has been available for a long time and it's been analyzed by many different people just to illustrate the main ideas. And so, it involves ICU patients in five hospitals. The main treatment is right heart catheterization, which we'll just call RHC. So RHC or not is the treatment. The outcome that we'll look at is death, so just died or not. And there's many confounding kinds of variables: demographics, insurance, disease diagnoses and so on. This dataset is fairly large so there's over 2,000 treated subjects and about 3,500 controls. It's on the Coursera website, we have posted the actual R code, but here, we'll kind of walk through some of the main points of the code. So to begin, we'll just load the data and load the packages that we'll use and view the data. So I'm going to start with just using two different packages. So the first one is this tableone package. So tableone is a really useful package if you're going to be doing a matched analysis. So it's a package that can very easily create the kinds of tableones where you're comparing the balance between treated and control subjects either in matched or unmatched data. And then, we'll also use the Matching package that we'll use to carry out the actual matching. So we can read in the data and then we can also, we'll also would typically you want to view the data as well just to understand what the variables are like, to get a sense of if there's missing values, and all of those kinds of details. And here, what I'm putting as an optional step is to just create- convert some of the character variables into numeric variables. So in this actual dataset, a lot of variables are character variables, in that in some R package handle that fine and others prefer numeric so just got to make things simpler. I have a bunch of lines of code here that just are creating new variables that are saying convert these to a numeric variable. So for example, the dataset rhc, there's the outcome variables death. In the dataset itself, death is either coded as yes or no. And here, I'm going to convert that to a binary one-zero variable where one means yes. So you'll notice I say rhc$death=='Yes'. So that's just an indicator kind of function and it's saying convert that to a one if yes and set it to zero otherwise. So I create a whole bunch of these numeric variables and then I'll create a new dataset that just includes those variables. So that's what I do next, it's what I'm calling my data, it's the data I'm going to use and I just took a handful of confounders and also the treatment variable and the outcome variable. This dataset actually has many more confounders than what I'm using – I'm just using a shorter list just to make the presentation easier. And so, the covariates, the list of covariates that I'm going to consider here, I'm calling xvars, and this is a list of them. And now, we can create a tableone pre-matching. And so we can use this tableone package. And in fact, you can just say CreateTableOne. And what I want to do is summarize, have summary statistics, for the variables that I just called xvars, that list of covariates that I just went over and I am telling it then to stratify on treatment because we want to compare, we wanna look at the balance between the treatment and control groups. So I tell it, use my dataset – my data – and this part here is just saying, I don't actually want a significance test, just give me the means and standard deviations and so on. And then I have it print the table, and here I say smd=TRUE, which means standardized differences true, so that means print the standardized differences. So here, you'll see what this table looks like. This is table that I just copy directly from our – so, this tableone package will just create a table like this for you with just that simple line of code. And so, here we have – these are all of our different covariates. Again, this is a shorter list than you would probably want to use in practice, just to illustrate the main ideas. And again, we see in the control group 3,500 subjects and in the treated group about 2,100 subjects. And then we see the means and standard deviations of these different variables. And over here, we have to standardized mean differences. And remember that we're especially concerned about standardized mean differences that are greater than 0.1. So we see a few of those. So we see one there, we see one there, and so on. So there's a few of them that are showing some imbalance. So, for example, this variable here is mean blood pressure and that is an 85 roughly versus a 68, so we see a pretty big difference there. So this is how you can create, you know, a tableone and just get a sense of what the data are like and how much confounding you might be dealing with. And next, we can go on to matching. So here we'll carry out a greedy matching analysis. And here we're going to actually match on the whole set of covariates; we're going to calculate the distance between the covariates and match on them. So, we can use this, the actual Match package. And here is very simple code. Tr is they want to know what is your treatment variable. Well, ours happens to be an in treatment, so that's straightforward. This M equal one has to do with whether you want pair matched or maybe you're interested in more than one control for a treatment. So here, I'm just doing pair matching, so I'm saying match one treated subject to one control subject. And now, what I'm saying here is here, these are the variables I want to match on. So what they're calling X is the set of variables you want to match on, so I say, in my data, just use what I called xvars. So it's going to match on those. So if you run that command, that will actually do the matching for you. And then what you can do is, well that what that will do is it will end up in an index.treated and index.control which will tell you sort of who the treated and controls are and what their actual ID variable is based on their sort of original ID. And if you unlist this, then you can end up looking at it and I'll sort of show you how you can look at it here so we can use, again, CreateTableOne just this tableone package. So this is going to look very similar to what we did before – we say create a tableone using xvars, stratify on treatment, but now we're saying the data are the matched data. So if you look at the previous slide, here, I called that new dataset, that matched dataset, I called it matched, so that's what I put here. Again, I'm not interested in this hypothesis test, I'm not interested in statistical testing. Now we can print this match, what I'm calling matched table one, we can print that and also ask for the standardized mean difference. And here's what that looks like. So, first thing you might notice is that we matched 2,186 controls to 2,186 treated patients. So these are exactly equal, which is what you would expect since we're doing paired matching. So here's the same list of variables. And if you look at standardized mean differences, now they're all very, very small, so we did a great job of matching here. This is a kind of table that you would expect to find if you actually randomized treatment. So that looks like there's great balance here on all of these variables – the standardized mean difference is never even close to a 0.1. If you remember previously, the mean blood pressure variable, there was a lot of difference in the unmatched, and now very, very similar. So this – we're not matching on very many variables here, but it seemed to do a really good job. And so at this stage, you would probably feel pretty happy about how the matching worked. Now that you're happy with the matching, you could actually carry out an outcome analysis and there's a lot of different ways you can do that. Here I'm just going to show you how you could carry out a simple paired T-test. So, all I'm doing is I'm defining two new variables: y for the treated and y for the control. Remember outcome variable's died, and for y treated, I'm saying only use the subset the data, subset this outcome for those who have treatment equal to one. So I'm just removing from the whole died variable those- the died variable for those who were treated. I do the same thing with control. And now I have two outcome variables of equal size, one for the treated, one for the controls. And, the rows of these correspond to the same matched pair. So the first row of y treat, the first observation of y treat is corresponds to a match with the first row, the first value of y control. So I can just take a pairwise difference then. I can actually just calculate a new variable which just takes the difference between your two variables: your outcome under treated versus outcome under control. It's a pairwise difference so that's preserving the matched pair. So you're taking a difference, one pair at a time essentially. And now a paired T- test is actually just a regular T-test on the difference in the outcome among matched pairs. So I can just use the standard T-test command now on this difference variable. And this is what the results look like. And so we see the point estimate is about 0.045, and this is the 95% confidence interval. And we have a very small p-value here. So this is actually estimating causal risk difference, that's a causal risk difference, and the reason we know it's a risk difference is because we're just we're taking the difference in outcomes and taking the mean – so that's just a risk difference. So, to summarize, our point estimate is about 0.045. So that's the difference in probability of death if everyone received the treatment here, RHC, versus if no one received the treatment. So in other words, it's a higher risk of death in the treatment group. And the 95% confidence interval is here, and it's showing you – giving you a sense of sort of plausible range of sort of the true risk difference, and it's highly significant. I should just remind you that this is just an analysis to illustrate the main ideas – we haven't controlled for all of the confounders that are in the dataset. In practice, you would match on a larger set of variables, and so your conclusions might end up being a little different if you did that. So, here we're just trying to illustrate the main ideas. You could also carry out a McNemar test. So, what I could do then is just ask for a table. Remember that these are paired-outcome observations. So I could just ask for a table, and it will show the counts of each type of pair. So for example, 994 pairs had an outcome of one for both groups. So outcome of one. So 994 pairs were concordant with an outcome of one; there were 305 pairs that we're concordant with an outcome of zero. And then the more interesting cases are the discordant pairs where we see 493 pairs. The treated person had an outcome whereas the control person did not, so in this case, 493 people, treated subjects, died when their matched control did not, whereas you see 394 pairs had the opposite situation where the control died when the treated did not. So among the discordant, we see that there is a bigger number where the treated is the person who died, so this is again suggesting that the treated group is at higher risk. And this is all for matched data so this is already controlled for the confounders that we matched on. Now, if we want to carry out the test, again, we can just use this McNemar test. We can tell it what the matrix looks like, it will carry out the actual hypothesis testing, we get a really small p-value, just like we do with the paired T-test. So either of these methods is getting at the same thing – it's testing the hypothesis of no-treatment effect. So, this is just summarizing what I said previously. And so in both analyses, we're concluding that treated subjects were at higher risk of death. So this was just, again, a relatively simple kind of outcome analysis but you could and we were just estimating a risk difference, but if you were interested in a risk ratio or an odds ratio, you could use GEE for example. If you wanted a causal odds ratio, you would use a logit link, for example. I've already mentioned that I just use a subset of covariates for simplicity. In practice, you would use more variables. And I just, in this case, I illustrated the Matched package, but there are other packages that can do matching and what I like in particular is rcbalance package because it has a lot more options, but it's also slightly more sophisticated but there is a really great description of this package in the journal Observational Studies which is open source so you're free to download in it. It really describes in detail how you could carry out sort of more sophisticated matching kinds of analyses where, for example, you could have a fine balance constraint or you could do optimal matching and so on. And I also am just mentioning here that this is tableone package actually goes into a lot of detail about steps that you could use to carry out a matched analysis including some very nice figures that assess balance, and so I'm recommending this particular document.