In this video, we'll look at one piece of software that will allow you to actually do multiple imputation. So there are various choices out there that you might think about using. In R, there are a couple of packages. One is called the mi, the other is called mice. And mice is the one that we'll be working at here. In SAS, there are also a couple of options. One is a SAS macro, it's written from the University of Michigan, it's called IVEware in the main SAS, there's proc mi. And then in Stata there's a procedure call mi impute. All those are possibilities. So what we'll look at is an example using mice. The default method for continuous variables in mice is predictive mean matching which we learned about in the previous video. For two level factors, 0-1, yes or no kind of characteristic. The default is logistic regression and the parameter name is logreg for that. It's possible where T is a different method imputation and a different set of covariance for every variable that you're trying to do imputations for. So that's a nice piece of flexibility. You can impute all variables with missing values or you could impute just a subset that are most important to you. The place to look to get a good description of this is by the authors of mice who are van Buuren and Groothuis-Oudshoorn. And I apologize to them if I butchered their names. The article that you should read is in the Journal of Statistical Software in 2011. It's called, mice: Multivariate Imputation by Chained Equations. So this is where the chained equations comes up. There's a tiny data set that comes with mice that I'll use for illustration, it's called nhanes2. So, nhanes stands for National Health and Nutrition Examination Survey. This is a big survey done in the United States, it's an important public health data set. It's a household survey of persons, and many different physical measurements are taken. Blood samples are drawn, there's an interview done with the people who participate in the survey. And it's a very important data set in the US for measuring the prevalence of different kinds of health conditions. There are four variables in this data set. One is age, another is body mass index bmi. Another is hypertension, yes or no, do you have high blood pressure, and hyp is the variable name. Total serum cholesterol is the final variable, that's a continuous one called chl. [COUGH] So to get access to the package I require it, or I could say library here, it has the same effect. And then another nice function in R is the head function, which by default will list out the first six records in an object. There's a corresponding tail function that will list the last six. You can also put a parameter in here, and list a different number, like 10 or 20. So what, we see the first six observations here. So you see that age is a categorical variable, it's coded in ranges here. BMI is continuous, it's got four missing values in the first six records. Hypertension is categorical, it's got three missings here. Serum cholesterol, it's got two missings, otherwise it's continuous. And we'll see in the next video, how to go about actually imputing for those missing values.