0:07

Let's continue now with the mice example for multiple imputing missing data.

Â Now this is what's called the margin plot,

Â which is the handy way of looking at at least small data sets and

Â seeing the pattern of missingness.

Â So the first thing I do to draw one of these is require the VIM package,

Â and it has a function in it called margin plot, which is nice.

Â So you can find this example in the Van Buren paper that was

Â referenced in the last video.

Â So I feed it the nhanes2 data set.

Â I'm going to plot total serum cholesterol and BMI.

Â I set some colors here using parameters that margin plot takes, and then the cex

Â parameters set the size of labeling in the plot.

Â Pch = 19 is just a particular plotting character

Â that gets filled in with a color.

Â So what have I got here?

Â I've got, on the horizontal axis, I've got total serum cholesterol.

Â On the vertical axis, I've got body mass index.

Â 1:35

Then, along the sides, I've got two boxplots.

Â So the blue one here on the left,

Â this one right here is for

Â cases that all have BMI present.

Â The red boxplot right here is for cases that have got BMI present,

Â but are missing serum cholesterol.

Â And then you've got the corresponding sort of thing down here for

Â cases that have got serum cholesterol present, but missing BMI.

Â And we've got a total of 9 cases that are present on

Â BMI missing serum cholesterol, 10 cases that

Â are present on cholesterol, but missing BMI, and

Â then we've got 7 cases right here that are missing both.

Â 2:40

Now another thing that is interesting to note from this plot is

Â if I had MCAR missing completely at random, in other words,

Â the missingness is just a random draw from the total sample.

Â I would expect to see boxplots that looked alike.

Â Now for the cases that have got BMI and have got serum cholesterol,

Â their boxplot should look the same as the ones that have got BMI,

Â but they're missing serum cholesterol.

Â 3:16

The boxplot for BMI, that is not the case here, on either axis.

Â You can see on the vertical axis these two have drawn arrows to boxplot looks

Â substantially different, and the same thing down here on these two boxplots.

Â So that means I needed to account for covariance to have

Â any hope of getting imputations that I would consider to

Â be unbiased predictions of what the actual values are.

Â 3:51

Now how do I do the imputation?

Â In this simple little example it's easy, I call the mice function,

Â I send it nhanes2, I set a seed for the random number generator

Â in case I want to generate the same set of imputed values again, and

Â I store the whole thing in nhanes2.imp for imputations.

Â The summary function operating on this will print information about

Â number of Mls, the imputation method for each variable, and

Â the covariates used to impute each variable.

Â 4:29

If I want to look at the complete datasets one at a time,

Â I can use the complete function.

Â So if I send complete nhanes2.imp, and then action equals k as a parameter,

Â we would retrieve the kth completed dataset.

Â So that's handy, you can see what you actually did.

Â 4:51

Now let's take a look at what summary gives us.

Â So there are a few things to note, it echoes the call to the function here.

Â The number of imputations by default is 5, but you can control it.

Â You could do more than 5, if you wanted to.

Â The number of missing cells or values for

Â each column in the data set is reported here, and

Â then it gives you in this row here the imputation methods that are used.

Â So age is not missing, so I don't need to impute for that.

Â BMI is continuous so the default is predictive mean matching.

Â Hypertension is categorical, the default is a logistic regression.

Â Serum cholesterol is also continuous, so

Â I get predicted mean matching as the default there, but you can control that.

Â if you've got a better idea of how to do it,

Â you can use one of the other methods that are available.

Â So the VisitSequence, as it shows here is that

Â I impute BMI first, hypertension second,

Â serum cholesterol third in the sequence of imputing.

Â Now the other piece of information here is a matrix

Â that tells us what covariates were used to impute each variable.

Â So what you see here is, this row of zeros means that age did not need to be imputed,

Â so nothing was used, no covariates there.

Â For BMI on the other hand, age and hypertension and

Â serum cholesterol were all used to form a model to impute for BMI.

Â So everything except itself was used to impute BMI.

Â Hypertension, we see a similar thing, age and BMI and

Â total cholesterol were used, and then for

Â total serum cholesterol, age, BMI and hypertension.

Â Now you can control that if you want.

Â If you know a better model involves just the subset of the variables,

Â you can specify that to the function.

Â 7:18

So just to show you a little example of how you'd go about using this,

Â once you do the imputing, I fit a linear model.

Â So you can say with the nhanes2.imp object that contains the imputations,

Â I want to fit a linear model using the lm function.

Â So here I'm regressing cholesterol on age plus BMI,

Â and I then do a summary on the fit.

Â But interposed in here is the pool statement,

Â which comes with mice, so the pool statement

Â 8:01

essentially results in those multiple imputation formulas being used to

Â summarize the data, so the estimates will be averages across the completed data set.

Â The standard errors that are reported will be using that variance formula that

Â involves the average of direct estimates plus an increment due to the variation

Â between completed data sets of the estimates that use imputations.

Â 8:32

And the variance formula was put together with that special multiple

Â imputation formula.

Â So I round things off to two decimals just to give us less to look at.

Â And you see here, here are the regression parameters in this column.

Â Here are the standard error estimates using that particular multiple

Â imputation variance formula, the square root out of it, t stats, and

Â then I get some other things on the end here called fmi and lambda.

Â Fmi is called a fraction of missing information,

Â which has to do with the size of the b parameter in

Â that variance formula, and then lambda's similar.

Â It is the proportion of total variance attributable to imputations.

Â So let's look at lambda.

Â This says that of the variance for the intercept,

Â 27% of that is attributed to imputations.

Â So in a big data set where you've got a relatively small number or

Â proportion of imputations, these numbers will be much smaller,

Â but this is just a toy example.

Â So that summarizes the simple example of how you would use the mice software.

Â So mice is nice.

Â I'd recommend it.

Â It's very flexible, and it's quite popular for both doing the imputations and

Â properly reflecting the effect of those imputations on variances.

Â