0:06

Now that we've heard about some reasons for imputing,

let's look at particular methods.

So I'll talk about means and hot deck, in particular.

But first, let's look at a list of all the possibilities

that we've got that we'll cover in this course.

The first one here is imputation based on logical rules.

Now, that is not normally what you'd think of as an imputation.

It's more like an edit check.

And the idea there is if, say, at one point in a questionnaire,

you get somebody's date of birth, and later on age is missing,

then you can do a little calculation and fill in what age is.

So that's a sort of fill-in imputation.

0:57

Another choice that we've got would be filling in a mean

that's usually within some cells defined on characteristics that you need to

know about all the units in your sample.

And you may or not add a random error to that.

The third choice is what's called cold deck.

So the idea there is this works if you've got a continuing survey

where you've got a previous edition of the survey.

If a unit responded in the previous version,

then you go back, get that value or some possibly

indexed forward value of it and fill it in for the current missing value.

So it's called cold because you're referring back

to a dataset that is already in your hands.

It's not the current one that you're dealing with.

Now, in contrast to that is something called hot deck, and

what that amounts to is you look at your current dataset.

If you've got a missing value, then you find a similar case that

has complete data for the variable, and you just grab that value and fill it in.

So that's usually done within cells, also.

A fifth possibility is regression prediction.

So based on covariates that are available for all units, those that are missing or

non-missing, you generate a regression prediction, and you may or

may not add a random error to that.

Now, somewhat related to that is a method called predictive mean matching.

You find a unit that's got the closest

observed value to one predicted by regression for your missing case.

So you've got a missing case.

You make a regression prediction based on some covariates.

So you've gotta fit that regression from the complete cases.

Then you look at that prediction, find a complete case that actually has reported

data that's close to that prediction, and then you fill in that value.

So it's got the virtue of taking advantage of any sort of regression

relationship between covariates and your analysis variable.

3:27

Now, each one of methods 2 through 6 can be done sequentially.

So you find an item where there are the fewest missing values.

You fill all those in.

You go to an item with the next most missing.

You use your complete data, plus the imputations you just made.

You impute for the missing values for this new variable, and

you keep going in a sequential method.

A variation on that which we'll get to later is called imputation through

chained equations, and we'll look at some software that will do that for you.

4:08

Now, looking specifically at mean imputation,

one of the troubles with it is that if you've got a lot of missing values and

you keep repeatedly filling in the mean, you're going to introduce a kind of

a spike in the distribution, a lot of values at that one mean value.

So one way to work around that is to add a random error to the mean.

That would help reduce that distortion of repeatedly imputing the same value.

So what's the error?

The error could be normal with mean 0 and

a variance equal to the observed element variance of the nonmissing values.

But you don't have to use normal.

I mean, there's no law that says that's the way data are distributed.

So you could certainly use distributions other than normal.

If you looked at your complete data and

you found that they were distributed like a gamma or

some sort of chi-square distribution, then certainly that could be used.

5:16

The cells or subgroups that you form to do this for

mean imputation are a way of accounting for

the possibility that the value depends on covariates.

So in that case, you're really implicitly using a model.

And the regression model that is in implicit in this is a kind of ANOVA model

analysis of variance where all the covariates are categorical.

Those are the ones used to form the cells.

And you're imputing a mean based on those categorical covariates.

5:55

Now, hot deck imputation is somewhat different.

Usually, you put things into cells.

For example, if you've got a business survey,

you might use type of business x size.

If you're doing a survey of persons, you might use age x gender.

So within each of those cells, if a unit is missing,

what you do is find a non-missing version, draw one of those at random, and

fill it in, fill in its value as the imputed value.

So that's got the advantage of

not imputing impossible values because you're using observed values.

It does have these implicit assumptions that all units in

a group have a common mean.

So it's got an ANOVA model type of assumption underlying it.

That's the case where it makes the most sense to use a kind of hot

deck imputation.