0:05

Let's take a look at this video about how you might do some diagnostics in R.

Â Now, there is some literature on adapting standard diagnostics for

Â use with survey data.

Â In fact, my students and I have contributed to a bit of that.

Â But there are other things that are out there.

Â The problem right now is very few of these things are available in packages,

Â so you need to program your own.

Â 0:48

So how do we compute these things?

Â We're going to use the same model as in a previous video.

Â The academic performance index data file and

Â the we'll look at standardize residuals.

Â So you may know from other courses that this should have mean 0 and

Â variance 1 under the model if you've got the model specified correctly.

Â And I have the form data- prediction,

Â yi-yi hat over the estimated standard

Â deviation in the model of the residuals,

Â of the model errors.

Â Now that assumes that there is a constant variance in the model,

Â that may or may not be true.

Â So how do we estimate sigma hat?

Â It is an estimate of the model error variance.

Â 1:50

It's not a design variance, but

Â we can use the survey weights in making an estimate of it.

Â So an approximately unbiased estimate of

Â the underlying common variance of the errors in the model,

Â will be the sum over the elements in the sample i an element of s.

Â The weight, the survey weight, times the squared residual

Â fitted from the model, divided by the sum of the survey weights.

Â 2:21

So that'll be approximately unbiased,

Â assuming that you've got the models specified correctly.

Â So we can use that, calculate these standardized residuals here,

Â and take a look at them.

Â 2:39

This repeats some of the things that we did in previous videos so

Â I won't spend a lot of time on this.

Â We want to require the survey package because that does the model fitting.

Â We tell it use the API data.

Â We specify our stratified design object using the API strat data set.

Â Fit our model with svyglm.

Â And the same model as before.

Â I calculate the estimate of the underlying sigma square this way.

Â I take, there's a built in function called weighted mean in the stat piece of R.

Â So you don't have to do anything special to get at that.

Â We just invoke it weighted.mean.

Â And first you tell it what are the things I want the weighted mean of.

Â 3:35

The model object that I saved, m1, has got a component called residuals.

Â So I square those and then I tell it here's my weight apistrat$pw.

Â Which is just the survey weight.

Â And it'll compute that sigma-hat squared that we looked at on the last slide.

Â So then I take the standardized residuals and calculate those this way.

Â I take the model residuals, m1$residuals,

Â divided by the root of sig2, which I just computed.

Â So we can look at a summary those, and you can see they're a little off balance.

Â And since they're not being centered around zero,

Â the mean's -0.278, or -0.2837.

Â You might know from ordinary leaf squares that the sum of

Â the residuals ought to be 0.

Â We've actually fit a weighted leaf squares here.

Â So that's not true.

Â But still you'd like to see those centered around 0.

Â And they're kind of asymmetric.

Â The smallest one is -3.86 largest is 2.54.

Â The standard normal distribution the bulk of

Â the distribution ought to be between plus or minus 3 so

Â we're seeing something a little different here.

Â 5:06

Now, let's draw a picture to look at these standardized residuals.

Â So ours got some really nice graphics capabilities in

Â their packages that extend those, the gg plot package is an excellent one.

Â This is just basic R graphics.

Â So I'm setting up a 1 by 2 plotting matrix with

Â this par(mfrow) specification.

Â And then I want to plot in the two pieces of the plotting matrix.

Â First I plot on the x axis, a variable called apistrat$cnum.

Â cnum is county number.

Â And then on the y axis I plot my standardized residuals.

Â Just to see if there's some relationship between county and

Â how big the residual is.

Â Conceivably that could be true.

Â You get urban counties, rural counties, they could act differently.

Â I put some labels on my plot so I could understand it better.

Â And I draw a horizontal line at 0.

Â That's where the residuals should be centered if all is going well.

Â And then a thing that I especially like is to put

Â a non parametric smoother through the plot.

Â So if the model's specified correctly your residuals

Â ought to be up and down around 0, centered on 0.

Â So this particular non-parametric smoother which is called

Â lowess ought to be wiggling through above the horizontal 0 line.

Â So we'll take a look at that.

Â And I've colored it red just so you can see it and made it twice as big or

Â wide as the standard selection for the line width.

Â Now, in the second panel, I'm going to plot the standardized residuals versus

Â school enrollment, apistrat$enroll, and I label the axes again.

Â And I fit a non-parametric smoother through there again.

Â 7:22

So here's what it looks like.

Â We've got counties on the left.

Â And here's my county number on this axis.

Â So it's not a whole lot going on there, the non-parametric

Â smoother is down below 0 which is slightly troubling.

Â There's nothing obvious about extreme residuals.

Â They're not asymmetrically

Â distributed in an extreme way around 0.

Â There is, however, this one outlier here compared to the other points.

Â And maybe that one so

Â you can take a look at those to see what's going on with those.

Â 8:06

Then, over in this panel, I plotted standardized residuals versus enrollment.

Â And here, we really see something.

Â The non-parametric smoother is definitely not horizontal at 0.

Â There's a real linear trend in these residuals.

Â Which just means that enrollment better get into the model.

Â It's affecting the academic performance index, it's related to it.

Â Over here, county doesn't much matter.

Â There's not an obvious reason for including county.

Â So we learned a little bit by doing this.

Â And there may be other diagnostics you can think of that you can easily

Â enough program yourself.

Â