0:05
Let's take a look at this video about how you might do some diagnostics in R.
Now, there is some literature on adapting standard diagnostics for
use with survey data.
In fact, my students and I have contributed to a bit of that.
But there are other things that are out there.
The problem right now is very few of these things are available in packages,
so you need to program your own.
0:48
So how do we compute these things?
We're going to use the same model as in a previous video.
The academic performance index data file and
the we'll look at standardize residuals.
So you may know from other courses that this should have mean 0 and
variance 1 under the model if you've got the model specified correctly.
And I have the form data- prediction,
yi-yi hat over the estimated standard
deviation in the model of the residuals,
of the model errors.
Now that assumes that there is a constant variance in the model,
that may or may not be true.
So how do we estimate sigma hat?
It is an estimate of the model error variance.
1:50
It's not a design variance, but
we can use the survey weights in making an estimate of it.
So an approximately unbiased estimate of
the underlying common variance of the errors in the model,
will be the sum over the elements in the sample i an element of s.
The weight, the survey weight, times the squared residual
fitted from the model, divided by the sum of the survey weights.
2:21
So that'll be approximately unbiased,
assuming that you've got the models specified correctly.
So we can use that, calculate these standardized residuals here,
and take a look at them.
2:39
This repeats some of the things that we did in previous videos so
I won't spend a lot of time on this.
We want to require the survey package because that does the model fitting.
We tell it use the API data.
We specify our stratified design object using the API strat data set.
Fit our model with svyglm.
And the same model as before.
I calculate the estimate of the underlying sigma square this way.
I take, there's a built in function called weighted mean in the stat piece of R.
So you don't have to do anything special to get at that.
We just invoke it weighted.mean.
And first you tell it what are the things I want the weighted mean of.
3:35
The model object that I saved, m1, has got a component called residuals.
So I square those and then I tell it here's my weight apistrat$pw.
Which is just the survey weight.
And it'll compute that sigma-hat squared that we looked at on the last slide.
So then I take the standardized residuals and calculate those this way.
I take the model residuals, m1$residuals,
divided by the root of sig2, which I just computed.
So we can look at a summary those, and you can see they're a little off balance.
And since they're not being centered around zero,
the mean's -0.278, or -0.2837.
You might know from ordinary leaf squares that the sum of
the residuals ought to be 0.
We've actually fit a weighted leaf squares here.
So that's not true.
But still you'd like to see those centered around 0.
And they're kind of asymmetric.
The smallest one is -3.86 largest is 2.54.
The standard normal distribution the bulk of
the distribution ought to be between plus or minus 3 so
we're seeing something a little different here.
5:06
Now, let's draw a picture to look at these standardized residuals.
So ours got some really nice graphics capabilities in
their packages that extend those, the gg plot package is an excellent one.
This is just basic R graphics.
So I'm setting up a 1 by 2 plotting matrix with
this par(mfrow) specification.
And then I want to plot in the two pieces of the plotting matrix.
First I plot on the x axis, a variable called apistrat$cnum.
cnum is county number.
And then on the y axis I plot my standardized residuals.
Just to see if there's some relationship between county and
how big the residual is.
Conceivably that could be true.
You get urban counties, rural counties, they could act differently.
I put some labels on my plot so I could understand it better.
And I draw a horizontal line at 0.
That's where the residuals should be centered if all is going well.
And then a thing that I especially like is to put
a non parametric smoother through the plot.
So if the model's specified correctly your residuals
ought to be up and down around 0, centered on 0.
So this particular non-parametric smoother which is called
lowess ought to be wiggling through above the horizontal 0 line.
So we'll take a look at that.
And I've colored it red just so you can see it and made it twice as big or
wide as the standard selection for the line width.
Now, in the second panel, I'm going to plot the standardized residuals versus
school enrollment, apistrat$enroll, and I label the axes again.
And I fit a non-parametric smoother through there again.
7:22
So here's what it looks like.
We've got counties on the left.
And here's my county number on this axis.
So it's not a whole lot going on there, the non-parametric
smoother is down below 0 which is slightly troubling.
There's nothing obvious about extreme residuals.
They're not asymmetrically
distributed in an extreme way around 0.
There is, however, this one outlier here compared to the other points.
And maybe that one so
you can take a look at those to see what's going on with those.
8:06
Then, over in this panel, I plotted standardized residuals versus enrollment.
And here, we really see something.
The non-parametric smoother is definitely not horizontal at 0.
There's a real linear trend in these residuals.
Which just means that enrollment better get into the model.
It's affecting the academic performance index, it's related to it.
Over here, county doesn't much matter.
There's not an obvious reason for including county.
So we learned a little bit by doing this.
And there may be other diagnostics you can think of that you can easily
enough program yourself.