0:08

Well we've discussed sampling techniques such as simple random sampling, or

stratified random sampling, or cluster sampling.

We made a point of talking about how to calculate standard errors,

within that framework in which we talked about.

Seven steps of the process.

Going from population to frame to sample and

estimates and the sampling distribution.

And then standard errors estimated from the data and constant intervals built

around those standard errors to give us uncertainty statements.

We should do the same thing here for systematic samples,

except that what happens with systematic samples, when we start talking about that

uncertainty estimation, is that the approach is built on the others.

It's built on the others in two ways.

In one of those, it is built on the others in terms of thinking about the sampling

distribution and replicating in our sample the sampling distribution.

We're going to refer to that as multiple random starts.

So I'll tell you more about that in a moment.

But then we also saw that systematic sample is applied to lists with

certain orders gave us, effectively, certain kinds of samples.

So if the list order were simple random,

that we had effectively then, a simple random sample when we got done.

That will allow us to use simple random sampling as a variation technique under

the assumption of random ordering.

Similarly, stratified random ordering could lead to

a stratified random variance estimate.

There's a couple of other modifications though,

I'm only going to mention one of them, something called a pair difference which

is quite widely used in these kinds of circumstances.

Especially when the selection units are not just elements but

happen to be clusters, and

we'll talk a little bit more about that when we talk about paired differences.

And then we'll go through a simple illustration just to put the calculations

in context and make sure that we know how numbers would actually fit into certain

kinds of calculations.

So this is more about taking a conceptual framework and model, and

then applying it to our systematic sampling context, for the purposes

of variance estimation, standard error estimation, cost interval construction.

But we're not going to go through the full thing,

just highlight these particular topics.

And we'll rely on referring back to previous units as a basis for

getting more detail about how they work.

2:39

All right, so suppose what we're doing is estimating a sample mean,

a mean of our transactions, a mean of some other characteristic.

And we happen to have done a systematic sample.

And we recognize that because we have only a single random start,

we can't technically estimate variance.

Because what we would then have would be one cluster selected from

K clusters if K were our interval.

Remember, that that interval defines starting points and

each of those starting points has a fixed sample that goes with it.

3:44

So, one way to deal with such a problem is to use additional random starts.

Now this is going to complicate the operation.

So instead of having one random start from 1 to K we have two random starts from 1 to

K, and we draw two samples.

Now when we're done, we can actually have two elements from

the sampling distribution which will allow us to calculate sampling variance.

We can take the means from each of our two samples,

from each of the random starts and calculate the mean of the means and

then look at the variability of each of our sample means, sample one

relative to the average of the two sample two relative to the average of the two.

4:27

And is that an estimate of the variance of our sampling variance.

So we're talking about using though more than two.

We could use c random starts, c is an arbitrary number.

It could be two, or three.

Or 10. The more random stats we have,

the more estimates we have more complicated their design is but

the more stable the variance because now

what we're going to be doing is getting separate estimate for each sample and

as we do that our variants would be built on the total number of random starts.

Our degrees of freedom will depend on how many random starts we have.

And if c is only two, we only have one degree of freedom,

we're going to have to use a very large t-value for that.

If we have ten, well our t-value will still be large for

intervals but It's smaller than if we only used two random starts.

5:24

The basic idea though is as follows.

We would use multiple random start,s and then calculate our mean by just taking

all of our data and dividing it by the sample size for

each of the random starts times the number of random starts.

That's our overall sample size.

So, if our sample size were going to be 50 for each of our random starts.

And we did two of them.

Our overall sample size is 100.

If we did 10, our overall sample size is 500.

Regardless, we would calculate the mean by taking all of the observed

values from all of the samples and dividing by the overall sample size across

all of the different random starts, or what we could do, as I indicated,

calculate the mean for each of the random starts and then average those.

The sampling variance, estimated sampling variance though shown of the last

line here on this slide is one in which we have a variance estimate,

then has the mean from each sample y bar sub I think it's a gamma there,

I shouldn't of introduced Greek should I, but

it's a subscript to denote the mean from different random starts.

And we're going to compare.

That mean from each of the random starts to the overall mean which is an average of

those, square it, add it up, and when we're done, divide by c minus 1 times c.

So, we've got a process in which we can calculate variances by replicating

the sampling distribution by using multiple starts as we might in a race.

6:59

But there's another broad approach to this.

And rather than going through and

attempting to identify multiple random starts, complicate our design,

we're going to use just the one random start, and then ask the question, what

order are the elements in the list with respect to our characteristic of interest?

7:29

If we have no reason to believe there is such a connection between list order and

the characteristic of interest.

And that may be because we've deliberately manipulated the list to create,

to break that association.

Or because we have good reason to believe that the underlying

list order is completely unrelated to the variable of interest.

We have effectively then, with systematic sampling a simple random sample.

And so now, we would use simple random sampling to estimate the variance.

So that's the expression we are showing here.

We're going to treat the sample size of lower case n

as though it were simple sample.

When in fact it was not.

Unless, our assumption is true or

unless we've deliberately ordered the list in some random fashion.

So we don't have to worry about multiple random starts and multiple means,

and worrying about the complexities of increasing numbers of replicates,

increasing numbers of random starts.

Here we have just the one random start but under an assumption.

Now an assumption means that we have a model in mind.

To the statistician that assumption that the list order is random gives

us simple random sampling as an outcome and

we would use that kind of an expression for the variance.

But now, suppose, that they are related.

There's something about the process in which the underlying order is

not bouncing back and forth.

With respect to the outcome, it might be increasing and then decreasing.

It might be that there's a, because of the categorization,

the sort order based on categories of some characteristic,

some auxiliary variable for our data in the frame.

We happen to have a group that might have a higher mean,

followed by a group that has a lower mean, followed by a group that has a lower mean,

followed by a group that has a much higher mean and so on.

And those groups now are being represented in the sample proportion to their size.

There's variation,

they're all being represented there as in stratified sampling.

10:01

The means differ.

Then, effectively what we've got is a stratified random sample,

where we've selected one element per row,

one element per zone as it's sometimes referred to as we go down a given column.

And that defines our sample and we could indeed use that to calculate

our variances under some kind of stratified random formula.

Now, this is beyond the scope of what we're able to do, but

we just wanted to give you an indication of what could be done here.

We face a dilemma here, because there's only one selection per zone.

10:36

But the zones may be grouped in terms of our sort order,

to be all business services.

Followed by all travel.

Or maybe there's, as we pass from one interval to the next,

there's a little mix of business services and travel.

But nonetheless, there's basically a divide between business services and

travel in our transaction example.

In a case like that, what we may be able to do is have multiple selections in each

of these categories and treat them as a stratum and then calculate our variances.

And we would calculate the variance using

a proportionally allocated stratified variance expression.

It's actually shown here.

We didn't go over this in our earlier material, but

that's how it would be treated.

11:22

Sometimes the ordering though is less discrete.

It's more continuous.

That is,

we not only have established that there is some ordering through characteristics.

So we're not going to do simple random sampling.

Then we're going to do some kind of stratified sampling.

But the ordering is such that it's almost continuous.

Remember our serpentine ordering, going up and down and back and forth?

And there's really some difficulty at times in drawing strict boundaries

within which we can calculate variances and then combine them across those groups.

And so instead, if the ordering is more or less continuous,

what we're basically getting is one selection from each row.

12:23

The fifth selection paired with the sixth, and so on.

And there is an approach that involves variance estimation for this circumstance,

called paired differences, pairing the successive rows.

Again, we give the formula here, and we're going to illustrate its actual application

in our illustrations, coming next.

It's not important to understand where this comes from, but simply that this kind

of thing reflects the underlying implicit stratification of the list.

It's an extension that we've moved from multiple random starts

to simple random sampling, no relationship between the starter and our outcome.

To yes a relationship, but it's done in terms of categories

that were levels of a auxiliary variable we've used for

the sort to one in which it's almost continuous ordering.

There's some underlying ordering there, but

it's very hard to draw boundaries where we would use a paired differences expression.

13:23

Just a small illustration.

Suppose that we had this sample here,

this was a sample of a block selected from a much larger list.

And six of the blocks were selected.

We've got records on these blocks and we happen to know the number of rental units

and the number of total housing units on each of the blocks.

And we have collected the information for a sample of six.

And we've got an index, one, two, three, four, five, six for our sample.

So we've got a very small sample here.

And it turns out that when we originally selected the list,

the list happens to be in more or less random order, with respect to the blocks.

There's no reason to expect any particular ordering in the list.

Well, regardless of the underlying order, we're going to calculate first the mean.

For example, this is the mean number of rental units in our sample, 23.83.

We've added up the six values in the rental column,

divided by 6, a mean of 23.8, 24.

Now, what is the standard error of that mean?

And in this particular case, the first question is, is the list order random?

If the answer to that is, essentially we think it is.

Now notice that's an assumption.

Or it could be yes it is because we deliberately ordered the list at random.

Then in that case, we would use the simple random sampling variance formula.

The (1- f) s squared over n.

In this particular case, there were actually 60 blocks on the list and

we drew a sample of six of them.

And so our finite population correction is 1- 6 over 60.

1 minus 0.1 or 0.9.

We're going to divide by n, which is 6 and then compute an s squared.

And there's a computing formula here for s squared.

But the bottom line is, that we end with the sampling variance of 34.0.

34.0, the square root of that is a little bit less than 6, and

that's what we would apply to our mean of 24.

Now, this is a very small sample, but you get the basic idea.

We have made an assumption, or

because of the list order that we deliberately manipulated we've taken

advantage of that manipulation to choose the simple random sampling model.

15:52

That's probably what happened in this particular case,

that someone had a list of the blocks and they were in that order.

And they simply did a systematic sample taking every tenth after taking

a random start.

The paired selection model would be most appropriate here.

The paired selection model variance estimate is shown here in which there is

a 1 minus f.

1- 6 over 60.9.

There's the division by the sample size squared,1 over 6 squared, a 36.

And then there are three paired differences.

The first sample value, ,for the first and

then the second block, the difference of those, squared.

So that's 23- 25 from our list, or minus 2 squared is 4.

The third minus the fourth, 42- 0, squared.

Now that's a big number, 1,764.

And then the final one, 16- 37 squared.

Now this is just to illustrate the calculation.

17:54

Okay, so on uncertainty estimation here,

it's a matter of calculating a standard error,

a matter of calculating a variance, but of the four techniques that we've used,

one involving multiple random starts, the other three involving a model assumption,

simple random, stratified random, or paired differences continuous ordering.

We're going to apply one of those to the data to obtain a standard error and

thereby a confidence interval.

All right, that's all we have time to cover for systematic sampling.

Our last unit will be some extensions and application of some of these things.

One of the topics that we talked about in previous units had to do with weighting,

particularly under stratified sampling.

But we're going to return to that to talk about weighting now at the as we did then,

for under and over sampling.

We're going to add to that some discussion about weighting for nonresponse and

noncoverage and combining the over undersampling with nonresponse and

noncoverage weighting to get a final weight.

18:57

We also need to talk a little bit about stratified cluster sampling and

then move into variance estimation built around stratified cluster samples.

We won't say a lot about stratified cluster samples as putting together

two of our techniques.

But we will talk about variance estimation and

the software that goes with it to estimate those that kicks into

takes in to accomplish these science features as we've been doing.

And as long as we're talking about software, we'll also have a brief

lecture about software for sample selection push the button.

What will we do on a pieces of software to draw a selection?

What code might we use, and what does it actually give us?

And then finally, one last topic in our kind of extensions,

what happens when we're dealing with networks and we sample them by

either sampling elements in the networks or sampling nodes in the networks?

Okay, we're either going to sample connectors between the elements, or

we're going to sample the nodes and their interconnections.

And there is a set of techniques in stratified sampling referred to

as multiplicity sampling that can be used in that particular case,

we have two demonstration to talk about before we wrap up.

Join us then for our concluding unit, extensions and

applications, as we wrap up our course on sampling people, records, and networks.

Thank you.