1:15

Suppose that I have a statistic that estimates some population parameter.

But I don't know that it's sampling distribution.

So, take the median, we actually know a lot about the sampling distribution of the

median, but let's suppose, you don't know anything about it.

Then, you'd say, well I'm stuck, well, the boot strap principle says, well, why don't

we just take the distribution defined by the data, to approximate the sampling

distribution of the median. So what's the distribution defined by the

data? The distribution defined by the data, if

you're not putting any constraints on it. Puts probability one over N on every data

point. That is a distribution, it's a discrete

distribution. Right?

It's a discrete distribution that puts probably one over end on each data point,

and it's a weird distribution, because you know, the data points, you know are of

course, generally have, have fractional amounts and it's a weird kind of discrete

distribution, in that it's not, sort of, like a die roll or it's on integers, but

nonetheless, it's an nice discrete distribution and it's.

Mean is of course, the sample mean. So it's, you know, kind of, any rate, this

empirical distribution is, is a reasonable distribution to, to work with, and so why

don't we look at the sampling distribution of the median, based on the empirical

distribution from our data, where we place probability one over add-on each data

point. And that's the boot-strap principle, it's

saying, well, I don't know what the population distribution is.

And so I, without that and without some tool like the central limit theorem, I

can't figure out what the distribution of my test statistic is.

Well, why don't I take my distribution given by my empirical distribution and

figure what the distribution of sample medians are like from that distribution

because I know that right, it's just a distribution that's one I ran on each data

point and I can work with that. So that's a really nifty principle and it

can be executed in a, a kind of parametric way or a non parametric way and today

we'll be talking about the non parametric bootstrap where the imperial distribution

that we work with just places one / N on each data point.

If we wanna figure out things about this distribution, it's actually not very

convenient to work with. Say you have 100 observation, and you're

saying, well, The distribution I'm working with now

places probability one / 100 on every data point and I want to know what the

distribution of medians of a 100 observations from that distribution is.

It's actually kind of a hard thing to work with.

You couldn't actually work that out on pen and paper.

Maybe if you had. You know, five observations, then there's

only 32 combinations to work with. But in general it's a little hard to work

with. People said, well, why don't we use

simulation? Forget about it.

Why don't we just use simulation? And then the process of boot strapping has

this, kind of, interesting re-sampling interpretation.

So what do people wind up doing? The general procedure follows by taking

the data set and simulating complete data sets.

With replacements. So we put all of our data points in a bag

and let's say we have a 100 data points and we pull out a new data set of 100 data

points. But we sample with replacement.

So once we pull out a point and record it, then we put that observation back in the

bag, mix it up again and pull out another point and it could possibly be that same

point again. If we sampled without replacement, we

would just get a permutation of our original data set.

So the bootstrap distribution is sampling with replacement.

So what is that? That is exactly drawing IID samples from a

distribution that places probability one / N on each data point.

And so, for every one of these re-samples, right,

That's the idea, it's re-sampling from the observed data.

You're creating fictitious data sets by resampling from the reserved data.

That's why it's called a resampling procedure.

So let's see, we have a hundred observations.

We draw a width replacement complete data set of size 100 from our observed data.

From each of those complete data sets of size 100 we calculate the statistic that

we're interested in. Say, for example, the median.

We would calculate the median then for every re-sample.

So we draw a sample size 100, calculate the median.

Redraw a sample size 100, calculate the median.

Redraw a sample size 100, calculate the median.

And we repeat that process over and over and over again.

Let's say we did that 10,000 times, we would get 10,000 re-sampled sampled

medians and we would use those 10,000 resampled, sampled medians to talk about

the imperiacal distribution of the sample median.

And that, It seems to make a lot of sense.

The first time you encounter this you may think that this is the strangest thing

that I have ever heard in terms of a statistical procedure, but it does make a

lot of sense. So, let me drone on about this for another

couple of minutes. So think about, if you actually knew the

distribution and could sample from it, how you would get to know the distribution of

say, something like the median, if you couldn't do the math, which, you know, for

most statistics, you can't do the math, even mathematicians can't do the math.

So let's say for example, I create a distribution.

Let's say, I have a 100-sided die. And that 100 sided die has probability one

over 100 for each side of the die. And let's make it at least a little bit

interesting, so, the die's shaped a little bit funny, so it's not exactly probability

one over 100 for every number. So you have a, a distribution on the

numbers between one and 100. And you wanted to know ten, what's the

sampling distribution of the median of ten die rolls from this 100-sided die.

Well, that's a hard problem. It's difficult to think about how you

would work that out on pen and paper. But what you could do is roll the die ten

times, get a sample median, Record it,

Then roll the die ten times again, get a sample median, and record it,

And then roll the die ten times again, get a sample median, record it.

And you could do that thousands of times if you had patience, let's say, you know,

you were waiting for something. And then, if you did it enough, you would

get exactly the distribution of the sample median.

7:34

Okay, so you would get exactly the distribution of the sample median of ten

die rolls. And if you wanted to know what the sample

distribution of the median of twenty die rolls, well you'd have to roll the die

twenty times, get a sample median. Repeat that process over and over again,

and that would do it for you. Okay so now we know, if we can actually

sample from the population distribution over and over and over again, how we would

get the sampling distribution of a statistic.

But when confronted with real data, we can't roll the die.

Right. We don't know what the population

distribution is, so we can't do it. But what we can do is roll from a die,

where every side of the die we've put on the number associated with an observed

data point, then we're not drawing from the population distribution, we're drawing

from the empirical distribution. Okay, then if we had ten data points and

we want to know what the distribution of the sample median of ten observations is.

Well, we can't draw from the population distribution, but what we can do is draw

samples of size ten from the distribution defined by the data we observed, and look

at what the distribution of the sample median is for those.

And that is exactly what the boot strap does, is it basically says, well, you know

the bootstrap in practice via re-sampling. It basically it says, well, we know

exactly what we would do if we actually knew what the population distribution was.

Why don't we just do that and use the sample distribution, and, and see how that

works. And it's sort of a really nifty idea.

So again, let's just take our 630 measurements of grey matter volume from

workers at a lead manufacturing plant, Then the median grey matter volume is

about 589 cubic centimeters. And we want a confidence interval for the

median of these measurements. How do we do that?

So here's our bootstrap procedure for calculating confidence interval for the

median of a data set of N observations where we know nothing about the.

Sampling distribution, of medians, of ten observations.

So, we would sample and observations with replacement, from the observed data

resulting in one simulated complete data set.

We would take the median of this simulated complete data set.

That would give us one bootstrap resample, and one bootstrap resampled sample median.

Then we would repeat the step B times, let's say.

Resulting in B simulated medians of N observations.

Those N observations having been drawn with replacement from the collection of

observed. Data, then these medians, well, let's say,

they're approximately draws from the sampling distribution of the median of N

observations, and they're exactly draws from the sampling distribution of the

median of N observations from the distribution of the observed data, but

we're going to say that's approximately equal to the sampling distribution of the

median of N observations drawn from the population distribution.

That's the leap of faith we're making, is that this bootstrap process approximates

if we, instead of drawing from the observed data, we're drawing from the

actual population distribution. And we could take these B sample meeting

and draw a histogram of them, and then say we wanted to know, you know, a confidence

interval, why not take the 2.95% confidence interval, why not take the

2.5th and 97.5th percentiles and call that a confidence interval for the media.

That's exactly a so called boot strap percentile confidence interval.

So it's hard to describe, and I know I'm butchering it, and if I were Efron I'd be

doing a much better job at doing this, but unfortunately you have me and not Efron

And it's difficult to describe, for me at least.

On the next page I'm showing you the R code for doing this and even I've neatened

up the R code a little bit, so it's probably a little bit longer that it needs

to be, you could do this in about four lines.

So here B is my number of bootstrap re-samples.

I said let's just do it a thousand times but, y'know, you wanna set this number B

to be big enough so you don't have to worry about the error in your Monte Carlo

re-sampling. You don't want the number of times that

you have rolled the die to be a factor in what you are doing you want to do it, you

can't. So here I did a 1000 but you know crank it

up until you're tired of waiting at least there is a science to how you pick B, but

we're not gonna talk about in the class. So N is the length of the number of

observations that I have. Okay.

Then I re-samples is this code right here just draws with replacement from the

collection of N observations, it draws B complete data sets of size N from that

distribution. The replace = true means that we're

sampling with a replacement. And then, here, this resamples.

I dumb them all into a matrix, so that every row is a complete data set.

So there's B rows, and N columns. And then I go up for every row, and I

calculate the median in this next line. And that's then B.

Medians, where each median was obtained from re-sample of N observations from the

observed data. And then if you take the standard

deviation of these medians then that is a bootstrap estimate of the standard

deviation of the distribution of the sampling mean.

If you take the quantiles, the 2.5th and 97.5th quantile, you get 582 to 595.

That is a bootstrap confidence interval for the median of Bray matter volumes

conducted in the non parametric way. And it's always informative in the

bootstrap to plot a histogram of your re-sampled, in this case, medians.

Okay so in here is my histogram of my resampled medians.

And then the 2.5th and 97.5th quantiles of my bootstrap resampled medians are drawn

here in dashed lines, and so that 95% of my resampled medians lie between these two

lines, and so we're gonna call that a bootstrap confidence interval.

Now, I'm going to give you some notes on the boot strap.

So, for the both the boot strap and the jack knife, today's lecture is really just

a teaser. As you can probably guess from my

description, they're sufficiently difficult techniques to where, you know,

you don't want to take these lectures and view them as enough knowledge to just run

out and use them willy nilly. I just wanted to give you a teaser so that

if you hear the terms you know what people are talking about.

So the boot strap, the one that I described today, is non-parametric.

It makes very little assumptions about the population distribution.

And the kind of theoretic arguments proving the validity of the bootstrap tend

to rely on large samples so there's a question about when and how you can apply

it, but I find it to be a very handy tool in general.

The confidence interval measure that I gave you.

These percentile confidence intervals. They're not very good.

You can improve on bootstrap confidence intervals by correcting the end points of

the intervals. And the bootstrap procedure, the one I

would recommend is this so called BCA, confidence interval, the bootstrap.

Package in R will calculate these for you directly if you like.

That's what these perc-, better, when I say here, better percentile bootstrap

confidence intervals correct for bias. And then, there's lots and lots of

variations in the bootstrap procedures. There's parametric bootstrapping,

There's bootstrapping for time series, You have to do something different for

bootstrapping for time series. There's all sorts of different ways to

think about the bootstrap, and data resampling in general.

And the book called An Introduction to the Bootstrap by Efron and Tibshirani.

For anyone who's taken this class, absorb the material, and it is at a level that

you should be able to understand. It's beautifully written.

It's a wonderful treatment of the subject, and then, in addition to this, there is

lots and lots of other books on the topic of the bootstrap.

Probably too many good ones to name. Some of them, unbelievably theoretical,

and other ones, quite accessible. I think this Efron and Tibshirani book

makes a very nice balance between, you know, giving you the why things work and

the how to do things combination. And it also covers the jackknife and other

data resampling procedures. The last thing I wanted to mention is, I

gave you the exact code that you could use to generate for yourself the bootstrap

sampling distribution. You could, of course, use the bootstrap

package in R which takes about as many lines of code in this case as programming

and it up yourself and just on this last slide I go through actually using the

bootstrap package. But the nice thing about the bootstrap

package is that it will actually give you this bias corrected interval.

In this case you can see that the bias corrected interval is nearly identical to

the percentile interval so it didn't make a big difference.

But you can, of course come up with instances where the bias corrected

interval is a little bit better. So that's the end of today's lecture.

That was a teaser on the idea of bootstrap re-sampling and a little bit on the use of

the jack knife. You know, I hope this inspired to go learn

a little bit more about these tools, they are among the kind of the wide class of

tools that became available as modern computing came about.

And the idea of being able to use our data.

Especially when we have large data sets, to use the data more fully, and to use the

data to come up with things like sampling distributions instead of using mathematics

and assumptions and that sort of thing. So it was a neat idea brought about by the

computational revolution in it, it's a very nifty technique.

Well next time will be our last lecture and I look forward to talking about it

with you.