1:15

Suppose that I have a statistic that estimates some population parameter.

Â But I don't know that it's sampling distribution.

Â So, take the median, we actually know a lot about the sampling distribution of the

Â median, but let's suppose, you don't know anything about it.

Â Then, you'd say, well I'm stuck, well, the boot strap principle says, well, why don't

Â we just take the distribution defined by the data, to approximate the sampling

Â distribution of the median. So what's the distribution defined by the

Â data? The distribution defined by the data, if

Â you're not putting any constraints on it. Puts probability one over N on every data

Â point. That is a distribution, it's a discrete

Â distribution. Right?

Â It's a discrete distribution that puts probably one over end on each data point,

Â and it's a weird distribution, because you know, the data points, you know are of

Â course, generally have, have fractional amounts and it's a weird kind of discrete

Â distribution, in that it's not, sort of, like a die roll or it's on integers, but

Â nonetheless, it's an nice discrete distribution and it's.

Â Mean is of course, the sample mean. So it's, you know, kind of, any rate, this

Â empirical distribution is, is a reasonable distribution to, to work with, and so why

Â don't we look at the sampling distribution of the median, based on the empirical

Â distribution from our data, where we place probability one over add-on each data

Â point. And that's the boot-strap principle, it's

Â saying, well, I don't know what the population distribution is.

Â And so I, without that and without some tool like the central limit theorem, I

Â can't figure out what the distribution of my test statistic is.

Â Well, why don't I take my distribution given by my empirical distribution and

Â figure what the distribution of sample medians are like from that distribution

Â because I know that right, it's just a distribution that's one I ran on each data

Â point and I can work with that. So that's a really nifty principle and it

Â can be executed in a, a kind of parametric way or a non parametric way and today

Â we'll be talking about the non parametric bootstrap where the imperial distribution

Â that we work with just places one / N on each data point.

Â If we wanna figure out things about this distribution, it's actually not very

Â convenient to work with. Say you have 100 observation, and you're

Â saying, well, The distribution I'm working with now

Â places probability one / 100 on every data point and I want to know what the

Â distribution of medians of a 100 observations from that distribution is.

Â It's actually kind of a hard thing to work with.

Â You couldn't actually work that out on pen and paper.

Â Maybe if you had. You know, five observations, then there's

Â only 32 combinations to work with. But in general it's a little hard to work

Â with. People said, well, why don't we use

Â simulation? Forget about it.

Â Why don't we just use simulation? And then the process of boot strapping has

Â this, kind of, interesting re-sampling interpretation.

Â So what do people wind up doing? The general procedure follows by taking

Â the data set and simulating complete data sets.

Â With replacements. So we put all of our data points in a bag

Â and let's say we have a 100 data points and we pull out a new data set of 100 data

Â points. But we sample with replacement.

Â So once we pull out a point and record it, then we put that observation back in the

Â bag, mix it up again and pull out another point and it could possibly be that same

Â point again. If we sampled without replacement, we

Â would just get a permutation of our original data set.

Â So the bootstrap distribution is sampling with replacement.

Â So what is that? That is exactly drawing IID samples from a

Â distribution that places probability one / N on each data point.

Â And so, for every one of these re-samples, right,

Â That's the idea, it's re-sampling from the observed data.

Â You're creating fictitious data sets by resampling from the reserved data.

Â That's why it's called a resampling procedure.

Â So let's see, we have a hundred observations.

Â We draw a width replacement complete data set of size 100 from our observed data.

Â From each of those complete data sets of size 100 we calculate the statistic that

Â we're interested in. Say, for example, the median.

Â We would calculate the median then for every re-sample.

Â So we draw a sample size 100, calculate the median.

Â Redraw a sample size 100, calculate the median.

Â Redraw a sample size 100, calculate the median.

Â And we repeat that process over and over and over again.

Â Let's say we did that 10,000 times, we would get 10,000 re-sampled sampled

Â medians and we would use those 10,000 resampled, sampled medians to talk about

Â the imperiacal distribution of the sample median.

Â And that, It seems to make a lot of sense.

Â The first time you encounter this you may think that this is the strangest thing

Â that I have ever heard in terms of a statistical procedure, but it does make a

Â lot of sense. So, let me drone on about this for another

Â couple of minutes. So think about, if you actually knew the

Â distribution and could sample from it, how you would get to know the distribution of

Â say, something like the median, if you couldn't do the math, which, you know, for

Â most statistics, you can't do the math, even mathematicians can't do the math.

Â So let's say for example, I create a distribution.

Â Let's say, I have a 100-sided die. And that 100 sided die has probability one

Â over 100 for each side of the die. And let's make it at least a little bit

Â interesting, so, the die's shaped a little bit funny, so it's not exactly probability

Â one over 100 for every number. So you have a, a distribution on the

Â numbers between one and 100. And you wanted to know ten, what's the

Â sampling distribution of the median of ten die rolls from this 100-sided die.

Â Well, that's a hard problem. It's difficult to think about how you

Â would work that out on pen and paper. But what you could do is roll the die ten

Â times, get a sample median, Record it,

Â Then roll the die ten times again, get a sample median, and record it,

Â And then roll the die ten times again, get a sample median, record it.

Â And you could do that thousands of times if you had patience, let's say, you know,

Â you were waiting for something. And then, if you did it enough, you would

Â get exactly the distribution of the sample median.

Â 7:34

Okay, so you would get exactly the distribution of the sample median of ten

Â die rolls. And if you wanted to know what the sample

Â distribution of the median of twenty die rolls, well you'd have to roll the die

Â twenty times, get a sample median. Repeat that process over and over again,

Â and that would do it for you. Okay so now we know, if we can actually

Â sample from the population distribution over and over and over again, how we would

Â get the sampling distribution of a statistic.

Â But when confronted with real data, we can't roll the die.

Â Right. We don't know what the population

Â distribution is, so we can't do it. But what we can do is roll from a die,

Â where every side of the die we've put on the number associated with an observed

Â data point, then we're not drawing from the population distribution, we're drawing

Â from the empirical distribution. Okay, then if we had ten data points and

Â we want to know what the distribution of the sample median of ten observations is.

Â Well, we can't draw from the population distribution, but what we can do is draw

Â samples of size ten from the distribution defined by the data we observed, and look

Â at what the distribution of the sample median is for those.

Â And that is exactly what the boot strap does, is it basically says, well, you know

Â the bootstrap in practice via re-sampling. It basically it says, well, we know

Â exactly what we would do if we actually knew what the population distribution was.

Â Why don't we just do that and use the sample distribution, and, and see how that

Â works. And it's sort of a really nifty idea.

Â So again, let's just take our 630 measurements of grey matter volume from

Â workers at a lead manufacturing plant, Then the median grey matter volume is

Â about 589 cubic centimeters. And we want a confidence interval for the

Â median of these measurements. How do we do that?

Â So here's our bootstrap procedure for calculating confidence interval for the

Â median of a data set of N observations where we know nothing about the.

Â Sampling distribution, of medians, of ten observations.

Â So, we would sample and observations with replacement, from the observed data

Â resulting in one simulated complete data set.

Â We would take the median of this simulated complete data set.

Â That would give us one bootstrap resample, and one bootstrap resampled sample median.

Â Then we would repeat the step B times, let's say.

Â Resulting in B simulated medians of N observations.

Â Those N observations having been drawn with replacement from the collection of

Â observed. Data, then these medians, well, let's say,

Â they're approximately draws from the sampling distribution of the median of N

Â observations, and they're exactly draws from the sampling distribution of the

Â median of N observations from the distribution of the observed data, but

Â we're going to say that's approximately equal to the sampling distribution of the

Â median of N observations drawn from the population distribution.

Â That's the leap of faith we're making, is that this bootstrap process approximates

Â if we, instead of drawing from the observed data, we're drawing from the

Â actual population distribution. And we could take these B sample meeting

Â and draw a histogram of them, and then say we wanted to know, you know, a confidence

Â interval, why not take the 2.95% confidence interval, why not take the

Â 2.5th and 97.5th percentiles and call that a confidence interval for the media.

Â That's exactly a so called boot strap percentile confidence interval.

Â So it's hard to describe, and I know I'm butchering it, and if I were Efron I'd be

Â doing a much better job at doing this, but unfortunately you have me and not Efron

Â And it's difficult to describe, for me at least.

Â On the next page I'm showing you the R code for doing this and even I've neatened

Â up the R code a little bit, so it's probably a little bit longer that it needs

Â to be, you could do this in about four lines.

Â So here B is my number of bootstrap re-samples.

Â I said let's just do it a thousand times but, y'know, you wanna set this number B

Â to be big enough so you don't have to worry about the error in your Monte Carlo

Â re-sampling. You don't want the number of times that

Â you have rolled the die to be a factor in what you are doing you want to do it, you

Â can't. So here I did a 1000 but you know crank it

Â up until you're tired of waiting at least there is a science to how you pick B, but

Â we're not gonna talk about in the class. So N is the length of the number of

Â observations that I have. Okay.

Â Then I re-samples is this code right here just draws with replacement from the

Â collection of N observations, it draws B complete data sets of size N from that

Â distribution. The replace = true means that we're

Â sampling with a replacement. And then, here, this resamples.

Â I dumb them all into a matrix, so that every row is a complete data set.

Â So there's B rows, and N columns. And then I go up for every row, and I

Â calculate the median in this next line. And that's then B.

Â Medians, where each median was obtained from re-sample of N observations from the

Â observed data. And then if you take the standard

Â deviation of these medians then that is a bootstrap estimate of the standard

Â deviation of the distribution of the sampling mean.

Â If you take the quantiles, the 2.5th and 97.5th quantile, you get 582 to 595.

Â That is a bootstrap confidence interval for the median of Bray matter volumes

Â conducted in the non parametric way. And it's always informative in the

Â bootstrap to plot a histogram of your re-sampled, in this case, medians.

Â Okay so in here is my histogram of my resampled medians.

Â And then the 2.5th and 97.5th quantiles of my bootstrap resampled medians are drawn

Â here in dashed lines, and so that 95% of my resampled medians lie between these two

Â lines, and so we're gonna call that a bootstrap confidence interval.

Â Now, I'm going to give you some notes on the boot strap.

Â So, for the both the boot strap and the jack knife, today's lecture is really just

Â a teaser. As you can probably guess from my

Â description, they're sufficiently difficult techniques to where, you know,

Â you don't want to take these lectures and view them as enough knowledge to just run

Â out and use them willy nilly. I just wanted to give you a teaser so that

Â if you hear the terms you know what people are talking about.

Â So the boot strap, the one that I described today, is non-parametric.

Â It makes very little assumptions about the population distribution.

Â And the kind of theoretic arguments proving the validity of the bootstrap tend

Â to rely on large samples so there's a question about when and how you can apply

Â it, but I find it to be a very handy tool in general.

Â The confidence interval measure that I gave you.

Â These percentile confidence intervals. They're not very good.

Â You can improve on bootstrap confidence intervals by correcting the end points of

Â the intervals. And the bootstrap procedure, the one I

Â would recommend is this so called BCA, confidence interval, the bootstrap.

Â Package in R will calculate these for you directly if you like.

Â That's what these perc-, better, when I say here, better percentile bootstrap

Â confidence intervals correct for bias. And then, there's lots and lots of

Â variations in the bootstrap procedures. There's parametric bootstrapping,

Â There's bootstrapping for time series, You have to do something different for

Â bootstrapping for time series. There's all sorts of different ways to

Â think about the bootstrap, and data resampling in general.

Â And the book called An Introduction to the Bootstrap by Efron and Tibshirani.

Â For anyone who's taken this class, absorb the material, and it is at a level that

Â you should be able to understand. It's beautifully written.

Â It's a wonderful treatment of the subject, and then, in addition to this, there is

Â lots and lots of other books on the topic of the bootstrap.

Â Probably too many good ones to name. Some of them, unbelievably theoretical,

Â and other ones, quite accessible. I think this Efron and Tibshirani book

Â makes a very nice balance between, you know, giving you the why things work and

Â the how to do things combination. And it also covers the jackknife and other

Â data resampling procedures. The last thing I wanted to mention is, I

Â gave you the exact code that you could use to generate for yourself the bootstrap

Â sampling distribution. You could, of course, use the bootstrap

Â package in R which takes about as many lines of code in this case as programming

Â and it up yourself and just on this last slide I go through actually using the

Â bootstrap package. But the nice thing about the bootstrap

Â package is that it will actually give you this bias corrected interval.

Â In this case you can see that the bias corrected interval is nearly identical to

Â the percentile interval so it didn't make a big difference.

Â But you can, of course come up with instances where the bias corrected

Â interval is a little bit better. So that's the end of today's lecture.

Â That was a teaser on the idea of bootstrap re-sampling and a little bit on the use of

Â the jack knife. You know, I hope this inspired to go learn

Â a little bit more about these tools, they are among the kind of the wide class of

Â tools that became available as modern computing came about.

Â And the idea of being able to use our data.

Â Especially when we have large data sets, to use the data more fully, and to use the

Â data to come up with things like sampling distributions instead of using mathematics

Â and assumptions and that sort of thing. So it was a neat idea brought about by the

Â computational revolution in it, it's a very nifty technique.

Â Well next time will be our last lecture and I look forward to talking about it

Â with you.

Â