So now let's get to the bootstrap. So the bootstrap is a tremendously useful tool. It's useful for creating confidence intervals, standard errors, basically anything that involves in the distribution of a statistic, where you don't know that distribution, the bootstrap is an incredibly useful thing to use. As an example, we said before on the previous page, that, well, the kind of jack knife, standard error calculation for the median doesn't appear to be working that well or the bias didn't seem to be working that well, and maybe that raises some question about the standard error calculation for the median using the jack knife. Well, how about deriving a confidence interval for the median? Or you come up with a generic statistic. You want to look at the log difference of two variances. How do you come up with a confidence interval for the log difference in two variances? Well. The bootstrap procedure follows from this so called The Bootstrap Principle and you can do things like creating confidence interval for parameters, based on kind of difficult to work with statistics. And, the bootstrap principle, basically follows along the following lines. Suppose that I have a statistic that estimates some population parameter. But I don't know that it's sampling distribution. So, take the median, we actually know a lot about the sampling distribution of the median, but let's suppose, you don't know anything about it. Then, you'd say, well I'm stuck, well, the boot strap principle says, well, why don't we just take the distribution defined by the data, to approximate the sampling distribution of the median. So what's the distribution defined by the data? The distribution defined by the data, if you're not putting any constraints on it. Puts probability one over N on every data point. That is a distribution, it's a discrete distribution. Right? It's a discrete distribution that puts probably one over end on each data point, and it's a weird distribution, because you know, the data points, you know are of course, generally have, have fractional amounts and it's a weird kind of discrete distribution, in that it's not, sort of, like a die roll or it's on integers, but nonetheless, it's an nice discrete distribution and it's. Mean is of course, the sample mean. So it's, you know, kind of, any rate, this empirical distribution is, is a reasonable distribution to, to work with, and so why don't we look at the sampling distribution of the median, based on the empirical distribution from our data, where we place probability one over add-on each data point. And that's the boot-strap principle, it's saying, well, I don't know what the population distribution is. And so I, without that and without some tool like the central limit theorem, I can't figure out what the distribution of my test statistic is. Well, why don't I take my distribution given by my empirical distribution and figure what the distribution of sample medians are like from that distribution because I know that right, it's just a distribution that's one I ran on each data point and I can work with that. So that's a really nifty principle and it can be executed in a, a kind of parametric way or a non parametric way and today we'll be talking about the non parametric bootstrap where the imperial distribution that we work with just places one / N on each data point. If we wanna figure out things about this distribution, it's actually not very convenient to work with. Say you have 100 observation, and you're saying, well, The distribution I'm working with now places probability one / 100 on every data point and I want to know what the distribution of medians of a 100 observations from that distribution is. It's actually kind of a hard thing to work with. You couldn't actually work that out on pen and paper. Maybe if you had. You know, five observations, then there's only 32 combinations to work with. But in general it's a little hard to work with. People said, well, why don't we use simulation? Forget about it. Why don't we just use simulation? And then the process of boot strapping has this, kind of, interesting re-sampling interpretation. So what do people wind up doing? The general procedure follows by taking the data set and simulating complete data sets. With replacements. So we put all of our data points in a bag and let's say we have a 100 data points and we pull out a new data set of 100 data points. But we sample with replacement. So once we pull out a point and record it, then we put that observation back in the bag, mix it up again and pull out another point and it could possibly be that same point again. If we sampled without replacement, we would just get a permutation of our original data set. So the bootstrap distribution is sampling with replacement. So what is that? That is exactly drawing IID samples from a distribution that places probability one / N on each data point. And so, for every one of these re-samples, right, That's the idea, it's re-sampling from the observed data. You're creating fictitious data sets by resampling from the reserved data. That's why it's called a resampling procedure. So let's see, we have a hundred observations. We draw a width replacement complete data set of size 100 from our observed data. From each of those complete data sets of size 100 we calculate the statistic that we're interested in. Say, for example, the median. We would calculate the median then for every re-sample. So we draw a sample size 100, calculate the median. Redraw a sample size 100, calculate the median. Redraw a sample size 100, calculate the median. And we repeat that process over and over and over again. Let's say we did that 10,000 times, we would get 10,000 re-sampled sampled medians and we would use those 10,000 resampled, sampled medians to talk about the imperiacal distribution of the sample median. And that, It seems to make a lot of sense. The first time you encounter this you may think that this is the strangest thing that I have ever heard in terms of a statistical procedure, but it does make a lot of sense. So, let me drone on about this for another couple of minutes. So think about, if you actually knew the distribution and could sample from it, how you would get to know the distribution of say, something like the median, if you couldn't do the math, which, you know, for most statistics, you can't do the math, even mathematicians can't do the math. So let's say for example, I create a distribution. Let's say, I have a 100-sided die. And that 100 sided die has probability one over 100 for each side of the die. And let's make it at least a little bit interesting, so, the die's shaped a little bit funny, so it's not exactly probability one over 100 for every number. So you have a, a distribution on the numbers between one and 100. And you wanted to know ten, what's the sampling distribution of the median of ten die rolls from this 100-sided die. Well, that's a hard problem. It's difficult to think about how you would work that out on pen and paper. But what you could do is roll the die ten times, get a sample median, Record it, Then roll the die ten times again, get a sample median, and record it, And then roll the die ten times again, get a sample median, record it. And you could do that thousands of times if you had patience, let's say, you know, you were waiting for something. And then, if you did it enough, you would get exactly the distribution of the sample median. Okay, so you would get exactly the distribution of the sample median of ten die rolls. And if you wanted to know what the sample distribution of the median of twenty die rolls, well you'd have to roll the die twenty times, get a sample median. Repeat that process over and over again, and that would do it for you. Okay so now we know, if we can actually sample from the population distribution over and over and over again, how we would get the sampling distribution of a statistic. But when confronted with real data, we can't roll the die. Right. We don't know what the population distribution is, so we can't do it. But what we can do is roll from a die, where every side of the die we've put on the number associated with an observed data point, then we're not drawing from the population distribution, we're drawing from the empirical distribution. Okay, then if we had ten data points and we want to know what the distribution of the sample median of ten observations is. Well, we can't draw from the population distribution, but what we can do is draw samples of size ten from the distribution defined by the data we observed, and look at what the distribution of the sample median is for those. And that is exactly what the boot strap does, is it basically says, well, you know the bootstrap in practice via re-sampling. It basically it says, well, we know exactly what we would do if we actually knew what the population distribution was. Why don't we just do that and use the sample distribution, and, and see how that works. And it's sort of a really nifty idea. So again, let's just take our 630 measurements of grey matter volume from workers at a lead manufacturing plant, Then the median grey matter volume is about 589 cubic centimeters. And we want a confidence interval for the median of these measurements. How do we do that? So here's our bootstrap procedure for calculating confidence interval for the median of a data set of N observations where we know nothing about the. Sampling distribution, of medians, of ten observations. So, we would sample and observations with replacement, from the observed data resulting in one simulated complete data set. We would take the median of this simulated complete data set. That would give us one bootstrap resample, and one bootstrap resampled sample median. Then we would repeat the step B times, let's say. Resulting in B simulated medians of N observations. Those N observations having been drawn with replacement from the collection of observed. Data, then these medians, well, let's say, they're approximately draws from the sampling distribution of the median of N observations, and they're exactly draws from the sampling distribution of the median of N observations from the distribution of the observed data, but we're going to say that's approximately equal to the sampling distribution of the median of N observations drawn from the population distribution. That's the leap of faith we're making, is that this bootstrap process approximates if we, instead of drawing from the observed data, we're drawing from the actual population distribution. And we could take these B sample meeting and draw a histogram of them, and then say we wanted to know, you know, a confidence interval, why not take the 2.95% confidence interval, why not take the 2.5th and 97.5th percentiles and call that a confidence interval for the media. That's exactly a so called boot strap percentile confidence interval. So it's hard to describe, and I know I'm butchering it, and if I were Efron I'd be doing a much better job at doing this, but unfortunately you have me and not Efron And it's difficult to describe, for me at least. On the next page I'm showing you the R code for doing this and even I've neatened up the R code a little bit, so it's probably a little bit longer that it needs to be, you could do this in about four lines. So here B is my number of bootstrap re-samples. I said let's just do it a thousand times but, y'know, you wanna set this number B to be big enough so you don't have to worry about the error in your Monte Carlo re-sampling. You don't want the number of times that you have rolled the die to be a factor in what you are doing you want to do it, you can't. So here I did a 1000 but you know crank it up until you're tired of waiting at least there is a science to how you pick B, but we're not gonna talk about in the class. So N is the length of the number of observations that I have. Okay. Then I re-samples is this code right here just draws with replacement from the collection of N observations, it draws B complete data sets of size N from that distribution. The replace = true means that we're sampling with a replacement. And then, here, this resamples. I dumb them all into a matrix, so that every row is a complete data set. So there's B rows, and N columns. And then I go up for every row, and I calculate the median in this next line. And that's then B. Medians, where each median was obtained from re-sample of N observations from the observed data. And then if you take the standard deviation of these medians then that is a bootstrap estimate of the standard deviation of the distribution of the sampling mean. If you take the quantiles, the 2.5th and 97.5th quantile, you get 582 to 595. That is a bootstrap confidence interval for the median of Bray matter volumes conducted in the non parametric way. And it's always informative in the bootstrap to plot a histogram of your re-sampled, in this case, medians. Okay so in here is my histogram of my resampled medians. And then the 2.5th and 97.5th quantiles of my bootstrap resampled medians are drawn here in dashed lines, and so that 95% of my resampled medians lie between these two lines, and so we're gonna call that a bootstrap confidence interval. Now, I'm going to give you some notes on the boot strap. So, for the both the boot strap and the jack knife, today's lecture is really just a teaser. As you can probably guess from my description, they're sufficiently difficult techniques to where, you know, you don't want to take these lectures and view them as enough knowledge to just run out and use them willy nilly. I just wanted to give you a teaser so that if you hear the terms you know what people are talking about. So the boot strap, the one that I described today, is non-parametric. It makes very little assumptions about the population distribution. And the kind of theoretic arguments proving the validity of the bootstrap tend to rely on large samples so there's a question about when and how you can apply it, but I find it to be a very handy tool in general. The confidence interval measure that I gave you. These percentile confidence intervals. They're not very good. You can improve on bootstrap confidence intervals by correcting the end points of the intervals. And the bootstrap procedure, the one I would recommend is this so called BCA, confidence interval, the bootstrap. Package in R will calculate these for you directly if you like. That's what these perc-, better, when I say here, better percentile bootstrap confidence intervals correct for bias. And then, there's lots and lots of variations in the bootstrap procedures. There's parametric bootstrapping, There's bootstrapping for time series, You have to do something different for bootstrapping for time series. There's all sorts of different ways to think about the bootstrap, and data resampling in general. And the book called An Introduction to the Bootstrap by Efron and Tibshirani. For anyone who's taken this class, absorb the material, and it is at a level that you should be able to understand. It's beautifully written. It's a wonderful treatment of the subject, and then, in addition to this, there is lots and lots of other books on the topic of the bootstrap. Probably too many good ones to name. Some of them, unbelievably theoretical, and other ones, quite accessible. I think this Efron and Tibshirani book makes a very nice balance between, you know, giving you the why things work and the how to do things combination. And it also covers the jackknife and other data resampling procedures. The last thing I wanted to mention is, I gave you the exact code that you could use to generate for yourself the bootstrap sampling distribution. You could, of course, use the bootstrap package in R which takes about as many lines of code in this case as programming and it up yourself and just on this last slide I go through actually using the bootstrap package. But the nice thing about the bootstrap package is that it will actually give you this bias corrected interval. In this case you can see that the bias corrected interval is nearly identical to the percentile interval so it didn't make a big difference. But you can, of course come up with instances where the bias corrected interval is a little bit better. So that's the end of today's lecture. That was a teaser on the idea of bootstrap re-sampling and a little bit on the use of the jack knife. You know, I hope this inspired to go learn a little bit more about these tools, they are among the kind of the wide class of tools that became available as modern computing came about. And the idea of being able to use our data. Especially when we have large data sets, to use the data more fully, and to use the data to come up with things like sampling distributions instead of using mathematics and assumptions and that sort of thing. So it was a neat idea brought about by the computational revolution in it, it's a very nifty technique. Well next time will be our last lecture and I look forward to talking about it with you.