In this video we will define sampling distributions, we're going to introduce the central limit theorem, and review conditions required for the theorem to apply. And we're also going to do some simulation demos to illustrate the central limit theorem and start talking about why it works without going into a theoretical proof As well as talk about how it works and why it might be of use to us. Say we have a population of interest and we take a random sample from it, and based on that sample, we calculate a sample statistic. For example, the mean of that sample. Then suppose we take another random sample and also calculate and record its mean. Then we do this again and again, many more times. Each one of the samples will have their own distribution, which we call sample distributions. Each observation in these distributions is a randomly sampled unit from the population, say, a person, or a cat, or a dog, depending on what population you're studying. The values we recorded from each sample. The sample statistics also now make new distribution. Where each observation is not a unit from the population but a sample statistic. In this case, a sample mean. The distribution of these sample statistics is called the sampling distribution. So the two terms, sample and sampling distributions sound similar, but they're different concepts. Let's give a little more concrete example. Supposed we're interested in the average height of the US women. Our population of interest, is US women. We'll call capital N the population size, and our parameter of interest is the average height, of all women in the US, which we denote, as mu. Let's assume that we have height data from every single woman in the US. Using these data we could find the population mean, 65 inches is probably a reasonable estimate. Using the same population data we can also calculate the population standard deviation, which we would usually call sigma. We wouldn't expect this number, the sigma, to be very small since heights of all women in the U.S. are probably very variable. It's possible to find a women as short as four feet tall, as seven feet tall. Then, let's assume that we take random samples of a thousand women from each state. We'll start with the first list on the alphabetical list, Alabama. We sample thousand women from Alabama. We represent each woman in our sample with an x and we use the subscripts to keep track of this state as well as the observation number reining from 1 to 1000. Then we collect data from thousand women from each of a bunch of more states. Including North Carolina, where I happen to be currently located. And then a bunch more, until finally we get to the last state on the alphabetical list, Wyoming. For each state, we calculate the state mean that we denote as x bar. So now we have a data set consisting of a bunch of means or 50 to be exact, since there are 50 states. We call this distribution the sampling distribution. The mean of the sample means will probably be around the true population. Roughly 65 inches as well. The standard deviation of this sample means, we'll probably be much lower than the population standard deviation since we would expect the average height for each state to be pretty close to one another. For example, we wouldn't expect to find s state where the average height of a random sample of thousand women is as low as 4 feet or as high as 7 feet. We call the standard deviation of the sample means the standard error. In fact as the sample size N increases, the standard error will decrease. The fewer woman with sample from each state, the more variable we would expect the sample means to be. Next, we're going to illustrate what we were just talking about in terms of sampling distributions, their shapes centers and spreads, using an applet that simulates a bunch of sampling distributions for us. Given certain parameters of the population distribution and its shape. If you would like to also play along with us, you can follow the URL on this screen. Let's start with the default case of a normal distribution for the population with mean zero and standard deviation 20. Let's take samples of, let's say, size 45 from this population and what we can see here is that each one of these dot plots show us one sample of 45 observations from the normal population. We can see that the centers of each one of these samples is close to 0, though not exactly 0. And we can also see that the sample mean varies from one sample to another. Since these are random samples from the population, each time we reach out to the population and grab 45 observations We may not be getting the same sample, in fact we will not be getting the same sample and therefore the for each samples are slightly different. The standard deviation of each one of these samples should be roughly equal to the population standard deviation because after all each one of these samples are simply a subset of our population We have illustrated 8 of the first samples here, but we are actually taking 200 samples from the population. We can make this a very large number, say 1,000 samples from the population. And what we have at the very bottom is basically our sampling distribution. Each one of the sample means, once calculated, get dropped to the lower plot. And what we're seeing here is a distribution of sample means. Since we saw that the sample means had some variability among them, the sampling distribution basically illustrates for us what this variability look like. The sampling distribution, as we expected, is. Looking just like the population distribution. So nearly normal. And the center of the sampling distribution so that is the mean of the means is close to the true population mean of 0. However one big difference between our population distribution up top And our sampling distribution at the bottom is the spread of these distributions. The sampling distribution at the bottom is much skinnier than the population distribution up top. And if you think about it, while the standard deviation of the population distribution is 20, the standard error. So the standard deviation of the sample means, is only 2.93. The reason for this is that while individual observations can be very variable, it is unlikely that sample means are going to be very variable. So if we want to decrease the variability of the sample means, what that means is you're taking samples that have more consistent means. In order to do that we would want to increase our sample size. Let's say that we increase our sample size all the way to 500. All right, so what we have here is again our same population distribution. Here we're seeing the first 8 of the 1000 samples being taken from the population. The distributions look much more dense here because we simply have more observations. So each one of these samples represent a Sample from the population of 500 observations. And we can also see that the means are again variable, but let's check to see if they're as variable as before. The curve is indeed skinnier, so the higher the sample size of each sample that you are taking from the population, the vet less samples of the means of those samples and indeed we can see graphically looking at the curve and we can see it numerically looking at the value of the standard error. Now, it's finally time to introduce the central limit theorem. In fact, the central limit theorem says that the sampling distribution of the mean, distribution of sample means from many samples, is nearly normal centered at the population mean, with standard error equal to the population standard deviation divided by the square root of the sample size. Note that this is called the Central Limit Theorem because it's central to much of the statistical inference theory. So the central limit theorem tells us about the shape, which it says that it's going to be nearly normal, the center which is says that the sampling distribution's going to be centered at the population mean, and the spread of the sampling distribution, which we measure using the standard error. If sigma is unknown which is often the case, remember sigma is the population standard deviation and oftentimes, we don't have access to the entire population to calculate this number, we use S, the standard sample deviation to estimate the standard error. So that would be the standard deviation of one sample that we happen to have at hand. In the earlier demo, the stimulation we talked about taking many samples. But if you're running a study as you can imagine, you would only take one sample. So that's the standard deviation of that sample that we would use as our best guess for the population standard deviation. So it wasn't a coincidence that the sampling distribution we saw earlier was symmetric, and centered at the true population mean and that as an increase, the sample size increased, the standard error decreased. We won't go through a detailed proof of why the standard error is equal to sigma over square root of n, but understanding the inverse relationship between them is very important. As the sample size increases, we would expect samples to yield more consistent sample means, hence the variability among the sample means would be lower, which results in a lower standard error. Certain conditions must be met for the central limit theorem to apply. The first on is independence. Samples observations must be independent, and this is very difficult to verify. But it is more likely, if we have used random sampling or assignment depending on whether we have an observational study where we're sampling from the population randomly or we have an experiment where we're randomly assigning experimental units to various treatments. And if sampling without replacement, the sample size N is less than 10% of the population. So we've previously mentioned we love large samples and now we're saying that well, we don't exactly want them to be very large. We're going to talk about why this is the case in a moment. The other condition is related to the sample size or skew. Either the population distribution is normal or if the population distribution is skewed or we have no idea what it looks like, the sample size is large. According to the Central Limit Theorem, if the population distribution is normal, the sampling distribution will also be nearly normal, regardless of the sample size. We illustrated this earlier when we working with the outlet where we looked at a sample size of 45 as well as a sample size of 500, and in both instances the sampling distribution was nearly normal. However, if the population distribution is not normal, the more skewed the population distribution, the larger sample size we need for the central limit theorem to apply. For moderately skewed distributions, n greater than 30 is a widely used rule of thumb that we're going to make use of often in this course as well. This distribution of the population is also something very difficult to verify because we often do not know what the population looks like. That's why we're doing this investigation in the first place, but we can check it using the sample data. And assume that the sample mirrors the population so if you make a plot of your sample distribution and it looks nearly normal then you might be fairly certain that the parent population distribution is coming from is nearly normal as well. We'll discuss these conditions in more detail in the next couple of slides. First, let's focus on the 10% condition. If sampling with that replacement n needs to be less than ten percent of the population, is what we stated earlier. Why is this the case? So let's think about this for a moment, say that you live in a very small town, say that the population of the town is only a 1000 people, all right. And your family lives there as well as included your extended family. Say that I'm a researcher who is doing research on some genetic application and I actually want to randomly sample some individuals from your town. Say I take a random sample of say, size just 10. If we're randomly sampling 10 people out of 1000, and let's say you are included in our sample, it's going to be quite unlikely that your parents are also included in that sample as well, because remember, we're only grabbing 10 out of a population of 1000. But say on the other hand, I actually sampled 500 people from the 1000 that lived in your town. If in this town you lived with your parents and all of your extended family and I've already grabbed you to be in my sample. And I have 499 other people to grab chances are I might get somebody from your family in my sample as well. You and a family member of yours are not genetically independent because observations in the population itself are not independent of each other often. So therefore if we grab a very big portion of the population to be in our sample, it's going to be very difficult to make sure that the sampled individuals are independent of each other. That's why while we like large samples, we also want to keep the size of our sample somewhat proportional to our population. And a good rule of thumb usually, if we're sampling without replacement is going to be that we don't grab more than 10% of the population to be in our sample. When you're sampling with replacement which is not something we often do in survey settings because I've already sampled you once and given you a survey and gotten your responses. I don't want to be able to sample you again. I don't need your responses again but if I were sampling without replacement then the probability of sampling you versus somebody from your family would stay consistent throughout all of the trials. That's why we wouldn't need to worry about the 10% condition there. But again, in realistic survey sampling situations we sample without replacement and we like large sample, but we also do not want our samples to be much larger than or any more than 10% of our population. And what about the sample size skew condition? Say we have skewed population distribution here we have a population distribution that's extremely right skewed. When the sample size is small here we're looking at a sampling distribution created based on samples of n=10, the sample means will be quite variable. And the shape of their distribution will mimic the population. Distribution. Increase in the sample size a bit. Now we've gone from N equals 10 to N equals 100. This decreases the standard error, and the distribution starts to condense around the mean and starts looking more unimodal and symmetric. With quite large samples, here we're looking at our sampling distribution where for each of the individual samples based on which the sample means were calculated, those sample sizes were 200. With quite large samples like this, we can actually overcome the effect of the parent distribution. And the central limit theorem kicks in. And the sampling distribution starts to resemble a closely normal distribution. Why our we somewhat obsessed with having nearly normal sampling distributions? Because we've learned earlier that once you have a normal distribution, calculating probabilities which will later serve as our P values in our hypothesis tests are relatively simple. So, having a nearly normal sampling distribution that relies on central limit theory is actually going to open up a bunch of doors for us for doing statistical inference using confidence intervals and hypothesis test using normal distribution theory. Lets do another demo real quick. We looked earlier at what does a sampling distribution look like when we have nearly normal population distribution. Lets take a look to see what happens if the population distribution is not nearly normal. Suppose I first pick a uniform distribution. Here we can see, that our population distribution is uniform. Let's say that it is going to be uniform between four and obviously our upper bound needs to be greater, four and 12. So we can see a uniform distribution between four and 12, absolutely no peaks, so on and so forth. Say that we're actually taking samples of size just 15 from this distribution. Each one of our samples contains 15 observations from the parent population. And the center of these samples are going to be somewhere close to the population mean. We take a bunch of these samples, a thousand of them, and let's take a look at what the sampling distribution is looking like. It actually looks fairly symmetric. Unimodal and symmetric. The center of the distribution is very close to our population distribution mean. And the variability of this distribution is actually much lower than our population distribution. We can see that the standard error is .59. While the original population standard deviation was 2.31. What happens if we have skewed data? Here we have population distribution that's right skewed. We're taking samples of size 15, and let's actually make this an extremely right skewed distribution. So this is what this looks likes. If we're taking samples of size 15. So here, we're taking a look at each one of our individual samples. The sampling distribution is looking an awful lot skewed. However, if I increase my sample size, to be much larger, say 500, then my sampling distribution is starting to look much more unimodal and symmetric and starting to resemble a nearly normal distribution. What about a left skewed distribution? Once again, let's make the skew of this distribution pretty high, and we can see that our sampling distribution when we have a large number of observations in each sample. We have still kept it at 500 observations. The sampling distribution looks pretty nearly normal. However, if I was to decrease my sample size to be something pretty small, 24 let's say, then my sampling distribution is looking more and more left skewed. And in fact, if I take even smaller samples. Let's go all the way down to 12, for example. Now the distribution is looking even more skewed. If I though, decrease the skew and my population distribution to begin with is not looking all that skewed anyway, then I really don't need a whole lot of observations in my sample. Here I have only 12 observations in each sample. And the sampling distribution is already looking pretty unimodal and symmetric. So the moral of the story is, the more the skew, the higher the sample size you need for the central limit theorem to kick in. Please feel free to go play with this applet, interact with it, and find out for yourself what the sampling distribution looks like in various scenarios. And also play around with the different parameters of the distributions, either picking how skewed they are, if it's a uniform distribution, what the minimum and the maximum are? Or it's a normal distribution, what the mean and the standard deviation are?