Hi, my name is Brian Caffo, and welcome to the lecture on T Confidence Intervals. This is part of the Coursera Statistical Inference class, which is part of our Coursera Data Science Specialization. The class is co-taught with my co-instructors, Jeff Leek and Roger Peng. We're all in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. In the previous lecture, we discussed creating confidence intervals using the central limit theorem. All the intervals we discussed took the form of estimates, plus or minus, a quantile from the standard normal distribution times the estimated standard error of the estimate. In this lecture, we're going to discuss some methods for small samples. Notably, we're going to talk about student or Gosset's T distribution and T confidence intervals. These intervals are going to be of the form estimate plus or minus a T quantile times the standard error of the estimate. So, the only change is, we've changed the z quantile to a t quantile. The t distribution has heavier tails than the normal distribution, so these intervals are going to be a little bit wider. These are some of the handiest intervals in all of statistics, and if you ever want to rule between when to use a t interval or a z interval for the cases where both are available, simply use the t interval. Because as you collect more data, the t interval will just become more and more like the z interval, anyway. We're just going to cover the single and two group version of the t interval. In our regression class, we'll cover some other t intervals that come in handy, as well. T distribution was invented by William Gosset under the pseudonym Student in 1908. He worked for the Guinness Brewery, and in fact, they didn't want him to publish under his real name. So, he published under the pseudonym Student. So, we're often talking about the Student's t distribution or the Student's t test. The t distribution has thicker tails than the normal. Unlike the normal, which is indexed by two parameters the mean and the variance, we really only talk about the t distribution as it is centered around zero with a standard formula for the scale. It's only indexed by one parameter, the so called degrees of freedom. As these degrees of freedom increase, it gets more like a standard normal. The reason for the t distribution is as follows. If we take x bar, subtract off the mean, and divide it by the estimated standard error for iid Gaussian data, it in fact is not Gaussian distributed. If we replaced s by sigma, it would be exactly standard normal. However, when we replace sigma by s, it no longer has a distribution as that of a standard normal. Instead, it has a t distribution. As n increases, this distinction is irrelevant. However, for small sample sizes, the difference can be quite large. And so, if you use standard normal for small sample sizes you can get, for example, confidence intervals that are too narrow. Our interval winds up to be x bar plus or minus the t quantile with the degrees of freedom n minus 1 times the estimated standard error. We'll go through some examples, and I hope you'll get the hang of it. Here, I'll use R Studios manipulate function to show the t distribution as it overlays over the normal distribution for varying degrees of freedom. Here, I have it for 20 degrees of freedom, and you can see they're quite similar. As I draw the degrees of freedom down, you can see that the t distribution gets to the point where it has heavier tails. Well, it always has heavier tails, and it looks like a normal distribution. Whereas if you were to squash it down, sort of squash it down right there, and the, the extra mass had to head out in the tails. It's a little bit hard to see the impact, because so much of this plot is devoted to the peak part of the distributions where they're actually fairly similar. Let's plot the quantiles. So, what I'm showing in this plot now, are the quantiles of the t distribution by the quantiles of the normal distribution starting at the 50th percentile. So, for example, we start at zero, and since at the 50th percentile, since both distributions are symmetric about zero, you see that zero is the 50th percentile. So, this plot goes through the point zero, zero. I don't plot percentiles lower than the 50th percentile, because the distributions are symmetric, and that information is redundant. What I've drawn here are reference lines at the 97 point fifth quantile. This will always be around 1.96 for the standard normal distribution, but can be a, quite a bit larger for the t distribution. For example, right now with two degrees of freedom, we get a t quantile that is over four. Now, let me emphasize though, with two degrees of freedom, that would mean in our t interval, that you only had three data points in order to, to estimate your variance, which is not too many. Let's bump it up to 20 degrees of freedom, and now what we see is that the t quantiles are much closer to the normal quantiles. Right here in this light blue reference line is an identity line. If the t quantiles were exactly identical to the normal quantiles, they would just fall on that line. I have I have again at 1.96 and the relevant t quantile, I show the reference lines here. So, this distance right here is exactly the distinction between the two intervals. Where now with the t interval, we're going to have a quantile that's slightly larger then two. Whereas for the z quantile, we're going to have one that's slightly smaller then two. This can make a fairly sizable difference in the intervals, and often can make the difference between the interval containing zero versus the interval not containing zero. And nonetheless, this is why we use the t distribution. This plot, this quantile plot is such that this t interval, the t quantile, is always going to be above this blue reference line, which is effectively saying that the t constructed confidence interval will always be wider than the normal interval. Which makes sense. We have an extra parameter that we're assuming we're estimating, the standard deviation. And so, it would make sense that that uncertainty would cause us to have wider intervals. Let's go to a couple notes about the t interval. So, the T interval technically assumes that the data are iid normal, though it's a very robust to this assumption. Basically, whenever the distribution is roughly symmetric and mound shaped, the t interval will work fairly well. If you have paired observations, for example, when you measure something once and then the same obser, it's the same unit a few days later or for a second measurement, and we'll go through some examples. You can use, you can use the t interval to analyze this kind of data by taking differences or differences on the log scale. For large degrees of freedom, the t quantiles become exactly like that of the standard normal quantiles. So, the t interval converges to that of the normal interval, the same one that you would obtain with the central limit theorem. Because of this, I, instead of saying, instead of between, picking between the t interval and the normal interval, I always say, just use the t interval. For distributions that are skewed, the spirit of the t interval assumptions are violated. And so, you could either work on this data on the log scale where logging will often take care of the skewness of the data, or you can use other procedures. For example, creating bootstrap confidence intervals. Nonetheless, it simply doesn't make sense to use the t interval for skewed distributions, because it, in a lot of ways, it doesn't make sense to center intervals for skewed distributions at the mean. Also, I'd add for highly discrete data, like binary or Poyson data, other intervals are available, and it's probably preferable to use them to the t interval.