So, now, let's look at a specific example of a hypothesis testing situation where we're comparing means in a paired study design. So, in this lecture section, you will learn how to estimate and interpret a p-value for a hypothesis test of a mean difference between two populations based on a paired study design. So, the method for getting the p-value is called the paired t-test. "Paired" because of the study's design and "t-Test" because sometimes in smaller samples, the sampling distribution as we've seen for means and mean differences in smaller samples is a t distribution in larger samples. It's a t-distribution with a large number of degrees of freedom that is essentially converges to normal distribution. So, two let's look at this study we looked at the paired study where we had diagnoses from two physicians on the same patients set. So, the patient set of interests was a random sample, 65 selected male sexual contacts of men with AIDS or AIDS-related condition. Each of the men was seen by two doctors and each doctor was assessing the number of detectable lymph nodes on each of the men.Their interest was was this approach reproducible and certainly a start to a being reproducible would be that these two physicians would get similar results. It turns out however that they estimated on average rather different numbers of lymph nodes on the set of 65 men. So, doctor number one estimate on average 7.9 lymph nodes per person versus 5.16 per person by doctor number two. The difference in the means between these two doctors in the direction which is arbitrary of doctor two compared to doctor one is negative 2.75. So, doctor two found 2.75 fewer lymph nodes on average than doctor one. Recall with the paired design, even though we had the sample information for the measurements, their mean and variability for the doctor separately because they repaired by patient. This could be reduced to a single sample of differences between the two doctors for each of the 65 men, and the standard deviation of those 65 differences was 2.83. So, let's now take a look at the hypothesis testing approach and we'll use the same data we used before to create confidence intervals to do this. But we'll be approaching instead of taking our data and giving you an interval for the unknown truth. We'll start with two competing hypotheses for the unknown truth and choose between the two of them based on the data we have. The basic guttural null hypothesis we'd be working with here is that there's no difference in the average mean number of lymph nodes found by the two doctors where they examined all men in the population. So, one way to express that is that the means are the same for the two doctors. Mu of doctor one equals Mu doctor two versus the alternative that the means are different at the population level. We could certainly re-express this in terms of the difference in means if the two underlying means are the same, then the mean difference is zero. It doesn't matter whether we express it as doctor two minus doctor one or vice versa if the two means are the same, both differences in both directions would be zero. The alternative is that this difference is not zero, and again, if the two means are not the same regardless of whether we do doctor two minus doctor one or doctor one minus doctor two, if those two means are not the same, the difference will not be zero. We could also just write this more succinctly by calling this mean difference between the two populations Mu diff, and the null in terms of Mu diff is that it's equal to zero. So, what we're going to do is assume. Start with a null hypothesis and we assume that the difference in population means lymph nodes found between these two doctors is zero. We're going to measure how far our observed mean difference is from that what we'd expect it to be under the assumption of the null zero, but we're going to do this in terms of standard errors. So, we're going to get a distance measured that looks very much like those Z scores we use to measure based on normal curves back in the early part of the course. Measure how far our result is from the mean, or the expected value of the sampling distribution, zero when the null hypothesis is true. We do it in terms of standard errors of this estimated mean difference. I think of this as a distance just like a Z score. If you look in textbooks they'll refer to this values t simply because in smaller samples we will compare this distance to the theoretical sampling distribution under the null, which for smaller samples would be a t distribution. So, let's now look at our results. Our result was 2.75 standard errors below zero. The distance in units of lymph nodes was negative 2.75, we're going to divide that by the estimated standard error of this difference. You may recall the standard error of a single mean for a single sample and we've reduced these data to a single sample. The difference is the standard error is equal to the sample standard deviation divided by the square root of the sample size or 2.83 lymph nodes divided by the square root of 65. We do this out, we get a result that is 7.86 standard errors below zero, and again, zero is what the true mean difference would be under the null hypothesis we assume for the moment to be true. So, what we're going to do is we're going to translate this distance into a p-value by comparing how far our result was to the distribution of such differences just under random sampling variability around the assumed true population mean of zero. Then, we'll compare this p-value to the preset rejection level or Alpha level for our purposes in most of the research world this is 0.05 or five percent. So, we have a result that is 7.86 standard errors below the expected mean difference of zero under the null hypothesis. How likely is this to occur just by chance? Well, we need to appeal to the sampling distribution of our potential estimates of this mean difference across multiple studies of the same size from the same population. That theoretical sampling distribution which we know tends to be approximately normal in larger samples. It's centered at the true population level mean difference which we're assuming to be zero. So, our sampling behavior of estimates under the null hypothesis, in other words, the estimates of zero is these estimates would behave randomly around zero in a normal fashion, and most of the results would be relatively close to zero within two standard errors. We have a result that's way off the charts. Let's try and draw somewhat to scale here. Something that's nearly eight standard errors below zero, here, and we want the proportion of results that are as far or farther away than that from zero. But the catch errors in both directions and we'll explain why towards the end of these two lecture sets on hypothesis testing. So, we'd be looking at the port, I can't even draw the curve out here. But you get the picture, our result is gray farther than most of the results we could have gotten by chance. Most of the sample mean differences, if in reality, the true mean difference is zero. So, let's translate this into p-value, and generally using the computer, there's packages within all stats packages that will do this computation from start to finish for you. But I'm just thinking now if we were doing this by hand and I'm cheating a little bit because I'm using R as the by hand table part, but there's no real reason to look things up in actual tables anymore. But the p-value would measure the probability of getting a sample mean difference of negative 2.75, what we did or something more extreme if the true population mean difference is zero. So, translating the standard errors, this is the probability of getting a result as far or farther than 7.86 standard errors from the mean of a normal curve. Based on the properties of the normal curve that's pretty far from that, but if we wanted to get the actual proportion of results we could that is further farther, we could appeal to the p-norm function in R to get that percentage. We may recall how the p-norm function works as I'll put in a number, in this case, 7.86. I'm not drawing this to scale but just pretend this is negative 7.86, and it will tell me the proportion of results that are less than that far from the center of a standard normal curve with mean zero. In other words, the proportion of results that are 7.86 or more standard errors below zero. Because I'm interested in looking at results that are as far or farther in both directions from zero, I want to double that proportion because of the symmetry of the normal curve, these proportions are the same. So, if I take two times p-norm of negative 7.86, I get 3.84 et cetera. This e to the negative 15 actually means 3.84 times 10 to the negative 15. So, this is an extremely small chance of getting a result as far as nine or farther, if in fact, we'd sampled from populations of physicians with the same underlying mean number of detected lymph nodes. So, again, this resulting p-value, I'm just going to say it's very small. It's very small it's well less than one in 10,000 per pulses, put it at less than one in 10,000 for now. How do we interpret that? Well, we say, if again, this is computed under the assumption of the null hypothesis being true that the population mean difference that we're estimating is actually zero. So, if there were no difference at the population level in the mean number of lymph nodes found by the two doctors, then the chance of getting the mean difference in a study like we did based on 65 persons, a mean difference of negative 2.75 or something even more extreme, by more extreme I mean less likely, is less than one in 10,000. So, if the underlying truth were no difference in means, then we've gotten a very extreme study result. So, how are we going to make a decision about whether our result is consistent with our assumption that we computed the p-value under that the mean difference is zero, well, we're going to compare it to our threshold for likeliness versus unlikely. So, because the p-value is well less than 0.05 for five percent, the decision we would make would be to reject the null hypothesis of no difference, in favor of the alternative which is very vague, actually. It's just that there is a difference or the difference is not zero. But based on this analysis and this analysis is alone, if we just had the p-value, all we could say is we've ruled out no difference in mean physician ratings as a possibility for the underlying truth, and our result is statistically significant because we found a difference at the population level. That unto itself that was not very informative. It be much more informative to know that the difference is negative 2.75 in our study, and to have the confidence interval. But I'll just remind you and you can go back and look, this decision of ruling out zero as a possibility for the true mean difference is consistent with our 95% confidence interval, it did not include that null value of zero. These complimentary approaches, confidence interval for a difference, and the hypothesis test for a difference will always agree in terms of their ruling about statistical significance, and they should because they're using the exact same data to make a statement about the exact same underlying truth just in two different ways. So, I just want to point out to you this p-value is invariant to the direction of comparison. Again, the direction of comparison is arbitrary especially in examples like this. Suppose we instead calculated the mean difference is the difference of Doctor one compared to Doctor two, and instead of being negative 2.75, it would be positive 2.75. So, the distance between the sample mean difference of zero is standard errors instead of negative 7.86 it's 7.86. So, our sample mean difference in this direction is 7.86 standard errors above zero, and again zero is what the true mean difference is under the null hypothesis. So, when we calculate the p-value, it's still the probability of being as far or farther than 7.86 standard errors in either direction from zero, and we get the same resulting p-value. Let's look at another example. We use this ten non-pregnant, pre-menopausal women, 16 to 49 years old, who are beginning a regimen of oral contraceptive use and they had their blood pressures measured prior to starting oral contraceptives, and three months after consistent use. The goal of this small study was to see what, if any changes in average blood pressure, were associated with oral contraceptive use in such women. We've seen these data before. We had ten women, we had the before and after measurements, and we were able to reduce it to a single sample of differences for the ten woman comparing their after measurement to their before. We saw before that the average change after three months on oral contraceptives compared to four was an increase of 4.8 millimeters of mercury, but there was a fair amount of variability in these individual changes. When we did the confidence interval, the result was statistically significant as the confidence interval for the true difference we estimated it to be difference would be 4.8 millimeters mercury, but taking into account the uncertainty in our sample, the truth could be anywhere from a difference of 1.4 millimeters mercury, increase of 1.4 millimeters of mercury after using oral contraceptives, up to an increase of 8.2 millimeters of mercury. Certainly, statistically significant may be a little bit ambiguous clinically because on one end, as we said, the result is not that clinically meaningful. On the other end, it would be an extreme result. But certainly, zero is not an interval. So, it's statistically significant. So let's do this in terms of doing hypothesis testing. So, our competing hypothesis is when we express it is that difference in mean blood pressures. After minus before. So, mu diff is just mu after minus mu before, is equal to zero versus the alternative that the difference is not zero. We're going to assume that the difference is zero and the population from which these women were taken, and we're going to calculate how far our observed mean difference of 4.8 was from that expected difference of zero but in terms of units of standard error. So, the variability of the 10 differences in our sample differences was 4.6 millimeters of mercury, and there are t10 women. So, the estimated standard error of this estimated mean difference is 4.6, the sample standard deviation or the square root of 10 sample size. So we get a result that is 3.29 standard errors above zero. Again, zero is the assumed population mean difference under the null hypothesis. So, the p-value is the probability of getting a sample mean difference of 4.8 or something more extreme, if the true underlying population mean difference is zero. If we translate this into standard errors, this is the probability of getting a result as far or farther than 3.29 standard errors from the mean, and we have 10 observations here. So, technically speaking, again, the computer will do all this detail for you if you use a function to compute the p-value from a t-test, but if we were doing it by hand, we'd appeal to a t-distribution with nine degrees of freedom. There's 10 women, so t-distribution with nine degrees of freedom. The pt function is equivalent to the p-norm function but just for a t-distribution, and the arguments we give it are our distance measure and the degrees of freedom for the distribution. So, there's two ways to get this result. Remember, what I want is the proportion of results there as far or farther than 3.29 standard errors from zero under this t-curve. If I plug in pt, this is very consistent with what p norm did as well in terms of what it gives us about the distribution, it gives me the number or percentage of observations. It gives me the percentage of observations than are less than that far from zero. So, if I plug in pt of 3.29, it's giving me the percentage of observations that are less than 3.2 standard errors above zero. So, all the area to the left of it. But I only want the area and the other tail, the percentage that are greater than or equal to 3.29 standard errors above zero. So, because it's giving me the percentage that are less than 3.29 standard errors from zero in order to get the percentage that are 3.29 or more standard errors from zero. If I take one minus pt of positive 3.29, I'll get this area in the tail here, and because I'm interested in that, not just 3.29 or more standard errors above but 3.29 or more below, I'm gonna double that. That gives me a p-value of 0.0094. A way to shortcut this without having to remember just take the result and subtract it from one before multiplying this instead even though our result was positive 3.29. If we do and multiply pt here, the probability for a number of standard errors on a t-distribution to the negative counterpart, negative 3.29, with nine degrees of freedom will get the same result and p-value once we doubled that, a 0.0094. So again, how will we translate that p-value? I would say if the true mean difference in blood pressures were zero after compared before oral contraceptive use in the population of women from which the sample was taken, the chances of getting our result a mean change of 4.8 millimeters of mercury or something even more extreme is 0.0094. So, less than 0.05, and we'd reject the null hypothesis, and that's consistent with the result we saw in our confidence interval that did not include the value zero. So in summary, the paired t-test approach allows us to set up the two competing hypothesis about the unknown population means for the two populations being compared. Here are three different ways of expressing these competing hypotheses at the guttural level. The means of the populations we're comparing are equal or that their difference is zero, and the way we go ahead and do the hypothesis test is we assume the null, assume that the underlying population means are the same and that their difference is zero, and then compute how far our observed are estimated difference in means is from zero in terms of standard errors. Then we translate that distance into a p-value using the normal distribution or a t-distribution in smaller samples. I've showed you in this lecture set how to do it quote unquote by hand using R as effectively a table, but generally speaking, you can use our any other package to do it all on one end and it'll take care of looking up the appropriate value for you as well. The p-value, what it measures is, remember it's computed assuming the null is true. So, it measures the chance of getting our study results or something even less likely. Sometimes we say something more extreme when the two samples are assumed to have come from populations with the same means. So, it's a measure of the likelihood of the results we got under some assumption about the unknown truth. The p-value, and it's called the two-sided p-value and we'll explore that in more detail in a little bit, is invariant to the direction of comparison. So, we don't have to worry about setting up the direction in one particular way to get a valid p-value. And as with confidence intervals, if we ignored the pairing of this data, we'll see you in the additional exercises that we get that inflated standard error which would give us lower p-values, incorrect lower p-values, than if we respected the pairing [inaudible]. In the next section, we'll show how to do this same set up and idea with unpaired samples and conceptually is exactly the same. The only thing that's going to change as with the confidence intervals for paired versus unpaired is how we estimate the standard error of that mean difference.