So, before we go any further in looking at more mean hypothesis tests, where the approaches will be conceptually the same when we start comparing binary outcomes between two populations via samples or incidence rates between two populations via samples, only the mechanics will change just like we've seen with confidence interval creation. But I think it's worth stopping and do debriefing for the first time, but not the last time on this idea of a p-Value, which is sort of the most revered and perhaps, not most but definitely misunderstood number in statistics and it's certainly ascribed more to it than they really is. So, I just want to talk about that for a little bit. So, in this lecture section, the focus will be on what a p-value can and can't reveal about the study results. So upon completion of this section, you'll be able to define type 1 error. We've already talked about it but didn't call it by this name and understand its role in hypothesis testing. Explain what a p-value is, we've looked at that, but also what it is not, and contrast the idea of statistical significance with scientific significance, and start to appreciate, we'll just scratch the surface and we'll wrap up the first term with this in more detail why a non-statistically significant result, which our examples thus far have all shown statistically significant results but we'll see some coming forward that are not, but when we achieve a null, a less significant result, it yields a decision of not accepting the null hypothesis per say but just failing to reject it. So, let's start with p-values. Just to remind ourselves of some characteristics, a p-value is a probability, a number between zero and one. P-values are computed under an assumption about the truth, namely, that the difference so far means, but can be expanded to proportions, incidence rates, et cetera. The difference in the measure of association between the two populations we're comparing is zero, and small p-values mean that the sample results we got, the estimated difference, would be unlikely to get it just by chance if the data we're looking at come from populations under the null hypothesis with the same means or proportions et cetera. The p-value would another way to say this is the probability of obtaining a study result as extreme or more extreme, as unlikely or more unlikely than you would get by chance alone, assuming the null hypothesis is true. So, it's a measure of how likely your sample result is another results less likely are if this null is true. The p-value is not the probability that null hypothesis is true, and it actually can't be because the p-value is only valid and computed under the assumption that the null is true. So, it can't be a measure of the null hypothesis being true, the null hypothesis is assumed to be true and we can only compute the likelihood of our study results under that assumption about the truth. The p-value tells us nothing about how well the study was conducted. It's not the probability that the study was well conducted or poorly conducted, it's not the probability that the study results are important, it's not the probability that the alternative hypothesis is true. Again, it's only a measure of our study results under strict assumption about the truth, the null assumption. It's not the probability that the study findings are legitimate or not legitimate, all it can tell us about how likely our sample results would occur just by random chance under an assumption about the truth. The p-value alone imparts new information about the scientific or substantive content in the results of a study. So p-values in a vacuum aren't necessarily that informative. So, for example, all I told you about our blood pressure or contraceptive study is that based on 10 women the researchers found a statistically significant, p-value 0.009, so well less than 0.05, statistically significant difference in average blood pressures before and after consistent oral contraceptive use for three months. If that's all I told you, could you answer any of the three questions? Was the blood pressure higher or lower after oral contraceptive use? How big was this average difference, and is it meaningful scientifically? What is the range of possible values for the true average difference in blood pressures before and after oral contraceptive use in the population of women that we are studying? I know you know the answers to these based on what we've done, but if all we had done was get the p-value from the paired t-test, you couldn't answer any of these important questions. All you can conclude is that the underlying true mean blood pressures before now after oral contraceptives were different. So, what does small p-values mean? Well, one of two things. If the p-value is small, either a very rare event occurred under the null, so we got an extreme sample generated from data coming from populations where the underlying difference means or whatever we're comparing is zero, or the null is false. So, we either have a rare event occurring in terms of sampling, or the null is not true. So, we have to make a decision about how unlikely does our result has to be for us to reject the null is the possibility? How unlikely under the null does it have to be for us to conclude that our results are not consistent with the null? So, in order to do that, we have to set a cutoff for deciding likely versus unlikely, and this cutoff, we've called it the alpha level, but it's also called, and we'll introduce a new term here; the type 1 error level. The type 1 error is our threshold for making a mistake. If our p-value comes less than the alpha level or the type 1 error level, we're going to reject the null. But by doing so, we're taking a chance. We're potentially rejecting the null in favor of the alternative when in fact the null is true. So, the probability of doing that, making a type 1 error, rejecting when we shouldn't is called, again, the alpha level or significance of level of the test. This is set in advance of performing the test and generally, universally, it's set at five percent. We're willing to risk a five percent chance of finding a difference, a false positive if you will, when in reality, there is no difference at the population level. We made that's more, down to 0.01, we'd be more conservative, but again the industry standard is five percent. So, if our p-value is less than some predetermined cutoff like 0.05, the result is called statistically significant, and this corresponds with our null value not being included in the confidence interval for the difference of interest. This cutoff is called the alpha level, and this alpha level is the probability of a type 1 error, so just redefine, re expressing that definition again. So, again, this type 1 error or alpha level is the probability of falsely rejecting the null hypothesis when the null is true, finding a difference when there isn't one. This is sometimes called a false positive or false association, would be a type one error. So, the idea of setting this at five percent or even lower is to keep the chance of making this mistake when H0 is true, to keep the chances low and only reject the result if the sample result is very unlikely, and again we determined that threshold by where we set this alpha level or type 1 error. Here's just a table that shows the underlying truth versus the decisions we make, and we have some names for these things. So, for example, if the truth in the population, the generated data, the null, there's no difference and we reject the null, then we've made a mistake, we've made a type 1 error. If we don't reject the null, we've done the right thing, so, when the null is true. If the alternative is true and we decide not to reject the null, then perhaps not surprisingly, we've made another mistake; this is called a type 2 error. So, we'll focus more on this later in the course towards the end, but there is type two errors, the chance of not rejecting when we should a false negative. What do we call rejecting when we should when the alternative is true when we reject the null? This is one minus the type two error but it has another name called power, the power of the study. Ideally, we want the power of a study to be high if there really is a difference in the population level measures that we're comparing, we want to be able to detect it with high power. For one study to have a good chance of detecting it, there is a difference. Again, we'll focus on this idea of power more in the last lecture set in this course, but I'll just throw this out here and put this here. If we have a p-value of greater than or equal to 0.05 and the confidence interval that includes the null value, we generally don't make a strong statement. I'll reject the null and we don't make a strong statement like except the null, we say something a little more tentative that we fail to reject the null. The reason is, it's not necessarily clear how strong a statement that is, and it has to do with the power of our study. If our study has low power, has low chances to find a difference if there really is one at the population level, then failing to find a difference is hard to interpret. Does it mean that there is no difference at the population level or there is one but we didn't have enough precision to see it. Contrast that with a higher powered study where if there is a difference, we're likely to find it. They're not rejecting this is a stronger statement. So we'll talk more about this in the last lecture set in the course, but I just want to give you some starting point as to why we are a little ambiguous in the language when our results are not statistically significant. So, let's talk about hypothesis testing and confidence intervals. Again, the confidence interval gives a plausible value a range of plausible values for the unknown population comparison measure. We say here's my data, use the uncertain metadata to create interval to take me to the unknown truth. Hypothesis testing starts with the truth, in fact two choices for the underlying population comparison measure and says, here are two possibilities with my data and I'm going to choose one. So, it turns out, and I said this in the beginning, and we've used this throughout the lectures is, if our null value of zero is not in the 95 percent confidence interval for a mean difference, then we would reject that null hypothesis at the five percent level. In other words, we get a p-value of less than 0.05. So, let's think about this. So, the way we do confidence intervals we start with our estimate and go plus or minus two standard errors of the estimate in either direction, and 95 percent of the time this interval will include the truth. So what does that mean? We go plus or minus two standard errors from our estimate. If this interval does not include zero, it means that zero must be more than two standard errors from our estimate either above it, may be more than two standard errors above it or more than two standard errors below it, because we've gone two standard errors in either direction and we didn't get zero. So if the absolute distance of, let me rephrase that. If the 95 percent confidence interval for the mean difference in this case does not include zero, then the absolute value of the distance of our estimate from zero would be greater than and let's just assume large samples this can be generalized to t-distributions, smaller samples. But then, we are more than two standard errors, technically 1.96 but two standard errors away, our estimate is more than two standard errors away from zero. Conversely, when we do the hypothesis testing approach, we start by assuming the true difference is zero, and we measure how far our result is and then we compute the proportion of results that are as far or farther away from zero. But again, if the 95 percent confidence interval does not include zero, then we already know that our estimate is more than two standard errors away from zero and therefore estimate is more than two standard errors away from zero. Then our resulting p-value will be less than 0.05. So, it's all about distances really. The confidence interval, we go a fixed distance from our estimate and create the interval. On hypothesis test, we measured the distance of our estimate from a fixed starting point at zero. Again, if zero is not in the confidence interval, our result is more than two standard errors from zero and hence our resulting p-value will be less than 0.05. Confidence intervals and p-values are complimentary, I think there's a lot more information in the confidence interval alone than a p-value alone. But in the blood pressure or contraceptive example, the 95 percent confidence interval for the population mean difference in blood pressures, 1.4 millimeters of mercury to 8.2 millimeters of mercury tells us that the resulting p-value for comparing this means is less than 0.05, but it doesn't tell us that it's equal to 0.009. The two results will be consistent in terms of the decision about the null, but you can't get the exact p-value from just looking at a confidence interval, and you can't get a sense of the scientific or substantive significance of your study results simply by looking at a p-value. One thing to keep in mind, this P less than 0.05 especially when just looking at the p-value is the statistical significance is just statistical significance. Statistical significance does not imply or prove causation. It doesn't imply scientific significance et cetera. So for example, in the blood pressure or contraceptives paired studies example as we talked about initially when we saw the first time, there could be other factors that explain the average change in blood pressure. A significant p-value is only really not random chance as the explanation random chance under a truth that the means are the same at the population level. We would need a comparison group to better establish causality preferably a randomized group. Statistically significant results are not necessarily scientifically important even if causation can be established by a randomized or control trial. So, large sample sizes can lead to statistically significant results even if the magnitude of the difference in means is small, and that's simply because large sample sizes result in very small standard errors. So, when we measure the difference, distance between our result in zero and compared to standard errors, since the standard error units are so small that distance may be large in terms of standard errors even if it doesn't mean anything in terms of clinically or substantively. So, suppose for example that the difference in mean weight changes between the low carbon low fat diet examples, suppose we observed this is fictional but a mean difference of negative 0.5 kilograms with a confidence interval for that mean difference from negative 0.25 to negative 0.75 kilograms. This is certainly statistically significant, our p-value would be less than 0.05 as our confidence interval does not include zero. But despite the fact that that's statistically significant perhaps this result would not be considered clinically significant. A lot of people consider the imperative of science to find a statistically significant difference, but really that resolve unto itself is not that informative. It doesn't tell you about the magnitude of the difference, range of possible values for the difference in whether they're scientifically interesting or informative or not et cetera. So, a p-value alone is only a piece of a story. Conversely people also feel they've failed if they don't get a statistically significant result. Depending on the situation we'll talk further about this later in the course, a non-statistically significant result can either be ambiguous we can't really make much of it or it can be conclusive, but in any case it's not a failure on the part of the researcher. Again, lack of statistical significance is ambiguous in smaller studies is not clear whether we failed to reject the null because there is no population level difference in the quantities which means we're comparing or that there's too much uncertainty in the results, a large standard error to actually see a difference if it's really there. So small and can sometimes produce nonsignificant result even though the potential association at the population level is real and important and our study just can't detect it. So, a lot of times in smaller studies sometimes a larger study will be with higher power and we quickly define that term power and we'll do it more definitively shortly. We'll follow up to try and answer the question more decisively. But if we have larger studies than the lack of statistical significance is a more decisive conclusion. Again, we'll talk more about statistical power and sample size in upcoming lecture set. So in summary the p-value alone can only indicate whether the study results were likely due to random sampling chance or not if and under the assumption there is no difference in the measure being compared between populations. Yet so far we've only looked at comparing means but this idea will hold for proportions and incidence rates as well. So, p-value should not generally be reported alone without the estimated mean difference and 95 percent confidence interval endpoints. Again not rejecting the null hypothesis is not equivalent to accepting the null hypothesis is the truth as we've discussed and we'll dig deeper into this and look at the impact of sample size and power on taking this from ambiguous to a stronger statement in lecture set 12. In the next lecture set we'll come back and look at more hypothesis test and the drill will be the same it's only the inputs that change for comparing binary and time to event outcomes, and we'll also debrief further on this entity called the p-value.