All right. Well, this is our second lecture set on hypothesis testing, and so it's only fitting that we end it like we did the first lecture set, debriefing on the p-value hypothesis testing, and this will be part two of the debriefing. So, the way we've demonstrated performing hypothesis tests are for something called two-sided hypothesis tests and result in two-sided p-values. I mentioned that before, but didn't explain what that means. Two-side refers to the method of getting a p-value that measures being as far or farther from the null value as extreme or more extreme in either direction. So we're looking essentially at alternative hypothesis. So let's just use means, for example, but this could be generalizable to any of our other measures of comparison depending on the datatype. The nulls, the means are equal or the mean difference is zero at the population level, or alternatively been expressing says, simply the means are not equal or the mean difference is not zero. So we're not putting the direction on the true mean difference. If it's not zero, we're just saying it's not zero. So we don't care, we're not discriminating between alternatives where the difference is less than zero or greater than zero, we're considering all situations where it's not equal to zero. It is possible and in fact, in many cases scientifically, it may seem preferable to perform a one-sided test. For example, if we were looking at cholesterol levels, average cholesterol levels after three months of being treated with the new treatment versus the standard treatment, we might hope that the average on our new treatment comes out lower than the standard, and in fact, we're only concerned in that direction because if it comes out worse than the standard, it won't be adopted. Anyway, so it may seem logical from a scientific perspective to perform a one-sided test and I agree with you scientifically. We always generally speaking. Scientifically, we have a sense of the direction we're expecting or that would be useful. However, because of somehow the sanctity of the p-value that is taken on in research, one-sided tests are not commonly presented in the literature and, in fact, tend to raise suspicion because for alternatives that favor the direction of your sample difference, my observe if I have an alternative that says the mean one is less than mean two, and hence mean one minus mean two is less than zero, if my sample results mean one minus mean two, the sample means are less than zero as well, I will immediately get a p-value that's half of what it would have been, had I specified a directionless alternative. So a lot of researchers see this as a way to game the system and get a lower p-value that expected, especially if the p-value you get from a one-sided test is on the order of 0.04 or 0.03. Had you done the regular two-sided test, it would've been 0.06 or 0.08, not significant. It's always viewed with suspicion. So we will always, in this class, use two-sided p-values, and use two-sided hypothesis test, and that's very much generally what you'll see in the literature. You might say, "John, well, I get the fact that we're doing conceptually the same thing whether I do a two sample paired t-test, a two sample unpaired t-test, a two sample z-test, a chi-squared test, a log rank test, et cetera, but how do I keep track of the names?" Well again, regardless of the name, the approach to doing our hypothesis test is consistently the same. We assume the null hypothesis. However, it's expressed in terms of the measure of association we're using to compare the populations, we measure the standardized distance between our study sample results and the null value or values standardizing by the sampling variability at the sampling result. We translate this distance into a p-value and make a decision. Frequently, this results in measuring the distance in standard error units like the two sample paired, unpaired t-test, and a z-test for proportion students' rates. But in some cases, it's not the distance, isn't measurement metric, for example, with the chi-squared approach, but we're always measuring a discrepancy between what we get in our study and what we'd expect to see under the null. Then regardless of the study of the test we use, the interpretation of the p-value is universally the same. It's always measures the chance of getting the study result as extreme, by extremely mean unlikely, or more extreme than the study results, the sample results, if the underlying populations are the same with regards to the quantity being compared. So it measures how likely our actual results are, these are the things that are random, under some fixed assumption about the truth namely that the difference in what we're comparing is zero. The names only distinguished the tests both in terms of the type of data being compared between the two groups. So we choose a two sample t-test to compare continuous outcomes via the means. We use a chi-square test to compare categorical outcomes. We've only looked at binary, but between multiple populations. Then, the specific mechanics of getting to a p-value differ in terms of these tests. But one you can always look up the names of the appropriate tests, given the data being compared, and the more important thing to note is that all are conceptually is same, and the resulting p-values, all have the same interpretation across all the tests, and will always agree with the corresponding confidence intervals for the chosen measure of association with regard to the null hypothesis. So you will undoubtedly see other tests in the literature that we have not covered or will not cover in this class. But if you can figure out what is being compared via the test, then you can interpret the p-value in the context of the comparison. In the next lecture set, we will discuss extensions to the tests we've covered thus far, to handle comparisons between more than two populations in one test, using data from more than two samples. But let's just stop and take stock, give you something you could look up in the future if you forget. The test we have for comparing things when it's means, we have for the paired design and we have the paired t-test for the unpaired design. We have the unpaired or two sample t-test. For proportions, we have the one we can do easily by hand and it parallels what we did for means is the two sample z-test, and this is mathematically equivalent to something we showed called the chi-squared tests. But the benefit of the chi-squared test is that it could be expanded to compare more than two groups, whereas clearly with a two sample tasks where we're limited to two groups that most. Then, the other one we have when we have smaller samples, but will generally yield the same p-value as the equivalent two sample in chi-squared is something called Fisher's Exact test. It's only in small samples where the results can differ notably. Then, for time-to-event outcomes, we have incidence rates computed for the two groups we're comparing, we could do the easy hand computation two sample z-test, and that assumes with the incidence rates are constant in the two groups we're comparing. Something that will give similar results, but allows us to compare survival curves where the underlying assumption about constantly incidence rate doesn't necessarily hold is the log rank test. We can only do the log rank test when we have individual time-to-event data though. This can be expanded to compare more than two survival curves in one test. So we'll revisit these tests again in the next section, but we'll expand or give analyzed through them to compare more than two populations with one test.