Now, let's look at hypothesis tests for comparing incidence rates and survival curves between two populations. So, upon completion of this lecture section, you will be able to describe the two approaches to getting a p-value for comparing incidence rates between two populations. The two sample z-test approach, which will look very familiar, is based on comparing the incidence rates on the natural log scale. This can be used both for data where individual times are not recorded, like cases per year, by region and also on clinical studies where individual times to event or sensory are recorded. The log-rank test compares the Kaplan-Meier curves for the two groups and can be extended to compare for more than two populations and can only actually though be used for data where the individual times are recorded, the event times and sensory in terms. So, we'll look at both of these in this section here. So, let's go back to our Pennsylvania lung cancer dataset. One of the things we looked at was comparing the rates of lung cancer diagnoses for men and women in 2002 and we had already computed the incidence rates and incidence rate ratios by sex, but here's a recall or recap of the results on the rate scale. The females had lower incidence rate of lung cancer as compared to the males and the incidence rate ratio for females compared to males was 0.75 and we saw before that this was statistically significant with a confidence interval of 0.7 to 0.78 so do not include the null value for ratios of one. So again, the risk of getting lung cancer in 2002 for females was 0.75 times the risk for males. Another way to state that is that females had a 25 percent lower risk of lung cancer when compared to males. And to put the confidence interval into a substantive context after counting for sampling variability, females can have 22 percent to 20 percent lower risk of lung cancer when compared to males. So as with the 95 percent confidence interval, the hypothesis test for the incidence rate ratio will be done on the natural log scale. So, recall our incidence rate ratio is 0.75 and the log of this, you may recall is negative 0.29. You may also recall that the estimated standard error for the log of the incidence rate ratio is given by when we're comparing two groups with the ratio, so square root of 1 over the number of events in the first group plus 1 over the number of events in the second group. So for these data, there were 4,587 cases among the women, 5,692 among the men. So the standard error of our log incidence rate ratio again is equal to, when we do that, it's 0.02. So let's do the hypothesis test for comparing the two incidence rates between men and women in the population from which these data come and again we've got all the data for one year in Pennsylvania, but we can think of this as a sample of one of many years if we assume similar associations across the years that we've taken this sample of. So we can express the null hypothesis, we can say the incidence rates are equal to each other. The incidence rate ratio is equal to 1 or the log of the incidence rate ratio is equal to 0. All of these are equivalent ways of stating the null and it could change the equal sign to not equal sign in each case to get the alternative. So as with every other tests we've done, by now you probably are saying this when you walk around or in your sleep, we start by assuming the null is true because we're doing the computation on the log scale. We bet and assume the distance between the log of our incidence rate ratio and we expect this log to be under the null and again under the null if the incidence rate ratio were 1, the log of that incidence rate ratio would be 0, and then we convert this distance to a p-value and make a decision about rejecting or not. So we've already set up the two competing hypotheses on the previous page. We're going to assume the incidence rate ratio is 1 the under the null and the log incidence rate ratio is 0 and we'll figure out how far our observed log incidence rate ratio is from 0 in terms of standard errors. We do this. We get something that is 14.5 standard errors below what we'd expect it to be 0 under the null hypothesis. So off the charts if you will. So I won't go through the or on this but you can check this out if you're interested but the resulting p-value is very small. Well, less than 0.0001. So just using that as our sort of upper bound, we could say if there is no difference in the incidence rates of lung cancer for women men in Pennsylvania, then the chance of getting incident rate ratio as or more extreme than 0.75 is less than 0.0001. So very unlikely to have gotten our observed incidence rate ratio of 0.75 or something even less likely if in reality there's no underlying difference in these incidence rates between minimal rate. So based on the standard cutoff of 5 percent, we would reject the null hypothesis in favor of the alternative and conclude that the difference is statistically significant and this is consistent with what we saw with the confidence interval for our comparison measures as well. Let's look at our Primary Biliary Cirrhosis studied from the Mayo Clinic in the United States, 312 patients randomized, 312 patients with Primary Biliary Cirrhosis randomized to either receive the drug DPCA or placebo. We may recall we looked at the incidence rates of mortality for both the drug group, the placebo group and they were nearly equal but they were slightly higher for those who got the drug than the placebo. So there was no apparent benefit of this drug in terms of survival. The incidence rate ratio comparing the incidence rate of mortality or death for those who got DPCA versus the placebo was 1.06 indicating like I said it's slightly higher risk in the drug group, but after cutting for sampling variability, the results were conclusive and the confidence interval includes the null value of 1. So let's approach this from a hypothesis testing approach, our two-sample z-test. So we're going to set up our two competing hypotheses. Assume the null hypothesis is true that the underlying population incidence rate ratio is 1 or the log incidence rate ratio is 0 at the population level. We'll figure out how far our version of that based on our study data along over observed incidence rate ratio is from the expected value of the log incidence rate ratio which is 0 as we said under the null. We do that with these data and you can go back and check these computations if you wish that the log of 1.06 is 0.06, and that our standard error on the scale is 0.18, but we did this in the previous confidence interval lecture and this gives us a distance measure of 0.33. So if we were to look this up under the assumed sampling distribution of the null and normal distribution, we'd see that our result was not particularly unlikely 74 percent of the estimates we could have gotten or as unlikely or more likely than our observed incidence rate ratio estimate, 0.33. So as a result, the researchers failed to reject the null and concluded the result in difference mortality was not statistically significant. So when the data includes the event and censoring times as with the Primary Biliary Cirrhosis data and some other datasets we looked at, another test that can be used to compare the time-to-event distribution in two populations is called the log rank test. This reframes things essentially in terms of incidence rates but reframes in terms of the underlying time-to-event or survival curves in the two populations from which the two samples are drawn, and the null hypothesis is that survival curves are equal in the two populations being compared versus the survival curves are not equal between the two. So what this does is it compares essentially the distance that we observe between the two Kaplan-Meier curve estimates based on our samples, compares it to what we'd expect that distance between the curves to be essentially 0. We'd expect the curves to be on top of each other. It compares that and decides whether it's large or not so large after accounting for how variable the discrepancy could be just by random sampling given our sample size and standard errors, etc. So the idea is that the log rank test compares the number of events observed at each event time in the two groups to the number pf expected events in each group at that event time and then the discrepancies. Again, this the sound familiar between what we observed in our study and what we'd expect under some assumption about the truth namely the null are aggregated across all event times and standardized by the uncertainty from sampling variability, the standard error if you will of these aggregated distances. So let's look at the same study here, the DPCA drug trial on patients with primary biliary cirrhosis. We already know the results are not statistically significant. We've seen it with the confidence intervals, and with the two sample z-test we just looked at for comparing the log incidence rates, and hence the incidence rate ratios. But here are the Kaplan-Meier curves that we observe for these two groups if we looked at the survival over time. They're followed for up to 12 years. At the end of the 12-year period, approximately, 40 percent in both groups were still alive and not died as of that point. So you can see these curves are not exactly on top of each other. They have differences across the range here. The question is, are these differences large compared to what we'd expect to see if not for the two curves that generated these data at the population level are equal or not? That's what the log rank sets out to answer. So what it does is it takes the total aggregated discrepancy or distance between what is observed in the samples in comparison to the distribution of such discrepancies across samples of the same size when the null is true. Then this gets translated into a p-value. So for the DPCA/placebo comparison, the p-value for the log rank test is 0.75, almost identical to the p-value from the two sample z-test approach. So they are testing the same underlying idea of speak to the differences conceptually in [inaudible] and hence resulting in slight computation differences between the two in a moment. Let's look at our study on antiretroviral therapy and partner to partner HIV transmission that had this very striking result. You may recall this was a prospective cohort study on couples, and couples who were HIV serodiscordant, where one member of the sexual partnership had HIV and the other didn't, and they were randomized such that either the infected partner was aggressively treated for HIV, or given the standard treatment. Aggressive treatment meant starting antiretrovirals at the start of the study. The standard group was starting antiretroviral therapy after the CD4 counts that gone below a threshold. You may recall that there was a pretty substantial result. There were 28 linked transmissions between the serodiscordant partners in the two groups, and only one occurred in the early or aggressive therapy groups. So we had that hazard ratio or incidence rate ratio of transmission to partner for those who got the early therapy, versus the standard of 0.04, much a very sizeable reduction, 96 percent reduction in the risk of transmission. So from these data, I did not have the individual level data, but I could compute from the article the incidence rate ratio we saw that as 0.04. I knew the number of events in both groups. There was one event in the early therapy group and 27 in the standard group so we could actually do the two sample z-test. If you do that, it comes out with a p-value of 0.002. I didn't clicked to the log rank test myself, because I didn't have the individual personal level data that the authors had access to. But they report a p-value of less than 0.01 for that test. I don't know how close their p-value was to the 0.002 I got, but certainly it was close enough in that they were both small numbers and would yield the same decision in terms of rejecting the null hypothesis. Generally, what you'll see in the literature, when you have individual event times and sensory is a p-value from the log rank, whereas if you're looking at situations where you have a total number of cases, and total population size collected over a fixed period of time, they'll use the two sample z-test because we can't enact the log rank because we don't have the individual times. So how would you interpret the results here? Just to put it into a context and bringing the confidence interval, we can say in a study of 1,763 HIV serodiscordant couples, the risk of partner to partner transmission among the 866 randomized to receive early antiretroviral therapy was 96 percent lower than among the 877 randomized to receive standard antiretroviral therapy, a p-value of 0.002. After counting for sampling variable and this is just the confidence interval on that which went from actually- so they observed incidence rate ratio was 0.04, went from 0.01 to 0.31. The early ART therapy could reduce the risk of partner transmission from anywhere from 69 percent on the upper end, to 99 percent of the population level. So this p-value of 0.002 means that if the underlying population level rate of partner to partner transferred mission were the same in the populations of serodiscordant couples given earlier standard antiretroviral therapy, then the chances of getting a sample incidence rate ratio of 0.04 or something more extreme is two out of a thousand. So certainly low and certainly lower than our threshold of five percent. So the two sample z-test versus the log rank, when we have time-to-event data, what's the difference between the two? Well, the two sample z-test uses information from an incidence rate ratio computed using the overall incidence rates in the first group, and in the second group. By computing the incidence rates by this method, it's assuming that the incidence rates in both groups are constant across the entire time period of interest. So it computes one incidents from the way we've been doing that. One instance rate for each group assumes that the incidence rate of the outcome is constant across the entire time period. That's true whether we are doing this on case count data where we don't have the individual times, or whether we're doing it on time-to-event data, where we have individual times. If we use the incidence rate ratio, we are assuming that the incidence rates are constant in the two groups where we'd be comparing. However, the log rank, when it's testing the curves over the follow up period, allows there to be changing incidence rates of the outcome over time in both of the groups being compared. So it's comparing the overall average incidence rate across both groups, but under the idea that it's not necessarily constant in either group over the entire time period. In reality, this will very rarely result in much of a difference in terms of the p-values we get from both. But I just want to point out that they're computed under slightly different assumptions about the nature of the underlying incidence rate of the outcome in both groups being compared. In medical clinical papers and any papers where cohorts are followed over time, such that we can measure the event times and centering times in the individuals in the study, a Kaplan-Meier curves will be presented and usually the log rank will be presented. But again, were we to compute the incidence rates of the outcomes in the two groups being compared across the entire time frame, it can be the incidence rate ratio and do a z-test for that. We'd be testing the same fundamental idea and get commensurate results of that with the log rank. So in another example and compare the results from the test we have, this is the study where pregnant women in Nepal were randomized to receive vitamin A, beta carotene or placebo during pregnancy, and what they were concerned about was the incidence of mortality in the first six months of birth by maternal supplementation groups. So computing the overall incidence rates in each of the three treatment groups, assuming it's relatively constant over that follow-up period of six months, here are the three incidence rates. So we saw before that the incidence rates of infant mortality were very similar regardless of whether the mother got vitamins, beta carotene, or placebo. If we were to do a comparison of vitamin A to placebo, the incidence rate ratio mortality for children born to mothers who got vitamin A to placebo is 1.05. It was not statistically significant. The p-value from the two sample z-test using the log incidence rate ratio is 0.55, from the log rank it's 0.52. So not exactly equal but similar, and certainly would lead to the same conclusion. We looked at the incidence rate ratio comparing mortality for children born to mothers who got beta carotene, to mothers who got placebo. It was exactly equal to one, not significant. Because one was not only in the confidence interval, it was our estimate the p-value from the z-test is 0.84, p-value from the log rank is 0.82. So again slightly different p-values but again the same overall result in terms of not rejecting or failing to reject the null hypothesis. So both the two sample z-test and the log rank test can be used to test competing null, and alternative hypotheses about time-to-event data. The log rank is most commonly presented in the literature whenever we have clinical or other scientific data collected at the individual level where we have the sensory and an event times. But the two sample t-test is a nice easy to implement by hand approach that is very similar in its approach to the two sample t-test for comparing means. Additionally, in situations where we have rate data where we don't have the time-to-event but data collected cases and counseling for a year or years like our lung cancer cases in Pennsylvania, we can't do the log rank. Because a slightly different mechanics, the p-values from those tests may differ slightly in value. But they base both tests use the same logic as all other hypothesis tests we've seen. The necessity of the log rank is that it can be extended to compare survival curves before between more than two populations with one.