Welcome to Lecture Set 5 in Public Health Statistics where we'll talk about data where we have an element of time that we have to deal with in summarizing it. Throughout this set of lectures, we'll look at numerical and graphical measures for both summarizing results in single samples and comparing results between samples when there's an element of time in the measures we're looking at. So, we'll first talk about sample incidence rates as summary measures for time-to-event analysis and spatial data collected over time. So, for spatial data, that includes event counts and person totals in a defined observation period, perhaps across multiple regions, you will be able to summarize the person count standardized event counts as event rates. For time-to-event outcomes where we have individual outcome times known, we have data at the individual level, we will be able to distinguish between calendar time and study scales for time-to-event data. Define censoring in the context of time-to-event data. Explain why either ignoring the time component or averaging subject follow-up times can be problematic for summarizing time-to-event data and hence we need other measures like incidence rates using event counts and cumulative follow-up times. So, whether we have spatial data, where we have counts in person totals in a defined observation period, so the number of cases per number of persons per year, or we have individual times that we can aggregate to get the total person time, we'll see that incidence rates are a numerical summary measure that addresses the time component. So, let's first talk about event rates for data without known event times, and this would include spatially collected data over a fixed period of time. So, for some data involving events occurring over time, the exact event times are not recorded but are grouped into time intervals. This is the case for many death and disease rates group by area, country, state, city, region, zip code, et cetera, by year, so by area by year or some other unit of time. So, let's look at an example of this. Here are some data from the year 2002 on the incidence of lung cancer diagnoses in the state of Pennsylvania in the United States and on the overall state population as well. The data that we have are stratified by county, sex, race and age groups, but for now, we'll aggregate them into an overall look at the state. So, how can we summarize this for the entire state of Pennsylvania in the year 2000? Well, we can compute what's called an incidence rate of lung cancer in the year 2000 by taking the total number of cases that accrued in that year and divide by the total person-time at risk in that year. So, in these data from Pennsylvania, the individual diagnoses times for new cases in 2002 are not known, so what we do for the total person-time at risk is assign each person one year for each member of the sample being analysed. So, everybody we assume lived in Pennsylvania for the entire year and were observed for a year and each contributed a year of follow up to our understanding of the process of lung cancer. Certainly, people develop lung cancer at various points during that year, but we're going to count everyone equally in the person count-person year time. So, in Pennsylvania in 2002, there were 10,279 lung cancer diagnoses and 12,281,054 residents in the state. So, again we don't know the exact times of the lung cancer diagnoses and we don't know when the residents, how many stayed for the entire year versus moved in or out, so we're going to assume that all 12 million were there for a year and that all cases were diagnosed at the end of the year, and the incidence rate under this assumption represented by IR, with a hat over it to indicate that it's an estimate, is 10,279 cases per the 12, 281,054 person-years. So, if we convert this to a decimal, this comes out to be about 0.0008 lung cancer cases per person per year, sometimes given as person-year. The incidence rate can certainly be rescaled for different time periods, it's usually done so the numerator has denature component. So, for example, in this incidence rate that came out as a decima to bel 0.0008 cases per person-year or per person per year, we could rescale it to per 10,000 person years by multiplying it by 10,000 and then our numerator would be an integer of eight. So, another way to express the same summary statistic is eight cases per 10,000 person years in the state of Pennsylvania in the year 2002. So, you might say, "Well, wait a minute John, isn't this just a proportion of a measure for a binary outcome? You take the number of cases out of the total population count." Essentially, it is, but there is an element of time, so technically it's a rate because we are looking at this proportion over a year of follow-up. Nevertheless, you could think of this in percentage or proportion terms. There's a 0.08 percent incidence of new cases in the year. In other words, 0.08 percent of the sample under study developed lung cancer in 2002, and remember they were 0.0008 cases per person per year which as a percentage is just 0.08%. However, even if we think of this as a percent over some unit of time or a proportion like we did with binary outcomes, these rates tend to be very smallest proportions and as such, their statistical properties will differ from the proportions as we have defined them previously. These tend to be proportions that are very close in numerical value to zero when taken as a proportion, and technically, there's also an element of time. So, in some situations, we have time-to-event data where we know the event times. So, in the lung cancer dataset, we knew the total number of cases that were diagnosed in the year, but we weren't given information on when the diagnosis was in that year. Was it in the first month, was it in the 11th month, et cetera? We don't know and that's why we had to make those assumptions about the cases being diagnosed at the end of the year and everybody living in Pennsylvania for the entire year to get an incidence rate. But for some time-to-event data, the individual event times are known, and then this individual event time information can be incorporated into incidence rate computations. So, this is the case for many longitudinal cohort studies where subjects are followed from a defined starting point up to a certain amount of time. So, let's look at an example here to get picked this off. We're looking at a randomized trial conducted at the Mayo Clinic in Rochester Minnesota in the US, and here's a description. This was on patients with primary biliary cirrhosis and they were randomized to either receive a drug or placebo. The study began on January 1, 1974 and patients were accrued- Up until December 1983. During that 10 year period, 422 patients with primary biliary cirrhosis satisfied the emissions or entry criteria for the study and three of them consented to enter the study and were then randomized to be on the drug or placebo arm. The primary outcome from interest for this study was survival or in other words another way of looking at it was death in the follow-up period. Ultimately, the researchers were interested in comparing the incidence of death and those who got a treatment D-penicillamine versus those who got a placebo. But let's use this as a springboard for thinking about what can happen when we follow subjects over time in a cohort setting. So let's look at a couple examples of patients we may get in the study. So we have patient one, for example. He or she enters the study right when the study started in January of 1974, and is followed for seven years at which point he or she dies. So he or she has the event after seven years. In terms of the study-time window, this person's time zero happened to be the start of the study so in terms of the study time, they were also followed from timer randomization for seven years. So what do we know about this person? We know that he or she didn't ultimately have the event under study death and it was seven years after they were assigned to a treatment group. Let's look at another person. Subject two. Subject two did not enter the study at the beginning of the study remember there was a long accrual period so people were admitted well after that initial start date of January 1974. So this person enters in June of 1978, and is followed up until the end of May 1980. So in terms of the calendar time, they started well after the beginning of the study in the sense that when the study was open for people to participate, they started four years after that time, and were followed for two years in after they entered at which point they were lost to follow-up. At that point that when the last visit in May of 80, they were still alive. So all we know about this person is that they entered in June of 78 and we're still alive as of May of 1980. We don't know when they went on to die, we just know it didn't happen before May of 1980. So in terms of our study time if we're calculating time and study, this person when they entered the study is there times zero for the study, that's when they were randomized to the treatment or placebo, and after that they were followed for two years that period from June to beginning of June to the end of 1978 to the end of May 1980. They were followed for two years at which point they were lost to followup. We have no information about what happened after two years, all we know is that they were still alive after two years from randomization. Then let's suppose we have another person who entered the study in November of 1980. So later in the accrual period, they were still enrolling people in November of 1980 and they actually stayed alive and were still alive at the last measurement or checking for the study, December of 1983. So this person wasn't lost to follow-up so much as that they were not followed anymore because the study officially ended. So what do we know about this person from their time of randomization till they were stopped being followed? They made it was a three-year period. So they survived if you will or made it three years into the study without having the event of death at which point we were no longer doing the study and they were no longer being followed. So if we put all three of these subjects together on the study time graphics, so mapped everyone to their measurement in terms of study design, here's the first patient. He or she from the time of randomization which was time zero made it seven years in which point they died at the event. Patient two made it two years from the time of randomization, there time zero and was still alive at two years when the researchers last saw the patient. Then patient three made it three years at which point the study ended and the person patient three was still alive at three years. So all we know about their time to death was, it had to be more than three years after they were assigned to a treatment group. So in terms of a complete versus censored observations, subject one is what we might call a complete observation. We know that he or she had the outcome under study they actually died after seven years in the study. So we know that they died and when they died. Subjects two and three are called censored observations. We have partial information about the outcome under study of death. We don't know when they died but we have a lower bound and when they could have died. So while subject two was still alive when he or she was lost to follow-up, we know that he or she survived two years on the study clock. So they couldn't have died after one year of treatment assignment or after one and half years if they did die was beyond two years. Similarly, we know that subject three made it three years without dying before the study ended. So for both of them, we don't have a death time but we have a lower bound on whether death time could be. For subject two it had to be more than two years and for subject three it had to be more than three years. So there is some information and they're partial information from these censored observations. So how could we summarize what happened with these three patients numerically? Well, option A would be to treat death as a binary and report the proportion who died in the follow-up period. So with these three subjects if we were just looking at these three subjects, a cohort of three, one of the three patients died the proportion of those three who died was one in three or 33 percent. The problem with this is the amount of time that each of these patients was followed after randomization and hence their time at risk of death from randomization varies from person to person. Taking a simple proportion ignores this fact and gives all three persons equal influence in computing a summary measure on the event. So, another option would be to treat the follow-up times as continuous and report the average time. So, treat this as continuous measure. So, we'd average the seven plus two plus three over three is four years. But what would this be in average time four? Well, only one of the three subjects died while in the study, so this average is not capturing the average time to death since follow-up only the average follow-up time. If we tried to use this as a measure of average time to death, it would systematically underestimate the true average because we were including two persons whose times were not their death times, but the time of last follow-up. If we want to represent the data properly, we can't treat them like binary outcomes and ignore the time element, and we can't just treat the time element like continuous measures because some are full pieces of time data, the time to death, and some are partial pieces of information and not a death time. So, we're going to handle this like we did with the spatial data. We're going to take an incidence rate, but now we know how much time each person was followed before they died or dropped out. So, what we're going to do is take the total number of events in this case, deaths that occur in this sample and divide by the total amount of follow-up time contributed by the sample of three. So what we observed in the sample was one death, but the three persons were cumulatively followed for seven plus three plus two years are a total of 12 years. Now, this computation does assume that the incidence rate is constant across the entire follow-up period, and that's what we'll have to assume for now, but later in the second part of the course we can allow that to change over time. So, the observed per incidence rate for this sample three was one death per 12 years of follow-up. So, let's look at some examples of this computation from the literature. Here's an interesting study on antiretroviral therapy and partner-to-partner HIV transmission. This was in nine countries, the researchers enrolled 1,763 couples in which one partner was HIV positive and the other was HIV negative. So, majority of the subjects were from African countries, and 50 percent of the infected partners were men. What they did was HIV subjects who had between 350 and 550 cells per cubic millimeter were randomly assigned one-to-one ratio to receive either antiretroviral therapy immediately as soon as they entered the study or after a decline in the CD4 count or the onset of HIV related symptoms. So, this other group this delayed therapy group would not be treated with antiretrovirals until they hit a certain threshold. The primary prevention endpoint was linked HIV one transmission between the HIV positive partner and their HIV negative sexual partner. So, what they did is they enrolled HIV one serodiscordant couples at 13 sites in nine countries, and they go across the countries here. A pilot phase started in April of 2005 and enrollment took place from June, 2007 through May 10th. So, people may start at different calendar times and they analyzed everybody's time of enrollment was reset to their time zero in the study time scaling. So, it turns out as of February 21st, 2011, a total of 39 HIV one transmissions were observed and the incidence rate of this across the entire cohort not yet broken out by the randomized groups was 1.2 infections per one 100 person-years. They did that by taking the total 39, the total number of transmissions divided by the total amount of follow-up time contributed by all couples in the study. They found that 28 of these 39 were virologically linked to the infected partner directly from that partner, an incidence rate here was 0.9 transmissions per 100 person-years. So, they used incidence rates to summarize the partner to partner transmission experience in this cohort. Here's another example, infant mortality this was done in Nepal. It was a randomized trial for pregnant women and they randomized pregnant mothers to receive vitamin A, Beta Carotene or a placebo during pregnancy, and of interest was infant mortality in the first six months after birth. So, a total of 43,559 women were enrolled, 15,892 contributed 17,373 pregnancies and 15,997 live born infants to the trial. We were able to get very kindly from the investigators two-thirds random sample of the live births. So, of these 10,295 births that we had access to, the children collectively were followed for up to six months after birth and contributed collectively 1.6 million days of follow-up. The total number of children who died in this six month follow-up period was 644. So, if we were to compute the incidence rate of infant mortality in the six months after birth for children born to all mothers regardless of randomization group vitamin A, Placebo or Beta-Carotene, we saw 644 deaths per total of cumulative follow-up across all children of 1.6 million days were approximately 0.0004 deaths per day. So, we could estimate this per person-year, we could change the scaling from deaths per day to deaths per person-year or child-year by multiplying this per day by 365 days, and then we could prorate this to 500 person-years by taking the 0.146 deaths per year and multiplying by 500. So, we saw 73 deaths per 500 person follow-up years is another way to express this rate of 0.004 deaths per day. So just a note on terminology, analysis techniques for prospective cohort data where time to an event is of interest has several synonymous names, incidence analysis, survival analysis, time-to-event analysis, failure time data analysis. So, survival analysis is the most commonly used term when we have individual event times and these are known. The incidence analysis perhaps is the most commonly used term for spatially collected data where we don't have the individual times, but neither case the event of interest is and we've seen examples where it isn't death, it can be death like in our Mayo Clinic example or the vitamin supplementation to pregnant mothers study, but it could also be the event of interest could be HIV transmission as in with that partner-to-partner transmission study or it could be lung cancer diagnoses as it were in that Pennsylvania study. So in summary, event data collected on a sample of initially event-free subjects followed over time it's two-dimensional. For each subject there's a time measure and also a binary indicator. In some cases, the time measures the same for everybody, every individual in the sample, so when we had that spatial rate data from vital statistics sources like the Pennsylvania data are aggregated over a year, that's a situation where we assume the same follow-up time for every individual in the sample. In some cases like the cohort studies we looked at, the individual event times are known and recorded. So randomized clinical trials, prospective observational studies etc. In both cases though, the incidence rate summarizes the two dimensions of our data, the time component and whether the event occurred into a single number, the incidence rate. So, what we'll look at in subsequent sections is how do we compare incidents rates between different samples to estimate associations between two or more populations from which the samples are drawn? What are some other summary measures of time-to-event data where the individual times are known. Some graphical displays.