So, we've shown how to summarize time-to-event data numerically with incidence rates, but incidence rates alone are sometimes hard to interpret in terms of the magnitude of the burden of the outcome. Certainly, we can compare them in ratios to get an estimated relative difference in incidence of the outcome between any two samples. But trying to understand what the impact is on a group of persons in terms of the percent of people who have the event especially over time when we have the individual times-to-events or censoring would be helpful. So, that's what we're going to talk about in this section. So, upon completion of this lecture section, you will be able to explain the purpose of a survival curve and its basic properties. The way we're going to estimate these curves is through a method called the Kaplan-Meier process and the results will be called Kaplan-Meier curves. So, you'll be able to interpret the Kaplan-Meier curve estimates as survival curves with respect to summarizing time-to-event data for samples of data. We'll also go through the process of estimating the Kaplan-Meier curve "by hand" for small sample data just to illustrate the process and make you think about how we're using information on both those who have the events and those who were censored in the estimation process. We'll give approximate estimates of event time percentiles from a Kaplan-Meier curve and interpret something that I'll explain once we've established the Kaplan-Meier curve, something called the complementary presentation of this curve. So, incidence rates are appropriate numerical summaries for time-to-event data in that these incorporate the two dimensions of the data, the time factor and the occurrence or count of events into a single statistic. However, time-to-event is two-dimensional, and to capture the richness in such data visually, a graphic would have to display both the dimensions of time as it unfolds from the start of the study and the outcome of the events over time. So, a common visual display for time-to-event data is what's called a survival curve and this can be estimated from a sample of time-to-event data where the event and censoring times are known on each individual using the Kaplan-Meier approach. So, what we're trying to estimate is called survival curve, sometimes abbreviated by S(t), and this is a function of time, and at any given time in the study follow-up period, it estimates the proportion who have not had the event by that time. In otherwise, the proportion who remain a friend free or quote-unquote survive beyond that time. So, by definition in a cohort study situation, we're following people from some study start time to when they had the event or censored everybody at the beginning of the study is event-free. We certainly would only include people who are alive if we were studying the outcome of death. If we were studying the outcome of quitting smoking, we would only include people who were currently smoking at the beginning of the study. So, by definition, the curve starts at 100 percent or one, the proportion of people who have not had the event of interest at time zero or who make it beyond time zero is one. This curve then will either stay constant or decrease depending on when the events occur along time in the cohort, so it can only keep the same value or decrease over the follow up time period. So, this survival curve can be estimated for a population based on the sample of data. In the estimated curve, we'll denote as S with a hat of t, is based on data from all subjects in the sample, both those who have the outcomes of interests and those who are censored. We will demonstrate the estimation procedure shortly, but first, let me just give you some examples of the curves once estimated, and these are estimated by the Kaplan-Meier method and hence called Kaplan-Meier curves, sometimes abbreviated as KM curves, and we'll show some examples first. So, the Mayo Clinic data we had on primary biliary cirrhosis, before we started comparing survival between the treatment and control groups, we were able to estimate the survival in the overall cohort. The overall incidence rate of death in the follow-up period was 125 deaths per 1,715 person-years or 0.073 deaths per year, but taking them as just a numerical summary, it's hard to figure out whether that's a high rate of death or a low rate, at least to me. So, something that would be helpful in understanding what that means in terms of death outcomes would be a graphic that shows me the proportion of people who were still alive over the follow-up period. So, this is the Kaplan-Meier estimate of survival for the entire cohort enrolled in the study including both the drug and the placebo arm. You could see it starts at 100 percent from each individual's time of randomization, everybody was alive at their respective time of randomization, and this tracks the proportion who have still not died by a certain time. So, for example, at four years after the start of the study, roughly a little less than 80 percent of the original cohort was still estimated to be alive. By the time we get to eight years, that's dropped to little less than 60 percent and it continues to go down until the end of the study. Luckily, at the end of the study, there were some people who were still alive which is why this curve does not go to zero but stops at a little over 30 percent. Still, a pretty sobering statistic that after 12 years of follow-up, all the 30 percent to 70 percent of the original cohort were still alive, and that puts a real context on that instance rate that we saw before. It shows that death is really common and likely across the follow-up period in the cohort of patients with this disease. Let's look at the infant mortality data from Nepal and we'll look at all infants here studied even though they were of three groups, where their mothers were randomized too, we'll combine them here for an overall incidence rate, and the incidence rate of death was 144 deaths per 1.6 million infant days of follow-up time or 0.0004 deaths per day. We could re-express this with different units but regardless of whether I express it as deaths per day, deaths per year or deaths per 500 person years of follow-up, it's hard for me to conceptualize what that means in terms of the impact on this cohort of children. So, to actually get a Kaplan-Meier curve, to actually look at the percentage of children who remain alive over the 180 day follow-up period, remember we're measuring death in the first six months after birth, is really helpful. Luckily, we can see from this Kaplan-Meier curve that the majority of infants that are born remain alive by the end of the six month period, but sadly, it's still on the order of 90 something percent, meaning that somewhere between five and maybe eight percent to 10 percent actually died in those first six months. It's hard to see on this graphic because the curve only goes down, so to speak to 90 or so, but the axes generally runs from zero to one. So, here's another version where I've zoomed in a little bit, made the axes only run from 90 percent to 100 percent, now we can really see that at the end of the six month period, roughly 94 percent of the children were still alive, would sadly means that six percent had died in 180-day period. Also, what you can get from this is that it seems that the deaths are more frequently occurring early after birth, and if a child makes it beyond say 60 days alive, their risk of subsequently dying tapers off. So, we can learn something from these visual displays that we can't necessarily get from an incidence rate which is averaged across the entire follow-up period. So, how is this Kaplan-Meier curve estimated? Well, it's generally done with a computer and that's what I did for the previous two examples which were rather large and it would be quite cumbersome to do the calculations by hand, but just to show you how the Kaplan-Meier curve uses all the data, both those with complete observations where we know that they had the event and what time they did and the censored observations, I'm going to show you an example of estimating this by hand with a small sample. So, the method as I said before, uses both complete data on persons where we know that the event occurred and when it did, and it also can use the incomplete data in censored observations. This still gives us information about who is at risk to have the event at a given time in the followup period. People who are censored are considered to be at risk of having the event until the point at which they're censored and they're no longer being followed with the rest of the cohort. Two interesting things to note about the Kaplan-Meier curve. First of all, the Meier in the Kaplan-Meier, was a statistician named Paul Meier, who spent some time in my department, the Johns Hopkins Department of Biostatistics, so I have some sort of connection to the curve. Secondly, the way this paper came about is both Meier and Kaplan submitted separate manuscripts to the Journal of the American Statistical Association in about the same time, and they were both working on this idea of how to quantify. We're summarizing time to event data when their censor and so the editors of the journal encouraged him to get in touch with each other and work together on the problem and ultimately they did, its statistical history was made. So let's just look at it with two example to process the ideas behind this. We have 12 subjects who attend to smoking cessation workshop, and after the completion of the workshop they're followed up to one month after they complete the workshop. They are all still smoking at the end of the workshop but the hope is that they will quit shortly thereafter. So they are followed to either quit smoking in the follow-up period or a loss to follow-up. I censored. The data are as follows. So the times you are ordered from smallest to largest and they're in days. When we see a plus next to an observation that means that that's the time which that observation was censored. If there's no plus then that's an actual event time or when the person has quit smoking. So the first-person quit smoking two days after completing the workshop. The second person was lost to follow up three days after completing the workshop and at that point they had not quit smoking. This third person quit smoking at sixth days after the completion of the workshop, and so on and so forth. One thing to note is the person who was followed for the longest made it through the entire study period and still had not quit smoking at the end of the 30-day observation period. So what we're going to do is estimate a curve that will start at one, we're going to estimate S of T the proportion of persons who have still not quit smoking by a certain time and hence survive or remaining event free beyond that time. By convention this will start it up one or 100 percent, at time zero the time when all completed the workshop, and it will not change from one 100 percent untill we observe the first time a person quit smoking at two days. Subsequently the curve will only change when we observe a quitting event. At each event time we will look at the total number of persons at risk for quitting smoking, those at risk of having the event and those will include those who have either not quit by that point or not yet been censored. So if they're still being followed at that point in the follow-up period, they are at risk of having the event. So what we'll do at each event time is calculate the cumulative proportion of the original 12 in the cohort who still have not quit smoking by doing the following, we're going to break it up into two pieces. At each event time we're going to look at only those who were still being followed in the study. We'll call them N of t, the number are still at risk of having the event at this given time. Then E of t will represent the number who have had the event at this time, and so this first thing we'll look at N of t minus C of t is the number who have not had the event. So, not had the event. That entire numerator so not had the event not quit smoking at that time, and we'll divide by the total number of people who were risk at that time to quit smoking. So this is the proportion of people who are still at risk at that time who did not quit smoking, and what we'll do and I'll explain why, momentarily when we show the data we'll multiply it by the survival estimate beyond the previous event time. So, let's do this. The data as follows as you remember, so by convention S hat of 0 the proportion who have not quit smoking by times 0 and hence survived beyond time 0 without having the event is one or 100 percent, 100 percent of the sample have not had the event, have not quit smoking at time 0, and the curve will remain at 1 until the first event curves at two days. So, S hat of 1 equals 1, S hat of 1.5 days equals 1 et cetera. So again, the data are as follows; times are in days and censoring is indicated by a plus. So the curve is now only going to change when we see the first event or the first quitting smoking at two days, and at this time all 12 persons in the original cohort are at risk of quitting smoking. So N of 2 equals 12 and only one person does quit. So, E of 2 the number of events of two days is 1. There are no previous event times because this is our first event. So, we're going to calculate S of 2 the proportion of persons who make it beyond two days without quitting, by doing the following; We'll take the number at risk 12 subtract the number of events 1 to get the number of persons who did not quit at time to 11 and there were 12 persons who were eligible or risk of quitting at time two days. So, 11,12 are 92 percent of those who could've quit at two days did not and made it beyond two days without having any event. Since there was no event before this, this is the proportion as well of the original 12 who made it beyond two days without quitting smoking. So the next event occurs at six days. Nobody quit smoking again until six days but in the interim somebody drops out or is lost to follow-up. They didn't have the event at time three, so we don't change the curve because we can't count them as a quitting smoking event. So at T equals six the next event time, we've lost the person who quit at time 2 they're no longer at risk and we've lost the person who dropped out. At time three days they're no longer at risk because we're not following them, they're not at risk on our clock anyway, and so we only have 10 persons in the sample left, being followed and these 10 persons are at risk of quitting smoking at time 6. Again at six days only one person does quit. So the number of persons at risk at time six days is 10 and the number of events is 1. So in order to calculate the proportion of those who are still being followed at six days, who do not quit, we do this first part, we take the number at risk which is 10 minus the number of events 1, so we have nine persons who didn't quit divided by the 10 people at risk. So 90 percent of those who were still being followed at six days did not quit. The question is that's among those who were still being followed not among of the original 12 we started with. So how can we use this to estimate the proportion of the original 12, who survive beyond six days or made it beyond six days without quitting smoking. Well we can multiply it by the estimated proportion who survive beyond the previous event time that was what we assumed the cumulative proportion who were eligible to quit at this next event time would be. So, what we see is 90 percent of the 92 percent the original 12 who were still being followed at six days did not quit smoking in six days for a cumulative proportion of 83 percent. You see the distinction here this 90 percent here is the proportion only among those who were still actively being followed in the study, and this 83 percent tries to estimate using that and as of the previous event time, the proportion of the original cohort of 12 will survive beyond six days without quitting smoking. Let's do this again at time equals 8 the next event time. By the time we get to eight days, we've lost three people, two have quit smoking before eight days and one has dropped out. It's only nine persons or the sample are still being followed and still at risk of quitting smoking. So N at eight days the number at risk is nine, the number of events is one. So the proportion of those nine who actually make it beyond eight days without quitting is eight out of nine, but that's only among those who were still being followed. We want to translate this into the proportion of the original 12 that we started with, so the best way we can do that is the estimated proportion of the original 12 who would still be being followed at nine days as the 83 percent who made it beyond the previous event time without quitting smoking. So, 89 percent of the 83 percent of the original sample who were still being followed, made it beyond time 6, for cumulative proportion of 74 percent of the original sample. I'm not going to finish the computations here and you may want to try them by hand and ask me questions, if you don't get the same results, but here are the estimated cumulative survival proportions, and each of the event times. Actually the curve would extend to 30, but there's no event at 30 so the curve stays at 19 percent and ends at 19 percent at 30 when the follow-up ends for the cohort. This actually screens out put this in a curve, and this is exactly what we have here, the Kaplan-Meier curve for these data. Now, notice it's not as smooth end fluid as the other curves because we have far less data in the sample. But what it does is it starts at one and it stays at one until two days, and you can think of there being a jump here. I'm going to put a little circle at two indicating that at two days the curve jumps from 100 percent down to 92 percent. Then it stays there until we hit the next event at six days, and jumps. It continues to do so until we get down to the last event time which was twenty 27 days when the curve jumped down to 19 percent, but persons were still followed actively, there was somebody followed for 30 days who hadn't quit smoking. So, our curve stops at 30 days, at the 19 percent we had, would not quit smoking by 27 days. The thing about the Kaplan-Meier curve is it doesn't assume any structure to what's going on in between curve point, so there's no interpolation here etc, it just jumps and stays flat till the next one and doesn't try and interpellate or assume anything about what would be going on had we observed events in between those event times. So the graph as we've seen with these 12 data points is a step function, in the larger datasets we looked at like the primary biliary cirrhosis and the infant mortality. It didn't look as step like because there were more events and less time elapsing between the events. But again, nothing is assumed about the shape between the each observed event time. There's no assumption that it's linear or a certain curve, and so, we carry forward the estimate of survival through to the next event time. We can use these curves to estimate percentiles of the event times as well. So let's look at percentile estimates of time to death in the primary biliary cirrhosis trial data. So, again this Kaplan-Meier curve shows the estimated proportion of the original 312 patients who survived, did not have the event of death by the corresponding follow-up time. So, if we wanted to estimate, for example, the median, that would be the point in which the time at which half the sample had died and the other half were still alive, we could crudely estimate that from this visual by finding the time associated with where the survival curve hit 50 percent. That's the point at which half the sample had died and half a naught and this is poor interpolation on my part, but the 50th percentile is roughly nine years for these data. To the median or 50th percentile of these data was roughly nine years. We could find other percentiles. For example, the fifth percentile would be and it's sort of just think about this, it would be where the curve hits 95 percent because that's the point at which 95 percent of the sample is still survive beyond this time and five percent had died by this time. So, it's actually the fifth percentile of survival time and very crudely that's roughly one year. Let's look at the curve for infant mortality rate in six months post birth on a Nepalese children, and this is the zoomed in version the Kaplan-Meier curve. Again, this curve tracks the estimated proportion of the original sample of 10,000+ births who survived, in other words, did not have the event death beyond the corresponding given follow-up time. So we can't and luckily and thankfully, we cannot estimate the median or 50th percentile from these data because the curve doesn't even get close to 50 percent. But we could estimate, for example, the fifth percentile of death time for this group again where the curve hits 95 percent, that's a 0.95 percent of the children survive beyond this time and the remaining five percent had died by this time. So, that's roughly 40 days in the sample. So, 95th percentile of survival time is roughly 40 days for this cohort. So frequently, and sometimes especially for computing percentiles, it's perhaps more intuitive. Instead of presenting the Kaplan-Meier curve in the way I've just shown, known as the Kaplan-Meier survival curve, researchers will sometimes spent one minus that curve which shows the cumulative proportion of the original sample that has had the event by a certain time as opposed to who survives beyond that time. So it's just the compliment, and if we wanted to convert from the triable curves I had shown you, to it's compliment we just take each point on the curve and subtract one. So, let me show you the example with the PBC trial data. If we wanted to create this complimentary curve, we would take each point on the curve we looked at before the tracks the proportion who made it beyond a given time without having an event and we subtract it from one. So, this curve if we're tracking the proportion who had actually died by a certain time instead the proportion who survived beyond a certain time, the curve will start at zero because nobody had died at the beginning of the study, and it would either stay constant or go up over time. So, you can see this curve we end at the 12 years of follow up and estimate that roughly 35 percent or so were still alive at the end of the 12-year follow-up that corresponds to 65 percent who had died by that time. So I'm just estimating that crudely visually, but just these two are complements of each other. They tell the same story just instead of emphasizing the proportion who have not had the event as with this curve tracks the proportion who have had the event by that time. Similarly, we can do this for the infant mortality rate in Nepal, and while either presentation is fine and is sort of a toss up between which is more common, it's about 50-50 in the literature to see one direction for one style or the other, I will say that estimating percentiles is little easier from this version because it corresponds to the proportion on the y-axis. So again, the fifth percentile survival time in this data, we estimated was on the order of 40 days. On this curve we had a look at where it hit 95 percent because that was the point at which 95 percent of the children live beyond, and five percent had died by that time. On this curve we could just look directly at five percent and I'm not drawing it very well, but it would hit the same point of 40 days. So in summary, Kaplan-Meier curve estimates add "richness" in other standing to time-to-event data when we have individual time-to-events and censoring times on our observations, they add richness to this data from a sample by presenting the two dimensions to the data separately in one graphic. Kaplan-Meier curves use all of the data in the sample, both the event times and censoring times, and while the curve does not change when they're censoring, the censored observations provide information about who is at risk of having the event of interest at a given time in the follow-up period. Kaplan-Meier curves are summary statistics based on sample times and they estimate the underlying, unknown, true population survival curve. Event time percentiles can be estimated via Kaplan-Meier curves, and as we saw there's two different ways to present the results of a Kaplan-Meier curve. Tracking the proportion who've not yet had the event or the proportion who've had the event by a certain time. In the next section, we'll show that these are nice tools for visually comparing the time-to-event experience between samples above and beyond an incidence rate ratio.