Okay, so a little bit of the basis for why we even need to worry about these kinds of things. Here's a classic illustration of the kind of question that we might answer in a very simple trial. The question that might be posed by this diagram is what is the proportion of black balls in urn A? The urn to the right if there was a mixture of white and black balls in that urn so that I could compare and see whether the proportion in that urn is the same or different than the proportion that's in urn B. Put this in the context of a clinical trial, perhaps urn B represents the proportion of patients receiving a placebo that experienced a particular outcome. And I am asking whether the proportion is different in a group of patients who receive a drug on studying. In the illustration here, I've shown at the bottom the probability of the black balls in urn B as being 0.3 often in a trial we have to make assumptions about the proportion in both of these groups, not just in one of these groups. So one of the ways to frame the question is that how sure are we that the estimate that we will obtain by sampling. For example, sampling balls from urn A how close is that to be the real proportion or the real sample the real population measure. And the answers to these questions kind of depend on how close we want to be to the actual proportion in urn A. The only way to actually get the exact proportion in urn A is to sample everything in urn A and that's just not possible in a clinical trial. It also depends upon how noisy is the distribution of the thing that we are measuring. If I'm measuring the response on a laboratory measure for patients receiving a particular drug then that noisy nous might typically be characterized by a standard deviation. The larger the standard deviation, the harder it is to feel confident that I know the exact proportion or the exact measure in that population. And it also depends on how large my sample sizes, if I have a larger sample than all other parameters being even I will have a better estimate of the actual measure or proportion. So all three of these are critical and they all interrelated to each other. The kinds of things that we can solve for when we're trying to design a trial and calculate a sample size are typically two of the following three. So one of them is the sample size itself. How many observations will we have in our trial? The second one is the power defined earlier as one minus beta the type two error. This is the probability of correctly rejecting the null hypothesis. And then the third is the detectable difference. How different are the groups in the two groups study, how different does my drug treated population look from my placebo treated population. This could be defined as an absolute measuring something in the two groups or it could be a relative measurement. One group to the other typically characterizes things like odds, ratios and hazard ratios, which we'll look at a little bit more later. The kinds of things that we have to specify in calculating the necessary sample size for a clinical trial often include the type one error or all your defined as alpha. Most often this will be used as a value of 0.05, although different values for that alpha can be specified. We'll have to make an estimate of the outcome in our groups. For example, what is the likely response rate to our new drug in the placebo treated group will need some baseline estimate of that. We'll need a measure of variability again, this could be a standard deviation would be perhaps the most common in a simple trial designed. We also have to specify the treatment assignment ratio. Often trials will use a 1 to 1 treatment assignment ratio. That means for every participant assigned to a placebo group, there will be also one participant assigned to the intervention group in a two armed trial with a 1 to 1 allocation ratio. But different ratios can certainly be used and those have to be specified in order to calculate the overall number of participants that have to be enrolled. And then certain kinds of study designs might require additional factors to be specified. So how does one determine the appropriate detectable difference, how different the two groups need to be. So one way is to find some prior definition for an important detectable difference, often based on clinical parameters. This can sometimes be found in the literature or in prior trials in the same disease group or in a similar population perhaps studying different interventions. There are qualitative approaches to determining an appropriate detectable difference, such as the principal investigator's best guess as to what is an important difference to be able to observe between the two groups. Or perhaps there's some known parameter that describes what's important to patients some minimal clinically relevant important difference. Or sometimes a trial might be designed based on some other data that's available about what the effect of a particular drug might be. Perhaps in a different population or perhaps in an earlier form of study. Quantitative approaches to a detectable difference can also come from previous studies that have used the drug and the same or a similar population. And then there are approaches that use a standardized effect size, which is essentially just a ratio of an effect size to some measure of variability, such as the effect size over a standard deviation. And we'll talk later a little bit about what to do when you have no estimates available to you for a detectable difference. Another important thing to remember is that most trials are going to encounter at least some difficulty in obtaining measurable data from every possible observation. And so we do want to make sure at the beginning that we remember how to account for the possibility of missing data. It's typical for investigators to estimate the degree of missing this to be expected in their data. It's important to remember that not all missing data are the same and so missing data that is not different base the treatment doesn't prove that it doesn't but it's less likely. That will typically reduce the precision of your measurements and therefore can make your power be decreased. However, missing this that's different by treatment group and which might be related to the treatment group itself also reduces precision. But can also bias the result that you see if you're missing data in one of your groups more than the other group. If that's related to the treatment itself, you may miss estimate the actual treatment effect. The most common approach to adjusting for anticipated missing this in the data is to inflate the sample size by one over the quantity 1 minus p whether p is the amount or the proportion of data that is expected to be missing. A typical value for many trials might be to assume that 10% of the data might be missing. So you would inflate your sample size by 1 over 1 minus 0.1 or 1 over 0.9. This will help with the precision reduction that can be observed with missing this, this does not in any way help compensate for bias. This is a classic diagram to help illustrate exactly what we will be talking about when we calculate the sample size. So this is a standard normal curve, normal distribution and reminding us what the tales of this normal curve look like. If I was to design a trial with an alpha of 0.05. Type one error 0.05, then I would be attempting to reject information in those small tails to the left and the right of what's marked here as a critical point. Another way you might think of this is that the in between those two critical points as the 95% confidence interval for a distribution that has the standard deviation, as plotted here. >> A few figures might help to illustrate the correct computation of sample size and the tradeoffs between type one and type two error. Consider a normally distributed population. This could reflect, for example, a laboratory measurement or some other continuous variable here, I illustrate with a mean of 20 and a standard deviation of 25. And consider what would happen if I took samples from this population of size two, computed the average of those two samples, and then plotted the distribution of successive samples of size two. The resulting distribution would look something like this, narrower than the original population distribution because the average of the two samples is less variable than the underlying population itself. If I increase the sample size from 2 to 3, I get a narrower distribution and so on up to a size of eight. Imagine now a clinical trial, in which I want to take that underlying population with a mean of 20. And expose some of the people in that population to an intervention that I think would change the value that I'm measuring here, for example here, plotted. Imagine changing that same value to a mean of 12. So from a change from 20 to 12, this might be a laboratory measure that I can change, perhaps with a medication. If I took samples from each of these two populations, I would get two underlying sampling distributions as pictured here, each narrower than the underlying population from which the samples were drawn. This reflecting this illustration showing a sample of three from each of those two populations. If I increase that sample size to 20, I get narrower distributions that have somewhat less overlap and if I continue to increase the sample size, I could get distributions that would overlap each other less and less. Here is illustrated the degree of overlap and the corresponding amount of type two error that would be reflected with a sample size of 40 from the same two underlying populations that were illustrated previously. If I increased and this reflects the power of 17%, I increase the sample size to 80. The power increases to 30% reflecting less overlap between these two sample distributions. These are all reflecting sampled distributions where the samples are the same size from each of the two groups. And if I ultimately increase the sample size up to 352, I would have a power of 85%. Which it so happens that if you compute the sample size rigorously for the underlying population characteristics illustrated previously and target a power of 85% or therefore a type two error of 15%. You compute a sample size of 352. Now, these figures have illustrated using normally distributed variables and using a Z-test, perhaps it might be more common to think of using a T-test for samples of this size. The difference between a Z-test and a T-test are quite small, although the T-test is the more appropriate test under most circumstances for other outcome variables with different distributions. The same general principles apply, in which increasing the number of samples taken from each of our intervention groups helps us better distinguish between those two populations, as has been illustrated here. This figure illustrates all of the important characteristics on a single diagram showing, for example, a type one error reflected as alpha. And because we're illustrating a two-sided test, the alpha is divided into the two tails on what's illustrated here as the left most distribution. The difference between the means of the two samples distributions is the detectable difference, shown here as H zero and H sub a, our null hypothesis as compared to our alternative hypothesis. And the shaded area under the alternative distribution. The right most distribution reflecting the type two error and the power. This is an example of the equation that's used to calculate the sample size for a Conventional to group test using a 1:1 allocation ratio. And assuming that the two groups are treated in parallel for a given level of alpha again, 0.05 is very common for that and that's accounted for by the first Z term in the numerator of this fraction. The power is accounted for by the second Z term in the numerator of this equation, the detectable difference is in the denominator as the sigma squared. And then the variability measure the standard deviation here is the other term in the numerator. And Z represents the conventional Z statistic from a standard normal distribution, often would be 1.96 for an alpha of 0.05 divided into a two tail test. And this equation here will give you the size needed in order to distinguish between two groups with a detectable difference as specified. With expected variants as specified by the standard deviation and using the power and type one error as also as specified. There are many tradeoffs involved in designing a trial between these different parameters. So things that can be done to make a trial smaller to decrease the sample size or to decrease the power, which has risks of type two error. Or to increase the detectable difference limit, that is, to assume that the two groups will be more separated and therefore easier to distinguish. That does not change what the actual effect will be. And it is possible that if you assume that the detectable differences too large, your trial will not actually detect the difference because the difference is present, but smaller than the one that you're capable of detecting. Another thing you can do is to decrease the power in order to decrease the power. You can either decrease the sample size or decrease the detectable difference limit. If you feel that your trial would benefit from increasing the detectable difference, that is assuming that the groups will be more separated. That will lead to either an increase in power or a decrease in sample size. So if the drug you're testing, for example, really does produce a response that's quite different from your control group, that will lead to a trial that has a smaller sample size need and or a greater power. And we'll look soon at how sample size and power to work together. Another thing that can be done is to increase alpha, which will either increase power or reduce the detectable difference or lead to a smaller sample size. However, I think increasing alpha is not a conventional way to approach the design of a trial. Just a note about some of the different trial designs and trial outcome measures that can be used and, in particular, we've been using examples so far about two group designs using a continuous measure. But another common trial design parameter is to use a time to event outcome, for example, a survival trial in which patients either do or do not have a particular event such as a hospitalization as a result of their illness. Time to event outcomes tend to provide more information than measures at a particular time point, and those tend to lead to either more power or smaller sample size. Requirements. In time to event trials, some of the differences between that and continuous measure trials, and say you're looking to specify either the event rate, how often do events happen or what is the average survival time? And survival here does not necessarily mean being alive or being dead, but rather being event-free, whatever that event might be. So a trial that's designed to reduce hospitalizations, survival means being free of hospitalization. Parameters to also affect the sample size needs for the time to event trial include the length of the recruitment period that is for, how long do patients enter the trial and then for how long is each of those patients or each of those individuals observed during the trial? And we'll look at examples of that in just a moment. Power is typically based on the number of events that occur in a time to event analysis. Often in designing a trial and computing the sample size, you'll estimate the number of events per individual or the rate of events in your two treatment groups in order to compute how many individuals need to be enrolled. But the statistical power actually comes from the number of events, not the number of people unless the number of events per person is exactly what you estimated. In most time-to-event outcomes, patients are recruited over some period of time. It would be rare to have a trial that involved, let's say 1000 people all on the same day. So that means that the people in the trial are going to have potentially differential amounts of follow up time. In a common close out design that people are all followed until a particular end date, that means that people who enroll in the trial early have more follow-up time than people who are enrolled in the trial later. If you imagined a study in which everyone was guaranteed to be followed for at least one year and recruitment lasted for up to three years, the first patient enrolled would have a total of four years in the trial. The last patient enrolled would have a total of one year in the trial. So different patients would have different amounts of follow-up time, which increases the average time for each patient or individual in the study. In time-to-event studies most of the time, not every individual is followed until they have an event. Some people will never have the event. Some people will drop out of the study, no longer be in the study, lost to follow-up, and so those are things that have to be handled and accounted for in the sample size calculations also. Here's a plot showing some of the different parameters that might affect the sample size needs for a time-to-event kind of trial. The top group of lines shown in dotted lines represent sample size calculations assuming a power of 90% or 0.9. The bottom group of lines represents the power of 0.8. So you can see here illustrated that a study that is willing to accept a lower power or a slightly higher potential for type II error will have a smaller sample size need. That is, the more power that a study wants, the more the sample size will have to be. And then within each of those two groups, the dotted lines or the solid lines, we show the effects of different amounts of average follow-up time or accrue followed by follow-up time. And the longer the average follow-up time for an individual in the trial, the lower the sample size would need to be. On this particular plot, the outcome metric that's being used as the hazard ratio, which is common in time-to-event analyses. And then on the Y axis were showing the total sample size. So this would be the sum for both of the two treatment groups we're comparing. And so when the hazard ratio is lower, the sample size is higher representing the detectable difference. As the hazard ratio gets larger, that detectable difference gets larger, the sample size needs go down. And then at any particular hazard ratio, the higher the power, the larger the sample size. That's why the dotted lines are above the solid lines and then the more the average follow-up time, the more information you'll have to lower the number of individuals that you'll need. A quick couple of notes about trial designs that will use group allocations such as a cluster randomized trial. One thing that's important to account for when designing those trials and calculating sample sized needs is to remember that if the individuals within a particular group have some correlation within the parameters that are being used as the outcome, then there needs to be an adjustment to account for that correlation. And typically that will be done with an inflation factor shown on the equation here, one plus the quantity m -1 times x. Where m is the number of individuals in the groups and X is some measure of the degree of correlation. Either a correlation itself for a continuous outcome that's going to be used or some other concordance rate for something like a dichotomous variable. An example of the application of this inflation factor might be an ophthalmology trial studying results of something in eyes in which each patient is contributing two eyes to the study, if it's a study that will examine both eyes. And it's not uncommon to Imagine that a intervention that treats something in one, I would contribute also to a treatment in the other eye. And so here we show the reflection. If we think that the correlation between the two eyes would be 0.5, then we would do an inflation factor of 1.5 to be applied to the sample size calculations previously considered.