[MUSIC] Let's discuss some of the statistical monitoring methods that are available. One question that gets asked is why not do repeated significance testing, why not simply just look over and over again and then stop when you see a real difference? Well, the problem with that is that the more you test, the more likely you are to make a mistake. In this case, the more likely you are to reject the null hypothesis and say there is a difference when there isn't one. A great example of that is the table on the right hand side of this slide. It shows you that as the number of repeated tests at a fixed significance level increases shown in column one, the Type I error also increases shown in column two. So even though if you only did one test, as shown in the first row, your type one error, your chance of making a mistake was only 5%, as soon as you've done five tests, your chance of making mistake is 14% and it just increases from there. We create statistical stopping rules as a set of statistical conditions that would trigger you to stop the trial early. They should be pre specified to avoid bias and also to avoid the chance of making that Type I error, the chance of increasing your mistakes. We often refer to these as guidelines instead of rules and the reason for that is the changes and unexpected events can occur. You may see something that you didn't plan for and always participants safety will overrule any statistical mechanics. However, I would like to emphasize that you must have a very strong rationale for overturning the plan stopping rules or guidelines, this is not the Wild West, we don't get to just choose what we do and stop when we feel like it. That would lead us to the same problem that we discussed in the previous slide of just going until we saw something significant. Alpha spending functions are one statistical tool, we do a series of tests at a fixed number of time points, we call this the proportion of trial information. And we spend our Type I error throughout the trials so that cumulatively the chance of making a mistake is below our threshold, for example 5%. Now these spending functions can have different shapes and I've tried to give you some examples of those shapes in the figure on the right hand side of the page. This figure looks at three different spending functions, which have looks at 20%, 40%, 60%, 80% and 100% of the trial information. And then it shows you what sort of thresholds you would need in order to stop the trial. The alpha level on the right hand side of the graph represents the type I error. Now the alpha level is half of the total because you spend half on an upper bound and half on a lower bound. So a 0.05 test, a 5% chance of making a mistake, would have a two sided alpha level of 0.025 and I've indicated this with a faint dotted line for comparison with the other spending rules. The Pocock boundary which is represented by the dotted gray line with circles has a uniform testing threshold. You require the same level of evidence at every time point either 20% of the data or 100% of the data. In order to control the type I error however, that means we have a much higher barrier for each test, in this case 0.05 instead of 0.025 which is what we would need if just one test were done. Alpha spending functions which allow us to have a threshold close to what we would have if only one test were done, it would be preferable. An example of this is the Haybittle-Peto alpha spending function which is represented by the dotted blue line with the triangles. In this case we only stop if there's overwhelming evidence at the 0.001 level early in the trial and then we have a final test at the full 0.05 level. This lets us stop early but only if the evidence is extremely strong. Probably the most commonly used alpha spending function is the O'Brien-Fleming function which is represented by the solid purple line with squares. In this case we have a gradually decreasing threshold of evidence, you need a lot of evidence at the beginning of the trial to stop early when you don't have much data. But at the end of the trial you need less evidence and the final test is extremely close to the 0.05 level. Another statistical tool is called conditional power, which I have abbreviated as CP. In this case we predict the likelihood of rejecting the null hypothesis. Because of that, it's often used in futility analyses when we want to see what's the chance that will have a significant result at the end of the trial? I have given an example of conditional power in the figure to the right. This example has a single interim look, once half of the data have been accumulated, there are a number of steps for doing a conditional power analysis. First you have to estimate your effect size given the current data. This is represented by the black line on the left hand side of the figure and denoted by observed data. Then you must assume a pattern for future data, this is what you think will happen for the rest of the participants in your trial. Now there are a number of different trajectories that we can use, the most common which is represented by a solid purple line ending with a square is to use the current trajectory and extend it to the end of the trial. This assumes that what you've seen already will be what you see for the rest of the patients. A very conservative approach is what I call the null hypothesis approach, which is represented by the dotted gray line ending in a circle. In this case you assume that there is no difference between the treatments. The final scenario is what I call the alternative hypothesis, which is represented by the dotted blue line ending in a triangle. It represents an optimistic view, for example, what you would need to see in order to have 80% power or an 80% chance that your trial will be successful. You can calculate under each of these scenarios, you combine the data that you've actually observed with the trajectory to calculate your conditional power. The probability that you reject the null hypothesis that you will see a difference at the end of the trial based upon that combined information. If the conditional power is poor, then you would decide to stop the trial. In the example to the right, you can see that on the current trajectory we only have a 13% chance to actually see a difference between treatments. That chance gets even lower to 2% if we made the null hypothesis assumption and in fact we would have to wildly diverge from the current projection in order to have an 80% chance of having a successful trial and that seems unlikely. One thing to note is that you must pre specify which trajectory you will use current, null, alternative and also what you consider poor, you can't just look at it and say, that doesn't look good enough or I think that's good enough because that could introduce bias. You have to say, I want to be sure that it's at least a 50% chance or a 60% chance with the current trajectory in order to continue and specify that threshold. There are of course issues with statistical monitoring. It's very mechanistic, it depends upon p values, there are a set of rules, what if you're close but not quite, it's very rigid in that sense. The second problem is that it really focuses upon a single outcome, typically an efficacy outcome. During a trial, you collect lots of information, secondary outcome, safety data, all of which could influence whether or not you think it's a good idea to continue the trial. This sort of statistical monitoring does not take those into account and this can create problems. Finally, you have to put your planned interim analyses into your protocol and this can create what I call information leaks. People can make assumptions about what's happening depending upon whether or not you choose to continue the trial or not based upon their knowledge of those stopping rules. That in turn can bias them in terms of whether or not they would want to recruit, continue seeing patients or other factors. So it's important to try to keep the information leaks to a minimum. [MUSIC]