[MUSIC] >> In this lecture we will revisit Type 1 errors and we'll take a closer look at one problematic aspect of Type 1 error control. Where we see that in the scientific literature, people are sometimes inflating their error rates by certain practices that lead them to conclude that there is an effect when there is actually no effect, at a much higher percentage than the alpha level that they've decided upon. And it's important to not make this mistake yourself. So just to revisit, the Type 1 error is the situation when you say that there is something when there's actually nothing that's going on. So you're concluding that there is a true effect when the null hypothesis is actually true. So typically, we set an alpha level at 0.05%, which means that in 5% of the studies we do, in the long run, where the null hypothesis true, we will say that there is an effect to be observed, but we'll be wrong. So this is a choice. You can set this alpha level at any level that you want. And in physics it's very typical to use the 5 sigma level, which is this very, very low alpha value that you see below. Now one problematic aspect that inflates the Type 1 error rate, so that increases the number of times that you will say that there is an effect when there's actually nothing going on, is when you make multiple comparisons. So you collect some data and you do one test and another test and another test, so this is a multiple testing situation. And if you do this, the probability of saying that something is going on somewhere inflates well beyond the 5% level. Let's look at a situation where we perform an ANOVA and this is a 2x2x2 ANOVA. So we have two factors times two factors times two factors. In this situation, you actually have three main effects that you can test for. Three two-way interactions, and one three-way interaction, with totals to seven tests that you could perform just based on this one study that you've done. In this case the, Type 1 error rate, if you do all of these tests and you say, Well look, something happened over here, let's interpret this, is actually much higher than 5%. You can see that the Type 1 error rate of stating that there is something in one of these seven tests is 1- (0.95) to the power of 7. In total this is actually a 30% error rate. So you see this is quite substantial. The probability of saying there's something when there's nothing is much higher than the 5%. So this is one situation where you have to think more clearly about what you're doing and what you're testing. To make sure that you control the Type 1 error rate and to prevent it from being inflated. Now this is a graph where you can see the probability of a Type 1 error given the amount of tests that you do. And you can see that there's a steep line here. It's really increasing. The number of tests that you do, the more often, the higher the probability that you'll say that there's something when there's nothing. And with about 60 tests or something you have a really, almost always something will turn out to be statistically significant just due to random variation. This is the example that we looked at before. This is the 2x2x2 ANOVA where we have seven tests and we test everything. Now in this graph, you can see the vertical line, the graph on the side that's saying the Type 1 error rate. So all the p-values smaller than 0.05. And you can see that this should be at the level of the horizontal line. So the horizontal line is the Type 1 error rate that you should expect in this situation. But you see that the actual observed Type 1 error rate is much, much higher because we are doing many, many different tests. You can control this Type 1 error rate. And you can do this if you design your study well and you change the alpha level based on some considerations. So how can you do this? How can you control the Type 1 error rate to prevent this inflation? The most common way that you can do this, the easiest way is the Bonferroni correction. It's called the Bonferroni correction but a better name might be the Dunn Correction. So, this is Professor Dunn, who actually came up with this idea. And when she did, she said, well, I cannot find anything in the scientific literature that uses this before. And it might be so simple that people just didn't think about it. So that's one of the strengths of this Bonferroni correction. It's so remarkably simple to control your error rates, that, well, until it was discovered by Professor Dunn, nobody had actually come up with it. But, it's very easy to do. In this case, the alpha level that you'll use isn't 0.05, or whatever you set it to, for a single test, it's the alpha level, divided by the number of tests that you do. So let's say that you do two tests, you'll divide the 0.05, your overall alpha level, by 2, and have a 0.025 alpha level for each individual test. In the case of the you will actually divide your alpha level by seven, for example. Alternatively, you can multiply the p-value by the number of tests. And it's exactly the same and then you use the 0.05 alpha level for each specific test. If we take this latter approach and we plot the p-value distribution, when we have applied the Bonferroni correction to the same situation we had before, you see this very peculiar distribution of p-values. In this case, you see that the alpha level is exactly controlled. You see in the bottom left there's this Type 1 error rate which is actually at 5% of the total number of stimulations in this case. Due to the multiplication of the p-values, we see that there is a bar on the right side of this graph that's completely going through the roof. Most of the p-values actually here, so we multiplied all these p-values and a lot of them become really, really high or actually one. But at least the number of times that we'll say that there's something when there's nothing is nicely controlled in this situation. We have to control not just the single test that we do but what is known as the familywise error rate. You have to control for all tests that will lead you to say that there is something when there is nothing. In a situation where we have seven tests and you will take any test as an indication that something is going on, well, the family, in this case, is seven tests. But you could also say, well arguably, I didn't do a 2x2x2 ANOVA because I was interested in main effects. You're typically interested in an interaction somewhere, maybe a two-way interaction or a three-way interaction. So in this case you would say, well, there are actually only four tests that I'm really interested in. This is my family and therefore I'll divide my alpha level by these four tests. Now sometimes you have a very specific prediction and you don't really need to use a Bonferroni correction. If there's only one thing you're interested in, like research Wahlberg says here, if you are only interested in one girl, there's no comparison. If you're only interested in one statistical test, then you don't need to adjust your alpha level. If you're saying, I'm actually only interested in the three-way interaction, then that's the only test you'll do. And the only situation where you'll say this indeed supports my hypothesis, and you don't need to control your alpha level any further. In this situations, it make sense to write down your prediction before you it. And later on we'll talk about preregistration as a way to control error rates and write down what the alpha level is and the tests that you want to do. It turns out that the Bonferroni correction is slightly conservative. You can do a little bit better. In practice it won't matter so often. But given that we use computational tools to calculate these corrections most of the time, why not use the slightly more efficient version? This is called Holm correction. In this case, you order your p-values' rank. You inverse the rank order and you say which is the lowest p-value and the second lowest, the third lowest. And you multiply the p-value by the inversed rank order to get a corrected p-value. So you see that for the lowest p-value in this situation, when we do four tests, we actually multiply this very low p-value by 4, for the second lowest we multiply it by 3, etc. The benefit is clear in the last line here, where we have the 0.032, the p-value that we've observed. But this is the last that we want to test. In this case, the inversed rank order is 1, so actually we don't have to correct for it anymore. So this is the p-value we can interpret at a 0.05 level. In this case, this can be considered statistically significant. Whereas if we would use a Bonferroni correction, this would no longer be statistically significant. Now this is a graph visualising the Holm's corrected p-value distribution. You can see it's slightly different than the Bonferroni correction. But again, the p-values are 5% of the time, they fall under the 0.05 level and the Type 1 error rate is adequately controlled. So we talked about multiple testing and this is one situation where you can inflate your alpha level. Another widely used, problematic research practice is known as optional stopping. In this case you collect some data. You analyze the data. And you do the statistical test. And you see whether the data is statically significance or not. Let's say you interested whether the p-value is smaller than 0.05. If this is not the case, you collect more people. You have additional observations. This is quite common practice in many, many research areas. And it sounds like it's a good idea, right? It sounds like you should want to do this, because it's a very efficient way to collect data. If you look at the data and it's already statistically significantly, what's the use of continuing on? Regrettably, if you don't control your error rates in this situation, this actually inflates the Type 1 error rate. So even though this is something that you might want to do, you can do it, but you have to do it the right way by controlling the Type 1 error rates in this situation. Let's take a look at how you can do this. Now first, let's take a closer look at the problematic aspect of it. If you do this often enough, then formally, using this approach will always lead to a statistically significant result. Sometimes you have to look at the data 100,000 times, so it might not be very practical. But formally, as long as you continue looking at your data, and nothing is going on, you will have 100% success rate. So in terms of good research practices, I would not recommend it, but it is a practice that will give you a guaranteed result 100% of the time. Now, obviously, it will hugely inflate your error rate, so that's not what we want to do. Let's take a look at how we can do this the correct way. In this graph we see the distribution of p-values when we're using optional stopping. In this simulation we've looked at the data five times. There's no true effect, so we should observe the uniform p-value distribution. But you can see there's this very weird peak on the left side of the graph. And there are way more p-values statistically significant than there should be. We have, by looking at the data repeatedly, sort of scratched away some of the nearly significant p-values and pulled them just below the significance threshold. What we can do is use a technique that's called sequential analysis. In sequential analysis you can look at the data repeatedly, while controlling your Type 1 error rate. You're using something that's very similar to a Bonferroni correction but it's actually even more efficient. In principle, you could just use a Bonferroni correction. You can just decide just how often you want to look at the data, let's say four times, and then divide the alpha level by four. That would be perfectly fine. But the sequential analysis approach is slightly more flexible and more efficient. Now you don't have to read this. The font is too small to be read by you, but I will explain what's going on. This is a quote from a paper by Wald, in 1945, where he introduces the sequential ratio probability test. This is a new statistical tool at this moment and when he was examining it, was 1943. This was during the second World War. He had done the math and said, this is actually a very efficient way to collect data. And they wanted to use this in the war effort to test, for example, ammunition. Every now and then they had to do a test, is this working correctly or not? And using this sequential approach made them much more efficient, which is why the National Defense Committee said we cannot share this knowledge at this moment. Because if you would publish this in the scientific literature, our enemies, in this case the Germans and the Japanese, would also discover this technique and use it and become much more efficient. So this is the explanation about Wald saying that he only published it in 1945, after the war, because this was deemed such an efficient technique that the US government said you cannot share it with the enemy during wartime. Now this is interesting, right? So there is this technique that correctly controls error rates and that will make you much more efficient. But at the same time, we're not using it very often. There are many researchers slightly hesitant to use statistical approaches. But my prediction is that his will be used more and more in the future, just because it's a very efficient way to collect data while controlling error rates. If you want to set an alpha level using this sequential approach, you can do so in many different ways. The easiest, that I will give an example for in the slide, is that you lower your alpha level to the same level for each time that you look at the data. But as I said, there's much more flexibility that's possible here. You can spend your alpha level over each look at the data in any way you like. You can only have a very, very strict alpha level at the beginning. Don't spend too much of your alpha level and then later on use most of it when you have the largest sample size. But you can also spread it out evenly over every look depending on what you want. The easiest way is using a Pocock boundary. This is very similar to just dividing the number of looks by, or the alpha level, by the number of looks. It's slightly more efficient, which is why, if you look twice, you have a slightly higher threshold than 0.025, which you would have with the Bonferroni correction. But you can see it's pretty close. So these are the thresholds that you would use if you look at the data twice or three times or four times. Well, something like that is typically sufficient in a study that you do. Remember that this was the optional stopping p-value distribution where the error rate is inflated. So we see way more p-values just below 0.05 in this case. Now if we use the corrected p-values based on the Pocock stopping rule, the distribution still looks a little bit peculiar. But in this case the error is rate exactly 0.05. So this is one approach to control errors. We talked about multiple comparisons using Bonferroni approach. You can use sequential tests and control error rates using, for example, the Pocockl boundary. There are many different ways in which you can control error rates. And an alternative to this, in recent years, is not controlling the error rate itself, but the false discovery rate. I just want to point out that this is possible if you're interested in this and you're working in field where you're doing many, many, many different tests, huge number of statistical test. Then this is a very interesting approach to consider. And this is a paper by Benjamini & Hochberg that you might want to read. I'm not going to go into too much detail but it's an interesting perspective to learn about, if it's relevant for your research question. Now remember that there's nothing special about this 5% threshold. You can increase it, you can decrease it. As I said before, it's up to you to determine how this balance should be struck. If you want to use a more strict error control or a more lenient version, that's perfectly fine. In recent years, we've seen a lot of people worried about the way that people control their Type 1 errors. And there's some idea that the error rate in the scientific literature is inflated. In due, this is because people want to use more flexibility in the way that they analyze their data. Looking at the sequential approach is one way in which you can actually have this flexibility but you can be more efficient in the same time. In other ways, you have to decide very clearly on which specific tests you are interested in. And defining these tests before you collect the data will also make it efficient to control the error rate. In essence, what I'm saying is that whatever you do, if you set an error rate for yourself, it's very important to make sure that it's not inflated. If you let it inflate, you will fool yourself much more often than you think you're doing. [MUSIC]