Welcome back to Principles of FMRI. We're now going to our discussion of pitfalls and biases in neuroimaging. So in the last module, we talked about a publication bias that is resulting in increase rate of false positives. And some arguments for why many research findings are false. We also talked about issues underlying these broad problems. Issues of selection bias at multiple levels is what we're going to focus on here. And this includes publication bias, only the studies that are the winners get published, or that end up finding something get published and the other is going to file drawer. Biases at the level of selection of experiments. Flexibility in which experiments to include in a paper or publication, how many subjects to run and when you stop running subjects. So if you just run subjects till you find a significant result then that induces a bias. The third one is flexibility in the model, so model selection bias. So if you have exercised flexibility in which outcome you choose, which covariant you include, how you divide up the outcomes, like a median split or other kinds of divisions and which procedures you employ, and you pick the models that look the best at the end, you're also inducing a bias. And finally, a voxel selection bias, or more generally, a test selection bias. If you run many, many tasks like we do in imaging, thousands of tests, hundreds of thousands of test, then we often end up picking out the significant ones to look at, to focus on. And that both increases the false positive rate and inflates the apparent effect sizes of the winning voxels, the ones we focus on. So we talked about those things in brief and some solutions to them. And we also talked about some positive responses by the community, in terms of homes and funding for replications and null findings, changes in journal policies, other new platforms for conducting research. In this module, we're going to home in on FMRI biases and we're going to talk more about voxel selection bias. We'll talk about circularity, regression to the mean, and along the way, kind of familiarize you with some of the important terminology and effects. Effects like the decline effect and publication bias and p-hacking, so let's look at voxel selection bias. This is at the heart of the voodoo correlation paper and debate. And here's a hypothetical study. This is a correlation, then, between reward sensitivity in the brain and behavior. So it might be a decision making task, and this is a pretty typical finding, there's the nucleus accumbens right there, and I see this correlation here between reward sensitivity, behaviorally and in the brain, of 0.82, it looks really great. Right, great finding, great paper, and the problem with this, is that this is a null hypothesis simulation. There are no true effects. And how many times I have to replicate a brain analysis before I found this null finding that looks really good, once. [LAUGH] It happens all the time. So here, in this case, here are the correlations across the whole brain. All the effect sizes, and you can see that they're centered on zero, and that's what happens if there's no true effect and truly symmetric. So, it's really just to know how hypothesis distribution, they're just many test and we could pick out the ones that look the best by chance. And this is what the map looks like at P is less than .005 uncorrected, which is a pretty common threshold for reporting findings in published papers, and what we see here is at .005 uncorrected, there are several hundred significant voxels. So I see lots of findings, and what I did is I picked out the nucleus sucombus finding and I zoomed in on it, and it looked great. So uncorrected thresholds, the problem is you find something every time, so this is something you have to guard against. So here's the series of simulations, null hypothesis simulations and I'll do the same thing as I did before, but I'm going to repeat it 10 times just so you get a feeling for what this maps look like. So what we'll see is a display that looks like this, and on the top panel there, we see what looks like a whole brain 0.005 uncorrected, so we'll see how many blobs there are and where they are. Then on the bottom left, we'll see the distribution of those brain behavior correlation values across the brain, and they'll always be centered on zero. We'll see the region with the strongest correlation In the bottom middle panel, that's the maximally correlated region and then what the correlation looks like in the best region. And this just illustrates that when we pick out the winners, the findings look really good. So let's repeat this ten times. So there's a positive correlation of 0.8, next time here's a negative correlation of about the same strength. Here negative 0.86, and you can kind of get a sense that no matter which run I do, I can do this again and again, I'll always find something. It's about 0.8 by chance, and this is with 20 subjects. If you would increase the sample size, then the chance colorationables will go down to some degrees, so it won't look quite as good, but the smaller the study the more likely it is you going to find something that looks really great by chance. And this problem initially with colorations here but it applies to all kinds of effects sizes as well o any kind of statistical task. So, this is the idea, the Voxel selection bias is what inflates those observed tests. So, here's another representation of a brain map with some results, 20 subjects, p is less than .001. And there are some true areas here that are in purple, and the red areas are false positives. And, if the true correlation looks like this, in this simulation, in the areas that are truly activated, the purple areas This is the true correlation on average, 0.5. So that's the ground truth. But I'll always see something stronger. A typical significant voxel will be correlated about 0.8. So there's some signal there, but it's not as strong as it looks. And why is that? When we search for significant tests out of a large family of tests we're conditioning on having a high observed effect size. In fact, here the correlation has to be at least 0.67 to be significant at all. So every correlation is going to be above 0.67. But the true correlation is only 0.5 and whats happened is I am only seeing the voxels were the noise favors my hypothesis, so I am capitalizing on chance. So this why massive multiple testing comparison frameworks just don't produce valid measures of effect size. So no matter what test I am conducting So this is another brain map here. We'll look at now a plot of voxels along the x-axis. And what we see here is, all the significant voxels. Everything that shows up in my map is significant. And what you see here is, all the red is noise, and all the blue is true signal. So now I've got a true effect size, in the left half of the Voxel over there, of 0.5. That's Cohen's D, a signal divided by a standard deviation, of 0.5. So I've got a modest true effect there, and once I pass that through the filter of the Voxel selection the observed effects sizes are only those that stick up above the threshold. So they're only those with the apparent effect sizes that are above 1.2 or so. And in addition, if you look at the right part of this panel, that's where there's all red, there's only noise, and those are false positives. So there are fewer false positives than true positives. But even the false positives have effect sizes above 1.2 [LAUGH] or so. And as you can see then Thresholding and filtering this, there are more true positives than false positives above threshold, right? But it's a very poor method overall for picking out the real signal. So if you understand the brain mapping analysis, then this is something that we have to be aware of and live with. So, there in the red areas, the noise actually favors the hypothesis and inflates the effect size estimate. Now, one other point is that this is true whether I'm looking at a correlation value in our value or T-value in a contrast or test, Z-score, p-values Any method really for brain mapping has the same kind of bias. And a common misconception is that if we correct for multiple comparisons, then that's going to solve this problem and sort of kind of intuitively multiple comparison actually makes the problem worse So the purpose of multiple comparisons correction is to ensure they don't get too many false positives but it doesn't ensure that my effect size estimates are meaningful. So when we correct from multiple comparisons, we're going to end up raising the bar. So now instead of the black threshold there, we have a green threshold, now the threshold has effect size of about 1.5 or greater. So now there are far fewer significant results, more of those are true versus false positives but, because I've raised the bar, I have to, I've increase the voxel selection by its problem so I have to capitalize on chance even more to get a significant finding. So here, more stringent multiple comparisons means a less accurate measurement of effect size or more bias measure So what we see here on the bottom panel is the threshold for significance. I just reported this in terms of effect sizes in d, but that could be p values as well. And the higher the threshold, then the greater the contribution of noise to the apparent effect. So multiple comparisons correction has its uses, but It doesn't solve this problem and in fact it actually makes it worse. So one way to think about this is, in terms of this whole criterion which I love. It says objects in the mirror are closer than they appear. Written on the old rear view mirrors at least, and here the effects that you observed in the rear view mirror of having already done your study and picked out the winning findings from among a whole family of test and procedures Look much larger than they are. And this problem is related to the problem of regression to the mean which is a really well known problem but it's really worth thinking through it. So let's look at this testing scenario again with the true and false positives. And some real signal in blue. And let’s imagine that we have new sample with this exact same box holes just with independent noise, just different sample of subjects. Now, what happen is the noise values tend to regress toward their mean because they're hide by chance in the first place. So the next time we sample and they’ll going to be lower on average So a retest of the same voxels with independent noise gives me an unbiased estimate of the true effect size. So now you see that the effect sizes are much lower, and they vary around the true effect size, plus or minus noise, again, where there's real signal, and they vary around zero where there's no signal. We could, of course, re-threshold and create a new voxel selection bias problem. [LAUGH] But here, you can see on a per voxel basis the effects are varying around their true means. So that's the phenomenon fo regression to the mean, and that's why it's important if you want to estimate effect sizes to replicate a test in a new sample. And get an unbiased measure of effect size. And if you want an unbiased measure of effect size you have to look at all the tests, not just the winners, even in your new replication sample.