Welcome. Today, we get to welcome two special visitors. Fellow Course Developer, and team member Laura Ness, currently a graduate student, PhD candidate in our statistics department at the University of Michigan. Then we have Tim Van Der Zee, he is a doctoral candidate at Leiden University in the Netherlands and currently here at the University of Michigan as a visiting research scholar. They're here to share their discussion and ideas about p-values, p-hacking and to keep the p's going proposals, good proposals, good practices for reporting of research results. So in this course, we have turned now to some statistical inference techniques. We have studied a little bit of confidence interval estimation, and hypothesis testing. This is where we state competing theories, we gather our data, summarize it in the form of a test statistic and convert that to a corresponding p-value. So probably a good place to start would be to ask you to sort of fill in the blank. A p-value is blank. Maybe a little bit about how you might explain what a p-value is more generally, the big idea. Let's have Laura start. Okay, I will start then. When I'm explaining to a student what a p-value is I usually say, ''This is on the context of hypothesis testing so you know one alternative hypothesis, and you have to, in describing p-value say, if the null hypothesis is true, so assuming that it is the ground truth, then the p-value is a probability of getting the statistic that you did out of your data.'' Tim would you like to give you a run? What I would like to add to that you see when I explain a p-value I like to call it a measure of surprise. Specific not a direct measure of evidence because it isn't really to some extent, but a measure of surprise. So, you need the null hypothesis and the p-value then. The higher the p-value, the less surprised you should be, and the lower the p-value, the more surprised you should be because that means that your data is not what you would expect, given that the null hypothesis is true. So it's being surprised about the data, not about hypothesis, that's important and also I think that this analogy means clear that if you are surprised, if the p-value is low, that doesn't necessarily then tells you a lot because you still have to work out at that surprise what it is, but you know that you should be surprised, and if you shouldn't be surprised, well that's just it. So, along with that, then we need to think about not only what is a p-value measuring what it is, but also what it's not. Some misconceptions. So a p-value is not blink. How much you feel that in? One of the biggest most concerns I get is that the p-value will actually tell you how likely it is that your data is true, and that is not what it gives you because you're completely taking away that assumption about the null hypothesis being true and that's assuming that given the whole world as it is. But you really need to be having that assumption that the null hypothesis is true to be getting any information about it. Absolutely I think that's the biggest issue people have with p-values. Also made more generally that it's telling us something about hypothesis, it isn't telling us only something about data, assuming that the null hypothesis is true. Then we can use some inverse inference in a sense and do make some suggestive claims about hypothesis. But, directly speaking, it's just about data, the probability of data. Of course, we already know the probability of data because it's one we have the data in front of us, so we know the data's real. I think that's one of the most important things to remember. Some feel that the p-value is actually trying to quantify how much that null hypothesis might be true or how much the alternative is true and in reality, those are statements that are either true or false. So it's not the probability of a theory being true, it's the probability of the data under a certain theory. How about the idea that it doesn't tend to measure then, what that effect might be? If it is a small p-value, what does that mean? Well, I will come back to the surprise thing again, so we can be sure that we can be surprised. But the next question is, why is this p values so small? There are a lot of reasons for this, a lot of possible reasons. So there's now one reason being what it can only be small when the null hypothesis is false, there other reasons as well. It has to do with how much power do you have, how do sampling go, what kind of analysis strategies have you used? So there are all kinds of more implicit assumptions built in into the whole model that you use which eventually produce a p-value to take all that into account to when you start interpreting a p-value. So for example, if you want to make claims based on p-values, one of those things is that you have to repeatedly do test. Because p-values and inferential statistics in general at least, frequented statistics only operates in these imaginary worlds of long run experiments of infinite repetition. So, you kind also need to do multiple studies, calculate multiple p-values based on multiple datasets, and then we can really start homing in," Okay. This gives us a good description of reality and this null hypothesis is not a good predictor of the data we get, so we're going to build a better hypothesis that is a better predictor". Each one studies either right or wrong when we make that decision. In a sense, yes, and that's fine as long as we do multiple studies. Right. So, we need to see the building of that evidence. Yes. Very good. So another thing that I wanted to add to this, is a misconception is that, having a really low p-value could mean statistical significance, but it doesn't necessarily mean practical significance. So, it could be that your mean is above ten but it's only above ten by like 0.001. Yes it could be very statistically significant like you've are for sure that it's above ten given your data, but what does that really mean in the world? Maybe it doesn't give you maybe the drug wasn't really that much better than the previous one and it's not worth it to pursue making it. Yes, absolutely. Even on top of that, you're almost guaranteed to get very small p-values if you have very large samples. For example, in my work where I use data from books of massive amounts of data, but the true value of some difference one of the facts is like 0.01, you might still get a very low p-value. It's not exactly zero that's the new hypothesis, and so it's not a very good explanation but overall actually is the best explanation, because the roof is so small, so close to the real zero. So, yes, it can be a bit misleading sometimes, if you forget all the other things that can cause low p-values. Very good. So moving on from p-value to our second P of p-hacking. All right. We've seen over the past few years a lot of interesting discussions about views of p-values, the idea of p-hacking. So, Tim what does it mean when we say a p-value is hacked? Good question. I'm going to favorite general initial answer would be, whenever we can no longer interpret the p-value by its mathematical meaning, by it's mathematical definition and if you can interpret anymore like that, then it's hacked one way or another and like a thousand ways to hack it. So for example, the most well-known way to hack a p-value is, so we have one dataset of a study, can be experimental or observational, and you just run as many tests as you can, and sometimes we call exploratory analysis, sometimes we just call p-hacking, but you're running tens maybe hundreds of different analysis and then hey, one of those is significant and you're happy and that's the one you report. But then, we can no longer interpret a p-value in its original meaning. It's still a true p-value, that single specific analysis is correct, but to interpret a p-value we need the big picture as well, we need the full transparency of all those analysis that we have done. We need to know everything that it has been done to that data, so that we can no longer interpret a p-value as it's supposed to be interpreted when its hacked. So, what are some of the other consequences then of p-value hacking? Or if it can no longer interpret as it should be, we just can't do anything with it anymore because we don't know what it means, and that has a big consequence because then, the study we did basically becomes uninformative. It's like, yeah, we did something, but then we kind of messed up this portion and now we're no longer sure how we can interpret this anymore. So, it becomes useless or uninformative most of the time and that's just a waste of energy and money. Right. So, p-hacking is more polyvalent and what about, is it possible to accidentally p-hacked to maybe not so obvious when you're p-hacking you can do it without your knowing it. Nora, you have some thoughts? Yeah. I would say that most scientists have inadvertently p-hacked on some research during their entire career if not multiple times without knowing it. Because a lot of the times when you're looking at data, you have a general assumption and hypothesis that you're testing and you can get something really close and you say, oh, well what if I looked at this variable instead or this subset of people. It may seem reasonable to you then, it is reasonable because you have data and you want to look at it, but you didn't have these research ideas in the first place, and so you're like poking around until you get a value that you say, okay, it really means something, but it doesn't actually mean something. It really just means that you are looking at multiple different questions and that's diluting your p-value essentially. So, I think for the most part it is an inadvertent thing and you wouldn't know necessarily when it's happening. Unless in retrospect when you come back and say oh, okay, somebody else got, I did it my same study and got a different result. How did this come about. Maybe it's because I was looking at too many different questions. So, p-hacking is only for p-values or can you have p-hacking of other statistics and other numbers? You can hack anything you want, sadly because, again, with a general definition. As long as we can no longer interpret a value as it's supposed to be interpreted, by my definition, then it's hacked. So, you can do that to averages in a sense, it's not an inferential statistic, but you can still hack it in a way that you can present it as a mean while it's not actually a mean. So, for example, if you took out some outliers and say, Okay we're going to look at this data without all these outliers. Well, you're choosing which outliers or how to define those outliers, so that in itself is a hacking? Yeah, and you still end up with a mean then, but it's not in the original meaning and if we interpret as a mean of the day then, we don't know that those values, how the outliers were removed. It's not really mean in the original sense of the words, so you may be misled, you can interpret it in that original manner. That's how we are often somewhat misled because we don't necessarily know all the decisions that were made along the way, or what choices the researcher took as they started their investigation. That's not always part of what we read in a research study writeup too. Yeah, and it affects any kind of numbers. So, often I hear or see people do is like, yes for example, the p-value was just above significance or a p-value was hacked and significant now, okay, we're going to ignore the p-value because there's now like number shady. We're just going to use the effect sizes for example, the parameter estimations. But, then they don't realize that all the same factors that affected the p-value also affected all those under numbers. So, you can't really rely on the effect size estimate either because it might be hacked by the same reasons that 'hacked' the p-value. Might other choices are made of felt many of them numbers that are reported, not just that p-value. Yeah. If some forces drive down the p-value, it's going to drive up the parameter estimates and it's going to lower the variance, for example. So, there are certainly some issues around p-values and the possibility of p-hacking. People have argued we should abandon p-values altogether. Some journal even decided to ban them So, what do you think. Do we need to just get rid of p-values, do they still have some merit, is it more of the watching for the misuse and making sure we're using them in appropriate ways. What are your thoughts? That's a tough question, which I have been struggling with for few years. I think in a sense banning p-values seems very silly initially, it's like banning averages, for example. Because people misuse it for some like. Sure people misuse means or people misuse percentages, for example. I don't think that's the reason to ban them from an academic journal. I do think it's extremely important to make systematic changes in a way we engage with data analysis and we actually are fully aware of all the assumptions based into inferential statistics, specifically p-values and make sure that we do so. For example, on a much larger scale engage with preregistration, which is many people argue and, again, requirements for valid p-value use. What is pre-registration? Good question. So, it actually comes back to what you said earlier about the how many choices we often make in data analysis inadvertently were kind of hacking the p by the choices we make. So, if you run a study and before you have like a fake ID of an hypothesis, and then you get the data, and then you start running some hypothesis tests, you still have a lot of flexibility. Without knowing you often will end up running multiple sites. You make all kinds of very subtle choices with all affects the p-value, the number it gives you, but also the interpretation thereof. Because if you decide to like, we're going to add 10 more participants and then rerun the test, that's become part of the way you should interpret that specific BF while you're getting. So, if you really want to have a pure untainted p-value, you should before you get the data, preregister. It means writing down the whole sampling plan, all the methodological choices you're making as much as you can. Sort of a protocol of the [inaudible]. A protocol, yeah, and as much as you can as explicit as you can, the more the better. You can do a very minor version as well, but the better defined is better and then afterwards you follow the protocol. Then, you really have a confirmatory test of the hypothesis you're testing, instead of a much more exploratory test. The more something as exploratory the more the p-value becomes undefined in a sense. So, it's more for scale, like it's unhacked, yes or no. The degree of. The degree of undefined ability in the sense. The degree of uncertainty we have. But, if it's like 40 preregistered, that's the real, pure p-value where we know exactly like, if you do this again by following this protocol, then you cannot make others claims, but if there was a lot of freedom, you don't really know which decisions were made, it becomes harder to say, what if you do this exact study again, but we don't really know what an exact study even was. That is the part of the definition of the p-value, was running that study over and over again. Yeah, the study was repeated over and over. Same thing, not always the same though. Exactly, yeah. All right. Laura, let's have you chime in on that idea of whether we should get rid of p-values and this idea of more transparency. I think I agree with what Tim said, that we shouldn't get rid of them completely. But this jumps onto the whole reproducability issue as well. So, all the studies have that paper that comes out and they say, "Oh, we got this p-value." But they don't include in the methods all of the steps they did, they only include in the methods the step got them this p-value. So, if instead you include all the tests that you did and say, we tried this, it didn't work. We tried this, we got this p-value. Then we'd have a lot more information on what actually went on that test, we could reproduce it better. Then, also people reading it can get a little bit more context of how to interpret the p-value that you're given. That definitely leads into our final P. The idea of having some good proposals, good practices. A couple years ago, ASA, American Statistical Association, right? Provided a statement on p-values. One of the principles in that statement was, "Proper inference requires full reporting and transparency." Now, Tim, you've had some assignments in this course that are related to food? One of them being pizza? Right. And you've also had a connection to some pizza publications. Yes. And I'm going to quote a part of your last line of your abstract from one of the articles regarding that. "We hope that our analyses will encourage readers to undertake their own efforts to verify published results. And that's such initiatives will improve the accuracy and reproducability of the scientific literature". So, that lens with this idea that we should end with I think. What can we do better? Maybe from a couple of different perspectives because we have some of us that are learners that are more on the consumer side. Right. Reading those are research reports and studies that are summarized in news today. We have others that are going to be doing their own data analysis, for their own purposes, but not necessarily publishing them, and then we have our scientific researchers out there too. So, looking for some advice. Maybe small steps in the right direction, may be more larger reform ideas. But, let's end with some good ideas that we can all strive for a little bit more in the work that we do. Yeah. So, one of the things that you can do to check on reproducability is, if you're running certain analysis, there's different methods that you can use, and you can use these multiple different methods and see if you get the same output, right? So, you can use, if you take clustering for example. There's a bunch of different clustering methods and you might say, "Oh here's the output I get for one doesn't match these five other ones". If it does, it's probably a good sign that there's something there. Versus if you only get your output from one very specific set of variables and you make a little tweak and it changes a lot. Then, it's not a very robust method and you might need to do another experiment to test your hypothesis more thoroughly and say, "Okay, does this really mean anything?" One thing that's useful to be able to do that I hope more people do in the future is, post your data publicly and post your methods publicly. So, this allows amateurs who want to figure out how data science works and just try it out on their own. You can look at code that researchers have posted and the data they posted, and see if you can get the same results as they got. Maybe even try a few different analyses and see what you get from it. So, this is a great way to learn something. It's a great way to see if analyses are reproducible for research. Which is something that needs to be done more often because most times when a paper comes out and it gets published, everyone's like, "Yeah that's great." But it's going to go with it and never check it again, until 10 years later someone says, "I'm not really sure about that." Having everything be open access really it provides a lot to the community in every way. Yeah. Fully agreed with that. For example, even when it's not really possible sometimes to publicly share data because of the sensitive information, I'm still very surprised by the fact that when we bear a few articles, bear views do not get access to the data. And they do not get access to the analysis and they have no way of verifying the results that are reported. They can read about it but then, we can only read what is we read and we don't know what we're not reading. So, I'm still baffled by that fact that we're not actually actively verifying those analysis. I think trust is very important, but also for verifying things is also important and thus go very well together. Being transparent? Being, yeah. Radical transparency. Absolutely. That's one of the key foundations of scientific progress I think. For consumers, because you also asked us what can consumers do. That's maybe even the hardest question, because you're much more on the outside. You can't change what those people are doing, so you just stuck with what people give you. I think, again, it's good to try to find that radical transparency and rely more on research where they share their data and analysis. Because then you can be more confidence in those results. Also, looking explicitly for replications for experiments, studies that were repeated across years by different people. So, not to focus too much on an individual paper. But it's tricky. It's hard. Even for people who are trained in this it's still fairly hard. One of the things that we're trying to strive for just in our course is being able to provide the data that we're working with, being able to provide the code that was used to generate what we are producing and showing to our learners out there. So that they will be able to see that's a good practice to keep track of what was done with the data and have the ability to go back and check what was done. Why were those observations removed you have the code that verifies that's what was done with the data along the way. So, just general data analysis practices of keeping track of your code, keeping a good log of what is done. Yeah, it's amazing. It's very important. I often say that a I think that statistics is almost entirely useless. What I mean with that is - Clarify that. Yeah I will. Of course this requires clarification. What I mean is that, we have a very long, called that chain of evidential reasoning. And statistics is one of those links in a chain and it's a very important one, but it comes relatively late in the process. And, it relies on the integrity of all the previous chains, previous links in a chain. So, for example, if something went wrong with the data collection or data storage or all those other steps, we can't rely on statistics to fix that. It will still give us numbers as outputs. So, it also doesn't really account as a good way to verify the integrity of the previous steps. It's completely reliant on that quality. So, if you've done a study in the right way, handled the data in the right way, maintain a logbook, documenting all the steps that you did and did not take, in that case, statistics it's amazing. It's great. I feel like, I need to add a caveat in there. Because- Oh yeah. Please do. There are statistics out there that the main point of them is to tell you how good is your data actually, and should you be using it to get these other statistics? Yes. No. That's a very good one. I'm probably thinking of relative to black and white pictures. So, thank you for your addition. And there absolutely ways to deal with low quality data, but it's it's all post hoc in a sense, its like that fixing something that should not be broken in the first place. First place. Yeah. Thank you for your addition, it's absolutely true. Well, certainly, things are changing in the Internet. These days are changing the way we distribute, we discuss ideas. Change the way we gather and share our data and the results. We know that a single analysis is not always going to give us a definitive answer. We see in a lot of papers, that they write at the end more study is needed. Very common phrase. There's always more to learn. Yeah. And I know that I have today and learned a lot more, and I appreciate you taking your time to share your thoughts, your views on these important issues. Keeping us all, if I can keep that food thing going, some good food for thought. So, thank you. Thank you for having us.