Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

34 ratings

Johns Hopkins University

34 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Discrete Data Settings

In this module, we'll discuss testing in discrete data settings. This includes the famous Fisher's exact test, as well as the many forms of tests for contingency table data. You'll learn the famous observed minus expected squared over the expected formula, that is broadly applicable.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so let's go through our more mathematical de, development, where we're assuming a model. Right, so now, before we were, when we were talking about it as a randomization process, we were kind of conditioning on the data. We said, oh, you have so many treated, you have so many control. You have so many tumors, and so many non-tumors, and we're simply re-doing the randomization process on the computer under the hypothesis that the randomization is irrelevant. Right? That whether you received the treatment or the control was irrelevant. That's one way to think about Fish's exact test. Now, we're going to talk about a different way. So let's let X be the number of tumors for the treated and Y be the number of tumors for the control, and were null hypothesis is going to be H naught p1 equal to p2 equal to the common proportion, where we're going to assume that X is binomial with whatever its sample size was and success probability p. And Y is binomial with whatever its sample size was and binominal probability p under the null hypothesis. Under the alternative they would have to be different. Probabilities.

By the way, if you, if this is true, right? If this is true, if both X and Y are a bunch of IID Bernoulli sums, then X plus Y is just the sum of more Bernoullis, n1 plus n2 Bernoullis, all with a common success probability p. And so, it, it's an interesting and, and fairly obvious fact that if you add two binomials with a common probability that the sum of the two binomials is also binomial, with a total number of trials equal to n1 plus n2 and the same probability. And this is clear, because if X is comprised as a sum of n1 Bernoulli's with probability p, and Y is comprised as the sum of n2 Bernoullis. With probability p and then X plus Y is simply the sum of n1 plus n2 IID Bernoullis with probability p hence its binomial n1 plus n2 and p. So now the way we've characterized the problem now we have two numbers X and Y that are random. Every, in our two by two table, there are no other free numbers. Right? If, if we know X and we know n1 then we know the number of non tumors for the trigger. We know Y and we know n2 then we know the number of non tumors for the control group. So, in that two by two table with know both, we know the margin the, that n1 and n2.

And then if we know X and Y then we know the, the second the. The, that which is the first column of, of numbers then we know the second number of columns.

so we only have two free numbers in our four numbers in our two by two table there. so, but, we still have one parameter that we don't know, even under the null hypothesis. The null hypothesis says h naught p1 equal to p2 equal to p. Okay? So what if we were to then try and figure out a strategy to get rid of that parameter? Find a distribution that doesn't depend on it. and, and it turns out that the probability Of one of the data points given the sum. And it doesn't matter, we just pick the first data point, you, you could, you get the same procedure if you pick the second one. Probability of X given X plus X equals z, it turns out that this follows the hyper geometric probability mask function. And I give the hyper geometric probability mask function right there.

Now what's interesting about this. Is this hyper-geometric mass function is exactly the probability distribution from a couple of slides earlier where we have so many bins of t's and n's and we have so many balls labelled t and c, for treated and controlled. And how, if we randomly allocate treated and controlled balls to the bins. That, you know, the first bin able to hold six balls, and the latter bin being, the, the end bin being able to only hold four balls. And I need to allocate ten balls, five treated and five controls randomly to that process, to the, to those bins. That's the hyper-geometric. Its the, the other way to think about this idea that is the distribution of 2 by 2 table where you're permuting the t's nd the c's.

we've in the t's and the n's fixed in the way that we described earlier. Of course that's identical to permuting in the t's and the n's leaving the t's and the c is fixed.

so again you wind up, you wind up if you have the same data and you assume that the row margins are the margins that include the randomized treatment or you assume the column margins are margins that had the randomized treatment you wind up with the same procedure provided you have the same data set. So that, that's interesting. perhaps comforting, perhaps discomforting, either way now before remember we only had two numbers, we had two success probabilities, X and Y or in this case it's a tumor; so I'd hardly call that successful, but let's say two success probabilities. Using the convention of calling a binomial event a success regardless of how successful it is. we have the two success probabilities at the onset, when we know the value of the sum, then we only have one left, so in that whole margin when we, when we assumed we only had two free two free cells, given that the, the row margins were fixed. Now we only have one free cell given that the row margins were fixed and now that the sum is fixed. And so, this is exactly what Fisher's ex, est, exact test really tells you is that, the, you know, it, it, as you vary that upper left hand cell or any cell holding both margins fixed you get the remaining three elements. You can obviously, you know, put in a value for the upper left-hand cell. And you can obviously go through the exercise of finding the other three cells, very easily, given the margins.

the hyper geometric distribution, that can a, that arises if we take the distribution of the upper left hand cell and condition on the sum. Note that this distribution does not contain p. It got rid of it. And there's a mathematical reason for that. It's the so-called conditioning on a sufficient statistic. So, when you condition on the sufficient statistics for p, you get rid of it. In this class, we won't go over that. We won't go over the mechanics of why, we won't go over the mechanics of the logic of how Fisher came up to condition on X plus Y. Or what, how that mathematical, that mathematical development works. suffice it to say for the needs of this class, that when you condition on the sum you do get rid of that probability. And there is a very general mathematical principle that is relying on, the relies on the fact that X plus Y is sufficient for these for the parameter p.

Okay, so let's derive this conditional distribution. So we know the probability of X. it's just the binomial probability, here. We know the probability of Y, and let's say z minus x. This'll make the derivation a little bit easier, but we can plug in anything here. Here provided z-x in an integer between 0 and n2. Then it's this binomial probably right here, and then we said already that X plus Y is binomial, so the probability that X plus Y equals z is this probability right here.

[NOISE] Okay, now putting everything together, the probability X equals x. And X plus Y equals z over the property X plus Y equals z. That's exactly this conditional probability just using our rules of conditional probabilities is that we know quite well from mathematical bio statistics boot camp one. And then if X equals x and X plus Y equal z then that's the same thing as saying X equals x and Y equals z minus x of course.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.