Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

41 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

[INAUDIBLE]

Â Okay.

Â So, the biggest problem is of course that the magnitude of differences is discarded.

Â So, it's, it's potentially not as powerful as you'd hope, right?

Â You know, it would be different if all, you maybe be feeling you got

Â half of them as being positive, but all the ones that were positive were

Â much larger differences, and all the ones

Â that were negative were really small differences

Â than that would be different than if they were kind of spread equally between, above

Â and below zero.

Â so I would say that, but then the other thing that I would mention is

Â there's nothing specific about zero, you could have

Â tested any means, theta equals to theta naught.

Â By calculating the number of times the

Â difference is bigger than any specific value.

Â Testing whether the median is bigger than any specific value.

Â What's interesting about that, then, you know, we won't talk about this, but it

Â is kind of interesting, right, that you can do this for any value of theta.

Â So that means, you can find the values of theta for which you fail

Â to reject and the values of theta for which you reject.

Â And then of course, if you can do that, you know, by grid searcher, say, something

Â like that, then you can invert that and

Â get a confidence interval for, for the median.

Â And so this is kind of an interesting way of, highly non-parametric way,

Â to get a confidence interval for the median of a set of data.

Â So Wilcoxan thought about this problem

Â of discarding the magnitude of the differences.

Â And, he said, well why don't we, instead of using just

Â the signs, why don't we also use the ranks of the differences?

Â So this, using the ranks saves some of

Â the information regarding the magnitude of the differences.

Â You're still testing whether the median is

Â zero versus the three potential alternatives, and we

Â can, we can create a statistic, and I'll say

Â that appropriately normalized, this statistics

Â follows exactly a normal distribution.

Â But if there's no ties, then the exact small sample

Â distribution is known, and then why use the normal distribution?

Â The one benefit of the normal distribution in this

Â case is because the small sample distribution is known.

Â we can evaluate how accurate the normal distribution is, in general.

Â so it's different than a standard

Â normal distribution, where you don't know the exact

Â small sample distri-, the associated small sample distribution.

Â So you can't tell how well the normal approximation is working.

Â So I would say at least that one aspect of normal

Â approximations and non-parametric tests is a little bit different than normal

Â approximations in general, because we can evaluate how, we can get

Â a much better sense of how well the normal approximation is working.

Â So here's the signed

Â rank procedure.

Â So we take all the pair differences, just like we did before, we take

Â the absolute values of the differences, which is new, now, we haven't done that.

Â And then we rank these absolute values

Â from least to greatest, throwing out the ties.

Â throwing out the zeroes is what I mean.

Â and if there's ties, right, there's a

Â difference between ranks of zeroes and ties.

Â If there's ties, in other words, two, differences are identical

Â but non-zero, then we use the average rank, we assign the average rank to those.

Â and then after we've done the ranking, we multiply them by the sign of

Â the difference plus one for positive difference,

Â and minus one for a negative difference.

Â And then we calculate

Â the sum of the positive ranks, and thatâ€™s the sign rank test.

Â Now, so think about what this is doing.

Â Let's suppose we did have the case where all the big

Â measurements were positive, and all the smaller measurements were, were negative.

Â Smaller differences were, were negative.

Â Smaller in absolute value differences.

Â Then what you would see is that all the, the, the, the small ranks

Â would have a negative sign, and all the large ranks would have a positive sign.

Â So the sign

Â ranks statistic would be very different from what you would expect

Â by chance where the signs are distributed equally among the ranks.

Â versus the sign test by itself, which if

Â roughly equal numbers were above and below zero.

Â Then you would get the same, then you could get a statistic that

Â was you know highly looked a lot like it, it was from the null.

Â So just to reiterate this point, if, if the

Â median is large, then W plus should be large.

Â If the median is small, then W plus should be small.

Â and this W plus, it does follow a, a normal distribution for large samples.

Â for small sample sizes, and large sample sizes but, you know, especially for small

Â sample sizes, we can work with an

Â exact distribution under the null hypothesis and we

Â can you can get critical values from, from a table, but

Â you can get the critical values you know are pretty easily.

Â So you know it's maybe a bit involved to exactly work with the

Â null distribution.

Â We'll do some of it, of the exact distribution of the statistic.

Â We'll do some of it.

Â But I wanted to at least, talk about Monte

Â Carlo, so you can at least see that, yeah, you

Â know, okay so we can pretty easily, figure out

Â what the exact distribution is, provided there is no ties.

Â so,

Â what you could do.

Â So if, if this procedure is invariant to the distribution being used.

Â You have to simulate, so what you could

Â do is, simulate n observations from any distribution

Â that has theta as its zero, theta as its, theta is zero, its median is zero.

Â and then rank the absolute values of the data, retain the signs,

Â calculate the sign rank statistic.

Â And apply this procedure over and over again.

Â And you'll have just used Monte Carlo to get the exact distribution.

Â especially because we're assuming this distribution is invariant to The,

Â or, we could use any distribution, like for example the

Â normal distribution, the standard normal, and that would that would

Â give you exact small sample distribution, but we can actually go

Â further than that.

Â under the, under this distribution under the null hypothesis, right,

Â the signs are equally likely to be distributed anywhere among the ranks.

Â So all you have to do is actually take the ranks of the numbers between one and n and

Â just randomly allocate the signs. Okay, and that's exactly the,

Â flip a coin. For each rank value.

Â So take the rank one, flip a coin, with success probability 0.5.

Â Take the rank two, flip a coin, and,

Â and so on, and then you've just generated this,

Â this distribution exactly, rather than having to mess

Â around with the normal distribution or something like that.

Â But you know, I, I present it this way just to show you conceptually

Â that, well, if this distribution or if

Â this test really doesn't depend on the distribution.

Â Then you could do,

Â pick any distribution, and just simulate under a null distribution.

Â But I'm contending that it's even easier than that.

Â All you have to do is take the ranks, i.e.,

Â the integers from one to n, and flip a coin for each

Â rank, and then you'll have exactly, giving a plus one and a minus

Â one for, for head or a tail for each rank, and then

Â you will have exactly calculated the

Â null distribution in the signed rank statistics.

Â So, it works out very conveniently.

Â Okay, so I, I, I in fact actually say this explicitly on this slide here.

Â Where, here's a little bit more elegant way to do it.

Â Take the ranks one to n, randomly assign the signs as binary

Â with a probability of 0.5 of being positive and 0.5 of being negative.

Â And then calculate the signed rank statistic.

Â any rate, if you, you wanted a Monte

Â Carlo p-value, you would just apply this procedure.

Â simulate over and over again.

Â And calculate the percentage of times your test statistic was more extreme.

Â And then, that would be the probability of

Â obtaining a test statistic as or more extreme.

Â the average of the number of instances where

Â your test statistic was, your simulated test statistic was

Â more extreme than your observed test statistic, would

Â be a Monte Carlo approximation to your p value.

Â And

Â then let me just go through the large sample distribution.

Â I'm not going to derive it.

Â Am I going to derive it? No.

Â It's actually pretty easy to derive, to be honest.

Â If you think about it, right, it's really a binary

Â times one to n, so it's really pretty easy to

Â work with, but let's, so, so maybe the homework assignment,

Â derive the, derive these things that I've put right here.

Â you need, you need some some

Â finite sum results for the sum of the integers from one to the n,

Â and sum of the integers squared, but other than that, it's a pretty easy problem.

Â but let's not waste time with that, because I don't see the true benefit

Â of using the large sample approximation when

Â you can use the small sample one anyway.

Â unless there's ties or something like that.

Â At any rate, the expected value of our statistic is n times n plus

Â 1 over 4, under the null hypothesis the variance is this guy right here.

Â And so our test statistic

Â is just W plus, minus it's expected value divided by its standard error.

Â That limits to a normal zero, one.

Â You can, you can do a you know to, you can do the correction for ties if you want.

Â But, under the null hypothesis, we can still do an

Â exact small sample test, so why not just do that.

Â Okay, so here, let's go through our example.

Â Here we take our differences.

Â we can do Wilcox.test, diff, exact equals false, and then it's

Â going to actually go through, I go through the calculations right here.

Â It'll agree with the results of Wilcox.test, so I think.

Â Now that I'm looking at this, I'm not sure, you may have to

Â put correct equals FALSE or something, to make sure you're not correcting for ties.

Â At any rate, double check to make sure that they're, they're the same.

Â They should be either exactly the same, or very close.

Â And the, the, the, the correction for ties is the distinction.

Â Which we don't really cover the, the

Â little technical details like that in this class.

Â at any rate, you get a, an expected value of W plus as 150, the variance is 1,225.

Â So our test statistic is our observed W

Â plus, 194.5, minus its expected value of 150,

Â divided by the standard error. Works out to be 1.27.

Â Of course that's not significant on the z scale.

Â And, and so our P-value in fact worked, works out to be 0.21.

Â So what is this suggesting?

Â This is suggesting that we can't rule out that the median is in fact zero.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.