Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

41 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so just expanding on these points,

matched binary data can arise from several circumstances,

for example, when measuring responses to occasions

or matching on k status in retrospective study.

matching on exposure status in a

prospective study or a cross-sectional study.

and all these cases, matching in general right?

Matching general induces a dependency, and that

has to be accounted for in the analysis.

So the pairs on binary observations are, are independent.

I'm sorry, the pairs are dependent, in other words, your response at time

one is correlated with your response at time two.

So our existing methods don't apply.

However, we assume that, that, you know.

Person one who responded at time one and time two, is

independent of person two who responded at time one and time two.

So there's, we're assuming, independence.

Across pairs, the dependence within pairs.

Okay, so let's look at some notation. So here we're going to use our standard

contingency table notation, where we have n11, n12, n21, n22 for the four cells.

And then we have n plus 1, n plus 2, n1 plus, n2 plus.

and so here's our data, the n's, and we're going to assume

that the, the four cell counts, n11, n12, n21, n22, are multinomial.

With n, which is the sum of them, trials.

And then the associated probabilities is conveniently labeled pi 11,

pi 12, pi 21, and pi 22. So in other words we're going

to assume that every pair of measurements, every time 1,

time 2 collection pair of measurements, is going to be a one or a

zero in exactly one, one in one of these four locations.

So the, the person will have either said yes at both occasions, a yes and

then a no, a no and then a yes, or a no and then a no.

So they're going to only be a one in each one of those occasions.

And the probability of, of being the probability of

being a one in that particular cell is pi IJ.

Okay, and then the multinomial is just

the sum of all of these Multivariate Bernoulli.

Okay.

And then we would denote the margins with plus, n1 plus for the

row margin, pi1 plus for the row margin of the probabilities and so on.

And so pi1 plus and pi plus 1 are the marginal probabilities of

a yes response that the two occasions

disregarding the other occasion, so pi1 plus.

is the probability of saying yes at Time 1 regardless

of whether or not you said yes at Time 2.

And Pi plus 1 is the probability of saying yes at Time 2 regardless of whether or not

you said yes at Time 1. Okay?

So marginal homogeneity is the hypothesis that

these two marginal probabilities are the same.

That's how it gets its idea, marginal homogeneity pi 1 plus equals pi plus 1.

And of course because there's only two

probabilities right?

If pi 1 plus equals pi plus 1, pi 2 plus equals pi plus 2.

So the marginal probabilities are the same and so we call it marginal homogeneity.

you can do a very quick calculation right?

Pi 1 plus is pi 1 1 plus pi 1 2.

Pi plus 1 is pi 11 plus pi 21, right, and the pi 11 is common in both of those.

If you subtract them out, this hypothesis is

identical to pi 12 equal to pi 21.

Okay, and so that's, that hypothesis is referred to as symmetry,

because it is the off-diagonal elements of the table and

it's basically saying that the true probability matrix, the true probability

two by two table would satisfy being identical under the

transpose if you were to transpose the table.

and so that property is called symmetry and

hence this marginal homogeneity hypothesis is equivalent to symmetry,

only in the case of a two by two table, and in more general cases, it's not true.

So the we, we clearly have an estimate for all of the pis, so pi

12 estimate is just the n 12 divided by n, pi 21 estimate is

n21 divided by n and so on. simply the proportion.

The, the estimates of the true probabilities of landing in each

cell would be the proportion of people who landed in each cell.

So the obvious estimate of, of the

difference between p12 and p21, are ie. How far away from symmetry are, you are.

Or in other words, how far away from marginal homogeneity

you are is just n12 over n21 minus 1 over n.

And it turns out and this is maybe a

little bit involved for us to go through, but under

H not as consistent estimate of the variance turns

out to be n12 plus n21 divided by n squared.

and so if you were to take this numerator

and one, and take this as our statistic, n12 over n minus n 21 over n,

and divide it by the standard error, so square root n12 plus

n21 divided by the square root of n12 plus n21 divided by n.

Divide those two, you would get a so-called z statistic.

The preference the preference in this case is typically to square that statistic.

I think that matches the traditional development.

And so the square of that statistic works out to have

this convenient form, n12 minus n21 squared, over n12 plus n21.

And this follows a chi squared distribution because of course the Z

statistic, squared follows a chi squared distribution with one degree of freedom.

So this is the famous McNemar's test statistic.

And you were jerked marginal homogeneity if this test statistic is large.

So this test is called McNemar's test. And

notice what's interesting about McNemar's test is that only n12

and n21 are used. They, they're the only ones that carry the

relevant information about pi 1 plus, and pi plus 1 being different.

now, n11 and n22, the concordance cells, contribute to

the magnitude of this difference, but, in testing whether or not

they differ, it's only the discordant cells, n12 and

n21, where people disagreed from time 1 to time 2.

so that's an interesting fact about this test.

It's called McNemar's test and it's a, you know, it's a very famous statistic.

Okay, so let's look at our test statistic from the approval rating example.

We have 86 and 150 as the off diagonal cell,

so that's 86 minus 150 quantity squared over 86 plus 150.

That works out to be 17.36.

The P value is extremely small then, right?

Because right, if chi squared was one degree of freedom,

it's going to be unlikely to be, extremely unlikely to be

above 9, 3 squared.

with three as a, as a way out on the tail of the standard normal.

hence we reject the null hypothesis, and conclude that there appears

to be some sort of change in the opinion between the polls.

any rate in R you can just do

mcnemar.test, you have to give it a matrix.

And again this is one of these options, one of these instances where if you if

you want get exactly the statistic you work out by hand you

have to put correct equals false because it does a continuity correction.

by default you, you want, in general

to leave the continuity correction in, I'm just

putting it as false here so you get, it matches exactly your by hand calculations.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.