Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

41 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Hypothesis Testing

In this module, you'll get an introduction to hypothesis testing, a core concept in statistics. We'll cover hypothesis testing for basic one and two group settings as well as power. After you've watched the videos and tried the homework, take a stab at the quiz.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

OK so here's in a simple example of doing a two, a paired tea test.

Â So I took data from two exams from a particular class one year.

Â So I had two exams.

Â So clearly the data are paired because its the same students measured twice.

Â And so lets make up a questions so I have exam one and exam two,

Â and in the question I'm going to try and ask is there any evidence

Â that the second exam was easier or harder than the first.

Â And one way to investigate that question would be to ask whether the.

Â The mean of exam one or two is, mean of exam one is different than that of the

Â mean of exam two.

Â now you know again we're testing a

Â population mean so at some level I'm trying

Â to model this as if the students are some draw from some population of students.

Â which you know seems kind of reasonable in this setting.

Â even though it wasn't you know they weren't

Â exactly randomly they weren't randomly sampled but it seems

Â like a reasonable thing to try and do.

Â OK so let me take this summary for test one so at r I just take summary test

Â one and I found that the worst that the

Â students did which wasn't so bad was Was a 76%.

Â And the, the best, best people did 100%. And for test two, the, similarly

Â the worst people did was a 71%, and the best a person did was 100%.

Â Okay, so here's a plot scatter plot where I have test one on the horizontal axis.

Â Axis in test two on the vertical axis.

Â And, let's see, I've arranged it so the axis are

Â the same, so both starting at 70, and ending at 100.

Â And you know, maybe there's, there's some suggestion that, that,

Â that test to, the scores were a little bit higher.

Â It seems

Â like more points are, are lying above the identity line here than below.

Â the coloration was 0.21 and the standard deviation for test one

Â was, was six,and the standard deviation for test two was six also.

Â Oh, and the, the mean from the previous slide, the mean for test

Â one was, was, was 87, and the mean for test two was 90.

Â [SOUND]

Â Sample mean.

Â [SOUND]

Â and then on this next plot, what I'm showing

Â is the difference between test two and test one,

Â versus the average of test two and test one,

Â and this plot is basically the previous plot tilted.

Â for, 45 degrees sort of like kind of turning your head like that and and

Â the reason for doing that this, this is called a mean difference plot,

Â right?

Â The difference on the vertical axis and the mean

Â on the horizontal axis and There's, a very famous paper.

Â So, so Tukey was the person who came up with the mean difference plot.

Â but there's also a very famous paper on,

Â sort of, test retest reliability by Bland and Altman.

Â Where they, promote mean difference plots and actually

Â show how to, to, to do some inference.

Â Associated with them, so that these, these pots are, are quite well known.

Â And, and one of the reasons for doing them.

Â It's maybe not so necessary for this data set.

Â But, but if the correlation is high then you wind up with a

Â lot of blank space, if you do the scatter plot in the upper Left

Â hand corner and the lower right hand corner and, and, and rotating it in

Â this direction actually gives rid of a lot of that blank space and makes

Â for a far more efficient plot.

Â It, it, you know, so at any rate that's a, that's a, that's a well known technique

Â and whenever you're looking at paired observation I think for many people,

Â looking at the mean difference plot, is, is all, is more natural in,

Â in almost starting point of when looking at this kind of data.

Â Okay, so now let's actually perform our test.

Â I'm going to, I'm going to show R code here.

Â I'm hoping that everyone in the class at this point could actually do the test.

Â So the difference, I'm going to do the pair differences.

Â That's test two minus test one.

Â That's this first line. And is just the number of subjects.

Â I put a comment here that that worked out to be 49 subjects.

Â The mean in this case worked out to be 2.88.

Â The standard deviation worked out to be 7.61.

Â This is again the standard deviation of the det, tet, of the differences.

Â Okay, so my test statistic is the square root of the sample size.

Â Because remember that's in the denominator.

Â The denominator so we can just bring it

Â up to the numerator times the average difference.

Â Our null value that we're testing against is zero so I'm just going to leave

Â that out divided by the standard deviation and

Â the difference and I have a little comment

Â here that that works out to be 2.65. so The

Â the two, two sample t, t, p

Â value works out to be 0.01, so we reject, we, we

Â knew we were going to reject because 2.65 is is a big value

Â of of of Standard normal and you know, by the time we get

Â to 48 degrees of freedom, we're pretty close to a standard normal.

Â So we knew that 2.65 we were going to reject and the exact way we calculate

Â those p values, we would twice the probability of getting a test statistic.

Â as large or larger than 2.65.

Â for a t distribution with n minus one degress of freedom.

Â I, I think whether you do this with pt or p norm you're going to

Â get about the same answer and then because we're doing a two sided

Â test, we multiply it times two here so two times this t probability.

Â And that gives us our p value, works out to be around 0.01.

Â So we reject the null hypothesis.

Â And conclude that there does appear to be some, some

Â difference in the means between, test one and test two.

Â I would say you, you typically don't go through all these calculations.

Â you, you calculate your differences and then you just

Â use a function to, to do the work for you.

Â In this case, the function is t.test and r but.

Â Every statistical package has something to do paired to the test.

Â OK, so we're not going to spend too much more time on paired differences.

Â I'm hoping for everyone in this class that

Â this, this discussion is no great stretch for them.

Â so, but let me, let me raise some points.

Â wh-, you know, one, one thing that, that, When you're doing t tests, paired t tests.

Â it's, it's generally, worthwhile to ask the question,

Â are ratios more relevant than pairwise differences?

Â And if ratios are more relevant then they consider doing the paired tea

Â test on the log observations rather than the observations themselves.

Â so another thing is when considering matched Matched pairs

Â data you always want to do some plots first, plot the

Â first observation by the second, and then I showed you this mean/difference plot of

Â the average versus the difference. And again if you are interested

Â in, in relative quantities then do everything on the log scale,

Â what i mean by that is, is take the natural log of every observation.

Â And then proceed with the analysis.

Â with, with, with the log data, treating it

Â like you would treat the, the, the data normally.

Â so any way, this, this plot is called

Â the mean difference plot It was invented by Tukey.

Â And, and, and and, and, and is often

Â called a Bland/Altman plot from the very well-known

Â paper by Bland and Altman who, who added quite a big inference on top of it.

Â [NOISE]

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.