0:00

[MUSIC]

Hi, in this module I want to talk about statistical significance.

Now statistical significance is an important concept in statistics.

If you've taken any kind of statistic's class you've almost

certainly encountered it, and it's widely misunderstood.

Now for the purpose of this module, I'm not going to get into all the formulas,

you'll have to take a statistics class to learn how to compute a significance

test and so forth.

What I want to talk about is the ways in

which significance is sometimes misunderstood,

and help caution you against misapplication or misunderstanding.

It's maybe useful even if you have had a statistics class because statistics

classes tend to be heavily formal without thinking about the implications for

research.

Now a significance test or a measure of statistical significance,

all this assumes that we are making a measurement in

a sample that is drawn from some larger population.

So statistical significance hypothesis testing has always

been embedded in the framework of thinking about generalizing

to some large population that we cannot measure in its entirety.

By making measurements on a sample from that larger population, and

then making measurements and then somehow generalizing.

And it's quite important and often forgotten, that actually most of

the mathematics that underlies tests of statistical significance,

hypothesis testing and so

forth assumes that our sample is actually a random draw from the larger population.

In other words, in drawing our sample, we follow the procedures that

we talked about in previous lectures of randomly sampling or probability sampling

from some larger population in which we're actually interested.

So measurement in a sample, when we go out and, for example,

conduct an opinion poll of a few hundred people, we compute the percentage

of people that hold some attitude or believe in something.

The measurement we make in that sample is what we call

an estimate of a population parameter.

So out there in a population in which we are interested in, we think of there being

numeric values, parameters, that describe characteristics of that population.

Perhaps the total percentage of people with a particular attitude or

a preference for a particular party.

And then we draw our sample and we make a measurement.

And then we want to use that measurement to figure out how to say something about

the corresponding parameter that describes that in the larger population.

Now, this is very important, all of the mathematics

involved in the calculation of statistical significance

assumes that the measurement is influenced by chance.

That when we're drawing a sample at random from some larger population

in order to perform some sort of measurement on them,

measure the percentage of people that support a particular party and so forth.

The composition of that sample, again is based on luck of the draw, random chance.

So whatever the population parameter is,

whatever the true percentage of people who support one political party or another.

In the sample that we draw, it's like drawing colored balls from an urn

if you think about examples you may remember from probability classes.

And so even if in a urn we have 500 red balls and

500 blue balls, if we randomly draw 10 balls or

20 balls we may not always get an exact even distribution by color, right?

By the laws of chance,

occasionally you'll get perhaps more red balls or more blue balls.

A little bit different from the overall share of the urn from what we are drawing.

So the same reasoning applies to samples drawn from a larger population.

If we have some share of the population,

which has some affiliation with particular political party.

When we draw a sample by the laws of chance, yes, maybe on average that

sample will resemble the distribution in the larger population but

there's always a probability that it may tilt in one direction or another.

You may have people with affiliation with a particular party overrepresented in

the sample even if it's a well design sample.

Again, just by the laws of chance.

So a significance test, what it does is really trying to do is assess whether or

not the value that we measure in our sample, how likely that

value is prompts the proportion with the affiliation was a particular party.

How likely we are to see a value, right one we have observed

in a sample if the true population parameter the true value

in a larger population were some hypothesized value.

So typically, we put this into the format of a null hypothesis,

where we have a hypothesis that the parameter in the larger

population is some specific value.

And then a significance test works out the probability that just

by the luck of the draw, we could get a distribution like the one

that we have on our sample, purely as a result of random chance.

So going back to the voter analogy, we might hypothesize or

null hypothesis that the people are evenly divided between two parties,

or null hypothesis might be that the split is 50-50.

And then in drawing out our sample, measuring it,

we find that perhaps 48% of the people favor one party and 52% favor the other.

A significance test would be an assessment of

how likely we would be to get a split like that,

48-52 if the population parameter or

the hypothesized value of 50-50.

As you might imagine,

more skewed distributions in the sample get progressively less likely.

So we often talk about a p-value, that's the specific estimate of the probability

of getting a value like the one we observed if the null hypothesis were true.

So in our example, it would be the probability

of observing a 48-52 split, if the true split

was as hypothesized 50-50 in the larger population.

So when we talk about correlation or regression,

the null hypothesis is usually that there is no relationship between two values,

the correlation coefficient, zero, or the regression coefficient is zero.

The changes in one variable are not associated in a systematic

fashion with changes in the other variable.

So statistical significance simply means that if the null hypothesis were true,

chance variation would be unlikely to yield a sample that produces one

that looks like the value that we've actually measured.

So for example, if we run a correlation or we ran a regression, and we got some

estimate off a correlation coefficient when we carried out a significance test.

And we claim that the correlation coefficient was significant

at the 0.05 level.

Then, all that it's saying is that there's only a 0.05

chance that if there was no relationship between the two variables in

the original population from which we drew the sample.

There's only a 0.05% chance that in a sample of the size that we've drawn,

we would see a correlation as large as the one that we've calculated.

And we might think that gee, 0.05, that's quite small,

a 1 in 20 chance that are result can be the product of random variation and

the composition of our sample.

And we might decide that this correlation really is not zero.

That somewhere in the larger population,the true correlation must be

not zero.

We reject the null hypothesis.

So what influences statistical significance?

One is the magnitude of the effect in the population for the sample.

So essentially, the further a measured value is from the hypothesized value.

And for example, the larger coefficient, the larger correlation coefficient,

the more likely it is to show up as being statistically significant.

Sample size matters as well.

So if you're thinking about same measured effects or

same measured correlation coefficients across samples of different sizes.

It's more likely that in a large sample that correlation coefficient will be

reported as statistically significant.

The sample size feeds into the calculation of the p-value,

feeds into the calculation of the test statistics

that shape the determination of the p-value.

And finally, the amount of error.

So if we're measuring our outcomes, our variables, with a lot of error,

that adds noise to the system.

And then it turns out that it makes it harder to actually

show a statistically significant relationship, or effect.

Because again, we're introducing noise into the system and

it becomes harder to rule out the possibility

that something that we've observed is the product of random chance.

So what are the limitations of significance testing?

The most important is that significance test are not proof of causal relationship.

Yet some of though have nothing to do with causality by themselves anyway.

Proof of cause and effect can only come from an appropriate study design.

Ideally, perhaps a actual control of treatment design.

Might be a natural experiment like we talked about in previous lectures or

a quasi experiment or instrumental variables.

But one way or the other, statistical significance by itself

does not say anything about cause and effect.

Significance tests assume that a sample is drawn at random from a larger population.

As soon as you got a sample that doesn't hold that assumption or

doesn't follow that assumption.

You're actually violating a fairly important assumption about

the computational test of statistical significance and

it's become a little harder to interpret the results.

You have to think more carefully about what you are actually

doing in terms of claiming statistical significance.

So we need to exercise caution when we're interpreting test of

statistical significance from non-random samples.

So test will again only show that an absolute value was unlikely

if the parameter in the population was as hypothesized but statistical

significant test never approves that absolute value is impossible.

So there is always some chance,

it might be very very small that whatever we have observed in our sample,

when we measured it, is actually the product of random chance.

And that the population parameter is in fact different from what we claimed.

And we'll talk about some of the problems that that leads to in the next module,

on type one errors.