0:01

Sampling variability and the essential limit theorem

should not be new concepts to you anymore.

However, in this unit we're shifting the focus away

from numerical variables and focusing on categorical variables only.

So, in this video, we're going to start by

talking about the sampling distribution for a sample proportion.

because remember, when we're dealing with categorical variables, the parameter

of interest is no longer a mean but a proportion.

And we're also going to define the central limit theorem for

proportions, which is very similar to what we've seen before

but a different measure of the standard error as expected.

And we're going to walk through the conditions for

the, that central limit theorem to hold as well.

Let's revisit quickly what we mean by a sampling distribution.

Say you have a population of interest and you take a random sample from it.

And based on that random sample, you calculate a sample statistic

If in that sample the variable of interest is a categorical

variable, the sample statistic is going to be a sample proportion.

Then we take another sample, and also calculate the sample proportion from that.

And then another one, and then another one.

And this goes on for a long time, because we

want to think about taking as many samples as we can.

The distributions of the observations with

in each sample is called sample distributions.

However, when we look

at the distribution of the sample statistics,

this is what we call our sampling distribution.

And remember that these two are not the same thing at all.

In the sample distributions, the observations are individual.

Let's say people or cases, whatever it is that your

sampling verses in a sampling distribution the observations are sampled statistics.

Let's give a little bit more concrete example, say we want to estimate

the proportion of smokers in the world.

So our population is our world population, and capital N is going

to be our population size, so this is everybody in the world.

And our parameter of interest is p, the

proportion, the true proportion of smokers in the world.

If we actually had data from the entire population, we could calculate this

p as the number of smokers in the world divided by the total

population size.

But we don't have data from every single person in the world, so

we're going to think, so let's say that you're taking many samples from this.

So the idea here is not necessarily

a realistic situation where you're doing data

analysis per se, but we're trying to

illustrate what we mean by a sampling distribution.

So you start with the first country on the roster, Afghanistan, and you sample 1000

people from Afghanistan.

And you ask each individual one are you a smoker or

not, and record a yes or a not for each individual person.

Then so on and so forth, you go to many countries.

Let's say you take another ra, random sample of 1000 from

the U.S. again, asking each person are you a smoker or not?

And recording a yes or a no for them.

2:55

And finally you end up in Zimbabwe, the last

country on the roster.

Another random sample of 1000 people from there as well.

Again asking them, are you a smoker or not?

So now you have a bunch of samples of thousand

observations each, where observation represents a person from that country.

And say we summarize these samples.

So, we calculate the proportion of smokers in Afghanistan.

This is the sample proportion.

The proportion of sample smokers in the U.S And you do this

for every country, all the way up to proportion of smokers in Zimbabwe.

So now, our data set is not individual people, and whether or

not they smoke, but actually we have a data set of proportions.

The distribution of these proportions is what we call the sampling distribution.

And as you can imagine, these should be individually

both good guesses somewhat for the true p.

Although we probably expect more variability between these

than the example we gave before when we were

talking about means with the heights of, average heights

of US women from various states in the US.

Because we actually would expect some trends in

the smoking habits of people from various countries.

But overall, we would expect the mean of these

p hats to be close to our unknown population mean.

So, this is very similar to the diagram we drew before.

So slightly re, repetitive, but we're basically trying to make sure here

that it is very clear what we mean by a sampling distribution.

And something that's actually different here is

that initially we started with a categorical variable.

Is the person a smoker or not a smoker? Then for each one of our samples,

we calculated a summary statistic. Our proportion, the proportion of smokers.

And now we are dealing with a distribution of numerical data.

Where our data, the data items are proportion of smokers in each country.

So we started with a categorical variable, but we're once again talking

about the distribution of a numerical variable, because

we're focusing on the distribution of sample statistics.

5:13

So what is the sampling distribution going to look like?

Well, the central limit theorem tells us about that.

It says that the distribution of sample proportions is going to be nearly normal.

Just like with sample means, it's going to be

centered at the population proportion suppose this population mean.

But again generically, it's centered at the population parameter, and with the

standard error inversely proportional to sample

size, and that's also we've seen before.

So the central limit theorem tells us about the shape of the distribution.

The center of the distribution, as well as the spread of the distribution.

And we can calculate the standard error as the square root of p,

which is the proportion of success time one minus p divided by n.

Just like with any rule we introduce, there

are conditions for the central limit theorem as well.

The first condition is very similar to what

we've seen before, independence of observation.

Our sampled observations must be independent, and to achieve that we either

want random sampling or assignment depending on the type of study we have.

As well as if we are sampling without replacement, we want to make

sure that our sample size is less than 10% of our population.

6:23

We also have a condition about the sample size.

And this time we're not just coming up with a threshold

sample size per se, but we're looking for the

balance of the sample size and the proportion of success.

We said, we are saying that there should be

at least 10 successes and 10 failures in the sample.

So, n times p and n times 1 minus p must both be greater than 10.

This rule should sound familiar to you because we've actually talked about this

when we were dealing with the binomial distribution, and we were looking for

the normal approximation of it.

And the same idea holds here, we want our sample proportion to be nearly

normally distributed, and therefore we need to

meet the success failure condition one more time.

7:08

However, if you're p is unknown.

So this goes for the both calculation of the standard error

as well as the calculation of the number of successes and failures,

we usually use our sample proportion.

Again, if you don't know your population parameter, your best guess is going to

be your sample statistic that you're using as a point estimate for that parameter.

So let's give a quick example.

We're told that 90% of all plant species are classified as angiosperms.

These are flowering plants.

If you were to randomly sample 200

plants from the list of all known plant species, what is the probability

that at least 95% of the plants in your sample will be flowering plants?

8:20

We're calling an angiosperm plant, a sampled angiosperm plant a success here.

So at least 95% is we're looking for the probability that our sample

proportion will be greater than 0.95. So if we knew something

about the distribution of pea-hat, we should

be able to easily calculate this probability.

In fact if we knew that pea-hat is distributed nearly normally, we know

that we can calculate this probability using

the normal distribution z scores, and percentiles.

Well, the central element theorem tells us

that it may be distributed nearly normally,

so let's check to see if the conditions for the central element theorem hold.

And if it does, then we can proceed with that.

The first condition is about independence. We have our random sample rate-all.

And 200 is certainly less than 10% of all plants, so we can assume that

whether or not one plant in our sample is angiosperm, is independent of another.

Number two is about the success failure condition.

200 is our sample size, our proportion of success is 0.9, so n times p 200 times

0.9 is 180.

And n times 1 minus p, that's 200 times 1 minus 0.9 is 20.

Both of these are greater than 10.

So our success failure condition holds as well, which tells us that

the distribution of the sample proportion is going to be nearly normal.

In fact, it's going to be nearly normal with

mean at the population parameter 0.90 and standard error

equal to 0.9 times 0.10 divided by 200.

And then we take the square root of all of that, which gives us roughly 2.12%.

Now,

10:03

we have a normal distribution.

We know it's mean, we know it's variability and

we're looking for a probability associated with this distribution.

Well, the first thing we need to do is draw our curve.

We draw our curve.

We mark our mean at 0.90, and then

we shade the area of interest anything beyond 0.95.

To calculate this probability, we can refer to a z score.

So let's calculate

our z score as the observation minus the

mean divided by this standard deviation of that observation.

And because in this case the observation is

a sample proportion, standard deviation of that is

going to be measured by the standard error,

and that gives us a Z score of 2.36.

We can see that we are more than two

standard deviations away from the mean at this point,

so it's going to be a pretty small probability.

By this time hopefully, you guys

are comfortable with finding these probabilities.

Remember we talked about using the table, using r or using

an applet, so you could for practice try one of these methods.

And check your solution against what I'm about to reveal.

So the probability in that we're interested

in here should be roughly about 0.0091.

One thing we should mention here is that we

were looking for the probability of at least 95%,

and so that seems like we should have used

the notion p-hat greater than or equal to 0.95.

However, remember that under a continuous

distribution, which normal distribution is one, the

probability of the random variable being equal to a number is defined as 0.

Because that would be like finding the area of a line or

a sliver under the normal distribution, which doesn't really make sense.

To answer this question, we use the central limit

theorem, which is a technique that we just recently learned.

But we could also do this using the binomial distribution as well.

Remember, our sample size is 200, our

proportion of overall success is 90% or 0.9.

And we're basically being asked for being able to obtain

95% successes, or in other words, 95% of 200, at

12:16

least 190 successes in 200 trials where the proportion of success is 0.9.

We could do this easily using R, we're going

to use the debinom function to calculate the binomial probabilities.

And since we're looking for a range, we're going to

calculate a bunch of binomial probabilities and add them up.

So we're looking for the sum of all probabilities under

the binomial distribution with n equals 200 and p equals 0.9.

Anything between 190 and 200.

And this probability comes out to be roughly 0.008.

That is not exactly the probability that

we calculated, but its awfully close to it.

12:56

So, before we wrap up our discussion on the sampling distribution of proportions,

let's talk about a what if scenario.

What if the success failure condition is not met.

The center of the sampling distribution will

still be around the true population proportion.

And the spread of the sampling distribution can still

be approximated using the same formula for the standard error.

However, the shape of the distribution will depend on

whether the true population proportion is closer to 0 or

closer to 1.

13:27

Let's take a look at this, here's our number line,

and remember that distribution of

proportions have natural boundaries around them.

They can only be between zero and one.

So we know that the sample proportion cannot

be below zero and cannot be greater than one.

Let's think about a situation where success failure condition

is not met, but let's say that our true population

proportion is at 0.2, so at a value that's closer to 0 than closer to 1.

We said that the center of the distribution is still going

to be around the true population parameter, but we're going to

end up with a smaller tail to the left of the

distribution and a much longer tail to the right of the distribution.

This is because even though from samples taken from this population where the true

population proportion is 20%, we would expect majority

of them to have sample proportions close to 20%.

We'll still get some that are different than 20%, and we

might get proportions all the way down to 0, or all the way up to 1.

But it's going to be much less likely to

get a sample proportion of 100%, and a random sample

from a population where the true population proportion

is 20%, than something, let's say 5 or 10%.

So, the tail to the left is short because we have the natural boundary at 0.

But the tail to the right, is much longer because the natural boundary on

the higher end doesn't appear until 1, so that yields a right skew distribution.

Similarly, if we

had a population where the true population

proportion is 80%, we would see the opposite

of this effect and our sampling distribution

would then be expected to be left skewed.

This is if the success failure condition is not met.

If the success-failure condition is met, remember

that means that the sample size is higher.

That's going to yield a smaller standard error.

So the curves are going to be

much more dense around the true population parameters,

and they're going to be looking more

and more symmetric as the sample size increases.