0:00

[MUSIC]

Suppose my iPod has 3,000 songs.

The histogram below shows a distribution of the lengths of these songs.

We also know that, for this iPod, the mean length

is 3.45 minutes and the standard deviation is 1.63 minutes.

Calculate the probability that a randomly

selected song lasts more than five minutes.

Here we're looking for the probability

of one randomly selected song lasting more than five minutes.

This is the same thing as saying among all the population of

songs on this iPod, what percentage of them last more than five minutes.

This should be a pretty simple question to calculate.

And lately, what we've been doing was to

calculate Z-score, and use those to find probability.

If that's your instinct here, though, you should not follow it.

Because remember

that we can use Z-scores and the associated normal probabilities

only if the distribution we're working with is nearly normal.

And taking a look at the distribution of songs here, they certainly are not.

The distribution of the lengths of all of

these songs on the iPod is indeed right-skewed.

Does this make sense?

Well, a song can't be less than zero minutes,

so we have a natural boundary at the lower end.

And there's really no

upper end to how long your songs can be.

However, as you can imagine, it's going to be

fewer and fewer songs as the number of minutes increases.

That's what gives us the right skewed distribution here.

So, we've confirmed that the population distribution

makes sense, but we've also said that

the methods that we've learned most recently

for calculating these probabilities don't apply here.

Does this mean we can't answer this question, though?

No.

We can actually use the histogram and the heights of the

bars to estimate what percentage of songs fall between, let's say

four and five minutes, five and six minutes, six and seven,

so on and so forth, and use those to calculate this probability.

2:05

So here we're interested in everything above five minutes.

This will require eyeballing the heights of these bars.

It looks like there are roughly 350 songs that last

between five and six minutes, 100 between six

and seven minutes, 25 between seven and eight minutes.

I'm kind of making these numbers up, but I'm making an educated guess here.

So your estimates might be slightly off, but should be within this range.

20 songs maybe between eight and nine minutes,

and five songs maybe between nine and ten minutes.

It seems like there are no songs on this iPod

that lasts more than ten minutes. Let's let X equal the length of one song.

We're using an additional notation here that

actually isn't absolutely necessary for this one question.

But having used some sort of notation will come in handy in a little bit.

2:57

Then the probability that X is greater than 5 is 350 plus 100 plus 25 plus

20 plus 5, divided by 3,000.

Which comes out to 500 over 3000, which is 0.17, approximately.

So the probability that a randomly selected song on

my iPod lasts more than five minutes is 0.17.

Another way of thinking about this is 17% of

the songs on my iPod last more than five minutes.

3:33

Now let's take a look at another question based on the same iPod.

I'm about to take a trip to visit my parents and the drive is six hours.

I make a random playlist of 100 songs.

What is the probability that my playlist lasts the entire drive?

So, we know that six hours is roughly 360 minutes.

4:31

So we want the average length to be greater than 3.6 minutes.

Remember, this is not the same thing as every single song being more than

300 3.6 minutes.

Because that would give me a very, very long playlist.

That would tell me that the minimum length of that playlist would be 360 minutes.

I just want the total to be greater than 360 minutes to last me the entire drive.

4:58

Now that we have introduced the X bar, the sample mean,

that should remind us that the central limit theorem might be helpful.

Because using

the central limit theorem, we can find

the distribution of the sample mean pretty easily.

The central limit theorem says that X bar will be distributed nearly

normally, with mean equal to the population mean, which is 3.45 minutes.

We were given this information in the previous slide.

And with standard error equal to the population standard

deviation, sigma, divided by the square root of n,

the sample size.

So, that is 1.63 divided by square root of 100, which comes out to 0.163.

Now, we have a random variable, X bar, our sample mean.

We know its distribution, it's normal.

We know its mean, the center, 3.45.

And we know something about its variability.

The standard error, which is basically the standard deviation of X bar, is 0.163.

And we're interested in some probability.

This combination of events, a normal

distribution, I know its parameters, I'm looking

for probability, should prompt that we should

first draw a curve before we proceed.

So I'm drawing my curve, I'm setting the center at 3.45.

And remember, I'm looking for the observation

of interest as 3.6 minutes, and I'm looking

for everything above that.

Remember, drawing the curve is always your friend.

If you do this first, it's much less likely

that you would do something wrong in the following steps.

So next, we calculate the Z-score.

6:39

The Z-score is equal to 3.6, the observation, minus 3.45, the mean,

divided by 0.163, the standard error. And it comes out to be 0.92.

Note that we divide by the standard error, and

not the sigma of the standard deviation of the population.

Because the observation of interest, the 3.6, is a sample mean and not

7:04

an individual song. So not an individual observation.

We measure the variability of individual observations with standard deviations.

We measure the variability of sample means with standard errors.

So whatever the observation is that you plug in in

the numerator in your Z-score, its variability belongs in the denominator.

In other words, our observation is an X bar, and not an X.

This is where we can see the notation from earlier come in handy.

We can now easily find the area, using many of the methods we've learned so far.

Using a table, using R, or the applet.

7:43

If I wanted to find this probability using the applet, I would choose the normal

distribution with mean 0 and standard deviation 1,

because remember, that is indeed the standard normal.

The distribution of Z-scores.

And I'm looking for an upper tail.

And my Z-score is 0.92. So I just need to slide my slider over

8:23

Here's one more question working with sampling distributions

and the central limit theorem for the mean.

We have four plots presented here.

And our task is to determine which plot, plot

A, B, or C, is which of the following.

The first one we've already known which one that is.

That's the distribution for a population where the

mean is 10 and the standard deviation is 7.

So that's the big plot, the right-skewed big plot at the bottom.

One of plots A, B, or C is a

single random sample of 100 observations from this population.

Another one is a distribution

of 100 sample means from random samples with size 7.

So, N equals 7.

And another one is a distribution of 100

sample means from random samples with size 49.

To

9:26

The distribution in plot C most closely resembles the normal.

Therefore, this must be the distribution with the distribution

of 100 sample means from random samples with size 49.

Remember,

the central limit theorem tells us that sampling

distributions will be nearly normal when n is large.

So the largest end of the options

should yield the most normal looking distribution.

9:53

We can choose between the remaining two, depending on their shapes or spreads.

The plot that resembles the pairing distribution, the population

distribution, the most is the single random sample of

100 observations from this population.

Because, remember, both in a population and in one, one

sample, the observations are still

individual observations, so not sample means.

This appears to be plot B.

It's right skewed just like the parent population.

And it also has the largest spread.

Its range goes from 0 to 35 while the ranges of other plots are much narrower.

10:32

Then plot A must be the distribution of

100 samples means from random samples with size 7.

So what we've done here is we've used what we know about the central limit theorem,

and how sample sizes affect the shapes and

spreads of sampling distributions to make this assignment.