0:01

In this video we will introduce you to the normal distribution and

Â discuss some of its properties, such as the 68â€“95â€“99.7% rule.

Â This motion's going to make sense in

Â a little bit when you see what we're talking about.

Â And we're also going to introduce standardized course, commonly known

Â as Z-scores, and we're going to give examples of working with the Z-scores to

Â find probabilities and percentiles under the normal distribution curve.

Â Many variables in nature are nearly normally distributed.

Â A commonly used example is heights.

Â We're going to take a look at a distribution of recorded heights

Â of members of an online dating website, OkCupid.

Â 0:41

Since members of this website are US residents and

Â likely represent a random sample from the US population, we

Â would expect their heights to follow the same height distribution of all Americans.

Â However, a closer look shows that, that's not exactly the case.

Â In this plot, the light purple curve shows the distribution of heights of US males.

Â The dotted line represents the distribution of heights reported by males

Â on OkCupid.

Â And the dark purple solid line is the implied

Â distribution of heights of these men, so the men on OkCupid.

Â We can see that heights reported by men on OkCupid very

Â nearly follow the expected normal distribution,

Â except the whole thing is shifted to the right of where it should be.

Â It appears that males on OkCupid add on average a couple inches to their heights.

Â 8:18

Standardized scores are also useful for identifying unusual observations.

Â Usually, observations with absolute Z-scores above 2,

Â so that's either 2 standard deviations below, or above the mean or

Â something beyond that, are considered to be unusual.

Â While we introduce Z-scores within the context of a normal distribution,

Â note that they're actually defined for distributions of any type.

Â After all, every distribution will have a mean and a standard deviation,

Â therefore for any observation whatever distribution the random variable follows,

Â we could calculate a Z-score.

Â But we're going to talk about why we brought this up within the context

Â of normal distributions in a moment.

Â 9:12

Percentile is the percentage of observations

Â that fall below a given data point.

Â Graphically it's the area below the probability distribution curve,

Â to the left of that observation.

Â So why is it that we can only use the Z-scores under normal curves, but

Â not in a distribution of a different shape?

Â Well we can always calculate percentiles for any sort of distribution,

Â except if the distribution does not follow this nice unimodal symmetric normal shape,

Â you'd need to use calculus for that.

Â And for the purposes of this course, we're not going to be using calculus, so

Â therefore we're going to be sticking to normal distributions for

Â calculating percentiles or areas under the curve.

Â In this day and age, percentiles are easily calculated using computation.

Â For example, in R, the function P norm gives the percentile of an observation,

Â given the mean and the standard deviation of the distribution.

Â So P norm of negative 1, for a distribution with mean 0 and

Â standard deviation of 1 is estimated to be about 0.1587.

Â We can also obtain the same probability using a web applet, so

Â no need for access to R to use this one.

Â So let's go to the URL that's on the slide to the web applet and

Â do a live demo of how we would use the applet to calculate this percentile.

Â So to use the applet the first thing we do is to select our distribution to be

Â normal.

Â We can change our mean as we desire,

Â but we're going to leave it that 0 since that's the distribution,

Â the standard normal distribution we're working with for now.

Â We could also slide our standard deviation around but let's leave that at 1 for

Â now as well.

Â And we were interested in the area under the curve below the cutoff

Â value of negative 1, and we want to pick the lower tail here, and

Â once again we get to the same answer, 15.9%.

Â Lastly, we can also avoid computation altogether and

Â use a normal probability table.

Â We locate the Z-score on the edges of the table and

Â grab the associated percentile value given in the center of the table.

Â So, for a Z-score of negative 1 we look in the negative 1.0 row and

Â 0.00 column for the second decimal and

Â arrive at the same answer, 0.1587 or roughly 15.9%.

Â Obviously, we don't have to keep using all methods here.

Â We've talked about three different methods using R, using our web applet, or

Â using the table.

Â You're welcome to use whichever you like in your calculations.

Â While the computation approach is a little less archaic,

Â the tables are actually very useful for

Â getting a conceptual understanding of what we mean by area under the curve.

Â So I encourage you to use the computation or R approaches.

Â But for the time being as you're learning this material,

Â also make sure that you get a chance to interact with the tables and

Â make sure that you sketch out your distributions.

Â And don't just rely on the numbers that the computer is spitting out at you but

Â make sure that you confirm them by hand as well.

Â Let's take a look at a quick example.

Â We know that SAT scores are distributed normally with mean 1,500 and

Â standard deviation 300.

Â We also know that Pam earned an 1,800 on her SAT and

Â we want to find out what is her percentile score.

Â Soon as we find out that the distribution is normal, the first thing to do is to

Â always draw the curve, mark the mean, and shade the area of interest.

Â Here we have a normal distribution with mean 1,500, and

Â to find the percentile score associated with an SAT score of 1,800,

Â we shade the area under the curve below 1,800.

Â We can do this using R and the pnorm function.

Â So here, the first argument is the observation of interest.

Â The second argument is the mean.

Â And the third argument is the standard deviation,

Â which spits out an associated percentile of 0.8413,

Â meaning that Pam scored better than 84.13% of the SAT takers.

Â 13:25

First, we calculate the Z-score, as the observation, 1,800 minus the mean,

Â 1,500, divided by the standard deviation, 300, the Z-score is 1.

Â Remember, we actually saw this before.

Â Then in the table, we look for the Z-score of 1, the row is 1.0, and column is 0.00.

Â And get the same probability, 0.8413 as the probability of

Â obtaining a Z-score less than 1, which basically means the same thing

Â that the shaded area under the curve, below 1,800 is 0.8413.

Â 14:09

Note that both the table and the pnorm function

Â always yield the area under the curve below the given observation.

Â If we actually wanted to find out the area above the observation,

Â we'd simply would need to take the complement of this value

Â since the total area under the curve is always 1.

Â So Pam scored worse than 1- 0.8413 which

Â amounts to 15.87% of the test takers.

Â We can also use the same properties of the standard normal distribution,

Â in other words the distribution of Z-scores to find cutoff values

Â corresponding to a desired percentile.

Â Here's an example illustrating this,

Â a friend of yours tells you that she scored in the top 10% on the SAT,

Â what is the lowest possible score she could have gotten?

Â Remember, SAT scores are normally distributed with mean 1,500 and

Â standard deviation 300.

Â We're looking for the cutoff value for the top 10% of the distribution.

Â This is a different problem than the one we worked on earlier,

Â as this time we don't know the value of the observation of interest.

Â But we do know, or at least we can get its percentile score.

Â Since the total area under the curve is 1, the percentile score associated

Â with the cutoff value for the top 10% is 1 minus 0.10, 0.90.

Â Remember that the formula for the Z-score is observation- mean/standard deviation.

Â And we know the mean, we know the standard deviation.

Â 16:20

We know that this number 1.28 is equal to the unknown observation.

Â We're calling it X here, minus the mean divided by the standard deviation.

Â A little bit of algebra, multiplying both sides by 300 and

Â adding 1,500 and we find that the cut off value is 1,884.

Â So the cutoff value for the top 10%, or the bottom 90%,

Â of the distribution of SAT scores is 1,884.

Â In other words, if you have scored above 1,884,

Â you know that you're in the top 10% of the distribution.

Â We could also do this using R, and we're going to use the qnorm function this time.

Â So pnorm for probabilities, qnorm for quantiles or cutoff values,

Â which takes the percentile as the first input, the mean and the standard

Â deviation as the second and the third, just like the function we saw earlier.

Â And the result is the same with either approach, 1,884.

Â