0:00

In this video, we will discuss shapes of binomial

Â distributions, and take a look at how they change

Â as we tweak some of its paramaters, such as

Â the number of trials or the probability of success.

Â We will also talk about the fact that when the number of trials increases, the

Â shape of the binomial actually starts looking

Â closer and closer to a full normal distribution.

Â And for such situations we're going to use methods we've learned to

Â calculate normal probabilities to approximate binomial probabilities.

Â Say we have a binomial random variable with probability of success 0.25.

Â This is what the distribution looks like when n is equal to 10.

Â Let's pause for a moment and carefully examine what we're seeing here.

Â Each bar represents a potential outcome.

Â With ten trials, the number of successes could

Â range anywhere from 0 to 10 and therefore

Â we have 11 bars here.

Â Heights of the bars represent the likelihood of these outcomes.

Â For example, the probability of zero successes can be calculated as 0.75.

Â The probability of failure raised to the 10th

Â power, since zero successes basically means ten failures.

Â This value comes out to be approximately 0.056, which is the height of this bar.

Â With n equals 10 and p equals 0.25, the expected number of successes is 2.5.

Â And hence the distribution is centered around this value.

Â So, the binomial distribution, with p equals

Â 0.25 and n equals 10 is right skewed.

Â Let's increase the sample size a bit keeping p constant at 0.25.

Â With n equals

Â 20 we see a change in the center of the

Â distribution, which is expected since n times p is now different.

Â But we also see a change in the shape.

Â The distribution, while still right-skewed, is looking much less skewed.

Â Increasing the sample size further to 50, the distribution looks even more

Â symmetric, and much smoother, and increasing

Â the sample size even further to 100,

Â the distribution looks no different than the normal distribution.

Â So let's take a look at why this might be of

Â interest, within the context of data from a study on Facebook usage.

Â 2:20

A recent study found that Facebook users get more than they give.

Â For example, 40 percent of Facebook users in our

Â sample made a friend request, but 63 percent received at

Â least one request.

Â Users in the sample pressed the like button next to friends' content an

Â average of 14 times, but had their content liked an average of 20 times.

Â Users sent nine personal message on average but received 12.

Â 12% of users tagged their friend in a

Â photo, but 35% were themselves tagged in a photo.

Â 2:55

So what explains this phenomenon?

Â The answer is power users.

Â Those who contribute much more content than the typical user.

Â I'm sure you all have a few friends like that, who

Â are so much more active than everyone else on your friend list.

Â Some of the other findings from the study are

Â that 25% of Facebook users are considered power users.

Â So these are the ones that give more than they get.

Â And that the average

Â Facebook user has 245 friends.

Â We're looking for the probability that an average Facebook user with

Â 245 friends have 70 or more friends who are power users.

Â 3:36

So what do we have here?

Â 25% are considered power users, which means that probability of

Â success is 0.25. And the average Facebook user has 245

Â friends, meaning that n is equal to 245.

Â The probability we're interested in is 70 or more power user friends,

Â which translates to number of successes equal to or greater than 70.

Â 4:03

We have n equals 245 trials, a fixed number.

Â Each trial outcome can be classified as a success or a failure, power user or

Â not power user.

Â The probability of success is the same for each trial, 25%.

Â And we're going to assume that the trials are independent.

Â They might not be in reality, since if you're the type of person to have some

Â friends who are power users, the others might

Â be more likely to be power users as well.

Â But again, we're going to assume independence for the sake of this example.

Â This is what the binomial distribution

Â with n is equal to 245, and p is equal to 0.25 looks like.

Â And we're interested in the probability of 70 or more

Â successes, meaning that 70 or more power-user friends among 245.

Â What does mean?

Â That's 70, or 71, or 72 all the way up to 245.

Â 5:00

So what we're interested in is the sum of probabilities

Â of each one of these outcomes 70 through 245.

Â We can calculate each one of these probabilities using the binomial formula

Â and add them up, but that really does not sound like fun.

Â This is where the resemblance between the binomial

Â distribution and the normal distribution comes in very handy.

Â The blue-shaded area of interest can just as

Â well be calculated as the area under the smooth

Â normal curve that closely resembles the more jagged binomial distribution.

Â Because calculating a shaded area under the normal

Â curve is a much simpler task than calculating individual

Â binomial probabilities for all of these outcomes and

Â adding them up, we might want to use that method.

Â To calculate a normal probability, we need a little

Â more information on the parameters of the normal distribution.

Â These can be estimated by the mean and the standard deviation of the original

Â binomial distribution. The mean is n times p, so that's 245

Â times 0.25, 61.25, and the standard deviation

Â is the square root of 245 times 0.25 times 0.75

Â Which comes out to be 6.78. So among 245 friends,

Â we expect 61.25 power users, give or take 6.78.

Â Given an observation, the mean, and the standard deviation, we

Â can calculate the area under the curve via a z score.

Â So the z score is going to be the observation 70 minus 61.25,

Â the mean, divided by 6.78, the standard deviation, which comes

Â out to be 1.29.

Â We can then find the probability of a z score being greater than 1.29, since

Â we shaded the area underneath the curve beyond the observation of interest.

Â So we want to take a look on our table to 1.29 as a z score, and in the

Â intersection of the row and the column of interest, we can see 0.9015.

Â The probability of obtaining

Â a z score greater than 1.29 is going to be one minus that probability from the table.

Â Why are we doing this one minus bit?

Â Well, because the table always gives us the percentile or the area under the

Â curve below the observed value and we want to find the complement of that.

Â Which comes out to be 0.0985. So there is a 9.85%

Â chance that an average Facebook user, with 245 friends,

Â has at least 70 friends who are considered power users.

Â 7:47

We can also directly calculate this probability using

Â R and the D binom function we've seen before.

Â The first argument in the function is the number of

Â successes, and we're interested in everything between 70 and 245.

Â The second argument is the total sample size, 245, and

Â 8:06

the third is a probability of success for each trial.

Â So what this function here is doing is actually two things.

Â First, calculating the probabilities for each outcome 70,

Â 71, 72, all the way up to 245,

Â and then we wrap that around with the sum function, so we're adding all of that up.

Â And the probability comes out to be 0.113, or 11.3%.

Â Versus the 0.0985 we found before.

Â Why are these values ever so slightly different?

Â On one hand, it makes sense.

Â We called the approach the normal approximation to the binomial after

Â all, so it's just an approximation and not an exact result.

Â On the other hand, if we need

Â the exact probability, the difference may be frustrating.

Â Let's take a closer look at the

Â binomial distribution and the normal approximation to it.

Â 9:01

We can see that the red normal curve is

Â slightly different than the bars

Â representing the exact binomial probabilities.

Â It falls a little bit short.

Â Also, under the continuous normal distribution, the probability

Â of exactly 70 successes is undefined. So the shaded

Â area above 70 doesn't exactly include the

Â probability of 70 successes. A common fix to this

Â problem is a 0.5 adjustment to the observation of interest.

Â So we calculate the z score using 69.5 as opposed to 70, which yields

Â an adjusted z score of 1.22.

Â Everything else about the method stays the same.

Â And the result we get, and you can confirm this using a table or a

Â computation, is now much closer to the exact

Â result from the binomial distribution, 0.1112 versus 0.113.

Â One other method for calculating binomial probabilities is using an applet.

Â So let's

Â go to this website where the applet can be found and

Â let's take a look to see how we can calculate this probability.

Â 10:13

We're working with a binomial distribution so

Â that's the distribution that we're going to pick.

Â Our number of trials or number of prints here is 245.

Â So we're going to slide n across to 245,

Â and our probability of success is 0.25, so we're

Â going to slide the p to 0.25.

Â We're looking for the area above 70, so let's take our cutoff value to 70.

Â And remember that we're looking for the upper tail.

Â And we're looking for greater than or equal to.

Â So we want to pick our bound to be that as well, and

Â once again we can see that same probability, 11.3% chance of having

Â 70 or more power user friends among a sample of 245 friends.

Â 11:04

In the example we just presented, we

Â plotted the binomial distribution using computation, and

Â visually confirmed that it looked unimodal and

Â symmetric, roughly similar to a normal distribution.

Â But what if we couldn't plot the binomial distribution?

Â What are some guidelines that we can use to determine whether the sample size or

Â the number of trials is large enough, such that we can be confident in estimating

Â the binomial distribution using the normal?

Â In other words, how can we tell if the shape of the binomial

Â distribution is going to be unimodal and

Â symmetric, and closely follow the normal distribution?

Â 11:42

The rule of thumb is the success-failure condition.

Â Which says that a binomial distribution with at least 10 expected

Â successes and 10 expected failures closely follows a normal distribution.

Â So that's n times p needs to be greater than or equal to ten,

Â and, n times 1 minus p needs to be greater than or equal to 10.

Â And in cases where it does we can

Â approximate the binomial distribution with the normal, where

Â the parameters of the normal distribution are calculated

Â as the mean and standard deviation of the binomial.

Â We also talked about the 0.5 adjustment to make the probabilities calculated

Â using the normal approximation much closer to

Â the exact probabilities from the binomial distribution.

Â But I encourage you to not focus on those details a

Â whole lot, but instead try to focus on the bigger picture.

Â Remember that the binomial distribution with sufficient

Â sample size starts to look nearly normal.

Â This is important and we're emphasizing this here

Â because when we later on get to doing inference

Â for categorical variables with two outcomes, so those are

Â kind of like Bernoulli outcomes that follow a binomial distribution.

Â We're going to make use of the fact that

Â the distributions start to look sl, nearly normal, and

Â we're going to apply methods that are based on

Â the normal distribution to do inference for these variables.

Â Let's do a quick practice problem.

Â What is the minimum n, or the sample size, required for

Â a binomial distribution with probability of success

Â equaling 0.25, to closely follow a normal distribution?

Â We know that n times p needs to be greater than or equal to ten, and

Â n times one minus p needs to be greater than or equal to ten as well.

Â So for both of these equations we want to solve for n and then we're

Â going to take the maximum of those since that's going to be the minimum required

Â sample size.

Â Well, for n times 0.25 to be greater than or equal

Â to ten, n needs to be greater than or equal to forty.

Â For n times 0.75 to be greater than or equal to

Â ten, n needs to be greater than or equal to 13.33.

Â So the answer is, we need at least forty observations for a binomial distribution

Â with p equals 0.25, to closely follow a normal distribution.

Â