In this video, we will discuss a census versus a sample, sources of bias in studies, and a few sampling methods. Previously, we mentioned taking a sample from the population, but one might ask, wouldn't it be better to just include everyone and sample the entire population, in other words, conduct a census. As you can imagine, conducting a census takes lots of resources, but there are other reasons why this might not be a good idea. First, some individuals may be hard to locate or hard to measure, and these people may be different from the rest of the population. For example, in the US Census, illegal immigrants are often not recorded properly, since they tend to be reluctant to fill out census forms, with the concern that this information could be shared with Immigration. However, these individuals might possess characteristics different than the rest of the population, and hence not getting information from them might result in very unreliable data from geographical regions with high concentrations of illegal immigrants. Another reason why censuses aren't always a good idea, is that populations rarely stand still. Even if you could take a census, the population changes constantly, so it's never really possible to get a perfect measure. If you think about it, sampling is actually quite natural. Think about something you're cooking. We taste, in other words, we examine a small part of what we're cooking, to get an idea about the dish as a whole. We would never eat a whole pot of soup just to check it's taste after all. When you taste a spoonful of soup and decide that spoonful you're tasted isn't salty enough, what you're doing is simply exploratory analysis for the sample at hand. If you then generalize and conclude that your entire needs salt, that's making an inference. For your inference to be valid, the spoonful you tasted, your sample, needs to be representative of your entire pot, your population. If your spoonful comes only from the surface, and the salt is collected at the bottom of the pot, what you tasted is probably not going to be representative of the whole pot. On the other hand, if you first stir the soup thoroughly before you taste, your spoonful will be more likely to be representative of the whole pot. Let's review a few sources of sampling bias. Convenience sample bias occurs when individuals who are easily accessible, are more likely to be included in the sample. For example, say you want to find out how people in your city feel about a recent increase in public transportation costs. If you only poll people in your neighborhood, as opposed to a representative sample from the whole city, tour study would suffer from convenience bias. Another sampling bias is called non-response. This happens if only a non-random fraction of the randomly sampled people respond to a survey, such that the sample is no longer representative of the population. For example, say you take a random sample of individuals from your city, and attempt to survey them, but certain segments of the population, say those from a lower socioeconomic status, are less likely to respond to the survey. A similar sampling bias is called voluntary response bias, which occurs when the sample consists of only people who volunteer to respond because they have strong opinions on the issue. For example, say you place polling machines at all bus stops and metro stations in your city, but only those who choose to do so actually take the time to vote and express their opinion on the recent increase on public transportation costs. Voluntary response bias clearly exists in online polls like this one from CNN from August 2013, which asked whether the West should intervene in Syria. The people who responded to this poll definitely do not make up a representative sample of the world population, since these are people who happen to have visited cnn.com the day the poll was posted, and felt strongly enough to vote. Indeed, the poll results say that this is not a scientific poll for this very reason. To recap, the difference between voluntary response bias and non-response bias, is that in non-response there is a random sample that is surveyed, but the people who choose to respond are not representative of the sample, while in voluntary response there is no initial random sample. Let's examine a historical example of bias sample yielding misleading results. In 1936 Landon sought the Republican presidential nomination opposing the reelection of Franklin Delano Roosevelt, commonly referred to as FDR. A popular magazine of the times, the Literary Digest, polled about 10 million Americans and got responses from about 2.4 million. To put things in perspective, now a days reliable polls in the US routinely poll about 1,500 people, so this was a huge sample. The poll showed that Landon would likely be the overwhelming winner, and FDR would only get 43% of the votes. In reality, FDR won the election with 62% of the votes. The magazine was completely discredited because of the poll and was soon discontinued. So, if you have never heard of this magazine, this might be the reason why. But what went wrong? The magazine had surveyed its own readers, registered automobile owners and registered telephone users. These groups had incomes well above national average of the day. Remember, this was the great depression era, which resulted in lists of voters far more likely to support Republicans, than a truly typical voter of the time. In other words, the sample was not representative of the American population at the time. While The Literary Digest election poll was based on a sample size of 2.4 million, a huge sample, since the sample was biased, it did not yield an accurate prediction. Going back to the soup analogy, if the soup is not well stirred, it doesn't matter how large a spoon you have, it will still not taste right. If the soup is well stirred, a small spoon will suffice to test the soup. Now that we have a good idea of why we might want to sample, and why it's important for our sample to be representative of the population, let's discuss some sampling methods, namely a simple random sampling, stratified sampling, cluster sampling, and multistage sampling. In simple random sampling, we randomly select cases from the population, such that each case is equally likely to be selected. This is similar to randomly drawing names from a hat. In stratified sampling, we first divide the population into homogenous groups called strata, and then randomly sample from within each stratum. For example, if we wanted to make sure both genders are equally represented in a study, we might divide the population first into males and females, and then randomly sample from within each group. In cluster sampling, we divide the population into clusters, randomly sample a few clusters, and then sample all observation within these clusters. The clusters, unlike strata and stratified sampling, are heterogeneous within themselves, and each cluster is similar to another, such that we can get away with just sampling from a few of the clusters. Lastly, multistage sampling adds another step to cluster sampling. Just like in cluster sampling, we divide the population into clusters, randomly sample a few clusters, and then we randomly sample observations from within these clusters. Usually, we use cluster sampling and multistage sampling for economical reasons. For example, one might divide a city into geographic regions that are on average similar to each other, and then sample randomly a few of these regions, go to these randomly picked regions, and then, sample a few people from within these regions. This avoids the need to travel to all of the regions in the city.