[SOUND] So far, we have learned how to summarize data and learn about distributions. We have even talked about having data from a sample. Now it is time to talk about how to produce the data, which is all about taking samples. Have you ever noticed that sometimes when you're listening to news programs or reading an article, you hear terms such as scientific study, scientific poll? Which by the way, implies that you can have data that has been produced in a non scientific way. Basically, when we take a sample, we are using one of two methods. The first one is by using one of the Non-Probability Methods and the second one is using one of the Probability Methods. So, let's explore these methods in more detail starting with Non-Probability Methods. Volunteer sampling is an example of a Non-Probability Method. In this type of a sample, individuals have selected themselves to be included. If you leave a review for an item you have purchased, then you are volunteering to be a part of a customer base who would leave comments. In the field of statistics, this type of a data is considered as non-scientific. Why? What is the problem with this? The biggest problem with data gathered this way is what is known as bias. such a sample is almost guaranteed to be biased. Because most often, it includes mostly data on individuals who have a particularly strong opinion about an issue and I'm looking for a way to voice this opinion. As a result, data obtained from a voluntary response sample is quite useless when you think about the big picture since the sampled individuals only provide information about themselves. Thus, we can't generalize to any larger group at all. However, you should be aware that in some cases, this is the only way we can obtain a sample. Clinical trials for medical treatments, as an example. It is unethical to force an individual to participate in a medical study. Does doctors look for individuals who would volunteer? However, in this setting, we use this type of a volunteer sample, so that we can compare the treatments given to this group to other groups receiving different types of treatment. As it turns out when comparing several treatments to each other, volunteer sampling is not as problematic. We will discuss this in more detail in the second course in the series. Sometimes, the data is gathered by simply being at the right place at the right time. We refer to this as convenient sampling. For example, news outlets often stand outside of the polling stations and ask voters who are leaving whom they voted for. Now, each individual may choose to ignore them or stay back and answer. So in a way, there is volunteerism here as well. And thus, could result in bias. However, base on the variable being collected, it may be that this type of this sample could provide a fairly representative group. So, bias is the most problematic issues with the sample. Let me share with you an example from an American presidential election. The two opposing candidates were Franklin D Roosevelt and Alfred Landon. This was right after the Great Depression and economic issues, such as unemployment and government spending were the dominant theme of the campaign. Roosevelt was in favor of continued government spending on public works to increase employment and his opponent, Landon was more in favor of limiting these. By and large, Landon was preferred by the more affluent voters rather than the working class. One of the most respected magazines of the time, the Literary Digest, known for predicting elections accurately predicted defeat for Franklin D Roosevelt. They were wrong. Roosevelt won in a landslide. Only two states, the two red ones were won by Landon. So, what went wrong? Well, its about how the sampling was done. They selected this sample from a list of telephone numbers. In 1930s, telephone was a luxury, which only the rich could afford. Therefore, their sample was biased and didn't represent the population of the borders accurately. Since the people who were polled were rich, it gave the impression that Landon would be elected. But in fact, the exact opposite was true. To do data analysis correctly, we must start off with good unbiased representative samples. The best way to get a representative sample is to pick a member of the population at random. This will make it more likely to have a sample, which is representative of the whole population. Produce a sample, whose averages resemble those in the population and enables us to infer characteristics of the population from a sample. When you select the sample the right way, then you don't need to worry about how large is the actual population. That is larger populations do not require larger samples. We will learn more about this later on, but this is a great thing. This means that if you want to know something about the Chinese consumer, you don't need to take a larger sample as compared to a study, which is looking at an American consumer, just because China has a larger population than the US. So now let's look at few sampling techniques that make this more likely, resulting in a scientific form of study. These sampling methods, all use property methods, which gives us the results that can be generalized from the population from which the sample came from. One such method is known as simple random sampling and this is the simplest probability sampling plan. As the name suggests, it is equivalent of selecting names out of a hat. Each individual has the same change of being selected. This is one way of ensuring that the bias is not present in the selection process. The only way bias comes in play here is associated with the response of the selected members. Have you ever received the phone call or a mailing asking you to answer a few question? If you ignore the request, you're classified as non-response. So we still can get some bias, if a large group chooses not to respond. Researchers spend a lot of effort in improving response rate to reduce bias. Another probability sampling method is known as stratified sampling. This technique is used when our population is naturally divided into subgroups, which we call strata. Imagine that we want to conduct a poll and we would like to make sure that our sample has a fair representation of both genders, males and females, which means we have two strata based on gender. In stratified sampling, we choose a simple random sample from each stratum. So in our study, we will select 200 females from the female stratum, 200 males from the male stratum. And therefore, our actual sample consists of all these simple random samples put together, which would be 400 people in total. Cluster sampling is another probability sampling method. And in this method, we divide the population into groups called clusters. However, with cluster sampling, each cluster is a small scale version of the population. Therefore, it is expected to behave like the population. After creating the clusters, first, we selected a few of these clusters and then we either select all individuals from these chosen cluster to be sampled or choose some members of the clusters at random. Cluster sampling is appropriate for populations that are spread out over a large geographical area or in such a manner that there are differences in each area that respect the variable of interest. For example, serving income by neighborhood. They may breakup a large city into 15 clusters based on some characteristics and then select 8 of the 15 clusters to pull the sample from. The biggest difference between cluster sampling and stratified sampling is that in stratified sampling, we will draw a random sample from all strata. While in cluster sampling, we will only pull from some clusters. Both cluster sampling and strata sampling require little work before we can start drawing a random sample. This is the time taken to decide how to divide the population up. But at the end, we can draw a smaller sample, which will be a good representative sample as compared to doing just a simple random sample. Let's see if you can determine which type of sampling we have used for the following study. You have a network of fast food restaurants, total of ten. You want to know if the customers are satisfied with the quality of food and services. Classify the following sampling techniques which can be used to do the data collection. Every customer receives a phone number with their bill and is encouraged to call with their comments. We choose three of the restaurants and contact all customers who visited these three locations within the past week and survey them. We choose 50 customers at random from each of the restaurants and survey them or we select 500 customers at random from a database of past customers, and survey them. Every customer receives a phone number with their bill and is encouraged to call with their comments. This would be an example of volunteer sampling and it is non-probability method. Here you may get a disproportionate number of customers who are very satisfied to call you or vice versa, giving you the wrong impression. Sometimes it's possible that this satisfied customers will not contact you in just one visit again, not providing the much needed feedback that you're doing something wrong and a chance to improve. Choosing three of the restaurants and contacting all customers who visited these three locations within the past three weeks and surveying them is an example of cluster sampling. The entire customer base is divided into groups, restaurants they visited. We chose three clusters and then all customer visiting these three restaurants in the past week were contacted with the survey. Choosing 50 customers at random from each of the restaurants and surveying them is an example of stratified sampling. There are ten restaurants. Then therefore, ten strata and we have surveyed customers from each. This way, we are sure not to leave out any restaurant that may have the worst service record or the best service record. We get have representation from all ten locations. Selecting a random 500 customers from database of past customers and serving them is an example of simple random sample, where which restaurant the customer has visited has no bearing on how they are selected for our sample. So far, we have made no mention of sample size. Our first priority is to make sure that we have the sample, which is representative of the population by using some form of a probability sampling plan. Next, we must keep in mind that in order to get the more precise idea about our population, then a larger sample does a better job than a smaller one. There's always a cost involved in collecting a larger sample. Based on a requirement, we can calculate the size of the sample needed. We will discuss the issue of sample size in more detail later on in this course, as well as the second course of this series. In this lesson, we learn various techniques by which one can choose samples of individuals or data points from an entire population. This is a seemingly simple step. But in reality, this is an extremely crucial step. Without a good sample, good analysis is simply not possible. Generally speaking, probability sampling method will resolve in a non-bias sample, which can be safely used to generalize our findings. However, there are times that are only choice is to use non-probability sampling technique. It is important though, when these techniques are used to be aware of the type of bias that they introduce and that's the limitation, the conclusions that have that can be drawn from these resulting samples.