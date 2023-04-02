Hello learners. In this video, let's discuss what sampling is all about and why is it needed and how can it be done and what are the ways of doing sampling. When we talk about sampling, it is important to distinguish between two terms: population versus sample. Let us say a person is interested to know the income of people in the city of Mumbai. Might be interested in knowing what is the average income or what is the median income, or what is the standard deviation of the income among people in Mumbai. Well, one way to do this is to literally go talk to every single person living in the city of Mumbai and ask them, “What is your income?” We know how to calculate the average, the mean, the median if needed, the standard deviation if needed, etc. However, very rarely, this is going to be possible. The cost-effective way of doing this is to only select a few people from the population of Mumbai and ask them, what is your salary? With this information, we try to understand what is the average salary or what is the median salary or what is the standard deviation of the salary of the entire set of people in Mumbai? Here, we are distinguishing between two objects. One, which is the entire set of people in Mumbai, which we call as the population. This is something which is large, complete. This is the entire set of possibilities. In contrast, we have something called as the sample, which corresponds to only a few members of the population from whom we collect data. Now why do we need to sample? Why not collect data from the entire population? Well, the most important reason is the cost. Given unlimited resources, unlimited time, unlimited money, maybe it is possible for a person to actually obtain data from every individual person living in the city of Mumbai. However, in many cases, the cost stays as a barrier and, moreover, the additional cost spent in getting data from everybody in Mumbai does not justify the additional information or the additional certainty we get from actually collecting data from the entire population. It is a cost-benefit trade off in some sense. Maybe allowing for some error in our measurements, we can significantly reduce our costs by asking this information from only a sample, which means a small number of people from the entire population. Second, in many cases, a person might not even have access to the entire population. For example, even getting to know who are all the people in Mumbai could be a hard thing to do because people might be traveling in and out of Mumbai. People might be staying in Mumbai for some time and staying outside for some time so even access to the population may be difficult, or even worse, maybe you know exactly who lives in Mumbai, however, when you ask them what is your income, maybe people are not happy to share this information with you. In general, access to the entire population might not be available. You might have only limited information, even if you do not have any constraint coming from a resource perspective. In these cases, we are forced to do sampling and then we are going to have to make inferences and make decisions from the sample that we have collected. Well, given that the need for sampling is clear, let's try to look at what are the characteristics of a good sample. Can any sample be good? No. To begin with, the sample actually has to be a representative of the population. For example, if you are interested in the income level of people in Mumbai, and we only talk to people who are living in a five-star hotel that might not be a good representation of the population because people who are in a five-star hotel are likely to be in the higher-income side of the population and in that case, the sample need not necessarily be representative of the population and, in that case, we say the sample is not good. Second, the sample has to have significantly reduced cost compared to collecting data from the entire population. This is obvious in some sense because we motivate the need for a sample, because we have constrained resources. Now if collecting data even from a sample is comparable to the cost of collecting data from the entire population, then why sample at all? For example, in certain special cases, the entire population could contain only 10 people or 100 people. For example, let's talk about something about living Nobel laureates in economics. Maybe there are a handful of people. You can probably actually go and interview those handful of people and get the required data. It might not be much expensive compared to a smaller sample. In that case, no, don't sample go directly from the population. However, in the context like assessing the income level in the city of Mumbai, the cost of collecting data from the entire population could be prohibitively high compared to the cost of collecting data from a sample. In that case, sampling is good. We discussed about collecting samples from an underlying population. Now, this underlying population could be a finite set, or it could be an infinite set. An appropriate care is required if this underlying population is a finite set. For example, let's consider sampling from a finite population. In particular, let's take the example of collecting the income data from the city of Mumbai to even a smaller locality within the city of Mumbai. Now, I could ask people standing in a corner of the road, “What is your income level?” However, in this case, it's likely that I might ask this question twice or multiple times to the same person. I might not realize that, but what I'm doing there is sampling with replacement, which means there is going to be probability, there's going to be a chance that I'll ask the same question to the same person multiple times, but it's a better alternative to keep track of whom I have asked questions and to ensure that I don't survey the same person again. In which case it goes to the case of sampling without replacement. However, in many cases, sampling without replacement is very hard or it's not going to be possible. For example, if you are going to track wild animals and you do not have resources to identify each wild animal distinctly, then you might be forced to do sampling without replacement which means if you are looking at the feeding habits or hunting habit of a wild animal you might be collecting data of the same animal multiple times, which means there is a risk of getting bias in your data, which means you might not get a representative sample from your population. The single data point could have been over-represented in your sample. Special care has to be taken when we're thinking about sampling from a finite population. Similarly, sampling can also be done from an infinite population. Let us say, I am interested in collecting data about the birth weight of babies born in the country. Well, the birth weight has a certain set of values based on what the rates have been for children who have already been born, but there could be a new number from a continuous range of values for the next child that is going to be born, which means my population is potentially infinite with a continuous range of values and my sample is always going to be finite sample from the infinite population. Different types of care has to be taken when sampling is done from an infinite population.