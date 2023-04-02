Hello learners, in this video, we are going to talk about a very important result in the context of probability and statistics named the central limit theorem. We might have all seen this bell curve over and over again everywhere. Even people with relatively modest understanding of statistics have probably heard about bell curve that the central value is more likely to happen. And as you go farther away from the center, the values are more and more less likely to happen, and we had discussed about something called less the normal distribution whose PDF, the probability density function, looks like this bell curve, but why is the normal distribution so important? Why is it so frequently talked about everywhere? And the most important reason for this is what is called, that's the central limit theorem. So what does the central limit theorem say? So there are some mild technical conditions which might not be very important, but a sufficient case is when you have a group of random variables to be independent and identically distributed. So suppose, you have a random variable which is nothing but the sum or the average of many, many independent and identically distributed random variables. Then this is a random variable in itself. The central limit theorem says this random variable is approximately normally distributed, which means as you add random variables or as you take average of random variables and as the number of random variables become larger and larger, then the sum or the average random variable that is always going to be normally distributed. It is very surprising and a strong result because you could have started from a set of uniformly distributed random variables. You could have started from random variables which has a distribution which you pulled out from thin air, but as you take sufficiently many of them and assuming they are independent and identically distributed, there sum or their average is going to be closer and closer to a normal distribution. Well, there is a caveat here. We are not saying that if you have a large data set, then the distribution of the large data set is going to be normal distribution, that is not what we are saying. You can have extremely large data sets, but the distribution of the data belongs to a uniform distribution, beta distribution or any other distribution. We are not talking about that, we are necessarily talking about sum or average of many, many random variables, and if each of those random variables are independent and identically distributed and they have some other mild technical conditions, the sum of these random variables or the average of random variables they are always going to be very close to a normal distribution. In fact, the theorem is itself stated in a limiting sense as the number of random variables whom you add or take average of tends to infinity, you get exact normal distribution, but in any finite data set, you can get an excellent approximation to a normal distribution, and this can turn out to be very useful. In this context again, let us consider the example of Krishnan and trying out lunch at a restaurant in IIMA. So we have already spoken about the distribution of the amount of money that Krishnan spends in the restaurant. We say, he spends 50 rupee with a probability 0.25, 70 rupee with a probability .35, 100 with probability 0.25 and 150 with probability 0.15. Well, this is the amount of money he spends in one day. Let's consider two days. What's the total amount of money Krishnan will spend over two days assuming the amount of money he spends on each day is independent and identically distributed as per the distribution you're looking at the screen. How are we going to compute it? Let us try to evaluate this. I'm saying on the first day, Krishnan could have spent 50 rupees could have had a small meal with probably 0.25, and on day two, he could have spent 50, 70, 100, or 150. Correct, which means in each of these case, he would have spent a total of 100, 120, 150 and 200 Rupees on day two. And their probabilities would be 0.25 for spending the 50 rupee on the first day, and since they are independent, we can multiply the probability. So again, 0.25 on day two, so this would have 100 rupee would have happened with probability 0.25 squared. 120 would have happened with probability 0.25 times 0.35. 150 would have happened with probability against 0.25 squared and 200 would have happened with probability 0.25, multiplied by 0.15. But, well, it's not just this, he could have spent 70 rupee on day one and again, 50, 70, 100 or 150 on day two, which gives a total of 120, 140, 170, 220 rupee each with probability 0.35 times 0.25, 0.35 squared, 0.35 times 0.25 and 0.25 times 0.15. Again, he could have had 100 rupee is worth medium meal on day one and then each of the four possibilities on day two. Well, I assume you understand this, but you can draw these trees for two days and make a long laborious calculation to identify the probability mass function for how much money is spent over the two days time. Well, let me ask you a harder question. What about the average money or the total money that Krishnan would have spent over a month consisting of 30 days? How are we going to draw a tree like this for a period of 30 days? Doesn't that seem like a humungous task? It is, however, if we are satisfied with approximations, which in many cases we are, central limit theorem can help us. In fact, without even doing any further calculations. I already know that the distribution of the average money he has spent or the total money he has spent over the 30 day period is going to be approximately normal. Because the total money spent is basically the sum of these 30 random variables which are independent and identically distributed or the average money he has spent over the 30 days as average of this 30 random variables which are independent and identically distributed. So central limit theorem already tells me this is going to be normally distributed. Let us look at an excel example showing central limit theorem in action. All right, let us look at this Excel file. If you look at each of the cells, this corresponds to a realization of the random variable that describes the amount of money spent by Krishnan on any single day. You will have 50, 70, 100, 150 exactly according to the probability distribution that we talked about earlier and they are independent and they are all identically distributed. For example, the first row corresponds to the amount of money that Krishnan could have spent on each of the 30 days in a month. It has a single realization for the entire 30 day period. The second rule corresponds to the total money spent by Krishnan in the entire month assuming that the random variables are independent and identically distributed and so on, you have a large number of realizations here. Great, now let us consider how their average behaves like. So, the parameter N days here tells you for how many days should they calculate the average. The fact that this is one basically says it's the average over just first today. So BL is nothing but reflection of whatever is in the first column. What if we say two then it is going to be the average of the first two days, 150 and 100 the average is 125 as one would have expected. Well, that is good. Now let us look at the distribution, when we say that the number of days is 1, the distribution we get is exactly the one that we had always worked with. What if it is for two days? Well, it looks like a slightly different but a new distribution. Well, this was the one which we attempted to calculate about sum of the money spent over two days, except that this is not a sum, which is an average, so you divide everything by 2. What about three days? Well, it has even more parts for even more, five even more. What about average money spent over 10 days? Wow, that looks amazingly close to a normal distribution. What about 20 days? Even more, so 30, even more so. As I refresh this, you'll see consistently, I get the shape of the bell curve. There was nothing special about the initial distribution that we started with. This distribution is not special in any way. You could have picked any distribution at all of your choice assuming those mild conditions which I talked about hold as long as that is true. If you are considering the sum or the average of the random variables over sufficiently large period of time, and in this case, 30 seems to be a sufficiently large number. The distribution becomes closer and closer to normal, and in fact, if you remember the discussion we had about normal distribution, we said that normal distribution is defined by two parameters, the mean and the standard deviation of the distribution. Likewise, your own initial distribution will also have a mean and standard deviation. And central limit theorem not only says that the sum and the average will be normally distributed they precisely tell you what will be the mean and the standard deviation of the sum and the average distribution. And that gives you a lot of power because you can immediately begin making statements about the average money this person would have spent over a period of time. You can say, what is the probability that the average money Krishnan would have spent over 30 days will be greater than 100. As such it doesn't look like an easy question to answer but with very simple calculations and by looking up normal distribution tables, questions of this form can be answered easily. And this is the power given by central limit theorem. And in fact, this is why normal distribution is one of the most important distributions that is studied by practitioners because in many cases, the quantities we consider come out to be average of a bunch of other quantities. Central limit theorem uses the power to forget about the distributions or the original distribution of the quantities whose average is taken because we know no matter what the averages are going to be normally distributed. [MUSIC]