In the previous video, we have found out that the mean is not such a great tool. That you need to estimate the average of the feature that contains a small number of extreme observations, such as external. After watching this video, you will know how to estimate the other kinds of average from the sample, using the technique called bootstrap, one of the most powerful statistical tools. Quantity which is much more stable to the extreme radius is called a median. A median of the feature is such a level that the feature takes variables above and below it about 50% of the time. If you are estimating a median from a sample, you need to sort the whole sample and take the middle element. If the sample size is odd, it's just literally the element in the middle. If it is even, you should just take the average of two elements that are closest to the center. In a sense, median is an average value of the feature, too. Just like the mean, it points us to the area where the feature typically takes values. Indeed, both means and medians are called averages but they don't always coincide and some mean statisticians could use that. A nice example could be found in the book How To Lie With Statistics from 1954. Imagine you have a sample of several people for whom you know their yearly income. One person earns $45,000 per year, which is probably a lot for the 50s. One person earns 15,000, one 10,000 and so on. There are 12 people whose income is $2,000. Now, if you need to estimate the average income of the population, you could actually calculate the sample mean, which will be $5,700, or the simple median, which is $3,000. Depending on the impression you'd like to make, you may choose one of these quantities and just report it as an average without specifying what kind of average it actually is. Most people will not notice it anyway. I'm telling you this not because I want you to ever do that. It is a mortal sin. I just want you to be aware that averages could be manipulated. Don't trust them unless you know the whole picture. Anyways, back to the taxi data. For the first example of 100 trips, the medium trip duration is 10.3 minutes. For the second sample that contains some incredibly long rides, the sample median is just 11.23 minutes. If you take those 20 super long trips out of the sample, the sample median over the remaining data points will be 11.22 minutes. Compare that to the drastic change of the sample mean. To obtain a confidence interval for the median, we are going to use one of the most powerful statistical techniques called bootstrap. It was invented in the late 70s, but became much more relevant with the widespread of computers as it is quite computationally demanding. So we have a sample of X of size n. We are going to use it to generate new samples of size n just by sampling it randomly with the replacement. If the original sample for example was 1 and 2 after sampling it with replacement you might obtain samples 1, 2, 1,1, or 2,2. We are going to repeat this procedure B times, obtaining B of those Bootstrap samples. Over each Bootstrap sample, we are going to calculate the quantity of our interest, namely a sample median. This is how we get vector of length b containing values of the sample median over all Bootstrap samples. Let's sort this vector and drop 2.5% of the smallest and 2.5% of the largest values. The range of the remaining values defines the boundaries of the 95% confidence interval for the median. That's it. The technique is quite universal. You could of course replace 2.5% with alpha over 2, to get an interval with confidence 1 minus alpha. And you could also replace a median with pretty much any function you need to estimate. And the bootstrap procedure would almost always give you a nice confidence interval first. Let's see how it goes for the taxi data. First, we need to define two functions. One will generate bootstrap samples from the data, and the other will calculate the interval. Using those functions we calculate the interval for the median over both samples. A hundred data point sample gives us 95% continuous interval for the median from 8.7 to 12.4. The second sample of 10 thousand, from 11 to 11.4. It is interesting to note that with of this interval decreased about ten times as we increased the sample size 100 times, suggesting that the same square root of N rule applies to the bootstrap as well. All right. This has been a short but hopefully intense journey. You have learned how to sample a data set and how to use the sample to estimate proportions, means, medians, and other quantities you might be interested in providing both point and interval estimates. Getting an interval estimate is quite important, as it helps to quantify the degree of your uncertainty in the estimate you provide. My favorite example confirming that is from the book The Signal and the Noise by Nathan Seaver. It is about a 1995 flood in Grand Forks, North Dakota, where the mythological office predicted the flood level to be 49 feet. The city build a dam 51 feet high and the actual level of the flood turned out to be 54 feet resulting in several billion dollars damage to the city. Silver states that the error margins of the flood level forecast of the surface over historical data was about nine feet. Had this information been stated explicitly, the city probably could have built a higher dam and avoided the disaster. So, confidence intervals are important. I hope you have learned something new in this lesson.. I try to give you the recipes on how to use statistical methods without going under the hood of them, and I hope you did not think my oversimplifications were gross.