Hey there. This lesson is meant to be a conversational statistics refresher to get you ready to use the statistics formulas necessary for AB testing, which is an important part of data science. It's really helpful to have a strong understanding of statistics for some kinds of data analysis. If you already have a background in statistics and you understand the difference between standard error and standard deviation and how these can be used to compute a confidence interval, you might be ready to jump right into AB testing examples and skip this lesson altogether. If, however, your statistics skills are rusty, this'll give you some intuition for some of the numbers that will be used under the hood in our AB tests calculator. Even if you don't have a strong background in stats, it would help if you are already familiar with the concept of a standard deviation and a distribution. Okay. So here's a question from my real life. I take the train to work, and I enjoy my commute a lot more if I can get a seat on the train. So the question that I think I want to ask is, how many seats are there on the train? Does this question have a numerical answer? No, not exactly. The number of seats available varies. It exists on a distribution. Sometimes I'll get an empty train, and there'll be lots of seats. Sometimes the train will be full. A better question to ask is, what's the average number of seats available on the train? The way I can collect data to answer this question is to make an observation. I can go to the station and observe how many seats are available. So if I just did this once, is that a good way to answer the question? Maybe not. Imagine the case where the trains are either really empty or really full. I would call this distribution bimodal, which means it has two peaks. This won't look like a bell curve. If I took just one observation and use that to estimate the mean, I might be way off. So what's something else I could do? I could collect a bunch of observations and take the average. I would be sampling randomly from the distribution, empty, full, empty or full. The average of these observations would be a better estimate of the mean. One of the cool things that happens when we take an average over several samples is that the central limit theorem unlocks a bunch of cool tools for us. The central limit theorem basically says that even if the original distribution we're sampling from isn't normal, like a bimodal distribution, if I repeatedly take samples and compute the average number of seats from a collection of observations, that set of means I compute will be normally distributed. So it looks like a bell curve. Then we have a normal distribution, and we can use well-developed statistics tools for analyzing it. Great. So here's the situation. I usually go to station A, but I could also go to station B to catch a different train into work. This is where we introduce our thinly-veiled metaphor for AB testing. Suppose I've already collected two observations from station A and my neighbor collected 10 observations from station B. Do we have any business comparing our numbers? Is it a problem that we have a different number of observations? No, because we're talking about the averages here. I'm still curious about which station I should go to because I want to get a seat. So, of course, I'm going to ask for my neighbors' observations. But whose estimate is better? We could both be wrong. In fact, we probably both are off by at least a little bit. So why do we have more confidence in my neighbors' estimate? Well, it'll be a lot easier for me to get lucky or unlucky and catch two trains that are both above or below the mean. But it would take a lot more luck for my neighbor to have it happened 10 times. So the number of observations, often denoted by the letter n, is important, and it will show up in our formula. I've been talking about a case where the train is either really empty or really full. But what if the train my neighbor takes is much more consistent, where the most common number of open seats is just one or two, rather than zero or 30? Let's ignore whether we still count it as an open seat. If someone has spilled coffee all over it, maybe that's a fractional seat TBD. So if the distribution for train B has less variation or, I should say, more specifically, a smaller standard deviation, I suspect it might be easier to get close to the true mean with fewer observations. Reminder, we don't actually know the true distribution of seat availability. We just know what it is from our observations. We have this notion of a confidence interval. This is a plus or minus range that we think the true mean is in based on our observations. Remember, both me and my neighbor are probably both off by at least a little bit. Using the tools that we have for normal distributions, we can compute a range that the true mean is probably in based on our observed mean, and the two other variables that we've talked about, and the number of samples and Sigma, the standard deviation. So when I compare my two observations with my neighbor's 10 observations, my confidence interval should probably be wider because the Sigma is big and the number of observations is small. Their confidence interval is probably a bit narrower. When we do AB testing, one of the steps is to compute the standard error, which is the Sigma divided by the square root of n. This is going to get used in our formula. The standard error and the standard deviation are not the same unless n equals one. So what's the difference? The standard deviation refers to the distribution of the seats on the train, one observation at a time. The standard error is referring to the distribution of the means of the n observations. So the standard deviation of this metric that we have created from grouping random observations together. Then we use the z-score, which is a tool we get to use. Thanks central limit theorem. We pick how much of the probability we want to scoop up. People often choose 95 percent confidence, which means we think that 95 percent of the time, the true mean, will be inner interval. This z-score is a multiplier we can use. We'll take the observed mean and then add or subtract the multiplier times our standard error. If I'm going to make a decision about which station to go to, I'm going to be looking for a case where the confidence intervals don't overlap. That would assure me that one of the trains probably did have a higher number of empty seats on average. If I don't have enough information yet, I could continue to collect observations and have my neighbor collect observations until all the confidence intervals were small enough. This means that the number n gets bigger, and the standard error gets smaller, and the integrals get more narrow. Okay. What if I'm overthinking this? Really, it doesn't matter which train I take because the distribution, they're just the same at both stations. I might call this the null hypothesis, the hypothesis that nothing is different. In this case, I would expect that the confidence intervals would overlap no matter how much time we waste collecting new observations because we're sampling to approximate the same value. When we compute another term called the p-value, we're going to use those same variables, and it will tell us what the probability is that the difference in means occurred because of this natural variation on the samples versus from the difference in the underlying distribution. You can use the p-value to disprove the null hypothesis. So in real life, this metric, the number of empty seats on the train, might not be the right metric to help me decide which station to go two. Remember, I just have one body. I don't really care how many seats there are. I actually only care if I'm going to get a seat. In that case, a completely empty train, every once in awhile, doesn't do me as much good as a single seat every time. So can you think of a better metric to collect, to help me make my decision? As an analyst, you might not be making decisions about which train to take. Instead, it might be a decision about which subject line to use in an e-mail to send to all of your new users or which of two recommendation algorithms to choose from. It will be up to you to decide things like what data to collect, how much data to collect, and ultimately, whether there's a statistically significant difference between option A and option B. Okay. Hopefully, now you're feeling ready to dig into some AB testing case studies.