Data scientists employ a broad range of statistical tools to analyze data and reach conclusions from sometimes messy and incomplete data. Many of these tools come from classical statistics and are used before the formal modeling part of the workflow. This unit focuses on the foundational techniques of estimation with probability distributions, and simple hypothesis tests in the context of EDA. Statistical inference is a very complex discipline, but fortunately, there are tools that make its application routine. Thinking about your data in terms of statistical populations and random sampling is foundational to the methods described in this section. We will discuss how statistical inference is used to answer the following types of questions. All of these types of questions are typical for an investigative analysis. We may be looking to uncover the connection between the business opportunity in the data, or we may be looking to understand a trend or pattern in the data, hypothesis testing, point estimation, uncertainty estimation, and sensitivity analysis are all examples of where we rely on statistical inference to do the heavy lifting. Before we jump into the investigational example, let's think for a moment about a simple application of statistical inference. Let's imagine that there is a DevOps unit within avail that allocates computational resources for other units in the company. Let's say the data or the percentage of time CPUs are allocated each day. For each day, we may have 50 percent, 80 percent, or another number. Any of these subplots could be a representative sample of the population. First, we're going to use a beta distribution to generate some data for this example. Then we are going to fit it with another distribution and make probability statements about the data. You may recall that the beta distribution governed by the parameters alpha and beta is very flexible as shown here. This function demonstrates the process of statistical inference on a dataset. We first instantiate your beta distribution given the input parameters, we create a histogram of 2,000 samples drawn from that distribution, and then evaluate the PDF for most possible values. The plotting code takes up most of the function and is less important here than the single line needed for inference. To summarize the function, we use a beta distribution to represent our given dataset, and then we infer a Gaussian using the dot fit method. The estimated parameters are denoted with a conventional hat notation. The histogram represents the random samples from the specified beta distribution and the lines are the corresponding PDFs. The goal here is to make probability statements that are meaningful even to non-technical stakeholders. For example, what is the probability that more than 90 percent of processors are being used at any one time? We can answer this using the CDF as shown. We see that the probabilities from the assumed and actual distributions are close. Given a reasonable fit, we can make statements like; on average, there is a 12 percent probability that more than 90 percent of processors are being allocated. Let's first see what happens when our assumed distribution is no longer appropriate for the given data. There is a noticeable difference between the two probabilities now. Next, let's align the assumed and actual distributions. We see that the probabilities tend to converge. Despite the ease with which the statements can be made, it is important to remember that visualization provides credibility and context that is important when using statistical inference to make probability statements. The fit method we just used on the previous slide was computed by maximizing a log-likelihood function. There are many ways to carry out inference. Depending on the choice of method, there are inherent advantages and disadvantages, like computational complexity, bias, and flexibility. Let's dive into an example that showcases several of these inference methods in the context of an EDA investigation. In data science, hypothesis tests often take the form of an AB test where there are control and treatment groups of samples. We are going to work with the following example for the remainder of this lesson. Visitors to the Avail website or randomly sent version A or version B of the website. Let's assume that version B has new marketing scheme for getting a user to click "Subscribe", and version A is the default version. In order to investigate whether version B has a greater impact on purchase decisions, we will track the number of visitors to each version and keep track of the proportion that convert to becoming subscribers. Recall the basic process behind hypothesis testing. If we decide to use a binomial test, for example, then the procedure would look like the steps enumerated here. From a scientific thinking perspective, we're trying to disprove all other possible explanations before accepting that website B is more or less effective than website A. It is important to remember that we decide on a test and the level of significance before collecting the data. In the context of modern data science, collecting the data could refer to the process of loading it into Pandas because data is often being accumulated in some form for most organizations. Since we are simulating the data, we can specify the unknown conversion rates for both versions of the website. In reality, these are values that we estimate. In a typical AB test, we would be comparing two versions of the site running concurrently, because we want to account for as many unmeasured effects as possible like seasonality and time of day effects. This would be a two-sample hypothesis test. Because many organizations are not always willing to run experiments in this way, let's start with a one sample test and ask the question, is there a difference between site B and the historical baseline? If the p-value is less than 0.05, we reject the null hypothesis that the conversion rate is the same as the historical conversion rate, in favor of the alternative hypothesis. It is important that you do not stop your investigation here. It is also important that you do not make critical business decisions based on a single p-value. We will discuss some limitations of p-values in later sections. This p-value should be considered alongside other forms of evidence before making decisions. We can also think of the AB test from a generative perspective, that is, samples are generated by repeated Bernoulli trials, and these follow a binomial distribution. So we can specify the baseline as follows; let p be the long-term conversion rate. In this case, it is the rate observed from site A, and let the parameter n be the number of samples in our experiment. We will use this distribution to give us an idea of what is expected given the null or baseline. The binomial test is an example of an exact solution. If the number of visitors increases beyond a few thousand, it becomes reasonable to use a normal distribution to approximate the estimated proportion. The test statistic in this case is a z-score shown by the formula above. The numerator is the difference between our estimated conversion rate and the baseline. The one-half is additionally subtracted as a continuity correction. This is necessary when we approximate discrete distributions with continuous ones. The denominator is the estimate for the standard deviation. We see that the p-value is similar to the exact test in this case. It is also possible that take a numerical approach to calculating these probabilities. In this example, we repeatedly generates success counts from a binomial distribution with specified n and p. We then track how many of those success counts were greater than or equal to the observed number of conversions from site B. The proportion after a large number of simulations converges toward the p-value that tests the hypothesis of equality between the two site conversion rates. We have seen an example of maximum likelihood estimation in the example about probabilities and CPU usage. One significant caveat to this kind of estimation is that we are left with a point estimate that has little context. Here we take the point estimation a step further and quantify the distribution of that estimate using the bootstrap. The Bayesian treatment for comparing conversion rates from sites A and B is very similar to the MLE approach when combined with a bootstrap confidence interval. Point estimates are not obtained directly, instead there's a posterior distribution that corresponds to, in this case, the estimate for p. Bayes formula and the relevant terms are shown on this slide as a reminder. For this example, we demonstrate here an analytical solution that makes use of a conjugate prior over the binomial distribution. For most real life problems, the necessary statistical models are more complex and estimation makes use of numerical methods like Markov Chain Monte Carlo. The conjugate prior of the binomial is the beta distribution. The prior distribution, in this case, a Beta with parameters equal to one results in a uniform distribution, which happens to be ideal when we want our prior to be uninformative. We encourage you to come back to this function later on, but try not to get caught up on too many of the details your first time through. We are interested in the question of whether or not the conversion rate from B is different from that of A. Normally, we do not know the actual conversion rate for site B, but we have plotted it here in yellow to see how well it aligns with our dashed blue line which is our estimate. With more data, these two lines just will converge. The historical expected number of conversions is shown in red. As a rule of thumb, if our confidence interval overlaps with it, then we cannot be confident that the two conversion rates are different. It is an intuitive way of essentially running a hypothesis test where there is no need to set a level of Alpha. First, let's increase n and see what happens. We see as the sample size increases, the known and empirically estimated conversion rates will start to converge. Let's see what happens when we set it a little bit higher. At even higher sample sizes, we see that the confidence interval begins to shrink to reflect an increased degree of belief. We can actually say that there is a difference now between the two conversion rates.