Welcome to the statistical modeling for data science applications specialization. This specialization is intended to prepare you, the student to be proficient in the use of statistical models for careers in data science. The first course in this specialization, this course, will focus on a subset of statistical models, namely linear regression models. Statisticians and data scientists get to play in everyone's backyard as the famous statistician John Tukey once said, and what this really means is that statisticians and data scientists have the opportunity to apply their skills in statistical modeling, to problems across the sciences in business and many other domains. Broadly this course and the masters in Data Science through Coursera and the University of Colorado Boulder, will provide students with the ability to use data to develop and test theories, to make predictions and make data-driven decisions. Now, these statistical modeling tools can be quite relevant for providing solutions to pressing problems. One example of that is that in the last several years, we've seen an increase in wildfire activity in North America and especially in the western United States. Now, statisticians and data scientists have developed statistical models, that are designed to predict the occurrence of a wildfire of a significant size, based on some set of input variables, say temperature, relative humidity, fuel or tree density, and many other variables. How might such a statistical model work? As a very simple analogy, we can think of a statistical model as similar to an audio mixing console or a mixer. Now, the mixer has many possible inputs, and some of those inputs might be switches that include or exclude certain instruments, dials for bass, treble, volume, etc. Now these inputs have association with the output, namely the audio signal, and an experienced audio engineer knows how the different dial combinations impact the resulting audio signal, and can manipulate those dials to obtain the desired outcome. Now I said that statistical models are something like an audio mixer, but in fact, statisticians in a bit of a tricky position than the audio engineer. Unlike the audio engineer who can control all of the dials, statisticians often can't control the inputs to their system. Instead, they can only observe the dials as you might say, nature sets them, and maybe as nature changes them and at the same time, or maybe a bit after observe changes in the output. One issue that statisticians have is that they might not have control over the inputs. Now in some cases they do, in experimental contexts, researchers, statisticians, data scientists, can control the inputs. In course 2, we'll learn a set of statistical tools that can be used to analyze data that come from this type of experimental setup. Now, another way that statisticians and data scientists are in a trickier position than the audio engineer, is that the relationships between statistician's inputs and outputs are typically not deterministic. Instead, we could say they're stochastic or random. What that means is there may be trends between inputs and outputs, but the relationships are not exact. So statistician's might not be able to manipulate the dials and even if they could or when they observe that the dials have been moved in there statistical model, the output doesn't always change in exactly the same way. Even with those challenges, statisticians are often successful at trying to either learn and explain the relationship between inputs and outputs, or learn how to predict the output for a given set of inputs. These tasks might seem similar, the task of explanation and the task of prediction, but in fact they're different. In this course we will address both of those. The task of explanation and prediction, see how they differ, and see what models are suited better to either one of these tasks. Now, if we think back to our example of wildfire modeling, the output of the statistical model might be whether or not a wildfire of significant size, say greater than 1,000 acres, ignites in a particular location. The inputs might be things like temperature, humidity, wind speed, fuel or tree density, elevation, likelihood of lightening, number of nearby camp sites if we think that some wildfires are caused by humans camping out in the forest, and maybe many, many other input variables. Now, notice these input variables are not subject to manipulation. For example, if we're really studying a forest, a wild area, we can't change the temperature at will, we can only observe when the temperature changes and see if that is associated with some change in the prevalence of large wildfires, given these inputs and outputs, statisticians are tasked with coming up with a statistical model that can give the probability that a wildfire might ignite in a particular location given those inputs of temperature, humidity, fuel density, etc. This wildfire model might be used for prediction which means perhaps we'd like to try to predict whether a wildfire is likely to ignite in a given area in order to take some steps to decrease that probability but it might also be used for explanation, meaning we might actually like to know the relationship between temperature, humidity, and the likelihood of a wildfire, and there, the difference between prediction and explanation might be one of whether or not there's an action being taken. If we predict something about the likelihood of a wildfire, we might take actions to try to prevent it. If we're just interested in explaining the relationship between the likelihood of a wildfire and certain input variables, then we might not necessarily take an action, we might just be interested in understanding the relationship without taking actions to prevent the wildfire. Some important questions that statisticians and data scientists might ask in the context of a statistical model might be, which input should we include? For example, does elevation really impact the likelihood of a wildfire? That can be a tricky question to answer but there are techniques to try to decide which variables should be included in the model and which can be safely excluded. We might also ask, how does a specific change in one input variable impact the output? If we increase temperature by 10 degrees and keep all other inputs the same, how could that impact the likelihood of a fire? A further question might be, does it change from 40 degrees to 50 degrees Fahrenheit result in the same increase in wildfires likelihood as a change from 90 degrees to 100 degrees? Here we're getting at the fact that for different regions of the input variable, namely 40-50 degrees versus 90 to a 100 degrees. There might be different impacts on the output variable. Some statistical models treat those 10-degree differences anywhere in that scale as the same. Other statistical models might be able to account for the fact that there are differences. Maybe wildfires are become much more likely as you jump from 90 degrees to 100 degrees, then if you jump from 40 degrees to 50 degrees. Finally, another question that statisticians and data scientists might ask is, is there a causal relationship between the input and the output? We might ask in what sense do high temperatures cause wildfires? That could be a really tricky question. Later on in this specialization we will talk about causal inference and under what conditions we might be able to make causal conclusions. Now that we have a broad overview of the way statistical models work, let's try to drill down a bit into some details and terminology for statistical models. First, it's important to note that statistical model should really start with an important question about the empirical world. Now that might be a research question, a business question, a scientific question, but it's really some question that you'd like to learn about. You'd like to learn the answer to and you could learn the answer to that question by collecting some data. At least, in theory, there would be some measurements and data that could bear on the question. For example, we might like to know, is a large wildfire likely next summer in Rocky Mountain National Park? In attempting to answer that question, we could look through history at when wildfires occurred at similar locations or in Rocky Mountain National Park, and at the same time measure the conditions under which those fires took place and attempt to model that input-output relationship. Another research question that we might consider is, is cycling safer in the summer than in the winter in the City of Boulder? Now in order to answer that question, we would have to collect data about cyclists and about injuries related to cycling throughout different seasons. An attempt to differentiate between injuries on the summer and the winter. Now a third research question we might consider is, does social media advertising impact the sales of a product in a given market? For this type of question, we would have to research different companies and understand their social media advertising budget and also look at how much they've sold for their product. With our research question in hand, our goal would be to find some data or collect our own data that would bear on that research question. We always collect data about specific units or statistical units, and those units are defined as a member of the set of entities being studied. In this slide, I've attempted to visualize the relationship between populations and samples, what units might be, what our variables might be. Here suppose we'd like to study the way that exercise habits relate to weight loss. We might have a population, say of all residents of Boulder, Colorado. For each resident, we could understand what type of exercise they do, so maybe they're cyclists, maybe they're walkers, maybe they do wheelchair sports, maybe they hike and they have other characteristics like maybe they only cycle for half-an-hour, three times a week, or maybe they cycle two hours, four times a week. They might be all different ages, have different pre-existing conditions. There are lots of things that we can measure about these individuals. Those variables may or may not have an impact on the output variable, which would be weight loss, the amount of weight lost or gained. We can't go out and measure all of these characteristics, all these input and output variables for the entire population of Boulder Colorado, but we could collect a sample. A sample would be a subset of the entire population, and we could try to make that sample representative in some way by finding ways to randomly select individuals that we can measure this data for. Once we measure the data in the sample, we could model it using a statistical model. We could think about whether or not they exercise, what type of exercise they do, their age, pre-existing conditions, how long they're exercising, all of these different variables can be inputs to the model. We can also track their weight, so measure whether they've lost or gained weight or stayed the same. Then we can try to come up with a relationship in the sample and then use statistical techniques to generalize that to the full population. What we've just described is really the process of inferential statistics and data science, which is the process of learning about relationships in a sample in a way that's reliable enough to generalize from the sample to the population of interest. In this course, we'll be concerned with modeling these relationships and trying to make sound generalizations.