This demonstration is on using Microsoft Excel 2016 with the data analysis toolkit for doing linear regression. We'll take a look at two examples, one of simple linear regression with just one explanatory variable and one example of multiple regression. The first example is looking at space shuttle launches and damage to the O-rings, this is the context of the Challenger disaster. There were 23 previous launches before the Challenger disaster. Let's take a look at that data. This is a publicly available data set. It's available at the University of Alabama Huntsville at this web address. Here we can see the 23 previous launches, the temperature in degrees Fahrenheit and an index of damage to the O-rings. Most launches had no damage, but some of them had 2, 4, or 11 units of damage. We'll copy all of this data and paste it into Excel. Let's take a look at a plot of the data. We can do this, Insert > Chart > X Y (Scatter) plot. Make the chart a little bit bigger so we can see it better. We can see that all of the flights that had no damage were on warmer days. All the flights on colder days had some amount of damage. We can Add Chart Element > Trendline > Linear to put a linear regression line onto this chart. So here's the fitted line and we can see that as it gets colder, we expect to see more damage. We can quantify that using a linear regression model. So we'll go to Data > Data Analysis > Regression. Our Y variable, the response or dependent variable will be the index of O-ring damage and that goes from B2 to B24. The explanatory variable, the X variable is the temperature and that goes from A2 to A24. We'll put the output into the same spreadsheet so that we can use the results. Here, Excel has computed linear regression for us. An R square of 0.64, Standard Error of 2.1, it's 21 degrees of freedom. The fitted intercept is 18.365 and the fitted slope is minus 0.24. The slope gives the relationship between temperature and predicted damage. It's saying approximately, for every four degree decrease in temperature, we expect to see a one unit increase in damage. That makes sense with this negative slope. Excel also gives us a 95% interval for the slope under the Bayesian analysis, this is opposed to your interval. Using a reference Bayesian analysis, we get the exact same interval as we would for frequentist. We can see how this is computed by hand, for example, the lower portion of the interval, minus 3.75. That's going to be the fitted value minus the standard error times the value from a t-distribution. A t with probability 0.975 and degrees of freedom as given by the regression. Thus, we can replicate by hand what Excel has given us in the table. Suppose we want to predict how much damage we expect to see if we launch at 31 degrees. That's pretty far below the actual data that we've seen so far and we're going to expect to see a fair amount of damage. Predicted value y hat is going to be the intercept plus the slope times the x value we're predicting at, in this case 31. We get a predicted value of 10.82. So at 31 degrees Fahrenheit, we expect to see on average, almost 11 units of damage, which is about as high as we've seen in the data set. We can ask what's a predictive interval for this launch at 31 degrees? So thinking about a posterior predictive interval. These under a reference Bayesian analysis will be the same, again, as the frequentist predictive intervals. And so just thinking about a new observation when we launch at 31 degrees. I going to copy and paste in a formula. This formula is in the materials associated with this lecture online. This will give us the lower end of the interval and then using a plus instead of a minus will give us the upper end of the interval. Thus we get an interval that runs from about 4.05 to 17.59. This would be a 95% predictive interval for all future launches at 31 degrees. You can see that zero was not included in this interval so we're pretty confident that we'll expect to see O-ring damage if we launch at 31 degrees. So far we've gotten pretty much the same answers if we're using a Bayesian analysis or a frequentist analysis. The question we can ask as a Bayesian that's difficult to answer as a frequentist would be, what's the posterior probability that the damage index is greater than zero for a launch at 31 degrees Fahrenheit? Here we'll use the marginal predictive distribution, posterior predictive distribution and that's a t-distribution. T-distribution with center 10.82 and scaling factor 2.1. The T-DIST function will give us the probability of it being less, and so if we want the probability that's greater than zero, we'll take one minus that value. Plugging this in, Excel gives us a probability of 0.998. Thus, we have extremely high probability of seeing O-ring damage if we launch at 31 degrees Fahrenheit. Now let's move on to a second example. This example is a classic data set on predicting the heights of children from the heights of the parents. This was done by Galton in 19th century Britain. All these height measurements are in inches. Let's take a look at the data set. So he measured heights of children in a number of different families. Measured the height of the father in inches, the height of the mother in inches, the gender of the child, male or female, the height of the child in inches, and how many children were in that family. So we'll copy this into a new Excel spreadsheet. In order to fit a regression in Excel, a multiple regression, it wants us to have all of the explanatory variables, all the independent variables together in one block. Because height is somewhere in the middle, we'll make a new block over here with all the explanatory variables. So Kids is a potential explanatory variable, height of Father, height of Mother and Gender of child. I'm going to start by copying Father and Mother, And put them over here in columns G and H. Now I've got Kids, Father and Mother together. Gender, Excel's going to want me to have numbers, not labels of male and female. So I'm going to create a new variable called GenderM, this will take a value of 1 if the child is male and a value of 0 if the child is female. So in particular, we can use the IF function in Excel. And so I can say IF the value in column D is equal to male, then I want the value 1 and otherwise, I want the value 0. With this formula now, I'll cut and paste it into all the rest of the column. This is a fairly large data set. And so we'll see, it goes down to row 899. So now I've created a contiguous block of my explanatory variables and I can use this to make a regression. My Y range, it's going to be Height. It's going to be going from E2 to E899. My X range, is going to be Kids, number of kids in the family, the height of the Father, the height of the Mother, and the Gender, and this is going to go from F2 to I899. The Output Range, we'll need to move over compared to where it was last time And so, here Excel gives us a response. R square, 0.64. 893 degrees of freedom and some coefficients here. Variable 1 is the Kids, Variable 2 is the height of the Father, Variable 3 is the height of the Mother, Variable 4 is the Gender. We can see that for parents' heights and for gender, these are highly significant and important variables. The coefficient is much larger than the standard error. On the other hand, for the size of the family, the coefficient is not much larger than the standard error. So this is a slope that's very close to zero. And we might say for a frequentist, that it's not statistically significant. In general, we don't want to include extra variables that are not helpful with the prediction because they tend to increase the predictive error. So let's fit this again but without the number of kids in the family. Here I'll change the X input range to start in column G. And we'll overwrite. Now we can see our R square is almost the same, still about 0.64. But that all of our coefficients are important for the model. All of the values are much larger than their standard errors and absolute value. We can see that on average, each extra inch of height of a father results in about four-tenths of an each of height in the child. Each extra inch of height in a mother results in a little over three-tenths of an inch of height for the child. And then on average, boys are 5.2 inches taller than girls. Excel gives us a 95% posterior interval for this difference in height by gender. It goes from 4.94 to 5.51. This would be a 95% posterior probability interval if we're doing a Bayesian reference analysis, but it is also the same as the frequentist confidence interval.