This segment will demonstrate linear regression using Excel plus StatPlus free extra software. This is for versions of Excel where the data analysis tool kit is not available. We'll look at two examples of linear regression. The first is a simple linear regression where we only have one explanatory variable. And the second is the multiple regression. This first one is looking for space shuttle data. And so this is going back to the challenger disaster. This freely available data currently hosted at the University of Alabama, Huntsville. What this data shows is before the Challenger fatal flight there were 23 other flights. It has temperature and degrees farenheit and an index of the amount of O ring damage for that flight. You can see that with most flights there was no damage but on some flights there were two, four or 11 units of O ring damage. You can also see that the damage tends to be bigger for colder temperatures. And that there tends to be less damage at warmer temperatures. We'll highlight of the data and copy it here in the StatPlus. StatPlus will allow your regression. We want to predict the damage index. So that's our dependent variable, or response variable, predicting it from the temperature. That's our independent variable or explanatory variable. StatPlus gives us a new sheet with all the linear regression results. You can see we have an r-square of 0.41. Standard error of regression of 2.1. There's 21 degrees of freedom. The estimated intercept is 18.365. The estimated slope is minus 0.243. This estimated slope that says approximately for every four degrees colder at launch we expect to see one additional unit of damage to the O rings. We can look at what's a 95% posterior probability for the slope. StatPlus gives us this actual interval. Interval goes from minus 0.375 to minus 0.111. We could also compute this by hand. Let's just look at the lower limit of this 95% interval. We could see this as The fitted value minus the standard error times. The value from a t distribution, with probability 0.975. And degrees of freedom. The degrees of freedom from the regression in this case 21. Here we can see that doing this by hand gets the same answer as in the table. This is a 95% post to your probability interval using. A reference Bayesian analysis or reference prior. This turns out to be the same interval as we get as a frequentist, creating a ninety-five percent confidence interval. But using a reference Bayesian analysis, we can give this a full probability interpretation, saying that we have a 95 posterior probability, that the true value of the slope will be in this interval. Suppose we want to predict the actual temperature at launch for Challenger that day was 31 degrees Fahrenheit. That's much colder than the coldest flight previously. We'd like to predict how much damage do we expect to see at 31 degrees Fahrenheit. So this predicted value is going to be the fitted intercept plus the fitted slope times this input value, in this case 31 degrees. We can see then that the predicted amount of damage is 10.82 units. This is relatively large, about as big as anything else we've seen in the data set. We can ask, what's a 95% posterior predictive interval? So we're thinking about posterior predictive distribution. Looking at it a new observation. want to ask what's a new interval, such that there's 95% posterior probability that the new observation will fall in that interval. This will be a t distribution that centered at our prediction of 10.82. And while the scale based on the standard error of regression and the rest of the formula that's available in the notes online associated with this lesson. This formula uses partly the distance from this new point to the rest of the data. So our new value of 31 relative to the other data. So we're going to go back to the first sheet where we have the rest of this data and so we can access that. Here's the lower interval. And here's the upper end of the interval. So we can see that a 95% posterior predictive interval runs from about 4.05 to 17.59. Thus we say we have a 95% posterior predictive probability of getting an observation between 4.05 and 17.59 minutes of O ring damage. It's seems pretty likely that there will be a significant amount of damage to the O rings if we were to launch at thirty-one degrees Fahrenheit. Note that again, under a reference Bayesian analysis, this is the same interval as we get as being a frequentist and asking it's a predictor frequentist interval. One thing we can do differently as a Bayesian that's difficult to do as a frequentist is we can ask about poster probabilities. For example, we can ask what's the poster probability that the damage index will be greater than zero for a new observation of a launch 31 degrees Fahrenheit. So again, we're going to use the poster predicted distribution which is a T distribution. To t-distribution with center 10.82, and this somewhat complicated scale factor. So we paste this formula in and in this case, some versions of StatPlus might work. Another version of this may throw an error code. Currently some version of StatPlus, it gets cranky if we have a negative number going into this T-distribution function. T-distribution is symmetric and it should work for a negative number but in this case it doesn't seem to be working. So instead of taking one minus the cumulative at the negative value, we'll look at the value up to 10.82 in this case because it's a symmetric distribution. So we'll just flip this around, we'll get the same answer in this case. So here we can see the posterior probability of getting a damage index greater than zero at 31 degrees Fahrenheit is 0.998. Probability is almost one that we'll have damage if we launch at 31 degrees Fahrenheit. All right, let's move on to a second example of regression. This one is a classic data set back to the 19th century where Galton was trying to predict the heights of children from the heights of the parents Here we have data from a bunch of different families. Some families have more kids than others. This first family had four kids. This family had four, this family had two. We can see for each child, we have the height of the father in inches, the height of the mother in inches, the gender of the child, and the eventual height of the child in inches, as well as the number of kids in the family. So we want to be predicting the height of the children from the heights of the parents. And potentially also the gender and the number of kids in the family. We're going to need to do a little manipulation to this data set before we can stick into the regression. So I'm going to do that first in Excel. In particular, the issue is that gender is coded m and f for male and female, but in terms of running a regression, we're going to need numerical values. So I'm going to create a new variable called gender m and I'm going to make this have the code of one if the gender is m and have the code of zero if the gender is f. So I pasted in this formula using the if command in Excel. And now I can copy and paste this into the rest of the dataset. Turns out there are 899 rows in this file. So we'll go down, we'll paste this into the rest. But the key is, we need to translate the M's and the F's into ones and zeros. So that then Excel or StatPlus can do the regression, because they're both expecting numerical values they can't deal with the categorical variables of m and f. Okay, so now I have zeros and ones. And I'm going to do my regression again is StatPlus. So I will open up a new sheet in StatPlus. I can paste in this dataset. I can now do a regression. I want to predict the height for the children based on the father's height, the mother's height by numerically coded gender variable and the number of kids in the family So here we find our relative regression. We get an r square of 0.64. Standard error regression of 2.15. 893 degrees of freedom. And here are our coefficients. That intercept of 16.18. Slope coefficients for each of the explanatory variables. Standard errors for each of those slopes. You can see that for most of them the coefficient is much larger than the standard error. But in the case of the size of the family, the number of kids in the family. The coefficient is not much larger than the standard error. In fact if we were frequentist we might do a T test and say is this variable significant. In this case we'd fail to reject it, say this variable is not particularly significant. That is basing we might just look at the coefficient relative to standard error and see that it's posterior interval Includes zero. We generally try not to include variables that are not adding much to the regression because they don't help with prediction, and they actually increase the variability of our prediction. So in this case, I'm going to go back and I'm going to fit the regression again without using the number of kids in the family. So predicting the height of the children based on the height of the father, the height of the mother, and the numerically coded gender variable. So now I get basically the same r square of 0.64, same standard error regression 2.15. We have one more degree of freedom, because we have one fewer explanatory variables. Now the intercept on all the slope coefficients are much larger than their standard errors. We can see that the fitted slope for the father site is 0.4. It says that for every extra inch taller a father is and it's correlated to another 0.4 inches of height in the child. Each extra inch of height in mother is correlated with little more than three tens of an inch of height in the child. Then on average, boys grow up to be 5.2 inches taller than girls do. The table gives us a 95% posterior probability interval for the difference in heights by gender. We can see that this interval goes from 4.94 to 5.51. So we can say there is a 95% posterior probability the difference in height by gender is in this interval, under a standard reference basing analysis. This is also, then going to be the same interval we'd get under frequentest confidence interval.