This equation makes a lot of sense to us when we're working with a quantitative explanatory variable and quantitative response variable. But what about a categorical explanatory variable and quantitative response variable? It obviously wouldn't make very much sense, for example, for us to create a scatter plot and use gender as our predictor variable. However, a regression model will still be informative. Let's look at the output testing the linear relationship between depression and number of nicotine dependence symptoms, where major depression is a binary categorical explanatory variable and number of nicotine dependence symptoms, ranging from zero to seven, is a quantitative response variable. Our research question is, is having major depression associated with an increased number or nicotine dependence symptoms? In this code, the response variable comes first, then the explanatory variable. >> Returning to the Python script for the NESARC dataset, we will again use the smf.ols function and we will assign the model an object name of reg1. Notice that both variables in my model separated by a tilde are included within quotation marks. The response variable comes first before the tilde, and the explanatory variable comes after the tilde. Then, after comma, the code ends with the specification of the dataset and .fit after the closed parentheses. This tells Python to calculate fit statistics for the formula and parentheses. As always, we need to tell Python to print the output using the print function. In the IPython window, we see the same output format as with the gapminder regression example. We see the name of our response variable and the number of observations with complete data that were used in the model. And here are our parameter estimates in p-values The parameter estimate for MAJORDEPLIFE is 1.36 and is statistically significant. The Intercept is 2.19. Thus we know that our equation is NDSymptoms = 2.19 + 1.36 x MAJORDEPLIFE. >> Lets consider what this equation actually means since it's not the best fit line of a scatter plot. We know that variable MAJORDEPLIFE is our depression variable and it takes on the value zero if the individual does not have major depression and the value one if the individual does have major depression. Thus we can plug in the values zero and one into our major debt life variable to get the expected number of nicotine dependence symptoms for each group. >> As we can see, we would expect daily smokers without depression to have 2.19 nicotine dependence symptoms and daily smokers with depression to have 3.55 nicotine dependence symptoms. Remember that we previously subset our data to daily smokers aged 18 to 25. >> Notice that this is also the mean number of nicotine dependence symptoms for each group which we can see by running summary statistics using the groupby function. To do this, we'll add syntax that creates a data frame that includes only the variables from our regression model and these symptoms in MAJORDEPLIFE. Then, we use the groupby function to estimate the means for each level of MAJORDEPLIFE in ds1 and the standard deviations for MAJORDEPLIFE in ds2. We can also graph the means using the seaborn.factorplot function. >> There are a lot of factors that contribute to Internet use rate and to nicotine dependence, the response variables in each of my examples. If we had more information and if we included those other factors in our model, it is quite possible that our expected values would be even closer to our observed values. We could include several explanatory and or predictor variables into our model in order to evaluate both the independent contribution of multiple explanatory variables in predicting our response variable and also in order to evaluate whether specific variables confound the relationship between our explanatory variable of interest and our response variable.