So, in this section, we're going to look at Estimating Group Odds and Proportions using Multiple Logistic Regression and also, Odds Ratios for Groups Who Differ in More than One Predictor. After viewing this section, you will be able to estimate the odds of a binary outcome for a single group, based on a specific set of predictor values, x values, using multiple logistic regression. This odds estimate can then be converted to an estimated proportion or probability of the group having the outcome. You'll also be able to estimate an odds ratio comparing the odds of a binary outcome for two groups who differ in more than one predictor. So, to remind you about the process we used for simple logistic regression, we're going just to extend this to multiple logistic, it'll be a smooth transposition. It's the same idea, but transforming the log odds estimates, we would get by from the multiple logistic regression equation into estimated proportions or probabilities. So, a multiple logistic regression model of the form that we've been looking at where the log odds of y equals one is equal to a linear combination of the intercept, and slopes, and axis. If you give me a specific set of x values, I can plug those into this equation and get a single value when I add up the intercept plus the slope multiplied by their respective axis. That single number estimate is the log odds of the binary outcome occurring on the log odds of a phi equals one. So, I can take that number, that log odds, and I can exponentiate it to convert it to the odd scale. So, I can get the estimated odds of the binary outcome occurring. To get it from the odds to the estimated proportion or probability to linkage or formula, I use is the one we used before, the estimated probability or p hat equals the odds over one plus the odds. Again, we define the odds is equal to the p hat over one minus p hat, if you solve that backwards in terms of p, p hat equals the odds over one plus the odds. So, let's look at Predictors of Obesity. Let's look at predicting the proportion of obese individuals as a function of this second multiple linear regression model. We take into account their sex, HDL levels, and age quartile. So, the resulting regression model for the model two is presented the underlying regression model is as follows, that the log odds of the EB equals the intercept of 0.87 plus the slope for sex of 0.78 plus the slope for age is HDL. The slow for HDL, I can make it to be 0.044, and then the slope for the three non-reference age quartiles. So, this beta three is for eight is X is an indicator of age quartile two, age quartile, age three quartile and age quartile four. You can get these numbers if you wanted by going back to that previous table and taking the log of the respective values but here I give it to you. So, I want you to use this result to estimate the proportion of adult females, where HDL levels of 75 milligrams per deciliter and who are 65 years old in the fourth age quartile. So, what we're going to do is, we're going use this equation, and just plug-in. We've got numbers for the intercept slopes, we just plug in our x values. The group is female, so their x value for the first x_1 is one. So, we take that slope of 0.78 times one, added to the intercept of 0.87. Their HDL level is 75 milligrams per deciliter, so we take this slope for HDL of negative 0.044 times 75 and there I'm in the fourth age quartiles. So, the indicator for quartile two and quartile three are about zero, and the indicator for quartile four is one. So, we take that slope for the indicator quartiles four as 0.87, and multiply it by one, and we add all these things up. If you do the math, you get a log odds of obesity for this group of negative 1.18. So, if we exponentiate that, we get the odds of obesity for this group of females. It's equal to 0.307. So, we add these things up to get the log odds of negative 1.18 exponentiate that to get the odds. Then, to transform this into the estimated proportion or probability of this group who's obese, we take the estimated odds over one plus the odds 0.307 over 1.307, turns out to be 0.234 or about 23.4 percecnt. So, now we've got an absolute probability value to understand sort of where we stand, in terms of the risk of obesity, at least for this singular group defined by the sex HDL and age levels. I want to show you something, and I don't want you to be intimidated by the Smith. Its actually not so bad but it looks like a lot here. I just want to show you this, because a lot of times I'll not come back to this if you're reading a journal article. They won't present things on the log or the regression scale, but everything will be exponentiated in odds and odds ratio form. You could take the log of those values to recreate the equation, and if you're interested in estimating different proportions or probabilities from the published results, do what we just did. I want to show you something. If I looked at this log odds of obesity instead of summing it up to get the negative 1.8, I exponentiated the components, summed together, so took e to the sum didn't combine it all together into the e to the negative 1.8, and wrote it out. It would be e to the first element to sum to 0.87 times e to the 0.78 times e to the negative 0.044 times 75 times e to the 0.47. This entire product would actually give us the odds of obesity for this group. Another way to write this out, e to the 0.87 times e to the 0.78 to disentangle this piece here of e to the negative 0.044 times 75. That can be rewritten as, e to the negative 0.044 raised to the 75th power times e to the 0.47. So, why would I do this? Well, I'll show you again explicitly in a minute, but if this e to the 0.87 now is the exponentiated intercept. That was the intercept from the model. This is the exponentiated slope for sex, so this is the adjusted odds ratio for sex, and this in parentheses here is the exponentiated slope for HDL, so this is the adjusted odds ratio for HDL, then we raise that to the 75th power, then this is the exponentiated slope for age quartile four, so this is the adjusted odds ratio of age quartiles four. So, we can actually compute these odds directly from the odds and odds ratio scale without going back to the log scale where we are pulling the results from a published paper that was already presented in exponentiated format. So, for example, if we were looking at a table in a published paper, the presented things on the odds and odds ratio scale, we could do this right from those exponentiated components if we wanted to. Now, I'll be honest, what I would do if I wanted to make computations is I would take the time to rewrite this out in the log scale, do the addition and then exponentiate the results, because I find that more comforting in intuitive. But I'm just illustrating that if you did have things exponentiated, start what you wouldn't necessarily have to go back to the log scale. So, the exponentiated intercept could be that baseline odds of 2.38, we multiply it by the adjusted odds ratio for females of 2.18, we take the adjusted odds ratio for HDL of 0.957 and raise it to the 75th power, because we're evaluating it for persons with edge year of 75, and then multiply it by the adjusted odds ratio for age quartile four. We do this, we get the exact same odds that we got from adding things first and exponentiating to 0.307, and we'll end up with the same estimated probability or proportion. I just want to make make you aware that you could do this if you wanted to. In terms of presenting things on the probability of proportion scale, sometimes a publication will include a graphic showing predicted values or probabilities for some or all groups defined by specific predictor values. So, one way to actually present these proportions and give some absolute context to things that we've only measure in a relative scale with the odds ratios that were presented in that original table, would be to look at the estimated proportion or probability of being obese. One way to present it when we have these three predictors would be to put these curve separately by sex, I'll do females on the left-hand side, males on the right, and look at the estimated proportions or function of HDL separately for each of the four age quartiles. This is maybe a little bit of a cluttered graph, but at least it gives us some context there for what these relative ratios mean in terms of absolute changes. We can see that on the lower end of HDL, those with relatively low values, their probabilities or proportions on the order of 60 percent to 80 percent but this drops relatively quickly. That's roughly 4% decrease on the odds per one milligram per deciliter of increase in HDL, it translates to a pretty rapid decrease in the proportion of probabilities. You can see the relative ranking of the hierarchy of the age groups in terms of who's got the higher starting odds versus the lower, certainly group 47 to 62 years old has the highest starting odds of the four groups. Then we can see these are put on the same scale, so this is the graph for males and we can see that everything is shifted down slightly compared to females because we saw that females had higher odds even after accounting for age in HDL, hence females would have higher estimated probabilities or proportions. This is nice way to give some absolute contexts to these relative quantities so that the reader can get a feel for what the risk of this outcome is, the proportion of, in this case persons who are obese, has a function of the predictors used in the multiple regression model. So, let's talk about using the results to come make comparisons on the odds ratio scale between two groups who differ by multiple characteristics used to model this. So, again this is our model here, x_1 was one for female, zero for male, x_2 was HDL in milligrams per deciliter, and then x_3 to x_5 were the indicators of Q_2 to Q_4 in terms of the age quartiles where the reference was left out, that was quartile one. To estimate the odds ratio of obesity for females with HDL of 75 milligrams over 65 years old, that's the group we just looked at, but now let's compare them to males with HDL of 80 milligrams per deciliter over 50 years old. So, I'm going to write these things out on the log odd scale for both groups. The first line here is what we just did before when getting that probability. Ultimately, we plugged in our values of female 75, an age quartiles four to get an estimated log odds for that group of negative1.8. We do the same thing for the second group males with an HDL Of 80 milligrams per deciliter over 50 year old, we start with the intercept, they are males the reference sex, so their value of x_1 is zero, we plug in 80 for HDL and multiply it by that slope of negative 0.044 for HDL. The 50 years old, if you look it up they and not the fourth but they're in the third age quartiles. So, we add in the slope for the third age quartiles 0.75 times one, because they are indicator is activated for the third quartile, and if we sum these up we get negative 1.9. So, if you take the difference in the log odds between these two groups, negative 1.18 minus negative 1.9, it turns out to be a positive difference of 0.72. I want you notice though, if we took these things piecewise here, if I look at the difference by each component before I had added them up in both groups, the intercept is the same and it cancels in both groups, the piece for sex comes down 0.78 times one for the first group of females minus zero for the second group because they're males, the slope for HDL ultimately gets multiplied by the difference in those two HDL value 75 minus 80, and then we have to take the slope for first group's age quartiles four and subtract the slope for the second group's age quartile 0.75, and this is more clearly written here. If you do it piece-wise like that you get the same result as if you added both up, and then took the difference at the end of this 0.72. So, this 0.72 is the difference in the log odds of obesity for the first group compared to the second. So, if we wanted the odds ratio, we we would exponentiate this e to the 0.72 is equal to 2.05. So, the first group has slightly over two times the odds of obesity as compared to the second group, that's 105% greater estimated odds. So then, to label this but just again, want to show you just we could if results again we're presented on the already exponentiated scale. We wouldn't necessarily have to take things back to the log scale, the regression scale to do this. I personally would because I find it easier to keep track of things, but I just want to throw this out there for those of you who would like to try something different. We use whatever approach you're comfortable with when doing it in real life, but we said that we could write the log odds ratio, the difference in the log odds. We could write it piecewise, the intercept canceled, and we had the slope for sex of 0.78 times the difference in the sex values, the females for the first group coded as one, compared to males with second. We have that slope for HDL times the difference in HDL values 75 minus 80 et cetera. So, I'll just make it a little less complicated over here. I'll simplify it as 0.78 plus negative 0.044, times negative five plus, 0.47, the slope for quartile four for the first group, minus 0.75, the slope for age quartile three for the second group. As we saw before, this sum is 0.72, same thing we would get if we had added the complete predicted log odds for both groups and then taken the difference. I just want to show you we could directly exponentiate that 0.72 to get the odds ratios, but let's look at what happens if we do it piecewise and exponentiate the sum. So, e is at the 0.78 plus negative 0.044, times negative five, plus 0.47 et cetra. For the first, we could rewrite this as e to the 0.78, so this is just adjusted odds ratio for sex actually, times e to the negative 0.044 times negative five, times e to the 0.47, times e to the negative 0.75, we could rewrite that slightly just represent that second term, e to the negative 0.044 times negative five, is e to the negative 0.44 raised to the negative fifth power. So, now this whole thing is presented in terms of the exponentiated slopes, or the adjusted odds ratio. So, it's e to the 0.78 is the adjusted odds ratio for sex. This e to the negative 0.044 is the adjusted odds ratio for HDL, and then we raise that to the difference in HDL between the two groups of negative five, e to the negative 0.47 is the adjusted odds ratio for age quartiles four, e to the negative 0.75 is the adjusted odds ratio for age quartile three. If we do this, we get 2.05 which is exactly what we got if we had summed it up first and exponentiate. Again, I only point this out because some of you may prefer what you're looking in published papers. If you want to do such a comparison and you don't want to take things back to the log scale and things are presented on the odds ratio and odds scale by exponentiation, you could do this directly in terms of those adjusted odds ratios. You can compare two groups and get the odds ratio for two groups who differ by more than one predictor. It's just going to be a function, a multiplicative and division, dividend's just a form of multiplication, function of the adjusted odds ratios. Let's look at predictors of breastfeeding, and let's look at the results from a multiple logistic regression. Let's scale it back since parody and mother's age didn't appear to have any association even after adjustment for sex and age of the child. Let's just look at sex and age of the child. Again, sex was not statistically significant but nevertheless, we might want to compare our predicted proportions to other studies, where they've included sex as a predictor. So, the regression equation that underlies the results presented here. These three quantities were the exponentiated intercept of 1,750, the intercept itself is 7.47. The slope for sex, which is a one for females is negative 0.27, and the slope for age, which is in months is negative 0.24. So, we could use this, for example, to estimate the proportion of 22 month old female children, who were breastfed. So, what we would do with the start, we have an equation where can plug in our x values and crank other numbers. We get the log odds of being breastfed for this group. If we do the math, the log odds of being breastfed for this group is equal to 1.92. We would exponentiate to get the odds for this group. If you exponentiate 1.92 it's 6.82, and then we can convert that to a proportion or probability by taking the logs of 6.82 divided by one plus itself or 6.82 over 7.82. An estimated proportion of 87 percent. We estimate that 87 percent of 22 month old female children are breastfed in the population of Nepalese children from which our sample of 192 was taken. Again, sometimes a publication when they're doing logistic regression, when it's appropriate to show predicted probabilities, they would perhaps put a graphic like this. So, we can see what the actual proportion being breastfed is, as a function of age. In this case, it also included sex. So, we can look at the curves separately by sex. One of the things you note here is we saw that the odds decreased substantially per month increase in age, but now you can see over that two-year period from 12 to 36 months for both boys and girls, it starts at close to 100 percent, the proportion are being breastfed, and drops to nearly 20 percent in both groups over only two years. So that really gives absolute grounding to that relative comparison we had in terms of age before, and you can see there's a slight differential that gets larger as a function of age between females and males even though the odds ratio was constant across the entire age period. What if we wanted to compare that group of 22 month old female children, wanted to compare them to 30 month old male children, and get the respective odds ratio of being breastfed for those two groups. Well, we can write out the log odds for both groups, and we take the difference. If we do this, the log odds for the second group, we just plug in their respective x as zero because there are male, a 30 because they're 30 months old. The difference in the log odds here is 1.65.1.92 is the log odds for the female 22 month olds, and 0.27 is the log odds for the male 30 month olds. Again, we could write that differences piecewise, the difference in the intercepts cancels the difference in sexes is multiplied by the slope for sex, and the difference in ages is multiplied by the slope for age. If you added those all up, do the math, it would equal 1.65. That's our log odds ratio. So, if we exponentiate that, we get the relative odds of being breastfed for these two groups of 5.21. So, the first group of younger females has over five times the odds of being breastfed compared to the young group and that difference is primarily being fueled by their age difference. You want to get a sense of what this translated to in terms of absolute differences you could compute the, probably, we've already computed the proportion being breastfed for this group at 87 percent, we could do the same thing here, to get a sense of what the difference was in the estimated proportions, or the risk difference for these groups. We could also, and I'm not showing this for these breastfeeding examples, but we could have started these computations with the exponentiated versions of things, the exponentiated intercept and the exponentiated slopes or adjusted odds ratios. But, if you're interested in trying that, you can go ahead and see if you can make it work for these data as well. So, in this section, we've seen examples of using the results from multiple logistic regression to estimate proportions or probabilities of a binary outcome occurring for specific groups as defined by a specific set of predictor values. When we have multiple regression, we have multiple possibilities for different predictor values. These can be computed on the regression scale because they end up being a linear combination of the intercepts and slopes, and then transformed to the odds and then the probability scale, or as we showed with the obesity example, we can start with the exponentiated versions; the odds, the baseline odds, or reference odds, the exponentiated intercept, and the adjusted odds ratios, the exponentiated slopes and multiply and divide these to get the proportion or probability of interests. To be clear here, and this isn't the case in every paper, and we'll look at some examples of the literature in a subsequent section in this lecture set, but not every presented logistic regression will give the exponentiated intercept, the starting, baseline, or reference odds as we may call them. If that's the case, we can't actually use the published results to estimate probabilities or proportions because that intercept is key to doing this. The results from multiple logistic regression as we've seen can also be used to estimate odds ratios comparing the odds of the outcome for groups who differ by more than one predictor. So, we're not stuck when we interpret our just the odds ratios. Remember, they only compare groups who differ by one characteristic. The odds ratio that therefore adjusted for everything else, but we are not limited to that, we can estimate odds ratios comparing the odds of the outcome for groups who differ by more than one predictor. These can be computed on the regression scale because it's a linear combination of slopes. Linear combination of the adjusted log odds ratios, and then transformed exponentiated back to the ratio scale or if we have the adjusted odds ratios themselves, we can multiply and we'll divide these to get the comparison of interest, and we showed an example of that. Again, I'm prefer to go back to the log scale, do my computations, and then we exponentiate them, but I just wanted to give that as an option for you, for those interested. We don't need the exponentiated intercept to do this or the intercept if things are presented on the regression scale. So, that means, in most publications even if the intercept or exponentiated intercept is not provided in the published results, we can make these comparisons. I didn't show you the results of this, but you should know the confidence intervals can be computed for each of the above quantities both the estimated probability or proportions and the odds ratios comparing groups who differ by more than one predictor. They can be computed. It's complicated. It needs to be handled by the computer, and it's a somewhat complicated computation because on the regression scale as we've seen, it's a linear combination for proportions, the intercepts, and slopes and for the odds ratio is comparing groups who differ by multiple predictors a linear combination of the slopes. So, the standard error is going to be a complex computation, then it will be transformed on top of that, but it can be done. In the next lecture section, we'll discuss the underlying linearity assumption when we have continuous predictors in a multiple logistic regression and also speak briefly about model selection ideas as we did with linear regression, and also talk about the basics of prediction with multiple logistic regression and what it means to do prediction with such models.