In this section, we'll talk about the linearity assumption with multiple logistic regression, and also talk briefly about prediction with multiple logistic regression models. What does it mean to predict? How do we measure that? We'll give some insight, a short treatise thereof. So upon completion of this lecture section, you will be able to speak about the linearity assumption related to continuous predictors in multiple logistic regression, expect to see comments about investigating this linearity assumption articles where continuous predictors are used in multiple logistic regression models. Discuss potential strategies for choosing a final, if you will, multiple logistic regression model among competing choices, and comment on what prediction and evaluating prediction entails for logistic regression models. So in order to talk about the linearity assumption, let's go back to our obesity and HDL cholesterol dataset. This was the data from the National Health and Nutrition Examination Survey (NHANES) survey from 2013 to 2014, also involved lab results. So data include 10,000 plus observations on persons 0-80 years old, and we have 5,800 adults who are greater than 18 years who also have a body mass index and HDL cholesterol levels. Obesity is defined by a BMI cut-off, and there's a lot of debate on whether this cut-off is the best measure of obesity, but nevertheless, that's the definition we've been using and will continue to use because that's the current standard. Simple logistic regression is an option to relate obesity to HDL cholesterol as a continuous measure and we've already reviewed the linearity assumption of simple logistic regression models. So again, if we were to fit this simple logistic regression model with HDLs as our predictor, the assumption we're making is that the relationship between the log odds of obesity in our continuous predictor of HDL is linear in nature. So again, we said well, how can we investigate this? One way to investigate this empirically and visually was to do what's called a lowess smoothing plot which was a method that showed log odds breaking our HDL data up into windows, estimated the log odds of the outcome of obesity in that window, plotting the point for the midpoint of the window, then moving the window over and doing it again, and we get these locally estimated in small ranges of HDL estimates to log odds and we connect the dots or the algorithm does, and we get a empirical visual estimate of what that relationship looks like. We saw in this case for the most part, it was pretty reasonable to fit a line and that's what we went ahead did. We talked about the departure and the upper range of HDL levels and said it wasn't something to worry about. So we did do diligence and we went ahead and did that, but what happens when we fit a multiple logistic regression model that includes generally HDL as a continuous predictor, but biological sex and age? However, in this case biological sex is binary and age is categorical, but the bigger issue here is now we have a multiple logistic regression model where at least one of our predictors is continuous. So what is the linearity assumption when we've estimated this association after accounting for other factors in the same model? So this assumption is a little more restrictive than it was before. The assumption regarding HDL in this model is that the relationship between the log odds and obesity is still linear in nature, but here, it's not just the overall unadjusted association is what will happen after adjusting for sex and age. More generally is what the relationship between the log odds of an outcome occurring and any continuous predictor in a multiple logistic regression model is that adjusted association is linear in nature. So how are we going to evaluate this? Unfortunately, unlike with linear regression where we had those adjusted variable plots, where we could get a visual of what the relationship between our outcome and predictor looked like after adjusting for other components of a model, there's no analog to this for multiple logistic regression. There's no adjusted lowess smoothing plot for multiple logistics. So we can't go back and create another lowess plot, reading in the log odds of obesity to HDL after adjusting for sex and age for example. But an empirical way to check this would involve categorizing HDL, maybe arbitrarily putting into quartiles for example, or clinically relevant cut-offs but quartiles is a good way to start, and checking whether the slopes for the ordinal categories because it is continuous, we put it into quartiles it would be categories of an ordinal nature, whether they show evidence of a relatively constant change in the log odds, either increasing or decreasing with increasing HDL. So I'm going to go ahead and do this. We fit the following multiple logistic regression that includes HDL now as categorical, I did it in quartiles. Biological sex is still binary and age is still categorical. So here's the resulting regression model. The first x term is now for sex, slope for sex is 0.76. These next three x terms x_2, x_3, and x_4 are the Indicators for HDL quartiles 2-4. So the reference was the HDL quartile of 1, and then these last three x's are the indicators for the age quartiles 2-4. So let's just focus in on the piece in the model with HDL categories and see if it meets roughly our expectation of if relationship were linear. So what do we see here? So this is the adjusted difference. So this is indicator of HDL quartile 2, this is indicator of HDL quartile 3, HDL quartile 4. So the reference for these comparison and these are adjusted comparisons is HDL quartile 1. So this measures the the adjusted difference in the log odds of obesity for those in HDL quartile 2 compared to HDL quartile 1. So that difference if this is the reference here is negative 0.37. So if we were to then look at the adjusted for age and sex difference in the log odds of obesity for those in HDL quartile 3 compared to the same reference, it is negative 0.99. Finally, for those in quartile 4 compared to same reference it's negative 1.59. So it certainly is decreasing with each increasing category with HDL. If we looked at the difference in contiguous categories, again, the difference between quartile 2 and the reference is negative 0.37. The difference between quartile 3 and quartile 2, if you take the difference of negative 0.99 and negative 0.37 is negative 0.62. Then the difference between quartile 4 and quartile 3, the log odds for those two groups, if you take the difference of negative 1.59 and negative 0.99 is negative 0.60. So we're consistently decreasing a slightly lesser drop in absolute value of drop of 0.37 compared to drops of 0.62 and 0.6 respectively, for the subsequent increase in auto categories across the board, but on the whole we see a consistent decrease and the quartile, that case was arbitrary. So I would suggest that these results are well, they don't show perfect evidence of linearity across the four quartiles. They're pretty strong indicator that we're not doing a disservice by estimating this adjusted association is linear. So I would based on the empirical result, go back and choose the model where we only had to have one x for HDL where we put it in as continuous, so based on this empirical analysis. But if we saw something very different, then we might want to stick with the categories instead. So generally speaking, suppose one wishes to assess whether the adjusted relationship between a binary outcome y and a continuous predictor x_i, we'll call it arbitrarily is linear in nature for a multiple logistic regression of the form, the log odds of the binary outcome occurring, the log odds that y equals 1, equals a linear combination intercept plus slope times x's. So the empirical approach would involve re-estimating this model with a categorical version of x_i. For example, breaking x_i into quartiles, or some other categorization maybe finer tune into quintiles, or deciles, or maybe there's clinical, or scientific cut-offs to use. But then the slopes for the ordinal categorical Indicators could then be examined for evidence of a relatively consistent change in the log odds of the outcome with increasing ordinal category. So certainly, having equal sized categories makes this easier to do, and that's why I generally suggest into some centiles, whether they be quartiles or quintiles et cetera. Certainly at a minimum you would expect to see a constant change in one direction, either increasing with increasing ordinal quartiles, or decrease in the log odds consistently with increasing quartiles. So what about when all the dust settles, you've checked for linearity, you've decided how to model the X's, you've got some competing models, you may want to choose a final multiple logistic regression models to represent your analysis. So what is the final best model? What constitutes the best model just like in linear regression? It will be the same for the other types of regression we do as well, depends on the goals of the research. If the goal is to maximize the precision decision of our adjusted estimates, maybe the best approach would be to keep only those predictors that are statistically significant in the final chosen model, and again the reason being is we don't want to estimate extra things that aren't adding to our knowledge about the outcome, the binary outcome above and beyond the other predictors because we have to estimate more slopes with the same amount of data, and that will compromise our precision on the slopes for the other predictors if the extra non-significant things are not adding information back. If the goal is to present results comparable to results of other analyses presented by other researchers on similar or different populations, one that'll present at least one multiple model that can includes the same predictors set as the other research, even if some of the predictors they used were not statistically associated with the outcome after adjusting for your other predictors. So you may show that and then also the model with only those statistically significant predictors for a comparison. If the goal is to show what happens to the magnitude of the association with different levels of adjustment, maybe you could present the results from several models that include different subsets or combinations of adjustment variables. So what happens to the relationship between obesity and HDL when we adjust for demographic characteristics? What happens when we adjust the lifestyle characteristics? What happens when we adjust for both together, etc? If the goal is prediction, well, like I said before with linear regression, it just gets more complicated. I'm going to give a very short conceptual [inaudible] of what prediction involves for logistic regression just to give you a heads up. It's more complicated and more nuanced than we can do in this two-term course, but I want to give you an idea of how it can be evaluated. So we've seen going back to this model we've been working with to even to investigate linearity assumption of the model we had, where we relay the log odds of obesity to sex, HDL, and again, we're going to keep it as continuous as per our diagnostics before an age which is categorized. We have this model. We've seen given a set of X's, how we can use this to estimate a log odds for a particular group and then translate that back into predictive probability of that grouping, individuals in that group being obese, the predicted proportion in that group who were obese. So for example, we did this when we looked at adult females with HDL of 75 milligrams per deciliter, who are 65 years old in the fourth age quartile, and we did the math and when all was said and done, we estimated that 23.4 percent of that group would be classified as obese by the BMI cutoff. But obesity is a binary outcome. So if we want to use predicted probabilities to predict for individuals not used in our dataset based on their characteristics, so if I had data on paper on people where I couldn't actually measure their height and weight but I knew their HDL level, I knew their age, and I knew their sex, how could I predict whether an individual was obese based on the predicted probability for a person like them in their same age, HDL, and sex group? Well, there's no logical way to quantify the variability of an outcome explained, so there's no R squared for logistic regression. There is something you'll sometimes see called pseudo R squared, but given that pseudo is the main adjective in that name, I would stay away from that. So in order to evaluate how well the results of the logistic regression model predicted the individual level, we don't have a measure like R squared, and what we need to establish is what we mean by predicting the individual level. We want to predict, for example, whether somebody is obese or not by the BMI cut-off given their age, sex, and HDL. Well, when we put in age, sex, and HDL values to our logistic regression model fit on the enhanced dataset, we're going to get a predicted probability that person's like this person are obese, and we need to use that to make a decision about whether this person is likely to be obese or not. So for example, we could use this cutoff of 23.4 percent and apply this to everyone in the NHANES data used to estimate in our model. We can go back and say, "We're going to see how well this predicts for the same people we use to fit the model," and we'll talk more about that in minute but just work with me on this idea, but we'll go back and for everyone whose predicted probability, in other words, for people of the same age, HDL, and sex, the predicted probability is greater than 23.4 percent being obese, we're going to classify those people as obese. We're going to predict that they are obese. For everyone who's group predicted probabilities is less than 23.4 percent, we're going to predict that they're not obese. The complication of this is the range of predicted probabilities of obesity across all age, HDL, and sex groups and our data goes from anywhere from less than one percent up to 81 percent. We could choose any of these values as the cutoff for classifying someone is obese versus not. So let's look at the ramifications of this, there's trade-offs in doing this. So let's look at two scheme and we're going to evaluate it based on the same data we use to fit the model which ultimately we would not want to do, but this gives an insight as to what process looks like. We could go back and say, "Let's choose the cutoff of 23.5 percent and anybody when we put it in their X values, their age category, their HDL level and their sex to estimate the log odds and ultimately the proportion of individuals in that group who are obese, if that predicted probability for the individual given their X values, the predicted probability for people like them comes out to be greater than 23.5 percent, we're going to say that they are obese, and if it comes in at less than 23.5 percent, we're going to classify them as not." So what I'm showing here across the top here is the truth. This is based on our enhanced data, they are obese based on the BMI criteria or not obese. Now, we're going to see how well our predictive algorithm of flagging anyone who's group probability based on their X's is greater than 23.5 percent as obese, how that works. So amongst people who are truly obese by the BMI measure, we actually do pretty well. All of the total number of persons who are obese in the dataset that's slightly over 2,000, we classify 1,882 of them correctly for a true positive of 92 percent. We predict correctly 92 percent of the time whether someone is obese. Unfortunately, in the opposite direction, we don't do so well. This is a pretty low threshold by the way, so we're going to have a lot of false positives. So we actually predict a large proportion of those who are not obese as being obese and only predict 26 percent correctly. Only about a quarter of them are predicted correctly as not being obese. So this threshold gives us a high true positive rate but it also presented through the whole, so if you take the converse of this 26 percent which is the true negative, one minus that is 74 percent, we have a very high false-negative percentage with this cutoff at 23.5 percent. Let's up this to 65 percent and we'll do the same exercise for each of the individuals in our NHANES data set. We'll plug their X's into the logistic regression equation to get the estimated proportion of people like them who are obese based on their age, sex, and HDL, and if that predicted probability will hit, again, estimate the log odds and translate it into a predicted probability. If that predicted proportion or probability comes any greater than 65 percent will classify them as obese, and if it comes in less than 65 percent, we'll say they're not obese. Here, we actually get very opposite results. We've set the bar high, you have to hit a high threshold to be classified as obese. So for only five percent of those who were actually obese given their BMI measure, only five percent of them do we correctly identify. So we have a very small two positive percentage here and only identify about five percent of those with that outcome. However, we do much better on the other side for those who are not obese, we get that right 99 percent of the time. So our true negative percentages close to 100 percent, so we only misidentify about one percent of those who are not obese. So depending on where you set this cutoff, you're going to tend to maximize one true positive and minimize the other. Truly, an excellent predictor will be able to tackle both regardless of the cutoff. If we have enough information or X set to really hone in on specific predicted probabilities for different groups, we might do better than we have done in this example for these two cut-offs. So the proportion of predictions that are correct shown in both tables are what are called the sensitivity and specificity respectively for the given prediction cutoff. So sensitivity is the proportion of the true outcomes in- we've evaluated again with the same data we use to fit the model and I'll speak to that in a minute, but sensitivity is the proportion of the true outcomes classified correctly by the prediction and logistic regression, the percentage of people we correctly classified as obese who were obese. The specificity is the poor portion of true non-outcomes if you will classify correctly by prediction. The proportion of those who were not obese who were correctly classified as non obese. We could run and do those tables I just showed for every possible predicted probability across that range of results in our data and look at the trade-off. There's the diagnostic tool that will sort of accumulate this information and present it in visual form, it's called a Receiver-Operating Curve sometimes called an ROC curve. It's redundant because C stands for curves but it's still called an ROC curve. It plots the sensitivity versus the specificity. For all possible prediction cutoff supplied the results from a logistic regression model to a corresponding dataset or the binary outcome that exactly new for each individual NOR dataset. So we evaluate against the enhanced dataset because we knew whether or not each individual was obese by that BMI measure. This on the left-hand side, shows our ROC curve for these data with this regression model. So on the vertical axis, we have the sensitivity on the, horizontal axis, we have the specificity and so we can see that if there's a third dimension, it doesn't show here what the actual cut-off value is. But certainly a very nice predictive ROC curve would look like this. If we go up pretty quickly and start to plateau in a very high point where and asymptote up here so that we get to points where certain cutoffs have both very high sensitivity and high specificity. Our curve doesn't quite get there. If I was trying to build a predictive model based on this and instead, I would go back and add more predictors etc and see if we've got an improvement. But one way to measure the collective predictive power of a regression model evaluating the sensitivity and specificity trade across all possible cutoffs for classifying people's obese anywhere from on the low end of just above zero percent, anybody whose probability of being obese is just greater than zero percent, we'll call them obese up to the almost the polar opposite calling nobody obese over very slight few because the threshold for being obese, the proportion or estimated probabilities is so high evaluates the trade-off between certain specificity for all cutoffs across that range. So the area under the curve frequently you'll see quoted the ROC curve and then the area under the curve, for ours it was 0.68. This is frequently abbreviate AUC. So you have the AUC for the ROC is an estimate of how much better the logistic regression model predicts an outcome versus flipping a coin and this diagonal line here is what I call the line of equality, this would be the trade-off for flipping the coin. The sensitivity would always equal one minus the specificity, the true positive and false positive rates or proportions would be the same in those cases. So this line of equality, so how much more we get above this line of equality is a measure of how much better our logistic model does at predicting the outcome compared to just guessing or flipping a coin. Really the measure we look at is the difference and the area under the curve estimate of for example 0.68 and 0.5 which would be the area under this line of equality. So really how much do we add above and beyond, just guessing be 0.68 minus 0.5 or 0.18. So you can read up on different cutoffs or what constitutes a good area under the curve. But ours of 0.68 is somewhere on the upper end of fair but not so good. So if I were trying for some reason wanted to predict obesity on individuals whose other data I had but I couldn't measure their height and weight, if I wanted to build a predictive model, I would go back and see what other information might improve my prediction. Here's the tricky thing. Just like we saw in linear regression with R-squared, it's better not to do what I did and I was just doing it for illustrative purposes just to evaluate the predictive power on the model using the same data that we fit the model with. So what would be a more appropriate thing for me to do with this enhanced data or for researchers to do in general is to split the data randomly for example into two subsets: "training" set and a "validation" or "testing" set. Fit and compare the relative prediction ability models using the training data and choose a final model. We might compare the area under the curve for different choices for multiple logistic regression models using the same data we had to fit the model just to get a ranking but in order to actually evaluate how well this would predict for observations that were not used to fit the data which is usually what we want to do prediction with. We want to be able to predict disease on people based on other characteristics using a model fit on a cohort of persons for example or if we wanted for some reason and predict obesity on individuals where we couldn't measure their height and weight, we could use the results of these analyses but what you want to do is evaluate the prediction of this final model based on how well it predicts actually for the validation set. So use the resulting ROC curve applied to the validation set applying our model fit on the training set and its corresponding area measure. I just want to give some concept on the idea of prediction and logistic regression is actually pretty complicated. We had a third term, we could dig more deeply into it. But in summary, what have we talked about this section just wanted to point out two things. Do wanted to talk about the linearity assumption in multiple logistic regression and it's with respect to continuous predictors. It's not an issue to be concerned with for binary and categorical predictors because again, as with linear regression, each slope or binary or categorical predictor is only estimated difference between two groups and that by definition is linear. This can be investigated empirically by categorizing continuous predictors and seeing if the results are consistent with a linear relationship. It does the log odds of y equal 1, change similarly across increasing ordinal categories with predictor. Same direction, roughly similar magnitude but certainly same direction is necessary. The strategy for choosing a final multiple logistic regression model even after we've checked for linearity and decide how to model continuous predictors when relevant etc. It depends on the goal of research and we just spoke briefly about that and again, if we had more time, we can delve more deeply into that. Prediction with logistic regression can be accessed to one possibility that's used frequently in the literature is something called the ROC curve, the Receiver Operating Curve and it corresponding area under the curve to measure how much better a logistic regression model predicts above me on flipping a coin to decide whether somebody is obese or not for example but in order to properly evaluate the predictive power of any regression model, it is necessary to compute that prediction measure based on the bit model with data not used to fit the model. So just wanted to give a little insight as to how prediction is done with logistic regression to give you an appreciation for the fact that it's complicated but the big take-home messages about prediction is consistent across these different regression lecture sets is that it's much better to evaluate prediction on data not used to create the predictive model.