Okay. So, welcome to week three. This week, we're going to start talking about statistical models for dependent data. So datasets where the observations are correlated with each other for one reason or another. Thus far in this course, we've been talking about regression models for independent observations and this week we're going to start talking about appropriate statistical models for datasets that are generated by study designs that introduce dependencies in the data. The first type of models that we're going to talk about are multilevel models. So we're going to talk about what multilevel models are in this lecture and why we fit them in practice. So that is an overview. We're going to now start talking about fitting statistical models to dependent data where the observations are correlated due to some feature of the study design. So several observations might be collected at one time point for example, from different sampled clusters of analytic units, so maybe from neighborhoods, or schools, or clinics. Observations nested within these clusters may be correlated with each other. People in the same neighborhood for example may share the same attitudes or have similar socioeconomic status or whatever the case may be. We could also have the case where several observations are collected over time from the same individuals in a longitudinal study. So, in this case, the repeated measurements collected over time from the same person are likely to be correlated with each other. The models that we fit to these types of datasets need to reflect the correlations The models that we were talking about in week two, basically assume that all the observations in our dataset are independent of each other. Now we're talking about datasets where these observations might be correlated. We have to make choices when specifying models for these data such that the models fitted reflect these correlations, in the observations. So, the first type of models that we're going to talk about are called multilevel models and multilevel models are a general class of statistical models that can be used to model dependent data, where the observations that arise from a randomly sampled cluster maybe correlated with each other. In these multilevel models, what makes them unique is that the regression coefficients that we were talking about in previous weeks are allowed to randomly vary across these randomly sampled higher-level clusters. So the regression coefficients no longer have to be fixed constants that we're trying to estimate, we can allow those coefficients to randomly vary across these higher-level units and estimate the amount of variability in these coefficients, which for example could describe relationships. So, the coefficients in our models are allowed to randomly vary across these randomly sampled higher-level clusters. Let's consider an example of a longitudinal study, where time might be a predictor of some outcome of interests. So we're interested in studying trends in different outcomes over time. We could fit a model where the intercept and the slope in our model are allowed to randomly vary across the randomly sampled subjects that we've included in our study. So in other words, each subject in our study could have their own unique intercept and their own unique slope, instead of assuming that everybody follows the same general relationship or the same general pattern. A multilevel models allow us to estimate the variability among subjects or clusters in terms of these coefficients of interest. So, in multilevel models, we still estimate regression parameters. These are still regression models. We're still interested in fixed parameters that describe relationships between variables, specifically between predictors and outcome variables but we go above and beyond what we talked about in the previous weeks and we estimate the variability of those coefficients across these clusters that have been randomly sampled at higher levels. So in addition to estimating overall relationships, we also estimate parameters that describe the variability of those relationships across these higher-level clusters. So, what this means is that multilevel models allow us to expand the types of inferences that we can make from fitting models to the data. So first of all, we can still make inference about the relationships between predictor variables and outcomes, that doesn't change. But on top of that, we can make inferences about how variable the coefficients are in the larger population from which these clusters, for example, schools or clinics, were randomly sampled, that's new. Something else that's new about the inferences we can make is that we can try to explain that variability among these higher-level clusters, with cluster level predictor variables. So we can use some feature of those randomly sampled clusters to try to explain that variability in the coefficients. That's another way that we can expand our inference with multilevel models. So, let's think about how multilevel models look when we're writing down equations. So a question, what changes allow the coefficients to randomly vary in the models that we're fitting? And the answer to that question is the inclusion of random effects of these higher-level randomly sampled clusters in the model. We explicitly include additional effects of these higher-level randomly sampled clusters. So let's think about this level one equation here. We have a dependent variable y which is defined for a given observation i, nested within cluster j. Notice how we write that dependent variable as a function of coefficients but those coefficients have a subscript denoted by j. What that means is that those regression coefficients are now determined by what cluster j we're referring to. They randomly vary depending on the cluster j. So we refer these coefficients now as random coefficients, these are not parameters, these are random variables. These variables are allowed to change depending on the cluster. Beta one j captures the relationship of the predictive variable X with Y four cluster j and we still have that error term e associated with observation I was in cluster j. Now, what gives this multilevel model its name, that defines the regression function for the observations at level one. The values of the dependent variable. So that could be subjects in a clustered study or it could be repeated measurements in a study where we're collecting repeated measurements over time from the same subjects. In the multilevel model, we then have equations at level two of the data hierarchy. So, these are equations for those random coefficients at level one. So you see that we have a unique equation for Beta zero j and we have a unique equation for Beta one j, that slope that's specific to cluster j. So in this example, that intercept specific to cluster j is defined by a fixed intercept Beta zero, that's our regression parameter. That's the fixed parameter that we're trying to estimate but we add this u term, that u term is a random variable, a random effect. That u term is what allows each cluster j to have its own unique intercept that u zero j. Same thing is true for the slope for cluster j, we added another random effect called u one j, that's a random variable that allows each cluster j to have a slope that deviate somewhat from the overall fixed slope defined by Beta one. So thus far in the course, we've been talking about estimating Beta zero and Beta one in regression models. This shows how multilevel models add these random effects that allow each cluster j to have its own unique coefficient. So, we have this level one and level two model. The random effects that are unique to multilevel models allow each cluster denoted by j to have unique coefficients. These random effects are random variables, so the values for different clusters are assumed to be randomly drawn from some larger population of possible random effects. A very common assumption that we make is that those random variables u, follow a normal distribution with an average of zero, so the average cluster looks like the overall fixed effect Beta zero, but then there's variability in those u's, and it's that variance that we're trying to estimate. We're interested in estimating the variability of the u's, so how variable are those coefficients around the overall coefficient beta zero, and the overall coefficient beta one. So, multilevel models are defined by this explicit inclusion of random effects, and by including these random effects, we're saying that observations coming from the same cluster are correlated with each other statistically, so when we include these random effects, we're allowing observations coming from the same cluster to be correlated, that's how we model the correlation of these observations. If we didn't include random effects in our model, just like we did in week two when talking about linear and logistic regression, we're making the assumption that observations from the same cluster are independent of each other, and that that correlation of observations within clusters is zero. That can be a really strong assumption when we're working with dependent data, so random effects allow us to model the correlation. Accounting for these correlations in our modeling, often substantially improves model fit when we work with dependent data. So it's important to consider whether we get significant improvements in model fit when we add these random effects, and we're going to talk about how to do that. So, multilevel models also allow us to decompose the unexplained variance in a given outcome into between-and-within cluster variance, that isn't accounted for by the predictors. So, the random effects that we include capture the between-cluster variability. The error terms at that level one equation they still capture the within-cluster variability, and the observations that's not been explained by the predictor variables that we're including. So, a key question that we try to answer the multilevel models, is how much of the unexplained variance due to this between-cluster variance, arises in the intercepts or the slopes for a given model. So, how much of the variability in our observations is actually coming from between-cluster variability in the intercepts and slopes. That's a key research question that we try to answer with multilevel models, if we don't care about that between-cluster variance, we may not need to use multilevel models for our analysis. So, we need explicit research interest in estimating the variances of these random coefficients. If we're not interested in estimating that variance, we can easily consider other models for dependent data, that we're going to be talking more about later this week. Okay. So, why do we fit multilevel models? I've already hinted at this a little bit, but all of these points need to be true to warrant multilevel modeling. First of all, we need to have a dataset that's organized into clusters, so clinics, subjects, schools, neighborhoods, et cetera, where there are several correlated observations collected from each of the clusters. So, we have some reason to believe based on the study design that the observations on our dependent variable are going to be correlated, within one of these sampled clusters. Second of all, the clusters themselves need to be randomly sampled from a larger population of clusters. So, in other words, we can't treat variables like gender or race, ethnicity as cluster variables, these are group variables where we have all the possible groups represented in the dataset. When we make the decision to include random effects of higher-level clusters, we're assuming that those higher-level clusters are randomly sampled, we don't randomly sample values of gender or values of race and ethnicity from a larger population of values on these variables. We do randomly sample neighborhoods or clinics or hospitals or whatever the case may be, and the random effects allow us to make inference about that larger population from which the clusters were sampled. Third, we wish to explicitly model the correlation of observations within the same cluster, so the study design gives rise to this kind of dependency, and we want to model that correlation when we fit a statistical model to the data. Fourth, we have explicit research interests in estimating that between cluster variance in the selected regression coefficients that define our model, again there are other models for dependent data that we could use, if we're not explicitly interested in that between-cluster variance. So, given this explicit research interests in estimating between-cluster variance in these selected regression coefficients, here are some examples of the questions that we might want to answer with multilevel models. So, for example, how much of the unexplained variance among hospitals in mean patient satisfaction is due to the size of the hospital? So, is there variability among hospitals and can that be explained by how big the hospital is? Second example, how much variance is there in long-term trends of substance use for a sample of drug users? So, do different drug users follow different trends in terms of their long-term substance use? We can estimate that variance with multilevel models. Multilevel models also offer advantages over other approaches for dependent data. So, when we fit these models, we estimate one parameter that represents the variance of a given random coefficient across the clusters, and this is instead of estimating unique regression coefficients for every possible clusters. So, this purely stratified approach where every cluster gets their own unique fixed regression coefficient, we just estimate one parameter that describes the variance of those random effects. This is a much more efficient approach to fitting these kinds of models, especially when we have a large number of clusters. In addition, clusters with smaller sample sizes in our dataset, do not have as pronounced of an effect on that variance estimate as the larger clusters do. So, the effects of the smaller clusters shrink toward the overall mean of the outcome when we use this random effects approach, this is called shrinkage and this really matters when a lot of the clusters have smaller sample sizes, you don't want them to have as large of an influence, and that overall variance. So, with multilevel models, we estimate the variance in a given random coefficient across these higher-level clusters, and when we do that, we can add cluster level predictors to those level two equations for the random coefficients, and we do this to explain variance in those random effects. So, recall the model that we introduced earlier where we had the level one equation for the dependent variable, and then we had level two equations for those random coefficients that we're varying across clusters. Let's take a longitudinal example, where y is our dependent variable, x is our predictor of interest which in this case might be age, t is a subscript that represents the time point at which the measurement was collected, and i is the subject that's repeatedly measured. So, you see we have unique intercepts and unique coefficients in this model for each subject. Notice that at level two, how we're adding a subject level predictor T, with a subscript i. We can use T to explain variability in those u's, just like we would in any other linear regression model. You can think of the level two equations as many regression models for those random coefficients. So, by adding T and its corresponding regression parameter Beta zero one or Beta one one, we're trying to explain variance in those random coefficients, were trying to explain some of that between cluster variance. So, we can add these regression parameters for that subject level covariate T, to those level two models to try to explain variability in the random intercepts and random slopes, and again, we can view those level two equations like many regression models. By adding those cluster level predictors, we try to explain variance in those random cluster effects denoted by u. We can test hypotheses about the regression parameters for T. So, once we estimate beta zero, one, and beta one, one, and test hypotheses about those parameters, if those parameters are significant, that means we're explaining some of the between cluster variance. So, we can make statements like 45 percent of the between subject variance in the relationship between age and the dependent variable y, can be explained by that subject level predictor T, and this is a unique advantage of multilevel models. We can make inference about how much variance in the random effects gets explained by these higher-level covariates. So, we've provide a broad overview of multilevel models in this lecture, what's coming up next? Well, next we're going to see how to visualize the idea of fitting multilevel models online, you're going to see a very cool website that allows you to visualize what it means to fit models where slopes and intercepts vary across higher-level clusters. Then we'll get into more details about fitting multilevel models to different kinds of dependent variables, continuous, binary, count, whatever the case may be, and we'll look at a large number of examples. Just as a reminder, when we fit multilevel models, we need to have explicit research interests in estimating between cluster variance in regression coefficients. There are other modeling approaches for dependent data that don't need random effects.