Hello, everyone. During this lesson, we are going to explore different models including the multivariate linear model, mixed linear model, and the reversible mixed linear model. We'll especially focus on the assumptions of these models and how that impacts power analysis. We have these learning objectives: defining a version of the multivariate model, a mixed model, a residual, and then understand how these model assumptions affect power analysis. We also want to understand the assumptions of each of those models, the multivariate, and then understand the assumptions of what we call the reversible mixed models. These are general classes of mixed models. So three commonly used models that we're going to talk about are all linear models. There are many possible models, but if you understand power and sample size for these, generalizing the other models is a much smaller step than learning how to do power analysis for a multilevel design or longitudinal study. The three common models are a univariate, multivariate, mixed. We're not going to discuss how to do a data analysis, but we're going to talk about identifying which kind of model is appropriate to use. The univariate model is appropriate for a cross sectional data designs. That is, one observation per person and every person is an independent sampling unit. So we have only between subject factors either observational or interventional if that's a randomized experiment. We have only one outcome, and the word univariate here is described in the number of outcomes, not for the number of predictors. We can only apply data to observations that are fully uncorrelated. In fact, most of the analysis methods that statisticians use are based upon the assumption of not just independence among observations, but actually identical distributions. So put simply, there is no clustering. It is a single outcome. Simplest model there is. The multivariate linear model or maybe could call it the multivariate general linear model is used in which we have multiple outcomes. It was originally thought of in situations such as the study of admissions to college. The SAT verbal and the SAT mathematic scores of high school students can be used to predict their performance in college, and to ask whether or not we should admit them to college. Those are two scores. We could call them multivariate because we have two outcomes, and that's an example of purely multivariate in which the outcomes are different skills. The multivariate to a statistician also encompasses when we're calling repeated measures designs or longitudinal designs. Multivariate is general, and then within multivariate lives repeated measures and longitudinal designs alone. The next model is more flexible. The mixed linear model. It has fewer restrictions on model structure. However, we pay a price for its flexibility. It requires more specification in fitting the model, it's more trouble to fit, it requires more knowledge to safely fit to make sure it's behaving well. Its flexibility is what made it very popular. The structure of the next part of the lecture is we define things in relation to model assumptions. We talk about correlation, residuals, and variance. We want to have a sense of all of those things in these three models. I want to talk about correlation and variance so we can talk about error terms and residuals and understand these models, and that's easier to do with the univariate model before we talked about multivariate models. So we're going to start with a really univariate case, and we'll review the correlation material again. The correlation being the measure of strength of relationship between random variables. Reflecting on the data that you're looking at positive relationships. The farther away you move from zero, the closer you move to one, the stronger the correlation. In this positive correlation, two variables are changing in the same direction. Likewise, negative correlations indicate two variables moving opposite directions. Variables with the correlation equal to zero have rates and directions of change that are unrelated. Meaning, a change in one variable does not lead to a change in another variable. Independent variables always have a correlation of zero. However, mathematically speaking, variables with no correlation equal to zero are not necessarily independent. Statisticians have to keep track of that. In this lesson, we'll focus on cases where we allow for assuming independence corresponds to zero correlation. Okay. Here, we gets back to the question we started with. What's the difference among the three models? They have a common structure. This will be the first part of it. We're going to do it in two pieces. First, the common part, and then we'll talk about the distinctions. So all of them have this simple form, a linear model, that's the key idea. A linear model really is just addition signs there. We've got our outcome and we're saying our model is a prediction plan plus an error term. This is a hypothetical model. This statement is making about science is there's a random variable, the error term. There's your prediction, and those two together totally describe what the response is. All the models have the structure of complexity of the error, and the complexity of the response changes as we move, and the predictors move among different models. A model is a statement about the population. If population is a set of interests, it could be the people of the United States, it could be all cells, it's a target and inference target of interest. It's a population, a set of interest, is sample is any subset of that data. It can be a good sample or a bad sample. As statisticians, we hope you will get a good sample, but we take a subset of the population that's a sample. Data analysis use the sample to find estimates of parts of the model, and we observe the response variables, so we use those observed variables to give estimates in those estimates or a predicted values. So here, we have a response. Here, we have a predictor. What our predictive values on this picture? Predictive values are the red line. So the observed YX pairs are the pairs of circles and the predicted values are the red line. We can take a single value, here's the line, here is a dot. We take a single value. Here's the observed value and the predicted value. We compute the difference between that prediction which is actually the vertical distance. That deviation is the residual or the observed minus the predicted. That's the vertical axis on the right-hand plot. The vertical axis was produced by subtracting the observed values from the predicted values. This is an incredibly useful plot. These values are actually called residuals, and these residuals are distinguished from errors. The model response equals predicted plus error. Residual is a term statisticians used to describe an estimate of the error, so residuals are observed in their estimates of errors. There's a special name for this plot. We're getting into data analysis. It's sometimes called a residual predictor plot. It's often used to check model assumptions. Again, errors are unobserved, they're part of the model. Residuals are observed because there are computed based on observed response, and when we use a statistical process to compute our predicted value which is observed, and so therefore observed minus predicted is observed. Again, residual equals observed minus predicted. We don't know the units here because there are raw residuals illustrated here. They can be positive, or they can be negative. The variance of the residuals summarizes the average square deviation to the degree to which the values differ. The observed values actually differ from the predictors. The error variance or variance of residuals indicates how good your prediction measure is. If your prediction is perfect, all of those residuals would be zero and the error variance would be zero. Typically, if we see a greater range of values, the variance is greater. Violations of the assumptions of the model can invalidate an analysis. Again, we are focusing on the univariate, multivariate, and mixed models. The assumptions of the three overlap, and Dan will distinguish them. The univariate model has the following five assumptions, which we'll discuss in more detail. Independence of observations, you have some sense of what this means I think. Independent means there's no correlation between the independent sampling units, and there's actually statistical independence, not just no correlation in the univariate linear model that is typically interpreted as independence of the response variables. Really, you need to think in terms of errors to keep this straight. So independence of error terms. Independent sampling units suggests the units are unrelated, unrelated people or unrelated schools in different cities. The data values vary in predictable and consistent ways in a very loose version of this assumption. This specific requirements for the univariate model is we only have one outcome. It boils down to the variance is the same in these data. The way we check that with actual data, we can estimate whether or not we have homogeneity is to look at the plot of the residuals. It helps to be able to interpret plots. If you look at the predictor value 5, thus it spread its dispersion. Variance calculation is an estimate of that spread or dispersion. If we look at the variance at 10, we get a variance of the same value, and if we look at variance of 15, and also that's what we want. That's a picture of homogeneity right here. The spread is constant across. That's all we mean by homogeneity. It's a statement about errors. Here we're illustrating it. We're estimating it with observed data. It says the model is correct. We wrote the model that said outcome equals predicted plus error. That's a linear model. The linearity assumption says that that's true, and it means that the observed change in the response variable is approximately the same for each one unit of change increase in the predictor is another way to say it more mathematically. But it's really saying the model is true. We need a variance less than infinity. We deal with infinite sets of data. So we are actually not going to encounter this problem in data analysis. It is a concern when one tries to do proofs or one tries to do derivations. So as long as you're not doing derivations, if you're doing data analysis, it isn't much of an issue. We want to see a density function that looks roughly like the visual on the right. It's very hard to tell from a residual plot. It's much easier to look at as a frequency histogram. This error should be normally distributed, and if we wanted to look at this picture, so that's the univariate model assumptions. Heterogeneity, independence, linearity, existence, and normality. The multivariate assumptions are the same as these five things. We just need to elaborate on them. So what is the difference between univariate, multivariate model assumptions? We have to now look at the multivariate homogeneity. You have to have homogeneity of each of the outcomes. The variance has to be the same for all people within the study. Next we have the complete data assumption. The multivariate model takes the sixth row and deletes it, and the multivariate model takes the ninth row and deletes it. The multivariate model only takes and allows list-wise deletion. It's contrast to a mixed model, which is more flexible in allowing complex patterns of missing data. But in order to justify the deletion, the data must be missing at random, and the missingness can't be determined by the outcome of predictors. This is an assumption that we need to be very aware of. Now, we're going to do the mixed model. Seen these assumptions before? They are similar with a few elaborations and an exception. The complete data is needed. We can add missing data in the mixed model. The mixed model can accommodate missing and non-missing data within the same independent sampling unit such as a person. With the multivariate model, we had to delete all observations from an independent sampling unit while with the mixed model we can use the data observed. Another big feature is that the mixed model allows repeated covariates, predictors that are across time points. Let me give you an example of a repeated covariate. Suppose that we are measuring children at different points in time. The age of the child is the repeated covariate. The children have different age at each point in the model. There's a class of mixed models that actually correspond to the multivariate models. In study planning, the vast majority of the studies scientists present actually could be analyzed as multivariate models if they didn't have any missing data. These are referred to as reversible models. They are reversible under the following three conditions. Each independent sampling unit contains the same number of units per observation. The units of observation are measures at the same times, the same location, and are measured on the same variables. The predictors only measure once. Each independent sampling unit contains the same number of observations. So if you're sampling students in schools, you know you're not going to get exactly the same number of students in every classroom. But if we start from an assumption that we'll deal with missing data and dropped out as adjustment factor after we start from doing power and the reversible mix model. So that's the two-step process we're going to do here. We're going to assume balance, we're going to assume nice studies, and then we're going to do power for it, but when there are going to adjust the power to deal with the unbalance. So we have the same number of units of observation. They're are measured at the same time, the same locations, or measured for the same variables, and predictors only have one measure. So we're not going to allow drugs changing. We're not going to allow those repeated covariates. The reason we can get away with this, although cardiovascular trials are interested in a covariate, they're not actually interested in testing that hypothesis. They want to test the hypothesis about their drug differentially affects your cholesterol levels. So power is done for the tests of interest, not the covariate. Model assumptions have big implications for power and sample size analysis. The assumptions are homogeneity, independence, linearity, existence, and Gaussian errors. Remember, highly Gauss. That's all for this lecture. Thank you for your time.