In this video, we'll learn how to implement generalized additive models in R. In particular, we'll use the MgCV package. Here we'll compare the way that a generalized additive model fits some other methods that we've studied in this course or that you've studied in a regression course. Will show that under certain conditions, games will have a superior performance. Now note that this video should be accompanied by a Jupyter notebook, where you can walk through this simulation example. Now in this example where we fit a GAM using the GAM function in the MgCV package, we're going to use simulated data. This allows us to try and understand the intricacies of the GAM function without having to worry about whether the model is actually the right one for the data. We'll know that it's the right model because we simulated it so that a GAM is providing a good fit. Let's first construct the predictor variables that will go into our simulated data set. Now the goal here is to construct data with different types of predictors. We'll end up having a factor and two continuous variables. One of those continuous variables will enter the model in a linear way and the other one will enter in a non-linear way. In particular, we have an X_1 that's a continuous predictor and we'll suppose that it has a nonlinear relationship to the response. Now X_2 is also a continuous predictor, but it'll have a linear relationship with the response. Then finally, X_3 is a factor and it'll have three levels. We'll just call them A, B, and C. Now if you look at the accompanying Jupyter notebook, you'll notice that I'm randomly generating the predictors in a particular way. For example, for X_1, I'm randomly generating from a uniform distribution between negative 3 and 3. But there are lots of ways to do this. You might explore how different realizations for X_1. If you use, say, a normal distribution instead of uniform, how might that change the data that you simulate and try to get a sense of what you might need to recover the relationship? What interval the data should live on? How much variability there should be? Etc. But for us, X_1 is generated from a uniform, X_2 is generated from a normal centered at three with standard deviation one-tenth. Then the factor I randomly generated from just a vector with A's, B's, and C's. The next thing we do is construct the response. In particular, we'll make it some non-linear function of X_1 and a linear function of X_2. You see what the relationship is here. I have the relationship between X_1 and the response is nonlinear in particular, it's wrapped in a sine function. Then X_2 and the response is just X_2 multiplied by three. That's a linear relationship and then I'm adding on the factor. Just note that in a real-world situation with real data, we wouldn't know these functional relationships and we have to estimate them blindly from the data. The nice thing again about this simulation is that we know what the relationship is so we can get a sense of how the model works. Also, note that the relationship shown here tells us that the mean of the response is equal to Sine Pi over two times X_1 plus 3 times X_2 plus X_3. But we will, of course, have some noise in the data and so to get back each y, you would have to add on some normal noise. You'll notice that the way that I simulate the response in the notebook, there's a normal noise component. Let's take a look at the marginal relationships between X_1 and the response and then X_2 on the response. What I've done here in these plots is split the data by the factor, by X_3. We have three different levels of that factor. Then for each level, I fit a lowest model through the data. We can see, for example, the plot on the left without taking into account X_2. We haven't adjusted for X_2 yet. This is just a univariate smoothing situation. The lowest shows a clear non-linear relationship between Y and X_1 at every level of X_3. Then similarly with the plot on the right, without taking into account x1, the lower shows non-linear relationships between y and x2 at every level of x3. But that's not what we would expect because we know the true relationship between x2 and y is linear. We can notice here that the confidence bands that are around each one of these fits is pretty wide. In fact, it appears that at least for levels probably all of them, but maybe it's more clear for A and B that we could draw a straight line through the confidence bands, which is sometimes thought as suggesting maybe this null hypothesis of a linear relationship is plausible. For example, we could draw a straight line through that confidence band. We don't really have evidence of a non-linear relationship here. Of course, that's right because we actually don't have a nonlinear relationship. Now importantly, the analysis that we just went through does not allow us to adjust for other predictors. The marginal relationships we looked at did not adjust for other predictors, including adjusting for the levels of the factor. To do that, we need some kind of multivariate model. A model that allows us to take in all of the predictors at once and try to learn the relationship with the response. Now, of course, the most basic model there would be to run a linear model. We'll show here that this model actually doesn't fit all that well. That's to be expected because we know the relationships. Now, notice in this table the parameter estimate associated with x1 is not significant, the p-value is about 0.74. But we know that there's a strong relationship, and it's nonlinear between x1 and y. The reason why this p-value is high is because we have something that's cyclical, and we're trying to model a linear relationship to it, and we're basically coming up with a line with slope zero. We're not picking up on the relationship at all. That's a strong argument for using a generalized additive model because the additive model framework allows us to pick up relationships that are non-linear. Now, I've got a plot of the residuals against the fitted values for this linear model. There's nothing too great to say here. The residual plot doesn't really pick up on anything too strange. Maybe there's a little bit of a cyclical pattern as you go through these residuals from low fitted values to high fitted values. But it's not super obvious. I mean, maybe there's a little bit of clustering, say down here with these values or something like that. But it's not super obvious. What might be a little bit more clear is what looks like a violation of the normality assumption. Here we have a Q-Q plot that's used to diagnose our normality assumption. We see in the tails that we have strong deviations from normality. Really the problem is the non-linear relationship. That's actually what's getting picked up here. Further, we see, if we plot the predicted values against the actual values, we see quite a bit of scatter around the line y is equal to x, which is not desirable. If we're making really good predictions, if our model fit really well, we would expect to see a tight fit around this line. We also might notice that there might be a little bit of non-linearity here, where we're under-predicting for high values of y and over-predicting for low. Actually, if we fit a smooth here, we might see something like this as opposed to this line. All of that is bad for the fit and maybe the predictive power of our linear model. Just take note here that the MSE is about 0.5 and we'll compare that to the MSE for the two models that we look at now. The second model that we'll look at here is cheating. We'll actually fit the true relationship here. We're going to do a transformation of x1. Amazingly, we'll pick exactly the right transformation. Of course, it's not that amazing because we simulated the data and we know what the relationship is. You can imagine that if you had a dataset in front of you that you did not simulate, it would be amazing if you picked exactly the right relationship if the data are complicated and you don't have some theory in forming that choice. The reason why we're going through this model is just to show, what would the exact parametric model do in terms of goodness of fit metrics like the MSE and a plot of predicted versus actual values. It shouldn't be surprising that all of the parameters in this model are statistically significant. The magnitudes are close to the true values. If you look at, for example, the magnitude of the coefficient associated with the x1 term, it's one, and that's because once we transform the x1 predictor, the coefficients sitting outside of the sine of Pi over two times x1 is just one. Similarly for x2, we see it's very close to three. That shouldn't be surprising. Here we have residuals versus fitted values. Again, nothing too surprising here. We're looking for a random scatter around the line y is equal to zero, and we see pretty much that. Now that we've transformed x1, we see a Q-Q plot that looks pretty good. Out in the tails there's just slight deviation from normality, which is not a serious issue at all. Finally, we see the plot of the predicted values against the actual values, and this plot looks much better than it did for just the plain linear model. We see predictions that are very close to the actual values, there's just a little bit of scatter off of the line Y is equal to X. Notice down here in the bottom, we have an MSE of something a much lower MSE than we had previously. For the plane linear model, our MSE was 0.5, this MSE is much lower. That is suggesting we have a much better fit, which again is not too surprising. Let's finally get to the main event of this video, which is to showcase the generalized additive model. When we don't know the true relationship, we can try to estimate it using a generalized additive model, or in this case, it's technically just an additive model because we have a normal response. It would be a generalized additive model if we had, say, some other type of response like a Poisson or a binomial. But in this case, I'll use those terms, generalized additive model, additive mode, or GAM interchangeably. The first thing we should do is load the mgcv library, and that will give us access to the GAM function and several other functions and datasets. Importantly, we should notice that the formula looks a lot like linear models and generalized linear models except for one piece. This piece here, we are telling R to fit a smooth rather than just a linear component for x1. We're allowing for a nonlinear fit for x1, which is exactly what we want. Now I'm making a decision here to fit parametrically for x2. The reason for that was because in the explorations above, when we fit our lowest, it seemed like a linear model was reasonable. Now of course, you could fit a smooth here, so you could wrap x2 in an S function and allow the GAM function to fit a non-linear relationship, and then maybe you would rule out that non-linear relationship. But here I'm just assuming that we have no evidence that we need a nonlinear fit. The next argument is to specify the dataset, and then the last argument is to specify the family. The family, just like the GAM function, allows us to specify either Gaussian for Gaussian, or normal errors, or if we had a Poisson response we would specify Poisson, binomial we would specify binomial. Let's look at the same metrics. First, residuals versus fitted. Again, not super-interesting, looks pretty random. No major trends here. The second thing we can look at Q-Q plot to try to see if our normality assumption is met, and this looks pretty good. Hardly any deviations at the tails of which suggests that we've met the normality assumption to some sufficient degree. Finally, taking a look at the fitted versus actual values, this looks really good. Again, this looks very close to fitting the true model, fitting the true relationship. That tells us that at least in terms of prediction, the GAM seems to be doing very close to the true model, and much better than just the plain linear model. You'll see down at the bottom there, that the MSE for this model is up to rounding error, the same as the MSE for the exact model. That tells you that the GAM seems to be doing a really good job at modeling these data. That should give you some basics on working with GAM's in R. In the next lesson, we'll take a look at some inference techniques with generalized additive models. Just note that we'll get a lot more practice with implementing GAM's on real data in future lessons.