So let's see how we test a lasso regression in SAS. Following my libname statement in data step, which I am using to call in the data set, I'll do a little extra data management. Namely, I want to create a variable for gender called male that is coded for 0 for female and 1 for male. Like the other binary variables in the data set, and delete observations with missing data on any of the variables. I will use the cmiss command to delete observations with missing data and (of_all_) to tell SAS to do this for every variable in the data set. I will also turn on ODS graphics with the statement ODS graphics on. ODS stands for output delivery system, which manages the output and displays, such as those in HTML. SAS will not print any plots, if ODS graphics is not turned on. Then, I'll use the survey select procedure to randomly split my data set into a training data set consisting of 70% of the total observations in the data set, and a test data set consisting of the other 30% of the observations. data=new specifies the name of my managed input data set. And out equals the name of the randomly split output data set, which I will call traintest. With it, we include the seed option which allows us to specify a random number seed to ensure that the data are randomly split the same way if I run the code again. The samprate command, tells SAS to split the input data set so that 70% of the observations are designated as training observations, and the remaining 30% are designated as test observations. method=srs, specifies that the data are to be split using simple random sampling. And the out all option, tells SAS to include, both, the training and test observations in a single output data set that has a new variable called selected, to indicate whether an observation belongs to the training set, or the test set. I will use the glmselect procedure to test my lasso regression model. data=traintest tells SAS to use the randomly split dataset, and the plots=all option, asks that all plots associated with the lasso regression be printed. With it we include the seed option, which allows us to specify a random number seed, which will be used in the cross-validation process. The partition command assigns each observation a role, based on the variable called selected, to indicate whether the observation is a training or test observation. Observations with a value of one on the selected variable, are assigned the role of training observation. And observations with a value of zero, are assigned the role of test observation. The model command specifies the regression model for which my response variable, school connected-ness, is equal to the list of the 23 candidate predictor variables. After the slash, we specify the options we want to use to test the model. The selection option tells us which method to use to compute the parameters for variable selection. In this example, I will use the LAR algorithm, which stands for Least Angled Regression. This algorithm starts with no predictors in the model, and adds a predictor at each step. It first adds a predictor that is most correlated with the response variable, and moves it towards least square estimate, until there is another predictor that is equally correlated with the model residual. It adds this predictor to the model and starts the least square estimation process over again, with both variables. The LAR algorithm continues with this process until it has tested all the predictors. Perimeter estimates at any step are shrunk, and predictors with coefficients that are shrunk to zero are removed from the model ,and the process starts all over again. The choose=cv option, ask SAS to use cross validation to choose the final statistical model. stop=none ensures that the model doesn't stop running until each of the candidate predictor variables is tested. Finally, cvmethod=random, and in parentheses, (10) Specifies that I use a K-fold cross-validation method with ten randomly selected folds. So, what I'm doing here is using K-fold cross validation, in which the first fold is treated as a validation set, and the model is estimated on the training data set using the remaining nine folds. At each step of the estimation process, a new predictor is entered into the model and the mean square error for the validation fold, is calculated for each of the nine folds, and then averaged. The model with the lowest average means square error is selected by SAS as the best model. In lasso regression, the penalty term is not fair if the predictive variables are not on the same scale. Meaning that not all the predictors will get the same penalty. The SAS glmselect procedure handles this by automatically standardizing the predictor variables, so that they all have a mean equal to zero and a standard deviation equal to one, which places them all on the same scale. Let's go ahead and run the code and take a look at the results. The first thing we see is some information about the SURVEYSELECT procedure we used to split the observations in the total data set, into training and test data. Next, we see the information about the Lasso regression. It shows the Dependent Variable, SCHCONN1, goal connectedness, and the selection method that I used. It also shows that I used as a criterion for choosing the best model, K equals 10-fold cross validation, with random assignments of observations to the folds. We can also see the total number of observations in the data set and the number of observations used for training and testing the statistical models. The number of parameters to be estimated is 24 for the intercept plus the 23 predictors. Next is the table with the LAR selection information. It shows the steps in the analysis and the variable that is entered at each step. The ASE and Test ASE are the averaged squared error, which is the same as the means square error for the training data and the test data. You can see that at the beginning, there are no predictors in the model. Just the intercept. Then variables are entered one at a time in order of the magnitude of the reduction in the mean, or average squared error. So they are ordered in terms of how important they are in predicting school connectedness. According to the lasso regression results, it appears that the most important predictor of school connectedness, was depression. Followed by self esteem and so on. You can also see how the average square error declines as variables are added to the model, indicating that the prediction accuracy improves as each variable is added to the model. The CV PRESS shows the sum of the residual sum of squares in the test data set. There's an asterisk at step 16. This is the model selected as the best model by the procedure. You can see that this is the model with the lowest summed residual sum of squares and that adding other variables to this model, actually increases it. Finally, you could also see that the training data ASE continues to decline as variables are added. This is to be expected as model complexity increases. This is an example of the bias variance tradeoff. If we go back to our graph from the bias variance tradeoff video, it shows what happens to prediction error as a model becomes more complex by adding more predictors. We can see that the decrease in the training ASE means that prediction error decreases as more variables are added to the model, and consequently, bias is lower. However, if you look at the curve of the test data, you can see that ,as the model becomes more complex by adding more predictors, both bias and variance increase. The model that is selected by the specified selection criteria as the best model, is the one that falls somewhere in here. It is the point where bias and variance in the test prediction error is lowest. If a model with fewer predictors is chosen, then the model's at risk of being under fitted. If a model with more predictors is chosen, then the model is at risk of being over fitted. SAS also provides some nice plots. The first plot shows the change in the regression coefficients at each step, and the vertical line represents the selected model. This plot shows the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changed with the addition of a new predictor at each step. As well as, the steps at which each variable entered the model. For example, as also indicated in the summary table above, depression and self esteem had the largest regression coefficient, followed by engaging in deviant behavior. Depression and deviant behavior were negatively associated with school connectedness, and self-esteem was positively associated with school connectedness. The lower plot shows how the chosen selection criterion, in this example CVPRESS, which is the residual sum of squares summed across all the cross-validation folds in the training set, changes as variables are added to the model. Initially, it decreases rapidly and then levels off to a point in which adding more predictors doesn't lead to much production in the residual sum of squares. The next plot shows at which step in the selection process different selection criteria would choose the best model. Interestingly, the other, criteria selected more complex models, and the criterion based on cross validation, possibly selecting an overfitted model. The final plot shows the change in the average or mean square error at each step in the process. As expected, the selected model was less accurate in predicting school connectiveness in the test data, but the test average squared error at each step was pretty close to the training average squared error overall. This suggests that prediction accuracy was pretty stable across the two data sets. Finally, the output shows the R-Square and adjusted R-Square for the selected model and the mean square error for both the training and test data. It also shows the estimated regression coefficients for the selected model.