In the last video, we explored model uncertainty using posterior probabilities of models based on BIC. In this video, we'll continue the example with the kid's cognitive scores, and show how to visualize model uncertainty and account for it in analyses using Bayesian model averaging. In the last video, we used the BAS Package in R to find the posterior distribution under model uncertainty for the kid's cognitive scores using all possible subsets of the predictors, mom's high school status, mom's IQ, whether the mom worked during the first three years of the kid's life, and mom's age. With the four predictors, there will be 2 to the 4, or 16 possible models. Now we're going to look at a visualization of the models that illustrate model uncertainty, beyond the top five models that we considered previously. In R, the image function may be used to create an image of the model space that looks like a crossword puzzle. Note that in R, I have not rotated the image to allowed the labels to be more easily visible in the video, while the default option is to rotate by 90 degrees. This image has rows that correspond to each of the variables and intercept, with labels for the variables on the y-axis. The x-axis corresponds to the possible models. These are sorted by their posterior probability, from the best at the left to worst at the right, with the rank on the top x-axis. Each column represents one of the 16 models. The variables that are excluded in a model are shown in black for each column, while the variables that are included are colored, with the color related to the log posterior probability. We can use this plot to see that model 1 includes high school and IQ, but not age or work. Mom's IQ is in all of the top eight models. And because it is in exactly half of the models, it has to be black in the last eight, indicating that it's excluded from the eight remaining models, which all have lower probability. The color of each column is proportional to the log of the posterior probabilities, the lower x-axis of the graph. Models that are the same color have similar log posterior probabilities. This allows us to view models that are clustered together, that have marginal likelihoods, where the differences are not worth a bare mention. Let's see how we can make inference using all of the models, about a quantity delta. Delta could be Y*, a future observation, one of the regression coefficients beta, or the indicator that the coefficient is non-zero, and even the posterior density of a regression coefficient. The posterior density for delta is obtained as a weighted average of the densities for delta under each of the individual models, where the weights are the posterior probabilities of the models. Models with high probability receive more weight, while models with low probability are discounted. Similarly, we can find the posterior expected value of delta. And here we use the model-specific expectations weighted by their posterior probabilities. Both of these expressions have the form of a weighted average over models, hence the name Bayesian model averaging. Since the weights are probabilities and have to sum to one, if the best model has posterior probability one, all of the weight will be placed on that single best model, and using BMA would be equivalent to selecting the best model with the highest posterior probability. However, if there's several models that receive substantial probability, they would all be included in the inference and account for the uncertainty about the true model. For example, under BMA, our best prediction using a squared error loss is the posterior predictive mean. This is given by the sum of the posterior predictive means for each model, weighted by their respective posterior probabilities. We can obtain summaries for the coefficients. This produces a table that is a Bayesian analogue to the regression coefficient summary from LM. The first column is the posterior mean of the coefficient, or the value that we expect under Bayesian model averaging, which would be used for prediction. The posterior SD, or standard deviation, provides a measure of the variability of the coefficient. And an approximate range of plausible values for each of the coefficients may be attained via the empirical rule, using the mean plus or minus two standard deviations. This applies if the posterior distribution is symmetric or unimodal. Last, we have the posterior probability that the coefficient is non-zero, which replaces the p value. Here, we can see that we are virtually certain that mom's IQ should be included, with a probability approximately 1. We're 61% sure that mom's high school should be included, while the probability that working and age are non-zero, taking into account uncertainty in the other variables, are 11% and 7% respectively. From the law of total probability, this means that there is a 0.93 probability that the coefficient for age is 0 after adjusting for all of the other variables. Now that we've looked at the collection of models, let's turn to visualizing plausible values for the coefficients, taking into account that there is uncertainty about the best model. This plot of the posterior distributions for each of the regression coefficients is displayed in a two by two table. Let's focus on the plot for mom's high school status. The vertical bar represents the posterior probability that the coefficient is 0, around 39%. The bell-shaped curve represents the density of plausible values from all the models where the coefficient was non-zero. This is scaled so that the height of the density for non-zero values is the probability that the coefficient is non-zero. For mom's IQ, the probability that the coefficient is non-zero is quite small, so no vertical bar is present. The range of plausible values is centered far from 0, also reflecting our beliefs after seeing the data that this variable is important. Mom's age has a much higher probability of being 0, hence the higher bar. And even for the models where it is forced into the model, the distribution overlaps 0. We have shown how Bayesian model averaging can be used to address model uncertainty using the ensemble of models for inference, rather than selecting a single model. We've applied this to the kid's cognitive score example using software in R. After successful completion of this module, you should be able to interpret the output under BMA. In this example we've illustrated the concepts using BIC and of reference prior on the coefficient. In the next collection of videos, we will explore alternative prior distributions as part of prior sensitivity. And we'll look at algorithms so explore the space of models when it is no longer possible to enumerate all possible models.