As an example of linear regression, we'll look at the line heart data from the car package in R. Let's first load this package, library("car"). And to access a data set that comes with a package, we could use the data function and tell it which data set we want. It's the Leinhardt data set and before we look at the data, let's learn a little bit about it from the documentation. Through question mark Leinhardt, if we run that line of code it brings up the documentation page for these data. The leinhardt dataset contains data on infant mortality for 105 nations of the world around the year 1970. It contains four variables per capita income in US dollars. Infant mortality rate per 1,000 live births. A factor, or categorical variable indicating which region of the world this country is located. And finally, whether this country exports oil as an indicator variable. If we want to learn more about the data set, we can also look to these different sources were references. Now that we know a little bit about these data set, let's explore it. First, I'm going to look at the head of the data set. The first few lines were we see these four different variables, income, infant mortality rate, region, and whether the country exports oil. For the first several countries in this data set. We can also look at the structure of the data set, to see what types of variables these all are. str(Leinhardt) And here we can see that income is an integer variable. Infant is numeric or floating-point numeric. Region is a factor or a categorical variable with four levels. And oil is a factor variable with two levels, no and yes. To investigate the marginal relationships between each of these four variables, we can plot each of the scatter plots using pairs. So we say pairs(Leinhardt) which creates scatter plots for every pair of variables together. For example, this plot is the scatter plot between infant mortality rate and income where infant mortality rate is on the X axis and income is on the Y axis. The same goes for categorical variables. And you can see that there are two versions of the same plot, this one is between infant and income. And this one is also between infant and income with the axis reversed. Instead of modeling all these variables at once, let's start with a simple linear regression model that relates infant mortality to per capita income. So let's do a plot for that one as well. We're going to plot infant mortality against income where our data set is Leinhardt. As you look at this scatter plot, you might be asking yourself, is a linear model actually appropriate here as a model between income and infant mortality? This relationship certainly doesn't look linear. In fact, both of these variables are extremely right skewed, to see that let's look at a histogram of each one individually. We'll do a hist of the infant variable. And immediately you can see how right skewed it is. And now let's look at a histogram. For income. Again, that one is very right-skewed. Most values are very small, and then they scatter across large values. The short answer is no. A linear regression is not appropriate for these variables. However, we did notice that these variables are positive valued and very strongly right skewed. That's a hint that we could try looking at this on the log scale. So let's recreate these plots now with variables that are transformed. So let's look at Leinhardt and we're going to save a new variable, let's call it logInfant. It's going to be the log of Leinhardt infant mortality rate. We'll run that and let's create another variable that's just like it for the income variable. And we'll run that. Now let's look at the relationship between logincome and loginfant. We'll plot loginfant against logincome. Where the data are still Leinhardt. I would say that a linear model is much more appropriate choice now. You can imagine fitting a line through these data points. Let's start a new section and call it modelling. Remember that the reference Bayesian analysis with the non-informative prior or a flat prior on the coefficient in the linear model is available directly in R using the LM function. Let's fit one of those and call it L-mod for linear model. LM where we used the same syntaxes we did to plot these variablesw loginfont will be our response variable. Against logincome, where the data is Leinhardt. We can run this linear model and look at a summary of the results. Under our non-informative flat prior, here are the estimates, the posterior mean estimates for the coefficients. These estimates are very large relative to their standard error, or in this case, standard deviation of the posterior distribution. And so they appear to be very statistically significant. Residual standard error gives us an estimate of the left over variance after fitting the model. The r squared statistics tell us how much of the variability is explained by the linear model, in this case about half. Also, don't miss this crucial line. It tells us that four of the observations in the dataset were deleted. They were not used in the model because they contained missing values. There are statistical methods for dealing with missing values, but we're not going to discuss them in this course. So we're going to follow the example of the linear model function and we're going to delete the observations that have missing values before we perform analysis. So, what we can do here is save a new data set. We'll cal it dav, and it will come from the data set Leinhardt where we omit missing values. To do this in R we use na.omit and we give it the data set. Let's look at the original dataset really fast just so we can see which countries it was that we omitted. If we run Leinhardt we see that there are missing values for Nepal. There are missing values for Laos. For Haiti. And for Iran. Those four countries will not be included in our analysis.