From the two Bivariate Graphing examples that we've covered, we filled in the left side of our graphing decisions flow chart. Each example showed situations when our response variable was categorical. Let's talk now about the right side of our flow chart when the response variable is quantitative. We will now change our research question using an example from the Gapminder data set. Here, we're interested in the association between the percent of the population living in urban settings within each country and the country's rates of internet use. That is the percent of people with access to the worldwide Web. Below you can see a full description of these variables from the Gapminder code book. For this research question, both the response and explanatory variables are quantitative. A bar chart would not work here. The graph of choice would be a Scatter Plot. A scatter plot by definition, is a graph of plotted points that show the relationship between two quantitative variables. In a scatterplot, data for each observation's explanatory and response variable are plotted. This scatterplot shows a sample of 11 observations according to the relationship between height and weight. In the lower left-hand side of the graph we see plotted individuals with relatively low height and weight, and the upper right hand portion, we see individuals with relatively high height and weight. We tune in to the Gapminder data set, let's examine the relationship between percent of the population living in urban settings and the rate of internet use. Since we're using a different dataset, we will begin with a new program. It begins with the library import statements. Then, we're going to load the Gapminder dataset. When I run this code, I get an error message that reads "value error, unable to parse string". What this means is that when Python read in the Gapminder dataset, it read in empty cells which are missing values as blanks instead of NaN, this in turn causes the error message when Python encounters an empty cell as it tries to convert the variable to numeric. By default, the [inaudible] reads CSV function should convert empty cells to NaN when it reads in the data, so, this shouldn't happen, but there is simple fix if it does happen, you just need to add this line of code, "Data is the name of our Dataframe and Data dot replace tells Python to replace in parentheses the pattern for empty cells to NaN and regex equals true" tells Python make this replacement for every empty cell. Then, you can rerun the code that converts the variable to numeric without any errors and add describe statements in order to explain the central tendency and spread or variability of both urban rate and Internet use rate. We can see that for urban rate, the mean percent of the population living in urban settings is about 57 percent, the standard deviation is about 24 percent, suggesting that there is quite a bit of variability from country to country in terms of the proportion of the population living in urban settings. For Internet use rate on average, about 35.6 percent of the population across these individual countries has access to the World Wide Web. Again, with a standard deviation of 27.8 percent, there seems to be quite a bit of variability from country to country. But, is there a relationship between these two variables? We can explore this question visually with a Scatterplot. Python provides Scatter plots through the use of the Seaborn Package. This time we use the reg plot function from the seaborn package, we name the quantitative explanatory variable for the x-axis here urban rate, and also the quantitative response variable for the y-axis Internet use rate. We define the Dataframe here called data, where the variables can be found. For this example, I will also ask Python to suppress the line of best fit with fit underline reg equal to false since the default is to add this line. Again, with the x-label function, we are able to label the x-axis and the y-label function the y-axis. Titles are created with the title function. To characterize the relationship that we see in a scatter plot, it can be helpful to also allow Python to draw a line of best fit through the observations as a way of trying to determine how the dots line up. That is, do they seem to line up in a positive or a negative direction? Or with a positive or negative slope? An increasing slope, as we can see here between urban rate and Internet use rate indicates the relationship is positive, that is, higher values on one of the variables seems to be associated with higher values on the other, and lower values on one are associated with lower values on the other. The code is identical, but I drop the fit underline reg equals false, in order to display the line of best fit. Here, we see what looks like a positive relationship between urban rate and Internet use rate. Here's another example from the Gapminder, exploring the relationship between income per person in each country and then each country's Internet use rate. Again, if considering a linear pattern, the relationship seems to be positive, that is, higher income is associated with higher Internet use, lower-income, associated with lower Internet use. The strength of the relationship in a scatter plot, is determined by how closely the data points follow the form. In this scatter plot, the data points follow the linear pattern quite closely. This is an example of a very strong relationship. In this other scattered plot, the points also follow the linear pattern, but much less closely, therefore, we can say that this is a weaker relationship. The form of the relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatter plot. There are many possible forms. As we saw, a positive or increasing relationship means that an increase in one of the variables is associated with an increase in the other, and negative or decreasing relationship, means that an increase in one of the variables is associated with a decrease in the other, as shown in this central scatter plot. Not all relationships can be classified as either positive or negative. Further, if you can't plausibly put a line through the dots, if the dots are just an amorphous cloud of specks on the graph, then there may be no relationship. For various reasons, the scatter plot is sometimes limited in its ability to allow us to evaluate a relationship visually. Here is a scatter plot for income per person by rate of HIV among 15 to 49-year-olds. Since most countries have a low HIV rate per 100 people, the dots on the scatter plot seem to clump in the lower part of the graph. To try to get a better sense of whether or not there is a relationship between these two variables, we could try to categorize or group the explanatory variable income. We need to add the appropriate data management syntax to the program in order to create these categories, INCOMEGRP4. I also include a value count statements so that we can examine the distribution of this new variable. After the program has been saved and run, we can see the distribution for income group. The four ordered groups we created, show that there are 51 countries in the lowest income group, there are 51 countries in the next 25 percent, 50 in the next, and 51 countries in the highest 25 percent in terms of income. With this new categorical explanatory variable, we're now ready to create the last type of bivariate graph, that is the categorical to quantitative bar chart. The code we will use is identical to the code used for the categorical to categorical graph, but what will be plotted on the y-axis, is the mean HIV rate. In this bar chart, we can see differences in HIV rate based on countries income per person groups, and the relationship seems to be linear. Though, as you can also see from the y-axis, differences between mean HIV rates for each income group are very small, that is less than two percent. Also, what linear relationship we do see, seems to be negative, that is, higher HIV rates are seen in lower income countries compared to higher income countries. We've worked through each type of bivariate, or two-variable graph highlighting when and how each should be used to visualize the relationship. Now, let's just very briefly, summarize. When visualizing a categorical to categorical relationship, we use a bar chart with explanatory categories on the x-axis and the proportion of our response variable on the y-axis. When visualizing a categorical to quantitative relationship, we use a bar chart with explanatory categories on the x-axis and the mean of our response variable on the y-axis. When visualizing a quantitative to quantitative relationship, we use a scatter plot, in which each observation is displayed according to the values of the explanatory and response variables. Use these basic guidelines, as well as the graphing decisions flow chart to visualize the relationships between your own variables of interest.