0:00

In this video on visualizing numerical data, we will discuss scatter plots for

Â paired data and other visualizations for

Â describing distributions of numerical variables.

Â The data come from gapminder which pulls this information from a variety of data

Â sources.

Â We will be working with two numerical variables.

Â Income per person, that's in US dollars and life expectancy, in years, for

Â the year 2012.

Â Each observation in this data set in a country.

Â That data set contains data from most but

Â not all countries, since this information wasn't available for certain countries.

Â A common tool for

Â visualizing the relationship between two numerical variables is a scatter plot.

Â To identify the explanatory variable in a pair of variables, we identify which of

Â the two is suspected in affecting the other and plan an appropriate analysis.

Â Since we might suspect that economic wealth of a country might effect

Â the average life expectancy of it's people, we have set up our analysis with

Â income as the explanatory and life expectancy as their response variable.

Â Generally, in a scatter plot, we place the explanatory variable on the x axis and

Â the response variable on the y axis.

Â It's very important to note that labeling variables as explanatory and response

Â does not guarantee that the relationship between the two is actually causal.

Â Even if an association between the two variables is identified.

Â We use these labels only to keep track of which variable we suspect

Â affects the other.

Â In fact, since these data are observational and

Â do not come from a randomized controlled experiment, we know that we can only talk

Â about correlation and not causation between the two variables.

Â So what is the relationship between these two variables?

Â The best way to answer this question is to visualize a line or

Â a curve going through a cloud of the data.

Â So here I'm drawing a curve that first shows

Â a positive increase in life expectancy as income increases and

Â then the relationship levels up such that countries with income levels above

Â a certain point still have roughly 80 to 85 years of average life expectancy.

Â 2:39

The shape of the relationship.

Â Is it linear, or does it follow some other form?

Â The strength of the relationship.

Â Is the relationship strong?

Â Indicated by little scatter.

Â Or weak, indicated by lots of scatter.

Â And any potential outliers.

Â 3:08

Let's take a closer look at the outliers.

Â Some of them have pretty high income levels.

Â Luxembourg, a rich country with a small population and

Â has higher income per person level.

Â Macao, a special administrative region in China And

Â Qatar, a country with a small population and lots of oil.

Â Another potential outlier is Nepal, where the life expectancy is considerable

Â higher than what would be expected for the low income level compared to others.

Â These are countries that we would indeed expect to behave differently than

Â the majority of the countries.

Â So it's not surprising that they stand out from the rest.

Â One naive way of dealing with outliers in data analysis is to immediately

Â exclude them.

Â But we're calling that approach naive because it's often not the right approach.

Â This is a good example of when the outliers might be very interesting

Â in cases.

Â And handling them with careful consideration of the research question and

Â other associated variables is important.

Â Now, let's take a look at the distributions of the variables,

Â individually.

Â One good way of visualizing the distribution of a numerical variable

Â is a histogram.

Â In a histogram, data are binned into intervals and

Â height of the bars represent the number of cases that fall into each interval.

Â In other words a histogram provides a view of the data density,

Â higher bars represent where data are relatively more common.

Â For example we can see that majority of the countries have average life

Â expectancies between 65 to 85 years old.

Â histograms are also very useful for identifying shapes of distributions.

Â In this case the distribution of life expectancies

Â appear to be left skewed which is expected

Â due to the leveling off of life expectancies we've identified earlier.

Â There's a physiological limit to how long people live.

Â And in most countries, people live up to that time but

Â there are some countries with much lower life expectancies and fewer and

Â fewer of these countries with lower and lower expectancies.

Â Resulting in a long left tail.

Â The distribution of income on the other hand is right skewed.

Â Incomes can't be negative so we have a natural boundary at zero, but

Â there is no real upper limit to how high incomes can go.

Â However, as we go higher and higher we have fewer and fewer countries

Â with such high levels of personal income resulting in a long right tail.

Â A shared characteristic between these two distributions

Â is that they're both unimodel.

Â Let's focus on these statements on skewness and modality for a bit.

Â 5:38

First off, skewness.

Â Distributions are set to be skewed to the left side of the long tail.

Â In a left skewed distribution, the longer tail is on the left on the negative end.

Â If no skewness is apparent, then the distribution is said to be symmetric.

Â And in a right skewed distribution,

Â the longer tail is on the right, the positive end.

Â As you can see, the best way to assess the shape of distributions is to step back and

Â imagine a smooth curve outlining the distribution,

Â instead of focusing on the jagged edges of the bars in the histogram.

Â 6:30

The distribution that you will most closely work with, and

Â in an introductory statistics course is unimodal, the normal distribution,

Â that you may also know as the bell curve.

Â A bimodal distribution might indicate that there are two distinct groups

Â in your data.

Â For example here's a distribution of heights of individuals at a preschool.

Â The first peak might be the kids and the second might be the teachers.

Â A uniformed distribution means there's no apparent trend in the data.

Â That high and low values of the variable are equally likely to occur.

Â Here's a distribution of the last digits of a random sample of people's social

Â security numbers.

Â As expected, the data show no trend as just as likely to have a social security

Â number that ends with a zero, as a six or a nine.

Â 7:15

Assessing modality like shape is also

Â best done by imagining a smooth curve outlining the distribution.

Â Here is a trick, think of the bars as the histogram as wooden blocks and

Â imagine dropping a limp spaghetti over them and try to imagine how the limp

Â spaghetti would fall over and between the wooden blocks.

Â Peaks that are further from each other will likely result

Â in differentiable prominent peaks and

Â peaks that are close to each other like the ones around zero and two may not.

Â Identifying the number of modes is not an exact science, and

Â not one that you should dwell on too much.

Â Usually all you need to do is to determine whether the distribution is uniform

Â Unimodal or something else.

Â 7:57

We should also note that the chosen bin width of the histogram can

Â alter the story the histogram is telling.

Â When the bin width is too wide, we might lose interesting details.

Â When the bin width is too narrow It might be difficult to get an overall picture of

Â the distribution.

Â The ideal bin width depends on the data you're working with.

Â So you should try playing with it until you're satisfied with the visualization.

Â 8:38

Yet another visualization technique that is especially useful for

Â highlighting outliers is a box plot.

Â A box plot also readily displace the median.

Â The mid point of the distribution, this is the thick line inside the box, and

Â the interquartile range, the width of the box.

Â According to this box plot, the median life expectancy is roughly 73 years, and

Â the middle 50% of countries have average life expectancies between 65 and

Â 77 years old.

Â In addition, countries with life expectancies that are below 48 years old

Â are considered to have unusually low life expectancies.

Â A box plot of the income distribution

Â shows the same right skewed distribution we've identified before.

Â And the outlying countries with unusually high per person income levels stand out

Â in this visualization as well.

Â 9:28

One way of determining the skewness of a distribution from a box plot

Â is to imagine what the histogram would look like.

Â The peak of the distribution will be roughly around the median, and

Â the tails will extend out to the tails in the box plot.

Â There's one more visualization method that we will discuss in this video.

Â An intensity map.

Â For certain types of data, like the one's we've been working with in this video,

Â it might be useful to view the spatial distribution.

Â These displays reveal trends in the data, that many of the others did not.

Â For example, we can see that both income and

Â life expectancy are lower in Africa, but higher in North America and Europe.

Â