0:02

This lecture's going to be about constructing exploratory graphs which are

Â graphs that you make more or less for yourself, so that

Â you can look at the data and explore kind of what's

Â going on in, in the data sets that you're looking at.

Â And so, I mean, the basic question that we're trying to ask

Â here is, you know, why do we use graphs in data analysis?

Â And there are a couple of different goals that we want to achieve

Â here, you know, understanding the data properties,

Â finding patterns in the data basic patterns.

Â Maybe just suggest some modeling strategies, you know?

Â Do we want to use a linear model or a nonlinear model to debug an analysis?

Â So maybe you started doing an analysis, and you

Â want to figure out, you know, what's going wrong.

Â And then we want to communicate results, too.

Â So we want to present things to people in a graphical form.

Â And exploratory graphs are really about the first four things on this list here.

Â So we want to basically understand properties about our data,

Â we want to look at some basic patterns suggest some

Â modeling strategies and and to debug an analysis.

Â But in this particular lecture, we're not going to

Â be talking much about how to build graphs

Â for communicating results, and we'll talk about that

Â in a little bit later in other lectures.

Â 1:04

So some of the characteristics of exploratory graphs

Â are that they, they tend to be made very

Â quickly, so they are kind of made on the fly as you are looking through the data.

Â You tend to a large number of them because you want to look a

Â very different aspects of the data, you know, look at lots of variables and

Â you have to go through them kind of one at a time often.

Â 1:33

and, and what are the things that need to be followed up?

Â It's just, just so you can get a sense of, kind of what the data's looking like.

Â So you if you can think of an,

Â an, an underlying question essentially of all exploratory

Â data analysis, it might be you know, what do my data look like?

Â And so and so this is the goal for making

Â exploratory graphs and a lot of the things that you typically

Â worry about, you know, later on in terms of appearances of

Â a graph or how it's presented you don't worry about now.

Â So things like the axes and the

Â legends are typically going to be cleaned up later.

Â Things like and color and sizes are primarily

Â used for, for you to kind of separate information.

Â But you might,

Â if you were going to present this in a, in a different setting for example in a

Â presentation or a talk you might think a

Â little bit more carefully about color and, and size.

Â So, I, I just want to go through a specific example in this lecture just so

Â we can talk about the various types exploratory

Â graphs that you might make using a real dataset.

Â So here this dataset is, it involves air

Â pollution ambient air pollution in the United States.

Â And it comes from the US Environmental Protection Agency,

Â which sets national ambient air quality standards for outdoor air pollution.

Â 2:39

And the particular type of air pollutant we're going

Â to look at is called fine particle pollution of pm2.5.

Â And the standard in the United States is that the annual mean averaged

Â over three years at a given location

Â cannot exceed 12 micrograms per meter cubed.

Â And so any state that exceeds this level of 12 micrograms per meter

Â cubed when you take the annual mean and average it over three years

Â is considered to be out of compliance with the national standards.

Â And so there's data available for on pm2.5 from the US EPA's air quality system.

Â And so I've downloaded that data and summarized

Â it here just for the purposes of this presentation.

Â And the basic underlying question in this

Â exploratory analysis is just going to be are

Â there any counties in the United States that

Â exceed the national standard for fine particle pollution?

Â And so

Â we have, we have monitoring data from many counties

Â in the United States where air pollution is a problem.

Â And we want to see whether or not they exceed the

Â 12 microgram per meter cubed standard, that's, that's recently been set.

Â 3:42

So the data here are available.

Â You can read the menus in the read.csv function.

Â And I've put the code here so you can take a look.

Â And here are the first couple lines of this data frame.

Â And so there is the first column is the level of pm2.5.

Â It's the, it's the annual average [COUGH], sorry it's the,

Â it's the annual mean averaged over the past three years.

Â So, a, actually it's the years 2008 through 2010 and

Â then the fips column that's the, it's an identifier column that's

Â for the county.

Â 4:16

The longitude and latitude, basically, is a is,

Â is the locations of the monitor in that county.

Â The longitude and latitude coordinates for the monitor in that county.

Â So we basically remember this is the underlying question is we

Â want to see do any of the counties exceed the standard of,

Â of 12 micrograms per meter cubed?

Â Even in an exploratory analysis where you're just kind of, you

Â know, looking through the data and seeing if there are any problems.

Â You, you, you still want to have kind

Â of an underlying question that you're thinking about in

Â the back of your mind, even if it's a

Â little bit of a vague question at this moment.

Â Because the question that you ask will drive your thinking about what the data

Â look like, and so something that may be a problem for one type of question,

Â may be not a problem for a different type of question.

Â So when you look through the data you

Â have to have a background question kind in mind.

Â So, we want to see if counties exceed this national ambient air quality standard.

Â So a couple of, so we can look at one dimensional summaries

Â of the data, and here are a couple that I list out.

Â One is a five number summary.

Â There's boxplots, histograms, density plots and bar plots.

Â And I'll illustrate a few here. So, I mean, the first

Â one is the five number summary which is really not a plot at all, obviously.

Â But it's a, it's a summary of just some

Â particular aspects of a, of a, of a given variable.

Â And so, the summary function in R can produce the summary,

Â and actually it's the six number summary because it includes the mean.

Â The traditional five number summary is the

Â minimum, the first quartile, the median, the third

Â quartile, and the maximum, and the summary function just puts the mean in there, too.

Â So here you see the median

Â is ten micrograms per meter cubed which is under the standard.

Â The maximum is 18.4 which is over.

Â So, there must be some counties that violate

Â the standard at least during this time period.

Â And so and the things, you can see the third quartile is

Â 11, and the first quartile is 8.5, and the minimum is 3.38 here.

Â 6:09

So, here's a boxplot of the pm2.5 variable,

Â and I just I decided to color it blue.

Â And you can see that the median is about

Â ten, just as we saw in the five number summary.

Â And you can see that there are a number of counties

Â that appear to be over the 12 microgram per meter cubed limit.

Â And then there are some, there are many that are that are under it.

Â That are, so that are in compliance.

Â 6:30

Here's a histogram of the same data.

Â And so the nice thing about the histogram is that you get a little

Â bit more detail about you know, the shape of the distribution of this variable.

Â And one

Â nice thing that I like to do is to put a rug underneath the histogram.

Â So the rug basically plots all of the

Â points in your dataset that, along underneath the histogram.

Â So you can see exactly where the points are that make up the histogram.

Â And so you get, on the, on the one hand you get the histogram

Â which is a summary, but you also get some fine detail with the rug underneath.

Â So you can see, you know, where the outliers

Â are, and where the bulk of the data are.

Â So you can see that the, the bulk of the data seem

Â to be centered around ten or so.

Â But there are couple of outliers kind of above beyond 15.

Â 7:12

One thing you can do with a histogram is change the breaks, the

Â number of essentially the number of bars that are going into the histogram.

Â And so you can see in the previous slide the bars are

Â kind of wider so the, the histogram is a little bit smoother.

Â But in this slide I made the, the

Â bars smaller by setting the breaks equal to 100.

Â And you can see

Â that you get a little bit more of a rougher histogram.

Â And so one of the things that you have to worry

Â about when, when you set the breaks argument is that you

Â don't want the number to be to big so that you

Â get lots of little bars and then the histogram becomes very noisy.

Â Of course, you don't want the number to be too small, because then you just

Â might get a few bars and you can't really see the shape of the distribution.

Â So, it's, it's usually useful to play around

Â with the breaks argument, just to adjust the

Â number of histogram bars there are, so as

Â to get the histogram that you like the best.

Â 7:58

Here's the box plot that I showed you before, but now I've overlaid some a

Â feature onto the plot, which is just a horizontal line at the level of 12.

Â So number 12 is the national ambient air quality

Â standard, and so I've overlaid the line right at, right

Â at 12 just to get a sense of, you know,

Â how many what counties are above and below the line.

Â So you can see the bulk of them, over 75 %,

Â are below the line because in the entire blue box is below

Â that line.

Â And the upper end of the blue box is the 75th percentile or the third quartile.

Â So, that's useful for sumari-, for kind of pointing out specific features.

Â You can do this on a histogram, too.

Â So here I've put a vertical line at, at 12, so that's the black line.

Â And I put a, another vertical line, which is

Â in magenta, at the median just so you can summarize.

Â You can see exactly where the median of the distribution is.

Â That was easy to see in the box plot because the box plot contains

Â the median as a feature.

Â But the histogram does not and so it's sometimes nice

Â to put a vertical line in there to highlight the median.

Â So you can see that again that the median is below the

Â standard, but there are a number of counties that are above it.

Â 9:03

A barplot is another graphical summary for categorical data.

Â And here I'm plotting the, the region variable so that you can see how

Â many counties there are in the east and how many there are in the west.

Â So you see that the, the majority of the

Â counties are in the eastern part of the United States.

Â And the there are a little over a hundred

Â counties in the western part of the United States.

Â And so that just summarizes this

Â particular variable which is a categorical variable.

Â