0:01

So there's one thing that is worth pointing out which, maybe which differs

Â in ggplot than it does from say the base plots Is that if you

Â have a plot where, where the the data kind of exceed the limits of

Â the plot the behavior between base plot and ggplot can change a little bit.

Â So here on the left hand side I've just simulates some data.

Â So just, so, so this is not max data. And I just, I, and I, I intentionally

Â introduced a little outlier here.

Â So in the 50th data point, I just changed that value to be 100.

Â So now as the, just kind of this random series of, of noisy data.

Â And then right in the middle, there's a point that's 100.

Â And so if I call plot, so I'm going to make a standard base plot here.

Â I call plot on x and y, and I, and I

Â say type equals l, because I want to make a line plot.

Â 0:45

But then typically if you have some out lier like this you don't want to

Â look at the outlier you just want to look at the core of the data.

Â So it's

Â typical to kind of set the the y axis limits to be, to

Â be roughly kind of where the data are and just ignore the outlier.

Â So you can see that the time series that gets

Â drawn has all the data connected and that you can see

Â roughly where it's going to shoot off to a hundred and

Â comes back down to be roughly where it's suppose to be.

Â So you know that outlier is out there

Â somewhere, but you don't see it in the plot.

Â Now, if I do the equivalent plot in ggplot, I can create my ggplot with with

Â the test data, and the aesthetics to the x and y.

Â And then I add the geom_line function to make

Â a line plot as opposed to a scatter plot.

Â You can see that just plots the whole, all the data including the outliers.

Â And it's maybe not exactly the kind of plot

Â you want to make because the outliers maybe not that interesting.

Â So if you want to do this, it's you have to be careful about how you do it.

Â And so the first is that on the left-hand side, you

Â might think, well, I'll just change the y limits to be within,

Â kind, in the range of most of the data between minus 3 and 3.

Â The issue here is that what ggplot will do is that it will subset the data.

Â To include the values that are between minus 3 and 3.

Â And so, of course, the outlier is not included in this data

Â set and so you won't see that data point in this plot.

Â So you can see this clearly where the outlier's missing the

Â two lines are not connected, but then everything else is connected afterwards.

Â So if you want to recreate the kind of phenomenon

Â that you saw with baseplot You have to add, this special

Â function called coord_cartesian, which that sets the limits to be minus 3.

Â The one, the y axis limits to be minus 3 and 3.

Â Now you can see in the plot here that

Â the outlier is in fact included, in the dataset.

Â It's the dataset hasn't been subsetted to only include

Â the ones that are in the y axis range.

Â Um,so, I just want to go over a slightly more complex example of kind of adding

Â pieces to a plot, just so you can get

Â a sense of how the different layers are added on.

Â And then hopefully get you going from there.

Â So, so here I've just, I've made the

Â scientific question just a little bit more complex.

Â I want to know how is the relationship between PM 2.5 and

Â nocturnal symptoms vary by both BMI and nitrogen dioxide or NO2.

Â And so as NO2 or BMI values change how what does the relationship between

Â PM PM 2.5 and nocturnal symptoms look like?

Â So one tricky thing about this is unlike our previous BMI

Â variable which is kind of categorized into normal and over weight.

Â Now, NO2 variable is continuous, or it's really the

Â log of the NO2, and it's really a continuous variable.

Â So we need to, so we can't really condition on a continuous variable

Â when we're making plots because then there would be an infinite number of plots.

Â And so we need to categorize this variable into a reasonable series of ranges.

Â And so what we're going to do is we can use the cut function

Â for this purpose, to cut literally cut the data into a series of ranges.

Â 3:32

So here is some code to make NO2, split NO2 into tertiles, so this is going to be

Â three separate categories you know, kind of between

Â zero, the minimum, and the 33rd percentile, the 33rd.

Â In the 66 and the 66 to the maximum.

Â And so the first thing I need to do is use the quantile function

Â to figure out where in the data ranges are the 33'rd and 66th percentiles.

Â And once I've use the quantile function to find these cut points I pass that to

Â cut function and I use the cut function

Â to actually NO2 into these three different ranges.

Â And so what the cut function does is it just

Â returns a factor variable where each of the original data

Â points is replaced to buy its category in terms of

Â the, which tertile it's in, so in terms of the low,

Â the middle, or the high tertile.

Â it's, it's a very handy function for when you're using things

Â like lattice or ggplot and you have to categorize continuous variables.

Â So now you can see the levels of this

Â variable, the cut variable are, there's three different levels.

Â There's kind of 0.378 to 1.2, and 1.2 to 1.42 and then 1.42 to 2.55.

Â So those are the three categories that I've split the NO2 variable into.

Â So here's the final plot, just to show you what I'm going

Â for, and then we'll work backwards, figure out exactly how to do it.

Â So you can see that there's eight different plots here.

Â On the top you see all the normal weight children.

Â And on the bottom you see all the overweight children.

Â So those are the two categories of BMI.

Â 5:18

And so it's, it's sometimes, it's often

Â important to look at the missing data just.

Â Just to see if there's anything special about those missing.

Â You don't always want to exclude them right off the bat

Â because there might be something special about them you've missed.

Â So, what does this plot have?

Â Well first of all I've, I've modified

Â the transparency on the points.

Â So I've made them a little bit transparent so

Â you can see a little bit of the density there.

Â I've added a smoother to each panel, so this is a linear regression smoother.

Â So, it's not the default.

Â And I've turned off the Confidence bands.

Â 5:50

I've changed the kind of default labels and the titles, so

Â I've added to, to reflect and be a little bit more descriptive.

Â And then finally I used a

Â non-default font, so the default font is Is Helvetica,

Â and I've changed the font here to be Avenir.

Â And so, there are a number of options that I've modified here.

Â And so, here's the code for doing it.

Â So, the first, in the first set of code, I, I just call ggplot.

Â I give it the data and I give it some

Â basic aesthetics in terms of the x and y variables.

Â And then, to this G object, I add a bunch of things.

Â I add points using geom_point. I add a, I make the panel

Â using the facet_wrap function and I add a smoother using the geom smooth

Â where I specify the LM method and I turn off the standard error bars.

Â 6:34

I, I changed the theme to be this black and white theme where I,

Â and then I modified the font to be Avenir instead, instead of the default.

Â And I've also made the font a little bit

Â smaller, to be ten points instead of the default 12.

Â And then finally, I've called the labs function three different times

Â to change the labels, the x label, the y label, and the title of the plot.

Â So you can see that I've added all these different things piece by

Â piece to make this plot a little bit more interesting every single time.

Â And it's easy to do this with ggplot, and I, and

Â then, and the nice thing about ggplot which I didn't do here

Â is that you could in fact save this to a new object

Â and then you would have everything stored in a single R object.

Â And then if you wanted to add on more

Â layers, you could add to that, that new object,

Â you could continue to add different things if you wanted to.

Â So it's a very modular, very kind of a, a useful framework.

Â For constructing plots that are new just for your data.

Â 7:25

So, just to summarize very quickly, I know this has been a very

Â brief introduction in ggplot, and there's a lot things that you could talk about.

Â But given that this is not a course specifically on ggplot, my hope was

Â to kind of get you started, to get you typing in some basic code,

Â making some basic plots.

Â I hope, and then if you want to know more, you can kind

Â of look at some of the references that I mentioned in, previously.

Â So, I think in summary, ggplot is a very powerful, it's very flexible if you

Â can learn the grammar and learn the different

Â pieces that you can add to a plot.

Â And that can be tuned and modified.

Â There are lots of different types of plots you can make.

Â I left out a lot, but you can explore and mess around.

Â I think that's how, that's kind of the best way to learn about these things.

Â And to, and to take a look at some of the references that I mentioned in part one.

Â