3:21

quiz one versus the final exam score,

we may find that for some reason quiz one is

highly correlated to the final exam scores and students

who did well on quiz one early in the semester stick it out and do well on the final.

These things might help us to identify average students.

If you're down here on quiz one,

next semester and I know this is correlated I might come to people and say, "Hey,

students who did bad on quiz one

really need to improve because they tend to do bad on the final."

So, we can start using this information to make decisions,

talk about interesting facts and the data,

help people understand trends and what those trends mean.

So, in creating a scatterplot we can plot

things like the number of runs and the batting average.

So, again, any sorts of variables I have

in my dataset and then each point here is a different player.

So, again, you have to think about what questions you want to answer with the data,

what labels you might want to have,

what interactions you might want to have and what we're trying to see.

So, here we might see some sort of trend maybe we could even try

to fit some sort of regression line and we'll talk about how to do

regression in different lectures to the plot to try to

see is there some way I can model the data to explain the different patterns within it?

Can identify outliers, for example?

So, this person had a really low number of runs,

but a reasonably high batting average same with this person actually,

one of the higher batting averages,

but some of the lowest number of runs.

We could think about adding interaction.

So, imagine if this was my mouse tool tip and I can point to the element on the screen,

it might be able to pull up the label to say this is player number seven,

for example and tell us who that is,

tell us more information.

Again, we can think about,

how do we want to fix the aspect ratio,

should I've made this axis much much longer than the other axis?

How do I manage and manipulate those things?

So, all of these different elements we've talked about in past modules from

aspect ratios to nice numbers to graph and labels,

all come into play in creating our scatterplot here.

We don't have to just stick with two variables in a scatterplot.

So, there's a really fun video by Hans Rosling if you go to gapminder.org.

Hans Rosling does this nice animated PowerPoint where each bubble

here is a country and he's showing income versus life expectancy.

As he shows this move over time you get a story of

the world and how income has shifted life expectancy.

So, if we think about a data set like that,

we can think about we have a country,

we have an income,

7:32

So, we get color, we get shape, we get position,

we can even think about adding a texture to these circles if we wanted.

They gets sort of busy.

Notice we have different sizes.

So, we have size in the circle,

and here size corresponds to something like population.

Color corresponded to location in the world.

So, we're combining these different visual variables together to

move past just this 2D representation,

where we had variable X versus variable Y,

it's now multiple variables.

Then he even adds in animation.

So we can move our data over time and so animation

provides us with yet another variable to see trends and patterns and changes over time.

There's nice work on looking at animated scatterplots by John Stascho.

So, you can take a look, I encourage you to watch Hans Rosling's Gapminder video,

just to see how his nice presentation goes for these multivariate cases.

Now the real question though is,

if I have all of these datasets which scatterplots should I draw off?

If I have all of these variables like income, life expectancy, population,

I can create a ton of different scatterplots for every two variables,

I can make a scatterplot for GDP vs the percentage of trade,

I can make a scatterplot for GDP versus life expectancy,

I can make a scatterplot for GDP versus population.

So, the more variables I have,

the more scatterplots that I can make.

So, I want to think about how do I

help people again detect the expected, discover the unexpected?

How do I identify anomalies in these scatterplots?

How did you identify interesting trends?

One way is through what was coined as Scagnostics.

So, this was, Tukey coined this back in the early 80's,

talking about graph theoretic measures for

detecting structural anomalies in scatterplots.

So, if I draw a scatterplot,

what are the different things that I'm seeing?

So, for example, remember when I drew a scatterplot that had properties like this,

notice this has a clumpiness property.

Is there a way where I can theoretically have some mathematical computation of that?

This may be a really interesting view to say, "Hey,

these particular elements are highly related in these two variables."

So, we can use these graph theoretic measures to help

users pick views to show particular structures of interest.

This was coined by Tukey to help us determine

which relationships between variables should we pick?

If I don't have time or capacities to show every possible Pairwise combination.

Which is the best Pairwise combination to show?

Which is the second best?

And so forth. So, Scagnostics gives us

a bunch of different equations that we can start calculating.

So, for example, we can figure out

the minimum convex hole that will enclose all of our points.

So, for example, the minimum convex hole if I draw some points on the screen,

the minimum convex hole is what's

the smallest polygon that's going to connect all of these together.

We can measure things like area of this polygon to try to do some measure,

we can have some sort of correlation measures to talk about how correlated the data is,

and other elements like that.

There's a whole list of Scagnostics and people have been

working on those sorts of measures for a very long time.

Wilkinson proposed nine Scagnostic measures to characterize scatterplots.

So, outlying, sparse, striated,

skinny, monotonic, skewed, clumpy, convex, and stringy.

All of these have a different equation associated with them but what's nice is

wilkinson developed the library where we can calculate all of these automatically,

and start using these to try and rank scatterplots.

Again, trying to think about importance in

showing people what's important and interested in their data.

We've talked about Shneiderman's information mantra

where it's overview first, zoom and filter,

details on demand, and Daniel Keim talked

about visual analytics mantra for analyze first as opposed to overview first.

So, if you have a large dataset you put it into some sort

of analytical framework whether it's going to be deep learning,

whether it's going to be supervised learning through clustering,

unsupervised learning through clustering,

whether it's going to be supervised learning through decision trees, things like that.

Whether it's going to be creating a bunch of scatterplots and measuring

how outlying the particles are or how skewed the particles are?

We can use these measures to then characterize different scatterplots.

What's nice about scatterplots is we

can also create what's called the scatterplot matrix.

So, even if I have a whole lot of variables for a scatterplot,

I can actually go ahead and organize these into a matrix where I can

do each variable versus each other variable.

So for example, I have At Bats,

every Y-axis, I'm sorry every X-axis in this direction is the same.

So, we see we have our At Bats here.

So, every X axis is At Bats across our row.

Across our columns, we get changes in variables.

So, here we get At Bats versus At Bats,

we get Runs versus Runs,

we get Batting Average versus Batting Average.

In this example here,

we've got Batting Average and Runs,

we've got At Bat and Runs.

So, we can start looking and see if there's any interesting trends.

Now the diagonal is always going to be these straight lines,

this is due to the fact that we're plotting the same variable against it self,

so it should be highly correlated.

So again, each dot is a baseball player and we can start looking for trends and patterns.

Here we might say well,

there's one outlier in this plot

otherwise it looks like it might have some sort of trend here.

Here we may not see much relationship here,

but scatterplots let us get a quick overview and the problem is

I could have rearranged any row and any column in any way I want.

So, I could have swapped these two columns or

these two rows and I would get a different order,

a different layout and how I'm going to go through

these orders and layouts is really important and can take time.

That's where we might want to use these Scagnostic measures to think about

how we can order what we call our scatterplot matrix.

We can even think about adding an interaction,

and adding interaction allows viewers to visualize other combinations of variables.

So, if I have a scatter plot in two dimensions,

I can always extrude a third dimension

and rotate my points and show what this looks like.

Nicholas Elmphast has a nice paper called rolling the dice,

and you can take a look at some of the nice interactions he added

in with these scatterplot matrices and

allowing extensions and things to let people visualize this in 3-dimensional space.

Again, we can add color to the points,

we can add some shape or sizes,

all sorts of things to add more information into these variables.

15:32

Now, the other thing we should have realized with scatterplots is

that really I'm just showing a whole bunch of the same thing over and over.

I have the same type of plot repeated over and

over but showing different combinations to my variables.

This display is sometimes referred to as small multiples or a trellis display.

Essentially, what you might do is if you have a bunch of

different data you might want to try to

organize it in some way we can look at an overview of it all at once.

So for example, what if I want to look at homicide rates in Canada.

So, Canada has several different provinces.

I've captured homicide rates over time,

and I don't want to maybe just make one plot for

the homicide rate in Canada because I may see something like this.

Or, even worse, I may see my homicide rate go up.

This may not tell the whole story,

if I break this down by province,

I can see some interesting trends.

So for example, in Abbotsford Mission,

I see this huge spike and decline in 2009.

I can see that Calgary had a downward trend and now it's back on an upswing.

I can see Montreal has been low and slowly declining.

The same for Toronto,

I've another case I'm seeing much more variability.

So, this allows me to quickly compare between different elements of my data,

in this case provinces and look at the changes over time.

Again I can think about how do I want to organize my small multiples here?

What's the different layout that I want to show

to the user to help them compare these quickly over time?

Do I want to put Abbotsford next to Calgary?

Are these geographically located next to each other in Canada?

Do I want to organize these based on most similar trends?

So for example, should I put Toronto next to

Montreal because they have similar sorts of graphs?

So, this allows us yet another mechanism to start thinking about how we can show

multiple variables on one screen with information about trends and changes over time,

to look for relationships?

Again, we can start thinking about how we might pre analyze the data,

to identify interesting things so we know where to start

visualizing to help and give the user more information.

So, that takes us through this concept of scatterplots,

extending those to scatterplot matrices,

and even further to small multiples. Thank you.