0:08

So now that we understand information visualization and

Â information visualization systems, we can start to use them to visualize

Â data held in databases, and we can use them in concert with data-mining.

Â So often, when we visualize data, the data is going to come from a database,

Â and it'll be quite useful and effective if, during that

Â interactive visualization session, when we're trying to investigate data for

Â our own purposes and gain insight into the data, if we can connect the tools for

Â visualization into queries to that database, so

Â that it's easier to dig deeper into the data directly.

Â So there are databases that support this.

Â Often, modern devices support OLAP processing, online analytical processing,

Â which basically allows them to be accessed over the web using standard protocols.

Â And, conceptually, a good mental model for the data is this data cube metaphor.

Â And if you have, for example, a sales database that represents the amount of

Â sales on a given day of a given product at a given location,

Â then you can think of the keys to that data set, the date, the product, and

Â location as dimensions of, for example, a cube, here.

Â So each row of this data corresponds to cell in this data set, and so

Â if the date is the horizontal axis, the product is the vertical axis, and

Â the location is the depth axis in this cube, then you've got three coordinates,

Â and that gives you a particular cell in this cube, and that cell will represent

Â the amount of sales at that date, at that product, at that location.

Â The challenge is going to be taking this data set and visualizing it.

Â I'm using an example of a cube with three dimensions, and

Â used to perceiving three dimensions, but we perceive three dimensions by projecting

Â it into two dimensions and looking at that.

Â In real world applications, you're going to have many more keys, and

Â this data cube is actually a data hypercube, and

Â you'll have many more than just three dimensions of data.

Â And the challenge is going to be taking this high dimensional data space,

Â this data hypercube, and

Â then finding the appropriate two dimensions in which to investigate, and

Â then you can always add more dimensions as glyphs or other visualization tools.

Â 2:45

So the way we're going to reduce the dimensions of the data set is going to be

Â largely with data aggregation, and there's a large number of

Â database visualization tools based on data aggregation.

Â The common ones like Tableau, or SAP's Lumina, or

Â ever the pivot tables that are available in Microsoft Excel,

Â are based on data aggregation as a way of simplifying a data set and

Â reducing the dimensions of variation by providing summaries of the data.

Â So, for example, here is some data.

Â We're plotting some quantitative value vertically as

Â we change some other quantitative value horizontally.

Â We've got an independent variable here and a dependent variable here.

Â This is a two-dimensional plot.

Â We could always project this onto a one-dimensional axis.

Â We can also summarize this data.

Â These would be values, these would be measures changing across one axis

Â into a single value by representing it by the mean value, or

Â the maximum value or the minimum value or the sum of these values.

Â The mean is just going to be the sum divided by all the values of

Â the dimension, the dimension of the dimension.

Â And so we have various operators, sum, mean, median, minimum, maximum.

Â There's other statistical operators, like standard deviation, or variants, or

Â other characteristics you can use, that will simplify data and

Â basically remove one dimension.

Â In this case, removing the change in value over one dimension

Â to a single value over no dimension.

Â We can also convert quantitative-continuous data into

Â quantitative-discrete data, or ordinal or nominal data.

Â And the count function does that.

Â For example, if we have these blue, red, and

Â green data points, there's three categories here.

Â They're being plotted over a two dimensional area that's going

Â to represent continuous quantitative data horizontally and vertically.

Â We can then remove those axes and just have a single axis representing the count,

Â the number of blue items, the number of red items, and the number of green items.

Â And that reduces the dimensionality of this data set,

Â from varying over two dimensions with a third category,

Â into just a category essentially over essentially one dimension.

Â And finally, there's binning, and binning is a discretization method that converts

Â quantitative continuous data into quantitative discrete data or

Â ordinal or nominal data, depending on how you use it.

Â For example, we may have our continuous function.

Â These are values that are varying over a single dimension.

Â 6:01

And finally, we can use this to create a histogram.

Â And a histogram is binning, in this case,

Â in the vertical direction instead of the horizontal direction.

Â So instead of having bins over the dimension,

Â you're having bins over the value that you're plotting over that dimension.

Â In this case, we've separated the values into an A, B, C, D,

Â and E range and then we're adding up all the values in the A range,

Â all the values in the B range, all the values in the C range, and

Â all the values in the D range, and just plotting those sums.

Â And there's no values in the E range, so it would be zero for the E range.

Â And so this histogram, when you choose these buckets,

Â the size of these bins gives you other characteristics of the data that

Â can reduce the dimensionality of the data, or at least reduce a continuous

Â variable into discrete categories of a continuous variable.

Â 7:00

And so our data cubes can use these aggregation

Â operations to simplify a data set, and

Â that provides a useful tool for investigating the data set.

Â So, for example, we can take our data cube.

Â In this case, it has location horizontally, products vertically,

Â and time in depth.

Â If we, for example, average out, if we want to look at all of our sales of tea,

Â coffee, espresso, other products over time at different times,

Â but we don't care about the location, then we can project this data cube into,

Â 7:41

basically, this square region, this two dimensional region.

Â So this is a two dimensional projection that's summing up all of

Â the sales at any given time of a given product regardless of location,

Â so it's summing them up over all the locations.

Â And that projection gives us something we can then visualize.

Â We can further summarize that projecting this two-dimensional data cube into

Â a one-dimensional data cube by something, in this case, over product.

Â We don't care about the differences of the the products.

Â So if we don't want to differentiate the product, we can aggregate

Â the product axis, sum over that, and now we have a one dimensional,

Â basically just a list of numbers, that tells us the amount of sales for

Â the first quarter, second quarter, third quarter, or fourth quarter.

Â And if we don't care about the time,

Â then we can just look at the total sales over a given amount of time,

Â over a given set of products, over a given set of locations, add all of those up, and

Â that gives us a zero-dimensional data cube here.

Â So here's and example using Tableau to show how aggregations are used for

Â data cube operations.

Â Here we have some dimensions of our data.

Â The data is collected over ten years,

Â as we have separate data by year and over a variety of different countries.

Â 9:17

And you'll see that we're averaging the data and we have a single data point.

Â We basically have zero-dimensional data.

Â It's all being projected to a single point, because we're averaging over all

Â these dimensions, specifically the time dimension over all the years that we

Â have data and over all the countries that we have data.

Â If we want to disaggregate this data,

Â then we can do that by basically dragging into this marks area.

Â For example, country.

Â And now we spread out the data.

Â Each one of these data points represents a different country, so

Â that these averages are just averages over the year that

Â the data's collected on, and not over the country and the region.

Â We can further segregate over the year, and we get this plot,

Â which shows you each country, how it's changing over the year.

Â If we want to see correspondences between countries,

Â then I can drag the country into color so

Â that each country gets its own color, so you can use the color to

Â follow the country over the years it's been disaggregated.

Â 10:33

And we can start adding more and

Â more dimensions from our data cube into this visualization.

Â So when you're looking at an individual cell of these different forms

Â of data cubes, you can work from a less detailed view to a more detailed view.

Â A single cube, when it's averaged, could represent a range of products,

Â a range of locations, a range of times.

Â Or it could represent the value in a specific location in a specific time.

Â And this is in its disaggregated form, it's representing an actual data point.

Â In its aggregated form,

Â it's representing an aggregate of the values over these ranges.

Â So these could be the total sales of all products, over all markets, over all time.

Â 11:24

And then you can start to drill down to these details by basically focusing,

Â for example, on a particular instance of time, or

Â on a particular product, or on a particular type.

Â And each time you do that, you're replacing an aggregated dimension

Â where you have a single value representing a range of one of these

Â dimensions with a specific product or specific data point

Â