Welcome back. In this video, we'll demonstrate the use of data visualization to learn more about our data. Before we begin, we need to get our data loaded into our workspace. If you haven't seen this before, we use a classroom setup notebook to load our data into our workspace to make it easily accessible. We can just run this cell, which utilizes the percent run magic command. To run that classroom set up a notebook that's located in our includes folder. Once our data is loaded into our workspace, let's import it into our notebook and display it. Using SQL, we can select all of the rows and columns from two tables here. ht_users and ht_daily_metrics. Because we can see the data displayed in the data bricks notebook, it's clear that we're able to access the data in both of these tables and the classroom setup notebook worked. Remember from the previous video that we wanted to spend a little time exploring our data and we started with record level. We can look at the record level to determine what the measurement level of each record is. For ht_users, it looks like our data is at the user level. This is specified by the names of the users. And we know from an upstream data engineer that there's only one device ID per user. For ht_daily_metrics, it looks like our data is at the user day level. You can see a unique device ID here, and then a bunch of different dates listed in the dte column. Now remember our data science objective for this lessons project is to determine whether there are naturally occurring groups of users in our health tracker dataset. As a result, we need all of our data to be at the user level. So we really have two options here. We can elect to use only the ht_user status since it's already at these level. Or, and probably the better option, we can aggregate ht _daily_metrics to be at the user level. This way we can continue our analysis using all of the data that's available to us. And that's exactly what we're going to do. In order to get this data at the user level, we can group it by device ID and aggregate the other columns. Notice that we're computing the average for each of our columns except for the lifestyle column. Since lifestyle is of a string type, we're just going to take the max. You could use another command here like first and it's going to do the same thing, maybe a little bit more efficiently. This is acceptable in this case, because we know that users cannot change their lifestyle in this dataset. So whether we do max or first, it's always going to give us the same result. Now that we've created that table, we can display this new aggregated data below to verify that the data is now at the user level. But we've retained some information about the original daily metrics of the user by aggregating across those days, by taking the average for each user. The next thing we need to do is determine the shape of our data. Remember, this is the dimensions of the dataset, and there are a few ways we can do this. The first use of SQL, we can count the number of rows Using the count aggregation function in SQL. And we can get the number of columns by showing the columns from the table and determining number resulting rows. Now this works but requires multiple steps to get the number of rows and the number of columns. And then once we do get the number of columns, we have to scroll down and see how many rows are actually in this table. We can also calculate the shape of our data using Python. This can allow us to save the shape into Python variables that we can use later. And it allows us to calculate the shape in a single step. If our data is smaller like we have here, we can utilize the Pandas DataFrame's shape attribute to determine the shape. We start by accessing the table using Spark SQL, and then we convert to Pandas DataFrame with the two Pandas method. And then we can simply call .shape at the end. This returns a tuple with the number of rows as the initial element and the number of columns as the last element. But if our data is too big to pull into the driver node and our Apache Spark cluster, we can also compute the shape in Spark directly. We can start again by accessing the table using spark sequel but instead of pulling it into pandas on the driver note. We can compute the number of rows using the spark data frame. And then we get a list of column names and compute the length of those column names. We get the same resulting shape across all of these methods. And picking the right one for your analysis comes down to your preferred programming language for working with data. The size of your data and whether or not you need shape values stored in a variable. Or if you're just doing a little bit of interactive exploration. So we know we have eight columns in our data set, but we don't know what information they contain. We could just view this information by displaying the table and looking at the columns, or and more directly. We can look at the column types by describing the table. This shows us the schema of the table, its columns and its types. As we mentioned earlier during data aggregation, most of our columns are numeric type. That's what double means here. And we have the lifestyle column as well, which is a string. And finally, we can compute summary statistics on our columns using a few different methods. The one we're using here is summary from spark Data Frame. It computes information like the mean, standard deviation in our quarter range, among others for each of our columns. Now, in the previous video, we ended by introducing the idea of anscombe's quartet. Anscombe;s quartet is a collection of four unique two column data sets. Each of these data sets contain identical summary statistics, like the mean, standard and correlations. And they even have the same linear regression lines. But the problem is that these data sets are in fact very different. And we really only see that by visualizing the data. You can see the four data sets here in scatter plots and you'll notice just how different they are. This is why it's important to visualize our data. It can show us things that things like simple summary statistics can show us. There are plenty of ways we can visualize data. Python has plenty of tools like mat plotlib, seaborne and plotly. But data bricks also has built in data visualization capabilities that make visualizing large scale distributed datasets really easy. So let's start with viewing the distribution of a single numeric variable. We can do this by just selecting all of the rows from our SQL table like we've done before. Now, by default, this is going to display the data. But we can click on the plot button below the display table to instead display a data visualization. Here we're going to need to manipulate a few plot options. The first thing we want to do is change the display type from bar chart to histogram. After that we want to removed our lifestyle in device key card. So what we're going to be left with is a histogram where steps is on the X axis. We can also control other options like the number of bins in the histogram. Next let's look at the distribution of steps for each unique lifestyle. We can do this again by displaying our data and clicking the plot button to switch to a data visualization. This time, we're going to click plot options, and we're going to select a box plot instead of a bar plot. The next thing we want to do is remove the device ID from our keys field. And we want to change our aggregation from sum to average. Now this gives us a box plot showing the distribution of average steps by sedentary users. Weight trainer users, athlete users and cardio enthusiast users with an average as our aggregation. And finally, let's begin to look at two numeric variables at the record level. We can do this with a scatter plot to help us meet our objective of understanding how our records are related to one another. We'll again, display our data. Then we'll move to a plot by clicking the plot button at the bottom of the cell. And this time In the plot options, we're going to select a scatter plot. The first thing we're going to do is remove all of the columns from the current fields, and we're going to add the average BMI and the average active heart rate columns to the value section to see how the sums of these two columns are related to one another. We can also add more columns by adding more columns to the values field. So if we wanted to add average workout minutes here as well, we could add that. And then when we apply these changes, rerun our query, what we'll see if we zoom this up, this will plot each unique pair of columns against one another. This can be more informative, but it can also be more difficult to interpret, especially as you increase the dimensions even further. And this brings us the sum of the limitations of data visualization. While it's a useful tool, there are a few drawbacks. The first is the data visualization can bias our understanding of data. We might think we see relationships in data, but they might not actually be that strong. And secondly, as we just saw with our scatter plots, data visualization can be really difficult when working with more than two dimensions. We'll investigate some ways of learning from data in more than two dimensions through the rest of this lesson and the rest of the module. But first, let's take a short quiz on data exploration and visualization.