Let's look at how we can use summary statistics to explore data in more detail. After this video you will be able to define what a summary statistic is, list three common summary statistics and explain how summary statistics are useful in exploring data. Summary statistics are quantities that describe a set of data values. Summary statistics provide a simple and quick way to summarize a dataset. We will discuss three main categories of summary statistics. Measures of location or centrality, measures of spread, and measures of shape. Measures of location are summary statistics that describe the central or typical value in your dataset. These statistics give a sense of the middle or center of the dataset. Examples of these are mean, median and mode. The mean is just the average of the values in a dataset. The median is the value in the middle if you sorted the values in your dataset. In a sorted list, half of the values will be less than the median and half will be greater than the median. If the number of data values is even, then the median is the mean of the two middle values. The mode is a value that is repeated more often than any other value. In this example we have a dataset with ten values. For this dataset, the mean is 51.1 which is the number of all the values divided by 10. The median is 46, if you sort these numbers, the middle numbers are 42 and 50. The average of these two numbers is 46. There are two modes for the this dataset, 42 and 78, since each occurs twice, more than any other value in the dataset. Measures of spread describe how dispersed or varied your dataset is. Common measures of spread are minimum, maximum, range, standard deviation and variance. Minimum and maximum are of course the smallest and largest values in your dataset respectively. The range is simply the difference between the maximum and minimum and tells you how spread out your data is. Standard deviation describes the amount of variation in your dataset. A low standard deviation value means that the samples in your dataset tend to be close to the mean. And a high standard deviation value means that the data samples are spread out. Variance is closely related to standard deviation. In fact the variance is the square of the standard deviation. So it also indicates how spread out the data samples are from the mean. For the same dataset, the range is 66 which is a difference between the largest number which is 87 and the smallest number which is 21. The variance is 548.767, you can calculate this using a calculator or a spreadsheet. And the standard deviation is 23.426 which is the square root of the variance. Measures of shape describe the shape of the distribution of a set of values. Common members of shape are skewness and kurtosis. Skewness indicates whether the data values are asymmetrically distributed. A skewness value of around zero indicates that the data distribution is approximately normal, as shown in the middle figure in the top diagram. A negative skewness value indicates that the distribution is skewed to the left, as indicated in the left figure in the top diagram. A positive skewness value on the other hand indicates that the data distribution is skewed to the right. Kurtosis measures the tailedness of the data distribution or how heavy or fat the tails of the distribution are. A high kurtosis value describes a distribution with longer and fatter tails and a higher and sharper central peak, indicating the presence of outliers. A low kurtosis value on the other hand, describes a distribution with shorter and lighter tails and lower and broader central peak, suggesting the lack of outliers. In our age example, the skewness is about 0.3 indicating a slight positive skew. And the kurtosis is -1.2 indicating a distribution with a low and broad central peak and shorter and lighter tails. Measures of dependence determine if any relationship exists between variables. Pairwise correlation is a commonly used measure of dependence. This is a table that shows pairwise correlation for a set of variables. Note that correlation applies only to numerical variables. Correlations is between zero and one, with zero indicating no correlation, and one indicating a one to one correlation. So a correlation of 0.89 is very strong and this is expected since a person's height and weight should be very correlated. The summary statistics we just covered are useful for numerical variables. For categorical variables, we want to look at statistics that describe the number of categories and the frequency of each category. This is done using a contingency table. Here's an example that shows a distribution of people's pets and their colors. We can see the most common pet is a dog and least common's a fish. Similarly, black is the most common color and orange the least common. The contingency table also shows the distribution between the categories. For example, only fish are orange while most of the brown pets are dogs. In addition to looking at the traditional summary statistics for numerical variables, and category count for categorical variables. For machine learning problems, we also want to examine some additional statistics to quickly validate the data. One of the first things to check is the number of rows and the number of columns in your dataset. Does the number of rows match the expected number of samples? Does the number of columns match the expected number of variables? These should be very quick and easy checks. Another easy data validation check is to look at the values in the first and last few samples in your dataset to see if they're reasonable. For example, do the temperature values looks to be in the right units of measure. Do the values for rainfall look correct or are there some values that look out of place? Are the data types for your variables correct, for example, is the date field captured as dates or timestamp. Or is it capture as a string or numerical value? These will have consequences in how these fields should be processed. Another important step is to check for missing values. You need to determine the number of samples with missing values. You also need to determine if there are any variables with a high percentage of missing values. Handling missing values is a very important step in data preparation which we will cover in the next module. Having this information will be very helpful in determining how missing values should be handled in data preparation. We covered several types of summary statistics useful for exploring data and machine learning. The statistics provide useful information about your dataset and should be thoroughly examine if you want to get a better understanding of your data.