0:10

Hello. This lesson will introduce descriptive, or summary statistics.

Â This is an important concept because when you're working with the data,

Â particularly large data sets,

Â it's often useful to get a quick feel for how your data is distributed.

Â The best way to do this is to use descriptive statistics to get

Â a few measurements that provide you with that information.

Â Now many of these statistics are likely familiar,

Â such as the mean, the median, and the variance.

Â This lesson will focus on calculating and using

Â those and other related statistics within a Python script.

Â We'll be using the Introduction to Descriptive Statistics notebook to demonstrate this.

Â Now we are focusing solely on one dimensional data sets at this time.

Â In particular you can think of this as

Â either a numpy one dimensional array or a single column from a pandas data frame.

Â In either case we want to understand what's the typical value in this data set.

Â What's the spread around that typical value?

Â How is the distribution shaped overall?

Â Is it peaked to one side,

Â is it angled, slanted,

Â does it have multiple peaks?

Â And so we want to find statistics that can give us

Â measurements of these and other related quantities.

Â To do this we're going to use a data set that is included as

Â part of the seaborn Python package.

Â Seaborn is a visualization library and we'll be learning more about it in future lessons,

Â but to make the visualizations it includes sample data sets.

Â This tips data set represents data from people that

Â visit a restaurant and they may go at dinner time or at lunch.

Â They go on different days of the week,

Â the person who pays has different genders,

Â they have different bills and they give different tips,

Â and the parties have different number of people in them.

Â So this is a nice simple data set that you can make statistical analyses on.

Â One thing that we're going to want to do is extract columns from our data frame because

Â we're only focusing on a single dimension or a single column at the moment.

Â To do this easily we can simply say,

Â take the total bill column from

Â the tips data frame and extract it as a matrix which means it's going to be

Â a one dimensional numpy array which we demonstrate

Â by slicing out 10 elements, as shown here.

Â The first thing we look at are measures of centrality or location.

Â These include things like the mean,

Â the median, and the mode.

Â We demonstrate how to do this either with pandas,

Â where we can select a column and compute the mean.

Â This is actually the arithmetic mean.

Â There are other means as well including the geometric mean and the harmonic mean.

Â To use those you actually have to use a different Python library,

Â the scipy.stats library which includes the geometric and harmonic means.

Â We also can calculate the median.

Â The main idea here is that if you've sorted

Â your data set the very middle element is the median.

Â If you have an odd number,

Â it's easy, you actually have a middle value.

Â But if you have two even values you have to make a choice, how do you compute it.

Â Because there is no absolute middle value in your data set.

Â Typically what's done is,

Â you take an average of the two values on either side.

Â So in this case it would be the average of two and three or two point five.

Â In some cases, however,

Â you want to restrict the median to data that lie within your data set.

Â So you would either have to choose to take the low value two,

Â or take the high value, three and there's different ways to do this in Python.

Â The mode is the most common value in a data set.

Â We typically see that as a plot where you have

Â the highest peaked value in your distribution.

Â We also can compute more robust statistics,

Â such as the trimmed mean where if our distribution

Â has values that are far on either side,

Â or what we call outliers,

Â we may want to remove those so they don't bias

Â our measurement and we can do that with a trimmed statistics.

Â And so we demonstrate this with the scipy.stats module.

Â We use the computer to compute the mode and we compute the mean via a trimmed version.

Â And you can see that the trimmed mean is actually very different than your typical mean,

Â with the bounds here.

Â We could also demonstrate this with a Python list where we

Â can use the built-in statistics module to compute things such as the mean,

Â the median, and the mode.

Â Here we are using that built-in library to do that.

Â But the reason I really like this is because you could apply

Â the mode from this built-in library to categorical data.

Â So here we have a list of colors,

Â red, blue, blue, brown, brown,

Â brown, having to say what's the modal color,

Â or the most common color in the list?

Â Well it's brown as you can see.

Â We can also compute the low, high,

Â and average medians as shown here.

Â The second thing that we look at is measures of variability.

Â This includes different techniques to figure out how spread out are the data,

Â in particular, how spread out are the data around the mean?

Â Different things that we can calculate are mean deviations.

Â This has the challenge that sometimes it's positive and sometimes it's negative.

Â So what we typically do is take an absolute value or we square that difference.

Â Either way this involves summing up intrinsically positive quantities.

Â Now, these are the two most important ones because we'll use these all the time.

Â And in fact this has a special name,

Â it's known as an L1-norm,

Â and the variance has a special name,

Â known as the L2-norm.

Â You will be seeing these when we go into using machine learning later on in the course.

Â Now one challenge with the variance is because we've taken a different and squared it,

Â any units on our variable are squared.

Â So for instance if X is measured in length,

Â the variance will have units length squared which

Â complicates a comparison of a variance to the actual value itself.

Â To simplify this, we generally just take

Â the square root and end up with the standard deviation,

Â or the root mean square error.

Â There's some other things that this note book looks at

Â as well and you should go through them,

Â but what we're going to talk about next are measures of a distribution.

Â If we get a location and a dispersion,

Â or variation measurement, those are only two numbers.

Â We might want to have a better idea of the full distribution of the data in our data set.

Â We can do this by dividing the data into

Â chunks and seeing how these chunks are distributed.

Â We can divide the data into four,

Â that gives us quartiles, into five, that gives us quintiles,

Â into 10 that gives us deciles or into one percent chunks which are percentiles.

Â Numpy has a function that does this,

Â it's a percentile and we simply pass in how many percentiles do we want.

Â So for instance the median is 50 percentile the middle of the way through.

Â We can also compute quortiles,

Â we can compute quintiles,

Â and percentiles in this manner,

Â and that's what's demonstrated.

Â Next we can look at weighted statistics.

Â If we have errors on our attributes or our features

Â we can actually weight the measurements of

Â our mean and our standard deviation to account for the fact

Â that sometimes we have data with higher accuracy or precision,

Â and we want to weight those such that they dominate the calculation.

Â And this is what rest of this notebook here demonstrates.

Â Next we move on to specific shape parameters that go beyond the percentile or quontile.

Â These are things that are known as moments of the distribution.

Â The first two moments would be the mean and the standard deviation, or variance,

Â but we can actually calculate others that are

Â the third and fourth order moments called the skewness and the kurtosis.

Â And these measure the symmetry with respect to the mean value or

Â the spread or how peaked the distribution is relative to a normal distribution.

Â Now the last thing I want to talk about is something that can sometimes cause confusion.

Â Typically you're given a data set and we think about it as a data set on its own.

Â But in reality what we typically have is a large population and

Â we've taken some of that data out and we analyze it and we call it a sample.

Â So for instance, the tips data set is not all of the restaurant's data,

Â it's a subset of all of the restaurant's data and we're analyzing that.

Â So we have to be cognizant of the fact that if we're

Â trying to use measurements of the sample

Â to infer properties of

Â the general parent population

Â we have to be cognizant of things being a little bit different.

Â So for instance we calculate the average or mean value

Â of the parent from the sample in the standard way,

Â but when we calculate the variance we have to use

Â the fact that we've effectively reduced by

Â one the number of data points to actually infer something about the parent population.

Â So typically with the variance we would subtract n-1,

Â or the same with the standard deviation.

Â Now we could do this in Python by simply using the ddof and passing at one.

Â This tells the standard deviation function we want to calculate

Â this sample variance with the n-1 in the denominator.

Â Now often, we are using

Â very large data sets and the difference between dividing by 1,000,000 or

Â 999,999 is minimal and we often can ignore the difference in practice.

Â But it is important to be aware of if you're dealing with smaller data sets,

Â say something on the order of 50 or 100.

Â So that should give you a pretty good introduction to

Â the idea of descriptive statistics and how we use them.

Â If you have any questions let us know in the course forum. Good luck.

Â