0:00

In this video, we'll be talking about descriptive statistics.

Â When you begin to analyze data,

Â It's important to first explore your data before

Â you spend time building complicated models.

Â One easy way to do so is to calculate some descriptive statistics for your data.

Â Descriptive statistical analysis helps to describe basic features of

Â a dataset and obtains a short summary about the sample and measures of the data.

Â Let's show you a couple different useful methods.

Â One way in which we can do this is by using the describe function in Pandas.

Â Using the describe function and applying it on your data frame,

Â the describe function automatically computes

Â basic statistics for all numerical variables.

Â It shows the mean, the total number of data points,

Â the standard deviation, the quantiles and the extreme values.

Â Any NAN values are automatically skipped in these statistics.

Â This function will give you a clear idea of the distribution of your different variables.

Â You could have also categorical variables in your dataset.

Â These are variables that can be divided up into

Â different categories or groups and have discrete values.

Â For example; In our dataset we have the drive system as a categorical variable,

Â which consists of the categories;

Â forward wheel drive, rear wheel drive and four wheel drive.

Â One way you can summarize the categorical data is by using the function value_counts.

Â We can change the name of the column to make it easier to read.

Â We see that we have 118 cars in the front wheel drive category,

Â 75 cars in the rear wheel drive category,

Â and 8 cars in the four wheel drive category.

Â Box plots are great way to visualize numeric data,

Â since you can visualize the various distributions of the data.

Â The main features of the box plot shows are the median

Â of the data which represents where the middle data point is.

Â The upper quartile shows where the 75th percentile is.

Â The lower quartile shows where the 25th percentile is.

Â The data between the upper and lower quartile represents the interquartile range.

Â Next, you have the lower and upper extremes.

Â These are calculated as 1.5 times the interquartilre range above

Â the 75th percentile and as 1.5 times the IQR below the 25th percentile.

Â Finally, box plots also display outliers as

Â individual dots that occur outside the upper and lower extremes.

Â With box plots, you can easily spot

Â outliers and also see the distribution and skewness of the data.

Â Box plots make it easy to compare between groups.

Â In this example, using box plot we can see the distribution

Â of different categories at the drive wheels feature over price feature.

Â We can see that the distribution of price between

Â the rear wheel drive and the other categories are distinct,

Â but the price per front wheel drive and four wheel drive are almost indistinguishable.

Â Oftentimes, we tend to see continuous variables in our data.

Â These data points are numbers contained in some range.

Â For example, in our dataset price and engine size are continuous variables.

Â What if we want to understand the relationship between engine size and price.

Â Could engine size possibly predict the price of a car?

Â One good way to visualize this is using a scatter plot.

Â Each observation in the scatter plot is represented as a point.

Â This plot shows the relationship between two variables.

Â The predictive variable is the variable that you were using to predict an outcome.

Â In this case, our predictive variable is the engine size.

Â The target variable is the variable that you are trying to predict.

Â In this case, our target variable is the price since this would be the outcome.

Â In a scatter plot, we typically set the predictive variable on the X axis or

Â horizontal axis and we set the target variable on the Y axis or vertical axis.

Â In this case, we will thus plot the engine size on

Â the X axis and the price on the Y axis.

Â We are using the Matplotlib function scatter here.

Â Taking an X and a Y variable.

Â Something to note is that it's always important to

Â label your axes and write a general plot title,

Â so that you know what you're looking at.

Â Now how is the variable engine size related to price?

Â From the scatter plot, we see that as

Â the engine size goes up the price of the car also goes up.

Â This is giving us an initial indication that there is

Â a positive linear relationship between these two variables.

Â