0:05

So we've looked at the histogram as a way of visually exploring our data.

Â If you wanted to summarize the data numerically, let's talk about first

Â the measures of central tendency that we can use to explain the data.

Â So we often look at the mean or the average of the data.

Â Another metric that you might use is the median, or what is the middle value?

Â That is if you rank order your data from highest to lowest,

Â what observation falls in the middle of that data?

Â Such that 50% of observations are greater than that,

Â 50% of observations are going to be less than or equal to that.

Â So we'll look at both of those commands, both mathematically and within Excel.

Â There are also measures of dispersions that we can use to measure the data,

Â the most common being looking at the standard deviation with a variance that's

Â observed in the data.

Â That's giving us a measure of dispersion from the mean so from the average value.

Â You might also look at the range or

Â the interquartile range to get a sense for how much variation is there in your data.

Â So these are the formulas that we can both the mathematical expression for

Â calculating the average as well as how

Â we can derive that using Excel the command equals the average and

Â then you would enter the cell range for where your data resides.

Â What Excel's going to do, it's going to add up all those values in that range,

Â divide by the number of observations to come up with the average.

Â The median, if you wanted to calculate that equals median and

Â then the highlighted data range.

Â One thing to note and we'll take a look at this in example to customer value in

Â a second is that the median is not going to be sensitive to the extreme values.

Â So if you've got a lot of variation in the value of your customers,

Â some are really high but most customers are really low.

Â When you calculate the average,

Â that's going to be highly sensitive to those extreme observations.

Â The medians is not going to be sensitive because it doesn't care with the exact

Â values are.

Â It's just cares about finding the point where 50% are above 50%

Â are below that value.

Â It doesn't matter how far above or how far below,

Â all that it is concerned with is the rank order.

Â 2:09

All right, so here is an example where we've done some work

Â looking at the customer base of new customers signing up for telecom service.

Â And when we calculate their lifetime value,

Â technique that we'll look at when we examine customer analytics with regards to

Â studying retention and driving customer value.

Â Really the story that we get depends on the metrics that we're using.

Â If we look at the average customer lifetime value based on this histogram,

Â turns out that it's around $1,200.

Â But if we look at the median customer lifetime value, that is

Â the point at which half my customers are going to be worth less than this amount.

Â Half my customers are going to be worth more than this amount,

Â turns out that it's only $800.

Â That difference between the median and

Â the mean indicative of having a skewed relationship.

Â In this case the data is going to be right skewed.

Â If you look at our distribution, as we go from left to right,

Â from low value to high value we see fewer and fewer observation.

Â The data seems to fall off so

Â that's indicative of that right skewed relationship.

Â So which customers would we want to target?

Â Well, if there were a way for us to identify the customers who have that

Â high lifetime value, it'd be great for us to focus our efforts on these customers.

Â But we need some way of identifying those individuals.

Â Maybe it's on the basis of demographic variables, maybe it happens to be

Â the region that they're in, maybe it's the method of acquisition.

Â So if we have additional variables that tell us something about

Â the individuals who have that high lifetime value.

Â Then we can make an inform decision of do we want to focus on those individuals?

Â Where they're very valuable but they aren't that many of them or

Â do we want to focus on customers who may not be as valuable to us.

Â But maybe it's easier for

Â us to acquire them and they're a lot more of those customers.

Â That's what we're going to look at when we talk about the customer analytics.

Â How hard is it to acquire these different customer groups.

Â What's the difference in value going to be?

Â And what are the levers that we can try to pull to try to attract them?

Â All right, looking at dispersion, again common measure's going to be the variance.

Â So I've given you here the mathematical derivation for the variance,

Â as well as the excel notation equals var and then highlighting the data_range.

Â And you'll see that In the derivation for

Â the variance, what we're really focused on is the key here is this piece.

Â It's the observation minus the average.

Â The reason that it's squared is we really don't care how whether it's a positive or

Â a negative deviation from the average or from the mean.

Â So what we're concerned about is the magnitude of that deviation.

Â So the more dispersion there is around the mean,

Â the higher the variance is going to be.

Â Variance can also be reported in terms of a standard deviation, so

Â stdev which is really just the square root of the variance.

Â 5:02

Other formulas that you might use to understand how much dispersion is in your

Â data, we can calculate the range.

Â So what's the largest value minus smallest value?

Â So here, we can use the max and

Â min commands to calculate just the size of that range.

Â You might also look at what referred to as the IQR or the interquartile range.

Â So this is the range that contains the middle 50% of the data.

Â There's not a built in command to say give me the IQR, so

Â what we can do is use the percentile command.

Â And so what we're doing in this equation say, let's extract from the data

Â what value is at the 75th percentile and

Â we'll subtract from that the values that's at the 25th percentile,

Â that difference giving us the interquartile range.

Â So these are methods for summarizing data both in terms of the average median,

Â the central tendency and in terms of dispersion.

Â One thing that we always have to contend with are outliers in our data,

Â those extreme observations that can heavily influence our data.

Â In some cases, those outliers might be caused by coding errors.

Â Sometimes in data you'll see that the code 9999 is

Â used to refer to data that's missing.

Â Well, we actually have to remove that before we conduct our analysis,

Â because otherwise it's going to skew things.

Â So when we detect those outliers, it may be a reason to dig in further and

Â understand what's going on.

Â It is possible that they're just mistakes, but

Â any assumptions that you're making about how you deal with those outliers,

Â those are conclusions that should be noted in any report that you're producing.

Â 6:38

All right, another aspect of quantitative data might be time series or

Â a temporal dimension.

Â So these are time series plots that were produced using television tuning data.

Â So data collected from set-top boxes, for

Â two television programs in the, I believe, early 2000s.

Â When we're collecting this data, America's Next Top Model and

Â Pussycat Dolls Present Girlicious, the dips in the audience level or

Â the dips in the ratings caused by commercial breaks.

Â So the audience staying tuned to the program, commercial break comes on,

Â people tune away, come back when the program resumes.

Â Well that's a characteristic of this type of data.

Â Audience levels decline during commercial breaks.

Â So in any type of analysis,

Â that's something that we're going to need to control for.

Â Another popular time series data is stock performance data.

Â So we can see, do stocks or do indices tend to move together or

Â are they tending to move apart?

Â We'll get into looking at time series data and building forecasting

Â models in a little bit, but just to foreshadow the direction we're going in,

Â one component that we're going to look for is the trend in the data.

Â So for a new product that's launched, as time goes on sales may ramp up.

Â Something else that we might also look at is a cyclical pattern.

Â Is it related to seasonality?

Â Is there pattern in here related to the general state of the economy?

Â And these are factors that we need to know about

Â if they're influencing in this case sales.

Â Because if we're going to be making forecasts going forward, those

Â are variables that have to be included in the predictions that we're making.

Â All right, we've looked already at how do we examine the relationships

Â that may exist among categorical variables.

Â Well, we want to do the same thing when we work out relationships that exist on

Â quantitative variable.

Â A scatter plots is common visual way of doing that so in this particular case

Â comparing the daily returns for the S&P 500 with the Verizon stock.

Â So on the x axis, it's the daily return for

Â the S&P, on the y axis, the daily return for Verizon.

Â And you'll see from the trend line that's been layered on here is

Â that there is generally a positive relationship on days when the S&P went up,

Â Verizon also tended to go up.

Â So we can look at the direction of the relationship,

Â in this case we fit a linear trend to that.

Â May not always be a linear, that's one thing that we're going to keep in mind.

Â We're also going to look at how much dispersion there is.

Â There seems to be a lot of dispersion think of around this line.

Â So from this point all the way down to this point.

Â A lot of dispersion that we're seeing in that scatterplot.

Â Means that if I wanted to know what the Verizon price was going to be on a given

Â day, well I can't just use the S&P because it seems like that's not explaining

Â everything.

Â When we get into talking about linear regression,

Â that's one of the pieces that is going to be key for

Â us is trying to reduce the amount of uncertainty that we have.

Â The more information we have, the better a fit we should see.

Â 9:38

If we're looking at trying to quantify the relationship between two variables

Â a common technique that we're going to use is to look at the correlation.

Â What the correlation does is assess the linear relationship and

Â that's going to be key because not all relationships are going to be linear.

Â As I'll show you in a second, sometimes that correlation is going to be indicative

Â of the nature of the relationship.

Â Other times, it might not be that informative, so

Â we really have to conduct exploratory analysis

Â to understand the patterns that exist between quantitative variables.

Â Within Excel equals correlm that's going to be the command that we use

Â to calculate correlation,

Â highlight the X range in the data then we highlight the Y range in the data.

Â Those are going to be separated by a comma.

Â 10:36

Mathematically and

Â this is something that I hope you never have to calculate by hand, but

Â this is the mathematical derivation behind the correlation coefficient.

Â And the piece that I wanted just draw your attention to just to give you some

Â intuition for what this number really means is going to be this set of terms.

Â So what we're doing is for each observation, we're calculating

Â the deviation from the average X value and we're calculating the deviation from

Â the average Y value and multiplying them together, adding that piece up.

Â Well, what does a positive correlation mean?

Â Positive correlation means that when the X value is above average, which

Â would be given by this first term, the Y value also tends to be above average.

Â Both of them are positive.

Â When is the X value is below the average, it's going to be negative.

Â When the Y value below its average value, that's going to be negative.

Â The product of the two giving us a positive relationship.

Â If we have a negative relationship, when X is above its average value,

Â Y is below its average value and vice versa.

Â So that's where the correlation concept is coming from.

Â All right, so again, we can look at this in terms of the financial performance.

Â So I pulled information,

Â this is data from 2010 on stock performance of some telecom companies.

Â And if we look at Verizon, Comcast and AT&T positively correlated with

Â each other, also positively correlated with the S&P 500.

Â So probably no surprise that these stocks and

Â the S&P 500 generally move in the same direction.

Â 12:34

Whereas when we have a lower correlation coefficient,

Â the 0.4 on the right, or the negative 0.7 you see there's more dispersion there.

Â So there still is a linear relationship.

Â If I were to superimpose a trend line we'd capture a negative relationship but

Â the fact that there's a lower correlation coefficient telling us that there is going

Â to be more dispersion there.

Â So these are all examples when the data actually suggests a linear relationship.

Â But as I'd said earlier, it's not always going to be a linear relationship and

Â that's where we need to be careful.

Â 13:03

So one example of how the correlation can be misleading

Â is that we have outliers in our data.

Â We've got a massive points in this case near the origin, near the lower left and

Â then we've got a single outlier.

Â Well, if we try to draw a best fitting line, because that outlier is going

Â to influence the trajectory of that line, we're actually going to be told that,

Â you know what, there is a strong linear relationship here.

Â But in reality, it's that one outlier that's doing it.

Â If we were to discard that and just look at that mass of points,

Â we would not get nearly as strong a linear relationship.

Â The other piece that we have to be careful about

Â is when we're dealing with non-linearities.

Â So in this case if we're looking at the lower right we have a correlation of zero.

Â Well, that doesn't mean that there's no relationship between X and Y.

Â What it means is that there's not a linear relationship between X and Y.

Â This relationship is a negative quadratic relationship so Y is related to X-squared.

Â Well, it's a very strong quadratic relationship, but

Â it's not a linear relationship.

Â That's why we're getting the correlation of zero.

Â So doing this visual inspection, using scatter plots, using histograms,

Â a very important of the data analysis that we need to do.

Â If we were to just look at the correlation coefficient we see that

Â there's no linear relationship,

Â we might erroneously conclude that there's no relationship at all.

Â 14:27

So to recap the techniques that we've looked at, we've talked about techniques

Â for examining individual quantitative variables.

Â We've also looked at techniques for

Â examining the relationships among the quantitative variables.

Â Then correlation and scatterplots good for

Â looking at the relationships among these two quantitative variables.

Â We'll come back to examining these relationships more

Â formally when we start getting into prediction and

Â we start using linear regression as one of our go-to tools in that forecasting.

Â