So we've looked at the histogram as a way of visually exploring our data. If you wanted to summarize the data numerically, let's talk about first the measures of central tendency that we can use to to explain the data. So we can we often look at the mean or the average of the data. Another metric that you might use is the median or what is the middle value. That is if you rank order your data from highest to lowest, one observation falls in the middle of that data such that 50 percent of observations are greater than that 50 percent of observations are going to be less than or equal to that. So we'll look at both of those commands both mathematically and within Excel. There are also measures of dispersions that we can use to summarize the data. The most common probably being looking at the standard deviation or the variance that's observed in the data that's giving us a measure of dispersion from the mean. So from the average value, you might also look at the range or the interquartile range to get a sense for how much variation is there in your data. So these are the formulas that we can both the mathematical expression for calculating the average as well as how we can derive that using Excel the command equals the average and then you would enter the cell range for where your data resides. What Excel is going to do? It's going to add up all of those values in that range, divide by the number of observations to come up with the average. The median if you wanted to calculate that equals median and then the highlighted data range. One thing to note and we'll take a look at this an example to customer value in the second is that the median is not going to be sensitive to the extreme values. So if you've got a lot of variation in the value of your customer some are really high, but most customers are really low. When you calculate the average that's going to be highly sensitive to those extreme observations. The median is not going to be sensitive because it doesn't care what the exact values are. It just cares about finding the point where 50 percent are above 50 percent are below that value. Doesn't matter how far above or how far below all that it's concerned with is the rank order. All right. So here's an example that's been where we've done some work looking at the customer base of new customers signing up for telecom service and then we calculate their lifetime value, techniques that we'll look at when we examine customer analytics with regards to studying, retention, and driving customer value, really the story that we get depends on the metrics that we're using. If we look at the average customer lifetime value based on this histogram turns out that it's around $1200. But if we look at the median customer lifetime value that is you know the point at which half of my customers are going to be worth less than this amount, half my customers are going to be more worth more than this amount. Turns out that it's only $800. That difference between the median and the mean indicative of having a skewed relationship. In this case, the data is going to be right skewed if you look at our distribution as we go from left to right from low value to high value, we see fewer and fewer observations. The data seems to fall off. So that's indicative of that right skewed relationship. So which customers would we want to target? Well, if if there was a way for us to identify the customers who have that high lifetime value it'd be great for us to focus our efforts on these customers but we need some way of identifying those individuals. Maybe it's on the basis of demographic variables, maybe it happens to be the region that they're in, maybe it's the method of acquisition so if we have additional variables that tell us something about the individuals who have that high lifetime value then we can make an informed decision of do we want to focus on those individuals where they are very valuable but there aren't that many of them? Or do we want to focus on customers who may not be as valuable to us, but maybe it's easier for us to acquire them and there are a lot more of those customers. So that's what we're going to look at when we talk about the customer analytics. How hard is it to acquire these different customer groups? What's the difference in value going to be and what are the levers that we can pull to try to attract them? All right. Looking at dispersion, common measure is going to be the variance. So I've given you here the mathematical derivation for the variance as well as the Excel notation equals var. And then highlighting the data range and you'll see that in the derivation for the variance, what we're really focused on is the key here is this piece. It's the observation minus the average. Now the reason that it's squared is we really don't care how whether it's a positive or negative deviation from the average or from the mean. So what we're concerned about is the magnitude of that deviation. So the bigger the more dispersion there is around the mean the higher the variance is going to be. Variance can also be reported in terms of the standard deviation. So stdev which is really just the square root of the variance. Other formulas that you might use to cut to understand how much dispersion is in your data we can calculate the range. So what's the largest value minus the smallest value. So here we can use the max and min commands to calculate just the size of that range. You might also look at what's referred to as the IQR where the interquartile range. So this is the range that contains the middle 50 percent of the data. There's not a built in command to say give me the IQR so we can do is use the percentile command. And so what we're doing in this equation is say let's extract from the data what value is that the 75th percentile and we'll subtract from that the value that's at the 25th percentile that difference giving us the interquartile range. So these are methods for summarizing data both in terms of the average, the median, the central tendency and in terms of dispersion. One thing that we always have to contend with are outliers in our data. Those extreme observations that can heavily influence our data. In some cases those outliers might be caused by coding errors. You know sometimes in data you'll see that the code 9 9 9 9 is used to refer to data that's missing. Well, we actually have to remove that before we conduct our analysis because otherwise it's going to skew things. So when we detect those outliers it may be a reason to dig in further and understand what's going on. It is possible that there are just mistakes but any assumptions that you're making about how you deal with those outliers those are conclusions that should be noted in any report that you're producing. Another aspect of quantitative data might be time series or a temporal dimension. So these are time series plots that were produced using television tuning data. So data collected from set top boxes for two television programs in the- I believe early 2000s when we're collecting the data America's Next Top Model and Pussycat Dolls Presents Girlicious, the dips in the audience level or the dips in the ratings caused by commercial breaks so the audience staying tuned to the program, commercial break comes on, people tune away, come back when the program resumes. Well that's a characteristic of this type of data. Audience levels decline during commercial breaks. So we were in any type of analysis, that's something that we are going to need to control for. Another popular time series data is stock performance data. So we can see how do stocks or do indices tend to move together or are they tending to move apart. Well getting to looking at time series data and building forecasting models in a little bit. But just to foreshadow the direction we're going in, one component that we're going to look for is the trend in the data. So for new product that's launched as time goes on sales may ramp up. Something else that we might also look at is a cyclical pattern. Is it related to seasonality? Is there a pattern in here related to the general state of the economy? These are factors that we need to do that we need to know about if they're influencing in this case sales because if we're going to be making forecasts going forward those are variables that have to be included in the predictions that we're making. All right. We've looked already at how do we examine relationships that may exist among categorical variables. Well we want to do the same thing when we look at relationships that exist among quantitative variables. A scatterplot is the common visual way of doing that. So in this particular case, comparing the daily returns for the S&P 500 with the Verizon stock. So on the x axis it's the daily return for the S&P, on the y axis the daily return for a Verizon. And you'll see from the trend line that's been layered on here that there is generally a positive relationship. On days when the S&P went up, Verizon also tended to go up. So we can look at the direction of the relationship. In this case, we fit a linear curve a linear trend to that may not always be linear. That's one thing that we're going to keep in mind. We're also going to look at how much dispersion there is. There seems to be a lot of dispersion think of around this line. So from this point on all the way down to this point, a lot of dispersion that we're seeing in that scatterplot means that if I wanted to know what the Verizon price was going to be on a given day, well I can't just use the S&P because it seems like that's not explaining everything. When we get into talking about linear regression that's one of the pieces that that's going to be key for us. It's trying to reduce the amount of uncertainty that we have. The more information we have the better fit we should see. If we're looking at trying to quantify the relationship between two variables, a common technique that we're going to use is to look at the correlation. What the correlation does is assess the linear relationship and that's going to be key because not all relationships are going to be linear, as I'll show you in a second, sometimes that correlation is going to be indicative of the nature of the relationship. Other times, it might not be that informative. So we really have to conduct exploratory analysis to understand the patterns that exist between quantitative variables. Within Excel, equals correl, that's going to be the command that we use to calculate correlation, highlight the x range in the data, then we highlight the y range in the data. Those are going to be separated by a comma and all that we're going to get as output is a number between negative 1 and 1. So negative 1 indicating a perfect negative linear relationship, positive one indicating a perfect positive linear relationship. Mathematically and this is something that I hope you never have to calculate by hand but this is the mathematical derivation behind the correlation coefficient and the piece that I want to just draw your attention to just to give you some intuition for what this number really means is going to be this set of terms. So what we're doing is for each observation we're calculating the deviation from the average x value and we're calculating the deviation from the average y value and multiplying them together adding that piece up. Well, what does the positive correlation mean? Well a positive correlation means that when the x value is above average as we see which will be given by this first term the y value also tends to be above average. Both of them are positive. When the value is below the average, it's going to be negative. When the y value is below its average value that's going to be negative. The product of the two giving us a positive relationship. If we have a negative relationship when X is above its average value, Y is below its average value and vice versa. So that's that's where the correlation concept is coming from. So one you know we can look at this in terms of the financial performance so I pulled information. This is data from 2010 on the stock performance of some telecom companies. And if we look at Verizon, Comcast, and AT&T positively correlated with each other, also positively correlated with the S&P 500. So probably no surprise that these finance these stocks and the S&P 500 generally move in the same direction. Right. So this is intended to give you a sense for the extent of dispersion that you might see depending on the level of correlation, the middle plot that has an r value of 0.9, you know, very high correlation there, very tight cluster looks like pretty much a straight line whereas when we have a lower correlation coefficient the 0.4 on the right or the or the -0.7, you see there's more dispersion there, so there still is a linear relationship. If I were to superimpose a trend line, we would capture a negative relationship. But the fact that there is a lower correlation coefficient time is that there is going to be more dispersion there. So these are all examples when the data actually suggests a linear relationship. But as I said earlier it's not always going to be a linear relationship and that's where we need to be careful. So one example of how the correlation can be misleading is that we have outliers in our data. We've got a massive points in this case near the origin, near the lower left and then we've got a single outlier. Well if we try to draw best fitting line because that outlier is going to influence the trajectory of that line, we're actually going to be told that you know what there is a strong linear relationship here, but in reality, it's that one outlier that's doing it. If we were to discard that and just look at that massive points, we would not get nearly as strong a linear relationship. The other piece that we have to be careful about is when we're dealing with non-linearities. So in this case we have if we look at the lower right we have a correlation of zero. Well that doesn't mean that there's no relationship between x and y, what it means is that there is not a linear relationship between x and y. You know this relationship is a negative quadratic relationship. So, y is related to x squared. Well, it's a very strong quadratic relationship but it's not a linear relationship. That's why we're getting the correlation of zero. So doing this visual inspection using scatter plots, using histograms a very important part of the data analysis that we need to do. If we were to just look at the correlation coefficient, we see that there's no linear relationship, we might erroneously conclude that there's no relationship at all. All right. So to recap the techniques that we've looked at we've talked about techniques for examining individual quantitative variables. We've also looked at techniques for examining the relationships among the quantitative variables and correlation and scatterplots good for looking at the relationship among these two quantitative variables. We'll come back to examining these relationships more formally when we start getting into prediction and we start using linear regression as one of our go to tools in that forecasting.