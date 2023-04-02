Hello learners. In this video, we will discuss about the importance of data visualization and some ideas on how that can be done effectively. In most cases, the way we come across a lot of data is by means of plots and graphs. They immediately get a quick sense of how data can be represented and how the data is behaving. For example, even in this week's course content, you looked at certain data points plotted as a graph so that I was able to give you an immediate intuition on how the data is like. Now, can we ask the following question? Can summary statistics, like the mean, median, standard deviation, etc, completely replace the role of plots? We're going to see what are the dangers of it or at least one clear danger. Now, let's go to our Excel demonstration. Here, we have four different datasets. They are paired in the sense of x and y data. Let us look at how these datasets behave. To begin with, let us calculate the average or the mean in each dataset. The average value of the x's in all four datasets, as we observe here, is exactly equal to nine. What about the average of the y points? Well, the average of the y points is exactly the same again. When we said the average or the mean is not everything, we should also consider the standard deviation to look at how the data is spread around the mean, around the central tendency. If we look at the standard deviation of x in the first dataset, that turns out to be 3.32, the second dataset 3.32, again 3.32, and again 3.32, What about the standard deviations of all the y's. First dataset 2.03, 2.03, 2.03, and again 2.03. Well, another measure that people sometimes use to understand the connection between two different variables within a given dataset is correlation. Let's look at how high is the correlation between x and y. If we look at correlation between x and y in the first dataset, that's 0.82, the second dataset again 0.82, third, again 0.82, fourth, again 0.82. Which means, it looks to me, that these four datasets should be very similar or very close to each other. Let us see if plots agree with this intuition. Let us plot. If you plot this, that looks like a neat enough curve dataset 1. Let's make copies of those. Now let's continue plotting each of these four datasets. That clearly looks a little different. What about the third dataset? That looks even more different. What about the fourth dataset? That looks very, very different. These all had exactly the same mean for x, the same mean for y, the same standard deviation for all the x's, the same standard deviation for all the y's, and also the same correlation between every pair of x and y. However, a plot immediately tells you that the story in each of these four cases is very, very different. This is a famous set of data set called as Anscombe’s Quartet. They all have very same and very similar summary statistics, however, they are essentially very different type of data. This directly shows the importance of plotting and the importance of visually understanding data.