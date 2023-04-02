Having discussed about numerical data, in this video, we will talk about categorical data. Categorical data is distinctly different from numerical data in the sense that this type of data doesn't take a set of numerical values, but typically a finite set of categories, one of which corresponds to this data. For example, if I'm running a movie streaming service and I want to identify the type of movie a person is watching, I might want to categorize movies into horror, thriller, romance, drama, etc. Similarly, if I am selling apparel, I might want to organize my clothes as small, medium, large, or extra large. It is still possible to measure the size of the clothes in centimeters as a measure of length, etc. where it could become numerical data. However, it's generally convenient to organize them into distinct categories. Similarly, many retail outlets categorize their customers in terms of their age groups. While age itself could be numerical data, different people of different age groups are lumped together, for example, many retail stores categorize them as kids, or perhaps young adults, or perhaps seniors, etc. The fundamental difference between numerical data and categorical data is we lose our ability to take averages, standard deviations, etc. What is the average of a horror and a thriller movie? Questions like this cannot be answered in a natural fashion. That means they also have to be treated and worked with in a completely different manner using different set of tools. While, let's say, in categorical data, each row of data can correspond to a category, it is also possible for categorical data to take unique or multiple categories. For example, in the context of films, as we spoke now, it is possible to simultaneously categorize the film as both a thriller and a horror movie. It is sensible categorization. However, in some cases, categories have to be unique. On the other hand, even unique values of categories is possible. For example, if you talk about the country where a certain store is located, it's not possible for a store to be simultaneously located in India and in the US. So if the category is the country of geographical location, then perhaps there is a unique category associated with this type of data. Let us look at some of the calculations and inferences we can do when we have categorical data in hand. Now let us look at this dataset, which measures how many hours has different users spent viewing movies from the drama genre, movies from thriller genre, etc., over a period of time. We can meaningfully take the average of these numbers, which will completely ignore any information about the genre from which a person is watching movies, but tell me what's the average time these different people spend in watching movies which turns out to be 14.4. We can do some summarization. Even in the absence of data 1, we can count the number of people that watched horror, the number of people that watch thriller, number of people that watch romance, and number of people that watch drama. For example, I can count the number of instances I have with drama as the category or horror as the category. Categorization helps us identify frequencies. Frequencies is an important information that is taken when you are looking at categorical data. For example, in the context of countries, one might know the number of large sales that has happened from a store located in Mumbai, versus a store located in New York City, where the city of location could be the category from the store that is considered. Well, not just this. When you have numerical data along with categorical data, it generally could make sense to understand the average of numerical data corresponding to different categories. For example, now let us look at the data points corresponding only to thriller. When we say thriller, we have two data points. There is a user who has spent 17 hours watching thriller movies, and another user who has watched about 14 hours watching thriller movies. This corresponds to these two points marked with a yellow horizontal line in this plot. What if we say drama? We can again look at a new dataset corresponding only to drama viewership, and we can compute averages of these numbers. Instead of doing a root frequency count, we can calculate the average time spent by users who watched horror in watching horror movie. We use the Excel formula AVERAGEIF. We want the genre to match horror, and we want the average of these sets of numbers. This gives the average time the different users spent in watching different genres of movie. We can clearly see from this dataset that a lot of time was spent watching horror movie by people compared to drama. This could be useful information. Maybe the streaming service wants to advertise drama more, or, for example, they might want to procure more of horror content because the users seem to be interested more in this genre of movies. Once you have categorical data, it immediately gives more sense, information, and inferential capabilities on your numerical data. This type of categories, where there is no useful order among the categories is called as cardinal data. For example, if I re-order this as thriller, drama, horror, and romance, I am not gaining any additional information, nor am I losing any additional information. Such data is called as cardinal data.