All right, so there's a data file that's posted on the course website. It's an Excel worksheet with data pulled from movies. If we think about the motion picture industry, it's a pretty big industry as far as how many workers are employed, how much it adds to the economy, how much it adds to the tax base. There's also a lot of interest there from a predictive standpoint because the better guess we can get as far as which movies are going to be successful. Well, the better position the studios are going to be in as far as allocating their budgets. Virtual stock markets, that's one way that we can try to predict how popular different movies are going to be. So, this is just a summary of the data that's up on the course website. We've got a sample of movies that were produced during the 2001 through 2005 years, with a lot of information available about those movies. Now, if we look at the information that's available, features such as what genre is it, which studio produced it, what's the movie rating? Is it based on an adaptation of a graphic novel or a novel? Is it based in some other media? Some of these are yes/no answers. Other variables might have multiple options available, not just the two options. These are all categorical outcomes. That's the common thread here. We've also got a block of financial measures. So, how much revenue was brought in? What was the production budget? What was the marketing budget? These are all quantitative variables. So, the nature of the variable, that's going to inform the type of analysis that we can apply to it. All right. So some of the ways that we might start looking at the categorical data, and we'll go through some of these. I'll demonstrate them, I would encourage you to spend some time working with the Excel file to make sure that you're comfortable, not only generating these different reports, but also understanding the trade-offs associated with them. Frequency tables are going to report numerical values to us as our contingency tables, or cross-tabs. We might look at pie charts, bar charts, column charts as ways of visualizing some of this output. All right, so if we wanted to put together a frequency table. And I'll jump over to the Excel file, so that we can see what we're working with in a second. This just gives you a snapshot of the number of movies being produced by each studio in a particular year, that's in the middle column. So, if we look at, of these 199 movies, we can see how they're distributed across the different studios. Notice that they're ranked in descending order. So, Universal had the most movies out in this year. Followed by 20th Century Fox, Warner Brothers, and so forth. As we go further down this list, the other category, lumping all of those smaller studios that had less than five films in that year produced, makes up a total of 44 of the 199 films. And then the column on the side is putting that as a percentage basis. So they're saying Universal produced 9.05% of the films, 20th Century Fox produced 8.54%. Descending percentage until we get to that final row where we've lumped the others together. That percentage adds up to 100%. Now the way that we've reported it here, we're reporting for each studio what percentage of films were produced. You might also want to produce a cumulative column. So Universal produced 9% of the films, 20th Century Fox produced 8.5%. So we might want to say, all right, well, the two studios that produced the largest films or the most films combined how much did they produce. It would be running sums, so lets start off with on the 9.05% for 20th Century Fox and larger studios, we add in the 8.54% for Warner Brothers, add in another 8.54% as we go further down that column, adding in the films produced by the smaller studios. We get closer and closer to 100%, and that's what this cumulative distribution would show us. So, on the X-axis or the horizontal axis, one corresponds to the studio that produce the largest films, 37 corresponds to the studio producing the fewest number of films. And as we include more and more studios as I move from left to right on this graph, it accounts for a larger share of the films that have been produced. Another way of looking at this data, numbers tend to be a bit sterile. We might want to put that into the bar chart. And so we can see how many of these, how many films were produced by each studio. So that frequency table, that can be reformatted and put into the bar chart. And this is just focusing on a subset of those studios, just those that had at least five films. Maybe easier in terms of delivering reports rather than including a massive table to have charts similar to this one. Another way that we might represent the distribution of films across the studios would be with a pie chart. And I have a little bit of a love/hate relationship with pie charts and you start to see why in this case. We've got a lot of studios that make up a very small slice. Well, think of making a more and more narrow slice of the pie. Try splitting that other category into the individual studios. Fitting these data labels onto this chart is going to become very difficult. So in this case, we've included the name of the studio on the pie chart. We've included the percentage of the films that those studios are producing and we can still see it on this chart. As you add more and more studios, as you have a categorical variable with more and more values. These pie charts may become less useful because you can't visualize all of the possible options. Ways around that might be to lump sum the options together. So that's what we've done in this case with the 22% falling into that other category. All right. So just a couple of words of caution with these charts, and we've talked about categorical variables in the sense that a movie falls under a studio. A movie falls only under one of the studios in our data set. Well, when you're making bar charts, when you're making pie charts, that's a requirement in the data that each observation can only fall into one of those categories and all of your options are going to have to add up to 100%. One of the other things to be careful of is, we focused just on studio. What if I wanted to look at studio by rating? So let's look at the movies that are PG-rated movies from the different studios. And I want to draw some comparisons between the PG and the PG-13 movies. Well, a single pie chart is not necessarily the best way to go about doing that. I might have to do side-by-side pie charts or side-by-side bar chart to make those comparisons. All right, but these are just ways of summarizing the categorical data that's available to us and visualizing that, and very helpful from a recording perspective. Bar charts, pie charts, the frequency tables, it's ways of summarizing a single categorical variable. But what about when we want to see the relationships that exist among two categorical variables or even more general than that? What if they're three or more categorical variables? Well, one of the popular tools that we can use to do that is a contingency table or a cross-tab table. So using the data that we have available to us, we can put together these tables. Say, we wanted to look at the studio and the genre of movies that were in there. Perhaps we want to see how many movies of a particular genre are made by different studios. Maybe we want to look at the relationship between studio and ratings. So this is one way of looking at this data. This is the raw count of the data. You'll notice, going across the rows, we have these studios. If we look down the columns, this is looking at the movie rating. So we have G, PG, PG-13 and R, and these are the counts of how many movies of each rating were made by these studios. This is the cross-tabs, so we're trying to look at the relationship that exists in terms of studio and ratings. Couple of different pieces of this table I want to call out for you. Notice, we see if we were to add up all of the entries in this table. And so we get to this 351. And again, that saying, all of these entries in the bulk of this table. If we were to add those up, 351 movies were released by these two studios in total. Of those 351 movies, 23 of those movies were rated G. All right, well, that's coming from adding up that first column, so 23 movies were rated G and they only came from two studios, Buena Vista and Warner Brothers. The 23 out of 351, think of this bottom row here, the grand total row that we're looking at. Think of this if we're looking at that last row, it's the margin of the table. It's the marginal distribution for us. And so, what fraction of movies were rated G? 23 out of 351. What fraction of movies were rated R? 89 out of the 351. If we wanted to look at how many of these movies came from a particular studio with a particular rating? That's when we're going to jump into the individual cells. So Buena Vista movies rated G were 20 out of the 351. If we look at Buena Vista overall, I could add up this entire row. And that's going to tell me that they produced, that studio produced 87 movies out of the 351. So this tool, good for looking at two different variables, in this case we're looking at just those counts. This is produced entirely in Excel, it's using the pivot table feature, very convenient as far as organizing data and providing quick summaries. Same data that we were looking at previously, but in this case, I've just reformatted that data. So instead of saying, let's count up the number of movies, in this case, focuses on the fraction of the total. So you'll recall from the previous slide, there were 351 movies. Well, now we're looking at 100% of those movies. So divide each entry by 351 and we can see what the percentages are. So, movies rated G made up just about 6.55% of movies released by these four studios. Whereas movies released by Universal made up just shy of 22% of movies released by these four studios. All right, I'm going to jump over to Excel to show you how easy it is for us to produce reports like this using the data that we have available.