0:05

So there's a data file that's posted on the course website,

Â it's an Excel worksheet with data pulled from movies.

Â If we think about the motion picture industry

Â it's a pretty big industry as far as how many workers are employed,

Â how much it adds to the economy, how much it adds to the tax base.

Â There's also, a lot of interest there from a predictive standpoint because

Â the better guest we can get as far as which movies are going to be successful.

Â 0:45

Virtual stock markets, that's one way that those

Â we can try to predict how popular different movies are going to be.

Â So this is just a summary of the data that's up on the course website.

Â We got a sample of movies that produced during the 2001 to 2005 years

Â with a lot of information available about those movies.

Â Now if we look at the information that's available,

Â features such as what genre is it?

Â Which studio produced it?

Â What's the movie rating?

Â Is it based on an adaptation of a graphic novel or novel?

Â Is it based in some other medium?

Â Some of these are yes/no answers.

Â Other variables might have multiple options available,

Â not just the two options.

Â These are all categorical outcomes.

Â That's the common thread here.

Â 1:37

We've also got a block of financial measures.

Â So, how much revenue was brought in?

Â What was the production budget?

Â What was the marketing budget?

Â These are all quantitative variables.

Â So, the nature of the variable, that's going to

Â inform the type of analysis that we can apply to it.

Â So some of the ways that we might start looking at the categorical data and

Â we'll go through some of these, I'll demonstrate them.

Â I would encourage you to spend some time working with the Excel file to

Â make sure that you're comfortable not only generating these different reports but

Â also understanding the trade-offs associated with them.

Â Frequency tables are going to report numerical values to us

Â as our contingency tables or cross tabs.

Â We might look at pie charts, bar charts,

Â column charts as ways of visualizing some of this output.

Â 2:31

So if we wanted to put together a frequency table, and I'll jump over to

Â the Excel file so that we can see what we're working with in a second.

Â This just gives you a snapshot of the number of movies being produced by each

Â studio in the particular year, that's in the middle column.

Â So if we look at all of these 199 movies we can see how they're distributed

Â across the different studios, notice that they are ranked in descending order so

Â Universal had most movies out in this year.

Â Followed by 20th Century Fox, Warner Brothers, and so fourth,

Â as we go further down this list the other category lumping

Â all of those smaller studios that had less than 5 films

Â in that year produce makes up a total of 44 of the 199 films.

Â And then the column on the side is putting that as a percentage basis.

Â So this is saying, Universal produced 9.05% of the films.

Â 20th Century Fox produced 8.54%.

Â Descending percentage until we get to that final row

Â where we've lumped the others together.

Â That percentage adds up to 100%.

Â Now the way that we've reported it here, we're reporting for each studio,

Â what percentage of films were produced,

Â you might also want to produce a cumulative column.

Â So, Universal produce 9% of the films, 20th Century Fox produce 8.5%,

Â so we might want to say all right, well the two studios that produce

Â the largest films or the most films combined, how much did they produce?

Â It would be a running sum.

Â So I'd start off with the 9.05% for 20th Century Fox and

Â larger studios we'd add in the 8.54, for Warner Brothers, add in another 8.45.

Â As we go further down that column, adding in the films produced by

Â the smaller studios, we get closer and closer to 100%.

Â And that's what this cumulative distribution would show us.

Â So on the x axis or horizontal axis,

Â 1 corresponds to the studio that produced the largest films,

Â 37 corresponds to the studio producing the fewest number of films.

Â And as we include more and more studios as I move from left to right on this graph.

Â It accounts for a larger share of the films that have been produced.

Â 4:56

Another way of looking at this data,

Â numbers tend to be a bit sterile we might want to put that into the bar chart and

Â so we can see how many films were produced by each studio.

Â So that frequency table, that can be reformatted and put into the bar chart.

Â And this is just focusing on a subset of those studios,

Â just those that had at least five films.

Â Maybe easier in delivering reports

Â rather than including a massive table to have charts similar to this one.

Â Another way that we might represent the distribution

Â of films across the studios would be with a pie chart.

Â And I have a little bit of a love-hate relationship with pie charts, and

Â you start to see why in this case.

Â We've got a lot of studios that make up a very small slice.

Â Well, think of making a more and more narrow slice of the pie.

Â Try splitting that other category into the individual studios

Â fitting these data labels onto this chart is going to become very difficult.

Â So in this case we've included the name of the studio on the pie chart.

Â We've included the percentage of the films that those studios are producing.

Â And we can still see it on this chart.

Â As you add more and more studios, as you have a categorical variable with more and

Â more values these pie charts may become less useful because you cannot visualize

Â all of the possible options.

Â Ways around that might be to lump some of the options together.

Â So that's what we've done in this case with

Â the 22% falling in to that other category.

Â 6:39

So just a couple of words of caution with these charts and we've talked about

Â categorical variables in the sense that a movie falls under a studio.

Â A movie falls only under one of the studios in our dataset.

Â Well, when you're making bar charts,

Â when you're making pie charts, that's a requirement in the data.

Â That each observation can only fall into one of those categories and

Â all of your options are going to have to add up to 100%.

Â One of the other few things to be careful of is we focus just on studio.

Â What if I wanted to look at studio by rating?

Â So let's look at the movies that are PG rated movies from the different studios,

Â and I want to draw some comparisons between the PG and the PG-13 movies.

Â Well, a single pie chart is not necessarily the best way to go

Â about doing that.

Â I might have to do side by side pie charts or

Â side by side bar charts to make those comparisons.

Â But these are just ways of summarizing the categorical data that's available to us

Â and visualizing that, and very helpful from a recording perspective.

Â Bar charts, pie charts, the frequency tables.

Â 7:55

It's ways of summarizing a single categorical variable.

Â But what about when we want to see the relationships that exist among

Â two categorical variables, or even more general than that,

Â what if there are three or more categorical variables?

Â Well, one of the popular tools that we can use to do that is a contingency table or

Â cross tab table.

Â So using the data that we have available to us we can put together these tables

Â say we wanted to look at the studio and the genre of movies that are in there.

Â Perhaps we want to see how many movies of a particular genre are made by different

Â studios maybe we want to look at the relationship between studio and ratings.

Â 8:41

So this is one way of looking at this data, this is the raw count of the data.

Â You'll notice going across the rows we have the studios,

Â if we look down the columns, this is looking at the movie rating.

Â So we have G, PG, PG-13, and R, and

Â these are the counts of how many movies of each rating were made by these studios.

Â This is that cross tab.

Â So we're trying to look at the relationship that exists

Â in terms of studio and ratings.

Â A couple of different pieces of this table I want to call out for you.

Â 9:31

If we were to add those up,

Â 351 movies were released by these two studios in total.

Â Of those 351 movies, 23 of those movies were rated G.

Â Well that's coming from adding up that first column.

Â So, 23 movies were rated G, and

Â they only came from two studios, Buena Vista and Warner Brothers.

Â 10:02

this bottom row here, the grand total row that we're looking at,

Â think of this, if we're looking at that last row, it's the margin of the table.

Â It's the marginal distribution for us.

Â So what fraction of movies were rated G?

Â 23 out of 351.

Â What fraction of movies were rated R?

Â 89 out of the 351.

Â If we wanted to look at how many of these movies came from a particular studio with

Â a particular rating, that's when we're going to jump into the individual cells.

Â So Buena Vista movies rated G were 20 out of the 351.

Â If we look at Buena Vista overall, I could add up this entire row and

Â that's going to tell me that studio produced 87 movies out of the 351.

Â So, this tool good for looking at two different variables.

Â In this case, we're looking at just those counts.

Â This is produced entirely in Excel.

Â It's using the pivot table feature.

Â Very convenient as far as organizing.

Â Data and providing quick summaries.

Â 11:17

Same data that we were looking at previously, but in this case,

Â I've just reformatted that data.

Â So instead of saying, let's count up the number of movies, in this case,

Â focus is on the fraction of the total.

Â So you'll recall from the previous slide there were 351 movies.

Â Well, now we're looking at 100% of those movies.

Â So divide each entry by 351, and we can see what the percentages are.

Â So movies rated G made up just about 6.5% of movies released by these four studios.

Â Whereas, movies released by Universal made up just shy

Â of 22% of movies released by these four studios.

Â I'm going to jump over to Excel to show you how easy it is for

Â us to produce reports like this using the data that we have available.

Â