Data science certainly is a whole entire body of practice unto itself. I thought it would give you a little bit of an intuition about broadly how this process works in practice and what an actual development environment for Data Scientists looks like. We're going to take a look at a case called Casino Jack. We're going to look at it in the context of a Jupyter notebook, which is a really common exploratory development environment for Data Scientists. The subject of this case study, is a case study prepared by some of my colleagues here at Darden is ATO pictures and this is an independent indie film company. They're releasing this movie, Casino Jack and the big question is, how much money should they put into promoting this movie in week two? That is particularly important for these books because they in fact take a more adaptive, a more agile approach to promoting their movies. What they generally do is in week one, they look at what happens and so they look at the critical reception and they look at the audience reception of these movies and then they ask themselves, how much money should we put into promoting this movie in week two and beyond? They try to make better decisions on that basis because what they're really looking to have happened is that, there's a lot of organic, viral promotion of these movies and they're just cannot amplifying that with their word of mouth. This is in contrast to a big budget movie like one of the Marvel Avengers movies. Just a typical big budget blockbuster where there's probably more planning in advance and certainly more money spent on promoting the movie. What's the deal here? What's the intervention? This is a really important, this is really the agile analytics question that we should always start with is, what we wanted to submit. We think we can do some analytics to get to a better outcome. What exactly is intervention? What do we want to be able to do? Really what's happening here is that at the end of week one, we would be on a team groups that has an analyst, people acting in these roles, a media buyer or an ad person and maybe a producer product manager for this movie and we're going to have to make a decision here at the end of week one, about how much money we're going to spend a week two promoting this movie. That's probably a simplification of how this actually works, but generally that's the basic idea here. We have a data set of comparable independent films and how they did, how much money was spent on advertising. On that basis, we're going to create a model and we're going to try and answer this question of how much money we should put into a given movie, in this case, Casino Jack, going into week two. So if we frame this in terms of what types of variables we're looking at, what is the dependent variable then? What are the independent variables that are driving the action of this dependent variable. The dependent variable is basically going to be box office revenues. How much water ticket sales in these subsequent weeks. The independent variables that we're going to use to look at this are things like ads spend. What's the relationship between spending money on ads and box office revenue? Then also secondary attributes of the movie itself, like its rating, it's critical reception. We're going to actually look at it on Rotten Tomatoes. The production budget, a few other things, but these are the big ones and so what we're going to do is say, "Can we predict box office revenue based on these things? Because that's a basis we could use to make a better decision about the optimal amount of money to put into promoting this movie." Let's take a minute, let's take a look at this data set just in a simple spreadsheet for starters. This is our data set just in a Google. I've loaded into Google sheets, which is the Google docs version of Excel basically. So if we look at this data set, it's two-dimensional. So we've got these indices and then there are these various independent films that our analysts in this example, in this case study, have put together to be able to basically look at comparable performance based on these various things. So RT credit stands for Rotten Tomato critic score, Rotten Tomato user. IMDB is another site where I think aggregates ratings of movies. Then all of these things here are ads spending numbers. So ads spending on TV in weak zero, so preceding the week of the release. Print TV, print in week one, the number of screens that the movie is on in week one. I think I didn't mention that as an independent variable, but that's an important one just because that's our channel. How many movie theaters or how many screens is this movies shown on. Then BO, doesn't stand for bad odor, it stands for box office and this is our revenue number. So the most rest of this data set is the same things through the subsequent weeks. Then, we have a few additional attributes like who's the distributor or the genre of the movie, how long it is, the rating, and the production budget. Now, so that's our data set and in the subsequent videos, what we're going to do is look at this data in this development environment called Jupyter notebook. Basically, this is a development environment, it's not the only development environment the Data Scientists use, but it's extremely popular and it's specifically designed around wanting to be able to look at a data set and run little blocks of code with a lot of observation built into those, where the observation, the output of running the code could be some texts like you see here, or it could be a table like this that they do neat things like make it a little bit interactive, easier to look at the rows, or even these graphs and these visualizations which we can invoke from the Jupyter notebooks. So, this is a development environment that is specifically created for doing this kind of analytics. This is what we're going to go through and what I'll do is just to close this out and ordinate you because this probably looks like a lot of new stuff. Basically, this little block of code here is just saying we've loaded the spreadsheet that you saw over here into a thing called a DataFrame, just a type of object in this environment that can hold data and we're looking at the head, we're invoking this thing that just says take the first five rows of it and tell us about those. So really what you're seeing here is just the same data that's over here, and we're just seeing it in this different development environment in the Jupyter notebook. In the subsequent videos, we're going to look at the data wrangling. So how did we get this data in a place? How do we make sure we understand it? How do we make sure it's set up in a way where we can run the analytics and models we want on it? How do we then zoom out and explore the data a little bit to form and focus our hypotheses, things that we might want to put into a model? How do we run models, statistical predictions on this stuff? Then finally, how do we design this intervention? At the end of week one, our analysts, media buyer team wants to make a good decision, the best possible decision about how much money to spend on promoting this movie. How do we focus all these analytics into helping them make that decision? at in the subsequent videos. If this all looks a little bit overwhelming, get the idea here is just to give you a little bit of intuition about how this process generally works, not for you to understand all this stuff one by one. That is obviously, way way too much for what we're trying to do here. So don't feel overwhelmed, just try to follow along and we'll do a big overview of generally this example of a pairing and analytical question with some analysis and designing an intervention.