Now that we understand the basic idea in this case, I'm going to go through and show you a little bit about the exploratory analysis we might do on the data and the idea here is to just look at the data we have and form some hypotheses and do in a way preliminary testing about the way some of this data might relate to each other, so that we get to decide what we want to dial in and look at with our models. Back here in the Jupyter Notebook, we're going to look at two main things. One is data wrangling or prepping the data so that we can run the analytics we want on it and then two, we're going to zoom out and we're going to explore, we're going to visualize different parts of the data. Do this early exploratory analysis to help us think about the kind of hypothesis we want to test and how we're going to interpret the results. Just looping through this thing from the top, there's a lot of explanation in here because this is a notebook where there's four instructions. You might not see all this explanation and nice documentation in other Jupyter Notebooks that you might, for instance, your collaborators using, that would be pretty normal. But just to give you a sense of what's going on here from the top, this first block of code is pulling in a bunch of third party packages that basically allow us to use pre-built code functions to do stuff that we want to do without having to code that stuff ourselves. This is all in a language called Python, which you probably heard of. It's a popular coding language very useful for general-purpose work. The other language you might likely hear about and there are certainly others is R, that's another really really popular language for doing this kind of work. This stuff's on Python though. So here we're pulling in these packages. Here we're reading in CSV, the file version of this data, and then here we're just looking at it, making sure we make sense. This is a property that the shape of the data which just says it's got 341 rows with 341 movies and 52 columns, all these different sort of attributes of the movie that we talked about earlier. Down here, my colleagues have written a function allows us to summarize the data and understand it. This is a good prep step for figuring out what kind of data wrangling we might need do. So now the columns or rows and the columns that we're having this data frame that we prepped to describe the different columns or features of the data, tells us things like the type of the datatype, the missing values, unique values. Gives us general sense of what we're looking at overall here and missing values or invalid values are a big thing, and so one thing this is going to show us down here is that if we want to look at Production Budget, and we do, there are some missing values and so one thing we're going to need to do is deal with that. The approach we're going to take is to average all the production budgets of all the films that do have a stated production budget and then we're going to use that or impute it to the missing values. So that's the approach we're going to take. This codes just basically wrangling that, making sure it worked and then the next thing we're going to do here is our spending is broken out into, you can see two different buckets, print and TV for each week, and we wanted to look at total ad spending for our purposes here. So we're going to create a variable that's just the total spending for each week. So we can act on that. This code is going through doing that, making sure it's okay, adding it to the data frame and checking it, and then we're also we want to create a total ad spending, a total box office attribute, which isn't in there now to baseline this. This code is doing that and then the last thing we're going to look at here is, what about films that don't have any ad spending at all? Are they comparable to Casino Jack in the way that we mean or intend here. We're going to say no, that there's something going on about those movies where they didn't spend any money on advertising, where we want to at least have the ability to exclude them and and so this has a column no ad spending that we can use to exclude those from our analysis if we want to. So now we summarize our data frame again and we can see all these new columns that we've added down here. So here's our ad things. We see that production budget no longer has any missing values anymore. We've got these total add things and then we've got these Bool is a true false value. So we've got this attribute now of, is this one of these movies that has no ad spending, and so now what we're going to do is basically go through and do some exploratory analysis where we look at the data. So one big picture question we might ask is like, well, how does this generally work, this ad spending? So this just shows us that ad 0 is before the movie, before week one and this is week one, week two, week three. So okay, this is the general shape of how previously people have invested in promotion for these independent films. This next section deals with transformations and this is the question, this is a little more nuanced and we're going to not spend as much time on this. But the question here is like, well, show me do a transform of the data like take its log to basically make it easier to look at possibly more amenable to certain statistical models that we might apply to it? So for instance, this is the log-log of looking at the number of screens and the box office take. This makes it easier for us to see that it looks like there is a, as we intuitively would probably expect, that you can kind of fit a line in this. They do seem to be kind of correlated. If you don't remember what log is and you don't care, don't worry about it. We're again, just looping through this at a really high level. If you're like I can't remember the hog from school or whatever, just Google it and check it out, but don't worry about it too much. We're here to learn about general analytics. We're just here to get a little bit of intuition about this process. Down here, we're just showing, yeah, we could fit a line to this. You may think like, hey, this seems like a lot of code to just make this visualization, and that's true, It is a lot of code. So a lot of the time, if you hear about data scientists using tools like Tableau or Looker, those are tools that basically automate a lot of the things that you might do in an exploratory analysis in a visual overlay and allow you to do this stuff in a little bit more of a like drag and drop plug-and-play kind of environment. This next question is, should we include a lagged variable to help predict box office from one week to the next? In other words, is the box office on week one a good predictor of box office on week two and also how might we look at that and should we do this log transform? Again, we're seeing all these look like we can fit a line through this. These look pretty related rather than, for instance, just spread out all over this plot and then the next thing we're looking at here is, is there a relationship between the effectiveness of ad spending and the critic score? So basically what we're doing is saying, all right, above 60 percent is Rotten Tomatoes threshold for movie being fresh or relatively good, and here we're looking at does ad spending perform differently if you do or you don't do this, have a good critical reception? That might be an important thing because does that make sense? How would we act on that? Well, yeah, that doesn't make sense because we're going to assume, and in fact it is the case generally, that we're going to have our critics rating on week one and that's something we're going to know and are baseline state of information here going into week two to help us make our decision, and so indeed it does look like these. I believe it's the red ones that are fresh or well rated ones and we do see a difference in how advertising performs. So those are some things that we can observe and help us decide what variables we might want to include and how we might want to data wrangled them and how we might want to interpret the results as we apply a model to predict on the basis of these different things, the box office performance rather in week two and that's what we're going to look at in the next video is both running these models on the data and how well do they predict the box office in week two based on all this stuff. Then secondly and finally, how do we design an intervention? How do we make this decision of optimal ad spending on the basis of what we can predict and observe from this data?.