0:12

Some of the key questions that you wanna think about before engaging in

Â an exploratory data analysis process, are basically,

Â do you have the right data to answer the question that youâ€™re interested in?

Â And then, do you need to get other data,

Â if you donâ€™t have the right data, to answer your question?

Â And then finally, given what you see from the data, and given kinda what you make

Â of everything thatâ€™s going on in there, do you have the right question?

Â Cuz it could be that, maybe, you thought you had the right question.

Â And you thought you had the right data set, but

Â when looking around in the data set and exploring what's in there,

Â you realize that maybe the question is better asked a different way.

Â Or maybe you need to just slightly refine the question, make it more specific, or

Â maybe focus on specific variables.

Â So the goal of EDA is to get a chance to look at the data,

Â see if you've got the right data, do you need to get more?

Â And do you need to refine your question at all?

Â In this lecture,

Â we're gonna talk about whether you have the right data to answer your question.

Â And then, another goal of exploratory data analysis,

Â assuming you pass that first part, is to think about how can you develop

Â a sketch of kinda what the answer to your question might be.

Â So the idea, so one of the end products of exploratory data analysis is to really

Â get a sense of whether the solution is out,

Â whether the solution exists, and what it might look at.

Â And then from there, if you wanna continue, you might go onto something more

Â like formal modeling which we'll talk about later.

Â So this lecture we'll talk about figuring our whether you've got the right data for

Â the job.

Â Now, I assume you've already formulated your question, but

Â it's good here to always double check to make sure that it's as sharp as it can be.

Â So that you can kind of reduce the number of variables or the kind of fact

Â features that you might want to look at, before you kind of go ahead, okay?

Â Now the first thing you want to do, of course, is you have to read in the data.

Â You can't do an exploratory data analysis without reading in the data.

Â But, assuming you can do that,

Â the thing that I like to do is what I call check the packaging.

Â Imagine you've got some box, and there's something inside you wanna get at.

Â And you can't get at it yet, and so you kinda shake it around, maybe measure it.

Â You know, see how big it is.

Â Is it bigger than a bread box?

Â And see what kind noises does it make when you shake it.

Â Things like that.

Â So that's kind of what you're trying to do with your data set here.

Â Your data set is the thing that's inside the box but you want to make sure,

Â kind of see what's, everything's kind of the right shape and size.

Â So some of the things that I like to check are,

Â do you have the right number of rows and columns?

Â If the way that you got this data said that you were expecting there to be 1,000

Â rows, then there should be 1,000 rows in the table for example.

Â If there are gonna be 60 features, there should be 60 columns, for example,

Â in a table.

Â So just make sure you got those dimensions right.

Â Basic details about the packaging of a data set.

Â 2:53

If you were told that certain variables were gonna be included in the data set,

Â just check to see that they are in fact included in the data set, right?

Â It's a very simple check.

Â You don't have to look at any data to figure that out.

Â You just basically have to look at the metadata,

Â things like the variable names and the number of rows, for example.

Â And then if there's any other metadata that you might need,

Â things like codebooks, things that describe what the variables are,

Â make sure that comes with the data too, okay.

Â So this could all be done without actually looking at any numbers yet.

Â 3:22

Now, the next thing I like to do is to check the edges of the data set.

Â And so for example, if you're looking at a table, I like to check the top and

Â the bottom.

Â Okay, so just look at the first few rows and then maybe look at the last few rows.

Â So the first few rows are helpful just to make sure you've got the right numbers

Â there, the kind of things you were expecting to see.

Â I find it's very useful to look at the last few rows just to make sure, for

Â example, that the data set is read in properly,

Â that you've got all of the data that you were expecting to get.

Â Often in data files, there's some junk at the end, some comments maybe,

Â just some notes that someone put in there,

Â especially if they were exported from an Excel spreadsheet.

Â 3:59

So you usually don't want to read that kind of stuff in, so

Â looking at the bottom of the data set can, so to speak,

Â can kind of help you to see if there's any of that junk down there.

Â And you want to check to see is the data formatted correctly?

Â Does it look like the right numbers are in the right columns and

Â the right numbers are in the right rows?

Â Sometimes those things can be shifted by one or two.

Â So you wanna make sure you've got everything kinda correct there.

Â If you've got data, for example with dates, often looking at the top and

Â the bottom can be useful because if they're sorted by date,

Â you can see if the range is correct.

Â You know, if the earliest and latest dates are correct.

Â And so that's another thing that you might wanna check for.

Â And so, just looking at the very edges of the data set can be very useful to flag

Â a number of basic problems that can very often occur and are usually easy to fix.

Â But once you kind of get into a data analysis,

Â if you discover these things later, it can be a real pain in the neck to deal with.

Â So the next item that I always try to think about when I'm looking at a new data

Â set, I'm just getting involved in a data analysis is what I call ABC,

Â always be checking your Ns, okay?

Â So every aspect of your data set is gonna have some kind of count or

Â number associated with it.

Â For example, there's gonna be a total number of observations, or

Â your sample size.

Â Is that what you expect it to be?

Â Are you expecting a certain number of columns?

Â There's going to be a number of columns, you should always check that end.

Â But then also within the data set,

Â there's going to be a certain numbers that you expect.

Â For example, if you have a number of subjects,

Â you wanna count the number of subjects or units in your analysis.

Â If every subject was measured three times, you wanna make sure that every subjects

Â got actually three measurements associated with it, right?

Â So there's all kinds of just ends that you can check within your data set and

Â kind of around your data set to make sure that everything is kind

Â of in structured in place, okay?

Â So, the next thing you wanna do is actually just look at your data.

Â And to me, the easiest way to look at the data to

Â determine if there are any problems is to make a plot.

Â So making a plot is useful in two ways.

Â The first way it's useful is for setting expectations about your data, okay?

Â So when you look at a plot, you get a sense of kinda how the variables

Â are related to each other if you make a scatter plot.

Â Now if you make a box plot,

Â you can look at the distributions of the variables to see whether they're skewed.

Â Are there positive and negative values,

Â are you expecting positive and negative values, things like that?

Â So plots can very quickly reveal this kind of information in a way that often,

Â tables cannot.

Â Because one of the things that plots give you

Â is they give you a summary plus a deviation.

Â And very often, tables will only give you the summary.

Â So for example, they give you the mean or the median.

Â But a plot will allow you to visualize both the mean and

Â the deviations from the mean.

Â And so you'll be able to see if there are very large deviations that are perhaps

Â unexpected.

Â Or there are kind of values that, for example, negative values or maybe they're

Â positive values that you weren't expecting that don't appear correct.

Â All right, so I think a plot is very important to make.

Â Not that there is no role for tables in data analysis, but a plot has

Â a unique ability, in my opinion, to show you both what to expect and

Â what not to expect in the sense of what the deviations are from that expectation.

Â So look at the data and make a plot.

Â 7:24

The next useful thing that I like to do with data sets is to try to

Â validate it with at least one external data source.

Â So obviously how you do this will depend on exactly what your problem is,

Â what your question is and the data that you have at hand.

Â But it's nice to be able to check certain aspects of your data set,

Â that they match something outside.

Â So that you know that the data you got are at least kinda within the realm of

Â reality.

Â And even just a single number can be useful.

Â So, for example, if you know that the average level of some feature

Â In your population is, on the order of plus or minus whatever, ten.

Â 8:03

And if you look in your data set, and you have a measurement on that same feature,

Â and it looks like the average is around ten.

Â Well that can be useful just so you know that roughly the mean of your,

Â the distribution in your data set,

Â is corresponds to kind of what you might expect in the population.

Â Another thing you can do is to look at some measurements

Â that you have in your data set.

Â And compare it with measurements that are similar, if not exactly the same feature,

Â but things that you would expect to be correlated with

Â whatâ€™s there in your data set.

Â And check to see if theyâ€™re actually correlated, right?

Â So that way, you can get a sense of, okay well, whatever Iâ€™m measuring in my

Â data set, itâ€™s measuring the same thing that this other metric is looking at,

Â and so there may be some kinda validity there.

Â So just doing a little check to see that your data matches with something that's

Â kind of independent and outside your data set.

Â Can give you a little bit more confidence in the idea that your data set

Â is properly formatted and it came to you in the right way.

Â So now that you've checked the packaging, you've checked your ends,

Â you looked at the top and the bottom of the data set.

Â You've made maybe a simple plot just to kind of visualize the data, and

Â you've validated with one external data source.

Â The next thing to do is to try to take a stab at your solution.

Â And I won't get into that very much right now,

Â but my only point to make here is that you should try the easy solution first.

Â Okay, what is the most obvious thing that you would do

Â in the kind of simplest of scenarios, right?

Â So, cuz you wanna be able to start with something simple and

Â then you can kind of get more complex a little bit later.

Â But often the simplest thing can be very revealing.

Â That's the key, and

Â often the more sophisticated approaches will give you a little bit more insight.

Â But not any more than you would have gotten from just looking at a very simple

Â picture or a table or whatever it is that you've tried to do it first.

Â And the goal of this is really to start developing what I call a primary model.

Â So a primary model is the model of which you kind of

Â base your other analysis around, right?

Â And it's not gonna be permanent.

Â You may change what the primary model is later.

Â But it's the kinda focal point, your initial focal point for

Â your analysis, and then you'll try all kinds of other analyses.

Â Secondary analyses that I call them,

Â that will try to test whether your primary analysis is appropriate or not.

Â And many times you'll find that actually your primary analysis was not right and

Â you'll focus on a different model and you make that your primary model.

Â But sometimes your primary model will hold up and

Â then you'll be able to stick with it.

Â But it doesn't matter, at first, kind of what you choose, you want to be able to

Â say craft the solution and then kind of hang on to it for a moment.

Â And try to do some secondary analyses around it to see

Â if your initial solution holds up.

Â So the idea is that you wanna build, if you're kinda making a case for something,

Â you wanna build some prima facie evidence.

Â And so that prima facie evidence is really this initial solution,

Â is the simplest solution that you can think of at the moment.

Â And then you try to pick away at it to see if it falls apart.

Â But one of the endpoints of exploratory data analysis is to develop this prima

Â facie case to develop this primary model.

Â 11:08

So once you've gone through this process, you've looked at the data,

Â you've checked to see that everything's valid, you wanna be able to follow up.

Â Okay, so this is the point where you ask yourself those three questions that we

Â talked about in the beginning.

Â Do you have the right data?

Â Do you need to get other data, right?

Â So do you need to collect more?

Â Do you need to ask for more?

Â In order to answer.

Â And do you have the right question, or does it need to be refined a little bit?

Â So once you're here, you may need to slightly

Â revise a few things with your data and with your question.

Â But once you've done that, if you wanna move on, then you can start using

Â exploratory data analysis techniques to kinda refine your primary model, and

Â then move on later into a more formal model.

Â