0:46

So in many ways this is like with film-making,

Â you want to make a rough cut of your movie, right?

Â So the idea is it's not gonna be the final product, but

Â it's gonna give you a sense of how things flow and how things work.

Â And if you're gonna make an argument to someone, you're gonna argue for

Â doing this versus that and you wanna build evidence to make your case,

Â this is gonna be a basic sketch of how that argument's gonna work out.

Â Okay? So that's what we'll talk

Â about in this lecture.

Â And so this lecture is really about using statistical models

Â to help you to summarize your data, and

Â to eventually kind of make things to things like make inference, okay.

Â So the first thing we need to talk about is what is a model, okay, and

Â why do we need them, okay.

Â So, models are generally speaking are just constructs that we build

Â to help us understand the real world, okay.

Â So for example, in biology often people will use mice as models for humans.

Â So we can't do experiments on humans so we'll use,

Â we sometimes people do experiments on mice to use,

Â to kind of give us sense of what might happen in a human being.

Â For example, for things like drug development or whatever.

Â Now, so that's a very, that's a physical type of model.

Â We're not going to be talking about those kind of models of course.

Â But the models that we use are mathematical models in many cases.

Â And we use them to help tell,

Â to kind of help us describe the population that we're talking about.

Â So if there's a population out there that we're trying to make inferences to, or

Â to describe in some way, we use a model to kind of help us do that.

Â Because often the population is too complex to think about all at once.

Â So imagine if you're trying to make a statement about the entire United States,

Â everyone in the United States, okay.

Â All the things that might go on between all the 300 or so

Â million people in the United States, it's impossible to think about.

Â So we need a model to help us simplify that,

Â to allow us to think about it in a kind of a reasonable way.

Â The ideas of the models will stand in for the population.

Â And they're a much simpler form than what might actually be going on

Â in the population.

Â But they represent population features and relationships, okay.

Â And the models help us by imposing structure on the population, so

Â we might for example assume that things are linearly related to each other.

Â We don't necessarily know that, but

Â it helps us to simplify how we think about two different variables in the population.

Â And the important thing to realize, and

Â this is a well worn saying in the field of statistics.

Â Is that, all models are wrong, but some are useful.

Â So, it's important not to get hung up on finding the right model.

Â But rather to focus on developing a model that's actually useful

Â to help you tell your story about the population.

Â 3:24

So it might be useful to start at this point,

Â which I think it was to ask what's it like to have no model at all?

Â Okay.

Â So that way you get a sense of kind of how bad things can be or

Â how difficult it might be if you didn't ever use a model.

Â Okay.

Â So just take this as a basic example.

Â Suppose you're developing a new product, and

Â you want to know how much people would be willing to pay for this new product.

Â So something you might do is just put out a simple survey,

Â you might survey 20 people, and that may be a representative

Â of the the larger population of people that would be willing to buy this product.

Â And ask them how much they would be willing to pay.

Â And so you do the survey and then someone comes at you and says,

Â okay well what did the data say?

Â Okay, what did they tell us?

Â Now so as an example, I recently published a book called R Programming for

Â Data Science.

Â And before it was published, on the website,

Â you could ask people to put their names, their email addresses and

Â ask them how much they'd be willing to pay for this book before it goes on sale.

Â And so here's what the data looked like.

Â So these are 20 numbers from the survey that was put out on

Â the website about my book, okay?

Â So this is life without a model, okay?

Â The answer to what did the data tell us.

Â It's in here somewhere, because this is the data.

Â It has to be in here somewhere.

Â But the problem is, so this is, there's no model to help us think about the data.

Â So this is what I would call the trivial model, meaning that there's no model.

Â Okay?

Â And the problem is the trivial model is not useful.

Â Because it doesn't provide any summary or any data reduction.

Â Okay? So put it this way, if all models

Â are gonna be wrong, you might as well try to find something that's useful.

Â Rather than have no model at all that's almost certainly not going to be useful.

Â All right?

Â So just for the sake of example, let's use the normal model.

Â So, the normal model is based on the normal distribution, and

Â it's the familiar bell curve that we've seen many, many times.

Â Okay?

Â The nice thing about the normal model is that it only requires

Â two parameters to estimate.

Â There's the mean, and the standard deviation, okay?

Â And we can estimate that from the data by just calculating the mean and

Â the standard deviation in the usual way.

Â So the first question we want to ask is, what do we expect to see?

Â If the population were truly coming from a normal distribution,

Â what would that look like, okay?

Â And it's always important to set expectations for

Â models, so I know it's very tempting to get right into the data and

Â see what they look like, but you gotta be able to set your expectations.

Â Appropriately, so that you know whether you're right or wrong in the end, okay?

Â So here's what we would expect the data to look like if they were drawn from,

Â as representative samples from the population

Â that was governed by a normal distribution.

Â So here's the normal curve.

Â It probably looks very familiar to you.

Â Now models are very useful,

Â because they can tell us a lot of different things about the population.

Â For example, this model, the normal model,

Â says that 68% of the population of readers would be willing to pay between $6 and

Â 81 cents, and $27 and 59 cents, okay, how do we know that?

Â Because that's what the normal distribution based on this data set

Â tells us about the population, okay?

Â We can use the models to compute other quantities too, for example we might want

Â to know how many people will be willing to pay more than $30.

Â So we can use the normal distribution to say that 11% of

Â the population would be willing to pay more than 30.

Â So that's useful to know.

Â Now, the one thing about this picture that you have to just remember

Â is that there is no data in this picture.

Â Now we use the data to draw the picture,

Â because we use the data to calculate the mean and the standard deviation.

Â But there's no actual data in this picture.

Â So just keep that in mind.

Â Now, but eventually we'll look at the data.

Â And we want to know how that data matches our expectation,

Â which is what this picture is giving us.

Â 7:05

Now before we actually get to the data, one of the things I just want to do is to

Â show you, what would data look like if it came from a normal distribution, okay?

Â Now the nice thing about most software packages now is that we could just

Â simulate the error from a normal distribution and see what it looks like.

Â So here's what that picture looks like.

Â I've made a histogram of 20 data points that come from

Â exactly a normal distribution.

Â And I plotted the theoretical normal curve over.

Â You can see that the histogram and

Â the blue curve match very nicely with each other.

Â This is all very nice and ideal because it's simulated, okay?

Â So this is what I call, drawing a fake picture, okay?

Â Drawing a fake picture I find to be terribly useful

Â because it really helps to set expectations.

Â And sometimes its okay to even literally just draw it with your hand.

Â You don't have to necessarily use a computer.

Â But draw a fake picture of what you're expecting to see with

Â the actual data okay?

Â So this is what normal data looks like,

Â if we see a histogram it kinda looks like this.

Â We might think okay a normal distribution is a pretty reasonable approximation for

Â the dataset okay?

Â So, now one thing that we can see from the fake picture

Â is that the normal distribution probably isn't going to be perfect from the get-go.

Â Because in particularly you can see on the left-hand side there

Â that there are negative values, okay?

Â [LAUGH] And it doesn't seem plausible

Â that people would be willing to pay negative dollars for this book.

Â And so maybe that's probably not the best model, but it may be still useful.

Â Remember that no model is going to be right, but

Â it may actually still be useful for helping us summarize the data.

Â Okay, so here's what the data actually looked like, okay?

Â I've got a histogram of all of the data points that were from the survey.

Â This is 20 data points.

Â And I've overlaid it with the blue curve, the normal distribution,

Â that's fitted to the data.

Â So you have to ask yourself how does the data match

Â up with this normal distribution, with this model, okay?

Â Now, given what we've seen before with the theoretical normal curve,

Â with the fake data and the fake picture that we showed,

Â how does this picture compare to the fake picture, okay?

Â 9:07

Now, you might think it doesn't look that good, actually, [LAUGH] right?

Â So what's wrong with this picture?

Â Well, you got this huge spike in the histogram at around $10, okay?

Â That's not predicted by the volume,

Â the normal distribution doesn't have a huge spike right there, and furthermore,

Â there are no values that are either close to zero or negative, whereas the normal

Â distribution has all these negative values in its functional form.

Â So it doesn't look like the histogram really fits that well.

Â So what are we going to do about that?

Â So there may be multiple problems.

Â There may be multiple explanations for why the histogram from the data

Â doesn't look like what we'd expect from a normal distribution.

Â For starters, the data may not even be representative of the population.

Â This is just a website that was up there and anyone who just happened to come by

Â could fill in their name and say what price they'd be willing to pay.

Â Who knows who these people were, who knows if they were even prospective customers,

Â people who would actually buy the product?

Â So that the data collection process might have been very skewed.

Â We have no real way of knowing that.

Â But on the other hand it could be that the model clearly just does not fit well and

Â we may need to revise the model too.

Â It may be easier in some circumstances to revise the model than to revise the data,

Â especially at the data collection process.

Â Is very expensive, okay?

Â So one of the things we can do is let's try the gamma distribution.

Â Okay, so the gamma distribution is another model and

Â one of its key features is that it only allows for positive values.

Â So unlike the normal which has negative and positive values.

Â The gamma distribution only allows positive values.

Â So then we can just repeat all the steps that we just went right through.

Â We can set expectations.

Â We can draw a fake picture and then we can compare our expectations to the data.

Â Okay.

Â So, I'll skip the first two steps there, and I'll just show you,

Â here's what the picture looks like with the data,

Â and the gamma distribution that's fitted on top of it, okay?

Â So you can see from this picture that the fit's not perfect either, okay?

Â Maybe, you could argue it's a little bit better,

Â you've got a little hump wherever that spike at ten is.

Â But it's not, it doesnt exactly fit it perfectly, and still you have a bunch of,

Â the curve, is kind of covering values where there's no data between the zero and

Â five range.

Â And now, but the important thing is that we have a different model, and so

Â a different model is gonna yield different predictions.

Â So this model is telling us something completely different about the population

Â than the normal model was, right?

Â So the normal model told us there was gonna be a big hump kind of around 20.

Â But this model tells us that the hump's more like around seven and ten.

Â Okay?

Â So the model is telling us something very different about what the population is

Â willing to pay for this product.

Â Okay?

Â For example,

Â before we said that 11% of people would be willing to pay more than $30.

Â However, if we use the gamma model

Â we find that only 7% of people would be willing to pay more than $30.

Â So the importance of using models, different types of models is that

Â they tell you very different things about the population, and

Â they result in very different predictions.

Â And so, if you're interested in making these predictions and

Â being accurate about them, you want to make sure you have a model that's

Â reasonably a reasonable approximation of the population.

Â And you can use the data to help you see if that fits well.

Â Now we have looked at two different types of models

Â to tell us about our data and to tell us about the population, okay?

Â So now you may want to keep, continue to refine this, think about different models.

Â Obviously this last one didn't really fit perfectly, so you might wanna

Â either refine your model or you might want to do another survey to get more data,

Â to get a better sense and so you kind of think about where you go from here.

Â The point of this whole exercise is that you get a little sketch

Â of where you're gonna go and kinda what your solution's gonna be.

Â If your question was originally, how much are people willing to pay for

Â this product, you have a better sense now in terms of

Â what the shape of that distribution might look like.

Â And what the population might be willing to do.

Â From here where you go, it depends.

Â You may have enough information as it is to kind of set prices or

Â to figure out how your marketing campaign's gonna go.

Â Or you might want to go into more formal modeling.

Â So you can test the sensitivity of your assumptions,

Â of your expectations to various features.

Â So that's what we'll talk about more when we talk about formal modeling.

Â