0:00

Hi, my name is Brian Caffo,

Â and this is the lecture on what statistics is good for.

Â Now at the start of every class, we try to define concepts,

Â so I went to the authority and

Â asked Google what statistics was good for, and it came up with this.

Â Statistics, the practice or science of collecting and analyzing numerical data in

Â large quantities, especially for the purpose of inferring proportions

Â in a whole from those in a representative sample, that's quite a mouthful.

Â So I've decided instead, of trying to define statistics, to really just pick up

Â some of the core activities of statistics and go through some examples of those.

Â When I think about the core of statistics

Â I come up with four key activities that define the field.

Â There are of course others, and all of these activities are overlapping.

Â They're not perfectly parcellating okay, but these four activities

Â are descriptive analysis, which includes things like exploratory data analysis,

Â just quantification, like creating tables, summarization and unsupervised clustering.

Â 1:18

Prediction, the third activity that I think I've associated with statistics

Â includes things like machine learning,

Â supervised learning any instance where we wanna create a lot of predictions

Â from maybe a lot of predictors, or even just a few predictors.

Â And then design, design is the process of designing experiments.

Â So again, these four activities to me cover a lot of what I

Â think of as statistics, and they're overlapping.

Â Inference and prediction are, of course, highly overlapping.

Â Descriptive analysis and inference, they're all quite a bit overlapping, but

Â let me go through some examples of each of these to get you sort of thinking

Â about some of these topics and how statistics might be useful for you.

Â 2:02

So here let's start with descriptive analysis, and

Â I put up a picture of the great Roger Peng's Exploratory Data Analysis book,

Â which you can get for free on Leanpub.

Â By the way, I think that's Roger's actual dog on the cover, by the way.

Â So let's talk a little bit about descriptive analysis, and

Â in each case when I described one of these four activities I

Â tried to come up with an example that's a good defining example of these.

Â In this case, I came up with the example that's from my field,

Â Functional Magnetic Resonance Imaging, that really created quite a stir.

Â In this case, some really good researchers, Power, Barnes,

Â Snyder, Schlager and Peterson, real heavy hitters in this area,

Â did this great plot and they, let me summarize what's going on in this plot.

Â They were looking, were interested a lot in this area,

Â is correlations between different areas in the brain with respect to brain activity.

Â So we want to know when brain activity goes up and

Â down, when that correlates in different areas of the brain.

Â We call that connectivity.

Â And what they figured out by doing this plot,

Â is when they got rid of some bad scans from people moving their head around,

Â that the estimated correlations from their data changed quite a bit, and the short

Â range correlations changed in a different way from the long range correlations.

Â And this had profound impact on our field, because everyone had thought

Â up to that point that they had been doing a good job of getting rid of head motion,

Â but this exploratory plot really was a defining characteristic for

Â making the field understand that well no, there was some motion leftover.

Â And it's possible that a lot of what's being reported in the literature is not

Â actual brain cognitivity of scientific interest,

Â but whether or not people are moving their head in the scanner.

Â And I don't know these folks personally, but I can imagine them going through

Â an exploratory data analysis, seeing this plot, and having an aha moment.

Â And that's I think, what exploratory data analysis is best for

Â is really coming up with hypotheses.

Â Since this paper was published, mountains of research

Â has been done on the subject of motion in the fMRI scanner.

Â 4:28

So that's a great example in my mind of an exploratory data analysis plot,

Â this plot made it into their paper and created quite a stir.

Â Okay, let's now talk about inference.

Â I have a picture of my much more austere statistical inference Leanpub book,

Â which you can get for free on Leanpub if you wanna read more about the subject of

Â statistical inference.

Â I define statistical inference as the process of making conclusions about

Â populations from samples.

Â And to me, it was pretty easy to think of a famous example

Â of statistical inference because we're confronted with one very frequently and

Â that is election polling.

Â In that case, the population we're interested in making inferences about is

Â the population of voters on election day, and

Â we want to know the proportion of them that will vote for a candidate.

Â So we are confronted with a fairly classical statistical inference problem

Â every two years, four years for presidential elections, and in fact,

Â in the 2012 presidential election,

Â there was quite a brew ha ha exactly over the process of statistical inference.

Â In fact, one of the news television shows on the night

Â of the election, one of the political pundits,

Â Carl Rogue, just refused to believe the,

Â in fact, their own team's polling results.

Â And even prior, well prior to that night, the statistician,

Â Nate Silver, had been doing a lot of publicity,

Â really kind of promoting the idea that well, Obama's really for the most part

Â locked up this election to much derision from a lot of the political pundits.

Â And what happened after the election was a very interesting discussion

Â about the role of inference, and

Â about the role of how much we believe inference when discussing polling.

Â So if you want an interesting collection of reading on statistical inference and

Â how it plays out in the media, then you can read up on the 2012 election.

Â 6:39

But at any rate, more germane to this class is the idea

Â that election polling is a great example of statistical inference.

Â We have a clearly defined population of interest,

Â a clear parameter that we're interested in, and we can't poll everyone, so

Â we're gonna get an estimate of that population parameter from a sample.

Â 7:10

So I didn't have, again this is another example where I didn't have to think too

Â hard about coming up with a really well known example of prediction,

Â and here I thought about stock market prediction.

Â And I think, to me one of the characteristics of

Â prediction over inference, because those two subjects bleed together quite

Â 7:56

So, to me stock market predictions are a great example of this, because for

Â many people who are predicting the stock market,

Â what they really care about is simply the losses or

Â gains, the monetary losses or gains that occur from the predictions, okay?

Â And, this is the modern way to think about predictions I might add.

Â And so, the frame of mind has shifted whereas someone who was

Â an academic studying the markets might be interested in why,

Â the why the market moves in the ways that it does, regardless of whether or

Â not they personally make a lot of money off of that knowledge.

Â 8:35

Another great example of prediction that's occurring a lot lately and

Â probably one of the things that drove you to this class, is how important modern

Â machine learning and modern prediction algorithms have become in data science.

Â For example, Amazon wants to recommend for

Â you things that you might wanna buy on their site.

Â Netflix wants to recommend movies.

Â At the heart of all these activities is a machine learning process

Â that's coming up with these recommendations.

Â And again, they care less about the underlying psychology and

Â 9:09

fundamental truths of why you're doing these things, but

Â more care about, we wanna give the person the most relevant ads, so

Â they click and then they buy things.

Â So, a huge chunk of marketing and

Â online retail and etc., all rely on machine learning now.

Â It's been somewhat of a revolution in prediction.

Â Finally, the last thing that we wanted to talk about was design, and

Â design is perhaps one of the most important things that we cover.

Â Though it's often overlooked, and it's often overlooked because in many,

Â many settings we don't have control over the design.

Â We just get the data that we get, and we don't have a choice over it.

Â However, and I put up a picture of the famous R.A. Fisher's book.

Â This is a reprint of it from Oxford, the Statistical Methods Experimental Design

Â and Scientific Interference is a real classic, so, and R.A.

Â Fisher was the patriarch of the idea of statistical design.

Â 10:10

So when trying to think of an example for statistical design,

Â the first thing that came to my mind for data science was AB testing, sort of

Â the most classy examples of statistics design in the data science world.

Â But I wanna instead talk more about clinical trials,

Â because I think clinical trials impact our lives more.

Â When things are really on the line, when the government has to decide whether or

Â not to allow a drug, or a therapy to be executed to the population at large.

Â 10:42

There's a demand to have a clinical trial, and it's so important that there

Â are entities like clinicaltrials.gov to keep track and monitor clinical trials.

Â The hallmark of a clinical trial is the randomization of treatment groups, so

Â that it balances unobserved co-variants.

Â So the idea of randomization is the fundamental hallmark of both clinical

Â trials and AB testing, but germane to our discussion today is the fact that,

Â that randomization is part of a carefully controlled experimental design.

Â In a clinical trial,

Â they're trying to control as much of the experimental design as possible, and

Â that's one of these four corners of the field of statistics that is so important.

Â 11:24

So just to remind you what these four activities are,

Â they were exploratory data analysis, inference,

Â prediction, experimental design.

Â So, I look forward to seeing you in some of our future classes, and

Â we'll talk a little bit more about each of these topics in turn.

Â