[MUSIC]
Hello and welcome to week four of this MOOC where we start to
delve into the world of statistical inference.
Now inference itself, we could sort of subdivide into two main branches.
Firstly, our estimation of focus for the week four.
And secondly, hypothesis testing our focus in week five.
Now conceptually, what are we trying to achieve?
Well, this word inference means to infer something
about a wider population based on an observed sample of data.
So really when we do statistical analysis, the data we observed,
we tend to view as a sample drawn from some wider population.
Now this word population in the everyday use of the term may refer to
perhaps the population of a country or maybe a city.
Well, indeed, we may be considering those particular types
of populations in our statistical studies, but
we are not confined to that kind of simplistic definition of a population.
Rather, a population doesn't necessarily even have to refer to human beings.
It maybe the population of companies who shares a listed on some stock exchange.
Maybe we're looking at the population of fish in the sea, planets and
the universe, you name it.
Now at the heart of what we tried to do with our statistical inference is that,
we assume that our sample is fairly representative of that wider population.
And our goal when selecting a sample in the first place is to achieve hopefully
this representativeness.
Now contextually, that may sound straightforward enough, but
that's perhaps easier said than done.
History is littered with many examples where a inference has been drawn
on samples which are very much unrepresentative of the population.
Perhaps you will consider a few famous examples.
In the 1936 US presidential election, the Literary Digest whereby,
generally wealthy people tend to subscribe to obtain books on various topics,
predicted that the Republican candidate would win that 1936 election.
The size of the data set they dealt with, well,
they had over 2 million responses to their survey based on that opinion poll,
it seemed to suggest that the Republican candidate would win.
In the end, FDR, the Democratic candidate won the election.
So one might think if you base an opinion poll on over two million responses,
that that's going to give you a very accurate result.
Well, it transpires that this was a classic case whereby, the population from
which the sample was drawn was not in fact representative of the target population.
So the target population, in this case, would have been the US electorate.
However, the sample of voters that the Literary Digest considered
was drawn from its own readership.
So here would seem an example of coverage bias of the sampling population.
I.e., the people on whose views were solicited,
were only drawn from the Literary Digest subscribers, who themselves were typically
not representative of the US electorate overall, why?
Well, this was the 1936 election, really in the heart of the Great Depression, so
what sorts of individuals would be subscribing to the Literary Digest?
Typically those on very high incomes and
hence, would give us a little skewed representation of the US electorate.
And it would tend to have a much greater proportion of individuals from high
socioeconomic groups, who would tend to support the Republican candidate.
That was 1936, scroll forward 12 years to the 1948 US presidential election.
Now, most of the mass media at the time were calling for
the Republican candidate jury to beat the Democratic candidate of Truman.
And there's a very famous say image of whereby the victorious
President Truman Was holding up a copy of the Chicago Tribune which
had as the headline, Dewey defeats Truman, why?
Because the opinion poll in which had been conducted, seemed to suggest a victory for
the Republican challenger.
And indeed, another example whereby, the opinion poll was based on a sample,
which turned out not to be representative of the population as a whole.
Maybe some more recent examples.
The Brexit referendum of 2016.
Admittedly, the polls were fairly narrow and
one really indicating a clear lead for either side.
Nonetheless, there was a general expectation that the remain side of
the referendum vote would win that the referendum itself.
Now, if one actually looks at opinion polls,
they actually give slightly different results depending on the content
method which was actually used to obtain the sample.
Namely, telephone polls tended to give a slight lead to the remain campaign,
whereas polls conducted online, tended to give the reverse
picture of giving a slight lead for the leave campaign.
Now of course, once the referendum had been concluded and
we saw that the leave side won, of course with hindsight, you might perhaps want to
explain why the online polls were more accurate then those conducted telephone.
Well, it transpires that the online surveys were better able to reach
a representative sample of voters based on educational background.
Which transpired to be one of the key predictors as to which way
an individual voted in that referendum.
So our goal,
hopefully, is to get a sample which is representative of that population.
But we should beware the many instances in history where this has failed to happen,
leading us to erroneous inferences.