0:29

The first is a very important dichotomy that we should be aware of which is

The Primary Vs Secondary Data Dichotomy.

There are two types of data, broadly speaking, that we would in some sense use

for research and analytics, primary data and secondary data.

What is primary data?

Primary data basically is data collected at source and hence primary in form.

Specifically, primary data is that data that would not exist but

for your research or analytics project.

If you didn't go out and collect it, it would not, in some sense, exist as data.

The source of the data, because it is primary in form, would be individuals,

groups of individuals, organizations, institutions, and so on.

Surveys, interviews, focus groups, all of this fall under the ambit of primary data.

On the other hand, secondary data are that data which are collected previously.

Whether or not you're doing research or you're doing analytics,

that data would have existed anyway.

A good example are sales records within a company.

I mean, they are anyway collected by the point of sale system.

ERP data within a firm is anyway going to be collected, accounting data.

So whether or not you're doing something with it is a secondary issue.

The data would already continue to exist.

This is an important dichotomy,

because we will see that the type of data you have will, in some sense,

influence the questions you can ask, and the answers you can hope to get.

All right, what will follow quickly is a multiple choice question on

this dichotomy based on what we've seen so far in the slide.

So now let's get to the four data types.

There are four types of data based on the four primary scales, right?

And these correspond to in some sense, or

data types correspond to the scales directly.

So these four are nominal, ordinal, interval, and ratio scale.

Nominal basically means off a name, right?

So it's just a name, it's just a label, and no further information can be gleaned.

And the example I put on the slide is not a good example because in some sense,

Coke and Pepsi are not uninformative.

In a blind test, people would normally prefer a Pepsi, but in a non-blind test,

where the names are visible, people tend to say they prefer Coke more.

I mean this has been, in some sense, documented repeatedly.

Ordinal data, ordinal means having an order, so there is ordered information.

So this conveys not just label information,

this also conveys preference information.

So when you say, I prefer A to B, you know that there are A and B, two entities, and

at the same time you know that you prefer A to B, so there is a direction implied.

The third data type is interval data.

Interval data is not just nominal and ordinal.

Sure, it has labels and it has direction.

And in addition to that, it has magnitude information.

I rate A a 7 and B a 4 on a scale of 10.

So it's telling me not just that I prefer A to B direction, and

that there are A and B nominal labels.

It is telling me how much I prefer A to B.

And finally, ratio conveys information on an absolute scale.

So I paid, say, $11 for A and $12 for B.

The reason ratio defers from, so

ratio basically has all the properties of all the other scales.

The reason it differs from interval is that this is an absolute scale.

It is understood independent of subject.

So $0 or 0 rupees in this case are understood the same by everybody, right?

So there is not a zero point.

And is considered fixed.

Here's another quick example from the world of sports.

Nominal would be the numbers assigned to runners in a race.

It doesn't matter what number a particular runner is wearing.

It doesn't affect the performance in any other way.

Ordinal would be the rank order of winners.

So I know that A came first and B came second, but

I don't know about the difference between them.

I don't know whether A won by an inch or by a mile, so to say.

Interval would be a performance rating on a 0 to 10 scale.

This is done in gymnastics, for instance, where judges give out ratings.

And ratio would be in a race, say, the time taken to complete it.

So basically, 15.2 seconds for A and 14.1 seconds for B basically telling me that,

well I know the exact difference compared to an absolute zero point.

What does it all mean?

Why should we care about the four primary scales?

Okay, look at the first column there.

If you have nominal data, only the most you can do in terms of analysis is mode,

frequencies and percentages.

That's about it.

Nothing else can really be done.

If you have ordinal data, however, because you now have ordered information, I can,

in addition to what you can do with nominal data, get medians to come in.

So half the ordering is above the median, the other half is below the median.

When we move from ordinal to interval, we're not just taking a step,

we're actually taking a leap.

5:26

Because the moment you get to interval and

to ratio, some very interesting properties come into play.

Well, looking at the slide, can you guess what they are?

So write the mean and the variance.

The moment the mean and the variance become meaningful, statistical analysis,

statistical inference of a parametric variety is now in play, right.

A lot is now possible, which basically tells me that if you had a choice

in your data collection, in your research design for your analytics.

You should ideally want data in either in interval or

a ratio form, as far as possible.

So the first two scales, nominal and ordinal are called non-metric.

The last two are called metric because the mean and the variance.

The arithmetic mean is meaningful there.

Based on what we've seen,

the four data types corresponding to the four primary scales.

What will follow shortly are four multiple choice questions.

So just to sum up what we did, any psychometric scale that yields internal or

ratio data is a metric scale.

The other two would be non-metric.