1:30

made discrete from the point of view of we're talking about 300 degrees,

350 degrees, 400 degrees, 450 degrees.

You can't have a continuous measurement there.

You're going to have it in terms of it being in discrete buckets or

in discreet numbers that we're talking about.

Because at the end of the day for

analysis of variants you're calling it four different groups.

So you're calling it four different levels of a factor, so

in terms of when do we use this.

We use this to test differences between

mean values of Y across multiple levels of a factor X.

The conditions more formally stated are that the Xs should be

in discrete groups and the Y should be continuous.

So when you're talking about the dependent variable,

the dependent variable has to be continuous.

The independent variables have to be discrete.

So here are in addition to the example of four levels of temperature,

you could be also looking at for example three suppliers of pizza dough.

Three suppliers of raw material and what kind of

output difference does it give based on the three suppliers that you have?

Does it take more time for

the pizza to get baked, based on which supplies and materials you are using?

That could be a question you could be addressing using ANOVA.

So some assumptions underlying the use of the analysis of variance model, the idea

of ANOVA, is that the samples are randomly and independently drawn from populations.

So when we're talking about a one-way ANOVA,

what you're saying is that if there are four levels of a factor or four different

groups from which you're going to pick samples of four different populations,

to use statistics terms from which you're going to get samples.

Those samples should be independently and randomly picked from each of those groups.

There should be no relationship in terms of how you pick the samples from one

group versus the other, they should be randomly picked or randomly assigned.

If you're talking about random assignment into a particular group,

if you're thinking about doing some kind of experiment where you're saying I'm

going to use a certain temperature for measuring its effect on the defect rate.

Then I'm going to randomly assign units or raw material or

whatever I'm using to test to

each of those four levels of temperature at which I'll be testing things.

So the random assignment becomes important.

Now why does it become important?

It's important because you're simply looking at

4:10

one particular cause for the effect.

And you want to be able to rule out any other causes, so

the way you can get around ruling out any other causes for

that effect is you say you try to randomly assign.

So what you are hoping with random assignment or what you are aiming for

with random assignment is that all those other factors

are getting equally represented or similarly represented

in each of those four levels of that one factor that you're measuring.

So if I'm looking at three different temperature levels, but I'm also using

different materials from different people, but I'm also using different employees to

do the actual process, but I'm also using different lighting conditions.

All those are getting randomly distributed and

hopefully equally distributed across each of those three levels of temperature which

I'm actually measuring, which I'm actually focused on, which I'm actually addressing.

So that's why random assignment or

random selection from each of those becomes important.

The second assumption that we have here is more of a technical assumption because

ANOVA is robust to its violation.

So we are assuming that when you're measuring three different samples

from three different populations, that all of those are normally distributed.

Now, as it says on the slide, ANOVA is robust to its violation.

So if there is no normal distribution for

each of those populations, it's okay for you to carry on.

The third assumption is the population variance are equal.

Now this is an assumption that, in order to get around it.

If you may not have a equal variance in each of those populations.

But to get around it, what you can do,

in order to not let that have an impact on the results that you're getting.

You can try to make your sample sizes as equal as possible in each of those groups.

So you can say, if I can have equal sample sizes in each of those groups and

sometimes that's difficult to do.

Sometimes it's difficult for

you to get an equal sample size in four different groups.

Especially if you're dependent on something that you're measuring,

let's say from customers, from the market, then it's hard for you to assign things.

So you're simply measuring from each group and

you say this is what I ended up getting.

I didn't end up getting enough of each level of income.

I got 300 people from the lowest level,

250 from the second one and then 400 from the third one.

They might not be equal sample sizes.

But if you can get equal sample sizes, then you are okay with

violating the assumption of equal population variances.

All right, so these are technical assumptions for

us to use analysis of variance.

When would you use analysis of variance?

We talked about the technicalities of it in terms of

the y variable should be continuous and the x variable should be categorical.

The effect should be continuous,

the causes should be something that you can put in categories.

So some examples here, gas mileage.

Now, gas mileage,

continuous variable, affected by four formulations of motor oil.

Formulation one, two, three, and four, right?

You have nominal categories there.

You're not even saying one is higher than the other, anything like that.

You're simply saying, a, b, c, and d or 1, 2, 3, and 4 or

p, q, r and s, or whatever you wanna call it.

And you can put them in any order that you want but

you're simply looking at the effect on gas mileage of four formulations of motor oil.

9:22

effect is coming from these causes is sample size.

If you can get a larger sample size and a large random sample for

each of the areas that would be the most beneficial.

All right, so let's get into the idea of doing this kind of analysis

based on an example.

The example you have here and

you have this available to you on your website as well.

It's about training for animation.

So, Julia Sabatini wonders whether the time required to compete

a common animation task, differs by four types of employee training.

There are four types of employee training and the data that we have and

it's also available to you on this slide is right here.

The data we have is there are four types of training, we're simply calling them A,

B, C and D.

Whichever one could of been called A, B, C and D, doesn't matter and

we have people that were assigned to each of these four different types of training.

And what you have in terms of the data are the number of hours that they take for

a common task.

After they were trained using method A, method B and method C or method D.

These are, different people that have been assigned to each of these tasks.

These are six people that have been assigned to each of these tasks.

And, and again the idea of random assignment being that, you have,

you should not be able to, to figure out

any kind of pattern when you look at people that were assigned to B versus

people that were assigned to C versus people that were assigned to A and D.

You should not be able to see that, they should be completely random.

That's what should be happening in terms of the assignment of

people to these tasks.

That's the data that we have and our task is to figure out

if the type of training has an effect on the amount of time that they take

to complete a common animation task after they get the strain.

How would you test this based on Anova?

And let's take a look at some basic underlying things

that are working when you're thinking about Anova.

Some things that are actually very intuitive about Anova and

what's it trying to do.

What is the null hypothesis?

What is your hypothesis when you're doing any kind of hypothesis test using

analysis, any kind of Anova Analysis, one-way Anova analysis.

You know hypothesis is that when you have multiple groups,

that each of those groups have the same population mean.

The null hypothesis or HO is that MU one equals MU two equals MU three and

you can have as many groups as you wish in terms of doing an analysis variance.

There's no treatment effect if these means are actually equal,

that means that whatever is the cause that you're studying, has no effect on this Y.

Whatever is the X you are studying has no effect on this F X or

what is Y as our dependent variable.

There's no treatment effect if all these means turn out to be equal.

Your alternative hypothesis is that not all of them are equal.

Here's a thing that you want to note about the alternative hypothesis for

analysis of variance.

It is simply saying that at least one of these means is different from the other.

It's not saying that all of them are different from each other, it's enough for

one of them to be different from the others for

the null hypothesis to be rejected.

We're simply saying one of them may be different from others.

In fact, you should make a note of the fact that it's not even saying

which one is different.

It's simply saying that one of them is gonna be different.

That's the alternative hypothesis.

The standard alternative hypothesis when you're talking about an analysis variance,

so one be analysis of variance.

Now let's take a look at translating this hypothesis to our particular example.

Remember the example is animation times, and how are they affected by training?

Our null hypothesis is that, training has no effect.

And the animation times are gonna be equal for

the four types of training that we have.

We have A, B, C, and D.

And the alternative hypothesis is that at least one of them is different.

Now how does an Anova work?

In terms of figuring out whether there's a difference or not.

What it does, is it takes the total variation in the data.

Now what you notice is that when you add the data for training type A,

the time taken by all the six people was not exactly the same.

There was variation in that time within that category A.

There wasn't variation within the category B, C, and D.

So, the task that Anova has, or the way it works,

is that it takes the total variation and divides it up between a variation

that is because of the treatment and variation that is because of random error.

What do we mean by variation that's because of random error?

It's saying that the variation that exists within a group is because of random error,

is because of there might be different things that might be explaining it.

It's not because of the fact that it's in that group.

It's not because of the fact that somebody is getting trained

using training method A, or B, or C, or D.

It's because of random error, then being assigned to A, B, C, or D.

And on the other hand, on the left side you have the treatment effect,

which is a variation that's happening because of the treatment effect.

Here we're talking about the difference between the groups.

These are called by different names, so the variation due to treatment effect,

you can see it's called sum of squares treatment, sum of squares between groups.

Variation due to random error is some of squares errors,

sum of squares within and within groups to variation.

The concept offer measuring these variations

is the idea that we're going to take the data and

figure out what is the variation within each group and then between each group.

And then we're not gonna get into the actual calculations for this,

but how does this end up looking in terms of the results when we see Anova results.

When we look at Anova results you basically see this

One-Way Anova Summary Table.

What is this table telling us?

It's telling us the two sources of variation in the first and

the second row in this table.

The first row is between treatment variation, and

you see something called Degrees of Freedom there.

Degrees of Freedom is K minus one here, and the K stands for the number of levels.

What was the number of levels that we had here?

We had four levels of training, so K minus one would be four minus one and

degrees of freedom are simply saying how much is going to be based on.

The treatment in how much is going to be determined based on the other values

being known.

So that's what degrees of freedom stands for and

without getting into the technicalities of how it's used in statistics,

we simply need to know that it's going to be a k minus 1.

It's simply gonna be the number of levels minus 1, so in our case it'll be three.

The sum of squares is going to be between treatment effects, and

that's going to be the difference between a, b, c and d.

And then before we move on to the mean square column and the F column,

let's go down to the second row, which is the within source of variation column.

And there you see the degrees of freedom being n minus k.

N being the sample size.

We had a sample size of 24, so

24 minus 4 is going to give us the degrees of freedom of 20.

And why is degrees of freedom important from an interpretation perspective,

you'll see that we have a decision rule for

deciding whether we accept or retain or reject the null hypothesis.

We shouldn't be saying exceptional hypothesis technically.

We should say reject or retain the null hypothesis, so

when we have to make that decision these degrees of freedom will matter.

The k minus 1, the n minus k will matter.

That's what we'll be using in order to make that decision.

Anyways, moving through the second row of within errors,

we have n minus k as being the degrees of freedom,

the sum of squares is the sum of squares error, sum of squares within.

That is talking about the variation within each of those a,b,c and d groups.

Now, let's move on to the second to last column which is the means square variance.

The mean squared variance is taking the sum of squares between and

then adjusting it by the degrees of freedom.

So SSB divide by k minus 1.

The SSB was the sum of square between treatments you divide that by the degrees

of freedom and you get the mean square between and similarly you get

the mean square within by taking SS within and dividing it by n minus k.

Finally you get to what is going to be important as our decision rule for

deciding whether we will reject or retain the null hypothesis and

that is what is called the F statistic.

The F statistic is a statistic that you will calculate and

that follows the F distribution.

We'll see what the F distribution is in a minute, but

the S statistic is calculated based on MSB divided by MSW.

And the intuition behind using this S statistic is that if there

is a treatment effect, this F ratio, this F statistic should have a high value.

18:45

If you think about what we're saying there, we're saying the mean square

between should be higher than the mean square within by a substantial quantity.

And that's how we would get a higher ratio here,

mean square between divided by mean square within.

Because that's we'll be telling us that when we compare the variations

between treatments, it is much higher than the other variation that we find

than the variation that we find within each of those groups.

So that's what it would be telling us, intuitively speaking.

So higher F ratios would lead us to say yes,

we can say that there is a difference between these groups.

There is a difference that is in y based on where on the levels of x at which

we have these four different groups in our case, four different types of training.

Now, even before we collect the data and do the statistical test,

we can set up a decision rule for our analysis of variance test.

So here's the decision rule that we can set up.

We can say that we know we're going to use the F distribution for

analysis of variance and the F distribution has only upper tail,

unlike the normal distribution that doesn't go to negative, which should

make sense because it's a ratio and ratio doesn't make sense for it to be negative.

So you have both values that are coming from squared values

that's why both of them are going to be positive.

So going back here we have the sum of square, which is a squared value,

sum of square within which is a squared value.

So both positive values and then we get a ratio which is going to be positive.

That's why it's an upper tail test, but

coming back to the idea of what is it in terms of the decision rule?

Well, we have to set up a alpha value even before we collect data and

go out and do the test.

So we say, this is our value and

once we've set up the alpha value we can set up a rejection region.

The rejection region is going to be based on what is the alpha value.

So we say if it's 0.05 or 5% that becomes our rejection region and

based on that rejection region, we can come up with a F value.

F critical value which is going to say, here is the F value based on our alpha,

and if you remember we were looking at degrees of freedom earlier.

And this is where the degrees of freedom come into play.

The F statistic is going to come from knowing the alpha value and

knowing the degrees of freedom.

So k minus 1, or the number of levels at which that factor

is going to be which is k minus 1 is going to be the numerator degrees of freedom.

And the denominator degrees of freedom are going to be n minus k or

the sample size minus the number of levels at which you have that particular factor.

So that's what we're going to use.

Now, before we had Excel to do our job for us and other software to do our job for

us we had to go to this distribution and have table for it.

So we had a table of F distributions and the tables were different based based

on numerator and denominator degrees of freedom.

So based on the numerator and denominator degrees of freedom,

you looked at a table and you said, given my alpha value of 0.05 or

0.10 or whatever I have decided, I will get a F value.

What does that F value give me?

It basically gives me a anchor.

It gives me a number.

So in our case, we had numerator degrees of freedom 3,

denominator degrees of freedom 20, and we use an alpha value of 0.05.

It's giving us a critical value of 3.1.

It's giving us a value of 3.1, which is saying,

if I get a F ratio based on my data,

which is greater than 3.10 I will reject the null hypothesis.

So as you see before I collected the data I am setting up the rejection rule.

If I get a F observed which is what we're gonna call what we get from our data.

If I get a F observed value that's greater than 3.10,

I will reject the null hypothesis.

And what you can see on the screen here is that

there is an Excel formula that you can use in order to get the exact F value.

Now, we'll use Excel to do these calculations, so

you need not use the Excel formula to come up with the F critical.

But if you were interested you would be able to come up with the F critical

based on this.

So now, let's go ahead and use Excel and see what we can do

in terms of the analysis for the data that we have over here.

23:28

So here we have the data for the training for animation problem and

you can see that there are four types of training, a, b, c and d.

And what you have in each of those columns are the animation times for

each of the four types of training.

So these are random samples taken and put into each type of training.

And then their timings are measure for how much time they took for

the animation task.

So first, let's check if we have the Excel add in for solving this problem in here.

So we have to look for.

We go to the options in Excel, and we are looking for

the add-in four, the analysis two pack.

So, we make sure we have that added in, and if you don't then you need to go in

here and add this by saying go, on this button and then hitting OK.

So, that should add that in.

And then, we can move on to do the analysis of variance for this problem.

So, let's click on Data.

And then, you see data analysis.

This is the add-in that has been activated by you clicking on that option.

We hit data analysis.

The first option on this menu should be, and

here we see it, it should be ANOVA single factor, which is what we're doing.

So, it's ANOVA single factor because it's one independent

variable with multiple level.

So, we hit OK there.

And that brings us to the menu of Input Range.

Here, you have to pay attention to two things,

you can have the data grouped in columns or in rows.

We have it in four different columns.

So, we are going to let it be on the default which is columns.

We will have labels in the front row.

So, you see a check box for that.

And we're going to click on that when we get to it.

But let's put the input range, and

we're simply going to highlight the input range as being A, B, C, and D.

26:21

Now, that you've seen how we used Excel to do this analysis, do the one way analysis

of variance, you can see the results here replicated from what you saw in Excel.

And here you can see pointed out that the P value is .0005,

so .001, and that's less than .05.

What you can also see in this output is

the different averages for each of those groups.

But, coming back to first thing that we want to do in terms of testing the null

hypothesis, here our P value, our observed P value is .001 and

that's less than our alpha value that we had set of .05,

so we reject the null hypothesis that these are equal.

That the values of the animation times are equal for the four types of training.

We say, that there's actually going to be at least

one that is different from another one in these times.

So, the task times are not the same for different training regimes.

Now, what you can also see is that we did get the off F observed value of 9.2,

so if you see in that analysis of variants table,

you see the column that says F has a value of 9.00335,

or we can call it a 9, and that F observed is what we are going

to compare with the F critical to make our decision.

So, the F critical that you see, which is in the last column, is 3.09.

So, what we're saying here is there are two ways in which we can come up with

the decision to reject the null hypothesis.

They will always be giving us the same result.

One is to say that the F critical is 3.09 and the F observed

happens to be much greater than that, so we can reject the null hypothesis.

And the second way of saying it,

is that the P value of .001 is less than the alpha value of .05.

And therefore, we will reject the null hypothesis.

Now, from a practical point of view.

You'd be saying, well, this told us what?

This told us that one of them is different, but which one is better?

Which one gives us lower animation times?

In terms of completing the task in less time.

So then, we wouldn't be able to say this

from a statistical perspective as to which one is actually better, but

you can at least go and look at the means of each of those group.

So for A, B, C, and D, you can see that the means are 8.5, 6.5, 8.83, and 10.33.

So, what can you say from there?

That B has the lowest average time

that was taken by people to complete their animation task.

So, people who got trained using method B

were the quickest in completing their animation task.

And we can also say that people who got trained using method D took

the longest in terms of completing the animation task.

So, that's something that we can also say.

29:24

So, once more, let's put the decision rule into a picture form and

see what that same rule that we saw from the table and how it applies.

So, what we're saying here is that the F critical of 3.10 and

the F observed of 9.0.

If you remember, we had set up the rejection region to be to the right of

3.10 and here, we can see that 9.0 is way out there, so

we will reject the null hypothesis.

And we say the training method has a significant effect.

Now again, you'll be saying well,

what's the practical value in terms of pointing out one of them that's different?

We do have something that we can do once we get this result.

So, there's something called a post hoc analysis that you can do

after you get a significant result in ANOVA.

It only makes sense to do a post hoc analysis,

if you got a significant result in ANOVA.

So, ANOVA is telling you at least one of them is different

the Post Hoc analysis will tell you, well, which one of these?

Which pair is different?

A versus B, A versus C, A versus D.

B versus C.

B versus D or C versus D.

So, you have these pairs that you're looking at.

And you want to find out which of them is different.

And we have something called a post hoc test.

And here is one example of a post hoc test, because there are multiple post

hoc tests and without getting into the specifics of each one of them, some have

advantages for certain types of data some have advantages for other types of data.

So, there are advantages and

disadvantages of using multiple types of test depending on which software you use.

If you use something like MINITAB or SPSS or SAS or R, you might be

able to get different post hoc tests and results from different post hoc tests.

But the idea is going to be there, you're going to be testing the specific

hypothesis of each mean being equal to the other.

Mu1 being equal to MU2, or MU1 being equal to MU3.

That would be the null hypothesis, and you would.

If you reject the known hypothesis,

you would say that one is indeed different than the other.

That's what you'd be able to say, but only based on a Post Hoc test.

31:29

All right so we've looked at a one way ANOVA.

We looked at a very simple example of using one particular X and

looking at it's effect on Y.

But of course, you can have a situation where you can use more than one

X and look at the Y, so that's called a two way ANOVA and there

what you can do is you can be looking at X1 and X2 as having an effect on Y.

The same rules apply in terms of the types of data.

Y has to continuous.

X1 and X2 have to be discrete.

So, for example, you could be looking at two different types of pizza dough.

And two different types of training for people.

And the joint effect on the time that it takes for the pizza to get ready.

Something like that is possible using a two-way ANOVA.

And there are the advantages.

You will also be able to look at the interactions between those two effects.

And you're putting those two X's together, so

you can look at the interactions of those two effects.

Of course, you can look at each of those separately using one-way ANOVA, but

the advantage of doing that as a two-way Anova, is that you can say,

you can answer the question of, does the Y value depend

on X1 and does the Y value depend on X2, and

does the effect of X1 on Y depend on the value of X2?

Does the effect of X2 on Y depend on the value of X1?

You can answer those kinds of questions using a two-way ANOVA.

The other point that I'd like to make, in closing, about this topic of

analysis of variance, is that depending on which kind of software you use,

you might have to structure the data differently.

You might have to structure it as being in one column, and

giving all of the Y values in one column.

And the column next to it stating repeatedly the levels of X.

So, X equals 1, 1, 1, 1, 1, many, many times.

And then, our Y values based on X equals 1, and then X equals 2 many, many times.

And that could be a single column, in which you have the data.

Or you could do it in four columns like we had when we used Excel,

we had it in four different columns,

and we did the calculation from data was structured that way.

So, that's something that you have to keep in mind.

The final thing I'd like to say about analysis of variance is remember that you

do not have to have an equal sample size.

So if you have four columns of data and the four columns are unequal,

that's fine, if there are four different levels of the factor and

you don't have equal sample sizes.

That's fine.

You can carry on with using an ounce of variance.

And most software, including Excel, will be able to handle that for

you in terms of giving you the analysis results.

So, that's ANOVA for you.

Next, we'll look at regression.