0:06

So, here we are in our R studio with our coursera.r file and

Â we're moving on to the scenario where we're comparing the number

Â of distinct pages visited in an A/B test, and

Â we're going to go through a few analyses to do that here.

Â And as the comment indicates, what we'll be doing is an independent samples T

Â test and we'll talk more about that as we go.

Â So, as is our usual procedure,

Â we'll read in one of our data files, that goes with this work.

Â And that is a PG views or page views .csv, so we'll read that in.

Â 0:44

And as is our typical process here, we'll take a view of what that is so

Â we can be comfortable with it.

Â So as you can see we have a subject column so we can see that each subject is

Â measured just once, it seems, and then a site column so which site where

Â they issued, A or B, and as I scroll down here, it kind of refreshes.

Â I'll go all the way to the bottom, and then it'll refresh.

Â And so we do have 500 subjects, as we said in the description.

Â And then a column called pages.

Â And it looks to be pretty much kind of single digit

Â 1:17

counts of how many pages were viewed.

Â Looks like, obviously one would be a minimum we would guess and

Â I saw maybe a ten in there or an 11 in there as maybe a maximum.

Â We can find out more formally what those are.

Â That gives us a sense of what we're dealing with.

Â 1:32

We'll go ahead and recode that subject, that subject column as

Â a factor since it's just a number it thinks it's a numeric variable,

Â but as we've now talked about variable types, we know we want it to be a factor.

Â It won't be used directly in this analysis, but we're going to keep dong

Â this good practice because as we progress in the sophistication of our analyses,

Â we'll see that we end up using the subject later.

Â And then let's go ahead and take a little sum review.

Â We can see that there are 500 distinct,

Â these six plus 494 other distinct levels of subject.

Â That's just the subject identifier.

Â It looks like 245 of those subjects were exposed to site A, 255 to site B.

Â So very nearly a 50/50 test and

Â certainly kind of a realistic outcome, as often is the case.

Â And then here, because pages is a numeric response variable.

Â It computes for us a min and a max, 1 and 11 there, and some other data.

Â We can see the mean is right near four and the median is four.

Â We'll also look a little bit more at some descriptive statistics

Â using the plyr library.

Â 2:40

This function, DDPLY, DDPLY,

Â allows us to apply a function over certain aspects of the table.

Â And remember, I'll remind you, you can always type a question mark and

Â then a function name, assuming that the library for

Â it is loaded, and it'll bring up the help for that name.

Â So DDPLY is a split the data frame, apply function and

Â return results in a data frame.

Â So what we see as input here is the data table itself is page views.

Â We want to split by site and apply this inline function

Â where we are summarizing over the pages by site.

Â So when we do that, we can see for each site, A and B, we can see now

Â some of the same statistics that we saw before overall, but now split by site.

Â So we can see the mean for site A is 3.4, the mean for site B is almost 4.5.

Â So that suggests there may be a difference,

Â but we've learned that comparing means directly is not the full story.

Â We need to know something about the variance.

Â So, this other function allows us to summarize and

Â get the mean number of pages which we have here.

Â But also then the standard deviation which would be of interest.

Â We can see that in the site A condition, there was a standard deviation

Â about half the size, of the number of pages viewed in the site B condition.

Â So there were more pages viewed in site B, but

Â also with greater deviation around that mean.

Â One way to view that is with a histogram.

Â So we can call the hist function and we can look at the page views for

Â site A and the number of pages.

Â 4:21

So I think we can just graph that there, and

Â we can see a couple of things about this.

Â We can kind of see the range from this from about one to six.

Â We can see in site A, it looks to be kind of a normal distribution,

Â kind of a bell curve or Dalsian curve there.

Â Let's go ahead and look at a histogram of site B.

Â And here we can see something a little bit different.

Â A very few number of pages visited up above,

Â seven and eight and ten, quite a few down lower.

Â Doesn't quite look like a bell curve.

Â It doesn't look normally distributed, and

Â those kinds of considerations will come up as we go forward in the course.

Â For now, we're going to ignore those differences, but they are relevant and

Â we will talk about them more in the future.

Â Another way to look at the data too is a box plot.

Â So with the plot command, we can see pages by site.

Â And now we understand that notation a little better.

Â Pages being the y variable, the outcome by site,

Â which is our independent variable or x variable if you will.

Â In the meantime then, we're going to execute our independent sample's t test.

Â Why is it independent samples?

Â What does that mean?

Â Remember that factors can be between subjects or within subjects.

Â And between subjects is the type of factor that site would be,

Â because each visitor gets either website A or B, but not both.

Â So it's an independent samples T test.

Â In the future we'll see a paired samples T test that is appropriate for

Â within subjects situation.

Â 5:57

You can see this parameter at the end.

Â To T test var equal.

Â That's saying the variance is equal.

Â We can see in this box plot that's obviously not true and

Â we'll formalize that consideration as we go as I said in the future, but for

Â now we'll just do a basic uncorrected T test assuming that the variance is equal.

Â In reality T tests are fairly robust to changes and deviations in variants.

Â They don't have to be exactly equal anyway.

Â 6:24

So, let's go ahead and execute that and we can see that we have the T test here.

Â Well, what's this output mean?

Â So, the data confirms we're looking at pages by site and

Â that's in fact exactly the design we talked about.

Â The t-value is the t-statistic, so just like with the chi squared statistic,

Â in the previous things we went through, the t-statistic is

Â the value in the t distribution that we are getting from this data.

Â The degrees of freedom is 498.

Â Obviously related to the 500 subjects that we have there, and

Â then the p value is very, very small, far less than 0.0001,

Â so that's about all we care about, but very near zero.

Â 7:12

Some other results as well, we can see the mean for Group A and

Â B are like we saw before in those summary statistics.

Â So the bottomline here is we have a significant difference between

Â the number of pages visited in website condition A and B.

Â Okay. So that is the T test for

Â our simple website AB test.

Â 7:54

As you know from before, we completed the top test of proportions table previously,

Â and now we've come down to the analysis of variance table and

Â we're in that first row, and what's turned red there is that independent samples

Â T test that we just did.

Â If we look on the left column it has one factor and that was pages,

Â it had two levels and it was a between subjects factor, so

Â that's what the third column with the B means, and we're in a parametric test.

Â And next time we talk we'll get more into what

Â the difference between what parametric tests and non-parametric tests are.

Â But you can see the table sets up a sort of equivalence relationship

Â where if we're in a parametric situation we have certain tests and

Â if we're in a non-parametric situation we have others.

Â For now you can think of the difference as whether or not we can make certain

Â assumptions about the data, which are required for parametric test.

Â For example that the data is normally distributed is a common assumption

Â we'll have to contend with and for many measures the data is.

Â We can see in these box plots however that for site visit A,

Â the data is clearly not normal and we saw that in the histogram as well.

Â 9:07

So that's the difference between those columns.

Â And we'll formalize that more as we go.

Â But we've done the independent samples t-test and

Â that's where we'll leave it for now.

Â Let's see how we would report that t-test result in writing.

Â 9:29

So we analyzed page views, and

Â our result was a t-test, which we indicate here.

Â It has one parameter for its degrees of freedom, and that was 498.

Â So this is it's degrees of freedom.

Â This is the test type, obviously and

Â the test statistic was 7.21.

Â In our case it came out as negative 7.21.

Â You can put that in or not, it's up to you and

Â really it just means which order the two levels of the website were in.

Â If you compare A to B then you'll get negative 7.21.

Â If you flip that and compare B,

Â the difference in the mean of B to A then it will be positive 7.21 so

Â it really doesn't matter whether you have the minus sign or not.

Â So that's the statistic.

Â