In this video, I'm going to demonstrate how to run AB Testing in the RStudio environment. So why don't I switch over there, and here's my little program that's going to demonstrate some of the key elements of AB Testing. We're going to install this library mosaic data which has some data sets and the data Saratoga Houses. So it's loaded. Let's take a look at it, and that command and you can see here on the top right, there are 1,728 observations of 16 variables. We can look at the top line here, here's the price, the size of the lot, age. Clearly, this is real estate type of information. Number of bedrooms, fireplace [inaudible] waterfront and et cetera, et cetera. So we're just going to focus on that first column, price. We can take an average, 211,966.7. So that's the average price of houses in, I believe this is Saratoga County, in New York. Now we can run a t-test. Now if I don't put any values in here, to see whether or not the mean of our sample is different from zero. We just looked at the mean, 211,000. So we can run this t-test and why don't we look at that for a second. There's a one-sample t-test and the results. It gives us the data that's used to calculate the t-test. So you're going to want to look at that. In here, this is the key number you want to look at, the p-values, and it's got less than your critical value of 0.05, in this case it's very, very small. So the alternative of hypothesis, the true mean is not equal to zero. Recall that the null hypothesis is that it is equal to zero and here's our confidence interval of the mean. So up here, we calculate the mean. It's 211,966. According to this confidence interval, it should fall between 207322 and 216611, and that's our 95 percent confidence interval. Next, it's sort of obvious, you wouldn't need a statistical test to ask whether or not the housing prices are different from zero, especially given the data. But is it different from some value that you're interested in, say 200,000. If you're testing housing prices, for example, again zero, you wouldn't expect that the average to be zero. But you might want to think, is the average 200,000. Maybe I just got a weird sample of my data. So you could run that test, there you go, and you have a P value of 4.8 times 10 to the minus seven. So that's definitely less than 0.05 and the alternative hypothesis is the true mean, is not equal to and this is scientific notation, but that's 200,000. Then there again is the confidence interval and it also gives you the mean of your sample. So to wrap up that test, we have a P value which is less than 0.05. We can reject the null hypothesis that the values, I'm sorry, the sample mean is equal to 200,000. If you want to play around with this, you could. Just for fun, let's put 211,000. Here we have a P value of 0.68 and we cannot reject the null hypothesis. The true mean is the alternative hypothesis is that the sample mean is different from 211,000. But in this case, we cannot reject the null. Okay, let me change that back so I don't mess up that file. Here is a two-sample t-test example. Recall on the Saratoga Housing data, there's the central air-conditioning, yes or no variable. There it is in the last column. So some homes have central air-conditioning, some houses don't and we're just going to break it up into two groups. This notation here, the square bracket, this gets the indexes where central air is equal to yes and this one gives the indexes where the central air is equal to no, so it splits it into two groups. This is an R convention, our programming technique, which I encourage you to study and stare at for awhile. So basically you have a column of prices and another column, central air, Yes, No, and we're going to split it into two groups. Another way to do this is to actually create two column vectors and a table. One column Yes, one column No and then we could just compare these two groups. So let's create these two variables x and y representing the two groups. The next thing you're going to want to know is, do they have the same sample size? So we can take that look at the length of x it's 635 and the length of y is 1093. So clearly they don't have the same sample size. We might want to look at a box plot. There's a box plot there and recall from a box-plot, that the top and bottom line or the interquartile ranges and the line in the middle is the median and it shows you how the data is distributed. Then the last point I should mention about the box-plot is, do you think these averages are the same or not the same and it's really hard to tell. But we can run the t-test. In this case, it's not a paired t-test. They are just two samples assuming unequal variance. Let's run the test. Here we have a P value of 2.2 times 10 to the minus 16 which is way less than 0.05. So we can reject the null hypothesis that the means are the same, and here's the confidence interval, 57881.43 to 77883. Here are the two sample means, 254,000 and 187,000. So what does that mean? Recall that x is houses with central air so the way to interpret the results of this is that houses with central air are priced higher. So there you have it. I know I talked a bit about the different types of tests in the previous video, but they're all essentially t-tests. I want to look at the documentation here. Let's see if I can make this bigger. I showed you some code on how to run t-tests under various scenarios. I want to take a quick look at the documentation so that you are at least aware of how to run t-tests under different conditions. The command is always t.test open plane and then the first parameter is the data set. So your data column name or your column vector, that's x. You can actually run alternative two-sided tests, things like that. Oh, did I skip y? y is your second data set, your second column, mu is equal to zero is the default. So in that first example that I showed you with the housing prices, I tested. If 200,000 was different from zero and then I tested if the average price of the house was different from 200,000, and you can change that value as you wish. If you actually have two columns in there, you're looking at the difference and testing against the difference. Here, paired is equal to false. That means you're not doing a paired t-test. I didn't show a data example here using a paired tests, but that's if you have some sort of maybe a pre-test and a post-test situation. The assumption of equal variances is assumed to be false, and then your confidence interval default is at 0.95. So that's how you would run all the various types of t-tests. It's not very hard and all. You just have to have an understanding of where your data came from, how you collect the data, choose the appropriate t-tests and execute. That wraps up AB testing.