In this video, I am going to show you how we can analyze the results from the wastewater treatment example we were considering in the prior module. Remember how tedious it was to analyze the results by hand? I am going to show you a really fast way of using R in today's class. In the previous video, I showed you how we used the software, RStudio, to analyze the results for a very small two-factor system. Open RStudio now, and start by creating a new file for this wastewater treatment example. And allow me to work backwards again. Start by creating a least squares model called "water", which is a linear model, where we predict the outcome value "y" from several factors. Remember that we had three factors in a water treatment example, "C" the chemical factor, "T" the temperature, and "S" the stirring speed factor. We also have several two factor interactions, CT, CS, and ST. And there's also a three factor interaction: CTS. In the water treatment example, recall that we had eight experiments required for this system. We learned that we will always require at least as many experiments as there are parameters being estimated. For example, in the popcorn video, we had four parameters and four experiments. In this example, we have eight experiments. So we are able to estimate eight parameters. However, as before, we see only seven of them represented here. The main effects of C, T and S, the three two factor interactions, and the three factor interaction. The eighth parameter, the intercept, is built in. R will automatically calculate that for us. So there are actually eight parameters here, in our linear model. We need to define what C, T and S are, and also provide the outcome variable, "y". We can let R automatically create C, T and S, using the special form of the command, shown on the screen. The first line of the code, creates three variables in one line. If we inspect the variables, we can see they are simply -1, followed by +1. Next, expand this into the standard order table, using the code shown here on the screen. We need to also extract from this, the C, T, and S columns. These are the columns from our standard order table. Feel free to reuse this code at the start of every experimental analysis you do in R. The last step that we need, is the "y" vector containing the eight outcome values. We can take these directly from the standard order table. Now we are ready to let the software calculate the model. Run this code to create the linear model; and use the "summary" command to display the model on the screen. Notice that the parameters in this prediction model are identical to those we calculated in the prior module: 11.25, 6.25, 0.75, and so on. I want to show you a quick shortcut that you can use in R. Instead of writing the linear model by hand, where you could make mistakes writing out all those two and three factor interactions; rather use this notation shown on the screen. This gives you exactly the same model as you had before. Now it's time for some advice. Always perform your analysis using software code that you write out by hand. This is a permanent record of your work and is especially helpful if you add many comments. Devon: Um, that seems like a lot of work. Couldn't I use Excel, or other statistical tools to, sort of click a few buttons and get the same result? Kevin: Absolutely, you can use those other software tools. The problem often is, and I've seen this happen so many times in companies that I've worked with, is that you do the work and then a few months later you have to come back to it, and try to answer questions from your boss. Or another colleague continues on with the project. If you only give them an Excel file, or some document that doesn't record the exact steps you took, it's very hard to reconstruct what you were doing, and what was going through your head at the time. Writing out the code like this explicitly creates a very traceable and reproducible record of your work. This is so important in many companies where there are regulatory requirements for traceability. There is one other piece of code I would like to share with you to help you visualize each of the effects in the model. What I mean by this, is that we have eight parameters and sometimes we would like to know which ones are the most influential on our outcome variable. We can get an idea of the important factors in our system, by examining the equation model in R. We pick out the numbers that are the largest. It's easy to visualize this though and here's some code that will create a barplot. The bar plot shows the absolute value of the model parameters. Why should we use the absolute values? We do this, because we want to compare the magnitude of each factor. The sign is important for sure, but it is easier to compare large negatives and large positives if they're on the same side of the plot. In order to retain the sign information, I've used light grey for negative coefficients and black for positive coefficients. This is important: not everyone can perceive colour, or sometimes you have to print a report in black and white. You can modify how you use the code to get alternative colour schemes though. R is really flexible in this way. Now we need to interpret the plot. We can quickly see that the C times T times S interaction. And the CT and TS interactions are really small compared to the other terms. Devon: Why don't you show the intercept in the plot? Kevin: There are several reasons. We always keep an intercept term in our models. The intent of this Pareto plot, is for you to compare the effect of the various factors against each other. But the intercept isn't something that you can really change. These plots are often used to locate variables that are uninteresting and then remove them. But since the intercept will always be important in our models; we will never remove it, so we don't really need to plot it. Furthermore, the intercept can sometimes be a really large value, relative to the other bars, which will distort the plot. The plot shows us the parameters, ranked from largest absolute value at the top, to smallest absolute values at the bottom. This quickly allows us to find the most influential factors in our system. And this plot is called a Pareto plot. The largest magnitude bars, corresponds to those factors which most strongly affect the outcome. In this case, we have S. This colour here indicates that it has a reducing effect on the outcome. Remember that our objective was to reduce the amount of pollution. So we can quickly see here, that increasing S will result in less pollution, which is desirable. We investigated the interpretation of this interaction term in the prior class, so I'm not going to repeat it over here. And similarly, for the bar that represents the chemical effect, C. Just before finishing this example today, I'd like to quickly share with you what the matrix form of this problem looks like. I know that there are those of you that are more math-oriented, and will like - and actually have a better understanding of - this representation. There's certainly something for everyone in this course. So there it is, the "X" matrix and the "y" vector. And what R is doing behind the scenes, is finding efficiently the solution to the least squares problem. To end this video, we hope that you are enjoying using R to solve your linear models and to fit your outcome values to make predictions. Please keep using it. There are plenty of practice problems in this module which you should be attempting. Those problems provide full R code in the solutions.