For our last session this week, we're going to look at the salary performance relationship in the NHL. We've looked at three leagues already; NBA, English Premier League, and Major League Baseball. Now we want to see whether we get the same relationship in the NHL. We're going to do the same thing. We're going to look at the relationship between win percentage and relsal, we're going to add in the lagged dependent variable, and then we're going to add in fixed effects, and see how, including those potentially omitted variables in our regression, how they change our results. Let's get down to it. First, as always, run the packages that we might need, then load up the data, and let's look at our data. We have fewer observations, just 301 observations. The data runs from, 2009 to 2018, so we've got ten seasons of data, and we've got here, we can do the "dot info" command to see what objects they are. Let's calculate the total salary spending in each season. It's a little different from the ones we've seen already. There isn't actually quite as much inflation as we've seen elsewhere. We have, a big jump between 2011 and 2012, but actually since then, it's been relatively stable. However, just to be consistent, we're going to continue to think about, relative salaries rather than absolute salaries as determining performance, and that's really based on theory. We really ought to expect that spending more than your rivals is what matters, rather than the absolute spending in total. Even though we could probably get away with just using the salary data, without using it in relative form, it makes much more sense, to include it in relative form. As with before, we merge the salary data back in the total salary, for the season back into our original dataset, and then, we create the variable "relsal", as salaries divided by total salaries, and then we run a plot to see what that looks like. Again, a familiar plot, really, looking like the NBA and Major League Baseball data. A positive relationship in the scatter diagram, very wide scatter, but still a very clear discernible, upward trend of win percentage, in relation to relsal. Let's run the regression, see what that shows us. The regression, is here, and again, as we would by now come to expect, relsal is large and highly significant. A large t-statistic, p-value of 0.000, and an R-squared for the regression of 0.22, which again is, not that large, but it's clearly for one variable alone, is still, indicates that relsal, it appears to play significant role. But we need to control for the possibility of omitted variables, just as we have in all of the other cases. Let's go on to look at that. We can organize the data, and then create the lagged dependent variable using the "dotshift" command, and that's what we've done here. Now let's run our regression, with the lagged dependent variable included. The lagged dependent variable, again, as always, is highly significant, and it has pushed down the value of relsal, suggesting that it was, a little bit overstated, but again, the full now in this case, is once again not that big, and the coefficient is still highly significant. Notice that the increase in the R-squared is modest, it's increased it so, clearly, the lagged dependent variable matters. It was good that we added in, it was an omitted variable, and it has helped us to get a better estimate of relsal. Let's control for heterogeneity amongst the teams. The teams are different, and we should allow for that possibility, by adding fixed effects. If we add the regression with fixed effects. Quite a wide variation in the values of these fixed effects, many are statistically significant, some are, some aren't. In fact, one thing to notice is that R-squared is gone up quite a bit. Adding the fixed effects was probably a good move. They are controlling for the heterogeneity in our, win percentage across the teams, and let's see what that's done to our values for relsal, and for lagged win percentage. Actually, now the lagged has become insignificant. It doesn't actually appear to matter anymore. In some ways, we could now drop lagged win percentage, because the fixed effects, really show that it wasn't very significant, but, the main variable of interest, relsal, is still highly significant. The coefficient has gone down, a little, but not by that much, and so we have reason to believe that we have a fairly reliable estimate of the impact of relsal. If we re-run the regression now without including the lagged dependent variable, since we found that it was insignificant once we added the fixed effects, we retain the fixed effects on our regression, then we get the following output. Again, when you draw up an insignificant variable, the R-squared barely changes at all. That's because it was insignificant anyway. We find that our estimate of relsal again is not going to be changed by very much. The light dependent variable really wasn't doing anything very much, or to the extent that it had an effect. It's better captured through the fixed effects. The final thing to do is to say, "Well, what is the impact of relsal on performance?" Again, we can look at a range of values. This is our equation from our regression, which defines the relationship between win percentage and relsal if we ignore the fixed effects for a moment. Win percentage is 0.256 plus 8.76 times relsal. Relsal goes from a relatively low level of 0.02 in our data to a relatively high level of 0.05. Let's just calculate a low, medium, and high value of relsal would do for a performance. You can see here that a low value would generate a win percentage around 0.43, a high value of 0.65, and a moderate value on win percentage of around 0.54. Rather similar to the story we saw with Major League Baseball, the difference here is in fact that the variation is much larger, suggesting that salaries, perhaps a little bit more like the NBA, not quite as strong as the English Premier League, in terms of the capacity to affect the performance of teams. Let's conclude by talking a little bit about what we've found this week. We've looked at ways of running regressions of using salary data to assess the impact of wage spending on team performance, and taking into account the possibilities of omitted variables, and taking account of heterogeneity in our data. Those are two of the things you can do in regression analysis. What do you probably get the sense here is that running regressions is really about feeling your way through the data, looking for relationships that have some plausible basis in theory. There's a reason to include these variables in your regression analysis, but looking to find what has the best fit, what looks like the best relationship. Now, some people might be dissatisfied with this process, because they're going to say, "Well, really, this is not scientific," in the sense that you're not really able to identify truly causal relationships in your data. There is a sense in which that critique is right. What we're looking at here is what's called observational data. We can only see what happened based on the particular events in history. What scientists tend to do in laboratory research is use experimental data where they can control the environment, so they decide what values particular variables are going to take, and then see what the impact is. Now, in some sense, a better way to infer causality. But it's just not a method that's open to us, because we're not able to use experimental data. We don't have experiments that we can run on players and leagues in order to see whether, say, team spending affects performance. In our analysis, we have to do the best we can with our observational data in order to try to find what looks like a causal relationship. My view of this is that in order to make sense, you really need to have some underlying theory. You have to have some underlying theory that would forms your belief about the relationship. You can never be fully confident that your regression analysis has really captured the true causal link. You should always retain a certain level of skepticism. But if your analysis is founded in a sound theoretical appreciation of what's likely to be going on in the data. In other words, if you've really grasp what the relationships are in the data, then you're going to be able to make some statements, and come to some conclusions about what is causing what in your data. Now, whether you find that satisfactory or not is to some extent a matter of taste. What we will do when we come to MOOC 3 is do something completely different, which is actually tried to forecast what's going to happen next. In some sense, that doesn't rely on any ability to identify causal relationships. It's a matter of grinding out the data. Some people are more comfortable with thinking of data analysis in those terms. In some sense, my view is that these are two sides of the same coin. Really, one will produce better forecasts if one has a good theoretical and explanation of the data, which can be supported by regression analysis or the kinds of statistical analysis we've been looking at here. But we'll come back to those issues in MOOC 3. But for the time being, one thing I think at least we can conclude is there is clearly a strong correlation between salary spending and team performance in the major leagues that we've looked at. That in itself is a significant conclusion.