In the last session, we looked at the pay performance relationship in the NBA. We're now going to repeat the same regression analysis for the English Premier League following exactly the same steps, just using a different dataset. Now clearly the English Premier League is organizationally quite different from the NBA. Two things to really bear in mind here. Firstly, in the NBA you have some restrictions on player spending. You have salary caps, you have draft, which limit in some sense that competition for players. In the Premier League, there are very few restraints on competition at all, teams can hire as many players as they want, they can pay whatever salaries they weren't, there is no cap, there's no draft. In some sense, the market is much freer and that's likely to have an impact on our regression analysis. The second thing to say is that in the English Premier League, just like you have across most soccer leagues around the world, is you have this system of promotion, relegation. Each season you have new teams coming into the Premier League and teams that were there the previous season being relegated, which affects the turnover in the league. These things might affect some of the relationships that we see and we're going to look at that as we go along. Let's get started. We're just again as always, we start by running the packages we need in order to do our statistical analysis, and then we load the data that we need in the English Premier League, and then let's have a look at some descriptive statistics for that data. We have 380 observations. The data runs from 1997-2015. We've got 19 seasons there in the data. Let's just look. We can see what type of objects are in our data using the dot info command. Like we did with the NBA, let's first start at looking at the total level of salaries across the season. Now, the numbers are very large, so it's actually return the data in scientific notation. Now this is a matter of taste, whether you're comfortable looking at scientific notation or not. But if you would rather change the format and see in conventional format, then that's easily enough done in Python. Command line allows you to see this so we can change the format using the dot format command, and you can see now in the conventional format the values for each season. One thing you can see here is that over our 19 seasons we have exactly the same phenomenon that we saw with the NBA. We see a significant increases of salary from year to year. In fact, since 1997 you can see total salaries in the Premier League have arisen from 220 million, this is British pounds, to 2031 million, that's over 2 billion pounds. That's an increase of roughly ten-fold in a period of ten years so very significant salary inflation in our data. Again, it's not that the players have necessarily got that much better, they may have improved in the English Premier League, this had been an influx of talented foreign players, but what matters in the process of competition is not how much you're paying in absolute terms, but how much you're paying compared to your rivals. There's been the same kind of inflation in revenues that we've seen in the NBA, increasing values of broadcast rights that really accounts for a lot of this increase. Again, as we did with the NBA, let's merge this back into the data so we can now calculate salaries relative to the annual spending of each team. Here you can see the data for all the teams, 380 rows. Let's give it the same name, relsal, which is total salary spending of a team relative to the total for the league in that season. Now let's do a plot to see what that relationship looks like. Now here's something to note in terms of this plot, we see a very strong correlation in the data, but actually what you see is it's a negative relationship as the higher relsal, the lower the value of position. But think about it for a minute. A smaller number for position means a better performance. One is better than two, two is better than three, and so on. In some sense, it's worth recognizing that this negative relationship actually conforms to our expectation about highest spending leading to better performance. If you find that slightly confusing, we can easily change it around by defining a new value of position, which is the negative of position that we had before. If we do that and then run the plot, you can see here that now you have what you might think more conventionally is an upward sloping relationship between swell cell and our performance variable. This hasn't changed anything about the statistics, the relationship, it's just a change of sign, and sometimes that's more convenience to explain to people what is going on. The other thing to say about this chart is you can see some curvature here in the data. What you tend to see is a lot of values around the relatively low value of reisal with a relatively low position. Then up at the top, you can see here a with a high values of reisal, the position is centered around a very small range of values. You've got this almost like a dogleg here relationship which is non-linear. Now, we can linearize relationships and since this is linear regression that we're carrying out, this makes sense by simply taking the logarithm of values. If we take the logarithm here and then run the same plot again. In some sense, we've linearized the data. You can see now it looks much more like a straight linear relationship between reisal and league position. That is a transformation which really doesn't fundamentally change the relationship in the data, but might give us a more reliable estimate of the relationship as we go along. Now, let's then run our first regression. Let's run a regression on the left-hand side league position, the log of league position on the right-hand side have reisal, the salary spending of the team. When we run that regression, we see something which is on the face of it, when we look at the coefficient, quite similar to what we found in the case of the NBA, we find reisal has a large value, a small standard error, a very large T-statistic, and a P-value close to zero, but note here the R-squared. Remember in the case of the NBA, the R-squared was less than 0.2, and here the R-squared is about 0.66. With this one variable alone, the relative salary, we're explaining something like 66 percent of the variation in team performance over this nearly 20-year period. It's really quite a much more powerful relationship, and as I said at the beginning, that's likely to be because there are fewer constraints in the market, and so spending really does count much more heavily in terms of determining outcome. Now, I'm suggesting here that this is a cause relationship, partly based on the theory that I suggested right at the beginning, which is that if you have more resources and spend more, you can hire better players because there's a market for players, and that leads me to think that the direction of causation is running from salary spending to performance and not the other way around. Now, deciding the direction of causation is often a difficult problem, and as you may have heard the phrase, correlation does not mean causation, and here, strictly, we might say that this is a correlation rather than a causal effect, but there are good reasons, as I've suggested, to think that this is likely to be causal. What we really want to do here is think about omitted variables and think about whether that might change our perspective in the same way that we did with the NBA data. Let's look at the same possible omitted variables. Let's look at lagged position and look at the fixed effects. Before we do that, let's look at one other thing that is present here in the Premier League data, which isn't present in the NBA data, which is the possibility of promotion. In each season, in our data, three teams are present that weren't present in the previous season. Let's see whether being promoted affects your league position. Many people think that promotion means that you're going to have a less strong performance because you've just come up from a lower league. Let's see if that's true in the data. Let's add promotion to our regression. What this shows is on this data is that being promoted has no effect on performance. In other words, promoted teams seem to do just as well as teams that were present in the league in the previous season. This in turn might be related to the salary effect. It's not really whether you were promoted or not. It's whether you're able to spend money on salaries. It might be the teams that were promoted aren't able to spend as much on salaries, but that's already captured in our RelSal variable. Here this regression is suggesting that there is no additional effect associated with promotion other than the effect that would pass through the RelSal variable. When you find a variable is insignificant, typically the response is to drop that variable from the regression analysis, and that's what we're going to do now. I'm not going to include this in our analysis any further. What I want to do next is what we did with the MBA Data, which is added in the lag dependent variable and see what effect that has on our RelSal coefficient. We saw the data there, so you can have a look at all of the data and use the Dot Shift command to create the lag dependent variable for a log of position. You can see here again, in the first season, there's no value for the lag dependent variable because we don't have the data for the previous season ending 1996. Let's include the lag dependent variable in our regression and see if that changes anything. Again, if we look at the coefficient on the lag dependent variable, as we found with the MBA. It is statistically significant and it has a p-value of zero, so it clearly should be included. It was an omitted variable because it's significant now. As with the MBA, our coefficient on RelSal has fallen. No buyers match, and our coefficient on RelSal remains highly significant. But it clearly means that are over, are initial estimate does look like it was an overestimate of the true value of RelSal. It's a good thing that we've added in this omitted variable, which gives us a better estimate of the coefficient of RelSal. But we can go one stage further and add the dummy variables for the clubs, the fixed effects. These have the advantage not just of adding potentially meta-variables, but really accounting for the heterogeneity. The differences amongst teams which are clearly important in soccer just as they are in any other sport. When we add those fixed effects, let's look at what our regression is. Let's look at the coefficient on RelSal and see how the edition of the fixed effects has worked. In the MBA case, it actually increased the value of that coefficient. But here you can see that it's diminished it yet further, but not by much. Our coefficient on Relsal is just under 11. It's still highly significant, and as a p-value, which gives us confidence that this has a statistically significant impact on the estimate. But what I want to draw your attention to here is also the value of these fixed effects. You can see here, all of these fixed effects, the coefficients are pretty much all negative. A pre-natural statistically significant. But there isn't an issue here about how these fixed effects are calculated. When you add fixed effects, there must be one base unit, and the fixed effects are measured in relation to that base unit. The selection of which, in this case, team will be your base team, is going to be important. If you just run the fixed effects as we've run them here, Python is going to select the very first team in the list as the base level team. Since it's the teams are in alphabetical order, as you can see here, the first one happens to be Arsenal. Now, Arsenal is traditionally one of the strongest teams in the Premier League and out spends most of its rivals by quite a lot, and so it's not a very good choice of a base level team. The fact that almost all the teams have negative coefficients just says that Arsenal is one of the biggest spenders. It would be better to select a team for our base level that's somewhere in the middle of the league. That will then give us a better sense of which teams are outperforming or underperforming that average level of performance of a team. We can do that, we can change the base level team in Python and I'm going to do that now. But first let's identify the mean for all of the teams. This is the mean league position in our data for all the clubs. Since there are 20 teams in the Premier League in any one season, a mid-level performance on average is going to be a position say around 10th or 11th in the league. There are many teams that we could choose for that level but let's choose Everton. Up here, Everton's average position over this data period is 10th. That means that they could work quite nicely as a base level team for us to compare to. Here, when we now run the regression with the fixed effects, you can see that we've added this text on treatment and defined the name of the group which is going to be our relative base level or treatment group if you like, and then everything will be estimated relative to that. If we run the regression with that, now you can see that in fact most of the fixed effects are not statistically significant. There are a few that are outliers and indeed you can see that Arsenal right at the top here is an outlier in our data and has a positive and significant fixed effect. That's because it has historically been an above average team and that even allowing for its spending. Now, what we've done in terms of changing the base level team has made it possible for us to generate a better estimate of the fixed effects, but it hasn't changed anything about the coefficients of interest, in particular relsal. The value of the coefficient of relsal is exactly the same. That was just an issue about the fixed effects, not about the overall regression. The same thing is true for the lagged dependent variable. We can now estimate based on this the impact that relsal has on the performance of teams taking into account that we've allowed for potential omitted variables in the form of the lagged dependent variable and fixed effects in terms of the performance of each particular team. We can work out what that effect is going to be. Based on our regression analysis, it tells us the relationship between league position and spending, and we can simply do the calculation. Since we took the logarithm of lead position in order to work out the impact of spending, we have to take the exponent of the value of the coefficient times the amount of spending. We can do that here for three different levels of relsal.02,.08, and.14. If we'd run that, we'd get these results and these gives us an estimate of the league position of the teams that you'd expect. Now one thing this tells us is that even with relatively low spending, you can expect a relatively high league position. Obviously with relatively high spending, you expect to come right at the top of the table, somewhere between first and second. There's something missing here which also might be related to the fixed effects which we haven't included and also the impact of the lagged dependent variable, which is affecting the spending over the longer term. If we're interested in modeling this relationship, we still might want to do more and consider more variables and more possibilities. But certainly this has given us at least a very strong sense of the significant role that spending plays in the determination of performance in the English Premier League. Now we've looked at two leagues, we've looked at the NBA, and we've looked at a very different league in terms of the English Premier League. Now let's look at third league when we'll turn our attention to Major League Baseball.