What we're going to do in this section is generate some predictions within sample. What that means in this case is using data where rich relates to events that have already happened. But dividing that data into two groups a first set of training data, if you will, which we're going to use to estimate the underlying relationship. And then the remaining data and to see whether the model works well in that context of data that if you like the model hasn't seen before. And then we see how well our model performs. So we can now use the transfer mark data because we've tested that's reliable relative to the wage data, which we've already checked up on. And but we also want to now compare how our model is going to work relative to the bookmakers. So as well as generating our model, we're going to see how reliable bookmaker predictions are in terms of forecasting results. So actually, that's the step we're going to take first. So we're going to use they're going to be three steps in our process here. This first step is let's look first at calculating the accuracy of the betting odds in the context of English Premier League games. Which is rather similar to what we were doing in the previous week of this course. And then we're going to calculate our predictions on using a regression model and ordered logistic regression model using the TM values. And then third step we're going to compare the reliability of the betting odds with our models. First checking weather, how good they are predicting the specific outcome of the game and then checking the brier score, okay? So, First, as always, will import the packages that we need, and then we will read the datasets that we're going to use. So these are the two data files that we're going to use in this session. We've got the TM data that we've already looked at. And now we've got a data set of game by game data for the Premier League over a period of eight seasons. So if we look at this file here, the gbygdat you can see it has the season. The name of the home team, the name of the way team, the home team goals at full time that's FTHG. The away team goals of full time. That's FTAG, the full time result. FTR, which will be A away win, H home win or D draw. Then the betting odds taken from the bookmaker bet365. The odds of a home win, the odds of a draw, the odds of an away win and bearing in mind that these are decimal odds. So the probabilities are related to one divided by the odds. In this dataset we have 3420 games played over this number of seasons. So first, let's to calculate the accuracy of the betting odds. Let's begin by calculating the probabilities. So bear in mind that, as I said, the betting odds the probability is one divided by the betting odds when the odds are expressed in decimal form. More we have to adjust for the over round, so we have to scale it by the some of the odds in order to get a true probability. And you can see now here the we've generated that we derive the probabilities from the betting odds of a home win draw or an away win. Now we want to, we could, we also want to know what the is considered the most likely outcome, given the betting odds. So we can begin that by defining, The outcome based on which of the three outcomes has the highest probability. And we do that here, so you can see here B365res says if the highest probability was a draw, then it would say D the highest probability was a home win. It would say H and if the highest probability was, home then it would say H, okay? And having done that, we can now create a variable which says, when were the betting odds correct in terms of predicting the results. So we'll give, we want to match up the column with FTR. The actual results with B365res the predicted results from the petty goals, and we want to give a value one when these two letters match up and a value of zero when they don't. So we can now see that and we have, we can see how many results we've got here and we can see what the data looks like. But they think we're interested in is how often were the betting odds correct. And you can see here if we just calculate the mean then of this variable B365true, the mean is just under 54%. So the bookmakers got it right just over 54% of the time. Remember that what we learned in the previous week was that in a three outcome league like this. Whether three possible outcomes home win away win or a draw, then if you picked at random, you can work out what your brier score would be. But if you think about what your success rate would be with three possible outcomes and you had to define which outcome you were selecting. Then you're expected success rate in that world should be around one third, so you should be right one third of the time. So the fact that the betting odds were right 54% of the time suggests that there is a reasonably good performance compared to selecting at random. What's also interesting is to do run a group by adult group by command of the mean just to see how the results vary. Depending on the kind of result, the type of game we're looking at. So this summarizes all of our variables. Based on whether the prediction was correct, the bookmakers prediction was correct or not. So if you look at this row, the first row, the bookmakers got it wrong. And the second row, the bookmaker got it right. So notice here the goals scored in the cases where the bookmakers got it wrong, in the games where the bookmakers got it wrong. The goals, the home team scored just over one goal on average and the away team scored 1.38 goals on average. So that's a fairly those on average look like fairly close games. If we now compare what happened when the bookmakers got it right. Then the average number of goals scored by the home team was just under two, whereas the average number of goals scored by the way team was only just over one. So one thing is you can see here is the bookmakers look like they're pretty good at predicting games where the home team wins and the home team wins fairly comfortably. Where they're less good is where, actually the expected the average number of goals of the away team is higher. So, in fact these are a way more likely to be away wins. And the game is very close. And so that tells us a little bit about how successful the what makes for a successful bookmaking odds. Now we want to look at the brier score. So to generate the brier score, we need to have a value for the outcome of each game. So we want to define for the actual outcome of a variable H outcome D outcome and A outcome. And we've done that here as a value of one. If that was the outcome of zero, if it was not and then we from this we can calculate the brier score as the squared difference between that and the bookmaker probabilities. So the advantage of this is that it takes into account not just whether the bookmaker got the right outcome. But how often the probabilities were close to the correct outcome, and again you can see a brier score of 0.568. And again remember, we said in last week that if the bookmaker, if the outcomes were chosen at random. If the probabilities were random, then we would expect to see a brier score of around 0.66. So the bookmaker is better than a random outcome. Remember, lower brier scores means better performance, okay? So though that's the performance of the bookmakers. Step two then, is to consider the performance of the to. Step two is to generate our own model and compare that performance against the bookmakers. So the model we're going to use is essentially the same one as we used in the first course in this series, when we thought about wages as a determinant, relative wages as a determined of outcomes. So we want to say that the outcome of each game depends on the relative TM values of the two teams. So a team is more likely to win if it's TM value is higher than its opponent. Now we also should take into account home advantage, and that will arise naturally in our regression model. So we look at that in a minute. But the first thing to do is to generate these the ratio of the TM values for the two teams. So at the moment, we don't have the TM values in our data. We need to merge those in, and we need to again, as we did in the previous video. We need to merge them by some index, which is going to be common across both datasets. So we're going to create an identify for the home team and the away team, which is based on the name of the team. And then we're going to add to that the season in which the team was playing and we're going to attached the season as a string rather than as a an integer. So we need to tell it using dot map, parenthesis sdo close parenthesis. Now bear in mind that each team plays each other team twice, once at home and once away. So each game here will be uniquely identified. So, for example, if it was baseball where you could play the same game team at home several times in the season, then you would need to think of a better identify. You have to use a specific date for the identified. But here we can just use the season, the year since in each year each pairing will be unique. So we generated those team IDS. And we now do the same thing in the TM data, which we've already loaded up. But we just have the tm values. Remember, now here we have a list of TM values for each team and what we've done is we've created an ID for a team as home team and as an away team. Now, of course, it's the same TM value, whether you're the home team or away team. But what we need to do is to match that with the gbygdat file were in order to associate the TM value depending on whether you're the team in question, was the home team or the awaiting. So that's why we've duplicated these identifies and called one htmid. And the other atmid even though they're actually identical. So now the first thing we do is then we'll merge the home team ids with, the for the two data files to do for the two data frames, and we'll match them on. So we're going to match on the home team id. And we're going to include the TM value. And we're now we're going to rename TM value in the merged data frame, which is going to be the gbygdat data frame the data frame with all of the games in it. We're going to rename that because we need to denote that this is the TM value for the home team. So we call it HTM. We do that. You can now see, we have the HTM value. So this TM value relates to Aston Villa playing in the first row playing in this season against Arsenal. And now we do repeat the merge. But now we do it on the A team id. So it will take the same TM value, but it will associate it with the away team. And again we will rename that TM value ATM. To denote that it's the TM value for the away team. And if we run this, we can now see the away team values of the ATM with the value for Arsenal in 2011 in the first row. So now we've got the TM values merged in for each team in each game in our dataset. We can now create a ratio of the team values which is going to be, which we're going to use as a predictor of performance in the game. And we'll take the log of that. The algorithm is going to be, it's going to work better. It's often the case that logarithms work better in these kinds of situations where you have potentially very large differences in the sums of money involved. And of course, we don't have any negative sums of money involved in this. And so there's no problem with taking logarithms here. So we've got this log of the TM ratio here and before we run our regression we're going to, our regression is going to be to say the result of the game is a function of the TM ratio, the log of the TM ratio of the two teams. So in the regression model, we need to have a win value and this has to do with the way that the ordered legit regression packages set up in python. So what we specifically we need is a win value. And the wind value here is going to be two, if the team gets a home win. If the home team wins, it's a win value of one. If it's a drawer and wind value of zero, if the away team wins. So we run that and create the wind value and that's going to be then our dependent variable in our regression. We're going to regress the wind value on the log of the TM ratio. So I'm just going to drop some columns here because we don't need them in our data. So we just make the data a little bit smaller, a little bit easier to see what's going on.