Let's start with the NHL data. In this notebook, we'll be using 2016 regular season records in the NHL. We'll use the log scale of the salary ratio between two teams in a match as a predictor variable. Let's call this model as salary model. Once we fit the salary model, then we will evaluate the performance of the model by comparing the fitted values against the actual outputs as we did previously, and then we will build another forecasting model using the betting odds data. Again, here we will use the money line, a betting odds obtained from the web. We will also evaluate the performance of the betting odds model by comparing the fitted values against the actual output. Lastly, we will compare two forecasting models with respect to their accuracy rate and the Brier Score. Before we start, it should be noted that the salary information among NHL players were obtained from the website below. Then the data were available before the regular season starts. Well, again, it can be used to predict the game result for the particular season in advance. Particularly, we used the average salary column in the page. Before we proceed to the data analysis, as always, we are going to import all the libraries, such as Pandas, NumPy, and so forth. Then let's expand our screen here so that we can see more columns. Now, let's import the dataset. We are going to import two datasets, NHL dataset and the salary dataset obtained from the spotreq dot com. Now, let's explore the dataset. Here we displayed two datasets for the data analysis, while it's always necessary for you to explore the dataset. I'm not going to explore the NHL dataset again, because we've been exploring the NHL dataset, again from the previous courses. So I'm going to skip the exploratory data analysis part when it comes to NHL dataset. Just take a look at the salary dataset that we obtained from the web here. Let's take a look at the salary data and see how it is structured. Here we see the amount of salary among the players in each team. We will aggregate the salary data presented here on a team level later on, because the only information we need is the team level, not the individual players level. By passing the dark shape command, you can also obtain the number of observations and number of variables in the salary data as well, then we can also obtain the very simple descriptive statistics. We can see the total number of rows and also you can see the mean salary among the players in 2016, and also you can see the standard deviation, minimum salary, maximum salary in the league and also you can see some useful quantile values in the DataFrame as well. We can take a look at the column [inaudible] in each dataset. Like I said, we are going to use the 2016 regular season record, so we are going to extract the 2016 regular season record here and then we are going to drop all the unnecessary columns, then let's take a look at the resulting DataFrame here. Here you can see the hgd, which is the home goal difference, and whether or not the team played at home or away, then you can see the team ID attach it to the specific team in the league and you can see the binary win variable and the ordinal scale win variable here as well. We can use these records to create the extra columns. Another information we need at this stage is the salary record between the two teams in the league. In order to do that, we have to match the columns from each data frame. We are going to change the column names in the salary data frame so that we can pull the salary information for each team. Then now, like I said, we are going to aggregate the salary data into the team level. Here you can see the total amount of salary spent by each team in the league, I mean NHL. Now, we are going to attach this salary information to the team in the original dataset. We are going to merge the resulting salary data into the NHL 16 Data. Then as a result, you can see the salary information of the team. Now let's move on to the next stage. Now let's take a close look at the way each game is structured in our data frame. As you can see, the data I just structured for home and away team separately. There are two rows for home and away in each game. If we are confused, take a look at the id, I mean gid, that is the column header of the data frame and you will notice that there are two identical game ids for home and away team as each row recorded the variables for home and away separately. This means that we need to aggregate the away team salary. They are next to the home salary data in a corresponding match, so that we can use the salary ratio as an independent variable for forecasting. We are going to extract home team record then awaiting record separately from the original dataset, then take a look at the resulting data frame here. You can see the home team record rate. Then you have the salary record for home team in the match, then you can also see the record for the away team here as well. Then now we are going to attach away team to salary record next to the home team to salary record, so that this way we can obtain the row table salary ratio between home and away team, so that eventually we can use the log scaled salary ratio between the two teams in a match as an independent variable for forecasting. First of all, we are going to drop all the unnecessary variables and then now here we have the team id of the away team and salary information of the team in the specific aim here as well. We are going to use the gid, I mean, game Id as a matching column. That this way we can attach the salary information of the away team to the home team record. The data frame now contains the salary data for both home and away teams in the same row. It should be noted that the added variables contained the same variable names. The Python distinguished these two variables by putting X for home and Y for away team records. Then, well, we can simply change the name of each variable below. We changed all the column names and then now we are going to obtain the log scaled home salary and log scaled away team salary, then take a look at the resulting data frame here. We have everything to calculate the log scaled salary ration. Then we are going to drop all the unnecessary variables again, then take a look at the final data frame here. We have log scaled home team salary and log scaled away team salary in a specific matching here. Now we are done with the data cleaning and data organization. Then, now we can pick the forecasting model. First of all, the first forecasting model, we are going to fit the OLS model by using the home goal difference as the dependent variable. Well, here you should notice as that we are going to fit the OLS model, or the logistic regression model, because the measurement of scale of the dependent variable is continuous, as you may recall. We are going to fit the OLS model here. First of all, like I said, we are going to use the log scaled salary ratio between the two teams in a match as an independent variable. We are going to obtain the log scaled salary ratio between the two teams in a specific match. Let's fit the OLS model using the home goal difference as a dependent variable. Then we can just run the regression model here. Then you can say the regression coefficient attached to the independent variable, that is the log scaled salary ratio between the two teams in a match and also once you obtain the regression coefficient, one thing that you have to ensure is whether or not regression coefficient is statistically significant? Also you can take a look at the t statistics here as well, and you can also see the r-squared value, which is 0.02. Then we're going to use this model to get the prediction records, I mean, faded values for forecasting the game results as well. But before we do that, let's pull up those two variables together. Here you can see the relationship between home goal difference as a dependent variable and the log scaled seller ratio between the two teams, and also we can fit the best fitting line onto the scatter plot here as well. Then here you will see that even though it's not very strong, as you can see the patterns scattered among the dots on the scatter plot, but still we can see in the positive relationship between the two. Now, we are going to obtain the fitted values here. At the very right end of the DataFrame, you can get the fitted result from regression model. Well, as you can see the fitted values, the fitted goal differentials are expressed in the form of decimals as it is the linear product of the model as a function of the scale ratio , we know that this column can be used to create another fitted results to classify whether or not home team wins the game. As a rule for classifying the game results, we can tell that if the fitted goal differentials are greater than the value of zero, then we can consider this as home team winning, and if it's smaller than zero, then we can see that the team loses the game. As you take a look at the resulting DataFrame here, now we got the fitted results from the fitted goal differentials. Now, we can evaluate how accurate the model is by comparing the fitted results against the actual outputs. The actual output here, we are going to use win.ord, and we're going to compare how accurate our model is, then we are going to save it into the new column here. Once you are past this line of the code, the [inaudible] , you can get the success rate. I mean, how accurate our scala ratio model is in predicting the actual game result by dividing it with the total number of the games. The GD model predicted 56.9 percent of the game results correctly. Now, here we move on to the next regression model using the ordinal logistic regression model as the ordinal scale, the win variable will be served as the dependent variable in our forecasting model. Before we do that, we have to import the bevel library, and then we're going to specify the model in which logs scale the [inaudible] between the two teams will be served as the independent variable, and we are going to use the ordinal scale, the win variable as the dependent variable. As the resulting table doesn't give you a clear picture of the regression results, we are going to obtain all the parameters by passing this line of the code as well. We can see the major parameters for the model, like the regression coefficient for the logo scaled scale ratio between the two teams in a match, and also the threshold for two possible outcomes. Here, loss to draw, draw to win. Then this way, by using the ordinal regression model here, we can first of all, obtained the linear product of each outcome, and then by using the linear product, which is written in the form of a logic function, we can transform the logic value back to the probabilities, and then by using the probabilities, we can assign the predicted outcomes from the model. That's basically how we are going to predict the game results. First of all, we are going to obtain the fitted probabilities by applying the regression coefficient obtained above from the ordinal logistic regression model. Then now, the less the three columns created by passing the Python code above, you will see the predicted results from the model in each match. Well, basically, this is the predicted probabilities for each outcome, and all those probabilities are provided from the home team perspective. By using the predicted probabilities, we are going to classify the game result. Well, basically the rule of the classification is that we are going to pick the highest probability out of the three possible outputs. We have fitted values here, and then you are going to compare the fitted outcomes against the actual outcome here, that is win.ord. It was our dependent variable in the model in the first place. We are going to see how accurate our forecasting is. Then again, we are going to get the total number of correct predictions from the fitted model, and then we're going to get the success rate by dividing the total number of the correct observations with the total number of games. The standard ratio model predicted 57.1 percent of the game results correctly.