Here's our last model with 2018 NBA data. Again, same analytical steps will be followed in this notebook as the goal of the analysis is to fit the forecasting model using the salary information obtained before the regular season starts. Then we'll also use the betting odds data to obtain the predictions of the game results as well. Then finally, we will compare the performance between the two models with respect to their accuracy rates and the Brier score. Well, again, the salary information among the NBA players were obtained from the website below. If you would like to take a look at the data set, you can just go ahead and click on the link below here, and take a look at how the data is structured. First of all, we're going to import all the libraries such as pandas, NumPy, and so forth, and forth before proceeding to the further data analysis, and then we're going to import all the data sets, I mean NBA data set and salary data set. Now, let's expand the display mode to see the results more clearly, and then take a look at the data set here. Again, NBA data set, we've been working on the NBA data set a lot, so basically, I assume that you are very much familiar with the way NBA data is structured. Here is the salary data among the teams in 2018 as well. Let's take a look at the salary data and obtain some very simple statistics. Here we have the total of 30 observations, meaning that we have the salary information among the 30 teams in the league, and you can also take a look at the mean salary, standard deviation, and the minimum salary spend by each team. I think the team is the Chicago Bulls here, and also, you can see the team spend the most on the player to salary here, which is the Cleveland Cavaliers. Well, I mean, that makes sense. In 2018, the Cleveland Cavaliers, they have the most expensive player in the league, who is LeBron James. Let's clean the data before we fit the regression model. This proceeds in several steps. First of all, like I said, we're going to use the 2017 regular season record. Since our data set contains 10 years of the regular season records, we have to filter the 2018 regular season records from the raw data set. Then we are going to manipulate the match up column to obtain the columns for each team, and then we are going to create dummy home variables to be used for the regression model as well. Step 1. We are going to filter 2018 regular seasonal records by using two variables; season ID and the game ID. By using the season ID, this will extract all the games played in 2018. Since that column contains the games played during the pre-seasons, we have to exclude all those games. Then all the pre-season games are recorded with game ID greater than 10 billion. By applying this command here, and you can exclude all the pre-season games as well. As a result, now here we have the regular season records among the teams in 2018. Then next, we are going to select columns used for forecasting model. So we are going to extract four columns. Now, let's move on to the next step. The Step 2 is a little trickier than under previous examples as the way it is structured is a little different from the previous data set. Basically, here's what we're going to do by manipulating the matchup columns. The strings in matchup column is separated at space, and this column should be used for two purposes. Firstly, we will distinguish two teams played in a game, as we need to incorporate the salary information for both teams in a match. Secondly, we need to figure out if the game was played at home or away, and it should be noted that the data were encoded from the perspective of a team appeared first in the matchup column. If the match is encoded with VS, it indicates that the team appeared before VS is played at home. If the match is encoded with apt, special character, then the team played away. Let's take an example of ATL versus IND in a match column. Well, it's basically another match between Atlanta Hawks and Indiana Pacers. In this case, Atlanta is our home team and Indiana is on the away team. One another thing I'd like to highlight is that the points difference in the plus-minus column is also encoded from the team appeared first in the matchup column as well. In the same manner we can see the binary when variable at the right end of the Data Frame, which is coming from the plus-minus columns. Let's take a look at the dataset and see what it means. You can take a look at the matchup column here. Then the team appeared before the special characters like vs, and add, then all the records are written, I mean encoded from the team appeared first and then we can see whether or not the team played at home or away. We are going to separate all those columns separated at space, and then we are going to give a separate column headers to be used for the regression model later on. In order to do that, we are going to create the DataFrame called match. Here we are going to use the matchup column from the NBA 18 dataset, and then we're going to create the new DataFrame named match, in which the resulting DataFrame will contain three different columns from the matchup column. You can see three columns in the matchup DataFrame that we just created. Then we are going to use each of which column to be appended to the NBA 18 DataFrame. We are going to create another column called team. Then the team column will be filled with the data in 0 column header from the Match Data. Also we are going to create another column named OPP, meaning the opponent of the match. Then we're going to fill up this column with the information here. Then we're going to also create another column in the DataFrame called the Home Away to record whether or not the team played at home or away. This is basically the way to create a dummy home variable, and then now, since we have all the information from the matchup column, we can drop the matchup column at the end and take a look at the resulting DataFrame here. Here we have all the necessary information to move on to the next step. Next, we are going to create the dummy home variable denoting if the team played at home, which is encoded with VS or away. We are going to create another variable by manipulating home_away. The resulting DataFrame contains the home dummy variable at the very right end of the DataFrame, and then we can just drop Home Away column from a data frame. Now, let's move on to the salary data. First of all, we are going to drop unnecessary columns. Here we have two columns, the salary information and the team information. We are going to use the first column, every duration column as our magic column to obtain the salary information for two teams in a match. In order to do that, we have to rename the column header so that we can use each of which column for merging the dataset. First of all, we are going to change the column header to team so that we can obtain the salary information for one team in the data frame in our original NBA 18 data frame here. Also we have to obtain the salary information for another team into same match as well. We need to change the column name again and then we are going to use OPP as matching column. This way, while again, we can obtain the salary information for two teams in a match. Since we used the same column header is from the salary dataset, the resulting columns are distinguished by salary_ X for team and salaries_y for or for the opposing team. We will change the column headers to avoid confusing. Here we have all the information for running regression for forecasting. But before we move on, we have to take the logs for two teams and then we can take the acceleration between the two teams. At the right end of the dataframe here, you see the logs scale the salary ratio between the two teams in a match. Then we're going to use this as an independent variable for our forecasting model. Again, we are going to use the linear regression by using point difference between the two teams as a dependent variable. First of all, we are going to fit the regression model using continuous dependent variable. In this case we are going to use the linear regression model. Before we fit the regression model, then we are going to draw the scatter plot by putting independent variable along the x-axis and putting dependent variable along the y-axis. On the scatter plot, we can also draw the best fitting line here as well. Even though it's rather weak, but we can still see that there is a positive relationship between the low scaled salary ratio between the two teams and the point of differences between the two teams. Let's fit the regression model here. First of all, take a look at the regression coefficient for our independent variable, which is logs scale to salary ratio between the two teams, which is statistically significant. You can also check the t statistics here as well. Then we can also see the constant, and ours care is about 0.019 which is not very impressive. But again, we can still proceed to the next Step 2 to make our predictions by using the regression model here. First, we are going to obtain the fitted point difference from OLS model. As a result, you can see the fitted result at the end of the data frame here. Based on the fitted values, then we can classify whether or not it predicts the team winning or losing. Based on the prediction, we can see how accurate our prediction is by comparing the value against the actual outcome variable here under the wind column. Then now we can obtain the success rate of our predictions by dividing the total number of correct predictions with the total number of observations. Our model predicted 54 percent of the total games correctly. From now, let's incorporate the home team advantage to the model and see if multiple regression fits that they are better than the previous model in terms of the prediction rate. We are going to fit the regression model. Now in the regression model we have two independent variables: dummy coded home variable. Well, basically as a result of running this regression, we will obtain the fixed effect of the home team advantage as well. As a result of running the regression model, you can see that all the regression coefficients for our independent variables are statistically significant. Logs scale to salary ratio, it is still statistically significant and the p-value is very low and the t-stat is very large. Also the dummy home variable, the fixed effect over the co-efficient regression Beta is statistically significant as well. R-squared value also improved as a result of a running multiple regression model here. By using the multiple regression that we just fitted, we are going to obtain the fitted values, and by using the fitted values, then we are going to make a prediction as well. Based on the predictions, then again, we can classify the game results, and we can calculate the prediction rate of the second model with home team advantages by dividing the total number of correct predictions over the total number of observations in the Dataframe. As a result, as you can see here, the second model with two independent variables predicted 59 percent of the games correctly. Let's move on to the logistic regression model, where now we are going to use the binary dependent variable to fit the regression model. Here, we will use the same independent variable in the model specification process while we're going to use binary wind-dependent variable. First of all, we are going to specify the model in terms of the independent variable, and dependent variable to be used for fitting logistic regression model here, and then we can just run this line of the code to run the logistic regression, and then we can print all the parameters as a result of the model. Here, you can see the intercept, and regression coefficient of the low scale to set out a ratio between the two teams here, which is also statistically significant and the G-statistics is very high as well, and you can also take a look at the standard arrow is attached to each other with parameters that we just estimated. Just like we did with the linear model, we can also obtain the fitted probabilities by applying the logistic regression formula we just fitted. Well, basically, this is the fitted winning probabilities. Fitted winning probabilities obtained from applying the logistic regression formula both. By using the fitted probability that we just estimated, then we can also classify the game results. We can create the confusion matrix to find out the number of accurate predictions for winning and losing games, and based on the number correct predictions, we can calculate the success rate by dividing the total number of correct predictions from the model with the total number of observations in the data-set. Here, we see that our logistic regression model, predicted 55 percent of the total games correctly. Here comes another sub-test. Again, even though we can obtain the fitted probabilities very easily by using the handy code from the package, but still there is a way to obtain the fitted probabilities manually by applying the logistic regression formula that we just estimated. This is a good practice for you to have a sense of how we apply the logistic regression model that we just obtained. I just want you to test your coding skills, and I also want you to understand how to obtain the fitted probabilities from the fitted to logistic regression model here as well. By running this line of the code at the very right end of the data frame, you can see the fitted prediction for winning probabilities, and the losing probabilities. By using the fitted probabilities, then we can obtain the predictions from the logistic regression model. Well, basically that it's the same regression model that we just estimated. Then we can get whether or not we got the correct predictions, and same goes here as well. We are going to get the total number of correct predictions, and then we are going to divide the total number of correct predictions with the total number of observations. This way we are going to get the same result here as compared to the prediction rate above. Just like what we did with the linear regression model, we can also improve the performance of the logistic regression model by incorporating the home-field advantage. Here, now, as you can see, we have two independent variables, and then we can fit this model, so as a result, you can see the regression coefficient for two independent variables separately. You can also check whether or not the independent variables in the model are rather statistically significantly by looking at the p-value. By using the model we can obtain the fitted probabilities. Based on the fitted probabilities that you can observe at the very right end of the DataFrame here, we can classify each game whether or not the team wins the game or lose the game. Next we are going to see how accurate our prediction is. We are going to compare the actual value against the fitted values and then we are going to calculate the total number of correct predictions from the second model and we're going to divide the total number of predictions with the total number of games here. Then now, as you can see, our multiple logistic regression model with two independent variables, I mean a log scale, the salary ratio between the two teams and a dummy home variable that actually improved our success rate here by 60 percent. Our multiple logistic regression model predicted 60 percent of the total games correctly. Now, let's move on to the next model, the forecasting model with the betting odds. First of all, we are going to import the betting odds dataset. It should be noted that the betting odds are recorded in the form of money line odds and also the data is recorded from the home team perspective. This way, we don't have to create the dummy home variable in this dataset. In order to evaluate the performance of the betting odds model, we need to have the actual outcome of the game. Here we have home score and away score, so we are going to obtain the home points difference by subtracting these two values. We are going to create HPD stands for home points difference. As a result of this line of code, we're going to see whether or not home team won the game or lost the game. Based on this column, we are going to obtain the outcome of the game here. Here, the resulting DataFrame under the win column indicates that if the value is encoded with a value of one, which means that the home team won the game and zero otherwise. Now we are going to obtain the fitted probabilities from the betting odds. If you run this line of the code, then we will see the fitted probabilities for home team winning and away team winning. Based on the fitted probabilities, then we can create the prediction column. Based on the fitted probabilities from the betting odds data then we can classify whether or not home team wins the game. Based on the fitted predictions from the betting odds model we can compare the prediction with the actual outcome of the game, then save it into the true column. We're going to add all the correct predictions and then we can divide the total number of predictions with the total number of games in the DataFrame. The success rate of the logit model is about 60 percent while the success rate of the betting odds model is about 67 percent. The betting odds model performs better than the logit model using salary ratio and the home team advantage as an independent variable. Now we can also compare the performance of two models in the context of a brier score. Let's obtain the brier score for our salary ratio model first. First of all, we are going to create the dummy variable for each outcome of the game and then we are going to apply the mathematic formula to obtain the brier score for the logistic regression model here. Then the brier score for the salary ratio model is 0.49. Now let's move on to the brier score for betting odds model. First of all, we're going to create the dummy outcome variable for each outcome of the match and then we are going to obtain the brier score for betting odds model here, then brier score for the betting odds model is 0.49. To summarize, the betting odds model had a lower score than the brier score for the salary ratio model. The betting odds model performs much better than the salary ratio model in the context of a brier score as well. That's the end of the NBA here.