Let's talk about some considerations in using this game predictive model that we built. There's lots of considerations to apply. But here's a couple that we'll talk about. Does our model generalize well? Will it actually work when we need it to work? Let's say we're going to use this to win the office pool or to inform our betting strategy in gambling or just to load it over others that we know all this about the sport, but we're really using data science. We need to think about that as we evaluate and train and test our model. I think there's another really big consideration which features are important to the accuracy of this model as we cram more features and we start to lose the ability to inspect our model and we might be learning weird things, things that aren't replicable or generalizable and so we want to get rid of those if they exist. I think that there's also a question of bias in the model. This is a very hot topic in the literature, in the community, the scientific community, and in society at large right now. In this case, does this model work for some teams or other teams? Is it equal for all teams, how well it will predict the outcomes? There's both techniques to apply and to detect these issues as well as techniques to mitigate problems that might arise from them. Let's explore this hockey game data a little bit more. Our goal here is sense-making or to actually learn how this model works. Just like before, let's bring in the standard data science data manipulation imports, Pandas and NumPy. Let's bring in a plotting library as well, Matplotlib, and we'll use this as pyplot. Now, we're going to read those observations in from our saved CSV file. There's our DataFrame. That looks familiar. We saw that in a previous lecture. Perhaps the biggest question is ecological validity. That is the validity of this model for real work, like winning our office pool bet, is how well does it work over time? In our first modeling approach, we broke our training and test sets into two pieces, roughly 800 or 450 split. But don't we want to dominate our office colleagues right from the first game of the season. Let's look at the accuracy of this model over time. The index of observations in the DataFrame is the format of year, year, year, month, month, day, day underscore time. For the sake of the pool, we probably just want to break this down into daily observations. I'm going to use a regular expression to do this. Regular expressions can be a little bit complicated to read and they're an important part of web scraping and data cleaning. But you can look at the string split module as well. After we do that, we can see if we browse through the observations DataFrame that we have a new column in here called date and it's just broken out our new date. Now that we've got observations by day, we can build predictive models on a day-by-day basis. Essentially for each day, we want to build a model which considers all of the data which comes before it in the season and use that to predict the outcomes for a given day. We're going to use the pandas groupby function along with the apply function. Together, these functions will create a small DataFrame for each date in our model that allows us to apply a single function to that data. The result of our function should be some accuracy values which is really how well our model performed. Our model by date function here, we're going to take in two different parameters. One is going to be called date observations and the other is going to be called the feature list. The date observations will just be the data that we're trying to operate on now, this sub-frame that groupby gives us and the features list will be a list of all of the features that we want to use to build our model. The first thing that we're going to do is we're going to try and get all of the data before the date that's passed in. We can use this nice date_observations.name. Sometimes a DataFrame can be named and it always is when you use groupby. It's named after the column that you're grouping on. We can actually get that value out and we can look at our global data, this observations at the date, and we can generate our training data from that. Now, we can only apply supervised machine learning if we have data to learn from. Here, I'm just going to add a check to see if the training data is greater than zero. We're going to have to handle whether we don't have historical data, whether we're just starting at the very first observation differently. I'll give you a hint. It's go with your gut. That's essentially what we'll do for this model. Just like before you saw this, we're going to build our model. First, cleaning up the missing values, and then we're going to go on and bring in the logistic regression function. Feel free by the way, to pause this video and play with some of the other different regression functions inside of Sklearn. We're going to separate out our features based on the feature list, and then we're going to train again with our target of this outcome categorical and then fit our model. Then we'll import accuracy score, just like we did previously, and then we're going to predict on the data that we don't have in here, in our testing DataFrame with the same features. Now, I want to return something that's a little bit different. In this case, I want to return the accuracy, but I also want to return a couple of metrics. You can play with these metrics to just chart them or plot them to get a sense as to what they look like. I want to determine how many observations have we trained on at this point in the model. Remember, we're building a model for a specific day of the calendar year. As the date increases, we'll be able to train on more and more data. We'll be able to capture that here just by capturing the length of our training DataFrame. Then I think it's important to also to look at how many observations are we actually predicting on. On a given day, there might be a single game or there could be 10 games or five games being played. That's important if we wanted to start understanding the value of the accuracy and what it would mean for winning our office pool. I'm actually going to return the whole fitted model as well. This means we could inspect it a little bit. We could change the coefficients or we can look at the coefficients, we could look at various things. Then here I'm going to return the coefficients of the regression model as well. Now we haven't talked too much about the coefficients and regression, and we won't in this course, but essentially there are positive or negative numbers for each one of our features that indicate how heavily weighted those features are in predicting the outcome that we're interested in. In this case, whether it's home or away, the classification of zero or one. We're going to use the built-in zip function of Python to bring our features list together with the coefficients. Because the coefficients are really just a whole bunch of numbers. Sklearn doesn't encode the label of the feature in there, just the feature data. We'll return this as a Pandas series. There is this base case of what to do with the first game of the season. In this case, I'm just going to return some missing values. But of course, you might want to actually consider building a totally separate model from this, where you train on historical data only from the last season. Here, I'm just going to try and keep it simple and just return NaNs. We want to set up a list of features. Now that we've defined that function, we want to set up a list of features. I want it to be all of the things in our observations, all of the things in our CSV, except for the thing that we're predicting for our target, which is the outcome categorical, and this date column that we added to the backend of that. Now, we can actually just take our observations DataFrame, we can group it by date. That's day essentially. Each one of those will get segmented into a sub-frame, and then we can call the model by date, the function that we just wrote against that sub-frame and pass in the features list. From the results, remember the results of this are going to be all of those series objects with an accuracy and the number of training and all the different metrics that we're interested in. We can actually just take these and turn them into a DataFrame with that. Let's give that a run. Let's take a look at this DataFrame here. We've got our date down the side here and you see the accuracy for the first date is NaN, and everything's NaN here, and that's of course, because we don't have any of this information on the first game. But as soon as we get to the second date, there's multiple games. We have some predictions from the model. In this case, we actually trained on two instances, and then we tested on seven instances and our accuracy was pretty abysmal, just 0.28 there. Well less than the 0.5 chance accuracy that we might expect from flipping a coin. The model here is being returned as well so that we can actually inspect, which is really nice. It's a logistic regression model and you can see that some of the coefficients are actually embedded in that, in a dictionary. As the season goes on we seem to, or at least these first week, we seem to improve a little bit. You can see that our training set size increases and our testing set size, not so much because we're looking at a single day. We're gaining more and more information about the season. Great. We've productionized this code a bit, or at least simulated this production development. Let's do a bit of analysis on how our model performs as a daily average throughout the time period of observations. What I'm specifically going to do here and what I like to do is look at how as we have more data in our model for training, the accuracy improves. I'm just interested in those two pieces, and I'm going to build this as a plot. I'm going to compare accuracy and date in Matplotlib. Let's take a look at that. This was a bit surprising to me. I thought we would find a regularly increasing model. Growing in accuracy as the number of training instances do, but while we see a bit of that, maybe it really doesn't hold for long, and it seems mostly random throughout. Our num_training is this blue line, and we see that the blue line just increases over time as we get more and more observations or data to train, and that's the one axis over here. But we see here that our accuracy, or pardon me, that's this axis over here. But we see here the accuracy of our model between zero and one. It just bounces around. It really looks random. Sometimes it's 100 percent accurate. Now that could have been just a fluke. That could have been one lucky prediction, and there was only one game that day. Similarly, sometimes it's completely inaccurate. We know the model is not very good, but generally, we want to start exploring the model to understand which features might be performing well. Now the most common way to do this is to inspect those coefficients for the regression models being produced. But given that we've got these three different buckets of features, we could also explore the accuracy as it relates to these buckets. I'm going to give that a shot. Here first, I'm going to silence a warning that'll come from Sklearn with respect to convergence of the models. It's important if you're considering using these models, but I really just want to show you how I would actually look at some of the different features, so I'm just going to make it a little quieter. Now, I'm interested in these different models. The first model here, the current season model. That's just the models with the previously seen performance for this date, so that's home_lost, home_won, away_lost, and away_won. Then we've got this last season's information. We've got everything that we would have from the past season that includes a bunch of data and then the salary cap information just by itself. I want to take a look at all of these, so I'm going to plot each one of these models directly. Here's our three different buckets of features and the models' accuracy of those increasing with dates. Remember, each day. We're building more models because we have more data. We're building a new model that has more information, more observations in it, and we can actually see they all look random noisy. None of these actually seem, as far as feature sets to be really useful in a clear way. What I was looking for when I look at this is seeing if there is a chunk of features. Let's say salary cap that are meaningful in our understanding of the model that add some value of accuracy, or some features that actually take away from that accuracy that are constantly bringing our model down, but in this case, I don't see anything really meaningful here. A question that's actually very hot in the machine learning literature these days is the issue of bias and predictive models. Is the model biased when predicting the outcomes of one or more groups? Now generally, the focus is on societal groups where such bias might reinforce inequities that we might not expect it. Policing, for instance, is a big one. But the issue of bias can be generalized from there to consider the cases where the model performs poorly and thus shouldn't be used. Actually, our data set here is interesting in that regard because one of our teams, the Vegas Golden Knights, has a lot of missing information. You'll recall that we imputed this using the average of all of the other data like the team salary cap and so forth. Is this reasonable? Probably not, but I just did it because it was quick and I wanted to give a demonstration. More generally, with respect to the games, the Vegas Golden Knights were playing and do our prediction models have poor accuracy? Let's check it out. We're going to change our modeling. We're going to model by game. I going to get all of the data from the games before this one. In this case, we're going to assume the game has ended and that we know all of the information from the previous games. We want to build a model for single games, but we run into a problem like we did with the data before. This time it's a bit different. We need to make sure that we see at least two different labels to classify; one where the away team has won and one where the home team has won. Otherwise, the classifier just doesn't know how to label our data. The approach here that I'm going to take is the brute force approach. I'm just going to try and classify the data that we have, and if there's an error, I'll just send back empty data and say our classifier can't be built yet. I'm going to do that in a giant try-catch or try-except state here. Now, in this model, we're specifically interested in the teams which played, but we actually scrub that information out from our previous data. It turns out though that we have this mapping between the teams and this data because we've captured the league ranking from last year and that value is not a number for the Golden Knights. Before we impute it, let's make sure we're going to send back these team identifiers. I'm just going to create a list here with these dictionaries inside of it. Let's build our models before, so first cleaning up the missing values that we want in there. Remember we split our data sets and we clean the missing values or impute missing values independently, and then we're going to train the model. I'm just using a very basic logistic regression. As you can see here, I don't provide any parameters to the logistic regression, I'm just going with the Sklearn defaults. This is obviously a place where our knowledge of both Data Science, if we have that, and our knowledge of the domain, if we have that, can improve our classification. Then I'm going to evaluate accuracy again and then return the accuracy here, as well as the team names. Then this is really just error handling if the classifier can't actually run. If the classifier throws an exception, I'm just going to return some empty data. Let's take a look at this. This is our DataFrame. I just looked at the first five elements of the DataFrame. We have our team, of course at the beginning it's just empty, this is because we don't have evidence of both teams winning yet. But then here we see what team did we predict were going to win in this game and whether we got it correct or not. As we know, our model has pretty bad often and so forth. Our teams all have numbers from our previous dataset. Now that we've got this DataFrame that has our teams and whether they have the correct prediction or not, we can group by the team identifier and apply an aggregation function to determine how many times the model was correct for a given team. We can use the numpy count non-zero function to do this as a true value is going to be represented by a one in numpy, while a false value is represented as a zero. This is actually quite nice and this is the core of understanding the bias that our model might have for particular teams. If we take a look at this DataFrame, and I sorted the DataFrame here from lowest to highest, now in this dataset, there's about 84 games that each team has played. We can see for Team 12 for instance, we were only correct 30 of those. So our average accuracy here is in the 35 percent maybe, so that's really not good at all. I would say our model actually has a lot of bias against this team. Down here, we see that actually for Team 9 we're correct 46 out of the 80 or so games that are played, so we're slightly better than chance for this. We have a bias in favor of this. Interestingly, our Vegas Golden Knights, that's the NAT version here, and they're right in the middle. We're a chance predictor when it comes to that. You can use this mechanism, not just for this analysis of course, but any of the machine learning analysis that you're doing, where you're doing this categorization and then you're looking to see what teams or what circumstances your model may or may not be powerful at. Now, this begs the question, who is Team 12 and Team 30 where the model is most bias? We can actually look this up from the other data file and so we see that it's the Senators and the Avalanche. This would allow us as data scientists to understand a bit more about what we have to look at and investigate in those teams to see if that source of bias is actually something that we can collect more data on or join another dataset and thus improve our model further for those groups.