With our observations cleaned, we now face another choice, what to do with missing values. Missing values are a signal of their own, but most machine learning techniques want an explicit indication of what missing means. Strategies for numeric values which we have here are usually aggregation functions based on the rest of the data we have. For the Vegas Golden Knights for instance, what should we expect their performance from last year is going to be like? The average performance of other teams, worse in the bottom performing team because they're brand new? Again, this is all a place where you need to bring your understanding of the game and the process and set reasonable values. In this case, I'm going to fill all the missing values with the mean value from observations. But a moment of caution, remember we're working through the whole machine learning pipeline here. We're planning on building a model on some training data and then determining how well it fits on some held-out test data. We want to split up our sets before we start replacing with imputed data like this mean value that I spoke of. We don't want to leak that information. First, we're going to make sure all of our columns are numeric. I'm just going to make sure everything is numeric from our categoricals here. Then I'm going to save this data file on the Coursera system to observations.csv. Then I'm going to put the first 800 observations or so in our training data and leave the rest of them in our testing set. It's only then that I'm actually going to go and impute the missing data, and so I impute the missing data separately, once for the training data and once for the testing data. It's really important that we separate these two steps so that we're imputing only on that data. Now, let's go on to build our logistic regression model. You've seen this before, a regression technique applied to categorical data. We're going to use the sk-learn logistic regression class to build our model. Building a classifier straightforward. We just create a new instance of this logistic regression class, and then we call the fit method passing in our features that we want to train on and the labels we want to predict on. For this first model, we'll pass in all of the observations in all columns except for the target, that's our labels, which is the outcome categorical column. I'd like to create a new features DataFrame here and then our target DataFrame, and then we just call the logistic regression. We create the classifier and then we call classifier.fit, and then we can score it and this is going to print out the R squared value. That's a pretty bad model. Let's see how well this works on our test data. Specifically, let's take a look at the accuracy or the number of correct predictions that we can make. We can import an accuracy score helper function from the sk-learn metrics, so I'll bring that in here, and now form a variable which has the correct labels and one which has the prediction. We'll take our testing DataFrame, so those are our labels, and then we'll take our predictions and we can just call reg.predict. This is actually just going to take our fitted classifier, and it's going to predict on this new data and we're going to just drop everything except for our actual label, and then take a look at our results. That's a pretty abysmal prediction there too, it's actually slightly worse than just flipping a coin. We're going to have to discuss a little bit more how we can locate some of these features. That's pretty exciting. Are we done? Well, I think that there's a lot more we could learn about this, but we've gone through the workflow. We've stepped through all of the different pieces and you've seen how much of it is data processing and how little of it is actually machine learning model building. But I didn't really emphasize any of the bits about tuning the model, choosing different algorithms, exploring different algorithms, and so forth. We'll start getting into those in the future weeks of the course. But for now, let's think a little bit more about this model. We acquired data through APIs and lightweight scraping. We cleaned the data, aligning values throughout. We made choices on features, putting our knowledge of the sport into our analysis, and likely you might have made different choices than I would have. We made decisions on how to represent missing data, and then we ran a fair analysis, building a model on 800 observations and evaluating its accuracy on the remaining four or 500 items. But let's throw a few flags up here. Lots of our choices were pretty arbitrary and naive, I mean, it's just a lecture to demonstrate what can be done. Our features from the previous season, we didn't really inspect those deeply, we just crammed them all into the model. We don't really have a sense as to where this model will likely be good and where it will be bad. It's a little unclear how we're going to use this to inform us in the future. I think we can do better, and I don't mean increasing accuracy per se though that could come as well. I mean in our understanding of how well this model would really function. Let's go back and let's dig in a bit more to this model and this task in the next lecture.