Any time we're building predictive models, we need to consider a number of things. And chiefly among these is how we train the model using this fit function. Now this function does all the heavy lifting for us and it's going to use all of the data we give it and try and learn the best model it possibly can. And this can lead to overfitting where the model created works really well and the data was given, but it doesn't generalize well to new data. We've seen two tactics already to deal with this. First, we can separate our data into a train test set and this gives us a realistic understanding of how well our model will perform on new data. And second we can change model parameters like see to penalize overfitting. But really how big should your train tests let be. A more common approach in modern machine learning is to not have this strict split of your data, but instead to have many different splits of your data and train many different models to get better estimates of how well the model might work. And this is called cross validation. And SK learn has this support built right in. So let's bring in our imports and our baseball data like we did before. Okay, so there's our baseball data frame. So since world hats at this now I'm going to add in a bit more data. So let's move from this binary prediction into a multi class prediction. Specifically it seems we differentiated well between these two kinds of pitches, the fastballs and the curve balls. So let's add in a few more and let's see what's actually in this data set for us. Okay, we can see there's a fair bit of diversity here but we have significant class imbalance. It turns out there's very few pitches that just lobbed this ball out there with this doofus pitch. And I know I've never seen one when I've been watching baseball, but let's try and predict all of these different types and let's use a host of different features. So here's a couple of features that I pulled out that I thought were good. So there's some pitch metrics. So the release spin rate and the release extension, the position of the release, and so forth. So a whole bunch that I just grouped as pitch metrics. We've got the player's name in here. Why not add that? It could be that some pictures have favorites or strategies. And then pictures might actually change their strategy when they already have out's going on in the inning or when they're deep in the game. So when there are many innings deep. So I want you to take a minute and think what you know about the game of baseball, what other features might be predictive of the next pitch that a picture is going to throw. And don't limit yourself to the things that we've seen that exist in the data set. Think creatively for a moment about the different kinds of indicators that you would use if you were sitting there in the stands and could measure everything. Now let's combine our features and reduce our data frame. So I'm going to take the pitch metrics, the player metrics, the game details and of course what we're looking to actually predict the pitch type and turn them all into one giant list. And then pass this to our data frame and get our new data frame. And then we're going to drop anything where there are no pitches or we don't know the pitch type I guess is a better way to put that. Of course we can't predict or we can't check to see what the pitch type was. I'm also going to factories based on player names since we need numeric values. So again, this just replaces the player name with an integer value. And this is gona give us roughly 40,000 observances. Now, this is just one month of data. We're going to prune out the last 35,000 for our validation set. We're just going to put it to the side, we're going to lock it up and we're not going to look at it. And then we're actually just going to work with this smaller group of 5000. And then here, I want to impute the last thing I want to do in this cell is impute the data for the missing data in our training set. And there's not that many actually, there's only a few 100. So I'm just going to use this simple mean approach. In a cross validation approach, we break the original data into a number of equal subsets and these are called folds. And then we hold one of those for testing and we train on the rest and we repeat this procedure for each fold. So sometimes this is called K fold cross validation where K is the number of folds you use. And of course, SK learn has this built in for us. So we'll just bring it in from the model selection, the cross validate and we'll bring in RSVM. And now we create the model that we're interested in. The linear SPM did well. And it was nice and simple, but it turns out that the default implementation of the linear SPM and SK learn is pretty slow and our data is actually getting up there in size. The fifth degree polynomial seemed to have the same accuracy and when I tested it it's roughly 10 times faster on this data. So I'm going to save you the first hand experience of learning that we're going to train this polynomial kernel instead. Again, we're going to set C to 15 and co f 0 to 5. And the random state, of course I'll just set to this integer so that you have it and can see what I see. Once we've done this we simply give the model and the data SK learn tell it how many folds we want to perform. In this case I'll do just five folds and then what metrics we want to use to evaluate how good the fit was. We're going to keep using accuracy here. But you'll see in the future that's not really a good idea given the high class imbalance and feel free to play around with different parameters for the cross folding. Change this K value or the CV value in the cross validate function from 5 to 10 to 10. This is quite common as well. Or to 8, or to 20. We've got lots of data so you can change how you want to actually do that K fold process and look at the results. Well that's crunching away. And remember this is almost 5000 observations were going to fit five times over. So it's going to take a while. Let's talk about what you actually use cross validation for. Cross validation does not improve your model. Your model is going to be trained on all of the training data you have when it comes to actually building your final model. Cross validation does not change the hyper parameters you're using either. Instead cross validation gives you a stronger understanding of how generalizable your model will be to new data. Think of it this way you might have created great model on September's baseball data which is the data we're using. But if it doesn't generalize to October they're going to laugh at you back in the office bull and you don't want that. The cross validation in the notebook should be done by now. Let's go check it out. So the results of the cross validate function or a dictionary which has some timing information and our test scores. So I'm just going to print this test scores because that's all I really care about and we'll run a standard deviation of the mean of these two. So we did a 5-fold cross validation and you can see un our first one we got a 60, almost 62% accuracy. Then it jumped up on our second fold to 67% accuracy and so on. We had a standard deviation of about 2%. With an average score around 64%. You could see the value here of this cross validation approach. If somebody said they had a model that did 67% accuracy, another person had 62% accuracy actually are they any different? Were they just trained, and was this random function or the random state function set to a different number? Did they just happen to get lucky? And that's actually really important for us. But what's even better is to see how well this model actually works on our validation data and it turns out we have this giant validation set that we just put in the vault of about 35. Well, I think in fact exactly 35,000 instances. So let's actually do that. First we're going to impute those missing values again and then we just pass it in. We fit our classifier and then we just scored against the validation data. Okay, 57 or so percent. Well, that's an overview of how we can apply support vector machines to this pitch data from the MLB. Now our accuracy at the end wasn't so great. This 57%. But we are predicting across many different classes of pitches and it's not really clear where our model goes wrong. Doesn't miss predict just one class. Really, really bad. Let's say curveballs. Or is it equally bad at all of the classes missing? Let's say 10%, 20% or whatever of each one of those. The answer is actually I don't know but I do know how to find out. And so in the next lecture we're going to take a look at a different data set where this comes up in a nice way. And I'm going to show you that the techniques that data scientists used to understand the model performance better, and then you can bring those techniques back to this data set and work on it.