Alright friends, we're back. And in this lecture we're going to use a much simpler method to classify documents using a bag of words approach. A bag of words approach is pretty different from a pretrained neural network. A pretrained neural network has already taken a large corpuses of English language. And parsed it into words and used those parsons to understand the semantic relationships. A bag of words really is just saying, whatever words are present in our collection of documents, those are going to be the terms that we're going to use. So it doesn't start with any pretrained knowledge of the English language or anything fancy. It starts with actually parsing the text into tokens. And then taking those tokens, and figuring out the linear relationships those tokens have with the target. Or the thing we're trying to predict in this case, whether a document is healthy living or not. So good stuff. Right? So, we're going to use this site by sparse matrix function, to see how we can build penalized models. That will only take a few features from all of these words that we could possibly extract from this collection documents. I have to give full credit to these scholars for creating this notebook. We have really lightly edited it. It is mainly their work. Give them all of the credit and thank them. If you have time for building something that is so useful to do machine learning. You can see here, that this package is using, if we go back up here a lot of SK learning stuff. And really not much else, a little bit of and pandas.Because we're going to be loading our data into pandas. But all of this stuff should import natively. There are some options here that you can review if you go to the original notebook. But we're only going to use one option in here which is the hashing function to use a hashing vectorizer. And that really just helps it run more efficiently on a notebook environment. So again, with any project that we do, we need to tell it where our files are. And the best way to do that is to store a variable with our full google drive path in it. And everything else here is the same. Now, I have created an evaluation model, and I am just telling it, hey, if this evaluation model directory doesn't exist, go ahead and create it. And the first thing I have to do is say, hey, does the directory exists? If it doesn't exist, then I make the directory and then we'll be able to save some files in it. So, I'm going to load my in, and I'm going to do all of that data transformation that I did last time exactly the same. And I'm going to do it all the same, because I want to compare these linear models to the deep learning models that we just did. So, I'm keeping all that the same and that's really all that's doing there. So we're going to train on 90% of the data, and we're using a mask function here. So we're saying hey, mask 90% of the data, apply that mask to the training data set, and then apply the opposite, which is 10% to the test data set. So you get a 90 ten split here. And in this particular code, we take our panda's data frames and we convert them to lists. This is really easy to do in pandas. All we have to do here is take call out the specific column that we want, and then use the to list functionality and it converts that to a list. Could not be easier. And so we're just pulling out our training and test datasets and we're turning them into lists. And there's nothing really we need to do here. There are things that we could tweak. But remember the number one thing that we have to do in a linear based approach, is find how many different words are there. And you can see that there are 23,000 words in the examples that we've given it. So that's probably a good ceiling for what our maximum number of features should be. It matched pretty well to that 20,000 range that we specified here in K train, that we were doing in our deep learning. So we'll leave that as is if we wanted to put a ceiling on that, there's a parameter to say. I only do the top, however many thousands that you want. And so this is just taking that data and actually transforming it onto the data set. And now we are just taking these and making sure that these haven't been changed in any way. So we're inspecting the X train, we're expecting white trained to make sure, that all of this is exactly what we would expect, which is 6.6,000 times two, which is exactly that. So we are good there. I've only added a couple of lines of code here. And the first line of code is just parsing out the model name that we're running. And we're going to be running a ton of different linear regression models here. And then, I am saving the test predictions and the actual classifications. And putting them in an evaluation folder so that we can run some classification metrics on them later. So all of this code is just, left untweaked, and I'm really doing a poor job of explaining this notebook to you. But there's just so much going on, that I think the bigger thing to do is abstract out. That we are building the same type of classifier on the same type of data. So we've got an F two here of 8.22, got an F one here of .57 .824 .834. I think that's our winner. So what do we have here? We have a model that ran untrained in just seconds. With very little performance decrease. We're talking about the difference between .87 and .834. So not much of a classification difference for a tremendous amount of performance in terms of speed. .837 here with elastic net, and elastic net often wins in my experience with text. So we've got some really decent models that all trained in a few clicks of the button. And so you can see here that we have a comparison printed out at the end of all of our models, and you can see that most of them perform around the .8 at one score. Pretty gosh, darn close to what we saw in our earlier deep learning approach. These elastic net models are four or five lines of code these days they are super fast. They are very easy to deploy. And it would save us a lot of headaches if we adopted a model like this, are we going to sacrifice a little bit of performance in terms of its evaluation ability? Yeah, this thing won't classify quite as well, but we're saving so much time, energy and effort that this is something we should explore. If you have questions associated with this model, please refer to the documentation. Again, we very lightly edited it and it's so nice that we can get linear based models working with so little manipulation. There's so much you can do with these models, and I am barely scratching the surface. But you can see with very little manipulation, we can almost recreate the accuracy, precision and recall of models that are much more sophisticated for deep learning.