Welcome back. Now we've got Google Colab set up. Hopefully, you have taken your notebook and put it into Google Drive where you want it. Remember, we've given you the link in the Coursera to get access to all of these notebooks. You can reference them and whenever there's a change, we're going to change that notebook first. Make sure that you got the current version of the notebook from Coursera. Man, the next step is to load your notebook into a folder or to start with a new notebook from scratch, depending on how you want to go. If you want to type as you go along, that's a lot of typing. Because there's a lot of things you're going to have to type. Then you have to type all this stuff out as you see on the screen. The best practice is to take our notebook, copy it into a folder that you want, and then open it up. I have already downloaded all of my notebooks and up, put them in a folder locally, and then uploaded them, and then to open them up with Google Drive, I've just click open with Google Drive and it's going to open up that notebook. Now, if I wanted to browse My Drive manually again, I can do that and I can manually upload the file that way too. If I go to Drive, I'm going to go over to my see you drive. I'm going to find the folder that I want to put my Colab notebook in and I'm just going to upload it. I'm going to put it under I think it's under MSDS, and then I'm going to go into master files, text classification. I can just click and drag this file into here, or I could use the Upload button as well. Once I've got my notebook in the location that I want it, I can open it up, and I can begin to actually do some stuff. We're going to be importing a ton of stuff today because we're going to be doing deep learning. A lot of this stuff works pretty well with Google Colab, but not all of it is installed. Some of this stuff is going to take a little bit of messing around with to get to work. All of these packages are necessary. But you can see here the ktrain isn't installed. I'm just using a little bit of simple Python logic that helps makes this process a little bit easier. I'm saying, hey, go and try to import ktrain, which is what we're going to use, which is a wrapper of Google's TensorFlow package. Now, it's not going to be installed by default. On first run, it's going to get the exception here because it's going to say, Hey, now ktrain doesn't exist. What we're going to tell it to do is install ktrain. Now, some packages that we use in this course are a little bit problematic in that once they're installed, they require that runtime to be restarted before they'll work properly. I've went ahead and imported OS, and from that, I've imported the kill functionality. This is really going to automatically terminate your Google Colab runtime and started over again. The first time that I run it, PIP is going to install ktrain. Once it does that, it's going to kill the runtime. The runtime is going to be restarted. When I run this cell the second time, I won't get the exception because it will already be installed and it will import correctly. If I try to run this once, it will yell at me at the end and say, you haven't restarted your runtime yet. That is essentially what he's going to say. That's okay. I've just gone ahead and put in the restart function so that no one is getting that error. But if you wonder why that looks the way it does, that's really all we're doing. We're going to import a couple of other things from ktrain. We're going to be using texts package , and specifically, we're going to use texts from DataFrames because we're going to be loading our machine-learning data from a pandas DataFrame. Of course, we've got an import pandas. We don't actually use this, so I can delete that. We're not going to be parsing URLs. We're going to be using some things from keras. We're going to be using pickles to save some data. These are some packages that will help us with the data workflow. The first thing we got to do for this project, and if you have not reviewed the project yet, go back to the project on Coursera, read through the description because that's going to set up our problem here. To recall everyone just really quickly. We're going to be creating a supervised machine learning algorithm that is used in contextual advertising today to determine the type of web content that a user is browsing. We're going to build a tool that classifies documents, in this case, web pages, and we're going to use it to classify whether a document is healthy, living, or not. We're making a binary decision here off of data. However, we need labeled training data. Remember, all machine learning projects need labeled data. How do I get data from the Internet onto Google Drive? Well, I could download it onto my local computer and then just uploaded here. That's one way to do it. But that takes a little bit of time. You got to download it and then you got to upload it, and local upload speeds on home Internet aren't great. It would be great if we could pull the files straight from the web and put it in our Google Drive without having to upload it locally at all. That'll save us some time. How do we do that? It's actually pretty simple. Remember that your URL for all of your data is in your project, but it's also here in the notebook. This is the news data that we're going to use for the project. I'm using the wget package inside of Linux to download a file. Remember, all Unix level commands are run with the exclamation point. That's really important for us because that is going to say, run this outside of Python, run this in the terminal. We're running this in the terminal and we're saying wget, I want you to download this file, and I want you to save it to a specific location, and that's what the capital P parameter does. It says I'm going to specify a path, and then I paste in the path to my Google Drive location. Now you may have decided to save your notebook somewhere else other than this. If you did, this is what you're going to need to change. Please do change that if you're saving it somewhere else. But what this is going to do, is this is going to download that file and it's going to put it in that location. It's really important to know that if you have spaces in your Google Drive URL, you need to escape those spaces. Let's say that my folder actually looks something like this, where instead of an underscore here, I use just spaces to separate the folder, which is fine. If you do that, you need to escape those spaces, and you do so by putting a forward slash in front of every time there's a space. You only do this for terminal commands or commands that are issued outside of Python. Python will recognize these spaces natively, and so you don't have to worry too much about that when you're working in the Python environment, you only have to worry about escaping spaces when you're issuing these terminal level commands. If you've got spaces in your Google Drive URL, you got to escape them, and you escape them with forward slashes. I'm sure someone's going to get tripped on that, and if you do, I want to point you to this point in the video where I walk you through it. This is a must, I think, for any data scientist in Colab. You want to start every project by specifying the folders that you're going to be referencing as you go along. Why do you do that? It makes opening files much, much, much, much easier. You don't want to get into a situation where you're referencing the full Google Drive path every time you open a file. If you do that, what's going to happen is, you're going to move and variably your notebooks are going to move location, and every single time where you reference your Google Drive path, you're going to have to go in and change it to reflect where you moved your file. If you just specify your root folder here once, that means that all you got to do is change that drive path once and you'll be set. Remember, if you want to find your drive path, you click on Drive, you make sure that your drive is mounted, mine is not. It says, "You sure you want to connect?" I say Yes. Then it's going to show up with my Drive. I'm going to navigate through my Drive to find myself a folder that I want, and then I can just copy that. I'm going to find my folder that I'm working in. Once I find it, I can just literally right-click on it or click the little three dots and click "Copy path". If I do that, it's going to be this exact thing. It's already here, so I'm going to paste that through. You can see that it's the same exact thing. Wherever you save it, that's how you find your path. But really, best practice, specify your path one time so that if you ever move things, you can just change this one reference and the rest of the codes can work. I'm just checking my data files and there's only one file. I'm just checking it in my root folder. We're going to do some evaluation metrics later in our last lecture. I'm going to make a sub folder here called Evaluation. You can create that manually or we'll programmatically created when the time comes. Then I'm going to make another path from models, where I'm going to save my TensorFlow models so that if I wanted to deploy them or reload them later to classify data, they'll be stored and they'll be safe and sound. I went ahead and took this JSON data that was provided to us and I transformed it. This is JSON, but it's not JSON that's natively ready to go with pandas. I went ahead and transformed their data into a Python-friendly pandas format. Our JSON file is a little different than what theirs is. If you want to see how I transformed their data, really, this is elegant code, but not efficient code, I think it just took like 15 hours to run and somebody who is more versed in pandas is going to tell me that they could do it in literally minutes I'm sure, with a one-off pandas line, but this is how I did it. You don't need to transform your data. I've already run this code for you and up in the version that you're uploading to your Google Drive here, this has already been transformed, so you don't need to run that and that's why it's just in the comments. I've made it easy for you. I know often loading in the data in machine learning is the hardest part. May be I'm taking it easy on you, but this is how we load the data in Python. Notice here this is why specifying a data directory is so nice. So all I really have to do is say, hey pandas, I want you to read some JSON that you know how to deal with. I'm going to load it from my data directory. All you need to specify here is the name of the file. This is the magic that's happening here in Python, we are using the percent S as a pipe that we're going to pipe this variable data drive into this to recreate the full string. That's the lovely part of that. Every time pandas, we know, works in a one liner and Angel gets its wings. It's a beautiful, a bit rare thing. I've saved you a little bit of a headache here. If we look through the data, what do we get? What does this look like? What does contextual advertising data look like? Well, at its core, you've got a category label. This is a crime article, this is an entertainment article, and so on and so forth. You have, the actual autonomy of the news articles. You've got the headline, you have the authors, you have the link to the actual URL or the actual news article, you got a description for it, the date that it was published. Of course, there could be many more variables in a contextual advertising solution but for us this is going to work really well. We're really only going to leverage the short description, the headline, and then obviously the label up the document. That's it. One of the things that I think is a problem with deep learning is that it typically only accepts one field. So whatever that field is, that's what it's going to train on. That's what it's going to create its tokens from. That's what it's going to create all of those features from to try to predict the y-variable. We have to concatenate our text into one long string. In pandas, it couldn't be easier sometimes these pandas one-liners just your magnificent. When they're simple, they're simple and I love it. 'm just creating a new column here called combined texts. I am adding the headline of the article. I'm starting with that because I think that's the most important features. I'm adding a space and I am also putting the short description right thereafter. You can see I've now created a combined text field and that's what we're going to train on. That's where our features are going to come from. I've used a little bit of filtering here to say that from the reviews column, if the column contains healthy living, then I want to print it. I'm seeing all the healthy living articles here by using the search query. You can see that there are 6,694r healthy living articles. We have a lot, but when we consider that the entire dataset is about 200,000 articles, we've got a bit of an unbalanced and we'll deal with that in a minute. But what I mean by unbalance is we have a needle in a haystack situation. We've got a ton of articles and only about three percent of them are healthy living. We've got a lot of negative evidence and 6,000 positive pieces, which is enough. I think we're going to be able to build a good machine learning algorithm here, but I'm worried that we're going to need to do some data bounds in here to even out the fact that most of these things aren't healthy living. Indeed, we see that here. What we're going to do is we're going to create a new column called healthy. We're actually using NumPy here. We're saying NumPy, where the reviews category equals healthy living, I want you to label it as a 1. Where that condition is not satisfied, I want you to label it as a 0. We're creating a new column where all of these articles that we saw up here are going to get 1s and every other article is going to get a 0. Really great way to convert a category column, such as what we're dealing with here, to a binary column 10. We could do this for each category in our contextual dataset. We can build an algorithm for each category in our contextual dataset if we wanted to. This one liner is just magnificent. Now that we've converted this healthy living label to its own column, we can actually just print out the descriptives of it and see, what does it look like? You can see here that we've got 200,000 articles and three percent of them are actually healthy living. That's exactly what we'd expect. As I said, our dataset, it's very unbalanced. We have a 194,000 articles that are not healthy living. If we give a machine learning algorithm this much negative evidence, it'll end up tuning to label everything as zeros more often or not. It's be like, "Look, most of the stuff that I see is not healthy living, so I'm just not going to label anything is healthy living," is what it ends up doing in practice. We got to balance our data so that we have a healthier mix of positive to negative articles. I'm going to do a one-to-one mix here, saying that I'm going to have the same number of healthy articles as I am every other type of article. I don't need to keep it one-to-one. I could do one part healthy living articles, three parts everything else, and I'll probably not run into too much of issues. However, as I said, deep learning is slow, and the training process is also slow. The more data that we put into this model, the slower it's going to be, the more frustrated you're going to be as it's taking hours to do tuning. While that might be a good thing to do in a marketing business situation, get as much data as possible to get the best models possible, for the sake of today, we're going to do a one-to-one ratio. We're going to take 6,694 and we're going to make that R split. Obviously, our healthy data is going to be split into a positive DataFrame, and our negative data is going to be split into a negative DataFrame. What I'm going to do here is I'm going to specify a sample. I don't want all 194,000 negative examples. Instead, I'm just going to take 6.6 K negative examples. It's really as easy as splitting these off, making the sample. Then I'm going to use CONCAT to add them together. Because the shapes are exactly the same, there's no issue with using CONCAT like this. It'll work exactly as expected. If we did it right, if we describe our new sample, we should get a mean on the healthy column at 50 percent and we do half of the examples are positive and half of them are negative. Half of them are one, half of them are zero, the mean is 0.5. That's great. We're all set. Now we're ready to do deep learning. The one major advantage to deep learning is we don't have to do much preprocessing to the text. We don't have to split texts into tokens. We don't have to tokenize the texts. We don't have to stem the texts. We let the transformer that's built into the package do the work for us. That's the magic part of this, is that we get to let our NLP side just completely go for these models. There's really not much we need to do. Now, you can get into the weeds of how features are created for these types of models. But honestly, unless you have hundreds of hours to waste, I think that what these pre-trained transformer models are capable of doing, it is just quite amazing, to be honest. The first thing that we got to do is specify our class names. Zero, in this case, corresponds to not healthy living. It's every other type of news article that's out there but not healthy living. Then we've got articles that are healthy living, and that's the positive class. We just create a little list that reflects that. One thing that you might want to uncomment is this clear session. One thing that you got to keep in mind when you're doing a TensorFlow is that your state of your model is always saved. As long as your runtime is still connected, your model is saved in its current state in memory, in RAM. If you continue to train without resetting your runtime, you'll train on top of training. Remember, if you train on top of training, you're going to have issues with overfitting. A best practice is every time you run this notebook, you're going to do one of two things. You could either keep this code there uncommented so that every time this runs, it'll reset the model to scratch. Or the better approach is to just go up here to runtime and click "Restart and Run All". If you just run your models without resetting them, all weird things will start to happen with overtuning. You got to make sure that you're resetting the RAM of TensorFlow, essentially resetting the weights of your model every time you run. Or lease, you're going to be in a lot of, "Why is this not working? Why is the performance so bad all of a sudden?" It's always going to be the case. We're going to transform our text using a pre-trained model. Remember, pre-trained models have a broad sense of the human language. They know the semantic relationships that words tend to have with each other. That's awesome. It also knows how to take a document and parse it into tokens that match that model. We can use DistilBERT-base-uncased or we can use roberta-base. I think that this will be best to use actually. DistilBERT case and then we'll put RoBERTa base. That's something that we might try later. You can actually browse the TensorFlow model hub, which I've provided a link for in our conceptual lecture. You can check that out. There are other models that you can try here. Due to Qwiklabs or RAM limitations, not a lot of the bigger models tend to work. I'm going to tell you that now. These are the ones that I've tried that tend to work. But any of the other models that you might try, try at your own risk, because not all of these models were trained the same way and not all of them are trained to work with ktrain, which is this special wrapper that we're using that makes this deep learning so easy to abstract and to use. We're going to use this function that we've imported from ktrain called texts_from_df. What do you think it does? It takes a DataFrame and it converts it into the training and validation set needed to do this problem. It also creates an object called preprocess, which is a preprocessor that can take documents and preprocess it the necessary say, formula. It requires review sample. It requires the combined text, which is the column that it is going to be looking at to extract features, do preprocessing, turn this features into nodes, into the neural net. It needs to know the label where the ones and zeros are for the training data. It needs to know if something's healthy or not. Those are the big things that we're feeding it. We don't have a value DataFrame and we want to go ahead and start with something around 20,000 features. You can up this feature count and as you up the feature count, the model performance may improve but remember, more features doesn't necessarily mean a better model. It's something that you can explore as you're doing your project. This is a parameter that you can tweak to see if your model gets better as you increase this. The max length that you can do in TensorFlow at this time is 512 tokens. That's going to be just fine for this project because our abstracts or our little news summaries are so short. But you can think about it as the largest document that TensorFlow can handle is one that has 512 words in it. That's not a lot, but if you get down to key key tokens, then 512 can be really informative for a document. We need to know what percentage we're going to be using for validation. I'm going to use one tenth the data set as validation because it's a larger one. I feel like 10 percent on a 13k sample is enough to get a feel for the ground truth. You can also specify an 'n' gram range, which is, do you want TensorFlow to only consider unigrams or in this case words? Or do you want to consider combinations of words? If you haven't considered combinations of words, then it might perform better. But remember, more features does not necessarily mean a better performing model. It's another thing for you to try, but it's something that won't necessarily make your model better. We're going to also set up our preprocess mode. It just needs to know, what model are you going to be building? Because the preprocessors for various deep learning algorithms are different. We're going to want to make sure that we specify the DistilBERT, because that's what we're using. We're going to allow it to print out stuff. If you don't want it to print all this out, then you just set the verbose to zero and it's going to not print anything. This is good for like a headless server model where you don't want this stuff printing but, else you probably want to be able to see this stuff. Once you run this function, you get a little preview here of what's going on and you get a preview of, this is what my data looks like. This is the document ID. This is whether the document is healthy or not..But what we want to see here is ones and zeros in both columns. If we don't see these columns set up with ones and zeros, that means that something in our Dataprep got messed up. But this looks good, so it's going to actually process the data. This is going to take a few minutes guys. When you run this, you're going to be patient. It's going to download actual TensorFlow pre-trained models from the TensorFlow Hub and that takes a little bit of time, so be patient. But you should see something that looks like this. You can see that it's telling us that the average sequence is only 30 tokens or 30 words and the largest one is 70, so we're well under that 512 limit that we get with TensorFlow. That's pretty much it for the preprocessing. To actually get the model we call the preprocess function that we just created and we use getClassifier. We then use the ktrain.get_learner and we specify the model, the training data, the validation data, and the batch size. Remember, the batch size corresponds to how many documents are we going to be feeding through this algorithm at any given time? If we were to just consider one document at a time, it would be pretty slow to go through 13,000 documents and try to tune those knobs each time. That's a slow process, so in general you use batch sizes. For text, batch sizes usually are 16 or under. It's usually somewhere around the 6-16 range, so 6, 8, 10, 12, 14, 16 are common batch sizes that I see. Six is probably the most common. I don't see a performance difference here between playing with this, but this is another parameter that you can play with to try to increase the performance of your model. The batches that we accept into the model will ultimately color the way that the model trains because the bigger the batches, maybe the less often we will tune to positive and that could be an issue. I'm going to use a batch size of 16 here. It's just because I was playing with it and that was what was quickest in Google Colab. You're more than welcome to decrease this value; 6, 8, 10 and you may see higher performance, you may not. It's one of those things where there's no one way, it always works one way. It's always just unique to the dataset. The next major variable that we can change, or tune, or tweak that really has an impact on the performance quality of the model is the learning rate. Remember the learning rate, just abstracted out, is this concept of tuning the knob. The higher the learning rate, the more we just turn the knob. We turn it as fast as we can. The lower the learning rate, the more we gently tweak the knob until we get to the right point. That varies depending on model and there's no one right learning rate, and it varies for every dataset. We run a learner on our dataset to try to figure out approximately what learning rate tends to result in an initial improvement in loss. What do we have here? We have epochs. Each epoch is a pass through the data attempting to tune the knobs on the training data so that the loss is as little as possible. We have six epochs, and each one has a different learning rate. It doesn't tell us what the learning rate is here, but just know that at each epoch it's trying a different learning rate to see if one learning rate tends to result in a better initial loss. You can see here this is way high. This is crazy high loss. This is the best one. This epoch wherever the learning rate was for this iteration of the data, this is the one where the model really learned. If we plot the performance, you can see that, remember, the lower the loss, the better the fit to the data. We've got a really low loss here at around 0.3, and that happens at 10 negative 4, so 10 negative 4 translated into a learning rate. Actually it ends up being 1e negative 4. That's where we want to take this. Really you can just drop the zero here on this graph and that's what it is. If it was at its lowest point, if we saw the lowest dip here at 10 negative 3, then this would be 1e negative 3. That's really all there is to interpreting this graph. We take the lowest point here and then we insert that learning rate into our model so that we can actually do a real training ground. I give it a checkpoint folder, which is just a temporary folder that it's going to put inside of our temp files, and I give it the number of epochs that we're willing to accept. Remember, this thing can keep going through the data all night and all day, but we don't want it to just keep going. We want it to keep going as long as the validation loss drops. Once that stops being true, we want this model to stop running. We want it to done so that we can use it. How do we do that? We specify early stopping. If we tell it to stop once the loss has been minimized, then we will have a model that'll run and then it'll eventually say, "I've learned." It's really fun to watch these. Process and as you see yours run, and you'll see a tune. What you're really looking at here are a few key things. Remember, accuracy is always going to be high, but not for this use case. This is the one rare use case where accuracy is actually a valid metric because we perfectly balanced the data because it was 50 percent positive, 50 percent negative. The benchmark accuracy or the null hypothesis accuracy is 50 percent. If we had a garbage algorithm just randomly guessing zeros and ones, we would expect the accuracy to be around 50 percent. If we had an algorithm that classified all ones, it would be 50 percent. If we had it classify all zeros, it would be 50 percent. Ninety out the bat is pretty good for the first pass through the data. You can see that the second pass through the data, it actually got better. You can see here that what we're looking for is this loss to lower but then we're also looking for the validation loss to lower as well. Let's just recap what's happening here. The loss was 0.24. The accuracy, we don't really care about. The validation loss was 0.34. We're already getting a better fit on our training data than we have on our validation data. That's an indication that we could either should keep tuning or that we're beginning to over-fit. The next pass through the data, we see that the loss on the training set dropped more. It dropped by 0.07 but the validation loss is actually low. Do you guys see that? This number is lower than this number. What does that mean? That means practically that we've begun to over-fit our data. On the training set we did better, but on real-world data that the model never saw, we actually did worse. We're going to restore the weights from our first pass through the data because this first-pass here, it was most representative of data that had never seen the model was most generalizable. That's where we're going to use it and that's the beauty of early stopping. This gap keeps getting better and in most models, this is just going to keep getting better until the loss it's zero but the validation loss at some point is going to start to go up. We're going to start to lose sight of reality when we're going to actually start overfitting our data. In this case, that happened right after the first epoch. Yeah, it's not always the case that it happens after the first epoch who often is after the third or fourth and some huge datasets that can be after many epochs. However, we reached convergence here at the end of our first epoch. The model stops and it reverts back to where it was at the end of the first epoch. Now, to save them all in catering, you can't make it up. It is a one-liner and you got to love it when it's this symbol. Again, nice as specifying a data folder and we're just saving the model in the data folder. We can make a separate folder for the models and save them in there as well. Now that the model is there, we can begin to use it to predict new data. It's built, it's that simple. Isn't it great, guys? Let's go ahead and call the get predictor. We give it the learner and we give it the pre-processing ability. These are two things it needs to know how to do. When it sees new documents, it needs to know how to separate them into tokens so that it can be used. When it sees new documents, it needs to have the model that we trained. We can save this as well. We need to save both of these pieces of evidence to reload them in, to be able to classify new stuff later on. The most important thing that we can do is look at the validation and actually see how well this thing performed on validation data. This is our validation data that we split out way earlier and we wanted to print the report because we really want to see these metrics. Now, remember, the three metrics that we should focus on are for the positive class because we want to know about the documents it positively classifies. If we're contextual advertising solution, we do not care about all the times that it does not label data. We only care about the times that it does labeled data. When it identifies a article as healthy living, how well does it do? Remember, precision, recall F1 score. Those are the basics and I think those work the best for classification. Precision interpreted in this case, when our healthy living machine learning algorithm classifies a news article as healthy living 85 percent of the time all right. So if we label 100 articles as healthy living, 85 of them are actually positive. Fifteen of them were labeled as positive or negative. Not bad. I mean 85 out of a 100, that's not bad. That's a good test exam score. I think that's pretty good for this use case. I think that's pretty good. I like to go for 95 and the stuff that I built in my startup, but this is pretty good. Recall is of all of the documents that it could have classified, how many of them did it get? Of the total universe of documents that could classify as healthy living, how many did it get? You've got 89 percent of those. So if there are a 100 out there, it got 89 of them. That's not bad as well. From a contextual advertising standpoint, I would think our precision would matter more. When we say something is healthy living and we want to put an ad on it, we want to be darn sure that it is. But the recall is also helpful. The more documents we positively classify, the more ads we can place. So both of these are pretty acceptable and you can see here that the F1 score is the average of the two; 85 plus 89 divided by two is 87. If we report in our methodology for this algorithm that we're got an F1 score of 87 most data scientists are going to look at us and be like, "Well done, go best friend." If we have something like in the 50s or 60s and it's probably something that we need more data for, it's probably not something that we've totally cracked. I would say is in pretty decent model for four or five lines of code. Nice, good work. We've built a contextual advertising solution. I think it's funny, there are these contextual advertising solutions out there they're selling. You have market values, the companies of billions of dollars and we can create this basic technology on our computers with a few buckets of code. It's an amazing time to be alive in that sense. We're going to go ahead and try to inspect the drivers of prediction here. That is, we're going to try to get a peek for under the hood of what's going on on these algorithms. The one major limitation to deep learning is it's hard to get under the hood and figure out what is actually driving the prediction. Like what words, what features, what are those weights that are hidden in that network? You would think it would be easy to extract out those weights and interpret them as normal but it's very hard to do so with the human eye. We need to be careful doing that. So what did I do? I went out to New York Times and grabbed a bunch of articles. The first 1, 2, 3, 4 are actually healthy living. The last two are not, political and then just other article. So what we should see is four high probabilities and two low probabilities. All we got to do is enumerate through the docs and then feed it the document and then we just need to print out what the probability is. The probability is stored in here and it's just stored as a key. The first is the class like 1 or 0, and the second is the probability so we're taking the second item in the tuple, which is the actual probability score and we're just printing that out. Ninety four percent, 94 percent, 94 percent, 90 percent. Remember the default probability cutoff for TensorFlow is 0.5 or well north of that. We are locked onto something. Clearly this thing understands some of the basic terminology associated with healthy living articles. Well done guys. Let's make a company. But we'll see it's not perfect here in a minute. Therefore, the other two articles, just as we'd hoped, the probabilities are really low. Saying nah this isn't actually what I thought it was and they shouldn't be. These are pretty obvious examples that work as expected but this is not a perfect model, so let's take a look here at a document that I made up. I am a believer in diversity, equity and inclusion. I think it's important that we include so many other people in data science and computer science and marketing. I think the more perspectives we have the better we are as a company. So I think diversity is healthy to society. I wrote this and I wanted to see if I can fool it because I'm using the word healthy in a context that isn't specific to a healthy living article. So healthy is a trick here and does it fall for the trick? Well if we look at this, it's saying that it's 73 percent likely. That's markedly lower than 0.94, but it's still higher than 0.5. It ultimately says, yes, it is a healthy living are equal, so we food it. Remember, setting custom probability curves can get around this. If we notice when we're manually looking at these results, that the algorithm starts to get food anything south of 90 percent, we can set that cutoff for ourselves and say, anything that has a probability lower than 0.75, even though it might be healthy living, we just see too much error in that probability range. Anything below 0.75, we're just not going to include, That's just a manual choice that we can make when we're building these classification of algorithms. It's not like this model can't be used as is, anything below the 90s so far looks like it may or may not actually be healthy living. We need to do a lot more inspection of articles that have these trick words in it to see how easy we can trick it. But remember, inside of our neural network, each word has a weight, and each weight helps to contribute the score. If you see red here, the darker the red the more negative rate. The word diversity is negatively correlated, America is negatively correlated with this class. But obviously the word healthy live, which totally makes sense because live and life and all of those things are going to be positive words that are used in a healthy living where it cool. It gets checked on the word healthy and there must be a really strong way associated with the word healthy. The final probability gets pushed north of 0.5 by a little bit, but it's still got some negative weights in here that suggests that it's not so this algorithm is not perfect, but it's certainly starting to lock on to some of this keywords. This makes me really excited, that's pretty much deep learning in a nutshell. There's so much more you could do I gave you four parameters that we talked about throughout this Notebook that you can actually use to tweak, to tune, to try to make this model even better. Of course, you can label more data. Of course you can try even different pre-trained models might give you different results. There's a lot more that you can do, you could try under sampling the data with a different methodology instead of the one-to-one methodology, you can do 1-2,1-3, allow more negative examples into your classification to try to give it more evidence to get these weights down. Because I think if it saw more articles with the word healthy in it that were associated with healthy living. I think this model will actually perform better. Lot here to take in but this is deep learning in a nutshell. I expect you to be able to run this code with very little intervention, but be patient. It does take time to run deep learning. Remember our runtime here has got to be set to GPU or else none of those is going to work, that's one thing you make sure you're runtime is set to GPU or else this won't run. But other than that, this is deep learning in a nutshell. Remember, we're using TensorFlow with pre-trained models from the TensorFlow Hub and we're using the ktrain wrapper to make TensorFlow a lot easier to manage. You can actually obviously build these with the TensorFlow API itself, but a lot harder with a lot more lines of code. I use this just as an abstracted way to teach deep learning. But one of the downsides of using ktrain is for our startup or contextual advertising startup social contexts. The ktrain models that we save, we cannot load in and work and get them to classify scale like we'd want to. Remember, there are trade-offs to every approach. In this, the trade-off is that this model is very easy to use, but the downside is it's slow. It probably will not deploy on AWS and it takes a lot of train this thing to be honest, once you see us training linear models with split seconds, you'll notice that this is a much heftier workflow to go through that being said, I think it's definitely the most accurate way to do contextual advertising classification or news article classification or any kind of textual document classification. Because it's got that pre-trained understanding of the English language baked right in. We're going to come back next lecture and I'm going to do a more simple linear based regression and we're going to compare the performance. For now, remember that our algorithm got somewhere on the positive F1, somewhere around 0.87. So we're going to see how well we can do with a really simple linear model that could scale and compare pros and cons.