Hi, we're back and we're talking about text classification using supervised machine learning. I'm Chris Vargo and I'm excited to have you here today. Let's dive into a time where I use supervised machine learning to extract some data from a much larger dataset. I set out to analyze the degree to which fact checks were shared on social media. To do that, I had to take a large collection of news and figure out which articles were indeed fact checks. It was a pretty big data problem. I had about 800 US consumers and all of the news articles that they had shared on social media. About 9,000 news articles and I really wanted to go inside of those news articles and pull out just the fact checks. I didn't want to read all 9,000 articles, although I could've. Instead, I decided I would read a small sample of that and use a supervised machine learning algorithm to solve the problem. This is what the data looked like. It was a URL and it was a piece of text and the machine needed to give us a 1 or a 0 if that row contained a fact check 1, if it was a fact check 0, if it was not. We can look at the text of a news article, but even just looking at the URL, do you think there are words that will be associated with fact checks here? The first question that you have to ask yourself with a problem like this is, are the features able to be extracted from the data that would actually, reasonably help a computer lock on to solve a problem? If the answer is yes, then sometimes the features need to be engineered or created or preprocessed so that it'll work as well as possible. We'll uncover and go through that process as we get into code but for this example, this problem was pretty straightforward. I had to extract out these words that were inside of the URL. Terms like fact check are definitely going to correlate to the positive class and that was what I was interested in. We call this preprocessing in the NLP world and preprocessing is simply taking the data and putting it in the most ideal format for the machine learning algorithm to begin learning. We got to pre-process the data here, the text of the article obviously, but also the URL which had the title of the article inside it. It just happens the most news articles have the title of the article right in the URL. We used a standard bag of words approach where we used every word in our corpus and every word in our corpus received a term in what's called a document term matrix but we also split up the URLs we can extract out the titles as well. This was done by simply splitting the string of the URL using the dash and the backslash. Just some simple preprocessing in Python. We have a bunch of news articles, but none of it's labeled. The first thing we need to do is create gold standard data. How would we do that? A fellow professor and I who were working on this project started by each labeling 100 URLs. We had a discussion after we did this and not all of our labels were actually the same, we disagreed on a couple, we talked about it. We came up with a better strategy that we both labeled consistently. Once we got on the same page, we went out and we both labeled over 1,000 unique articles. We had about 2,000 plus articles in our training set and as far as most classification problems go, this is on the lighter side but given that this was a simple problem, we felt like it was a good starting point to try to build a machine learning algorithm. We only looked at about 23 percent of the URLs and I think that's important to say, we saved ourselves looking at 7,000 news articles, which if you're doing a tedious task like this, can be time well saved. So 2,000 articles, is that enough? It depends on how many positive examples we have. Most news articles are not fact checks. A machine learning algorithm needs to observe enough positive examples to be able to understand what separates it from the negative class. While we might think we have a lot of data here, we're really limited by the number of positive examples that we have. My rule of thumb is to never use accuracy as a classification metric unless the positive and negative classes are perfectly balanced. Why? It's easy to gain accuracy. If I made an algorithm that simply labeled all the news here as not fact checks, I'd have an algorithm that was 99 percent accurate. Why? Because 99 percent of the data is not fact checks. I'd be right 99 percent of the time if I said everything was not a fact check, I'd only be wrong one percent of the time. What's the problem with that? Of course, the whole point of this analysis is to extract fact checks. If I just label everything is zero, that would be a total waste of time. Don't use accuracy. Use precision and recall instead. Which one is more important? I would say recall in this case, because there are few fact checks, we want to surface as many as possible. You've always got to develop an evaluation strategy for your machine learning algorithm, and that's going to be different based on your challenge. Since we care about recall, I think recall is definitely a measure we need to keep in mind, but we also care about precision. We don't want a bunch of articles mislabeled. An F1 score here does make some sense. An F1 score is simply the blend between precision and recall. It's the most common performance score because it is a average of the two scores. I ran a model on data robot, and when you run a model on data robot, it actually runs like a thousands of models. It's cool. It's a workbench for machine learning. It gives you the ability to go through and understand what it is about a model or a set of models that makes it perform better than other models. How did I choose the model? I started with linear models, that is, traditional machine learning models with linear relationships based on words extracted from the text of the documents. Why? Because I suspected the problem was simple and penalized linear regression models tend to do a really good job on simple problems. The penalization process on a linear model means that it accepts fewer features in the model at the end, a fewer features really corresponds to a simple problem. I ran 38 models through data robot. The best performing one was a simple elastic net model that can be implemented in Scikit-learn in just a few lines of code. We've got a model with a log loss of 0.15. My prior experience for binary classification, that's pretty good, but we should still try to contextualize that lost. Specifically, we should look to see if the loss would decrease dramatically if we coded more data. I hate labeling more news because it's a tedious task, but I'll do it if I know it's going to make my model a lot better. When we want to answer this kind of question of, do we need more training data? A learning curve is what we need. Data robot calculates this for us, which is incredibly nice. I have about 2,000 labeled news articles. The first time that data robot tries to build a series of models, it might only train on 300 of those articles. The next time it wants to build a series of models it'll go on twice that, 600 articles. Then again, it'll build a series of models using 1,200 articles in its training set. Notice that I never train on 2,000 because we always need to hold out validation data. Each time it builds a specific type of machine learning models such as an elastic net, it records the performance. There are two assumptions that you want to see hold as you see this performance model graft. First, you want to see the last decrease as more data is introduced. This generally suggests that, hey, you're onto something here. The machine learning algorithm as it sees more data, starts to get the gist of this classification project, you can see that that assumption holds for all the models here except for the one in the green line. A blender model that combines a bunch of other linear models into it. Generalized linear models that are blended are complex. It's likely that just something went wrong with that specific model, because all the other models here improve as the data gets introduced. Most models start around the 0.20 point and then settling around 0.15. That might not sound like a lot, but that's about a 25 percent reduction in loss. These models really do appear to be learning as they get more data. This suggests that the training data didn't get messed up in transformation. It also suggests that there is a concept that we can lock in on. Second, you want to see that the loss decreases as more training data is introduced into the models and you want to see that performance increase decay over time. That is, you want to see the performance of the models settle in and not improve too much as new data is introduced. Half of the improvement happens just after half of the data is introduced. The loss gain is about 10 percent from the second sample size, about 33 percent of the data, to the third sample size, which is double about 66 percent of the data. That slope in change really settles in from time 1-2, compared to 2-3. It looks like we've settled in here a bit and as such, I'm not going to pass on more coding to myself and I'm not going to label more news articles. I'm going to say, okay, I think we've got enough here to get a gist of what's going on. I encourage you not to use log loss as your final model evaluation metric. Why? Because precision, recall, and F1 scores better contextualize your model in different specific error outcomes. Most folks in data science can quickly interpret an F1 score of 0.97 as very good. To reach a score this high, both the precision and recall of the classifier on unseen data must be very high. A loss score lacks this kind of interpretation. It measures the distance in predictive probabilities from their actual labels. It does not actually weigh in on how the model actually classifies documents in binary terms. Remember that every machine learning model has its limitations. The sad truth is that a model, the moment is launched, begins to degrade in performance as the world and the text in the world around it begins to change. That being said, the nature of a fact check is so blatant. In fact, checks haven't changed that much in the last few years. Could I use this model today? Well, I would probably try it on a few news articles, manually checking it to make sure that it seemed to be doing it. Then I would probably formally label a few 100 to make sure that my labels matched it. If it still did, then I would probably proceed with caution. But warning, one day, there will come a time when the algorithm has drifted, too far a sea of current reality. Maybe that day hasn't come yet, but text linguistics in human nature changes. All text-based machine learning algorithms get worse as time goes on, and they need updated with new data reflecting current realities. Let's go through one more supervised machine learning project that conducted and highlight the major learnings. This time, I had a very large collection of Facebook messages from everyday people like you and me. Why don't you try to pull out political messages from the data-set? I didn't feel comfortable just using a keyword approach to do this because the political lexicon of America is diverse. A lot of different things are tied to our political system. I just didn't see a keyword list is being a viable way to capture all of those things. Remember when I said that every good classification project begins with an objective definition. Well, they do. What's political? Well, I mean, I guess everything right. No, really though, political talk is subjective without a formal definition, how I would identify something as political would likely vary from how you would label something as political. This is exactly what definitions are for. We call these things code-books in the social sciences. They can be anywhere from one to six pages long. Code-books can be big and have lots of concepts, or they can just be a paragraph like this. To give you a better idea of what a code-book looks like in our supplementary resources folder, I'm attaching a code-book that I wrote when I was doing my dissertation all those years ago. I was trying to detect various specific linguistic aspects of text-based messages and tweets. Specifically, linguistic characteristics called concreteness, arousal, and valence. I've included this code-book because I want you to see how meticulous you should be when coming up with your definitions. Because those definitions matter more than anything else. If you're labeling data for machine learning, without good definitions, you cannot label good data. The process was the same as it was before. We run a definition, two people coded a small sample of the posts. We came back, we looked at where we disagreed, and then we went on to code 5,000 posts. We coded more this time because we suspected that politics would be a broad concept, and we felt like we would need a lot more examples. Moreover, we suspected that Facebook posts would not be political, so we knew we were up against a needle in a haystack. We needed to find enough needles to make this algorithm work. Here we see a Blender model, that is, the score this algorithm produces is actually the result of three other models that were then averaged. This is several steps more complicated than what we previously saw in our elegant penalized regression approach. This suggests that the problem is more complicated, that not all models could pick up on all of the relevant features needed for detection, and that several together was our best bet. A bad thing for this is that when we use three models instead of one, it's going to be slower, and it's going to cost more. As we expected, political talk was really rare. Only 1.1 percent of all messages. We wanted to be able to observe every possible example of political talk on Facebook to better understand how people express themselves politically online. What evaluation should I optimize for? If we want to get everything that we possibly can, the answer is recall. We want as much political talk that we can possibly dredge up. We're willing to even sacrifice a bit of precision, and falsely label articles as political when they're not, if that means that we get more political articles overall. Probabilities play a big role in allowing us to optimize what data we return. Remember that machine learning classifiers return probabilities, not binary classifications, and that's a good thing. Here we have two probabilities that two documents belong to a specific class. Political talk. Do you see why the algorithm is more sure that the second example is more probable to be about politics? It has two key terms that give it evidence. Whereas the first example only has one. That term debate is likely to have usages that are maybe not even related to politics. Most models are going to be able to be more sure about the second example because it has more readily measurable evidence. We, as data scientists, have the ability to choose what our probability cutoff is. If we set it at greater than or equal to 0.62, we are going to classify both of these documents as political. If we set it to anything higher, that document gets excluded, the further down we go, the more false positives we're going to get. Remember that you have the power to pick the probability threshold cutoff that best suits your classification problem. If we lower our probability cut-off, we're going to allow more posts to be classified as political. This is the probability cut-off for political talk ranging from 0-1, and you can see that most of the documents are classified at the point six or greater in the grain, but there is some in the middle of the probability. There's an inherent trade-off that happens as I lower my probability threshold by sliding the orange dot to the left. As I lower my probability threshold, I get more data, more political talk, but I also get more false positives. Here's what it looked like on the actual political classifier itself. By lowering the probability threshold from 0.51-0.25, I invited 30 percent more false positives into my data. But look, the F1 actually improved slightly. That's because recall was boosted so greatly. So in this case, lowering that probability score was the right thing to do. But it won't be the case in all things. You won't always care about recall, sometimes precision will be more important, and in that case, we want to increase your probability cut-off. Let's close this lecture by taking some high level ways in which you can do supervised machine learning. There are different platforms that you can use, some free, some are not, and all have trade-offs. Remember, there's no one best algorithm. Each supervised classifier comes with trade-offs in regards to how much it costs to deploy the speed at which you can classify data, the performance that it's able to achieve, and how easy it is to build and deploy it in the web at scale. For example, if you just want model performance, you'll be hard pressed to find anything better than Google's TensorFlow solution with hugging face transformers that we'll use in this class. But again, that approach uses GPU power and it's slow and it's very hard to deploy these algorithms is scale on the web. There are limitations to every approach. With each machine learning algorithm , there are trade-offs. If you just want to get up in running at your organization with building and deploying models right away, no one will get you there faster than DataRobot. It's great for linear base supervised models, and for deep-learning models that do not incorporate transfer learning, but as of this lecture, they don't have transfer learning models available. As a result, it's out-of-the-box results don't really be the solutions that we'll use, why? Because we have a pre-built native understanding of the semantic relationships that English language has. That being said, the majority of DataRobot models are linear, they're quick, they're easy to replicate in a few lines of code, and then you can actually deploy them, right on DataRobot platform and they work wonderfully at scale. I do worry about people using these tools with limited knowledge or desire to improve them, at some point, these abstracted Machine Learning Workbenches can be really risky, and if you don't really know what you're doing, you can create a model that's bad and think that it's good. Literally anyone can build a model and DataRobot that said DataRobot gives you all the tools you need to make the right decisions. We've looked at a number of charts in this class that were created by DataRobot. If you use DataRobot, you got to inspect your algorithms, you got to make sure they're doing what you expect them to do. Manual inspection of model performance is always a must no matter what method you use, can't get around it. Every big-tech company now offers machine learning and deep learning solutions, they all vary in cost with deep learning costing dramatically more due to GPU usage. All of these services offer low code solutions in Python. Google's models and tools are free to use offline. AWS and IBM require that you use their computing resources when using their technology. That's why I like Google's Ecosystem. Google tends to be the most common innovator in deep learning and rightfully so, they're an information retrieval company. That's how they got their start. They'll always be interested in innovating these technologies in these spaces because it's at the core of what they do. Finally, within the Python ecosystem, we have several native packages that do supervise machine learning. You can't throw a stone in the Python community and not run into scikit-learn. As we'll see in the next series of coding lectures, it has a suite of regressors that work really well on classification problems. Statsmodels also has models and there are dozens of other packages that do machine learning in Python. That is all I have to say for now about supervised text classification. Let's jump into actually doing this in Python. It's a wonderfully simple method, and I'm ready to show it to you. Remember, that now that you know what topic modeling is conceptually, it's time to go or view the class project that we're about to embark on. You're ready to figure out how to use text classification for this problem. Find it under your assignment's tab in Coursera. We're going to walk through how to complete this project step-by-step, going over the code, so there's no need to panic. But for now, I want you to get yourself comfortable with the project, so as we go through it, you'll know exactly what you should be taking away. Thanks for learning conceptually about what supervised machine learning is for text classification. Let's get into that Python code.