Hi again, we're back and we're talking more conceptually about topic modeling and how it can be used in a marketing context. When you're fitting a topic model, what are you asking the computer to do in concrete terms? The assumption here is that documents within a cluster should be similar and the topic should be dissimilar from each other. Topics should have the same words within them, then documents from different clusters should also be different. These are two things that I hope to see qualitatively when I'm inspecting a topic model. We don't want the topics to overlap too much. That is, when multiple topics seem to be containing the same words, then there's probably an overlap here that we can address by changing parameters. My only critique of my Yik Yak topic model that I showed you is that the topics seem pretty similar. There's a decent amount of overlap there. Overall, you'd have to accept some level of arrow topic modeling. I spent multiple hours trying to fit that model and that was the best I could do. It's going to lock on the spurious correlations just like linear regression, lock onto a spurious relationship between two variables. We're going to have to figure out what we mean when we say similarity. We're going to have to model it computationally. We're not quite sure how we're going to do this. Similarity is essentially trying to put documents together that are similar. Similarity is essentially trying to put documents in a vector space and selecting those documents that go together in that space. We're trying to see how close the documents are together in that vector space. If we can represent a document in space, then we can see what documents are in close proximity to it. Don't think about it as anything more than terms. What we're going to do with our topic models is essentially pre-process them so that they're no longer continuous lines of text but instead terms, just like we did in the supervised machine learning class. Each term that we found across our collection of documents will get a dimension in vector space so that word can be cat. This word could be dog, and another word could be octopus. That's one of my toddler's favorite words. We have three terms. If a document mentions the word cat and dog, it gets put somewhere in vector space. If it mentions cat and dog and octopus, it gets put in a different spot in vector space. If it mentions just octopus, it gets put over to the right. Anything beyond three dimensions gets really hard to visualize just because it's hard to display multiple dimensions in a visualization. Just because it's hard to display doesn't mean it's hard for a computer to calculate these vector space dimensions for each document in the corpus. This way we are essentially recording locations for documents and where they live in vector space. The idea is that we're trying to find cluster of documents where they exist together, and we're drawing imaginary boundaries around those clusters to say that these documents represent a topic. That's what we're doing with topic modeling. Here are two more vector space examples. This is for X and Y. For term X, we could be talking about the term cat, for Y could be dog. If a document mentions both at one time, they might be represented in vector space as such, if it mentions one term or the other, then it would be either down there or over there and so forth. If we have multiple mentions, it could end up being somewhere further out in vector space. It doesn't have to just be binary how we represent a document and its term. We can count up the number of times a term is used in a document and represent that in vector space as well. As we'll learn, they're even better measures of how we can represent a term with vector space that more accurately retrieves the uniqueness of that term to a specific document called term frequency-inverse document frequency. We'll cover that in just a few slides. This is what we referred to as two-dimensional vector space. There are two terms in the vector. This is a three-dimensional vector space on the right, there are three terms. What we're looking at are the values for each document across these terms. How many times does the document mention octopus? How many times does it mention cat? How many times does it mention dog? This one blue dot record represents one document. One tweet, one news article, so on and so forth. If we plot all of the documents for an entire collection, then we can begin to see how they cluster in space. We're looking at two terms again, x is cat and y is dog. We can see three things. Let's just say they're books, because the counts are so high. We'll get into what these actually represent in a little bit, but it's actually a vector space terminology called term frequency, inverse document frequency. The first book mentioned the word cat or a 101 times and dog 43 times. The second book here represented had counts of cat 96 times and 30 mentions of the word dog. We've labeled these three documents. We've put them up in vector space. Now, we've got to try to measure the distance between the documents. If we want to try to set up collections of documents where the distance between the documents is as little as possible, that's what we're trying to optimize for, the space in-between the topics. That's where learning occurs and unsupervised topic modeling. We could add up the distance between these three documents by simply adding and subtracting their locations to get a relative amount of space or capture that actual space that is represented inside of this vector space topic. The idea is that we want to try to select topics where we're minimizing this space. The broader the topics, the more space there will be in the topics. You don't need to worry about this math just like you don't need to worry about tuning Keras neural network nodes. But we do want to try to observe these distances as our topic model learns to make sure that the distances inside of our topics minimize. We're just trying to figure out how far away these points are, and then we're trying to represent that as an average distance. This is what we mean when we say Euclidean distance. That, there are distances that we have in a set of documents. They're three clusters in this document. We could actually just draw them. Remember, we're trying to minimize one, the space inside of the topics and then two, the distance we're trying to maximize between topics. We draw these topic lines by setting a central point for each topic and then setting a boundary outside of that center point. This is called a cluster centroid. This is the central point of a topic model. That would be here in the middle of this triangle for this topic. The machine is trying to find or learn where the central points are. There are three clear clusters here. For us as humans, it's easy to see the center point where the centroid is located for each cluster. For our algorithm, though, it doesn't really have this intuitive vision. Our algorithm must try to focus on a ton of points to try to figure out where centroids are. Think of it as playing darts until you get the bulls-eye or something really close to the bullseye. We throw a dart on the board and we randomly then measure the Euclidean distance of the closest documents. We throw another dart on the board and we measure the distance again. We throw another and another and another and we just keep going for a number of throws. Once we've thrown all the darts that we have time for, the smallest distance wins is the best fit for a topic centroid. Looking at the Euclidean distance and picking the smallest iteration, after we've done all of our random tries. This approach requires us to tell the computer how many topics we should optimize for, that capital K that you see in topic modeling. That means that we have to specify the number of topics that we want returned in a model before we start training the model. It's often a heuristic because let's be honest, I mean, we don't know anything about a collection of one million tweets. We don't know how many topics are going to be in there. Statistical models like Elba methods can help us figure out when the distance is being maximized across topics. I think of these tools as helpful for getting started with model fit and we'll cover how to use them in our code walk-throughs. But I can't stress enough. Intuition is so important. The best topic model is one that makes sense to you, the scientist. Once we give that number to the model, it'll try to find those centroid points in the topics were the most central point for these clusters exists and the distance is minimized inside of the topic and the distance is maximized outside of the topic. If you see these boundary points here, I think that's a fairly good job. Where's the centroid point here in this collection of documents? Right in the middle. It's easy for us to see this as humans, but it's not easy for a computer to do so. How does this process work? We're going to randomly select a few documents. Our topic model is going to assign each document to its nearest centroid. If we have five documents, each document is going to get assigned to the topic centroid document it's closest to. Then, it's going to compute the distance for all of the topics in the data. Then it's just going to go through this process again, where it assigns random documents, re-computes the Euclidean distance, and it keeps going until the best Euclidean distance just doesn't change very much. Eventually, by selecting and re-selecting centers, it's going to find something near to the closest centroid for each topic in our collection of documents.