All right, so let's continue on with our conceptual learning of topic modeling. So if we're going to do topic modeling, we need a data structure to represent our documents. We conceptualized what we needed to measure, which essentially up until this point is just the occurrence of terms and documents, but we haven't actually made a data table. This is one way to represent vector space, it's called a document term matrix. Each column represents a word, and each row represents a document. Word one or W1 is stored in a dictionary where we can look up its meaning word two could be cat or dog or octopus. We don't see these things in the row document term matrix, so we have to kind of look them up. Often document term matrices are just identified by identification numbers such as 1,2,3, so on and so forth. Most topic models use a bag of words approach just like the SK learn method that we went through earlier. Most topic models are simply giving each term in our corpus or collection of documents a column. We create the bag of words by scanning over the entire collection of documents, and then we go back and count up the occurrences of these words in each article. In this set of lectures, we're going to actually cover basic text pre processing. Something that we didn't really have to spend much time with with deep learning because it does this for us. But for these types of bag of words approach is we really do need to take some time to make sure our features are ideal. Before we can do traditional machine learning features need to be created from the text and that's exactly what we're doing here with this document term matrix. The documents themselves have IDs by default. So the third document in this matrix mentions W2, we can say that represents dog and W4 and we can say that represents octopus. A document term matrix is the simple but efficient way to represent a text corpus, it's been around for decades and it's still used today. By treating each document with only a 1 or a 0 is to ignore the number of times a feature is mentioned in a document. Document term matrices are more commonly made with counts and even more commonly made with term frequency, inverse document, frequency scores. A term frequency is simply the number of times that a word is mentioned in a document. We could easily put that in vector space and just sum up the number of times the cat was mentioned in each tweet and put the number right there for that document. If we use the word cat twice it gets a two in the word column that represents cat. This is a document term matrix that is in 10 dimensional vector space. What does that mean again, it's just simply that there are 10 Columns or 10 terms in our document term matrix. So this is an IMDB dataset that's commonly used for unsupervised machine learning. So we've got a database of movie reviews for a ton of different movies. We can do topic modeling for an entire collection of movies. But first let's take a look at some important documents specific scores that can help us figure out what the most common words are that are important to this document. So this is a Rocky review. And really what we've done here is tokenization. We've taken the words, we've splitting them up and we counted them up. And so now we have word counts for the words that appear in the document. This helps us capture what's unique about the document. But are these words really unique? Here the frequency words for Rocky, it's not really that descriptive or meaningful. They're what we call stop words really at the top right. The most common words are words that are just general parts of speech that really don't have any unique meaning. If we give this data to a topic model, then we really can't expect it to come up with much meaningful stuff out of the document. The features themselves aren't very descriptive. I've highlighted the potentially descriptive words here in blue. So term frequency for a document isn't a very descriptive way to extract out the meaningful terms for that document. That's why it's not very commonly used in topic modeling. There have been several advancements in information retrieval that help us address this. Inverse document frequency has stood the test of time, why? Because it's good at extracting out what makes terms in a given document novel. Inverse document frequency is just a measure of how rare a term in a corpus is imagine the word octopus. If we had a corpus with 100 documents and octopus appeared once in that collection of documents, the IDF score for octopus would be 100. If the term appeared twice, the IDF score would be 54 times 25 and so on. So the rare term is across the collection of documents, the higher the IDF Scores. So if we use IDF scores to look at this review, the results are somewhat better, right. The words are certainly more unique but they don't necessarily map to what makes this document unique. Doesn't is the top word but it isn't really descriptive of the Rocky movie at all. Other words like touting and managers are certainly used in relationship to the Rocky movie but they don't really describe what this movie is about. So what can we do to further are uniqueness in our representation. It turns out that IDF is just a part of the picture. When we interact or multiply term frequency by inverse document frequency, we get a document specific matric. I'll say that again, document specific matric, that is something that is unique to each document that is TDF IDF scores change from document to document because term frequency is a document score. The number of times the term was used in a specific document changes as we look at different documents. Extracting the logic here, we built a matric that is higher for words that are mentioned often in a document, and also higher for terms when they're rare. So we built a measure for rare words being used often in a specific document. Another way to think of it, TF IDF accentuates terms that are frequent in a document but not frequent, and a collection of documents. So if we look at the TF IDF scores for this document, we see that it does a wonderful job. It really is pulling out the key descriptive words that would describe the movie if you were looking for those key search terms that you might type into google. So all of the top words here are going to be helpful in deciding what makes this document unique. Therefore, it will be good features for topic modeling. So if you take TF multiplied by IDF, you get term frequency, inverse document frequency. Take a look at how these scores are calculated. It's no surprise that Rocky is the top TF IDF term. It's used 19 times in a review. It has a fairly high IDF score, meaning that it's not used in reviews generally. Philadelphia is even rarer but it's only mentioned five times so it scores about a third. Boxer is also unique but it's only used 4 times. So what do we do with these TF IDF scores? We use them as locations in vector space. So for the Rocky term no document is going to be as high in vector space as the Rocky review will be. Remember that each document has a location in space for each term in our document matrix. So now we're using the TF IDF score as a representation of that location instead of term frequency. If two documents score approximately the same in terms of TF IDF, it says, hey, these two documents used this rare word a lot and a lot not a lot of other documents did. So they might be related, and that's a good intuition of course. Remember this is what our final data format looks like. This is our final document term matrix. This means that ultimately these TF IDF scores get calculated and then stored in their approximate location. So each term in each document gets in TF IDF score, and those TF IDF scores are a core sense of similarity in most topic models. So again, as I've said, topic models far from perfect. They produce fuzzy results that is commonly you see topics bleed into each other or contain words just don't belong. Topic modeling is a blunt tool and what a model's best solution for a topical fit does not perfectly correlate to how humans would assign topics to documents. That doesn't mean that they can't be informative. They can unlock meaning and understanding with huge collections of documents. They can cluster similar documents together, that said, there's no true right answer to what is the right number of topics. It's something that you must specify before you build the model. And it's something that's generally not tuned. How many topics is in a document is really a subjective measure because you're ultimately accepting higher granularity when you choose a higher number of topics, and lower granularity when you select a lower number of topics. Once you've locked into your K the number of topics that you want from your model, there are good evaluation metrics that can help you determine how to best fit your model. However, as we will see even see in our coding lectures, these evaluation metrics are far from perfect. You'll almost always fail at being able to optimize all of the metrics available to you. At some point, topic modeling is actually more qualitative than anything. You have to be able to read the topics and critically assess for yourself if the topics make sense. They seem to represent a collection of words that represents a topic that you as a human can read and interpret. If they don't, then you got to keep working on your topic model, luckily these things are easy to fix and address. But if you don't take time to ask these questions, the topic models that you create will be bad and ultimately not useful. Sadly, even as a research professor, I often see topic models regularly through academic peer review that are bad. Very few topic models that my colleagues make live up to these standards. Making good topic models take time, it's an iterative process that requires a lot of tuning and a lot of patients and honestly being critical with the results that come out. That said I can't wait to show you my topic modeling workbench how I use it and how you can get started in just a few lines of code. If you just can't wait to get started, I strongly encourage you to spend 30 minutes making a no code topic model using a really cool website built by David Letterman. He's a software engineer specializing in machine learning and natural language processing and he has visualized the creation of a topic model better than anyone else. If you've got a collection of documents on your computer such as like research papers or a collection of anything, you can upload it to this tool manually and you can start to build topic models. So now that you know what topic modeling is conceptually you're ready to review the central course project on topic modeling. You can find it under our assignments and Coursera we're going to walk through the code required to complete the project step by step in the upcoming lectures. There's no need to panic but for now I just want you to get yourself comfortable with how we're going to be using topic modeling. All right, I'll see you back in lecture where we'll actually get started with python, get our hands dirty and start clustering some documents.