Let's talk in more detail about generative models and LDA. The generative models for text basically starts with this magic chest. Suppose you have this chest and words come out of this chest magically. And you pick words from this chest to then create your document. So when you start pulling out words, you start seeing words like this. Harry, Potter, Is, and then you have other words like movie and the, and so on. Already, by just looking at the first two words, you know that this chest kind of gives out words about harry potter, so this is some distribution that favors words coming from harry potter. And then you can these words that come out to create this document to generate this document. And this document will be something like the movie harry potter is based on books from j.k rowling. And now you see that in the generation process, you have a model that gives out words, and then you use those words coming from that model to generate the document. But then you could go the other way. You could start from the document and see how many times the word the occurs, or harry occurs, or potter occurs. And then create a distribution of words, that is you create a probability distribution of how likely it is to see the words, harry, in this document or the word, movie, in this document. And you'll notice that when you generate this model, when you infer this model, the word, the, is most frequent. The probability is 0.1. That means one in ten words is the. And then you have is, and then harry, and potter, and so on. So notice that because the documents were about harry potter, the model favors the words harry potter there. It's very unlikely that you would see harry and potter being this frequent in any other topic model or in any other corpus of documents. So here you had a very simple generative process. You had one topic model and you pulled out words from that topic model to create your document. That was a generation story. However the generation story can be very complex in most cases. Instead, suppose you have, instead of one topic, you have four topic models, four chests. And you have this magic hat that pulls out words from these chests at random or it has its own policy of choosing one chest over the other. And then you have these words that come and then you still create this document. Now your model is more complex because instead of learning, the generation is almost like where you decide which chest the word comes out of. And once you have made that choice, then you have a different distribution of that word coming from that chest. But you still create the same document, but then, when you are using these documents to infer your models, you need to infer four models. And you need to somehow infer what was the combination of words coming from these four chests, these four topics. So you not only have to somehow figure out what were the individual topic models, individual word distributions. But also, this mixture model of how you use these four topic models and combine them to create one document. So this is typically called the mixture model, the first one that we saw in the previous slide was unique model, where you have one topic distribution and you get words from there. Whereas here, it's a mixture of topics. So you have the same document, generated by four different topics. Some of them represented with a higher proportion and others that are not. It should remind you of the example we started this topic model discussion from. That was on the bare necessities in science article, and you saw that there was a topic model for computation and another topic model for genetics. And a third topic model for anatomy that was not represented as well in the document and so on. So this is kind of similar model here. LDA is another such generative model. And the generative model for a document d is you choose the length of document, you first decide what is the length of the document that you are generating. Then you choose a mixture of topics for that document. And then you use that topic's multinomial distribution, that is the word distribution, to output the words to fill up that quota, that topic's quota. Suppose you decide that for a particular document, 40% of the words come from topic A, then you use that topic A's multinomial distribution to output the 40% of the words. This is a very simplistic explanation of LDA. You could, some of you might have seen more complex mathematical notations, called plate notations, to define topic models. And that is something that we'll leave for a future study. I can point to some of it in the reading list. But for now, it is enough to kind of understand that LDA is also a generative model and it creates its documents based on some notion of length of the document, mixture of topics in that document and then, individual topics multinomial distributions. In practice the questions you need to ask though when you create a model such as LDA is how many topics you want. There is no good answer for it. Finding or even guessing that number is actually very hard. So you have to somehow say that okay I believe that there might be five topics. Or I would prefer learning five distinct topics than 25 topics that are very similar to each other. So you make a choice, just based on a guess of how distinct these topics could be. But if you are in a domain where you know these topics a little bit well. So for example, you have all medical documents. And you know that these medical documents come from radiology and pathology and urology, and there are these streams, then you might say, okay, I'm interested in these seven streams of medicine, and those are my topics. So that there at least you have some sense of how many topics there should be. The other big problem is interpreting the topics. So you would get topics, but topics are just word distributions. They just tell you which words are more frequent or more probable coming from particular topic and which ones are not as probable. But making sense of that or generating a coherent label for the topic is a subjective decision. There have been some work that have looked into generating names for these topics. But most likely whenever you see a name in a topic model, it just comes out manually. When people just look at these words like genetics and genes and so on, and say that this is a genetic topic. Or if they say computation, and model, and data and information and there's something to do with computation or computer science or informatics. So those names are fairly subjective. But actual topics that you learn from LDA is basically a solution of an optimization function. So it is more kind of deterministic in that sense. To summarize, topic modeling is a great tool for exploratory text analysis that helps you kind of answer the question about what these documents are about. What is this corpus about? And you could think about it as a corpus of tweets, a corpus of reviews, of news articles. So you might get a big dump of tweets and say, what are people talking about in tweets? What are the different themes that come from tweets? And there are many tools available to do it fairly effortlessly in Python. So let's take an example of how to do it. There are many packages available. Some of them are gensim, lda and so on. We are going to go and talk about gensim more in the next few slides. But before you use any of these packages, you need to pre-process text. And I would encourage you to kind of recall what we talked about early on in this course, right in the first module about pre-processing text. That you need to tokenize text and normalize it, that means make them all lowercase. Decide whether you should make them lowercase or not, you remove stop words. Stop words are common words that occur frequently in a particular domain and is not meaningful in that domain. So for example, in general English, the word the and is and so on might be words that you want to remove. While if you are in the area of medical documents, let's say, so clinical notes, you would always see the word patient and always see the word doctor and so on. And they may not be as important as the other words, like what is the medication and what is the disease. Then you may want to say patient and doctor are stop words for that context. The other pre-processing step would be stemming. That means you would need to remove the derivation in related forms, somehow normalize the derivation in related forms to the same word. Meet, meeting, met, all should be called meet, let's say. And then once you have done the pre-processing steps, you convert this tokenized document into a document term matrix. So going from which document has what words to what words are occurring in which documents. Getting that document term matrix would be the important first step in finding out, and in working with LDA. And then once you have done that, once you have build this document term matrix, you build the LDA models on top of it. So once you have built the mapping between the terms and documents, then suppose you have a set of pre-processed text documents in this variable doc_set. Then you could use gensim to learn LDA this way. You could import gensim and specifically you import the corpora and the models. First you create a dictionary, dictionary is mapping between IDs and words. Then you create corpus, and corpus you create going through this, all the documents in the doc_set, and creating a document to bag of words model. This is the step that creates the document term matrix. Once you have that, then you input that in the LdaModel call, so that you use a gensim.models LdaModel, where you also specify the number of topics you want to learn. So in this case, we said number of topics is going to be four, and you also specify this mapping, the id2word mapping. That's a dictionary that is learned two steps ahead. Once you have learned that, then that's it, and you can say how many passes it should go through. And there are other parameters that I would encourage you to read upon. But once you have defined this ldamodel, you can then use the ldamodel to print the topics. So, in this particular case, we learnt four topics. And you can say, give me the top five words of these four topics and then it will bring that one out for you. ldamodel model can also be used to find topic distributions of documents. So when you have a new document and you apply the ldamodel on it, so you infer it. You can say, what was the topic distribution, across these four topics, for that new document. So the take home concepts here are, the topic modeling is an exploratory tool, that is frequently used in text mining. LDA or Linear Dirichlet Allocation is a generative model, that is used extensively, in modeling large text corpora. There are other topic models available, PLSA being another one. In addition to being an exploratory tool, LDA can also be used as a feature selection technique for text classification and other tasks. So for example, if you want to remove all features that are coming from words that are very fairly common in your corpus or you want to focus your features to only those that are coming from specific topics. Then you would want to first train an LDA model. And then, based on just the words that come from specific topics of interest, you might actually generate features that way. So in general, LDA is a very powerful tool and a text clustering tool that is fairly commonly used as the first step to understand what a corpus is about. Hope you learned how we could use topic modeling, and this gives you a brief introduction to topic models. There is many, many things you can go into more detail about, but for this course, I think we'll leave it there.