Welcome back. In this video, we are going to talk about topic modeling. Let's take an example of this article from science, and this is about Seeking Life's Bare Necessities, Bare Genetic Necessities. And as you look through this, you'll notice that some words have been highlighted. So you have words such as genes and genomes that are highlighted in yellow, words such as computer, and predictions, and computer analysis, and computation are in blue. And then you have organism, or survive, or life in pink. This demonstrates that any article you see is more likely to be formed of different topics or sub-units that intermingle very seamlessly in weaving out an article. This is the basis of one of the leading research work that has happened in text manning on topic modeling, and this one particularly is from Latent Dirichlet Allocation. Well, you have three topics. You have genetics that's in yellow, or computation that's in blue, and life-related, life science, let's say, in pink. This shows that documents are typically a mixture of topics. So you have topics coming from genetics, computation, or even anatomy. And each of these topics are basically words that are more probable coming from that topic. When you're talking about genes and DNA and so on, you are mostly in the genetics realm, while if you're talking about brain and neuron and nerve, you are in anatomy. If you're talking about computers and numbers and data and so on, you're most likely in computation. So when a new document comes in, in this case this article on seeking life's bare genetic necessities, it comes with it of topic distribution. And so for that particular article, there is some sort of topic distribution over these topics. Assume there are only four topics in the world. Genetics, computation, life sciences, anatomy. Obviously, that's not true. But let's take in this sense that these are the only four topics you have, and this particular article is generated by these four topics in some combination of words. Where anatomy, the green one, is absent, and computation, for example, is the most probable. But then you have genetics also including a percentage and a little bit of life sciences. So what is a topic modeling? Topic modeling is a coarse-level analysis of what is in a text collection. When you have a large corpus, and you want to make sense of what this collection is about, you would probably use topic modeling. Because you would say, let's figure out what kind of documents you have in this collection. Are they all about sports? Are they all about business? Or are they all about computers? And if they are all about computers, then are they all about architecture? Or are they all about algorithms? Which are all different subunits within a larger unit. A topic is a subject of theme of a discourse, and topics are represented by a word distribution. And that means that you have some probability of a word appearing in that topic. And different words have different probabilities in that topic. So for example, if you see a basketball, or a player, or a fee, or a score, you are more likely to be in the topic of sports. And if you are in the topic of sports, then words such as player and team and score are more likely to appear. Team may also appear in social science studies but maybe not as frequently, or it's not as probable to have that in, let's say, in life. Though it is likely there as well, right? So for a particular word, you have different distribution or probable occurring from a topic, and topics are basically this probability of distribution over all words. A document is assumed to be a mixture of topics. So for example, you will have a lot of topics like this. So you have humans, genomes, DNA, and so on. That is probably about the genetics topic. You have another topic that is evolution, and species, and organism, and life, and biology. You have a third topic on disease, and host, and bacteria and so on, on new strains. And another one in computer modeled information data. And you can see that these topics or these word distributions where, for example, a topic is what is there in a column, and they are salted maybe weekly by how probable these words are. So computer or model is the most probable word in this topic, the fourth topic. So when you're doing topic modeling, what's known, what's given to you? What you're given is a text collection or a corpus, and you are somehow given the number of topics. Let's say we are interested in 20 topics, and we are somehow group these words and find these topics and find 20 of them from a large collection. What's not known are the actual topics. You are not given that you are interested in these specific 20 topics. You say, well, what you've given us that you want me to find 20 topics. But you could find any 20 topics, and you want to find a topic that is more coherent. So that's part of the problem. And you're also not given the topic distribution for each document, so you're not given that this particular document is all about sports. This particular document is 50% sports and 50% genetics. So that distribution is not known either. Essentially, topic modeling is a text clustering problem. However, in this particular case, the documents and words are clustered simultaneously. You need to figure out what words come together. What were they similar to each other or semantically related to each other? Recall one of the previous videos where we talked about semantic similarity of words. And then you also need to figure out what documents come together. What documents are of the same topic or mostly about the same topic? And how does these words get derived based on these documents? So how do you build this topic modeling to understand what is a distribution of words in a particular document and what is a probability of a word in a topic. Different topic modeling approaches are available, and there have been new models that are defined very regularly in computer science literature. The most common ones and the ones that started this field are Probabilistic Latent Semantic Analysis, PLSA, that was first proposed in 1999. And then Latent Dirichlet Allocation, that's LDA, that was proposed in 2003. LDA is by far one of the most popular topic models, and we're going to talk about it in more detail in the next video.