In this hands-on lecture, I will discuss about the most used among the most basic topic modelling techniques called LDA which stands for Latent Dirichlet Allocation. Let's create a Java file called LDA/Main.java. And insert it in our test.main package. For you, I already created LDA/Main.java and it should look like this. Same as before, we're going to use New York Times news articles. So we use scanner object and read in news articles. In this case, we're going to stop once we read in 1,000 news articles from this file, and then after that we close the scan object and we create LDA. Now, let me explain this function in more detail. Create LDA takes two arguments. The first one is number of topics and the second one is number of iterations. Let's go into the function. The first one is number of topics, as I said. The second is number of iterations. In reality, the number of iterations between 1,000 to 2,000 produced most best performance. For this demonstration, I only have 100 iterations because of the performance. If you have a bigger number of iterations, that means it's slower. And if they're a smaller number, iterations are faster. For demonstration, I'm going to only show you the outcomes, very reasonably fast. After you create LDA, most areas of the logics coded into this create LDA function. After that, what we're going to do is we're going to see what kind of topics are generated, what kind of words, top words are generated. So let's take a look at this create LDA, which is a core function of this LDA implementation. If you recall the lecture notes for LDA, we instantiate pipe which is character sequence lower case, character sequence to token sequence. Third one is token sequence remove stop words. Fourth one is token sequence to feature sequence. But this is a series of pipes. There is a dependency between pipes. We cannot call or instantiate token sequence to future sequence before you instantiate pipe to sequence lower case. The character sequence lower case is you basically take the input and tokenize and make it lower case. And then once you have character sequence, you make token out of it by taking this one argument. The other argument is patent. So regular expression. After that, if you want to remove stop words from the tokenized words, then you simply instantiate token sequence remove stop words. This is optional. If you don't want remove stop words, then you just simply comment out this pipe. Once you create pipes then you instantiate InstanceList object by passing the created, the list of series of pipes. And then you look through the documents by using iterator. And then after that, you're ready to execute LDA. As I explained in the lecture note, module implements multi-threaded LDA, it's called product top model. Here you set up on data, and number topics. And you add instance list to product topic model. And you can also set the number of threads. So number of threads now is set to 2, but you can pump it to 5, 10, or 20, depending on the availability of your computer resource powers. And then you also set the number of iterations. So the number of iteration, as I told you before, between 1,000 and 2,000 is ideal. After that, what you do is you execute, you call estimate function. The estimate functions takes care of inference algorithm of LDA. After this step is done, basically what you're going to do is you're going to see the topic model and also, Your saved instances. Why you doing this is just because you want to, I mean if you, If you have a huge amount of hex data, then processing and creating instance list and generating topic models takes time. So you don't want to generate topic models every time you execute the LDA. What you may want to do is, you save, you train and save the topic model and also instance list. You can save one of those two or you can save both. After that we create these topics. And topics is basically, if you set ten topics then we have the number of topics ten and we have ten topics. For each topic we have n number of words. Well here, if you want to print out the result of TopicModels and then what you can do is you simply instantiate the file and then location of the file, in this case lde_out.txt under your project which is yTextMiner. And you call LDAs print top words function. It takes three arguments. One is the autofile, second is number of top words and third one is whether you want to clean up the new line or not. If you're to print every single word belonging to that particular topic, then you will un-comment this part, clean topic work week. It prints everything. But if you only want to see top end number of top words, then you may want to use the second one. All right. Now, let's go back to LDA/Main.java and simply just call this. Remember, I only have a hundred, but you need to make it to like 1,500, or so. Also, the number of topics will, solely depending on data that you have, the number of documents that you have. You may want to use several number of topics. And then set the final one to whatever the final topic numbers resulting best result, you will probably use that topic number. Let's execute LDA/Main.Java. So this red font debugging message is all coming from MALLET LDA. It prints out the initial result of ten topic. And its probabilistic value for each topic and top words that belonging to that topic. So right now, it prints topics zero to nine which is ten topics. Write fast, because we only have a hundred, I'm sorry, we have only 1,000 documents with hundred iteration. We see this is each topic. Topic number and top words depending on the data that we have. In our case, we use New York Times news articles. So first topic consists of company, city, yolk, according, former, deal, Volkswagen, SOS. What we may want to do is here we may want to eliminate numeric value. What we can do is we can put this numeric values in Starword list. Or we can adjust pattern. Regular expression so that only letters, we keep only letters. There are several ways you can do, but I don't suggest probe them here. So we have ten topics result. Print it.