Mallet is a Java based package for status called natural language processing, document classification, topic modelling, and many other text mining applications. The details of Mallet can be found at their homepage, mallet.cs.umass.edu. Mallet can be used in two modes. The first one is command line script. The script can be executed either on terminal window like CMD or list terminal. By typing bin/mallet, the name of the command you want to use along with option name and value associated with the name. Or, you can use Text User Interface which is called tui classes which can be executed in GUI mode. If you want to use Mallet as part of your software, you may want to use Mallet API. In our lab session, we will use Mallet API. To learn more about Mallet, you can consult with Quick Start guides, focused on command line processing or you can use developers' guides with Java examples. There several topic model where it is implemented in Mallet. The first one is Parallel LDA. Parallel LDA is a simple parallel threaded implementation of LDA with sparse LDA sampling scheme. DMR LDA is the implementation of Mimno and and McCallum DMR algorithm. And it is not based on multi-threading. Topical N-grams is like latent allocation but with integrated phrase discovery. Polylingual Topic Model is latent directional location for loosely parallel corpora in arbitrary languages. Hierarchical Panchinko algorithm model is hierarchical PAM where each node is has a distribution over all topics on the next level. And one additional node to specific topic. For more information about these algorithms, read through their package which is at cc.mallet.topics Let's take a look at how Mallet handles input text. Given a set of documents which is shown in the left side of the diagram. Mallet first transforms text documents to a set of vectors, like X1, X2 and so on and so forth. While it keeps track of index position of X it also retains the meaning of each vector index. The outcome of this transformation is a sparse metrics. Elements of vectors are called feature values. For example, suppose that there is a feature at row 345 from the top of the index on the diagram. And the value is 10. What it means is the feature in this case is the term dog. And 10 means the number of times the dog appears in a document. Now, let me give you more explanations of how Mallet transforms documents to vectors. As shown in the figure, there is a very simple document that contains Call me Ishmael. This document is fed into Mallet, and Mallet tokenizes it into tokens and makes it lower case. Tokens are then converted into features while preserving their sequence in documents. They can be further transformed into bag features. The sequence of these steps are executed in pipe class of Mallet. Once the transformation of int text is finished, the transformed data is store in Mallet object called Instance. Mallet Instance is composed of the following four fields. The first one is Name, this field acts as the name of Instance and used for the identification of Instances. Second field is label which is primarily used to classify the Instances. Labels will present the classes in the classification module. Third one is the data which is generally a feature vector or feature sequence. For example, features of that word. The last one is source. It carries the information about the source of the Mallet Instance. An Alphabet class represents the mapping between integers and objects. Integers are assigned consecutively. Starting at zero as objects are added to the alphabet. One cannot delete objects from public object. And thus, the integers are never used. When classifying documents using Mallet, all the unique words in a document will be unique entries in the alphabet with a unique integer associated with it. Feature vectors use this integer data in order to present the subset of alphabet. Feature vector is represented as a spector. A location in a feature vector represents the index in Alphabet. For this, Mallet uses the node of TObjectIntHashMap to store the list of terms and their integer ID. For their performance reason, Mallet stopped the increase of the size of vocabulary by using stopGrowth function. The instance object is instantiated by taking four arguments, data, target, name, and source which I explain in two slides before. A instance means one document. For the number of documents to be processed, we need to efficiently transform them into instances. Mallet takes care of it by Iterator of Instance. Mallet implements several Iterators including FileIterator, CsvIterator, ArrayIterator and so on and so forth. Which iterator to be used is determined by how input data is formatted. InstanceList contains Mallet instances which are typically used for training and testing of assimilating algorithms including pipe modeling. InstanceList class is initiated by taking one argument called pipe object, which I will explain in the next slide. All these instances in Mallet must be pass through the same sequence of pipes and hence, share the same data and target Alphabets. Based upon elements that I explained so far, Mallet is right to do topic modeling. Let's put it all together. Mallet uses different types of pipes in order to pre-process the data. For example, Mallet provides token sequence lower case which converts the incoming tokens to lowercase. Pipe is an abstract super class of all these pipes. The slide shows just some of the important pipes which Mallet provides. The list of pipes are kept in the list container. And they are used as an argument to be passed through the constructor of InstanceList objects of Mallet. Data importing, or preprocessing, is achieved by using a series of pipes. Pipes take an input instance list and modify the data field based on the series of pipes specified. ParallelTopicModel class is the implementation of LDA in multi-threaded mode in Mallet. It takes on argument of the number of topics. You need to set the number of topics for your dataset before you execute LDA. Once you instantiate LDA, you pass instance to the subject you created as I explained in the previous slide. The estimate function runs LDA attempting to estimate the topic model given the data and settings you've already set up. It runs sampling method for inference algorithm. And that requires an initial step to randomly assign words to different topics. The iterations later we find the assignments according to an optimization function. After this, you can write the result of Topic Modeling to a file by calling the printTopicWordWeight function which prints on normalized weight for every word in each topic. DMRTopicModel class is the implementation of direct multinomial regression in Mallet. It takes on argument of the number of topics. Once you instantiate DMR, you pass. Instance list object and called the estimate function to estimate the topic model by DMR. Once you finish it, you can print the result of topics along with TopWords by calling the printTopWords function. The first argument is out file. The second one is the number of words to print. And the third one is whether a new line will be printed per topic owner. By calling the function of writeParameters, you can write the result of topic models condition on the meta data or absolute feature such as time stamp, journal name, company name and so on and so forth, that you are interested in.