Suppose you want to learn something about a corpus that is too big to read. Let me give you some of the examples of big data that we need to make sense of. We know that there are half a billion tweets generated on a daily basis. Given this, how can we answer questions like, what topics are trending today on Twitter? Second example is, hundreds of bills each year. Given this data, how can we answer for the questions, like, what issues are considered by Congress and which politicians are interested in which topic? Suppose that we have 10,000 active NIH grants. What research topics receive grant funding and from whom? Is there a way to discover interesting patterns and answers for those questions out of these documents? Topic modeling is a rapidly developing branch of statistical text analysis. Topic modeling uncovers a hidden numeric structure of the text collection and finds a highly compressed representation of each document by a set of each topics. From the statistical point of view, each topic is a set of words or phrases that frequently occur in many documents. The topical representation of a document captures the most important information about its semantics. Therefore, we can annotate documents according to topics. It is useful for many applications including information retrieval, classification, categorization, summarization and segmentations of text. What is topic anyway? General definition is a grouping of words that are likely to appear in the same context. From the perspective of topic modeling, it is a hidden structure that helps determine what words are likely to appear in a corpus. But the underlying structure is different from what you have seen before. It is not much about syntax, but so much about semantics. For example, if war and military appear in a document, you probably won't be surprised to find that troops appears later on, why? It's not because they're all nouns, but based on semantical association you perceive, you might say that they all belong to the same topic. Also, hidden topical structure is long-range context such as local dependencies like n-grams and syntax. Early topic modelling algorithm is LSA, which stands for Latent Semantic Analysis, which proposed by Deerwester and his colleagues in 1990. The idea of LSA is that text can be explained by mixing latent topics. LSA attempts to discover this underlying structure. LSA first measures occurrence frequency of terms in documents. It then writes frequencies as term-document matrix and analyze them using Singular Value Decomposition or SVD. These procedures result in term-document, document-term, and topic-topic matrices. LSA is not explicitly a topic model, but it is the foundation for much later work, which is LDA. LDA stands for Latent Dirichlet Allocation. It is the de facto algorithm of a topic model. It is the generative statistical and graphical model for topic discovery, which was proposed by David Blei and Andrew Ng and Michael Jordan in 2003. The fundamental assumptions are documents have latent semantic structure which can infer topics from word-document co-occurrences. Another assumption is that words are related to topics and topics are related to documents. In LDA, each document may be viewed as a mixture of various topics. This is similar to Latent Semantic Analysis except that in LDA, the topic distribution is assumed to have a Dirichlet prior. Let me describe topic model with graphical representation. There are two starting points to explain this diagram. The first one is a collection of texts, which is data in the bottom-left corner of the diagram. The second one is assumption, which is upper-right corner of the diagram. Assumption denotes the texts have been generated according to some kinds of models. With the plate location, the dependencies among the many variables can be captured concisely. The boxes are plates representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. After inference algorithms, such as Gibbs sampling applied, the output is the model that has generated the text. Let me share some of topic modeling result with you. The top figure is the most typical one. It shows ten topics in table format which was generated by MALLET LDA with the primary records in the field of bioinformatics. The bottom figure is the network representation of topics on the Ebola virus from the news articles. The link between topics was made if given two topics are similar enough by measuring the similarity between the list of top words of two topics.