Before we get into the preprocessing techniques, I want to share some previous interesting projects that I conducted which are related to this course with you. First one was to apply several text mining techniques to news articles to examine whether and how a particular event influences on the stock price of Korean companies. As shown in the overview of research design, we first collected about 4,500 new articles from Factiva DB. Primary goal of this project was to examine the phenomenon of so-called Korean discount by classifying the US news articles in terms of sentiments when North Korea posed geopolitical risk to the South Korean economy. In particular, we were interest in whether geopolitical threats posed by North Korea influence the stock prices of Korean companies enlisted in NYSE. Regarding sentiment classification, we divided collected data into training and testing to train the maximum entropy-based classification algorithm. After that, we compared the overall sentiment scores per day with the stock market index of selected Korean companies. And what we found was, there were some interesting correlations between these two for several Korean companies. Second case study is also related to social media mining. As shown in the overview of the research design, we first collected about 16,000 news articles and 7 million tweets. We applied stopword removal, lemmatization, and tokenization in the preprocessing stage. Before we apply lemmatization and tokenization to the collective data, we split the text into sentences and assign part-of-speech tags to the tokenized text using Stanford CoreNLP. In addition, since tweets are likely to contain jargons and acronyms, we applied the vocabulary control technique to minimize some variation problems. The primary goal of this paper was to examine topic coverage and sentiment dynamics of two different media sources, which are Twitter and news publication on the issue of Ebola. As far as text mining technique goes, we conducted content analysis with the n-gram based LDA for identifying the specific topics. And we also employed sentiment analysis with the sentiment dictionary to track sentiment changes of each topic. We also conducted and constructed the keyword's network consisting of extracted entities such as person, organization, location, dates and digits names. In sentiment analysis, we scrutinized the sentiment changes of each topic as to the main theme of Ebola virus The third case was about Twitter mining. As shown in the architecture of the system, we collected about 1.7 million tweets, and user IDs, timestamps related to 2012 Korean presidential election. What we used was Twitter Stream API provided by Twitter for. We stored keywords And dimensions. As well as pairs of users and pairs of keywords in Redis database. In addition, we employed MYSQL relational database to store tweets and timestamp information in disk for further analysis. Goal of this paper was to understand how social and political issues related to 2012 Korean presidential election are discussed on Twitter by employing several text mining techniques. Regarding text mining techniques, we used the multi-nomial topic modeling technique to analyze topic trends overtime. For network analysis, we conducted mention-based social network analysis. In addition, we analyzed term co-occurance. Even a search term, our Twitter mining system retrieved the list of terms which occurred with the. Once the list is obtained, we sort co-occur terms by their co-occurrence frequency and we display them on the result page. Next project was about applying central analysis to the research problems of bibliometrics on the comment databases of YouTube video. As shown in the overview of research design, we collected YouTube videos using Google API. We used the keywords such as K-pop, Korean pop, SM entertainment, entertainment, and JYP. As of August 16, 2013, we collected about 3,004 YouTube videos related to K-pop. From these videos we gathered about 5 million user comments posted on those videos. Primary goal of this paper was to investigate whether and how user commenting behavior impact the topology of K-pop video community through analysis of co-commenting behavior on these videos. We first built a network based on the number of co-comments. We extracted both user information from their public profiles and the content of those comments. We then computed the sentiment score of comment by the sentiment dictionary called. In order to compute the sentiment scores of each comment, we aggregated the sentiment values of the tokenized words in the comment. The last case study was about analyzing public opinion about political issues online by automatically detecting polarity in Twitter data. It consists of two stages. The first stage detected polarity in Twitter's using the enriched models of shrinkage regression. The second stage identified the major topics via LDA topic model. And estimated the degree of polarity on the LDA topics using top sentiment score. So far, I have presented several related projects to you so that you can use some useful ideas out of them. For more information about these projects, I provided the reference list in the next two slides.