Hello everyone, my name is Min Song, I'm a professor of Department of Library and Information Science at Yonsei University. I'm the instructor of this course. Let me talk about the problems of making sense of data. We live in the era of data deluge. Just normal people cultivate digital native generates tons of data in their daily lives. One example of favorite data is social network sites such as Facebook and Twitter. Facebook alone generates 250 million posts per hour. In a similar rate, Twitter pours out 21 million tweet per hour. This sheer volume of data simply makes it difficult to understand what is buried in it. Another serious issue as to data with is heterogeneity. Different characteristics of data including language, domain, style, format, and so on and so forth, prevent us from making sense of data. The third issue is about how long is data valid and how long should it be stored? In this world of real time data, we need to determine at what point is data no longer relevant to the current analysis. The first issue is semantics. Data semantics is about discovering the meaning and the use of data in a programmatic fashion. Thus, the focus of data semantics is on how data represents a concept or object in the real world. The last, but not the least issue, is complexity. Today's data comes from multiple sources. And it is still an undertaking to link match, cleans, and transform data across systems. Together with ambiguity of text, the complexity issue becomes even more difficult. So what is this course about? As you well know, this course is a hands-on text mining class. I will talk about the definition of text mining next session, but if I give you a very brief intro of text mining, it is about finding interesting nuggets from a vast amount of unstructured text collections. This course will give you a decent chance to get your feet wet with text mining. You're required to work on a team-based project related to text mining. As far as data for the project is concerned, I provide three data sets, GitHub, New York Times and Twitter data for you. In addition, I provide a codebase called yTextMiner. It is written in Java and it covers major components of text mining, including text prepossessing, topic modelling, sentiment analysis, document classification and so and so forth. So let me talk about the first data. The first data, which comes with this course, is called GitHub data set. GitHub is repository hosting site for Git which is a version control system. Other popular version control systems are CVS and SVN. GitHub provides many unique features. As of May 2016, it consists of 18,717,384 repositories and 15,335,221 users. The official URL for GitHub is http://github.com. The description of the APIs can founded at http://developer.github.com/v3/ .github.com/v3. GitHub provides a variety of features helpful for understanding the product repository hosted at GitHub. Example features include user information, repository description, repository files, and details of how to use the repository. As you see in this slide, if you're interested in social network analysis from the perspective of text mining, GitHub is an excellent choice. To collect data from GitHub we recommend you to use GitHub APIs. All APIs are based on REST services and require authentication. You need to obtain a developer key from GitHub. The details of how to do are provided in our last session. To give you a feel for the data, I provided sample GitHub data collected by those APIs. I used 10 programming languages as search terms to collect GitHub repositories and its about README files, which are the descriptions of GitHub repositories. And retrieved results about README files, and they consist of 10,000 repositories. And it comes with the yTextMiner, all returned results are in form of JSON. Thus later on, we need a JSON parser to extract needed information. One very simple sample API call is api.github.com/search/repositories? And then you provide your client ID and client secret code along with a query. The query is in my case, text mining. The second dataset is a collection of news articles from New York Times. New York Times is a very popular daily newspaper in the United States. New York Times digital repository contains more than 13 million articles total. Articles published before 1923 or after 1981 are free. This means that our data set consist of relatively recent news articles that are freely accessible, unless you're interested in historical news articles, then have to use the data that was published before 1923. Online version of New York Times can be accessed at http://www.nytimes.com. Same as GitHub, to get New York Times data, we'll use APIs. The detailed descriptions of API is found at developer.nytimes.com. Before you use the data, you need to get an API from New York Times. To get on an API key, you simply push the button on the right-upper corner, on the slide and it will take you to the simple registration page. And once you complete the form, the API key will be sent to you. With APIs, you can retrieve all excess server articles including movie review only, books only, community only, and many other datas. To get a feel for the data, I provided sample New York Times data collected by those APIs. It comes with the yTextMiner toolkit, which consists of a total of 8777 news articles. To retrieve these articles, I didn't use any search query or any other limitation. To get those articles, I limited the date from April 20, 2016 to June 30, 2016. The returned articles are in format of JSON and several metadata such as author, article type and URLs are included in the returned result. Again, a simple API call is api.nytimes./svc/search32/articlesearch.j- son? And your API keys and another parameter is FQ as a key and value is type of in my case it's news. The third data set provided in this course is Twitter data. Twitter played an important role in understanding how a certain issue is propagated through social network. Thus, I believe that playing with Twitter data will be a great exercise. The description of Twitter APIs can be found at the following URL, dev.twitter.com/overview/api. In order for you to use APIs, you need to register for Twitter and go through several steps to get the API key. Compared to the other two data sets, the Twitter data requires some complex steps of authentication. I will provide detailed instruction of how to get it in lab session. With Twitter APIs you can collect many useful information about Twitter user and their tweets. As far as user information is concerned, you can get various user related data including demography of a particular user, their followers and their friends. For instance, using the Twitter's Friendship REST API you can get a number of friends of the particular user given his or her screen name. In addition, you can search Tweets by keyword search API, either in stream mode or non-stream mode. Since there is a rate limit to collect data, you may want to use data rate attribute in your API call to retrieve more tweets. As a sample data, I provided 170,000 tweets that I collected for four hours using Twitter Stream API. In Twitter API, there are two major types. The first type is, search related API. And the second type is, stream related API. In theory, you can collect Twitter data 24/7 by using the stream API. All right, so I believe that I give you enough material to start with. Let me tell you about what I expect from you in this course. First, I want you to be enthusiastic and passionate for learning new stuff and working hard. Second, I want you to learn text processing techniques so that you can utilize them for your thesis and all your work. At the end of such journey of learning text mining, you will probably become a good information and/or data specialist or scientist, that's my belief. No sweat, no gain. Thank you,