Let us recall the last week's electron node on the definition of text mining. According to Usama Fayyad and Marti Hearst, text mining is about discovering interesting, useful, and previously unknown facts and objects and their relations from a large amount of unstructured text collections. Text mining is different from data mining in that the data source for text mining is unstructured, whereas the data source for data mining is structured. And in the most cases, data mining, data source store in the relational data base. As the figure shows, there is a stiff increase in the number of web pages over time. This statistic is just a microsome of what is happening with the gross of big data and unstructured data across the globe. Enterprise including Google, IBM, Microsoft generates tons of unstructured data. Moreover, just normal people produce text data on various social media sites including blogs, Twitter, Facebook, YouTube, and so on and so forth. So let's talk about some statistics of unstructured data. News articles represents about 25 terabytes annually, magazines about 10 terabytes a year, office documents represents 195 terabytes. 610 billion emails are sent each year representing 11,000 terabytes in 2010. There were 152 million blogs in 2010. Twitter has 200 million tweets per day, Facebook has about 640 million users with 50% logging in daily as of March 2011. Facebook collected an average of about 15 terabyte of data everyday. Google has more than 50 billion pages in its index as of December 2011. YouTube has 3 billion visitors a day, 48 hours of video is uploaded per minute as of May 2011. If we classify text mining techniques according to its purpose, we can classify them into one of three classes. The first one is descriptive, second one is predictive, third one is prescriptive text analytics. Let's start with descriptive. The primary goal of the descriptive analytics is to make sense of data. More raw data are not suitable for human consumption, but the information that is derived from the big data is Because of this, it is critical to transform raw data into some kinds of information. In the situation where more than 80% of business analytics are descriptive, the descriptive data text analytics takes a significant portion of text mining. By condensing text data into smaller and more useful nuggets of information, people can digest big data more easily. The second category is predictive text analytics. The primary goal of predictive analytics is to forecast what might happen in the future based on learning from pre-identified answers. Due to this characteristic, all predictive analytics are probabilistic in nature. Predictive analytics utilize a variety of statistical, modeling, and machine learning techniques to study recent and historical data. The last type of text mining is prescriptive text analytics. It goes beyond descriptive and predictive analytics by recommending one or more courses of action. Another feature of prescriptive text analytics is to the ability of showing the likely outcomes of its decision. Prescriptive analytics is particularly useful when we need to prescribe an action so that business decision-maker or policy-maker can take this information into consideration for the final decision. Some of ready-to-use application areas include information access by providing access points to data sets. For example, on the result page of a search engine, text mining can better summarize search results. Another area is information organization. For example, topic modeling result can serve as automatic generation of ontology out of huge amount of unstructured data. Third application area is visualization. Coupled with visualization tools like D3.js text mining result can be nicely integrated with visualization. There are number of text mining techniques. Natural language understanding is one of them. it is a topic of natural language processing in artificial intelligence that deals with emotion reading, comprehension. Topic modeling is another important text mining technique that generates the topic topology of given data sets. Sentiment analysis is one of text mining techniques that draws increasing attention since it can be applied to many different domains including product reviews, public opinion detection, and so on and so forth. Besides traditional techniques, which are document classification and clustering, those traditional text mining techniques not well-covered in this course. Instead, in this course, we'll focus on techniques that I just described. Text mining process can be divided into three major steps. First step is collecting data from various different data sources. In the collection stage, useful documents are gathered, selected, and filtered for the next step. The next step is preprocessing stage. Preprocessing refines miscellaneous text into analyzable units of text. The third stage is application of text mining techniques to find facts and events of interest to users. It includes extracting relevant concepts or facts about the concept and discovering relations among them.