[SOUND] In this lecture we give an overview of Text Mining and Analytics. First, let's define the term text mining, and the term text analytics. The title of this course is called Text Mining and Analytics. But the two terms text mining, and text analytics are actually roughly the same. So we are not really going to really distinguish them, and we're going to use them interchangeably. But the reason that we have chosen to use both terms in the title is because there is also some subtle difference, if you look at the two phrases literally. Mining emphasizes more on the process. So it gives us a error rate medical view of the problem. Analytics, on the other hand emphasizes more on the result, or having a problem in mind. We are going to look at text data to help us solve a problem. But again as I said, we can treat these two terms roughly the same. And I think in the literature you probably will find the same. So we're not going to really distinguish that in the course. Both text mining and text analytics mean that we want to turn text data into high quality information, or actionable knowledge. So in both cases, we have the problem of dealing with a lot of text data and we hope to. Turn these text data into something more useful to us than the raw text data. And here we distinguish two different results. One is high-quality information, the other is actionable knowledge. Sometimes the boundary between the two is not so clear. But I also want to say a little bit about these two different angles of the result of text field mining. In the case of high quality information, we refer to more concise information about the topic. Which might be much easier for humans to digest than the raw text data. For example, you might face a lot of reviews of a product. A more concise form of information would be a very concise summary of the major opinions about the features of the product. Positive about, let's say battery life of a laptop. Now this kind of results are very useful to help people digest the text data. And so this is to minimize a human effort in consuming text data in some sense. The other kind of output is actually more knowledge. Here we emphasize the utility of the information or knowledge we discover from text data. It's actionable knowledge for some decision problem, or some actions to take. For example, we might be able to determine which product is more appealing to us, or a better choice for a shocking decision. Now, such an outcome could be called actionable knowledge, because a consumer can take the knowledge and make a decision, and act on it. So, in this case text mining supplies knowledge for optimal decision making. But again, the two are not so clearly distinguished, so we don't necessarily have to make a distinction. Text mining is also related to text retrieval, which is a essential component in many text mining systems. Now, text retrieval refers to finding relevant information from a large amount of text data. So I've taught another separate MOOC on text retrieval and search engines. Where we discussed various techniques for text retrieval. If you have taken that MOOC, and you will find some overlap. And it will be useful To know the background of text retrieval of understanding some of the topics in text mining. But, if you have not taken that MOOC, it's also fine because in this MOOC on text mining and analytics, we're going to repeat some of the key concepts that are relevant for text mining. But they're at the high level and they also explain the relation between text retrieval and text mining. Text retrieval is very useful for text mining in two ways. First, text retrieval can be a preprocessor for text mining. Meaning that it can help us turn big text data into a relatively small amount of most relevant text data. Which is often what's needed for solving a particular problem. And in this sense, text retrieval also helps minimize human effort. Text retrieval is also needed for knowledge provenance. And this roughly corresponds to the interpretation of text mining as turning text data into actionable knowledge. Once we find the patterns in text data, or actionable knowledge, we generally would have to verify the knowledge. By looking at the original text data. So the users would have to have some text retrieval support, go back to the original text data to interpret the pattern or to better understand an analogy or to verify whether a pattern is really reliable. So this is a high level introduction to the concept of text mining, and the relationship between text mining and retrieval. Next, let's talk about text data as a special kind of data. Now it's interesting to view text data as data generated by humans as subjective sensors. So, this slide shows an analogy between text data and non-text data. And between humans as subjective sensors and physical sensors, such as a network sensor or a thermometer. So in general a sensor would monitor the real world in some way. It would sense some signal from the real world, and then would report the signal as data, in various forms. For example, a thermometer would watch the temperature of real world and then we report the temperature being a particular format. Similarly, a geo sensor would sense the location and then report. The location specification, for example, in the form of longitude value and latitude value. A network sends over the monitor network traffic, or activities in the network and are reported. Some digital format of data. Similarly we can think of humans as subjective sensors. That will observe the real world and from some perspective. And then humans will express what they have observed in the form of text data. So, in this sense, human is actually a subjective sensor that would also sense what's happening in the world and then express what's observed in the form of data, in this case, text data. Now, looking at the text data in this way has an advantage of being able to integrate all types of data together. And that's indeed needed in most data mining problems. So here we are looking at the general problem of data mining. And in general we would Be dealing with a lot of data about our world that are related to a problem. And in general it will be dealing with both non-text data and text data. And of course the non-text data are usually produced by physical senses. And those non-text data can be also of different formats. Numerical data, categorical, or relational data, or multi-media data like video or speech. So, these non text data are often very important in some problems. But text data is also very important, mostly because they contain a lot of symmetrical content. And they often contain knowledge about the users, especially preferences and opinions of users. So, but by treating text data as the data observed from human sensors, we can treat all this data together in the same framework. So the data mining problem is basically to turn such data, turn all the data in your actionable knowledge to that we can take advantage of it to change the real world of course for better. So this means the data mining problem is basically taking a lot of data as input and giving actionable knowledge as output. Inside of the data mining module, you can also see we have a number of different kind of mining algorithms. And this is because, for different kinds of data, we generally need different algorithms for mining the data. For example, video data might require computer vision to understand video content. And that would facilitate the more effective mining. And we also have a lot of general algorithms that are applicable to all kinds of data and those algorithms, of course, are very useful. Although, for a particular kind of data, we generally want to also develop a special algorithm. So this course will cover specialized algorithms that are particularly useful for mining text data. [MUSIC]