Welcome to the course on applied text mining in Python, I'm glad you're here. Today we're going to start with working with text. Text is everywhere, you see them in books and in printed material. You have newspapers, you have Wikipedia and other encyclopedia. You have people talking to each other in online forums, and discussion groups, and so on. You have Facebook and Twitter, that's most text, too. And this text data is growing really fast. It grows exponentially and continues to grow so. And it's estimated to be about 2.5 Exabytes, that is 2.5 million TB, a day. It'll grow to about 40 Zettabytes, according to recent estimates. That is 40 billion TB by 2020. And that is 50 times that of what was just 10 years ago. Approximately 80% of all of this data is estimated to be unstructured and free text. That includes over 40 million articles in Wikipedia. Over 5 million of them are in English. Actually, just 5 million of them are in English. 4.5 billion Web pages, about 500 million tweets a day. That consists of about 200 billion tweets a year. And over 1.5 trillion queries on Google in a year. So when we look at data and look at what is hidden in text in plain sight, you'll see that it says a lot. So for example, this is the Twitter profile of UN Spokesperson. So you have the author there. And you have description, location of where they are. You have the tweets, themselves, the actual content, if you think about it. That has the topic and the sentiment around each of them. For each tweet, you have the timestamp, when it was sent out. You have the popularity of how many times this is retweeted or liked by others. And in general, this also gives you some idea of the social network. About how many people are following this account, how many accounts are being followed by this particular account of UN Spokesperson. So what can be done with all of this text? You could parse the text, try to understand what it says. Find and extract relevant information from text, even define what information is. You're to classify the text documents. You're to search for relevant text documents, this is information retrieval. You're to also do some sort of sentiment analysis. You could see whether something is positive or negative. Something is happy, something is sad, something's angry. They're all sentiments associated with a particular piece of text. And then you could do topic modeling, identify what is a topic that is being discussed. How many topics are being discussed in this document and so on? We are going to talk about each of these in the next few modules. And see what we can do using text mining in Python.