In this video, we're going to talk about basic NLP tasks and introduce you to NLTK. What is NLTK? NLTK stands for Natural Language Toolkit. It is an open source library in Python, and we're going to use it extensively in this video and the next. The advantage of NLTK is that it has support for most NLP tasks and also provides access to numerous text corpora. Let's set it up. We first get NLTK in using the import statement, you have import nltk. Then we can download the text corpora using nltk.download. It's going to take a little while. But then once it comes back, you can issue a command like this from nltk.book import star, and then it's going to show you the corpora that it has downloaded and then made available. You can see that there are nine texts corpora. Text1 stands for Moby Dick, text2 is sense and sensibility. You have a Wall Street Journal corpus in text7, you have some personals in text8, and chat corpus in text5, so it is quite diverse here. As I said, text1 is Moby Dick. If you look at sentences, it will show you one sentence each from these nine text corpora. Call me Ishmael is from text1. Then if you look at how Sentence 1 looks, it's sent1 and you'll see that it has four words; call me Ishmael, and then full stop. Now that we have access to text corpus and multiple text corpora, we can look at counting the vocabulary of words. Text7, if you recall, was Wall Street Journal, and sent7 which was one sentence from text7 is this, Pierre Vinken, 61 years old, will join the board as a non-executive director, November 29th. You have already these words parsed out, so you have comma separate and your full stop separate. The length of sent7 is the number of tokens in this sentence, and that's 18. But if you look at length of text7, that's the entire text corpus, you will see that Wall Street Journal has 100,676 words. It's clear that not all of these are unique. We can see in the previous example that comma is repeated twice and full stop is there, and words such as the and so on are so frequent that they are going to take up a bunch of words from this 100,000 count. If you see the unique number of words using the command length of set of text7, you will get 12,408. That means that Wall Street Journal Corpus has really only 12,400 unique words, even though it is 100,000 word corpus. Now that we know how to count words, let's look at these words and understand how to get the individual frequencies. If you want to type out the first 10 words from this set, first and unique words, you'll say list set text7, and in square brackets, have colon 10. That will give you the first 10 words. In this corpus, the first 10 words really in the set are Mortimer and foul, and heights, and four, and so on. You can notice that there is this u and quote before each word. Do you recall what it stands for? You'd recall from the previous videos that u here stands for the UTF-8 encoding. These have been automatically UTF-8 encoded, so each token is represented as a UTF-8 string. Now, if you were to find out frequency of words, you are going to use this command, frequency distribution, FreqDist, and then you create this frequency distribution from text7, that is the Wall Street Journal Corpus, and store it in this variable called dist, you can start finding statistics from this data structure. You have length of dist and that will give you 12,408. These are the set of unique words in this Wall Street Journal Corpus. Then you have dist.keys that gives you the actual words, and that would be your vocab1. Then if you take the first 10 words of vocab1, you'll get the same 10 words as we saw up there in the top of the slide. Then if you want to find out how many times a particular word occurs, you can say, give me the distribution of this word four, that is UTF encoded, and I'll get the response of 20. That means in this Wall Street Journal Corpus, you have four appearing 20 times. What if you want to find out how many times a particular word occurs and also have a condition on the length of the word? If you were to find out frequent words and say that, I would call a word as frequent if that word is at least length five and of course at least 100 times, then I can use this command saying W for W in vocab1 if length of W is greater than five and dist of W is greater than 100. Then I'll get this list of words that satisfy both conditions. You'll see million and market and president and trading are the words that satisfy this. Why did we have a restriction on length of the word? Because if you don't, then words like there or comma or full stop are going to be very frequent. Those will occur more than 100 times and they would come up as frequent words. This is one way to say, the real unique words are ones that are fairly long, at least five characters, and occurs fairly often. There are of course, other ways to do that. Now, if you look at the next task, so we know how to find unique words. The next task becomes normalizing and stemming words. What is normalization and stemming? Normalization is to find out different forms of a single word and to bring them to the same form, to normalize them to the same form. There are multiple ways to do that. First would be to bring them to the same casing, so bring them all to lowercase, let's say. Let's take this example. For this string Input 1, we have five words. They are all slight variations of the word list. You have list and then the plural form lists. You have listing and listings, and then you have a verb form which is listed. The first step we would do is to take this input string and use the function lower() to bring that string to the lowercase form. Why would we do that? That is to remove the upper-casing and lower-casing so that we don't distinguish the capitalized word list and the lowercase word list. If you run this function lower() on Input 1 and then split it on space, the words would be all lowercase, but just the list of these five words. List, listed, lists, listing, and listings. That is Step 1. Another step you could do as part of normalization is what is called stemming. Stemming is when you are taking a word and removing all common suffixes to bring the word to the base form. There are multiple existing algorithms to do that. NLTK already provides different ways to stem a particular string or a word. One example is the Porter stemmer. If you use nltk.porterstemmer and put that as a variable Porter, then you can run this Porter stemmer on all words in words1. For t in words1, run this porter.stem on that word t. When we do that, we see that each of these five words converts into the word list. Because Porter stemmer knows that ED is a common suffix that is associated with verbs or plurals have an S or ING is a common suffix as well, and all of these get removed. Now, it is still up to us as developers of these NLP models to decide whether we want to do it that way. Should we actually remove the variation that is between lists and listing, they actually have different meanings. How do we try to remove them is still a decision that we have to make. But we can use Porter stemmer, and think about that as a way to do it when we decide to do so. A slight variant of stemming is lemmatization. Lemmatization is where you want to have the words that come out to be actually meaningful. Let's take an example. NLTK has a corpus of the Universal Declaration of Human Rights as one of its corpus. If you say nltk.corpus.udhr, that is the Universal Declaration of Human Rights, dot words, and then they are encoded with English Latin, this will give you all the entire declaration as a variable UDHR. If we just print out the first 20 words, you'll see that Universal Declaration of Human Rights and there's a preamble and then it starts us, whereas recognition of the inherent dignity and of the equal and inalienable rights of people and so on. It continues that way. Now, if you use the Porter stemmer on these words and get the stemmed version, you will see that it takes out these common suffixes. Universal became univers without really an e at the end. Declaration became declar, and of as of and human rights. The same rights became right, and so on. But now you see that univers and declar are not really valid words. Lemmatization would do that stemming, but rarely keep the resulting stems to be valid words. It is sometimes useful because you want to somehow normalize it, but normalize it to something that is also meaningful. We could use something like a WordNetLemmatizer that NLTK provides. You have nltk.WordNetLemmatizer. Then if you lemmatize the word from this set that we have been looking so far, what you get is Universal Declaration of Human Rights preamble, whereas, recognition of the inherent dignity. Basically all these words are valid. How do you know that lemmatizer has worked? If you look at the first string up there, and then the last string down here, rights has changed to right, so it has lemmatized it. But you will also notice that the fifth word here, Universal Declaration of Human Rights is not lemmatized because that is with capital R. It's a different word that was not lemmatized to right but if you had done a lowercase, then the rights would become right again. There are rules of why something was lemmatized and something was kept as is. Once we have handled stemming and lemmatization. Let's take a step back and look at the tokens themselves, the task of tokenizing something. Recall that we looked at how to split a sentence into words and tokens, and we said we could just split on space. If you take a text string like this text leaven is children shouldn't drink a sugary drink before bed, and you split on space, you will get these words, children shouldn't as one word, drink, a sugary drink before bed, but unfortunately you have a full stop that goes with bed, so it's bed full stop. You got one, two, three, four, five, six, seven, eight. You got eight words out of this sentence. But you can already see that it is not really doing a good job because for example, it is keeping full stop with the word. You could use the NLTK's inherent or inbuilt tokenizer. The way to call it would be NLTK.word tokenize and can pass the string there and you'll get this nice tokenized sentence. In fact, it differs in two places. Not only is full stop taken away as a separate token, but you'll notice that shouldn't became should and there's n't, that stands for not. That is important in quite a few NLP tasks because you want to know negation here and the way you would do it would be to look for tokens that are representation of not, so n't is one such representation. But now you know that this particular sentence does not really have eight tokens, but 10 of them, because you've got n't and full stop as two new tokens. We talked about tokenizing a particular sentence and the fact that these punctuation marks have to be separated. There are some unique words like an apostrophe t, that should also be separated and so on. But there's even more fundamental question of what is a sentence and how would we know sentence boundaries? The reason why that is important is because you want to split sentences from a long text sentence. Let's suppose this example of textural is, this is the first sentence. A gallon of milk in the US cost 2 dollars 99 cents and is this the third sentence or question mark? Yes, it is with an exclamation. Already, you know that a sentence can end with a full stop or a question mark or an exclamation mark and so on. But not all full stops end sentences. For example, U.S. that stands for US, is just one word, has two full stops, but neither of them end the sentence. The same thing with 2 dollars, 99 cents where that full stop is an indicator of a number but not end of a sentence. We could use NLTK's inbuilt sentence splitter here. If you say something like NLTK.sent tokenize instead of word tokenize, sent tokenize and pass the string it'll give you sentences if you count the number of sentences, in this particular case, we should have four. Yay, we got four the sentences themselves are exactly what we expect. This is the first sentence is the first one. A gallon of milk in the U.S cost 2 dollars and 99 cents, that's the second one. Here's the third sentence. That's the third one, and yes it is, is the fourth one. What did we learn here? NLTK is a widely used toolkit for text and natural language processing. It has quite a few tools and very handy tools to tokenize and split a sentence and then go from there lemmatized and stem and so on. It gives access to many texts corpora as well. These tasks of sentence splitting and tokenization and lemmatization are quite important pre-processing task and they are non-trivial. You cannot really write a regular expression in a trivial fashion and expect it will work well. Analytics gives you access to the best algorithms, or at least the most suitable algorithms for these tasks.