In this video we're going to talk about handling text in Python. Let's first start looking at primitive constructs in text. You have sentences or strings and they are formed of words or tokens, and words are formed out of characters. On the other side, you have documents and larger files and we're talking about all these constructs and their properties. So let's try it out. Let's pull out a sentence from the U.N. spokesperson's twitter profile and say, that is text1. So text1 here is "Ethics are built right into the ideals and objectives of the United Nations." If you find out the length of text1, it could tell you how many characters are there in this string. That's 76. What if you want to know the words? So you have to split this text on space. Let's say that is our primitive tokenization. So you split this sentence on space to find out words or tokens. And the length of that is 13. There are 13 tokens in the sentence. And what are those? Ethics, are, built, right, in, doing, so on. So these all look very good. They all are valid words. So this is great. Looks like this splitting works. Now, if you are to find specific words for example, long words that are more than three characters long, you would say w for w in text2, where if length of w is greater than three. And that will give you all these words that are more than three characters long in this text2. Ethics, built, right, into and so on. What if we want to find out capitalized words? Capitalized words are those that start with a capital letter A to Z, but you could use something like istitle because istitle is a function that checks whether the first character is capitalized and the others are small. So w for w in text2 if w.istitle will tell you that the w.istitle is true for words like Ethics, United and Nations and it's false otherwise. What if you want to find out words that end with s. You can say w for w in text2 if w.endswith s,` that will give you Ethics, ideas, objectives, nations. Great. So now we have found out how to find individual words. Now let's look at finding unique words and where to use set function for that. Let's take another example text3, that is this famous phrase "To be or not to be." If you split it in space, you are going to get six words- to, be, or, not, to, be six of them. Now if you use set function, it's going to find out all the unique words in this list. So when you say set of text 4, it's going to find out unique words, that will be: to, be, or, not. So we expect four. But we get the answer of 5. What happened? If you look at the set text4, you'll see that you do have to, be, or, not, but you have "to" occuring twice. One with a capital T and other with the small t. That's a problem because you don't want to have these two variants just because one was the first word and was capitalized. So to fix that we should lowercase the text. So we say w.lower for w in text4 and then find set of that and the length of that and that will give you 4. So if you print the entire set, it is indeed to be or not in some order. Great. Now let's look in more detail on some of the word comparison functions. We have startswith, we have endswith as we saw with endswith s. We can also use a function call t in s to find out the substrings. If a particular substring is in a largest string, and then you have these functions that check whether a particular string is capitalized- isupper, is all small case. Lower case islower or title case- where the first one is capitalized and the others are small, using s.istitle. The same way you can check for other patterns. You can check for isalpha which is whether a particular string is alphanumeric. Isdigit, if it is actually, isalpha is if the string is just made of alphabets, is digit if the string is just made of digits 0 to 9 and isalnum is if the string is made of alphabets or numerals or digits- that's isalnum. Once we have done this checking operations, you can look at more on the string operations. We have already seen s.lower where it takes in a string s and gives out the lowercase version of that string. It could use s.upper to make the entire string uppercase, or titlecase to make it title case. You can split a sentence s on a smaller string t. So if you split something on space, let's say, then t becomes that space- one character and we have seen that that will give out words from a sentence. The same way you could use split lines. So s.splitlines is going to split a sentence on the newline character or end of line character- \n in some cases. s.join is the opposite of splitting. So you have s.join on t, would say that you think the words represented by an array or a set in t and join it using a string that is s. You can also do some cleaning operations on string s.strip is going to take out all the whitespace characters, which means spaces and tabs from the front of the string, and rstrip is something that will take out these spaces and whitespace characters and Tabs and so on from the end of the string. Let's take an example. In fact, s.strip is going to take these whitespace characters from the front and the back. s.find is going to find a particular substring t in s, from the front. While s.rfind is going to find the string t from s from the end of the string. Finally, s.replace, it takes two parameters, u and v, where every occurrence of u, a smaller string in s, is going to be replaced by v, and other small string. So let's take these examples and see how it works. So first look at words two characters. text5 is ouagadougou. For those who know, this is the capital city of Burkina Faso. And I like this word in general because of this repetitions of characters. So you split that sentence or word in this case text5 on ou. What do you expect to see? We'll see that in text5, when you split it with ou, you're going to get four groups. The first is an empty string because the string text5 starts with ou. So there is nothing before that. That's what that empty means. And then between the first occurrence of ou and the second occurrence of ou, you have agad. That's the second element in this set. And then you have g at the third and then finally ou is the last set. Last set of characters in ouagadougou, so there is nothing after, so the fourth one is also empty. So when you have a particular string ou occurring three times in text, in this case, text5, when you split it, you're going to get four parts: Before the first, between the first and the second, between the second and the third, and after the third. Okay, So that's why you have this four. And now if you join text6, that is this array of four elements using the string ou, you are going to get back ouagadougou. That's where we started, so split and join are opposites of each other. Now suppose we want to find characters- all the characters in the word. We would imagine that we have to split on something. So let's split on empty. So text5.split on an empty string should give you all the characters. But actually what you get is an error. It says that it's an empty separator. So it doesn't work. So what should we do? The way to do that is to find list of text5. So when you have text5, which is a string, list of text5 gives you all the characters, and then you could also- the other way to do that would be for c in text5. So that will also give you the string with- sorry the array with individual characters. So there are two ways really to get characters out of words. One would be do use list function and the other is to say c for c in text5. Now let's take some examples of cleaning text. So we take an example text8. That is a string, "a quick brown fox jumped over the lazy dog" but it has some whitespace characters before and after. When it's split on space, because there are whitespace characters before and after, you're going to get these empty strings at the start or a tab at the start and so on. Because there are indeed multiple spaces and tabs up in front and there's also a space at the end, so we have this empty string at the end. This is not how you want to get all the words when you have just stray spaces around some text. So, what we'll do is we say text8.strip. Note that strip, strips out whitespace characters both from the start and the end. So then if you split on space, you're going to get the sentence as it is, and you basically get the words right. A quick brown fox and so on. What if you want to change the text, like finding and replacing. So remember text9 was this "a quick brown fox jumped over the lazy dog" and let's say we want to find the character o. In this string, when you say text9.find o is going to give you the offset, the characters offset of where it found the first o. So in fact, that was at character 10, because a is character one, and then space is character two, and so on. In fact, a would be character zero, because at zero bound, it's 0, 1 and then quick becomes two three four five six, then the space at seven, and eight, nine, and 10. So brown, the o in middle of Brown would be character number 10. The same way you could do rfind that's reverse find, and that will give you the character of 14 because if you start from the first a, as zero and go to the o in dog, that's character number 14. Finally, you can replace And let's replace the small o with a capital O. And that will give you the same sentence, The quick brown fox jumps over the lazy dog. But every occurrence of O that's four of them, has been replaced by capital O. So this demonstrates how you can use find and rfind and replace to change text. What about handling larger texts? Larger texts are typically going to be in files. So you're going to read the files and you have to read them let's say, line by line. And in this case let's take an example of a file that I have called UNHDR.txt. This is the United Nations Declaration of Human Rights. I say that you have to open it using fopen UNDHR.txt and you have to open it and read more. So that is r. Once you have that. And you say f.readline is going to read the first line, and that is, "Universal Declaration of Human Rights." with the \n to tindicate that it's the end of the line. If you are reading the full file, you can either do it in a loop. Read the line one by one. And because we have already opened it and I want to reset the reading point let's say to the start of the file, I'm going to say f.seek zero. That resets the reading. And then I could just use f.read and f.read is going to read this entire file and give it back as text12. Then if you look at length of text12, it has read the entire files, so that are 10,891 characters there, that would be your length. So it's not just one sentence, one line out of this file. And then you can split lines on text12 to do basically give you a 158 sentences. In fact, they are not sentences, I should say that they are lines that are delimited by a \n, so a newline character. So for all purposes here, when it's a line, you mean something that ends with the \n. So that are 158 such lines. The first line is Universal declaration of human rights. You will notice that when you do split lines and read the line that way, the \n goes missing because you split it on \n, And so when you split on a particular character, that character is not included in the result. Whereas the f.readline up at the top was reading the line, one line at a time. So it has this \n at the end. In general, these are the file operations that you would want to use. You have open that has a file name and a mod, So read more would have an r, write more would have a w, and so on. You can read the line usingf.readline or f.read or f.read with n, that will read n characters rather than the entire file. For you can use a loop, say, for line in f, doSomething line. You can say f.seek that will reset the reading pointer, let's say reading position. So f.sek of zero will reset the reading position back to the start of the file. f.write would be how you would write a particular message into a file if you opened it in the appropriate more, in the write mode, say. And then once you've done all the operations, the counterpart of opening is closing. So f.close will close that file handle. And you can check whether something has closed with that particular right handle closed by using f.closed. So when you read this you'll notice that f.readline gave you this \n at the end. That is not necessarily something we want to keep. So how would you take that out? How to take that last newline character out? You could do rstrip. Remember rstrip is used to remove whitespace characters and \n is one of them- from the end of the string. So rstrip would just remove that and give you "Universal declaration of human rights. " And this rstrip works for DOS newline characters that show up as \r or \r\n and so on. So it is universally the function that you would want to use rather than saying find \n, because that may not be the character in the kind of encoding you have. So the take home messages here, as we looked at how to handle text sentences. We saw how to split a sentence into words, and words into characters. We saw two ways to do that. We've set and looked at that how you could find unique words and looked briefly into how to handle a text from documents or from large files. Next, we're going to go in more detail about how you could process the text to find out some interesting concepts from within the text.