In this video, we are going to continue where we left off in the previous video. Recall that we were talking about baby steps of linguistic analysis. There's five-step process; identifying sentences and how they split, tokenization, lemmatization, part of speech tagging, and parsing. In last video, we looked into sentence splitting. Now, let's look at tokenization. Tokenization is this process of identifying tokens or words from text. Tokens and words are typically used in a synonymous way, but I prefer using the word token because it means that it is one indivisible unit which may contain more than one words. Let's look into that in more detail. The task is to split sentences into tokens. But now the question is, how will you do it? Typical rule is white space, a space character. If you take a sentence and you split it on space, you should get the words. But again, it's not that straightforward because we have to also handle special characters. Specifically, what do you do when you have hyphens? Co-operate; is co-operate one word or two words? We would agree that co-operate as a word should come together. But what if it was a two-word phrase like hocus-pocus. Where you actually have these two individual component, or within-without or something like that where the words themselves have meaning to stand alone, and in those cases maybe you do want to split on that hyphen. The next example comes with currencies. In this particular case, I'm showing using the US currency dollar. But in general when you have a currency symbol and then you have the number, then again, they have to go together, they are written together. Last time we saw that the full stop here should not split that token. So 200.00 should all come together. If it is two comma 000, the 2,000, again, we probably want to keep them together, we don't want to split on the comma. But we probably do want to split the currency symbol from the number to then be able to work with the fact that this is actually US currency and not, let's say, the Euro, or the Pound Sterling and so on. Another example would be when you have other numbers. Something that are written in scientific notation, or when you have range of numbers 1-3, then the question is, should that be kept together? Probably not. We still want to split one separate and three separate to indicate that it is a range going from the number 1 to the number 3. Other examples would be special characters like at the rate sign or hash, especially when we're working with social media data. Percentages might come in when we're talking about scientific literature or any accounting data and so on. Talking about social media, we use of repeated characters. Ellipse, which is three dots, shows up at three dots, is not really three dots. It has a meaning of its own, it is a character of its own. But in general, in social media, you would see a lot of use of repeated exclamation marks or exclamation marks in conjunction with question mark. Together, that indicates end of a sentence or that indicates a token in itself. In that particular case, each exclamation mark is not a token to be separated out. We want to keep them as a group. But maybe when you have an exclamation and question marks together, we want to distinguish the two to indicate that that particular one is surprise when a question together and so on. You have to make a decision of what is a token depending on the application of using this tokenization approach. The question then becomes what happens if we don't have a white space? If you don't have some space between tokens, how do we split it then? We have to then again come up with specific rules and approaches to do it. If you don't have a white space, we have to add white spaces, so we had to split an existing token. Take an example of this dollar symbol and then the monetary value, like $200.00. We want to add a space after the dollar symbol and then leave the number together. We can similarly add a space. After the last word and before the end of the sentence marker, again to indicate that that end of a sentence marker has a meaning of its own, and the word that it is attached to typically has meaning of its own that you need to include. Similarly with question marks or parentheses or contractions like don't and wouldn't and couldn't, where we have a word and then the n't or the not is important modification. The shouldn't and should not, and that negation is important identify so that we can infer that what is being talked about after that is actually not happening. The patient shouldn't wait longer. You have to say that wait longer is actually being getting negated by having that n't before it or not before it. We talked about commas slightly in the context of a numbers, but in general, commas also are attached at the end of the preceding word without a space. But maybe we need to separate it out as a separate ones so that the word itself, the text stays [inaudible]. Now we see that there's a need to add whitespaces. Then this is why we call it token and not a word because in a comma is a token in itself, or a parenthesis is a separate token. The third step then is Stemming. Stemming is identifying all morphological variants, into a single form. What that means is when you have inflections such as go, goes, going, gone, you want to convert all of these into the root form, go, because they are all variants of go. One is in the present tense, one is in the gerund form, one if in the past tense or past perfect tense and so on. Talking about past-tense, for go. The past tense word is went, which has a very different form. But if you are doing all other variants, you might as well take the verb form, which is the past tense, even if it has a very different form and map it to the world go. Go and went should also get mapped towards go. This is an example of derivational related form, but there are other examples of that too. It's not only with tenses, it can also be with adjectives like kind, and then you have unkind, which is antonym opposite of kind or kindness, which is again a derivational related form going from adjective to the noun. There are common ways in which verbs get modified, depending on the tense. Or words, get modified by adding prefixes and suffixes that change the part of speech of the word. All of these should be grouped or clumped together, and that is done using stemming. The reason is that you want to know that they are actually talking about the same concept, maybe in a slightly different way. You have kind and unkind at the same concept. One is a positive and another with a negative going and gone are the same concept except that the time, the timing is different. When we want to ignore that timing variation, for example, we want to group them as the same concept. The concept of going is happening. How do you do it? The way to do that would be to use rules or heuristics to combine these together. If you know that anything that ends with an S, after or ES is probably if it is a work than it is a present tense of the verb. If it is a noun and it's plural for a noun, and so on. There are some rules of thumb that can use to remove those common suffixes. You can discard affixes general, that means prefixes and suffixes to then bring them into the root form. There are also algorithms that have been defined, especially coming from linguistics. Specific we call, like a porter stemmer algorithm, there are other stemming algorithms also, that take the word and convert it using some series of operations into the root form. But it is a deterministic way of getting there, right, so you can always use it. Specifically, the porter stemmer algorithm will take the word abominable, an abomination, and truncate it to the word abomin because it takes up all the suffixes, able, ation, and so on and removes it. But then what you notice is abomin is not a valid English word, right? So then there is a variation of this approach of stemming called lemmatization, where you then convert the words into another word that is also a valid word. So abominable and abomination gets converted to abominate, which is a valid English word. Hence, you can actually group them based on the meaning. You can understand the meaning of this word and the fact that abominable is a slightly different variation of that meeting. Great. This is the stemming process. You have stemming and lemmatization that are both taking the word and converting into a root form or a normalized form. This form is called lemma. Fourth step is then to go beyond just sentences, and words, and tokens into a broader scheme of what the document structure is. This document structure changes with every application. For health data, the document has forms like this. Now, this is an old image of something typed out, but you have variations of these in electronic health record forms too. For example, this would be an example of a laboratory report that indicates a surgical pathology report that indicates what happened, the diagnosis, and what was the examination, what was the results? What you'll notice is at least in the second one, the one on the right, there are clear sections. You have a section that gives information about the patient. Then you have a section about what was the diagnosis. Then you have a section really about the examination, the details, and the results. But if you think of the report that is on the left side, it also has something like that. You have this section of it that has information about the patient, and then you have the major remaining part of the document is really about diagnosis, and it is pretty well-structured. You have some things in capital, you have something that's underlined, and bold, and so on. That shows you that this is a section, and then you have details about the section, and then you have another section after some physical vertical space, a blank space, and then you have the next section, and so on. At the end, then you have information about the doctor and other details, which is also there on the one on the right-hand side further down. Documents structure through these two examples, what I want to show you is even though for every application it might be slightly different for every instance of it. So every institution, you might have a slightly different form. They have roughly the same structure. You have some information about the patient, some metadata, and then you have information about what procedure was done, what are the instructions that are given out, or what are the results of a procedure, and then you have some information about who the author of this document is. Identifying these sections become important. In clinical domain, for example, you might want to know whether the diagnosis or discussion is about the patient or about family of the patient. So is it family history, or is it social history? That information can only be obtained if you can identify the section, and identify that that section actually belongs to family history as compared to social history, or patient history. This becomes an important structure. But as you can see from these examples, that there are some clues that are there in the text itself that can be used. The key takeaways from what we've seen so far is that text processing is an important step that includes both text cleaning and preprocessing to identify sentence boundaries and words and so on. The early stage NLP tasks, whenever you're creating a pipeline, is to split the text and normalize it, and that becomes an important part for any downstream processing. Clinical texts and biomedical texts in general have unique challenges, but they also have some unique resources and solutions to address these challenges. There are tool kits available for all of these tasks that are not necessarily specific to medical data. For example, the Natural Language Toolkit, or NLTK, is used for sentence splitting, and parsing, and tokenization, and part of speech tagging, and so on. In demos this week, we are going to talk about NLTK or use it in the context of processing text data and then working with it for specific medical information extraction task. There are some tools that are also available for specific tasks related to clinical data. In the next videos in this module, we are going to look at what those resources are, and how we can use those knowledge resources to build out this clinical or medical natural language processing pipelines.