In this video, we are going go in a little bit more detail about how some of what we have learned in this course can be all put together in this specific natural language processing tasks. That are really important in a lot of tasks of understanding free-text. Information is, as we have seen so many times, is hidden in free-text in very interesting ways. You have most traditional traditional information that is structured, while a lot of information now is in unstructured free text form. As we have seen earlier when I was introducing this whole course, that about 80% of data is now in unstructured form, in blogs, in websites, on websites, and so on. And it's growing. You have tweets and other social media that is kind of adding more information here. So the question then becomes, how do you convert this unstructured text to structured form? You don't necessarily have to convert it all, but more of what im trying to get here is, how do you extract relevent information from unstructured text. And if you want to make it searchable or make it usable later on you would probably put it in a structured form as it has been traditionally kept. So that way it is conversion from unstructured to structured form text. And that's where information extraction comes in. The goal of this task is to identify and extract fields of interest from free text. So let's take an example of a medical article. This one is from WebMD from quite some time back. But it says Erbitux helps treat advanced lung cancer. By looking at this, you know that one of the pieces of information that you want to extract would be the headline. But very importantly also, the components of that saying that Erbitux is a particular treatment for lung cancer. And the fact that advance lung cancer is a specific form of lung cancer. So you're going to understand which modifier could be dropped. So in this case, advanced could be taught but lung couldn't be. Because lung cancer treatments are different from let's say blood cancer or breast cancer and so forth. But this webpage has more information than just the headline. It has some information about who wrote this article or who reviewed this article. Or when was it published or the place it was published from and so on. And you can see that this happens all the time in news reports and other places when you see some piece of information, some news article or a tweet. It has more than just the text that is important, and are of some interest for people. So who wrote a particular news report? Where is it coming from? What does it say? Who or what are the other persons involved in this story? This is the typical representation of an information extraction task. The first thing to notice here is something about fields of interest. These fields of interest are named entities. If it's news, we are talking about people, and places, and date, and organizations, and other geopolitical entities. When you say White House, you mean something completely different than a white house. Or when you're saying 10 Downing Street it's a completely different meaning than the actual address 10 Downing Street. If you're talking about finance you're talking about money, or monetary values. About how much worth particular save for, or the names of companies, or the stock price index of a particular company. When talking about medicine and health, you're talking about diseases and drugs and procedures. And especially if you're talking about the protected information, protected health information, then you're talking about address and their unique identifiers. Sometimes even emails and URLs and provisions and so on. So you can see that these name can be very, very diverse as they don't all follow the same ideas and the same features. So for example, in news it's fairly common for people, name and places to be capitalized. So you have a title case there, you would see China with C capital more often than with small c. You have dates, they are typically of a particular format, we have seen some examples of that early on in this course. But those don't apply in medicine. So, for example, when you're talking about lung cancer, you don't have capital L and capital C. When you're writing it out, it's going to be all small case. So the particular rule that you might have for named entity being title cased is not applicable for some of the data sets, like in medicine. Once you have understood what named entities are, the next is relations. Relations are, basically is, what happened to who, when, where, and so on. So again, here even though the particular pieces of information are named entities, so who is basically a person or an organization. When is a particular time, or a date. Where is a place, but it also has this relationship between these named entities, to get meaningful semantics out of it. So let's take these two examples in more detail and see what that means. So named entity recognition relies on something called named entities. Named entities are noun phrases that are of specific type and refer to specific individuals, places, organizations, and so on. And the named entity recognition task is a set of techniques and methods that would help identify all mentions of predefined named entities in text. So for example, you are to identify the mention, the actual phrase where something is happening. So you need to know where does that mention start and where does it end? So this is the boundary detection subtask within named entity recognition. Once you understood the boundary, the next task is to type it, to tag it, or to classify one of the named entity types that you are interested in. So for example, if you have the word Chicago, it could be a place, it could be an album, it could be a font. So depending on what kind of variation it is, you need to know what is the label it should be assigned to this word. Once you identify that Chicago is your element phrase. So let's take an example of an named entity task in the context of a medical document. I'm going to leave this one and let you try out what are the different named entities here that you would be interested in. Okay now that you've had a chance to see what are the named entities, let's see some of them. One big one is the disease or the condition or the symptom, right? So you have bilateral hand numbness or occasional weakness or C5-6 disc herniation. Okay, but you also have age here. So you could say age is 63 or you could say age is 63 year old. You have gender, female. You have mention of a medical specialty in neurologist. You have a procedure that was done, so you have a workup or an MRI. And in relation to all of this, you also have body parts like feet and hand. And certainly, you come to a conclusion where you can see that these named entities don't have to be separate from each other. So for example, bilateral hand numbness is a completely valid named entity in itself. But hand is also a named entity that is of interest. So then the question becomes, what is the granularity with which you need to annotate this. And if you're doing hand as a separate annotation, why not C5-6 disc or cord because you're talking about spinal cord in this case, you're talking about an anatomical part. So these are the decisions that need to be made when you're defining this named entity task. So in this particular case, if you define the task to be one, where you say we are only interested in the condition, the diagnosis, the age, the gender, the procedures that were done, and the specialty. That's it. In this then, you're not going to focus on a body part, hand and feet, and even if they are there, that's fine. And some decision already was done that way. So for example, nobody said that patient is not a valid entity in this role, it could be, it very well could be. If neurologist is a valid entity, patient could be a valid entity, and that's a decision that has to be made when you're defining this named entity task. Suppose you have these goals then, you know that you have not only identified the boundary, but also colored it in some way, tagged it to see what is the type of this [INAUDIBLE] that we have identified. Okay, now that we know the task well, what are the approaches that people use to identify named entities? It really depends on the kind of entities that we have. So for example, if you have well-formatted fields like dates and phone numbers, then you would use regular expressions. Recall that we have done the regular expressions for date, for example, in week one of this course. And that is a very valid named entity identification task. In fact, the assignment was really asking you to do an information extraction task for dates from the given text file. For other fields, it's fairly common to use a machine learning approach. In fact, even for dates and phone numbers you might want to use a machine learning approach, where you use these regular expressions as features. But you have other features that help them define this particular extraction to be valid. So for example, phone numbers and fax numbers are very similar. The only thing that differentiates them is what is the number preceded by. So is it phone dot, or is it fax dot. So some partition information kind of helps doing that, and you could either include it as a regular expression or let those be features in a machine learning model that you then train over large data sets. The standard NER task in natural language processing is this four-class model. Person, organization, location, and everything else, or everything outside. So, if you have a sentence like, John met Brenda. You want to say that John and Brenda are named entities so you have those two persons. And then Matt is a word that is outside of any named entity, so you are going to use these two classes. So you're going to say John is a person, Matt is outside and Brenda is a person again. Once you have identified these, then you are going to go and ask the question about relation and relation extraction. The relation extraction task is identifying the relationship between named entities. So let's take an example of something that we have seen earlier today on Erbitux help treat lung cancer. This sentence has a relationship between two named entities. One is Erbitux. Erbitux is colored yellow here to represent that it is a diagnosis, sorry it is a treatment of some kind. And then you have lung cancer that is represented, that's in green to represent a different name identity. In this particular case, a disease or a diagnosis. And then you have a relationship between them. So Erbitux and lung cancer are linked by a treatment relation. Going from Erbitux to lung cancer, that says Erbitux is a treatment for lung cancer. You could easily have a link going the other way. Lung cancer and to Erbitux, lung cancer is treated by Erbitux. So this is the relation, a very simple relation, a binder relation between a disease and a treatment, or a drug. The other task is co-reference resolution. That is to disambiguate mentions in text and group mentions together if they are referring to the same entity. An example would be if Anita met Joseph at the market and he surprised her with a rose, then you have two named entities, Anita and Joseph. But then, the second sentence uses pronouns, he to refer to Joseph and her to refer to Anita. So in this case, it's pronoun resolution where you are making an inference that if Anita met Joseph at the market, Joseph surprised Anita with a rose. Why is that important? That is in a question answering task where you're given a question and you need to find the most appropriate answer from text. Now appropriateness is something that we can talk about in more detail. But let's say in this case, the correct answer, or the set of correct answers, would be one definition here. Andyou would need, both named entity recognition and relation extraction to answer these questions. So, for example, the question was Erbitux treat? You first have to identify that Erbitux is a treatment. The relation is treat and then somehow fill a slot in this relation to say Erbitux is a treatment for something and that comes from text and that is lung cancer. Or who gave Anita the rose, where you are doing some sort of pronoun resolution to then identify that Joseph was who was the person who gave Anita the rose at the market. So this builds on the named entity recognition task and the relation extraction and co-reference resolution that we have seen earlier. So, to conclude, we see that Information Extraction is important task for natural language understanding and making sense of textual data. It is the first step in converting this unstructured text in to more structured form. Name Entity Recognition becomes a key building block in addressing these tasks and these advanced NLP tasks. And named Entity Recognition systems use the supervised machine learning approaches and text mining approaches that we have discussed in the course. So for example, if the entity that you need to recognize is a date, you are using typically expressions that we've talked about in week one. If you are talking about extracting people name. So person name or organization name, you are not only using emotional learning model to identify what is an identity and what label you should get it, but also the features that you're going to use are coming from what we talked about in week two. So, for example, we want to know that, yes, if it is capitalized or not, but what is the part of speech of a particular word? Is it a noun or a verb? What is the semantic role that a particular word is playing in a given context in a sentence? And these could be features that you would then put in in a named entity recognition model. NLTK has an in-built NER model that does trained, or new datasets and so on, for the standard task for the person, organization, location task. But I have shown in this video the named entity recognition problem goes beyond just news. If your even extending it to finance or then if you go all the way to medicine, it's a completely open problem. Where the topics that we have talked about in this course kind of puts it all together and brings it all together. Hope you had fun with the course and I had fun presenting it to you.