Welcome to Module 4: Automatic corpus annotation with computational linguistic tools. Today we'll talk about the recognition, classification and linking of named entities. If we want to recognise and tag named entities we need to identify the named entity mention in the text. Then, we want to define the class or category that the named entity belongs to. Next, there is the so-called "linking" or "grounding" of named entities by connecting the named entity mention in the text to a unique reference entity in the real world. Let's talk about the task of Named Entity Recognition (NER): The term "named entity" was coined by the research field of information extraction where NER (named entity recognition) represents an important subtask. The task involves the recognition of names, for example personal names, proper names or geographical entities, toponyms, and also names of organisations, companies or products. Named entities can also refer to the recognition of expressions that contain temporal expressions, dates or currencies. Here is an example: "1950 plante Graf Dino Lora Totino eine Seilbahn von Cervinia zum Gipfel des Matterhorns." (In 1950, Count Dino Lora Totino planned a cable car from Cervinia to the summit of the Matterhorn.) This sentence contains the temporal expression "1950", a person with the title "Count Dino Lora Totino", two places: "Cervinia" and "Matterhorn". It needs to be noted here, that also "Gipfel" (summit) could be described as a geographical entity but when dealing with places [ORT], named geographical entities are more interesting. What are difficulties in the task of Named Entity Recognition? The German word "Kohl" (cabbage) is ambiguous, it can be a normal noun tagged with the STTS pos tag "NN" and refer to a vegetable, but it can also be a last name, for example of the former German Chancellor Kohl. In German, the NER tagging task is far more difficult than in English since both normal nouns and proper nouns are both capitalised. Let's make a small experiment with the name "Kohl": We want to find all instances of the word that have been tagged as proper noun in the Austrian journal "Oberösterreichischen Nachrichten 1999" that have been pre-tagged in the German corpus DeReKo. If we search the texts that have been tagged using the TreeTagger, here you can also see the query, we get 273 hits for the word "Kohl" together with the POS tag for proper nouns. But we also analysed the corpus morphologically using the CONNEXOR tagger the query is again displayed in the slide. And we find 105 instances this time. That makes a clear difference. Let's now analyse how many of the hits are actually correct or false positives. The result of the TreeTagger contains 22 false positives out of 273 hits. The CONNEXOR tagger has 0 false positives out of 105 hits for proper nouns. That means, all of them are actually proper nouns. But which one of the two is better? How do simple rule-based methods for named entity recognition work? Very important ingredients are so-called "gazetteers", some kind of name lists and context rules that can control and refine the application of such lists. To recognise personal names, such as "Bundeskanzler Helmut Kohl" (Chancellor H. Kohl) we need a list of professional titles that can optionally occur within the text, then a list of first names which usually don't correspond to normal nouns in German. As soon as we identified the first name we can identify the following capitalised word as last name. It is relatively easy to recognise proper names with such patterns or rules. Using this method, all last names can be learned from the text and added to a list containing all last names specifically for this text. If we identified such document-specific last names, also the occurrences of "Kohl" standing alone in the text can be identified without risking to many false positives. Usually, a word is only used in one sense within a document. To recognise geographical names gazetteers, name lists are also very important. There are huge lists, for example geonames.org with hundreds of thousands of geographical names. The problem is that these names cannot just be blindly used since there are many names on this world that are identical with normal words in a language, both for English and German. How can named entities be classified? An important method is the so-called "NER tagging": That's a statistical method similar to part-of-speech tagging and that's why also the names are quite similar. In this task, we try to learn from annotated data which parts of the text refer to named entities and what type they belong to. The named entity recognition task can thus be modelled like a tagging task. On the left of the slide, you can see an example for an annotated sentence. The token column contains all tokens of the sentence and on the right, the NER tag column encodes all named entity classes in a so-called IOB format. This encoding has two parts: "B" stands for beginning of a name and after "B" the type or class of named entity is specified: "TIME" in this case. If a name consists of two or more parts, the "I" marks that this items is inside of a name, i.e. the token "Dino" is a continuation of the personal name "Graf Dino Lora Totino". All tokens that aren't part of a name are marked with an "O" for "OUT", outside the name. Now, we are able to train so-called sequence-tagging programs based on this annotated training material and they will return NER tagged output where the classified named entities can be identified. The best statistical methods reach an accuracy of approximately 90% on English test data. That means that one out of 10 entities was identified incorrectly or wasn't found at all. These statistical approaches achieve already a fairly good performance. But not in all cases this approach can be used. Here's an example for the language of football to illustrate some difficulties: Just take a moment to think about potential difficulties. Now, we move on to the third subtask of Named Entity Recognition, the so-called "entity linking" or "entity normalisation" or also "disambiguation of named entities". We try to resolve the real-world reference that is referred to by the named entity mention in the text. Back to the ambiguous name "Kohl": We know that this is a personal name, but which person is actually meant? Sometimes this task can also be formulated as "wikification task", that means you should try to annotate the corresponding URL of a wiki-page that refers to the name. It might be Helmut Kohl, the German Chancellor or it might be Helmut Kohl, a referee, or any other namesake without wiki-page. Here you see an example of the wiki-page of Helmut Kohl, the text displayed on such a wiki-page helps to disambiguate the name since we want to know about which person we are talking about. The referee Helmut Kohl is far less prominent which is also visible regarding the length of the wiki-page. It is also possible that the Helmut Kohl mentioned in the text is no celebrity at all and thus hasn't even a wiki-page. This entity linking or wikification task also helps to find out if the name can be identified explicitly. Entity linking can also involve the reference to so-called linked data. There is for instance the tradition of libraries to list personal names in the so-called "Integrated authority file" (IAF). Helmuth Kohl is listed there and a unique URL is provided along with additional information regarding the person. Another source for facts and named entities is wikidata.org which is a successor of the Google freebase database and contains millions of entities together with structured information. In the case of Helmut Kohl, following information is available: He was married, had two wives and kids, the names of the kids, and so on. For geographical names, this disambiguation is realised using geographical coordinates as a reference system. Here we have 27 geographical names that are called "Schafberg" in Switzerland and depending on which "Schafberg" we're interested in the corresponding coordinates can be selected in this topographic information systems and linked to the name. In this case here, we have the "Schafberg" from canton Uri close to Realp in Switzerland. When dealing with entity linking or grounding or normalisation the personal names need also be mapped to an explicit identification, like a wiki-page or an entry in a linked data database. For geographical entities it is important to define to precise place or region of a toponym on the coordinate system. When dealing with temporal expressions, normalised formats like "TimeML" are very popular to annotate a standardised representation of time or date specifications. Entity linking is an overall very active field of research. There is a lot going on here, but not many systems can be used as "out-of-the-box" tools. Also in the Text+Berg corpus (Swiss Alpine Club), named entities are annotated in a so-called "standoff annotation format" for personal names, toponyms and temporal expressions. Here you see a sentence from an article of 1935 which contains a life review of a famous late mountain guide. What are the difficulties that we might encounter when dealing with the annotation of named entities? In this sentence, we have two personal names, two German last names "Kaufmann" and "Maurer" that can also be normal nouns (merchant and bricklayer). It is thus important to be able to disambiguate. The toponym "Petite Dent de Veisivi" was recognised as mountain name but in the geographical coordinate database, only the name "Pte." can be found. The recognition is restricted in this case. We were able to identify that it is a mountain since it is a frequent French pattern for mountain names but we're not able to look up the coordinates in the database. Another difficulty are temporal expressions, in this example: "August, 29": If we want to find out which year is meant we would need to read more than two pages in the text to understand that somebody tells a life review, written in the 19th century. The text is namely taken from the yearbook 1935. Our system proposes the year 1888, that's three years after the correct date that is mentioned in the text, but at least we're in the right period of time. I will briefly summarise today's contents: We've learned that the main problem when dealing with Named Entity Recognition are ambiguous terms and expressions. Depending on the context, one word might refer to a real-world person or a part of a personal name or be used as a completely normal noun. The next problem is to identify the type and class of a named entity. Automatic, statistical machine learning approaches are able to solve this problem by learning from annotated training material. That's the so-called "NER tagging" approach. If we identified the type of a named entity the next step is to connect a named entity mention with an explicit person or a unique geographical entity by referencing to linked data or wiki-pages. It is a challenging task that requires the combination of knowledge-based and statistical approaches. Thank you very much for your attention. Named Entity Recognition is a very important and exiting topic for the field of Digital Humanities and I added thus a couple of bibliographic references where you can find out more about this topic.