Welcome to Module 4! We're dealing with the topic of automatic corpus annotation with computer linguistic tools. First of all, we will address the topic of part-of-speech tagging. In computational linguistics, we often call it POS tagging. We will focus on statistical approaches for automatic part-of-speech-tagging and look at potential problems possible solutions and the performance of such systems. Then, we will also briefly discuss the automatic identification of base forms (lemmas). If we look at a sentence in the Swiss Text+Berg corpus, we see that there is a column "lemma" that contains the corresponding base forms of all words in the W text. Additionally, an attribute specifies the "POS" (part-of-speech) tag for each word. And we find the perhaps somewhat cryptic abbreviations of the German "STTS" part-of-speech tag set. We will learn about these things in this module. But where does this information in the lemma attribute and in the POS attribute come from? The answer is quite simple: This information is provided by the statistical, computer linguistic tool "TreeTagger" and its model for Standard German. When we run the TreeTagger we need to feed in a specific input. The input is a verticalised format with each token on a new, single line. The TreeTagger uses this input to generate two additional output columns, on the one hand, a column with the POS tags and on the other hand a column with base forms or lemmas. I would like to draw your attention to three things here. We see the base form "schieben" (push) from the inflected word form "schob" (pushed). And it needs to be noted, that the separate verb prefix "heran" (near) is not attached to to main verb. That's something that will be done in the Text+Berg corpus after the automatic lemmatisation that will give us "heranschieben" (push towards) as a base form of "schob" (pushed). Next, we see that the TreeTagger doesn't know all words or word forms. "Gewölk" (clouds), a rather unusual word in Standard German is not known to the TreeTagger model. That's why the TreeTagger assigns the label "unknown" as a base form. Nevertheless, the TreeTagger is able to identify the part of speech tag correctly. The TreeTagger is able to derive that is has to be a normal noun from the context and the capitalised word form. A third aspect deals with the so-called "function words" like reflexive pronouns or indefinite pronouns like "jede" (each, all). It is not always clear, which type of base form should be chosen. And it might happen that one is not satisfied with the base form suggested by the TreeTagger and needs to modify it. If you are using the TreeTagger you have a language-independent system with models for different languages that can be used directly. Many language models also provide the automatic calculation of lemmas during tagging which can be very useful. It is also possible to add words from an extern lexicon in the case they are unknown to the TreeTagger model. By this means, all possible POS tags and corresponding lemmas can be added. Something very special about the TreeTagger is the fact that some words can be POS-tagged in advance before handing them over to the TreeTagger, i.e. proper names that are difficult to guess for the TreeTagger. The context of these words will be used to calculate all possible POS tags in the neighbourhood correctly. The TreeTagger can also be trained from scratch for a new language or a new text genre. But in this case you need manually annotated corpora. This topic will be addressed in module 5. The TreeTagger is not only available as a single tool but is often integrated in different text analysis environments like GATE or UIMA for instance. There are also two main disadvantages when using the TreeTagger: The tokenisation needs to be adjusted to the tokeniser that has been used to create the language model in the TreeTagger. Precise instructions are often hard to find and also the POS tag categories are not always fully documented. But on the other hand, the TreeTagger provides a kind of "out-of-the-box" system that is relatively performant for determining POS tags. Let's take a closer look at the process of POS tagging in the proper sense: a "part-of-speech tagger" is a program that identifies the part-of-speech class for each token in a text by assigning an abbreviation, also called "tag" or "POS-tag". Part-of-Speech tagging has become a well-established, independent application in the field of natural language processing. It is possible to identify the tags efficiently and with a high reliability with an accuracy of 90 to 98 %. That helps with the lemmatisation and with the corpus linguistic preprocessing in general for example for further syntactic analysis or the calculation of lexicographic data or the automatic recognition of technical terms. A small example, the word form "eine" (a, feminine) in the expression "eine Kuh" (a cow) or "Jetzt eine dich mit ihr!" (You have to agree with her) is ambiguous. If we know that "eine" (a) is an article, it is obvious that the lemma has to be "ein" (a, masculine) If "eine" is a verb, then the lemma should be "einen" (to agree). By this means, it is possible to disambiguate an ambiguous word and assign the correct lemma just by identifying the POS tag correctly. Another example for English, speech synthesis: The character sequence "lead" has to be pronounced as "[li:d]" if it is a verb, but if it is a noun it has a totally different meaning and needs to be pronounced as "[lɛd]". What kind of part-of-speech systems are available? There are several, very different systems for different languages. In the past years, the tag set by Petrov with totally 12 classes, a so-called universal tag set, was established. You can find all traditional part of speech categories like adjectives, prepositions, adverbs, conjunctions, articles, nouns and proper nouns together in one category, cardinal numbers, pronouns, particles and other function words, verbs as well as a residual category "X" where foreign words, abbreviations and also typos belong to. And a separate category for punctuation marks. For more than 20 languages, linguistically annotated corpora with this tag set have been made available for scientific research. Now, I will show you another tag set that is slightly more complicated and is usually used for German: the STTS tag set. It is a refined tag set with more than 50 different abbreviation tags for part of speech categories. It is the most important tag set for German since there are linguistically annotated corpora in different sizes that have been annotated at the POS level with these STTS tags. For example, the "TüBa Baumbank" (German treebank) with newspaper texts and around 85.000 sentences. And slightly smaller, but still quite large, the "TIGER treebank" with 50.000 sentences. What happens actually when such a tag set is refined? The main categories like nouns are divided into generic names, "appellativa" and "proper nouns", this is a typical and frequently applied subdivision. It is partially a semantic distinction, at least in German. The proper name "Schweiz" (Switzerland) "die Schweiz" behaves in many cases like a normal noun. Here, semantic criteria need to be applied in order to identify the correct POS tag. Another type of refinement can be applied for verbs: Finite forms are distinguished from imperative or infinite forms resulting in a morphological distinction in the POS tag set. A last example are the pronouns in the German STTS tag set. A very fine-grained system is applied to distinguish substituting or attributive pronouns from eachother and to guarantee a consistent division according to the use of pronouns. This largely depends on the syntactic function of the pronoun. We focus on the statistical tagging of part of speech tags. The special thing about statistical taggers is the fact that they need to be "trained". What does "training" mean in this sense? In other words, we have so-called training material. That is usually an annotated corpus. In the following example I was working with the annotated TüBa (treebank) corpus and here you can see the first two sentences: "Veruntreute die AWO Spendengeld" ("the AWO misappropriated the donation money") In the first column, you find the tokens, the words forms. And in the second column you find the manually validated POS tags. We want to know how a statistical tagger is able to learn from annotated material. If you want to measure the performance of your tagger, you need to divide the annotated material in so-called "training material" (9/10) for creating the tagging model. And 1/10 of the corpus is held out in order to measure the performance of the learned model. In machine learning, it is almost a capital sin if for the "evaluation material" data from the "training material" is used. We don't want to train a statistical system to memorise things but to generalise from the training data to new, unseen data. The main point is: How well can new, unseen material be processed with the trained model. What are the difficulties in Part-of-Speech tagging? As always in natural language processing, it is particularly ambiguity. When looking at the TüBa (treebank) training material which POS tags can be assigned to a single word we can count in the test set how many of the words could actually have more than only one known POS tag. In the TüBa, around 16% of all different POS categories are concerned. However, these categories include more than 50% of all 140.000 tokens in the test set. For English, it is very similar: In the Brown corpus, approximately 11% of all different word forms have more than one possible POS tag. Again, these words represent more than 40% of the data. Why is that so? How are statistical POS tagging systems able to decide in those cases where they encounter ambiguous POS tags in the training material? That's why there is a so-called "lexical baseline". The system basically assigns to each token the one tag that was most frequently assigned in the training material. In the most extreme case, we can achieve up to 90% of correct decisions. An example from the TüBa (treebank) training corpus. The sequence of tokens "die Mehrheit bestimmt" (the majority decides): There are two words that could have more than one possible POS tag. For "die" (the, feminine) the interpretation as article is here correct and also most frequent in the training set. All other possible interpretations occur considerably more rarely. The other ambiguous word "bestimmt" (he/she defines vs. certain) unfortunately wasn't classified correctly according to the baseline since the past participle is definitely more frequent than the finite verb that only occurred 12 times in the training material. That's why the lexical baseline fails to assign the correct finite verb interpretation to the word "bestimmt" (he/she defines). For humans, it is relatively obvious that "bestimmt" (he/she defines) needs to be a finite verb in this case. We can understand that by reading the sentence carefully and also by looking at the word in context. POS tagging systems are not able to "understand" the text semantically but they work fairly well when computing statistics. For example statistical evidences for sequences of POS tags that are more frequent than others. Such statistical information can be used to revert and modify the baseline's decisions. In this case here, it is more frequent and more likely that a noun is followed by a finite verb instead of a past participle. If we combine these two types of statistical information we need to have the information from the tagger's lexicon and to know the probability with which certain tags are connected to certain words. On the other hand, we can look one or two words on the left in our context window. How likely is a certain tag at the position minus 2 at the left and at the position minus one. This information is used to take the correct decision regarding the current position "tn" (see slide). We see in this example, that statistical tagger don't need a large amount of information to come to a decision in such cases. The context window is very small. Nevertheless, these approaches are very performant. And it is quite an effort to further improve such systems. There is a second difficulty that needs to be solved by statistical taggers: Unknown words. Every lexicon in a system, every training set is always incomplete with regards to what one would like to do with it. Especially in the test set or in applications at the very end, proper names or foreign words can be encountered. Also words with number are difficult since there are so many possible combinations. In German, compounding is a very productive process. How can a tagger take a reasonable decision? There are two main techniques to solve such issues: On the one hand, there are so-called "heuristics" based on the character shape of a token. For example, if the word is capitalised or lower cased at the beginning or how the ending of a word looks like. Technically, such ending patterns are stored in so-called "suffix trees". We will soon look at such a "suffix tree". Another possibility is to define certain POS categories that are very likely to contain new words, i.e. proper nouns or nouns. Other POS categories are much more conservative like auxiliary verbs and it is very unlikely that suddenly new forms appear. Here comes an illustration regarding these "suffix trees", I mentioned before: That's an example of how it has been implemented in the TreeTagger. We see that the ending "ous" with three characters indicates with 96% probability an adjective and only with 4% a normal noun. When evaluating how well our tagger deals with unknown word forms for example here in the TüBa test set, it is visible that approximately 7% of all tokens weren't recognised. Those are unknown word forms. Anyway, around 89% of these cases were tagged correctly. Overall, the performance achieves an accuracy of 97%. Of course, the simplest case is tagging those cases that were seen with exactly one POS tag in the training corpus. As soon as many word forms are ambiguous, the performance decreased. But it is not always the case that tokens are tagged worse only because they are ambiguous. It is a bit more complicated than that. I also want to show you how the performance of a statistical tagger can be improved by adding more training material. These are so-called "training curves". Here we see the training curve that we get for the entire STTS POS tag set with totally 54 different tags. If we only use 1000 tokens the performance reaches already 95% accuracy. The recognition of unknown word forms doesn't work so well, though. There the accuracy achieves only 55%. For such a small training corpus there are many unknown word forms and it is also difficult to guess them correctly. By adding more material it becomes easier to guess these unknown word forms. Of course, there are also less word forms that the system hasn't already observed during training. The green line represents the potential for improvement of 1000 tokens and includes both known and unknown words. In the training set, an accuracy of 75% can be achieved which can be improved by 22% if we provide 1.4 million tokens of annotated material. We can see that the factor "unknown words" is a central criterion for the performance of our tagger. If we radically simplify the tag set and use the universal POS tag set instead with only 12 different tags, the performance can be slightly increased from 97 to 98%. But, however, there is still a relatively high error rate. Thus, we see that the simplification of the tag set is not an universal remedy to reach a tagging accuracy of 99% for example. When looking at the errors in a so-called "confusion matrix" we see that the confusion of normal nouns and proper nouns was very frequent as well as the confusion of finite full verbs and infinitives or past participles. The errors look different when using the universal POS tag set for German. This time the confusion of articles and pronouns is most frequent, followed by verbs and adjectives. Then, the distinction between foreign words or non-word expressions and normal nouns is also frequently incorrect. Let's move on to the topic of lemmatisation: A relatively primitive and simple, but also efficient way of normalising inflected word forms stems from the field of information retrieval. We can also refer to the task of lemmatisation with the term "stemming" and a well-known tool is the "Porter Stemmer". These stemmers reduce the inflectional endings, but also the derivational suffixes of inflected word forms to linguistically un-motivated forms in many cases. We see in this example that "französischen" (French) is reduced to "franzos" (without any meaning) which is the same for "Franzose" (Frenchman). That's some very strong normalisation. Additionally, conventional easy stemming approaches often fail when dealing with base forms that are very different from the inflected form, i.e. "stieg" (rose) and "steigen" (rise) causing the faulty stemming to the base form "stieg". Such morphology analysis and lemmatising systems were already developed in the 90s for several languages. The company Xerox was very active. Here's an example output of a tool for German provided by Xerox. The lemmatisation tries to identify a linguistically motivated base form of a word form. In the case of "eine" (a) a different base form should be assigned depending on the use as a pronoun, as article or as verb. Such tools are often capable of giving a detailed morphological analysis regarding all morphological properties of this word form, for instance gender, number and other indications. Ambiguous words like "eine" (a) can usually be lemmatised without problems if the POS tag information is available and the output of a morphological analyser is available. This is also a typical way lemmas enter the TreeTagger models if they are not manually annotated in the training material. In these cases, the POS tags are identified manually and an automatic comparison with the morphology analysis system is done. These lemmas are then added to the annotated material. Just a few more words on machine learning techniques that can be applied for the lemmatisation task. They work partially very well, if one focusses in particular on a certain language or a certain POS category. For instance, the "Durm Lemmatizer" for German nouns. It also works fairly well if one has further resources at hand such as large lexicons containing lemmas together with their inflected forms. From this, general rules for lemmatising can be derived. This has been done in a paper for different European languages. However, knowledge-based approaches are usually better. You can also build your own system with data from Wiktionary for instances that are freely accessible and no copyright needs to be respected. I will now summarise the contents of this module: We've learned that statistical POS tagger like the "TreeTagger" can work very efficiently and reliably with predefined models and achieve a tagging accuracy of 94 to 97%. For this purpose, annotated material of at least 100,000 tokens is sufficient to train such models. Typical statistical part-of-speech taggers use the so-called lexical probability. Which is the most frequent POS tag for a certain word form according to the training material. Usually, you need a context window of one or two positions to the left and suffix-based predictions for unknown words in order to reach such performance. This is a language-independent approach that can be used for several languages. Decisive for the performance of such systems is the number of unknown words. If this number is high, the performance of a statistical part-of-speech tagger decreases. Although part-of-speech tagging refers to the classification in to word classes, often a lot of morphological information is extracted in this task. For example the number of nouns or finiteness of verbs is calculated as well. Then, we've talked about the lemmatisation. Lemmatisation is usually computed using knowledge-based morphological analysis systems. There are also statistical systems but they typically need large amounts of linguistic resources and knowledge to work reliably. Statistical lemmatisation systems for languages with a very rich morphology and complex inflection don't work very well. Stemming can be seen as an inferior compensation for real lemmatisation. Such stemming systems are easy to develop and they are already available. I hope that I could illustrate the tasks of part-of-speech tagging and lemmatisation especially regarding some statistical approaches. Thank you for your attention.