Welcome to the third part of Module 4: Today we'll talk about automatic corpus annotation with computational linguistic tools and we'll focus on the topic of syntactic analysis. There are two main aspects: On the one hand: partial syntactic analysis the grouping of words into chunks, the so-called "chunking". And on the other hand: the complete syntactic analysis, as "dependency analysis" to identify all syntactic dependencies between all words in a sentence. Here's an example for partial syntactic analysis: A chunk parser divides the sentence displayed in the slide, into coordinate chunks, i.e. nominal chunks, or prepositional chunks or also verbal chunks. The phrase "einer Farbeinstellung" (of a colour setting) is tagged as nominal chunk but we don't know if it depends on the verb or another noun. The subject is also not really connected to any other chunk in the sentence without being determined in its syntactic function. This is the opposite to the so-called "dependency parsing". Here we see the same sentence analysed with "Mate tools" which is a statistical dependency parser. In this parse tree, we see binary dependencies between different word pairs in the sentence resulting in syntactic pairs. This binary relationship is visualised with an arrow here in this slide. The origin of each arrow has the function of a "head" element, and where the arrow ends, the dependent element can be placed at the arrowhead. The arrows are named with a label, specifying the type of relation between the head and the dependent. How does so-called chunking work? One possibility are so-called rule-based chunker based on part-of-speech tag patterns. We need to specify which tags can be inside of a chunk. In a very simple like here, we see that a nominal chunk can consist of an article followed by a normal noun. Such rules would be too simple to be used for bigger systems and applications. What we need as well are so-called "operators" that express repetitions, optionality, repetition or any number of occurrences. A more realistic rule could look like this: A nominal chunk consists of either a definite or indefinite article or a possessive pronoun, any number of attribute adjectives and any sequence containing at least one normal noun or a proper noun. Using such complex rules a high performance can easily be achieved. Statistical approaches are very important for chunking. Again, a small trick is used that we've already seen for NER tagging and the automatic recognition of named entities, namely: the chunking problem can be reformulated as a tagging problem. For those chunks that can be composed of multiple parts an encoding in the IOB format is employed. In this example here, we see the pronoun "we" that is a nominal chunk consisting only of one word in this case and is tagged with the chunk tag "B-NP" whereas a more complex nominal phrase "the yellow dog" gets three chunk tags a begin chunk tag marking that we're inside of the chunk for "yellow", and an additional "I" chunk tag that specifies that "dog" also belongs to this chunk. All words that should not be part of a chunk are marked with "O" for "out", outside of a chunk. Using this encoding, so-called statistical chunker can be trained. We need to use annotated data where for each token these chunk tags have been annotated. Then, machine learning techniques can be applied to predict such tags and do a so-called sequence tagging. "Hidden Markov Models" are very popular and fast while "Conditional Random Fields" are more precise but also a bis slower during training. Great amounts of text data can accurately be analysed syntactically (partial) by such statistical chunking systems. These chunkers are quite robust against imperfect language or also very complex sentences, which are very difficult for complete syntactic analysis. In the year 2000, scientific challenges have been made to compare English chunkers. The best systems achieved an accuracy of 94%, that means that approximately every 20th chunk is not recognised or wrongly recognised. A bit of progress has been made in the meantime but the performances of the chunkers basically stayed the same. In the past years, statistical dependency parsers, i. e. programs that automatically calculate dependency analyses, have become very popular in natural language processing. The fastest systems only need require only a few milliseconds of computing time, to analyse a sentence. Another advantage of statistical dependency parsers is the fact that different types of languages can be analysed very precisely. In 2007 a large evaluation study has been made with structurally very different languages such as Chinese, English, Italian, Catalan and the best systems achieved an accuracy of up to 90% for correctly labeled dependencies. The labels on the arrows must match exactly. For Arabic, Basque, Czech, Greek, Hungarian and Turkish the results were slightly worse and achieved an accuracy of 76-80% of correctly labeled dependencies. Meanwhile, a relatively large amount of annotated material is available thousands of sentences with labeled dependencies. The so-called "universal dependency format" is particularly noteworthy, it is based on the universal part-of-speech tag set and allows very uniform annotations for different types of languages that can be used for science and research. But not only statistical dependency parser deliver fairly good results there are also rule-based approaches for syntactic analysis based on a so-called "constraint grammar" that achieves very good results. Here, the same sentence has been analysed with the CONNEXOR parser. These types of parses try to select a final and hopefully correct analysis out of different possible solutions on the level of syntax and morphology by using different selection and deletion rules. There are also parsers that use a statistical component for disambiguation but are mainly based on hand-written grammar rules. Such a product has been developed at our Institute (Zurich University, Computational Linguistics) the "Parzu-Parser" and here you see an example sentence: Syntactic analyses are normally conducted to access the syntactic dependencies and the content in a sentence automatically. Natural language, of course, contains many ambiguities that make it very difficult for a machine to decode the syntactic and semantic properties of a sentence and to analyse them correctly. In the current slide, you see a relatively long sentence. Take a moment and think about all the possible ways this sentence could be interpreted correctly or incorrectly. Let's see how many possibilities there are in theory to interpret this sentence. The first ambiguity is in the word "stellten" (put) that could be indicative or subjunctive. Then either "Frauen" (women) or "Kopftücher" (headscarfs) can be subject and object of the first clause. The prepositional phrase "am Wochenende" (on the weekend) can modify "Inseln" (island), "Frauen" (women) or the verb. So this gives us three more possibilities. "mit Blumenmotiven" (with flower motif) can be ornative and modify "Kopftücher" (headscarfs), or be interpreted as instrumental together with "herstellen" (produce) or comitative for "gemeinsam mit" (together with) with regard to the women. "Her" can be a separated verb suffix of "herstellen" (produce) or a directional adverb "wohin" (where). Also the relative clause introduced by "die" (the) can be ambiguous. All four plural nouns could be the centre of the relative phrase. In the second clause "die" (the) or "ihre Männer" (their husbands) could be subject and object respectively. The possessive pronoun "ihre" (her) can also refer to all four nominal phrases in the first clause. The word "Montagen" can be interpreted in two different ways (mondays vs. installation). "Der Hauptinsel" (of the main island) can be genitive attribute regarding "Zentrum" (centre) or regarding the dative object of the verb. And there are several prepositional phrases that can be attached to the preceding nominal phrases or the verb. And last but not least - the verb "verkauften" (sold) is morphologically ambiguous: it can be indicative or subjunctive. All these possibilities can be combined freely and by multiplying all possibilities we get approximately 130.000 possible readings of the sentence. How can we calculate the intended meaning if there are so many possible interpretations? Especially long sentences have incredibly many possibilites for interpretation and syntactic analysis. Often we don't even realise all these possibilities because we can directly access the intended semantic interpretation thanks to our language and world knowledge. For computers dealing with symbolic numbers and units this task becomes very challenging! It is still very difficult and faulty when calculating the most likely interpretation of a sentence even with present systems. The systems are still not performant enough although statistical approaches that are very popular, can be used to learn the intended interpretations from annotated treebanks Here's an example from the TIGER corpus where 50.000 German sentences have been completely annotated with their constituent and dependency information. Modern, "learning" parsing systems can calculate such parse trees with an accuracy of approximately 70%. What are the benefits of such syntactic dependency analyses for corpus linguistics? Here's an example from dwds.de where such syntactic dependency analyses are applied to the provided corpora. And we see, how often "Krieg" (war) is associated with a verb as a passive subject and the strength of association. Thanks to dependency analyses such collocation analyses can be made for syntactically connected words. This can be done over short and long distance. We saw that chunk parsing is an efficient and robust method for syntactic analysis that has a high reliability. Statistical approaches require relatively little training material and reach, however, an accuracy of approximately 94%. Dependency parsing allows a more fine-grained syntactic analysis. Annotated data has become available for several dozen languages to train statistical dependency parsers. Dependency parsers trained in this way often don't work reliably on very complex sentences. There are still many errors in the analyses. If a rule-based parser is available for a certain language these systems are certainly able to compete. I would like to thank you very much for your attention and I hope that I have been able to give you some insights into how syntactic analysis works.