Hello and welcome to the second part of Module 3: A hands-on class regarding corpus analysis using a concordance software, namely the Open Corpus Workbench. We will give a brief introduction regarding concordance and corpus analysis tools. Especially the Open Corpus Workbench and how you can work and analyse text data with this online tool. What is a concordance or corpus analysis software? What does concordance mean? The word concordance is Latin for "concordare" (match, correspond). This word has an old meaning concordances were already extracted by Bible science. Those were alphabetically sorted lists with important words and phrases from the Bible in order to get further information about the use of certain terms. Also in the literary studies, people used to work with such concordances. Of course, manual concordances and not digital ones. In the world of digital text corpora, there are completely new possibilities, to work with concordances and we already learned about the KWIC (keyword in context) view in our last module. That's basically the same, a clear representation of concordances in digital form. A corpus analysis software should thus provide tools to manage and analyse large text corpora. We want to be able to query the corpus and to display the results in different ways. And we want to do statistical analysis with the results. We want to extract collocations and examine the distribution of a certain phenomenon according to the texts' meta data. Here, we will be using the Open Corpus Workbench, a concordance and corpus management software. It is basically a collection of different tools that can be used to manage and analyse large text corpora. The corpora can be enriched with linguistic information and annotation, that's what makes them so special. These annotations can be accessed on various levels. The CQP (Corpus Query Processor) is a powerful engine that allows us to start complex queries. There is also the web-based application called CQPweb that can be simply used in the browser. This application provides a clear and intuitive user interface in the browser. The software os open-source and freely available. The interface can also be manually adjusted to one's individual needs. By clicking on this link: cwb.sourceforge.net you can find the up-to-date version of the software that can be used for your own administration. I will show you some important steps to analyse text corpora that you can do by using this software. We will use the Text+Berg corpus (Swiss Alpine texts) that you already encountered in the 1st module. You will need to register to access the corpus. You can for free by clicking on this link: www.textberg.ch You need to switch to the menu "corpus" where you can register. It would be best to follow my demonstration on your own computer to try everything out directly. Or, you take a look at it first and try it out afterwards. This is the query page of CQPweb where we can access the corpora. We work with the Text+Berg corpus, namely the yearbooks of the Swiss Alpine Club, Version 151. In the middle, you can see the query window where we can formulate our query. You can select several options here, but I will come back to them at a later stage. At the left side, you can see different menus where you can call various commands and retrieve information. For the beginning, you might want to take a look at the Corpus Info section. Under "View Corpus Metadata" you can find important corpus information, and further statistical information. You see the total number of texts that are available in this corpus (21.000 texts) the total number of words (or tokens), totally 38.9 millions. And the number of types, that correspond to the number of different tokens in this corpus. In addition, you see various metadata that are accessible in this corpus. And different kind of annotation on the level of tokens. For each token, a lemma information has been added, that represents the basic word form. Each token has an ID that is not really interesting for us now, and we have information about the part-of-speech class. In our case, the part-of-speech tagging follows the conventions of the Stuttgart-Tübingen-Tagset STTS. You can also take a look at the official documentation of the STTS to see the different classes of part-of-speech tags and how cases of doubt have been handled. That's an important basis for corpus queries. Depending on the corpus, you can find additional information in the corpus documentation, in this case, which explains explicitly what kind of linguistic information can be queried in the corpus. Let's go back to the standard query view where we can enter our search terms. And let's start with a simple query: I enter the word form "freedom" into the search window using the predefined settings of the platform. Then you need to click "start query" We get a KWIC view with evidences for the word "freedom". In the middle you see the search term in blue surrounded by the immediate context. Totally, we got 706 hits in 475 texts and next to this information, you can see the total numbers of the entire corpus. We can access the corresponding metadata for each hit by clicking on it. We get the information about the title and author of the article and the publication date. We can also have a look at the single text parts where the search term has been found and look at a broader contex to get a more complete picture of the text. You can scroll through the entire hit list to the last page. But you probably don't want to do that, and look at all 706 hits one at a time. There is also the possibility to sort the evidences. If we click on "sort" and then "go!" different possibilities are shown to sort the hit list. We can for example sort all the instances according to the first word at the left of the search term. And by clicking "update sort" the new hit list will be displayed. Apart from all the preceding punctuation marks, we would get "absolute freedom" in different German declination forms. The next words are : "similar", "all", "as freedom", etc. By clicking "new query" we get back to the search window. Now, we want to try typing in a more complex query. By choosing "CQP query syntax", the query mode is updated for complex queries. This query syntax refers to the query language of the Corpus Query Processor. We have already seen that linguistic information is annotated in this corpus, i.e. lemma and part-of-speech information. That means that we know for each word form the part of speech tag and the basic form. And now, we want to query this information. First of all, let me show you how this CQP query language works if we want to search for single tokens. If we want to reproduce the query from before in CQP syntax, we need to use square brackets and type in : [word="Freiheit"] (freedom) That means that we are looking for a token with the word form "freedom". If we use this pattern, we find the exact same amount of hits as before. But we can also say that we don't want to look for single word forms but for basic forms (=lemmas). We can do so by replacing word with lemma, which results in: [lemma="Freiheit"] And when looking at the results, not only singular forms are displayed but also plural forms ("Freiheiten") since German nouns are changed due to inflection and the addition of plural morphemes. As already said, we can access further metainformation in this corpus, like lemmas and part-of-speech annotations. In this corpus there are additional annotations, that can be accessed in the metadata. On the text level, we see that within these "positional attributes" also "structural attributes" associated with a word form are present. Mountain names can be an example, but also the geographical names of lakes or valleys. And this is how you can query such information: If you selected CQP syntax, you can now search for any desired mountain name by stating that you are looking for the level information <mountain>. And the system returns all those words that have been annotated as mountain names in the corpus. We can also add the information in the query that we are also looking for multiword mountain names and not only for single names. After that we insert the end tag that marks the end of a sequence of mountain names. By this means, we will get a list of mountain names and expressions for mountains if they were automatically tagged during preprocessing. Again, we can do a so-called "frequency breakdown" to see mountain names that are very frequent in this corpus. And we get the names: "Les Alpes", "Mont Blanc", "Jungfrau", "Himalaya", "Tödi" and "Monte Rosa" as most frequent mountain names in this corpus. You can also combine queries and look for adjectives that are immediately followed by a mountain name. If you want to look for particularly frequent combinations you can do a "frequency breakdown". We get "Zermatter Breithorn" with "Zermatter" (place in Switzerland) tagged as adjective, and the following hits: "peruanische Anden", "der große Gendarm", "indischer Himalaya", "nepalesischer Himalaya" and "westlichen Berner Alpen". You learned that you can display your results in a KWIC view and also generate frequency breakdowns. This interface provides also other standard analysis methods that are widely used in corpus linguistics, for example the analysis of collocations. Let's search for the lemma "mountain" and only look for the basic form "mountain" in the corpus. By clicking on "new query" we return to the main menu where we can select the function "collocations". There are several options here. We can decide if we want to use the basic form (lemma) for the extraction of the collocations. We can leave apart the token ID. We can use the part-of-speech information and I will show you in a minute how that works. We can also define a maximum span, a context window within the text where we want to look for co-occurring items. The predefined settings are usually set to + - 5 words. By clicking on "create collocation database" a table is created. What we get is basically a list sorted according to the significance scores for all items co-occurring with "mountain". These collocations are based on word forms, in this case. Different inflected word forms are considered separately. The different German articles are treated separately. "highest", "high", "higher" are all treated as separate word forms. We selected a context window of three words left and right. We can also change that to 5 words left and 5 words right. And we want to switch from word forms to basic forms (lemma). And we select "go!" and our table is updated automatically. This can take a moment, but now we see the basic forms i.e. "d" standing for "der/die/das", the three German articles. This means that the article is considered the most significant co-occurring word. Here we also see the frequency of each co-occurring item for itself, the article appears 2 million times in our corpus. We also see the expected frequency, the number of times we would expect that this item co-occurs together with "mountain". And in the next row, we get the exact observed number of times that the collocation was seen in the corpus. And on the right side, we get the statistical significance score. We would have expected "der Berg" (the mountain) to occur 14.000 times but in reality, we observed in 39.000 times. That's why this combination is highly significant. "in" (in), "hoch" (high), "und" (and) are significant collocations as well. At the bottom of our list, we see the less significant combinations, that are, however, still significant. You can also set a minimum frequency threshold for the collocations and the list can be filtered according to different part-of-speech categories. You can choose to display only adjectives that appear as significant collocations together with "mountain". You will then get these adjectives here: "hoch" (high), "umliegend" (surrounding), "heilig" (holy), "heimatlich" (local), "geliebt" (loved), "zweithoch" (second highest), "weiss" (white), "gewaltig" (enormous) and so on. I'm going back to the view we had earlier, with all significant collocations, If you are interested in a specific collocation, for example "local" you can click on the frequency, here "68" and you will get all text references for this collocation. If you want you can also click on the co-occurring word directly, and you will get further information about where the co-occurring item appears in the text. In this table, you see the distance -5 means that the co-occurring word appears five words before the search term "mountain" we see words at position 4 and 3 and, of course, all items that appear after the search term and not before. In this case, 65 of all hits for this collocation appear exactly one position before the search item "mountain". "local" for example, is placed directly before "mountain". There are three other cases, where the items can be found at other positions. After "mountain" there are no hits at all. Using these collocation profiles helps to evaluate if a collocation can be considered a fixed expression or if it tends to appear in the same form or to be rather variable. If you look at the collocation "mountain and men" the collocation profile shows that the positions are much more variable. The item "Mensch" (men/human) often appears in a long distance, 5 words before "mountain", most frequently at the 3rd position but also quite frequently after "mountain". Thus "mountain and men" is not really a fixed expression, but rather a combination that occurs significantly often together even in different syntactic patterns. I will also show you how to look at the distribution of your results and how to restrict the search to specific sub-corpora. Let's search the lemma "home". We choose again CQP syntax and look at the results that we get. In this corpus, we don't have only German texts but also French and Italian texts. And we want to be sure that we're searching only within the German texts. I can select the mode "restricted query" and insert my query [lemma ="home"] Below, all available metadata is shown depending on the corpus. And you see here, it's a bit confusing, but when you scroll all the way down to the end of the page you can see that there is also information regarding the languages. We can define that we want to only search those texts that are tagged with the abbreviation "de" for German. You can also select particular years where you want to search in order to restrict the time frame of the query. If we run the query now, we will only get results from the text categories or languages that we selected. In this menu, we can access the distribution which is again dependent on the available metadata in the corpus. Per default you will see the distribution over all possible metadata that is available which doesn't always make sense. You can choose the meta date under the menu "categories". In this corpus, it is particularly interesting to see how the frequencies are spread over the years and whether we can observe a diachronic change. That's why we choose "text_year" and "go!" and here we get all results per year that can be found for "home" in the different yearbooks. We also see how many words are in the corresponding category and the relative frequency in terms of the correspondent corpus size. You can also display the results as clean "bar chart" which gives you a chart regarding the diachronous change for the use of the word "home". In the time span from 1864 to 1867, slightly higher frequencies can be observed while lower frequencies are visible for the later years. We can also observe some peaks, i.e. in the nineteen-twenties, or in 1943 during the second world war. Where we can see the strong reuse of the lexeme "freedom" and how the use decreases again. In 1963, we get slightly higher frequencies while in recent times only rather low frequencies can be observed. Now you have learned the most important functions. There are other interesting possibilities. You can create sub-corpora for example to extract keywords. You can also categorise and store the evidences manually. You can also store you queries temporarily and get back to them at a later stage. I suggest to play around with CQPweb to get to know the platform and if you're interested in additional functions you will find further instructions in the online manual for corpus linguistics. As well as in the official documentation and further video tutorials provided by Andrew Hardie, who programmed the interface. I hope you've seen how powerful this software is, and what you can do with it. We could only show a small part of it. You should try it out for yourself. There are several sources for tutorials and documentation. You've seen how concordance and corpus analysis software works and what the basic functions are by using the Open Corpus Workbench. Further assistance can be found in the online corpus linguistics course at bubenhofer.com/korpuslinguistik. There is a separate chapter on the Corpus Workbench, which also explains how to process and prepare your own text corpora before importing them to the Corpus Workbench. That brings us to the end of this part of module 3. Thank you for your attention! I am looking forward to seeing you again in the third part of module 3.