Hello and welcome to the first part of module 2: Structured and sustainable representation of corpus data. Today we are mainly concerned with XML standards for text representation. First, I will give an introduction to XML and discuss the sustainable representation that can be achieved with XML documents. In the second part, I will show the TEI P5 standard for the encoding of natural language texts and its most important elements. What does sustainable digital representation mean? My personal credo could be summed up in this 3 points: applications, web applications or databases are never sustainable but data can be sustainable. Binary data, however, is never sustainable; text-based, especially openly specified data formats based on XML can be sustainable. Digital sustainability requires universal conventions for cultural signs. And the character encoding UNICODE and the XML data format, are a good basis. What do I mean by "signs"? We all have a visual perception of characters, so-called glyphs, which represent the graphic representation of components or whole characters, which is meant for the human eye, but there are also characters for the fingertips, as in Braille or some that contain audio signals, as Morse code. These are all representations of signs designed for people. For the machine, signs are essentially character codes, which means that there must be a convention whereby these culturally set signs can be mapped to a number in the machine. Such character code tables or just code tables are available in different ways, The most common standard is the ASCII standard, in this standard here we see a code table, representing ASCII, ISO-8859 and a small part of UNICODE. We see in this table that the code for the capital letter "A" is hexadecimal "41", converted to decimal the number "65". For each character you can define a coding in numbers. The ASCII standard is relatively old and also limited because it contains only English characters and provides a total of only 128 number codes. For different languages or operating systems, codings have been developed that can work with 8 bits, that means 1 byte, corresponding to 256 possible codes, but these systems are not satisfactory from a universal perspective. and only with the UNICODE solution, which corresponds to the ISO standard, is has been tried to assign a unique number code to all human current, past characters. In the current UNICODE standard 7.0 more than 113,000 graphic characters or parts of characters are defined and approximately 123 writing systems are supported. With UNICODE it is possible to create character code tables, for example for Egyptian hieroglyphs, as shown in the current slide. It is important not to confuse UNICODE with a specific storage format for texts. An important storage format is the so-called UTF-8 storage format, which comes from a whole family of storage formats, these are a representation of UNICODE numbers on physical bytes that are stored on the computer. UTF-8 is a compact and meanwhile very popular format, more than 84% of all websites are currently encoded in UTF-8. An advantage of UTF-8 is that every ASCII file is always a valid UTF-8 file, which means that there is strong backward compatibility. In the table below you see what this difference between UNICODE and UTF-8 is. In the column "UNICODE binary" we see the binary representation of the numeric codes, for example for the letters "y" or "ä" (German). In the column "UTF-8 binary" we see that if the character already exists in the ASCII table, only 1 byte is needed, if it doesn't exist in the ASCII table, 2 or even 3 bytes are necessary to encode the letter which is the case for the encoding of the "€" sign in the form of a 3 byte sequence. Recently, I found a very interesting word, wait a minute, [Rustle] I want to show you. It should be here ... Where was it again ... Maybe it is here ... Recently, I found an interesting word in the yearbook of 1904 in the Text+Berg corpus it was article 4, I am going to show you ... Here we have the yearbook 1904, we don't need the first article, neither the second article or the third one ... and here we have the forth article, precisely. Now, let's see: article 4 and it was in the first sentence, I remember it was the word at position 4 And now, yes I remember ... The word was "Lichtbildner" (old word for photographer in German). That is an old expression for "photograph", that I didn't know so far. What do we mean by structured representation? In XML, structuring essentially means nesting, XML is a hierarchical data model. Let's imagine that a document is like a box containing smaller boxes with text; and these texts contain for each section and for each paragraph, even smaller boxes. Each paragraph contains boxes with sentences and each sentence contains boxes containing words and the words are finally that what we use to call text data that make up our corpus. In XML these boxes have to be represented with two tags, normally, there is an opening and a closing tag. And if we remove these two parts of the boxes this hierarchical data model results naturally in a serialisation as text. Let's take a closer look at the basic XML constructs. Typically, a XML file starts with an XML declaration where we define explicitly the encoding of the file. This declaration is optional and, if missing, it is assumed that the file is encoded in UTF-8. elements are introduced by so-called start tags and ended by end tags; it is important to nest these elements properly and there are also elements without any content, i.e. without any text data or other nested elements inside, so-called "empty tags". These empty tags need to be marked with a special notation, namely a slash right before the right angle bracket, not to be confused with the normal end tag. Inside these elements, you can assign certain labels and set a value for an attribute. Each attribute can only appear once within one element. Text data can be displayed between the elements, and each letter corresponds to the character in the text. Comments are ignored by the XML processor. An important rule when working with XML: There is only one root element, one outer box. It is possible to say that XML is a markup language and everything that doesn't count as text data, counts as markup. In green you see five signs that count as markup. These signs have a special meaning since they specify the markup. If we want to mention these five signs literally, we are not allowed to use these green signs but we need to use a numerical sign reference that also exists for the other signs without a special function or we use so-called predefined entities. There are five different entity references in XML. XML has some standard attributes, that are defined in the XML name space. An important attribute is the language attribute: xml : lang This attribute allows us to define the (natural) language that is used within the elements. The second important standard attribute are so-called "identifiers" or "IDs". These are unique names that can be assigned to the elements. Such identifiers have to start with a letter and not with a digit or number. They can also contain any alpha-numerical signs and underscore, point or hyphen. If you assign such unique names, you can link to them in other elements' attributes. These are so-called ID preferences that allow not only to model a nested tree structure in XML, but also to handle data structures of any complexity with the help of XML. This leads us to an important distinction of how to make annotations in XML. For now, we have seen what sometimes is called "inline annotation". This can be used in a simple model to mark person names or toponyms in a text. In this example, we want to distinguish if London in the context of "Jack London" refers to the person or if it refers to the geographical name, namely the city of London. This can be expressed by appropriate inline markup. A more flexible solution in the context of named entity tagging are provided by so-called "stand-off-annotations" that were used in the Text+Berg corpus for instance. We decided to encode references to elements for both geographical mentions and person names. Each word has a unique identifier and a standoff-annotation which can be defined within the same document but also within a separate file. If we have the information that "Val Suvretta" (valley in Switzerland) is an actual valley and that the corresponding words in the text are mentioned with their ID references in the span-attribute, we can ensure the link between the ID and the corresponding words. Let's summarise: XML is a standard, also called eXtensible Markup Language, to assign markup to text data. The XML standard specifies, how well-formed XML documents should be structured, because if they are not well-formed, they are no XML documents. In addition, the nested structure and the allowed attributes together with their values can be described in a detailed way, in so-called XML schema language. An XML document can also be validated according to its specification. In the slides you see a graphical visualisation: a text file is only a well-formed xml file if it is nested properly and if there is exactly one root element and if each attribute appears only once per element and its value needs to be enclosed by quotation marks. Another important point for well-formedness is that each identifier is assigned only once throughout the entire document. That's crucial for the well-formedness of XML. Only then a text file that might look somehow like XML, is a real well-formed and good XML document. In addition, XML documents can be valid if there are specific nesting rules formulated in a machine-readable schema language. In older standards DTDs were very popular. Nowadays, XML schema, XSDs or RELAX NG schemas are widely used. Such schema languages contain rules about how often and in which sequence elements can contain certain contents. These schemas or RELAX NG are more powerful compared to DTDs, because the rules regarding the allowed values in obligatory or facultative attributes are very precise. I summarise: XML in natural language processing, is a central text-based standard for the nesting of texts together with their meta information and for the structured and program-independent storage. Next, we will focus on the text encoding initiative that provides specific standards for lexicons and text corpora for digital editions as XML schemas. There are also several other XML standards that cover the needs of natural language processing. Thanks to the standardisation of text-based files in XML documents there are a wide range of tools and programming interfaces to read, create, modify and visualise such XML files. Let's move on to the topic TEI P5. TEI P5 are guidelines for the encoding of texts. It is a mature standard, already defined in 2007 and applied in different scenarios to represent several natural language documents together with their meta information and text annotations. TEI P5 can be used in two different ways: As a supermarket with predefined boxes i.e. elements, labels and attributes. You can use the full TEI or the slightly flattened version TEI Lite or TEI Tite with a uniform representation regarding particular types of content. TEI can be also seen as a DIY store where the end user is allowed to configure and to design and document its own nested box system that is conform to the TEI guidelines The documentation and the design can be stored in so-called ODD documents ("One Document Does it all"). There is a web interface where you can easily combine these files with a few clicks and use the various modules provided by TEI P5. What needs to be done when writing a so-called ODD specification? On one hand, you can specify which of the TEI modules you want to use, the content of these modules can also be modified, i.e. existing elements and attributes can be altered and new ones can be created. This specification needs to be documented at the same time. And when all this is done, you can turn this ODD into different automatically generated XML schema descriptions that can be used to validate XML documents. Let's take a look at the basic structure of TEI corpora. Here you see a model that allows you to insert different texts into a corpus. The entire corpus has a header for the meta information and one TEI element for each sub-corpus. In these sub-corpora the header and the text body can be specified for this text. How does a header look like? Here is an example from the Text+Berg corpus, yearbook 1904: On the one hand, you can find information about the type of book and also information about the book's title specified in the title statement. There are information about who published this corpus electronically, and information about the sources where these text type originally come from. In the Text+Berg corpus we have a very simple and only structural text structure model. That means that there is no information about the layout. As already mentioned, this is a hierarchical model, with text consisting of a text body which again consists of different paragraphs. These paragraphs contain sentences, and the sentences contain words. We will say more about the meaning of the attributes "lemma" and "pos" for part-of-speech in the course of this MOOC. And we will also mention the automatic recognition and annotation of such categories. This brings us to the conclusion: We saw that the universal character encoding is a crucial point for sustainable data formats. Based on this formats, we can store text-based XML documents as well as annotated corpora and their metadata. XML schema languages allow an exact specification and automatic validation of the stored data. The TEI P5 standard provides manifold predesigned modules for the structures representation of corpus data. It can be seen as a logistic supermarket, providing different boxes and description templates On the other hand, TEI P5 can be also seen as a DIY store that allows anyone to compile its own flexible nested box system. ODD specifications can be used for the automatic generation of XML schemas for the validation. Thank you for your attention and I would like to suggest the background text regarding these topics where the concepts presented during this module will be explained and illustrated more in detail.