By and large, there are three different levels of text representation. First level is lexical level. Lexical level includes character, word, phrase, part-of-speech-level analysis. We'll focus on the lexical level text representation for this course. Second level is syntactic level. It includes vector space model that result from transforming raw text into preprocessed text in format of term document metrics. In addition, language model and full-parsing. Language model is a probabilistic distribution over sequences of words. For parsing it is representation of sentence by parse tree The semantic level includes, collaborative tagging, frame search, and ontologies. Character level presentation is the very first level of text representation. Characters are divided into letter and delimiter. Thus, a sequence of characters which consists of letters between delimiters makes a token. Characters play an important role in part-of-speech tagging and named entity recognition. In named entity recognition, capital letter or small letter, makes the difference in predicting the entity type of term. Application areas of character level representation include spell checking, part-of-speech tagging, word segmentation classification, named entity recognition, and so on and so forth. The second level of text representation is word level. It is the most common representation of text used for many different text mining techniques. From the linguistic perspective, a word is the smallest element that is altered in isolation with semantic or pragmatic content. In English, a word is separated by white space. In almost every text mining software package, tokenization is included. And what it does is to split text into words. One thing that we may want to be aware of, is that, word is well defined unit in Western languages, whereas Chinese has the different notion on semantic unit, where it cannot identify by delimiter. It is so called word segmentation problem. Now, let's define some of important notion regarding word. First, word is delimited string of characters as it appears in the text. Term, is a normalized word in terms of case. Morphology, spelling, and so on and so forth. It is an equivalent class of word. Token is an instance of word or term which occurs in a document. Type is treated as the same as our term in most cases. It is an equivalent class of token. Next level of text representation is phrase, or N-gram. In a linguistic sense, a phrase is a group of words, or possibly a single word, that function as a consequent in the syntax of a sentence. Text chunking is an intermediate step towards verb parsing, which identifies phrase groups. Let me take some examples. A noun phrase is a segment that can be subject or object for a verb. A verb phrase is a segmentation that contains a very with any associated modal, auxiliary, and modifier. Regarding N-gram, N stands for how many consecutive words are used. For example, unigram, one word. Bigram, two consecutive words. Trigram, three consecutive words. So, N-gram is used to help determine the context in which some linguistic phenomenon happens. Let's move on to the preprocessing process. Most important preprocessing technique is normalization. Three major techniques belong to normalization ,which are, tokenization, lemmatization and stemming. Other important preprocessing techniques include stop word removal and part-of-speech tagging. Also, named entity recognition techniques are useful for identifying and keeping the meaningful unit of text. Shallow parsing, such as text chunking, is also helpful in the preprocessing stage. Normalization helps improve the quality of the text mining technique as well as information retrieval. In information retrieval, a normalizing process of terms in indexed text, as well as query terms, into the same form. For a given text mining test, we define equivalent classes of terms in a generic sense. For example, we match U.S.A and USA. In addition, we do asymmetric expansion for denormalization. Denormalization is used for expansion of information retrieval. For example, the word window can be mapped into window and windows. This de-normalization is conducted with a set of expansion rules. It is a powerful approach, but less efficient since it requires rather lengthy rules. Tokenization is a preprocessing step where the input text is automatically divided into units called tokens. Tokenization is an important step to any text mining technique. And the most comprehensive set of tokenizations is implemented in the Apache Lucene toolkit. There are several issues with tokenization. One issue is how to detect sentence boundaries. Does punctuation such as quotation marks around sentences indicate a sentence? Is period the end of line? To tackle this issue, several sentence boundary detection techniques are proposed. Another issue is proper names. For example, how to tokenize California Governor Arnold Schwarzenegger. Single white spaces versus proper noun as one word. This is an easy task to conduct. From the computational linguistic perspective, lemmatization is an arithmetic process of determining the lemma for a given word. The lemmatization process involves complex tasks such as understanding context and determining the part-of-speech of a word in the sentence. The goal of lemmatization is to conduct proper reduction of words to dictionary headword form. It involves two types of morphology. First one is inflectional morphology. For example, the word cutting is reduced to the word cut. Second one is, derivational morphology. For example, the word destruction is mapped onto the word destroy. Other examples include words like am, are, is, are reduced to the word be. Words like car, cars, car's and cars' is mapped into the word car.