it is known to be true that common words do not convey much value. In particular, from the non-linguistic and practical view, common words do not carry any useful information. It is also true that common words do to help select documents matching a user need. Rather, it takes up much memory and storage for computing word or document vectors. The general application of stopword is to remove the tokenized word if it is matched with one of the stopwords in the stopword list. Since the list of stopwords is pre-selected, it is language dependent. For example, in English, there are about 500 or so general stopwords such as a, the, about, which, etcetera, etcetera. Stemming is a process of transforming a word into its normalized form. It has been widely used in information retrieval tools. For grammatical reasons, documents are going to use different forms of a word such as organise, organises and organising. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. Since different languages have different linguistic rules, stemming algorithms are language dependent. A popular stemming algorithm for English is Porter algorithm, and we provided it in our yTextMiner pacakge. As I mentioned earlier, for English, Porter stemmer is the most widely used one. More information about the stemmer is provided at, www.tartarus.org/-martin/PorterStemmer/. Porter stemmer was originally written in C and put it in many different programming languages. The algorithm goes through six steps from step one which is, get rid of colors and -ed and -ing at the end suffixes, all the way down to step six which is, remove a final E at the end. With these six steps, an original word is stem. Let me take one example, if a word ends with ational, A-T-I-O-N-A-L, like relational. After the stemmer is applied, the suffix change to A-T-E. In this case, it becomes relate. There are many different approaches to stemming including what force look up, affix stripping, a statistical algorithms like end gram and hidden mark-up model. The Porter algorithm utilize suffix stripping. And it does not address critics. Porter algorithm, as the name tells, was proposed by Martin Porter in 1980. It is still the default go-to stemmer. There is a good trade-off between speed, readability, and accuracy in using Porter stemmer. Way it works is that stems using a set of rules or transformations applied in a succession of steps. There are about 60 rules in six steps. It is controversial issue whether stemming improves the quality of text mining result and effectiveness of information retrieval. In general, it is mixed. For text mining, it significantly reduced the bag of words together with stopword removal. But, due to a lack of consideration of syntax, the context information attached to the original word is eliminated by stemming. For information retrieval, in some cases stemming increased effectiveness for some queries. However, in other cases, it decreased effectivenesses. Words starting with O-P-E-R, such as operator, operating, operation, and operatives would be stem to operate or oper depending on the stemming algorithm used. Let me move onto part-of-speech tagging. Part-of-speech tagging is an automatic technique to assign one of the parts of speech to the given word. It is commonly referred to as POS tagging. Parts of speech include nouns, verbs, adverbs, adjective, pronoun, conjunction, and other sub-categories. POS tagging is used as a basic element of other text mining techniques. For example, POS tagging makes dependence parsing easier and more accurate. Tagging works better when grammar and also graphing of given text are correct POS tagging is to annotate each word in a sentence with a part-of-speech marker. For example, for the sentence, John saw the saw and decided to take it to the table. Given this sentence, POS tags for each word in the sentence assigned by POS tagger. NNP for John, which noun, VBD for verb saw, etcetera, etcetera. POS tag for the same word saw, after the tag, the, is noun, NN. There are many different POS training corpus, for English POS text sets Brown Corpus was first used with a large set of 87 POS tags. As far as the first past Corpus goes, after Brown Corpus, most common in NLP today is the Penn Treebank set of 45 tags. Penn Treebank was reduced from the Brown set for use in the context of parse Corpus. Another common tagset is C5 tagset, which consists of 61 tags based on the British National Corpus. Now let's talk about the utility of POS tagging. As I mentioned earlier, POS tagging is useful as a preprocessing step of parsing sentences. In transforming a raw sentence into a parse tree, the unique tag to each word by POS tagging reduces the number of parses. In addition, it is useful for many other text mining tasks such as Information retrieval, text-to-speech conversion, and word sense disambiguation. Word sense disambiguation is a technique to identify which sense or which meaning of a word is used in a sentence, when the word has more true meanings. For example, the word G-O, go, can be a verb go or the acronym of gene ontology. POS tagging helps disambiguate the meaning of the word. POS tagging has to deal with the ambiguity problem. For example, the word, like, can be either a verb or a preposition. For the sentence, I like candy. In this case, the word like is verb. For the sentence, time flies like an arrow. In this case, the word, like, is preposition. The possible number of tags for a word can be bigger. As the example shows in the slide, the word, around, can be a preposition, particle or adverb. Which POS tag is assigned to the word is large dependent on which POS tagging algorithms are used, and which corpus is used to train the algorithm. By and large, there are two major approaches to POS tagging. First one is rule-based. Rule-based POS tagging is oldest approach that use hand-written rules for tagging. Rule-based taggers depends on dictionary or lexicon to get possible tags for each word to be tagged. Hand written rules are used to identify the correct tag when a word has more than one possible tags. Second one is learning based. Learning based purist tagger is trained on human annotated corpora like, the Penn Treebank. Learning is based on either statistical model or rule based model. Statistical model based learning includes hidden Markov model, maximum entropy Markov model, conditional random field and so on and so forth. Rule-based learning includes transformation based learning, which is also called TBL. Generally speaking, learning base approaches have been found to be more effective over all compared to rule-based algorithm.