Today's session, I'm going to talk about congregating scheme. Weighing terms is widely used in information retrieval and supervised learning such as text classification. Let me tell you why it is popular. The basic assumption is that a word that appears often in a document may be very descriptive of what the document is about. By assigning a weight to each term in a document, which depends on the number of occurrence of that term in the document. We make a set of useful terms of a document more discriminative than other terms. Why not all terms be treated equally depending on their frequencies in the document collection? It is simply because the Zipf's law says, that functional words such as stopwords are dominant by the term distribution over documents and thus, simple word counting is not quite useful. Tf of T comma D denote the term frequency of term T in document D. And it defines the number of times that T occurs in document D. Term frequency is widely used in many different applications. Two typical examples are creating document term matrix or computing query-document match scores in the information retrieval. As I mentioned in the previous slide, simple term frequency is not quite useful, due to the fact that any text corpus obeys the Zipf's law. For example, a document with 10 occurrences of term is more relevant than a document with one occurrence of that term. It does not mean that it is 10 times more relevant than the other. In information retrieval or text mining, relevance does not increase proportionally with term frequency. In terms of discriminative power of a term, unknown or uncommon terms are often more discriminative. Thus more informative than frequent terms as we observe in the case of stopwords. Let's look at this from the information retrieval perspective. Suppose that there is a term in a query that is rare in the collection. For example, the query term Is genomics and tumor. If a document contained this query, the document is highly likely to be relevant to the query which is genomics and tumor. In this case what we want is to assign a higher weight to rare terms like genomics and tumor than other common terms. Let's take another example. Suppose that there's a query term that frequently appears in the collection, such as common use or do. If a document contains such a term, a document is more likely to be relevant than a document that doesn't contain it. Unfortunately, it is not a reliable indicator of relevance. For commonly appearing terms, we may want to assign positive weights to those terms such as high, increase and line, but weights must be lower than rare terms. Document frequency comes into play to capture this in any scoring function. Df is the number of documents that contain the term. Df of t is the document frequency of some t, in other words the number of documents that contain term t, df measures the informativeness of term t. The conventional definition of IDF, here, IDF stands for inverse document frequency of term t is as follows. IDF of t is natural log of N divided by df of t, base of the log is immaterial in this case. Why is it? The reason is because regardless of base of the log, either natural or binary or something else, ranking of terms does not change. As I mentioned earlier, we want to reduce the effect of multiple occurrences of term. Let's say, if a document is about Clinton then the document will have the term Clinton's occurring many times in that document. Thus, it is better to use the log of the frequency than the role frequency. We use log of N divided by df of t, instead of N divided by df of t to dampen the effect of idf. If not, dampen then what happened? What happens is that the result is the exact reverse order of n divided by df of t. Let me describe TF-IDF in a more plain language again. TF-IDF calculates values for each word in a document through an inverse proportion on the frequency of the term in a particular document to the percentage of documents the word appears in. Words with high TF-IDF values imply a strong relationship with the document they appear in. From the information retrieval perspective, if that word were to appear in a query, the document could be of interest to the user. Let me recap tf-idf rating mathematically. The tf-idf rate of a term, is the product of its TF rate and it's IDF rate, as the formula shows. It is one of the best weighing schemes in information retrieval. Note that the dash in this case is not a minus sign it is a hyphen. Alternative names for TF-IDF are TF comma IDF and TF multiply by IDF. TF-IDF value will increase with the number of occurrences within a document. Also, it will increase with the rarity of the term in the document. The TF-IDF based IR system first builds an inverted index with TF and IDF values of terms. TF-IDF is superior in pre-season recall compared to other weighted schemes and it is treated as the de facto method for comparing higher performance. TF-IDF is used as a dominant weighting scheme for system, and also widely used in text mining application.