Next, we'll look at a data model that has been successfully used to retrieve data from large collections of text and images. Let's stay with text for now. Text is often thought of as unstructured data. Primarily because it doesn't really have attributes and relationships. Instead, it is a sequence of strings punctuated by line and parent of breaks. So one is to think of a different way to find and analyze text data. In this video, we'll describe that finding text from a huge collection of text data is a little different from the data modules we have seen so far. To find text, we not only need the text data itself, but we need a different structure that is computed from the text data. To create the structure, we'll introduce the notion of the document vector model which we call a vector model. Further you will see that finding a document is not really an exact search problem. Here, we'll give a query document and ask the system to find all documents that are similar to it. After this video, you'll be able to describe how the similarity is computed and how it is used to search documents. Finally, you will see that search engines use some form of vector models and similarity search to locate text data. And you will see that the same principle can be use for finding similar images. Let us describe the concept of a document to an example. So lets consider three types of document shown here. Now we'll create a matrix. The rows of the matrix stand for the documents and columns represent the words in the documents. We put the number of occurrences of returning the document in the appropriate cell of the matrix. In this case, the count of each term in each document happens to be one. This is called the term frequency matrix. So now that we have created the term frequency matrix, which we call tf for short. We'll create a new vector called the inverse document frequency for each term. We'll explain why we need this vector on the next slide. First, let's see how it's computed. The number of documents n here is 3. The term new occurs twice in the collection. So the inverse document frequency or IDF of the term new is log to the base 2, n divided by term count. That is, log to the base 2, 3 divided by 2, which is 0.584. We'll show the ideal score for all six terms here. Now some of you may wonder why we use log to the base 2 instead of let's say log to the base 10. There is no deep scientific reason for it. It's more of a convention in many areas of Computer Science when many important numbers are powers of two. In reality, log to the base two of x is the same number as log to the base ten of x times log to the base two of ten. The second number, that is log to the base two of ten is a constant. So the relative score of IDF does not change regardless of the base we use. Now let's understand this number one more time. The document frequency of a term is the count of that term in the whole collection divided by the number of documents. Here, we take the inverse of the document frequency, so that n, the number of documents, is in the numerator. Now before we continue, let's understand the intuition behind the IDF vector. Now suppose you have 100 random newspaper articles and let's say 10 of them cover elections. Which means all the others cover all other subjects. Now in this article, let's say we'll find the term election 50 times in total. How often do you think you'll find the term is as in the verb is? You can imagine that it will occur in all hundred of them and that too, multiple times. We can safely assume that the number of occurrences of is will be 300 at the very least. Thus the document frequency of is is six times the document frequency of election. But that doesn't sound right, does it? Is is such a common word that it's prevalence has a negative impact on its informativeness. So, now if you want to compute the IDF of is and election, the IDF of is will be far lower. So, IDF acts like a penalty factor for terms which are too widely used to be considered informative. Now that we have understood that IDF is a penalty factor, we will multiply the tf numbers through the IDF numbers giving us this. This is a column-wise multiplication of the tf numbers with the IDF numbers giving us what we call the tf-idf matrix. Therefore for each document, we have a vector represented here as a row. So that row represents the relative importance of each term in the vocabulary. Vocabulary means the collection of all words that appear in this collection. If the vocabulary has 3 million entries, then this vector can get quite long. Also, if the number of document grows let's just say to 1 billion, then it becomes a big data problem. Now the last column after each document of vector here is the length of the document vector. Which is really the square root of the sum of squares of the individual term scores as shown in the formula. To perform a search in the vector space, we write a query just like we type terms in Google. Here, the number of terms is three. Out of which the term new appears two times. In fact, this is the maximum frequency out of all terms in the query. So we take the document vector of the query and multiply each term by the number of occurrences divided by two which is the maximum term frequency. Now in this case, it gives us two non-zero terms. 0.584 and 0.292 for new and york. Then we compute the length of the query vector just like we did for the document vectors on the previous slide. Next, we will compute the similarity between the query vector and each document with the idea that we'll measure how far the query vector is from each document. Now there are many similar functions defined and used for different things. A popular similarity measure is the cosine function, which measures the cosine function of the angle between these two vectors. The mathematical formula for computing the function is given here. The intuition is that if the vectors are identical, then the angle between them is zero. And therefore, the cosine function evaluates to one. As the angle increases, the value of the cosine function decreases to make them more dissimilar. The way to compute the function is to multiply the corresponding elements of the two vectors. That is the first element of one with the first element of the second one and so forth. And then sum of these products. Here, the only contributing terms are from new and york because, these are the only two non-zero terms in the query vector. This sum is then divided by the product of the document length and the query length that we have computed earlier. Look at the result of the distance function and you will notice that the document 1 is much more similar to the query than the other two. So while similarity scoring and document ranking process working effectively, the method is a little cotton dry. More often than not users would like a little more control over the ranking of terms. One way of accomplishing this is to put different weights on each query term. And in this example, the query term york has a default weight of one, times has a weight of two. And post has a weight of five as specified by the user. So relatively speaking, york has a weight of 1 divided by 1 + 5 + 2 is equal to 0.125. Times has a weight of 0.25 and post has a weight of 0.625. Now the scoring method we showed before will change a bit. The query of vector and its length were exactly as computed before. However, now each term in the query vector is further multiplied by these relative weights. In our case, the term york now has a much higher rate. So as expected, this will change the ranking of the documents and new york post will have the highest rank. Now similarity search is often used for images using a vector space model. One can compute futures from images. And one common feature is a scatter histogram. Consider the image here. One can create the histogram of the red, green and blue channels where histogram is the count of pixels having a certain density value. This picture is mostly bright, so the count of dark pixels is relatively small. Now one can think of histograms like a vector. Very often the pixel values will be bend before creating a vector. The table shown is a feature vector where the numbers for each row have been normalized with the size of the image to make the row sum equal to one. Similar vectors can be computed of the image texture, shapes of objects and any other properties. Thus making a vector space model significant for unstructured data.