Hi. In this video we'll talk about spam filtering task. Let me remind you that when we want to use Bag-of-Words representation, for every N-gram of our text, we actually need find the feature index, or the index of column where we will input the value, let's say TF, IDF values into. And for that purpose, we need to maintain that correspondence from N-gram to feature index. And usually, you use a hash map or a dictionary in Python for that. Let's assume, that your data set is huge and that's where it can become a problem. Let's say we have one terabyte of text which are distributed on 10 computers, and you need to vectorize each text, you need to replace the text with the vector of TF IDF values. You will actually have to maintain that correspondence from N-gram to feature index, and that can become a problem when you have 10 computers that are doing the same thing because, the first problem is that, that hash map can actually not fit in memory on one machine. So, that means that you need a kind of a database when you store, where you store these correspondence and old machines use that database, that doesn't scale well. And another problem is that, it is difficult to synchronize that hash map because, when new N-gram appears, you have to introduce a new index for it, and 10 machines are doing that in parallel which means that they should synchronize somehow. They should say that okay, let's say I'm machine number one, I found a new N-gram and I'm taking the next free index in our hash map and I add that correspondence to the hash map. But, that particular N-gram should be converted to that feature index on all other machines as well, so all other machines should somehow know, that that first machine introduced a new feature index. So, that can become a problem, and that can actually lead to bad scaling of that workload. And there is an easier way actually, you can throw away hash map and you can just replace it with hashing, and that means that you take N-gram, you take a hash value of that N-gram and take that value modulo two to the 20, or two to the 22 or any other huge number. The hash is actually a function that converts an input into some number, so you can give it different strings, and it will output you different numbers. But for some strings they can sometimes output the same value and that is known as collision. And hash functions have collisions but in practice we will later see that if you take it, if you take that hash value modulo two to the high you rise it to the high power then those collisions can be neglected. You can actually, that hashing vectorizer is implemented in scikit-learn and it's called hashing vectorizer obviously, and it's also implemented in a Vowpal Wabbit library that we will later overview. Okay, so let's take spam filtering task and as you might guess that is a huge task because, even if you are a medium mail server, people send a lot of emails, and if you have millions of users then you have a terabyte of text that you need to analyze. There is actually a paper on archive and it actually introduces proprietary data set for spam filtering. It has half a million users, three million letters, and it has 40 million unique words that is seen in that letters. Let's say we map each token to index using some hash function Fi. It works like the following. It takes our token x, it hashes it, and takes that value modulo two to the B, and for B equally 22 we have four million features. And that is actually a huge improvement to our 40 million features that we originally had. So we somehow mapped our 40 million features into four million features. And thanks to the fact that hash collisions are very unlikely, they are there but there are not a lot of them, we can actually train the model on top of that formula and features and still get the same pretty decent result. So, let's look at the example of how that Hashing vectorizer works, and first let me introduce some hash function for an arbitrary string. We have a string S. We take the first character code of that string and that is actually a number from zero to two, 55 for example. And then you take the next character, you multiply it by some fixed prime number. Then you take the third character and multiply it by B to the two and so forth. So, what do you actually obtain is an integer number, and that is a hash value of your string. Of course some strings can hash into the same value and that is known as collision but there in practice we will not see a lot of them. So let's see what we might have. For this particular data set where we have three reviews, good movie, not a good movie, or didn't like, we take all the possible tokens that we have and let's pass them through our hashing function phi, and we take a pretty small B here, and what we actually can get is the following: zero, one, two, three and three. For A and 'did' we have the same hash value but that is fine, and for 'like' we have the hash value of four. So, how vectorization now works? And now in our columns instead of different tokens we have different hash values and those are all the numbers from zero to four. And let's see how our good movie now vectorizes. We look at the hash value of 'good' that is zero, and so we add one to the column corresponding to that value. Then we take the next word which is 'movie', we hash it as well, we get the value of one, so we input one in the column that corresponds to that hash value. And that is how we proceed with all the other reviews that we have, and this actually looks pretty similar to bag-of-words but now instead of tokens we have hash values. Okay, let's try to train a linear model on top of these features. But first, let's look at this thing. Now we actually proposed a way how we can squeeze the number of features that we originally had. So, in a bag-of-words manner, you had 40 million features and if you hash them, then you have four million features. And you can actually control the number of features that you have in the output by adjusting that B parameter that you can find in the power of two. And what that actually means is that now we can introduce a lot of tokens, a lot of features, trillion features. And if we hash them, we still have the fixed number, two to the B of features that we can analyze. Let's see how it might work. So phi_zero is the old hashing trick that we use. We just take the hash value of our token and take that value modulo two to the B. Another thing is we can actually use personalized hashing. That means that we want to have a feature that says that for that particular user 'you', and that particular token, if you see that user and that token in the email, that actually means that we want to learn some personalized preference of that user in spam or non-spam e-mails. And if you take a user, add underscore and token, and hash that new token, and take that value modelo to the two to the B, you have some new features. And actually, in that data set, if you take all pairs of user and word, actually you have 16 trillion pairs, and it's not possible to look at those features as a bag-of-words representation because it takes 16 terabytes of data. It's just not feasible. But if you take the hash of those features and take it modulo two to the B, you have a fixed number of features. So, here we have our pipeline, we have a text document. We extract tokens, we add personalized tokens where we just add the prefix, let's say user one to three, underscore all the tokens that we've seen for that user, we hash all those tokens and we get some sparse vector of the size two to the B. Okay. Now let's see whether that hashing actually hurts our performance or not. On this graph, you can see three different models. The first one is denoted with black color and that is a baseline. That is actually the model that was trained on original tokens without any hashing just in bag-of-words manner. We trained a linear model on top of bag-of-words. Then the blue one is actually a hashed version where you replace TfidfVectorizer with let's say, hashingvectorizer. And now you have a smaller number of features. And you can see that starting from some value of b let's say, 22, you actually don't lose anything in terms of quality. So if you take b equally 18, then you lose some quality. But if that value is huge then that is okay. So it's pretty okay to use hashing if you have a lot of hash values. Another thing, is there's a red curve which corresponds to personalized hashing. That is the model where you introduced personalized tokens and you hashed them, and you used that for linear model as well. And you can see that that somehow gives you a significant boost in miss-rate. So actually, on the y axis we have a miss-rate and we want to make it as low as possible. Okay, so let's understand why that personalized features actually work. So, the answer is pretty obvious, because they're personalized. They actually capture some local user-specific preference. Let's say, some users might consider newsletters as spam, and that means that if you see some words that's frequently used in newsletters, and then for that particular user that would be a spam as well. But for the rest of the people like for the majority of them, those newsletters could be okay. So, if you add that personalized features, you can actually understand what makes a letter a spam letter for a particular user. But how will it work for new users? Let's look at different users that have different number of emails in training. Let's say, we take users that have 64 emails in training, that means that we have a very low miss-rate for them because we know really well what is a spam letter for them and what is not. And if they have less and less examples, it actually starts to hurt the quality of the model because we have less examples of what is a spam letter for them. But one surprising thing is that even for users that didn't have any letters in the training set, we have a higher quality than a baseline model. And why does that happen? Because for those users nothing changes, we don't add any user-specific tokens, and you can actually expect that nothing changes for them too. And we get pretty close to baseline but actually it performs superior to baseline, and let's find out why. So, you can actually think of it in the following way. It turns out that we learn better global preference when we have some features that correspond to personalized preference or local use of preference. Let's take the same example of people that hate newsletters. There could be a small number of those people, and for those people we can actually use their personalized features to learn that those people hate newsletters. But for majority of the people newsletters are fine. And that means that having those personalized features, linear models can learn then okay, I will look at those personalized feature. That particular person hates spam, okay, hates newsletters. That is okay. But for all the rest, I will use the features that contain the words that are seen in newsletters, like newsletter. And for those people, for all the rest, I will learn a better model. And that what actually happens in practice and that's how we can describe why this happens. Another thing is, why do we deal with such huge dataset? Why do we take one terabyte of data? Why can't we take like a thousand of emails and just train our classifier? It turns out that you can learn better models using the same simple linear classifier and the same simple features, but when you have more data you can learn better classifier. And that can be well seen on ad click prediction. There is a paper in our archive which has appropriatory dataset as well, which has trillions of features, billions of training examples. And those people actually showed that, if you sample your dataset with let's say, you take a one percent sample or a 10 percent sample of a huge terabytes dataset, then it actually hurts your model. It hurts your model in terms of area under ROC curve. And you can see that it hurts it with any sampling rate you take. And you may think that that difference in the third digit in auROC actually makes sense. You may think that, okay, that is not that much, why do I need to bother with that one terabyte dataset? But if you are talking about ad click prediction, that means that any improvement in that ad click prediction can actually lead to millions of dollars of revenue. So people actually want to squeeze the maximum they can from those models. At last I want to overview Vowpal Wabbit library, that is a well-known machine learning library that is used for training linear models. It uses feature hashing that we have described previously, internally. It has lots of different features. And it's really fast and it scales very well. And what's more wonderful is that as an input to this library, you can give your raw text and it will convert it, it will tokenize that text on white spaces, it will take the hash value from each token, and it will use that hash values internally for a hash vectorization. But you can also say that you want to pass your features there when you already know the hash value, and you can also do that, you say like 13 colon and some real value of number. That means that in the column that corresponds to hash value 13, you will have those value. Okay, let's summarize. We've taken a look on a different application, particularly spam filtering that uses feature hashing. And thanks to hashing, you can hash a lot of features, trillion features. And you can actually add personalized features, and that is a really nice trick to further boost the performance of your model. Linear models over bag-of-words scale well for production that is a well-known thing, and that's why we actually overview them because most likely you will have to implement linear model as a baseline when you work somewhere, in some corporation or anywhere else. In the next video, we will take a look at text classification problem using deep learning techniques.