When building an anomaly detection algorithm, I found that choosing a good choice of features turns out to be really important. In supervised learning, if you don't have the features quite right, or if you have a few extra features that are not relevant to the problem, that often turns out to be okay. Because the algorithm has to supervised signal that is enough labels why for the algorithm to figure out what features ignore, or how to re scale feature and to take the best advantage of the features you do give it. But for anomaly detection which runs, or learns just from unlabeled data, is harder for the anomaly to figure out what features to ignore. So I found that carefully choosing the features, is even more important for anomaly detection, than for supervised learning approaches. Let's take a look at this video as some practical tips, for how to tune the features for anomaly detection, to try to get you the best possible performance. One step that can help your anomaly detection algorithm, is to try to make sure the features you give it are more or less Gaussian. And if your features are not Gaussian, sometimes you can change it to make it a little bit more Gaussian. Let me show you what I mean. If you have a feature X, I will often plot to hissed a gram of the feature which you can do using the python command PLT. Though you see this in the practice lab as well, in order to look at the history graham of the data. This distribution here looks pretty Gaussian. So this would be a good candidate feature. If you think this is a feature that hopes distinguish between anomalies and normal examples. But quite often when you plot a hissed a gram of your features, you may find that the feature has a distribution like this. This does not at all look like that symmetric bell shaped curve. When that is the case, I would consider if you can take this feature X, and transform it in order to make a more Gaussian. For example, maybe if you were to compute the log of X and plot a hissed a gram of log of X, look like this, and this looks much more Gaussian. And so if this feature was feature X one, then instead of using the original feature X one which looks like this on the left, you might instead replace that feature with log of X one, to get this distribution over here. Because when X one is made more Gaussian. When anomaly detection models P of X one using a Gaussian distribution like that, is more likely to be a good fit to the data. Other than the log function, other things you might do is, given a different feature X two, you may replace it with X two, log of X two plus one. This would be a different way of transforming X two. And more generally, log of X two plus C, would be one example of a formula you can use, to change X to try to make it more Gaussian. Or for a different feature, you might try taking the square root or really the square would have executed this X lead to the power of one half,and you may change that exponentially term. So for a different feature X four, you might use X four to the power of one third, for example. So when I'm building an anomaly detection system, I'll sometimes take a look at my features, and if I see any highly non Gaussian by plotting hissed a gram, I might choose transformations like these or others, In order to try to make it more Gaussian. It turns out a larger value of C, will end up transforming this distribution less. But in practice I just try a bunch of different values of C, and then try to take a look to pick one that looks better in terms of making the distribution more Gaussian. Now, let me illustrate how I actually do this and that you put a notebook. So this is what the process of exploring different transformations in the features might look like. When you have a feature X, you can plot a hissed a gram of it as follows. It actually looks like there's a pretty cause hissed a gram. Let me increase the number of bins in my history gram to 50. So bins equals 50 there. That's what hissed a gram bins. And by the way, if you want to change the color, you can also do so as follows. And if you want to try a different transformation, you can try for example to plot X square root of X. So X to the power of 0.5 with again 50 hissed a gram bins, in which case it might look like this. And this actually looks somewhat more Gaussian. But not perfectly, and let's try a different parameter. So let me try to the power of 4.25. Maybe I just a little bit too far. It's the old 0.4 that looks pretty Gaussian. So one thing you could do is replace X with excellent power of 0.4. And so you would set X to be equal to X to the power of 0.4. And just use the value of X in your training process instead. Or let me show you another transformation. Here, I'm going to try taking the log of X. So log of X spotted with 50 bins, I'm going to use the numpy log function as follows. And it turns out you get an error, because it turns out that excellent. This example has some values that are equal to zero, and we'll log of zero is negative infinity is not defined. So common trick is to add just a very tiny number there. So exports 0.001, becomes non negative. And so you get the hissed gram that looks like this. But if you want the distribution to look more Gaussian, you can also play around with this parameter, to try to see if there's a value of that. Cause user data to look more symmetric and maybe look more Gaussian as follows. And just as I'm doing right now in real time, you can see that, you can very quickly change these parameters and plot the hissed gram. In order to try to take a look and try to get something a bit more Gaussian, than was the original data next that you saw in this hissed gram up above. If you read the machine learning literature, there are some ways to automatically measure how close these distributions are to Gaussian. But I found it in practice, it doesn't make a big difference, if you just try a few values and pick something that looks right to you, that will work well for all practical purposes. So, by trying things out in Jupiter notebook, you can try to pick a transformation that makes your data more Gaussian. And just as a reminder, whatever transformation you apply to the training set, please remember to apply the same transformation to your cross validation and test set data as well. Other than making sure that your data is approximately Gaussian, after you've trained your anomaly detection algorithm, if it doesn't work that well on your trust validation set, you can also carry out an error analysis process for anomaly detection. In other words, you can try to look at where the algorithm is not yet doing well whereas making errors, and then use that to try to come up with improvements. So as a reminder, what we want is for P of X to be large. For normal examples X, so greater than equal to epsilon, and p f X to be small or less than epsilon, for the anomalous examples X. When you've learned the model P of X from your unlabeled data, the most common problem that you may run into is that, P of X is comparable in value say is, large for both normal and for anomalous examples. As a concrete example, if this is your data set, you might fit that galaxy into it. And if you have an example in your cross validation set or test set, that is over here, that is anomalous, then this has a pretty high probability. And in fact, it looks quite similar to the other examples in your training set. And so, even though this is an anomaly, P of X is actually pretty large. And so the algorithm will fail to flag this particular example as an anomaly. In that case, what I would normally do is, try to look at that example and try to figure out what is it that made me think is an anomaly, even if this feature X one took on values similar to other training examples. And if I can identify some new feature say X two, that helps distinguish this example from the normal examples. Then adding that feature, can help improve the performance of the algorithm. Here's a picture showing what I mean. If I can come up with a new feature X two, say, I'm trying to detect fraudulent behavior, and if X one is the number of transactions they make, maybe this user looks like they're making some of the transactions as everyone else. But if I discover that this user has some insanely fast typing speed, and if I were to add a new feature X two, that is the typing speed of this user. And if it turns out that when I plot this data using the old feature X one and this new feature X two, causes X two to stand out over here. Then it becomes much easier for the anomaly detection algorithm to recognize an X two is an anomalous user. Because when you have this new feature X two, the learning anomaly may fit a Gaussian distribution that assigns high probability to points in this region, a bit lower in this region, and a bit lower in this region. And so this example, because of the very anomalous value of X two, becomes easier to detect as an anomaly. So just to summarize the development process will often go through is, to train the model and then to see what anomalies in the cross validation set the algorithm is failing to detect. And then to look at those examples to see if that can inspire the creation of new features that would allow the algorithm to spot. That example takes on unusually large or unusually small values on the new features, so that you can now successfully flag those examples as anomalies. Just as one more example, let's say you're building an anomaly detection system to monitor computers in the data center. To try to figure out if a computer may be behaving strangely and deserves a closer look, maybe because of a hardware failure, or because it's been hacked into or something. So what you'd like to do is, to choose features that might take on unusually large or small values in the event of an anomaly. You might start off with features like X one is the memory use, X two is the number of disk accesses per second, then the CPU load, and the volume of network traffic. And if you train the algorithm, you may find that it detects some anomalies but fails to detect some other anomalies. In that case, it's not unusual to create new features by combining old features. So, for example, if you find that there's a computer that is behaving very strangely, but neither is CPU load nor network traffic is that unusual. But what is unusual is, it has a really high CPU load, while having a very low network traffic volume. If you're running the data center that streams videos, then computers may have high CPU load and high network traffic, or low CPU load and no network traffic. But what's unusual about this one machine is a very high CPU load, despite a very low traffic volume. In that case, you might create a new feature X five, which is a ratio of CPU load to network traffic. And this new feature with hope, the anomaly detection algorithm flagged future examples like the specific machine you may be seeing as anomalous. Or you can also consider other features like the square of the CPU load, divided by the network traffic volume. And you can play around with different choices of these features. In order to try to get it so that P of X is still large for the normal examples but it becomes small in the anomalies in your cross validation set. So that's it. Thanks for sticking with me to the end of this week. I hope you enjoy hearing about both clustering algorithms and anomaly detection algorithms. And that you also enjoy playing with these ideas in the practice labs. Next week, we'll go on to talk about recommender systems. When you go to a website and recommends products, or movies, or other things to you. How does that algorithm actually work? This is one of the most commercially important algorithms in machine learning that gets talked about surprisingly little but next week we'll take a look at how these algorithms work so that you understand the next time you go to the website and then recommend something to you. Maybe how that came about. As was you'll be able to build other algorithms like that for yourself as well. So have fun with the labs and they look forward to seeing you next week.