[MUSIC] In this video, we will discuss techniques for detecting anomalies. So the first point we will discuss would be the point anomaly detection. Several approaches have been proposed and we will discuss them one by one. First category is based on classification and includes algorithms such as rule based, neural networks, and SVM. Nearest neighbor based approaches aim to use density or distance criteria to detect anomalies. Clustering based algorithm, divide the data into different clusters and the points which do not belong to major clusters are declared as anomalies. There are several statistical approaches as well, which can be parametric or nonparametric. Others include based on information theory or spectral techniques or visualization based. So now let's look at classification-based techniques. So, these techniques train a classification model for normal versus anomalous data, using label training data. In fully supervised setting, normal and anomaly class training data is available. Whereas in semi supervised, we only have training data for normal class and any deviation from normal behavior is classified as anomalous. Now, classification model must handle imbalanced class distribution because anomalies are far fewer than the normal dataset. So class imbalance is a problem which is encountered in all anomaly detection tasks. So techniques that we can use are oversampling the anomalies or undersampling normal data. We can also generate some artificial examples. Classification-based techniques have the advantage that they are highly accurate in detecting known and unknown anomalies, but they require labeled training data which sometimes may be difficult to obtain. An example of these types of algorithms is PN-rule learning. So this is a rule-based classifier model for each target class where we have P which is positive rules for positive class, which predicts the presence of a class and the N-rules which is short for negative rules predict absence of a class. It is a two-phase learning scheme in which in the phase one, we discover a few P-rules that capture most of the positive cases with high support. This is done keeping in mind that the false positive rate should be reasonably low. In phase 2 we discover a few N-rules that remove most of the false positives introduced by the union of all P-rules. Keep the detection rate above an acceptable level and high accuracy and significant support is provided. So the set of P and N-rules are ranked according to certain statistical measures. To show it using an example, consider we have an original training data which consists of let's say a square, in which the white area represents negative examples and the blue represents positive examples. So we first discover our first P-rule which is shown here by the red box. Now, once we have found this, we removed this area and for the remaining area, the blue one, we discover second P-rule, which is shown here by P1. So here we have removed the examples which are covered by P0 and P1. So for the choice of the third P-rule, we have three options which are shown here q1 which is shown by a blue square, q2 again shown by a blue square and P2 which is shown by a red rectangle. So here we can see that P2 is chosen over q1 or q2 because its support is higher, it included more number of positive examples. Once we have found P2, we can start dataset for the second stage in which all the examples that were covered by the P-rules are shown now as negative examples. And all the regions which were not actually positive but were covered are shown by the blue region. So these are the positive examples for the second stage. We could also have learned a fourth P-rule if accuracy threshold is low. Again here we have two options P3 and q3. But since q3 excludes some positive examples, we choose P3 which include them as well. Now this is the starting data which is shown in the right for the second stage if the first stage stops after P3. So here negative examples are shown by white color and the blue are the positive examples for which we need to find the rules. The next algorithm is nearest neighbor based techniques. Here they make the assumption that the normal point have close neighbors while anomalies are located far from other points. Here again, it is a two-step approach. First, it computes neighborhood for each data record and then analyze the neighborhood to determine whether data record is anomaly or not. The distance-based methods are the ones in which anomalies are data points, which are most distant from other points. The density-based methods assume anomalies are data points which are in a low-density region. The advantages of nearest neighbor based algorithms are, that they can be used in unsupervised or semi-supervised setting, and does not make any assumption about the data distribution. However, they are computationally expensive as distance or density of each data point needs to be computed. And they are not suitable for high-dimensional data, since data is very sparse in high-dimension region. In distance-based outlier detection, that is nearest neighbor approach, for each data point d we compute the distance to the kth nearest neighbor and represented it as dk. Here k is a design parameter, but usually we choose k to be something around 5 to 10. Then we sort all data points according to this distance dk. So the larger the dk means the data is an outlier because it is located in a more sparse neighborhood. So usually the data points that have top n% distance d_k are identified as outliers. Again here n is a design parameter. It is not suitable for datasets that have models with varying density or regions in which some regions are more dense, some are less dense. For these types of datasets, another algorithm is proposed that is called as local outlier factor (LOF). In this for each data point first we compute the distance to the kth nearest neighbor d_k which is similar to the nearest neighbor. Then we compute what we call it as a reachability distance, which is given by this formula, which is maximum of d_k and distance of d and p. Once we have that we compute the local reachability density of d based on min points nearest neighbors, which is given as min points divided by sum of all the reachability distance. Based on it we compute LOF which is the local outlier factor of d as a ratio of average local reachability density of these k nearest neighbor and its own. So essentially we are comparing point d to its neighbors in terms of local reachability distance. Let's see an example for LOF. So in this figure we can see that the data consists of two clusters, one is a sparse cluster and one is a very dense cluster. There are also three points of interest denoted as p1, p2, and p3. So, as we say that the distance of p2 and p3 from the nearest neighbor is almost same as shown by the red line. So in NN approach, p2 is not considered as outlier, while LOF find both p1 and p2 as outliers. Now, NN may consider p3 as outlier, but LOF does not. The reason being, p3 is very close to its neighbors, whereas p2 is far away as compared to the properties of their local neighborhood. So this algorithm is suitable for datasets that have modes with varying density. In this video, we discussed a few algorithms for detecting anomalies which are based on classification and then on nearest neighbors. Thank you. [MUSIC]