In this video, we'll continue our discussion of anomaly detection techniques. So, we'll begin with clustering-based techniques. So, the key assumption that these techniques make is that the normal data points belong to large and dense clusters, whereas anomalies are the points that do not fit into any cluster, hence they are residuals from clustering or they could be small clusters, or they could be low-density clusters or local anomalies which are far from the other point within the same cluster. Clustering-based techniques are categorized according to labels. In semi-supervised, they cluster normal data to create modes of normal behavior and if a new instance does not belong or is not close to any of these clusters, we call it as anomaly. In unsupervised settings, post processing is needed to distinguish normal data clusters and anomalies, and this is based on the size of the clusters and the distance between them. The advantage of using clustering based techniques is, firstly they are unsupervised techniques and does not require ground truth labels. They are easily adaptable to online or incremental mode and hence are suitable for time series data. The drawbacks are that they are computationally expensive and if normal points do not create any cluster then the technique may fail. They are also not very suitable for high-dimensional data, as clustering algorithm may not give any meaningful clusters for this kind of data. An example of clustering based technique is, FindOut algorithm which find outliers in very large data sets. The main idea is to remove the clusters from original data and then identify the outliers, before that we apply a signal processing technique called wavelet transform before finding clusters. So, the algorithm first transformed data into multidimensional signals using wavelet transform. These are high-frequency signals, which are regions having a rapid change of distribution that are boundaries of the cluster. All low-frequency signals which are regions where the data is concentrated, also known as clusters. It then removed these high and low frequency parts, and all remaining points will be outliers. So, to show using an example on the top left figure, we have an original data space. And then when we apply wavelet transform, we get this figure which is bottom left. After removing this wavelet space, we get original data which consists of cluster shown by white and then the cluster boundaries shown by black region. Once we remove this, we get outliers which are shown here as black dots. Statistics-based techniques assume that the data points are modeled using stochastic distribution, and the points are determined to be outliers based on this model. There are two types of statistical techniques. One is parametric technique, which assumes that the normal data is gathered from an underlying parametric distribution and then the algorithm aims to learn the parameter from the normal samples. Non-parametric techniques do not assume any knowledge of parameter such as kernel density estimation. The advantages of statistics-based techniques is that they utilize existing statistical modeling to model the various types of distribution. The challenges are, it is difficult to estimate distribution, especially for High-dimensional data, and parametric assumptions do not hold for real world datasets, SmartSifter is an algorithm which uses a probabilistic model as a representation of underlying mechanism of data generation. In this, the histogram density represents probability density for categorical attributes, and finite mixture model represents probability density for continuous attributes. It detects outliers in an online manner using incremental learning for the probabilistic model, a score is given to new data points based on how much the model has changed after its inclusion. If the score is high that means the data point is an anomaly since previous model was not able to explain it. However, a small score means it is a normal point. It is adaptive to non-stationary sources of data. Now let's talk about contextual anomaly detection, in which we identify a context or neighborhood around a data instance. Now this context could be spatial context, that is latitude or longitude of the neighboring regions; graph context, in which we consider edges or weights of those edges; sequential context considers position or time; profile context uses user demographics. In this we determine if the data instance is anomalous with respect to the context using a set of behavioral attributes. The advantages are it can detect anomalies that are hard to detect when analyzed in the global perspective. However, the challenge is to identify a set of good contextual attributes, and determining a context using the contextual attributes. Few contextual anomaly detection techniques are: reduction to point outlier detection, that is, the segment data using contextual attributes and then apply a traditional point outlier within each context using behavior attribute or the algorithm can utilize structure in the data in which it builds models from the data using contextual attributes such as time series model, ARIMA, ARMA, and so on. Collective anomaly detection detect collective anomalies and exploit the relationship among data instances. For example, sequential anomaly detection detects anomalies sequence in a database of sequences. It detects anomalous subsequence within a sequence. And the application includes system call intrusion detection, climate data, and so on. Collective anomaly detection can also be of various types such as spatial anomaly detection in which we detect anomalous sub-regions within a spatial data. A graph anomaly detection which Detects sub-graph which is anomalous. In the Python example, we use PyOD which is a python toolkit for anomaly detection. It consists of various modules for outlier detection. Models include linear models such as PCA, MCD, which is minimum covariance determinant or one-class support vector machine. It also has proximity-based outlier detection models such as LOF, local outlier factor, clustering based local outlier factor, K nearest neighbor such as average KNN and median KNN, and HBOS, which is histogram based outlier score. Probabilistic models for outlier detection such as ABOD, which is angle based outlier detection and its fast version, FastABOD. It also has support for outllier ensemble and combination frameworks such as isolation forest and feature bagging. So, for this, first we import all models from PyOD. So, we import algorithms such as ABOD, CBLOF, HBOS, and so on. Next, we generate some synthetic data in it. The data consists of two clusters represented by X1 and X2, which are some random points and are defined by some offset to the left, and to the right. Then we add some random outliers, which are shown in this figure. Here, the white circles are the normal data points belonging to two clusters and the black circles represent anomalies. Then we pick the model with the generated data and compare model performances. This figure here shows the model of the normal behavior, and anomalies for different algorithms, the orange part represents normal regions. So, as we can see different algorithms model the normal region differently. Some represent it as square blocks, some using the boundary, or some using ellipticalmodels, and then the region which is shown by blue are the anomalies. The higher the tone of blue that means more anomalous the data point is. In this video we discussed a few anomaly detection techniques, such as based on clustering and parametric models. Thank you. [MUSIC]