Now, for the sake of variety let's leave neural networks alone for a moment then come back to our general diagram of clustering methods that we introduce to our lair. In the last video we spoke about the K means algorithm as an example of the so-called flat clustering, where all points in the cluster are equal as clusters do not have any further internal structure. This time let's talk about hierarchical clustering methods, where results in clusters do have such internal structure via sub-clusters, sub sub-clusters and so on. The first thing that should be assert here is that nested hierarchical structures are very common among many different complex physical, biological and social systems. Even now perception and information encoding reflex the hierarchical structure of the outside world. For example, we locate a formulae in the book by a chapter then sector then a paragraph, or be defined in address of person by a country, state, city, street and the house number. All these are examples of nested hierarchical incogent of some data. But even more interesting is the fact that hierarchical structure of interactions in complex systems also affects the dynamics of such systems. This was discussed at length in beautiful paper by Phil Anderson, the winner of the Nobel Prize in Physics of 1977, that he published in Science magazine in 1972. I highly recommend you read the paper entitled, The more is different, especially if you are interested in complex systems in natural sciences which is the main topic of the paper. But Anderson also touches upon economics and finance at the very end of his paper. I can't resist the temptation to quote this for you here. Anderson said, "In closing, I offered two examples from the economics of what I hope to have said. Marx said that the quantitative differences become qualitative wants, but the dialogue in Paris in 1920s some it up even more clearly. " Fitzgerald, "The rich are different from us." Hemingway, " Yes, they have more money." So, after this elegant illustration of a link between quantitative and qualitative difference in the society, let's go back to our main theme that is finance. So, what would be the reasons to consider hierarchical structures or hierarchical model in confinements? Well, one reason is that some financial instruments are hierarchical in nature. For example, there are global indices such as S&P 500 in the marketplace as well as sector indices or geographic based indices. These are examples of hierarchical structures in market instruments and market data. This is also reflected in how asset managers and risk managers tend to think about portfolio risks. They often define risk factors at the level of separate industries and geographical sectors, and then aggregate risks along both sectors and geographies, to have a view of the composition of the total portfolio risk along these dimensions. A way to model such Hierarchical structures of risk factors in their classical finance is via nested hierarchical factor models, where factors are observable and can be identified with market provided values of global and sector indices. As opposed to this approach, here we follow a machine learning paradigm, and try to learn hierarchical structures directly from data instead of being imposed by some predefined models. It rules out that hierarchical sector model with independent sector set different levels of the hierarchy can be extracted from a hierarchical cluster followed by such machine or in approach. One particular way to find the hierarchical clustering structure is called agglomerative hierarchical cluster. Here's how it works. You start with each data point forming it's own cluster, then you update your clusters by edging to each point its closest neighbor. After that, you merge similar clusters. Then you repeat all these steps again and so on until convergence or until a needed number of clusters is found. Now the algorithm that I just described is an example of a bottom-up greedy algorithm. It's called bottom up because it starts with four points being in their old clusters and gradually builds more general clusters. It's called greedy because each cluster merger is only guaranteed to be optimal locally for these given step but not necessarily for the final objective. Now, it turns out that complexity of agglomerative cluster in algorithms is in general rather high of the order of N square times log of N. There are two special cases. So, for agglomerative clustering that reduced the complexity to N square. These methods are called single linkage and complete linkage algorithms. I will explain them shortly. So, what I described above is a very generic agglomerative algorithm. To make it more specific, we need to define the notion of distance between points and distance between clusters. Let's start with the first one. The choice of a metric defines the distance between two points x1 and x2 in their multi-dimensional space. The most popular choice in machine learning is the Euclidean metric given in first line here. And alternatively is to use its square, hence we get their squared euclidean distance. Now, instead of suming the squares of differences of x1 and x2 along different dimensions, we can take their absolute values. This will be equivalent to using the so-called Manhattan distance, named so in reference to the geometry of Manhattan, with its system of stick with perpendicular streets and avenues. Yet another alternative is to use the so-called maximum distance, that would correspond to the maximum of absolute differences across different dimensions. And finally, another interesting alternative is provided by the Mahalanobis distance, which is the same as the euclidean distance, but after a rotation of input vectors by some orthogonal metrics. Such orthogonal metrics can even be made free parameter and Lorenc from the data. This is called metric Lorenc and it constitutes another class of supervised Albania. Now, after we discussed how we define the distance between individual points, let's discuss how we can define distances between clusters of points. I remind you that we need the measure of distance between clusters in order to nourish different clusters in the agglomerative cluster method. There exist different ways to define the distance between two clusters or set of points A and B. One way to do this would be to define the distance between two clusters say the minimum of four pairwise distances between points in two clusters. This is schematically shown in this diagram. If we adopt this rule, we will get the single linkage clustering, which as we said before has a quadratic complexity in the number of points. Another possible choices to take the maximum of four pairwise distances instead of the minimum. This corresponds to another specification of agglomerative clustering called complete linkage clustering. This method also has quadratic complexities similar to the single linkage method. Yet another method is to take an average of four pairwise distances instead of taking a minimum or maximum. Such algorithm is expected to behave in some ways in between of these two cases. This definition alludes to the ever ritually disgusting method. In your homework you will have a chance to try all three such specifications of linkage on stock each on date and see the differences that they produce. Now I want to show you how you can represent results of hierarchical clustering. This is done using dendrogram. Dendrogram is a binary tree that represents the process of merging sub-clusters in hierarchical clusters. Steps in drawing of dendrogram corresponds to steps in construction hierarchical cluster. You start with the bottom up with the bottom most leaves which in this case represent individual items in a data set. Sub-clusters that are nurtured are connected by lines where the height of the vertical line measures the amount of dissimilarity between different sub-clusters. Therefore the y axis on this graph measures the dissimilarity between different sub-clusters. Now, how many clusters do we have in agglomerative clustering really depends on where we draw horizontal line on this diagram. Such line corresponds to resolution with which we look at our hierarchical clusters. If we draw such a line exactly at the level of the X-axis, then we will get all the original data points as separate clusters. If on the other hand we draw it at the very top of the Dendrogram we will have only two clusters. For any intermediate level so dissimilarity between difference sub-clusters, we may have some intermediate numbers of resulting clusters. So, the number of clusters would be an output of agglomerative clustering methods. And this is very different from the K means clustering which requires to specify the number K of clusters beforehand. At this point let's check what we learned so far and then move on with implications of clustering methods to finance..