We will now talk about hierarchical clustering, the last clustering approach we will discuss in this lecture. In practice, clusters often have subclusters, this subcluster often have other subclusters and we want to capture the stratification of clusters. For example, in this case these are two clusters, but they can be split further into more, and more clusters. To capture this stratification, biologists construct trees where every leaf corresponds to a data point. Every horizontal line crossing this tree will result in a set of clusters. For example this line crossing the tree at four points will result in four clusters. While the line crossing the tree at six point will correspond to six clusters. To construct the hierarchical cluster in trees, biologist first transform data n x m expression matrix into n x n similarity matrix or distance matrix. For example, they can construct the distance matrix by simply computing the Euclidian distances between the points in the data set. As soon as this matrix is constructed the hierarchical clustering start from just representing all data points as leaves in a tree that is to be constructed. Afterwards the hierarchical clustering finds the smallest distance in the distance matrix. In this case, it finds that this distance is realized, as the distance between the third point and the fifth point in the data set, and it connects the corresponding cliffs, and forms a new node that is apparent of these two leaves. Then afterwards, it not just this leaves into a single row of the matrix, and in the newly constructed matrix I will describe later how the distances in this newly constructed matrix are computed. It once again finds the shortest distance. In this case the shortest distance is realized between element two and element four and they are also connected to a parent-node by address. I will define later how to define distances from this parent-node. Once again, it construct, collapse the leaves two and four into a single node in the tree and recomputes the matrix again. And in the newly constructed matrix, and once again find the smallest element, and continues in this way until the whole Tree is constructed. In other words when there is only single cluster which corresponds to the root of the tree left. Here is the pseudocode for constructing a tree from a distance matrix that pretty much follows that algorithm that I just described obvious specifying some details on how to compute the distance tree. The only things that remain unaddressed in this pseudocode is how to compute the distance between the newly formed clusters and remaining clusters in the current partition of the tree into clusters. There are different approaches to clustering. For example, one can compute clusters using average distance between elements of two clusters, or by using minimum distance between elements of two clusters. Of course this may result in different clustering approaches. These are six clusters built by hierarchical clustering for the same dataset that we looked at when we talked about k-means clustering. These clusters are often a departing point for biologists to decode difficult regulatory puzzles. For example, if you look at the first data set you will see that there is a sudden surge of expression at the last checkpoint for generating this data set. When biologists started looking more carefully at these genes they saw that upstream regions of many genes in this cluster contain the carbon source response element with these motif shown on this slide. And by noticing this, they started to decode the biological puzzle of yeast being able to switch from glucose to ethanol pathway to ethanol metabolizing pathway. Indeed since yeast prefer glucose over ethanol as an energy source, genes responsible for metabolizing less taste ethanol are repressed in the presence of glucose. So the carbon source response element tell biologists to decode deposits of CSRE. Indeed it turns out that CSRE activates the genes when the yeast runs out of glucose and starts consuming ethanol. I think we are done with the diauxic shift. Phillip, should we try this bottle of wine? >> Sure, I think so. >> Thank you. Our students may think that we do it after every class. >> Well, we actually do, don't we? Actually Pavola, come to think of it, you mentioned cancer at the beginning. What happened with that discussion? >> Back to recording. In this lecture, we focus on the dioxin shift rather than medical applications of expression arrays. In fact, expression arrays have both hopes and concerns when it comes to Oncology. For example, genetic studies reveal that breast cancer has more grades than previously thought, and the result, there is a possibility to add a genomic grade to a traditional histological classification of breast cancer. On the other hand, microarrays allow doctors to distinguish between morphologically identical types of leukemia and lead to better clinical outcomes. However, there are also concerns about applications of expression arrays in clinical oncology. First, tumor sample preparation is critical for accuracy and reproducibility of expression arrays. And standard procedures for tumor sampling and storage are still in development. Very important, bioinformatics model for classifying tumor should be extremely accurate if Bioinformaticians want them to be applicable to medical application. Indeed they should have extremely low false positive and false negative rate because false positive lead to unnecessary surgeries while false negative leads to missing early stages of cancer.