We're still looking for a way of dealing with non-additive matrices. So in this section, let's try and develop a heuristic. In practice, researchers often assume that every internal node of a phylogeny corresponds to a speciation event, which is defined as an event where one species divides into two. In this case, we need to place limits on the internal nodes of the tree. For example, if we're working with unrooted trees, our requirement is that every internal node needs to have degree 3. This means that regardless of where the root is, when we progress from the root to a leaf, every time we encounter an internal node, the tree splits into two pieces. For example, if we place the root on the limb of the squirrel monkey here, then as we progress through the tree, it splits into two pieces, splits, splits, and so on. If we root the tree, then, at that location, we obtain the following rooted binary tree. Our goal, then, is to design a heuristic that models a "molecular clock" that assigns each leaf an age. The age of each leaf node is 0. And then we need to assign ages to the internal nodes. The age of an internal node is going to represent how long ago the speciation event represented at that internal node occurred. So for this monkey phylogeny, the ages are in millions of years. Once we assign ages to the nodes of a rooted binary tree, we get the weights of the edges of this tree (here shown in red) simply by taking the difference between the ages. So for example, the age of this node is 33. The age of that node is 23. We get the edge weight 10 for free just by subtracting the two ages. Furthermore, if we assign ages to the nodes of the tree, then it automatically implies that the tree is what's called "ultrametric", meaning that the length of every path from the root to a leaf is the same. So, here the age of the root is 33 million years, and the length of any path from the root to a leaf node is 33. The heuristic that we're going to introduce for constructing an ultrametric, rooted binary tree is called "UPGMA", which stands for "Unweighted Pair Group Method with Arithmetic Mean". UPGMA is going to construct an evolutionary tree by clustering the species from the distance matrix into larger and larger clusters, beginning with single element clusters. At each step, it looks for the two closest clusters, according to the average distance among all pairs of elements taken from the two clusters. Now, at this stage of the algorithm, we're dealing with single element clusters, so if we're looking for the closest clusters, that's just the smallest element of the distance matrix, which corresponds to k and l. Now, I hope you're worried because we've seen the perils of choosing the minimum element of a distance matrix as neighbors. But UPGMA is a heuristic, and it's not going to break like adding additive phylogeny did. So once we've found the two closest clusters, C1 and C2, we can merge them into a single cluster, C. Here we put k and l into a cluster together because they're the closest. We then conform a node for C and connect that node to both C1 and C2. We then set the age of this internal node equal to half of this average distance between C1 and C2. So we saw that the distance from k to l was 2, so the age of this node is just 1. We then recompute the average distance between each pair of clusters and we form an updated distance matrix. Now, we then just begin this process over, choosing a minimum element of our matrix, merging the two associated clusters, forming a new node for the merged clusters (which here is (i,j)) and then assigning the age of this node as half of the distance value from the matrix, which was 3 here, which gives us an age of 1.5. After one more update of the distance matrix, our only choice is to merge the remaining two clusters, assigning an age of 2, and this yields the final UPGMA tree. Now, here is the outline of UPGMA provided on a single slide. Please feel free to just pause the video here if you need a little more time to understand the algorithm. You should now be ready to implement UPGMA and apply it to the coronavirus data. Of course, Son will complain because we never have really stated a computational problem that UPGMA is trying to provide an approximate solution for, but that's okay. The goal of UPGMA is not to fit a distance matrix, but to provide a reliable method that always can construct an ultrametric tree, regardless of what the input data is. So, we know that the tree produced by UPGMA can't possibly fit this matrix, because the matrix is non-additive. You can see, for example, that the distance from i to l is 3 in the distance matrix, but 4 according to the UPGMA tree. So in summary, we've encountered two methods for constructing an evolutionary tree from a distance matrix. Additive Phylogeny works if you give it an additive matrix as input. But then, on the other hand, it fails completely on non-additive matrices. Then, UPGMA, in contrast, will produce a tree for any input that you like, but the tree produced doesn't necessarily fit an additive matrix in a natural way. So, can we possibly hope for an algorithm, one single algorithm, that's going to combine the better features of both of these algorithms, in that it always produces the tree that fits an additive matrix, but it also offers a good heuristic for non-additive matrices?