The method that we're looking for, for distance based phylogeny construction, is called the "Neighbor-Joining Algorithm", which was introduced in 1987. It's a result so fundamental to bioinformatics, in fact, that it has been cited 30,000 times, and it's one of the top 20 most cited papers over all scientific fields. So first, to see how this works, define the neighbor-joining matrix of a given distance matrix, D, as the matrix D*, whose (i,j)-th entry is given by the formula shown on the slide. So, let me give you an example. Here's the matrix that we encountered before, where j and k corresponded to the minimum element of the matrix, but j and k were not neighbors in the tree that fit this matrix. The total distance of a leaf node is just the sum of distances from this leaf to all other leaves. So for example, the total distance of k in this matrix is 21 plus 12 plus 13, or 46. We will then define D*(i,j) by taking n minus 2 times D(i,j) and then subtracting the total distance of i, and subtracting the total distance of j. In this case, n, the number of leaves or the number of rows in our distance matrix, is equal to 4. So D*(k,l)is equal to 2 times D(k,l), which is 13, minus 46, minus 48, which gives us a value of negative 68 for D*(k,l). Now, if it's obvious to you that defining this crazy matrix D* was going to be a productive endeavor, then you are probably far smarter than I am. Like the Burrows-Wheeler transform, which we encounter in another chapter, which makes matching multiple patterns to a text a lightning-fast endeavor, I think this matrix D* is black magic. Whereas taking a minimum element of the original distance matrix did not guarantee that we were going to find neighbors, taking a minimum element of the neighbor-joining matrix, D*, does find a pair of neighbors. You can find a proof of this theorem in a detour in the text. It's not a short proof. In fact, although D* was introduced in 1987, it took a decade before a proof of this theorem was found. Also, the Neighbor-Joining Theorem is part of the reason why I led you down the wrong path in the first place when we first attempted to construct a phylogeny from a distance matrix. Now, if we use D* instead of D, then we will be able to successfully identify a pair of neighbors from a minimum element of the matrix. So let's walk through the neighbor-joining algorithm. First, we construct the matrix D*, and I'll move that up from the previous slide. So there are two pairs of leaves that correspond to a minimum element of D*. i and j must be neighbors, and k and l must be neighbors, as well. Let's work with say, i and j. We set delta(i,j) equal to the total distance of i minus the total distance of j, all divided by n - 2. So here when we plug those values in, we see that delta(i,j) is going to be equal to 9. Now, you may remember before that we had a formula for the LimbLength(i). The formula that neighbor-joining uses is going to give the same result when the underlying distance matrix D is additive. But this formula offers better results when we deal with non-additive distance matrices, so that's why this formula differs from the one that we had before. Here, LimbLength(i) is equal to one half of the original distance matrix value D(i,j) plus delta(i,j), and LimbLength(j) is equal to (1/2) D(i,j) - delta(i,j). So now that we have these limb lengths, we can remove the i-th row and the j-th row, as well as the i-th column and the j-th column from D, and then add a row that corresponds to the parent node of i and j, or m, and then we denote this smaller matrix as D'. For each other leaf k, we then compute the distance between k and this parent node, m, as the distance from i to k, plus the distance from j to k, minus the distance from i to j, all divided by 2. Does this formula look familiar? I hope so. It's the same formula that we derived several sections ago, when we first attempted to reconstruct a tree from an additive matrix. So, if we return to neighbor-joining, now that we have a smaller matrix, we can apply the neighbor-joining algorithm recursively to D' and obtain the tree on the right that fits this matrix D'. Once we have this tree, we can simply re-attach the limbs of i and j using the limb lengths that we previously computed. So here's the summary of the neighbor-joining algorithm, which we apply recursively until we obtain a 2 x 2 distance matrix when we form an edge, and then we retrace our steps by adding limbs successively. Go ahead, try implementing it yourself. In particular, see what happens when the neighbor-joining algorithm is given a non-additive matrix, because we know what's going to happen when it's given an additive matrix. The tree constructed by UPGMA on the same non-additive matrix is shown on the right. How does the tree that you create by running the neighbor-joining algorithm differ from this UPGMA tree? Of course, we can also apply neighbor-joining to the non-additive distance matrix that we derive from coronaviruses. But we can also apply it to a distance matrix that we obtain from pairwise alignment of Spike proteins taken from different patients, as well as from the palm civet coronavirus. When we do this, just from this small sample of ten coronaviruses, we can see a very clear pattern of how the disease was passed from the palm civet to residents in Southern China at the end of 2002 and then how it spread to other other residents throughout the beginning of early 2003. And then we see the cases from the Metropol Hotel grouping together as the disease went global. It probably seems like we've come a long way in our discussion of distance-based approaches for constructing evolutionary trees. But we should point out that these algorithms don't allow us to infer anything about the ancestors that are assigned to the internal nodes of the tree. The reason why we can't infer anything is because when we passed from a multiple alignment to a distance matrix, we inherently lost information about the sequences themselves. In order to say anything about ancestral states, we're going to need an approach that can analyze the alignment directly.