[MUSIC] All right, in this week's lab, we will be doing phylogenetics. Phylogenetics is the study of evolutionary relationships. And basically what we're doing is we're converting DNA or protein sequence data into a branching diagram or a tree that shows the relationships between the sequences. In terms of the anatomy of a tree, we've got certain parts to it. We've got branches or edges. We've also got nodes. The terminal nodes on the right in this case, are also called the taxa or the sequences or the operational taxonomic units. A and B would form a clade, in this case. They're closely related. A, B, and C could also be considered a clade if we move further back along the tree. The time vector for this tree is from earlier on, on the left hand side of the screen to present day on the right hand side of the screen. Trees have many shapes, can have many shapes. It's not really so much how they're displayed. What we're looking for is the relationship between the sequences. So, if we have a tree that describes the relationship of A and B to C and D. In this case, on the left-hand side of the screen, we could also represent that tree as follows... looks a bit different but the relationships between A and B and C and D are the same. We can also rotate around any given node, the tree would look different but the relationships remain the same. So we could also rotate around internal nodes, and the tree topology remains the same and relationships remain the same. The tree looks a bit different but, again, it's the relationship between the sequences that we're interested in. For any possible rooting, there are 2 to the N-1 possible arrangements. Tree growth happens as follows. So let's consider an ancestor to sequences A, B, C, and D. This is way back in evolutionary time. Over time, there are speciation events, and for instance A could go off on its own. And the ancestor of B, C, and D could go off on its own. And, again, there could be a speciation event leading to the ancestor of C and D. A and B are maintained as a species. And then at each node, we can say that that's the most recent common ancestor of whatever's below that node. So in this case, this node would represent the ancestor of B and CD. And those changes that happen over time are actually mutations in the DNA sequence. And those mutations appear in all subsequent generations or in subsequent speciation events. And so a mutation that would happen up here would be in the ancestor of BCD. If another mutation were to occur here, we would have that appearing in that particular sequence associated, this is the ancestor of C and D. And then finally, a mutation that would occur leading to species D might be something that would occur in this sequence, but not in any of the other sequences. And you can see that as time progresses, mutations get maintained in certain sets of sequences according to the evolutionary relationships. There are different kinds of representations of trees. Here's an unrooted tree. We can also root trees and rooted trees have one node from which all the other nodes descend. And this implies the direction corresponding to evolutionary time. So if we had an outgroup here, we could have this particular topology. If we had a different rooting point, we could have that particular topology or this particular topology, that particular topology or that particular topology. The way that you can also represent a tree is with this Newick format here. And this is just a bracketed notation that denotes the relationships as described by the tree. We can have midpoint rooting, where basically we take the center of gravity of the tree and come up with something like this. Or we could have an outgroup rooting where we choose a species that's quite distant to the species that we're interested in, and that's called the outgroup. So there is some terminology associated with phylogenetics. The ancestral state is also known as plesiomorphy. Derived state is called apomorphy. Autapomorphy is a unique derived state, synapomorphy is a shared derived state. Homoplasy is similarity due to parallel evolution, convergent evolution or secondary loss. And homology is similarity due to common ancestry. We've already talked about that in previous lectures. So, just to put that in the context of a tree, the ancestral state occurs at the start node. And then we can have derived characters where mutations are acquired along a certain branch, and then these other terminal nodes here would have the ancestral character. Here in this case, we've got homoplasy occurring, which means that a character appears at different branches that aren't directly related one to another. In terms of homoplasy, we can have parallel evolution which is the independent evolution of the same character from the same ancestral state. So here we've got an ancestral state. And this is the independent evolution of a character along two different branches of our tree. Convergent evolution is the independent evolution of the same character from a different ancestral state. So here, we've got one ancestral state and this branch has gone on to have some character that also appears in a different branch, that has a different ancestral state. We can also have a secondary loss, which is the reversion to the ancestral state. So here's the ancestral state. At this point here, we've acquired a mutation and then there's a back mutation to the ancestral state. So these two characters are homoplasic. At the DNA sequence level, we can achieve these homoplasies via a couple of different methods. We could have parallel substitutions along one branch, one sequence, and in another sequence. So here, the T is being mutated to an A. And in this case, the T is also being mutated to an A. We could have convergence substitution, so... in this case, the A mutates to a C which then mutates to a T. And here, we've just got an A being mutated to a T. We can also have a back substitution to the ancestral state. Start off with a C, mutate to a T, and then mutate back to a C. So in terms of phylogenetics, and what we actually need to do phylogenetics, we need to have good sampling across taxa to be able to construct a tree that's representative of the the taxa of that we're interested in. We need to have homology, homologous sequences. We need to have some variation in those sequences. And those sequences should be independent from one another, not too closely related. We also need data, some sequence alignments. We need to use a phylogenetic method. And it would be nice to have statistical support. So there are a couple of different tree-building methods that we'll be exploring. One of these is a distance-based method called neighbour-joining. We'll briefly mention UPGMA and why this isn't a good method. Then, we can also use character-based methods, such as maximum parsimony or maximum likelihood, to construct our tree. And one thing to keep in mind when we're building trees is recombination. We meed to think about how recombination would affect interpretation of the tree. So what happens under recombination is that say one region of DNA that contains a lot of changes might recombine with a region of DNA that doesn't contain as many changes. And that would skew our interpretation of the phylogenetic relationships between the sequences. So distance-based methods are based on sequence similarity. The advantages of distance based methods are that they're computationally fast, the single best tree is found. There are disadvantages and assumptions that we're using additive distance and that there's a molecular clock. And also, information loss occurs due to data transformation. Sometimes the branch lengths are uninterpretable. And it can be considered a disadvantage that the single best tree is found. In the case of neighbour-joining, the way it works mechanistically is that we calculate all pairwise distances between sequences. And here, we're using or Blosum matrix, our PAM matrix, and asking how similar the sequences are to one another. We create a distance matrix. We determine the net divergence for each terminal node for each sequence. And then, we create a rate-corrected distance matrix. We identify the taxa with the minimum rate-corrected distance. And then, we connect the taxa with the minimum rate-corrected distance to create a new node, and then determine their distance from this new node. Then, we determine the distance of the new node from the rest of the taxa or nodes. We regenerate the distance matrix, and then we return to step two, and continue cycling through the process until all the nodes are connected. So in terms of our distance-based phylogenetic methods, I mentioned UPGMA versus neighbour-joining. UPGMA tree typically will not reflect the true phylogenetic relationships. This is just an example showing the corrected rate distances on the lower-left of this diagonal over here. And these are the non-rate-corrected distances. And in the case of the non-rate-corrected distance A and B, show the minimum distance, and they're connected to one another via this particular node here, and are most closely related. However, if we're looking at the rate-corrected distances, actually sequences C and D are most closely related one to another. So we really should be using neighbour-joining to generate trees if we're going to use a distance-based phylogenetic method. Another kind of method that we can use is a character-based phylogenetic method as exemplified by maximum likelihood. Maximum likelihood attempts to answer the question: what is the probability of observing the data,so the sequence data, given a particular model of evolution and evolutionary history? So the data are multiple sequence alignments, and the model is transition probability, base frequencies, rate heterogeneity, and I'll talk about those in a bit. The evolutionary history is the phylogenetic tree. And what maximum likelihood does is it evaluates the likelihood of every substitution, every possible tree. All possible trees are considered, and the number of substitutions that must have occurred are calculated. So the tree with the highest likelihood is assumed to be the correct tree. So what we're looking for is the fewest number of substitutions to achieve the phylogenetic tree. So what is likelihood? Here's an example with a coin toss. Likelihood is the probability of observing the data given a model. So if the data are six coin tosses with the results of heads, heads, tails, heads, tails, heads, we could have three models that we could use in our likelihood method. We could have a fair coin model where the probability of observing heads is equal to the probability of observing a tails. We could have model two, which is two-headed coin. The probability of observing a head in this case is 1. The probability of observing a T this case is 0. And we could also have model three, which is a two-tail coin. And here, the probability of observing a head on a two-tail coin is 0. And the probability of observing a tail is 1. So what is the likelihood of the data given model one, which is our fair coin? That would simply be the probability of observing a head given model one. So observing these data, basically, times the probability of observing a head, the second position, and so on. Multiply all of those probabilities together to come up with an overall probability of 0.0156. So the likelihood of observing these data here given model two, our two-headed coin, is 0 because we can't actually observe any tails with the two-headed coin. And likewise, the likelihood of observing the data given model three, our two-tail coin, is also 0. So maximum likelihood, what weâ€™re trying to do is find the model that maximizes the likelihood of the observed data. And if this is our DNA sequence here, this is just another example of how likelihood might work. The data are GGACGCCT and so on. If model one is an equal base composition whereby the probability of each base is 0.25, we can compute the likelihood of the data, these data, given Model 1 as the probability of observing a G in first position given the model 0.25. Probability of observing G at the second position, given Model 1 is also 0.25 and so on. So basically, the overall likelihood is 0.25 to the 20th because the 20 nucleotides here which is 9 times 10 to the -13. The likelihood given some other kind of models, so here's a model that has a GC bias, whereby the Gs and Cs are over-represented, the likelihood would be 0.4 to the 16 times 0.1 to the 4, which is 4.3 times 10 to the -11. And then we could also compute the likelihood for a model which gives an AT bias, and here our computation will give a likelihood of 2.6 times 10 to the -18. So a maximum likelihood is this one, which is the GC bias. So what we can say is that, this sequence likely comes from a GC biased region of the DNA. So that's how we would use likelihood and models to figure out the maximum likelihood. So, maximum likelihood models in phylogenetics, what we're doing is we're finding the tree topology with the highest likelihood, given a particular evolutionary model. Our models our nucleotide substitution models can have two components, which are the composition of nucleotides, the nucleotide proportions and how the nucleotides change over time. The advantages of maximum likelihood base methods, are that they are based on explicit evolutionary models. They permit statistical evaluation of the likelihood of specific tree topologies, it often returns many equally likely trees, and it usually outperforms other methods. However, the disadvantages are that maximum likelihood methods, as you can imagine, are computationally very intensive. And they also return many equally likely trees, which can be unsatisfying, but that's the way it is. So how does it work, when we use an actual example? So consider these four sequences and the position j in those sequences, where we see Cs in position j, for sequence 1 and 2 and an A and G in sequence 3 and 4. We can come up with a given tree for the taxa, and we can arbitrarily root that tree, and the nucleotides are the terminal nodes in this case. And then we ask the question / we compute the likelihood of observing those outcomes the CCAG, and given various ancestral states and changes, we ask how many changes are required to achieve these outcomes, given certain ancestral states for those nucleotide sequences. And we compute that by multiplying all possible changes or possible starting sequences, and we do that for all possible sites, and we could come up with an overall product here. Which is depicted in this formula, and then we usually evaluate that as the sum of the log of the likelihoods to make the computation easier. Maximum likelihood evaluates all possible ancestral states at all variable sites, and in all possible tree topologies. And the most likely tree is the topology that has the highest overall likelihood. And as I mentioned, we can have a number of different models of sequence evolution, these are depicted here. We've got Jukes-Cantor, where the base frequencies are equal and all substitutions are equally likely. Then we can have a couple of different models here that allow for a transition/transversion bias, these are kind of changes in DNA sequences. Here we can allow the base frequency to vary in the Felsenstein 81 model. We can actually allow for different transition/transversion rates and unequal base frequencies in the General Time Reversible model. When we generate a tree, we would like to have some confidence that the relationships that we're seeing are in fact, robust, i.e. they're real. What we can do is we can use bootstrapping to actually compute the confidence that we have in a given tree. So the way bootstrapping works is that we would take our original sequence and tree and compute some relationships here. So here we see that A and B are related more closely to one another than they are to C and D, and conversely, C and D are more closely related to one another than they are to A and B. Here's a sort of cartoon of some part of the sequence that we're interested in. And what we can do is we can create pseudo-replicates, that's the PR representation there. We can we can create 500 to 1000 pseudo replicates, whereby we sample each column some number of times or no times, so pseudo-replicate one might contain one instance of the first column, three instances of the the second column. One instance of the third column, no instances of the fourth column and so on. And each pseudo-replicate would be slightly different sampling of the columns. And then with those pseudo replicates, here is pseudo-replicate 1, where you can see that we've sampled the first column once, the second column three times, the third column once and so on. And what we do with those pseudo-replicates is we also generate trees. And then for each tree that's generated with those pseudo-replicates, we actually ask, how many times do we see the same groupings that we see with our original tree? And we just count that and denote that at each particular node, so in the case of this particular example, we see that A and B are always grouping together. We see that A and B are always grouping together, and C and D however, do not always group together. They only group together in this particular example 75% of the time. So the bootstrap score for this particular node would be 75, and the bootstrap score for the A- B node would actually be 100. So there are a couple of assumptions that we make when we bootstrap. One is that the data size is large enough to accurately reflect the true error distribution, and that the data are identically and independently distributed. Typically, bootstrap values above 90% are considered, if you see those kind of values, you would say that the node is strongly supported. If the bootstrap values are between 70 and 90%, you would say the node is well supported, between 50 and 70% you could say the node is weakly supported, and less than 50% is not supported. And what you can do is you can actually create a bootstrap consensus tree which collapses nodes where we don't have good bootstrap support. So for instance, we don't have good bootstrap support at this node, so we can't really be sure of the relationship of D to sequences A, F, and E, in this case. So we would depict that kind of relationship with this trifurcation instead of a bifurcation. It can make a difference, so this is just a tree that was published several years ago. Looking at the control region of mitochondrial DNA, this relatively small amount of sequence information, and the initial tree was published without bootstrap support. But you can see when you actually bootstrap this tree, it's describing the relationship of various human populations one to another. And you would think that, seemingly, there are some subgroups within that tree that might make sense. But if you look at the bootstrap scores for these nodes, they're actually quite low, so our bootstrapped consensus tree actually looks like this. Which is not to say that there aren't differences between human groups that exist. However, with the data as published in this paper, this small region of the control mitochondrial DNA. We can't actually reliably make inferences about the groupings of these human populations. And just keep that in mind when you try to interpret your phylogenetic trees. I hope you enjoy the lab and see you next week.