[MUSIC] So how do we construct a phylogeny from molecular data? Describing how organisms look - let alone extracting the characters from it, as we have just done, is often the work of specialists. It takes a totally different vocabulary to describe a bird, an earthworm, a butterfly, and a jellyfish, just to name a few of the endless options. Even things that are called the same, may have virtually no similarity. Thus, the central bone of the thigh of a human, called the femur, has not even superficial resemblance to the femur or thigh of an insect except perhaps, in its function. In contrast, at least the most commonly encountered molecular data, seems deceptively simple. DNA or deoxyribonucleic acid, consists of longs strands of only four building blocks or nucleotides named after the four bases they contain. A for adenine, C for cytosine, G for guanine, and T for thymine. RNA is composed of the same building blocks, except the thymine is replaced by uracil. DNA stores the genetic information, used by the organism to function and develop into a fully grown organism, and carries the blueprint for the organism and its function. All organisms have DNA, and these are found in the nucleus of both animal and plant cells. Additionally, DNA is found inside the energy fracture of the cells. The mitochondria, found in most cells of both animals and plants. At least green plants have even a third way of storing DNA, in the plastids, most notably the chloroplasts where assimiliation the creation of oxygen from carbon dioxide and sugar takes place. Thus animal cells have two, plant cells three compartments that have DNA. Briefly said, return to RNA, it is found all over the cell. Both the information of RNA and DNA can be used to reconstruct phylogenies. Here we will concentrate on DNA, which is by far the most widely used for the purpose. As indicated, a DNA molecule is a long, straight string of only four letters say AGTTGGGC, etc, etc. Even though there are only four letters, how many different DNA strands can be made of only 10 nucleotides? 40, 10 raised to the power of four, or four raised to the power of 10. Well, it's actually four raised to the power of 10, which corresponds to more than 1,000,000 different strands. So the possibilities seem endless, as a 10 nucleotide long DNA sequence is unbelievably small. A small DNA molecule like a chloroplast DNA strand is composed of around 150,000 nucleotides on the average. Doesn't it strike you as odd that there are DNA in two different compartments of animal cells and three in plants? What could be the possible explanation for this? Well, think about it, is it because there's always been DNA in all three compartments? The plastids and mitochondria are probably bacteria, that has been engulfed by a larger organism earlier in time? Or is it just pieces of DNA broken off from the nucleus? Well unfortunately, we do not have the time to look closer into this. But I would recommend you to look up the endosymbiotic theory on the web, and try to read about it, it is a truly fantastic story, so let's return back to our track. There's a couple of things worth knowing about the structure of DNA. Inside the cell, DNA usually occurs as a double strand molecule. It is composed of two strands, technically seen running in opposite directions, they are said to be anti-parallel. On the left side of this slide, one runs from the top to the bottom and on the other the reverse. You might be surprised, that molecules have direction. But it is a matter of how they are built, they can only be extended in one direction. The two strands are not only double stranded, they are also intertwined in a double helix, almost like a spiral staircase, where the steps are base pairs, one from either strand of the DNA molecules, as can be seen in the middle and the right side of this slide. We talk about base pairs because a rule of nature, applies such that the T's are always facing an A and C's are always facing a G. T's and A's and G's and C are not connected by chemical bounds, but hold together by intermolecular attraction. As the genetic information is stored in linear sequences of four letters, the information is stored, complementary on each of the two strands. One of the strands carries a mirror image of the information on the other. We do not understand the significance of this astronomical amount of information that is stored in DNA strands - as a matter of fact, we are, despite our own claims to the contrary, only scratching the surface of what the information in the DNA strand really means. And huge stretches of DNA have no known meaning for us, others are genes, control regions, etc. The part we do not know of anything bout is often called Junk DNA, which is a peculiar way of expressing our lack of knowledge, though some of this it may truly lack any function. You need to know one other thing, a gene is usually translated into a protein which is composed of amino acids. As the standard there are 20 different amino acids, and the information on the DNA strand is read in 3-letter groups, where 3-letters correspond to one amino acid. So, if a 3-letter string codes for an amino acid, how many different amino acids could be coded for if we, for the sake of the argument, agrees that all chains of amino acids are actually protein start with a start codon, and end with a stop codon. Would it be 25, 62, or 64? Well, as the three letters correspond to 64 different sequences of letters, the code is degenerated as several 3-letter combinations code for the same amino acid, and the actual answer is 64. Changing the first of the three bases usually gives rise to a new amino acid being inserted, changes in the second position occasionally does, and changing the third is largely irrelevant - the amino acid remains the same. The third letter in the 3-letter code is said to be four-fold degenerated. Additionally there is one start codon, and three stop codons. Often only one of the two DNA strands code for a series of amino acids but this, and which of them it is, may change as you move along the molecule. Now, let's go to some problems unique to molecular data. Variation in DNA strands, is, however, despite their apparent endless possibilities, constrained within limits, and only a fraction of the potential variation is realized. Thus, the part of the DNA strand that code for protein, which in turn are composed of amino acids, usually vary less that the part of the DNA strands between the protein coding part, which may or may not have a known function, or code for all and decode for regulatory mechanisms. What is of interest to us, is the construction of phylogeny and variations and mutations in DNA strands, that are accumulated over time exactly as when we spoke about morphology. Changes in how organisms look over time. The DNA strands change/mutate in two different manners, either by changing of the individual bases, an A may be changed for a C, a G to a C, etc, etc. Such changes are called point mutations. Alternatively, the DNA strand may change by deletions or insertions together called indels of stretches of nucleotides, which maybe either removed or inserted in the DNA strands. Occasionally, large chunk of DNA strands maybe repeated, including whole genes or quite often in plants, whole genomes, which may then occur in gene families, or complete duplication of all genes. After duplication, two members of the gene family or the two genomes, usually behave independently, and may or may not acquire different functions. Precisely, as when dealing with morphology, comparison has to be done between homologous stretches of DNA, stretches that we can justify as being the same across a group of organism you look at, no matter whether you work with genes, or with Junk DNA. We simply have to sample or create a DNA in a matrix similar to the one we created from morphological data. When it comes to protein-coding genes this is often quite simple, here gene families is a major problem. The two (or more) copies started by being identical, but diverge over time and you've got to compare the right set of copies even, if they are both found within the same organism. A more severe problem is how to deal with insertion-deletion events. If you look at this figure I have inserted a new species compared to the previous slide, X. epsilon, which has just recently been found. This species has a sequence that is shorter than the others. The new sequence can be arranged in several different manners, as can be seen in example one and two on the same slide. We tend to think that the sequence has dropped out as a default, but we do not know, equally likely, the four species with the longer sequence could have a piece of DNA inserted. This problem with insertion or deletion of sequences, is fundamentally seen of the same magnitude, as the problem of inspecting all trees, you remember that this increased exponentially with the number of taxa, as does the alignment with a number of different sequences. So from molecular data sets, we have two very, very in principle unsolvable problems, superimposed on each other. Finding the best tree and creating the best alignment, and actually you have to do this simultaneously. An added but unrelated problem is the fact that DNA cannot be used to create a tree of life, that includes a lot of extinct taxa. Even though we become better and better at extracting degenerated DNA from relatively recent extinct animals and plants. We are no where near, able to produce DNA sequences for extinct organisms like this one on the slide, a trilobite which only consists of extinct animals. It's a large group of arthropods that have no known living relatives. So then why is it that we want to use molecular data? Well, one of the main reason is the extent of information, and a human has a limited array of morphological characters to choose from. Actually, it has 3.3 times 10 raised to the nine base pairs and approximately 21,000 genes. When it comes to other organisms, say rice, the number of base pairs is slightly smaller, 4.6 times 10 raised to the eighth. But the number of genes are at least twice, but probably three times as big as in humans. So how do we analyze molecular data? You have seen how to analyze morphological data, by creating the tree that requires the least amount of change. This method can also be applied to molecular data, however, due to the very simple nature of DNA, it has become possible to develop other methods - methods which are currently not working convincingly for morphology. This is because the alternative methods developed to deal with DNA data, builds on the creation of models of evolutionary change. Or to put it simply, how does one, for example change a C to an A and how frequent is the change? Evidently, the development of molecular evolutionary models of molecular changes is much easier, than to create the same type of models for the development of morphological traits. How do we describe the creation of a bird wing, of the composite eye of an insect or the tiger's stripes? Frequently, we cannot even describe in great detail what we want to model. Here I will only scratch on the surface of how models for the evolution of DNA strands are developed. It can become exceedingly complicated mathematically to describe even these simple models, and this one is in fact a very simple fundamental model. This figure shows the chemical structure of the four bases or nucleotides found in the DNA strand. It's very easy to see that the chemicals fall into two categories, A's and G's are very different from T's and C's just look at the extra part on the pentomeric ring on the A and G molecules. Intuitively, you would expect that it is easier to change an A to a G, or the reverse, than from A or G to either T or C. The same applies to changes between C and T, in contrast to changes to A and G. Changes between bases of the same type or category, are called transition changes, between bases of different types are called transversions. Hence, you could decide that transversion should have a higher weight, because they're more difficult in creating a phylogeny than transitions. This can be easily seen in this table, where the frequency of all changes is empirically estimated, based on the sequences of a gene from the plant plastids. You can just compare A to C and T changes with A to G changes. However, as you can see the table is quite elaborate, it is done for all possible different changes. These changes can all be incorporated in the module, together with other models or eventually actual observations like the starting frequency of individual bases, and the ratio of invariable to variable sites. However, if the model is wrong, the end result is likely to be equally wrong. However, different methods are different in how susceptible they are to violations of the model. It is important to stress that no phylogenetic method is model-free even the simplest ones, make assumptions. If you look at this slide, this is the kind of situation you have to solve for, in principle, all trees for a given problem. No matter which method you use, different types of data set, or different partition of a data set, for instance DNA data versus morphology or any combination you can think of, may give different result. In case of conflict, it's not necessarily easy to decide which, if any, of the result is correct. On this tree, on the left-hand side, the hippo is placed together with the cow and camel, whereas on the other tree, the molecular tree, the hippo goes together with the whales and the dolphins. Most people, today, would agree that the hippo and the dolphins are more closely related than any of them are to the camel and the cow, but it's still an open hypothesis which one is correct. [MUSIC]