Welcome to “Bioinformatics: Introduction and Methods”. I’m Liping Wei from the Center for Bioinformatics at Peking University. we have learnt about a lot of bioinformatic methods from sequence alignment, sequence search, to next-generation sequencing data analysis and pathway networks. We have also reviewed lots of online bioinformatic resources including databases and software tools. This week, we will use two case studies to show you how you can integrate bioinformatic data, methods and analyses to study biological questions. This week, I am pleased to have Dr. Manyuan Long, Edna K. Papazian Distinguished Service Professor from the University of Chicago Department of Ecology and Evolutionary Biology, to be here with us. Dr. Long and I will discuss how to use bioinformatics data and methods to study the origination, evolution, and function of new genes. Dr. Manyuan Long received his Ph.D. degree from the laboratory of Dr. Chuck Langley at the University of California, Davis. He then did his postdoc at the laboratories of Walter Gilbert and Richard Lewontin at Harvard University. In 1997 he joined the faculty at the Department of Ecology and Evolutionary Biology at the University of Chicago as Assistant Professor and has subsequently been promoted to Associate Professor and Full Professor. Manyuan was named the prestigious Edna K. Papazian Distinguished Service Professor in 2011. Since 2006 we have had the pleasure of having Manyuan as a Chang Jiang Chair Adjunct Professor at the School of Life Sciences, Peking University. Manyuan is a world-renowned expert and pioneer in the study of the origination, evolution, and function of new genes. I am delighted that Dr. Long is here today to tell us about the biology background of the study of new genes and how his lab had combined computational and experimental biology approaches to study them. Welcome! So, today, we are going to talk about new gene evolution detected by genomic computation. In the first section, I am going to introduce basic concept and examples. So, today, we know that bioinformatic analyses play a common role in biology and medicine. In present day, biological and medical studies are in a quick paradigm shift toward genomic analyses in both gene identification and expression analysis. These analyses created astronomical-scale of data. We also know that bioinformatics is a very important tool for data analyses in various level from preliminary data presentation to advanced interpretation for various scientific problems, with an unprecedented power to detect natural phenomena with the underlying mechanisms. The biological rules and various correlations among the involved factors detected by the bioinformatic analysis from biological and medical studies are illuminating in the progress toward understanding basic biological and medical problems. In this section, I am going to apply bioinformatic analyses to a basic biological problem, which is the origin and evolution of new genes in a general concept and our understanding of evolution of humans and other mammals. These results are valuable for solving relevant biological and medical problems, exemplified by the case analyses. This is a combination I prepared, which included a lot of species whose genomes have been sequenced. From this table, we see that the gene numbers among different species differ in the two ??? higher. For example,The soybean has more than 50,000 genes there. And about ??? species Candidatus Hodgkinia which have about only 189 genes. So, the species with smallest number of the genes and highest numbers of the genes can be differing by more than 265 times. If talk about difference of genome size, that can be even bigger. And this comparison immediately told us that organisms evolve in number of genes and size of genomes. This suggests that there is a general process of birth and death of genes in evolution, i.e., the new gene origination become a very important general problem. Here I give definition for what we called new genes, this a virtually synteny-based definition. Assume there are four species, which diverge in certain time ago, and split in time T3, T2, T1, usually in unit of million years, and two genes which drawing as green color and yellow color which existing in most common ancestor for all species, suggest that they are old genes,which we called G1 and G3. If we sequence the region of genomic DNA, we see that in species S1 and S2 there are G1 after that, and G2 after that, and G3 gene there. And G1 shows up in all four species, while G2 only shows up in S1 and S2 by this phylogenetic distribution, one interpretation is that G2 is a new gene which originated in the most recent common ancestor of species S1, S2 in there as the red line show. But after giving the definition, you may wonder why we cannot infer that this gene G2 might have been a very old gene just because species S3 and S4 lost it. So this is the question: why do we not define G2 in S3 and S4 as the consequence of gene loss that may have occurred in the ancestor before the divergence of S1 and S2, which may lead to the absence of G2 in S3 and S4? A solution to this question relies on a principle in evolutionary analysis which we called parsimony principle. This principle of accounting for observations by the hypothesis require the fewest or simplest assumptions that lack evidence. In evolution, the principle of invoking the minimal number of evolutionary changes to infer the more likely possibility. For example, if we analyze the two hypotheses. In the scenario of gene loss, in order to assume the gene loss, we have to assume that in the common ancestor as red dash line show that gene G2 has to be there through that process foundation time before T3. And I use three red lines as full lines indicate presence of the gene. Because you have guaranteed that the gene eventually have been assigned to the first two species And this has to invoke two gene loss events. In the right scenario of the gene gain or new gene origination, there we only invoke one single origination event. Therefore we have invoked two independent gene loss events in the hypothesis of gene loss while we only invoked one gene gain event in the gene origination hypothesis in the most recent common ancestor. For this reason, we say it is more likely that the gene originated in the most recent common ancestor of S1 and S2. You can do an exercise. Assuming the equal probability of gene gain and loss in each evolutionary change in the process, you can infer the ancestral state of presence or absence of the gene at the time T1, T2, T3 in the two hypotheses of new gene gain or ancestral loss of an old gene. Then, you choose the most parsimony hypothesis by calculating the total numbers of evolutionary changes required by the two hypotheses. In evolutionary analysis, S4 is called the outgroup species that can be used to help infer the ancestral state of G2 at T2. Repeat the exercise when you add one more outgroup species that also has no G2 and find if you are more confident for our previous inference that G2 is a new gene that originated between T1 and T2, as is show in below. Here is one example which reported as a new gene. The gene Sdic is a new gene in Drosophila melanogaster, which exist in a single lineage in the group of Drosophila species. This gene codes for a sperm-specific axonemal dynein subunit, which is immediately flanked by two parental genes, Cdic and Annx. Because this gene only appears in one single lineage of Drosophila melanogaster, which flanked by the two old genes, actually they are parental genes of Sdic, which is a chimera between the two genes, and repeat a number of time. And for this reason, we have assumed that this gene originated between 3 million years ago to 1 million years ago in the middle, maybe 1.5 million years ago also Using this definition, then we right now can identify the new genes from the 12 Drosophila species because all their genomic sequences have been reported in the database and public available to everybody. The left side is a pipeline we defined when we derived computer pipeline to identify new genes. I would suggest you read it, and I don't have to repeat it. But I will say that in this pipeline we defined a very conservative criteria. We required if two genes duplicated identified, then their similarity have to be higher or equal to 50%, and their overall coverage region is no less than 70%. So this is very (conservative) criteria. Using this pipeline and running with the 12 genomic sequences as we show in the right side chart. There shows how the 12 species diverge and what are the evolution times. Result is here. We identify almost 1000 new genes in red lineage which involve common ancestor of all Drosophila toward Drosophila melanogaster. You are going to see that in the Brach 2, 161 genes originated, and existed from Drosophila pseudoobscura to Drosophila melanogaster. Even in the most recent branch, which is Branch 6, which diverged from the most recent closely related species Drosophila sechellia and simulans, so in about 3 million years, there are about 6 genes originated in that period. Today we know that there are more than 11 molecular mechanisms which can create new genes. One of the early examples in the change created in the kind of molecular mechanism is the Jingwei gene. By phylogenetics analysis of Jingwei gene in Drosophila, we know that this gene originated before 2 West African species diverged, which is about 2.5 million years ago. And by inspecting the sequence of the gene, we see that this gene actually use both mechanisms of gene duplication and RNA-based duplication, retropostion, to create the chimera. And the protein including N-terminal peptide which comes from yellow empire and C-terminal which is from Adh which involve important biochemical function to help the Drosophila in West Africa to survive in different ecological environment How important are these young genes in nature or in organism? Here, using genetic mechanisms we can today knock out or knock down their expression or disrupt the genes, which are able to detect their functional importance. Here is an example, a new gene called YLL1, which was created by gene duplication on DNA level from a parental gene called CG7627, which is a new gene existing in only a group of Drosophila which diverged about 6 to 8 million years ago, including Drosophila erecta, Drosophila yakuba to Drosophila melanogaster. We apply three genetic analyses to these new genes, which includes P-element insertion, i.e., put some elements into the gene. We also use chemical mutagenesis which creates mutant lines at the positions 717, 765 which both change amino acid sequences from G to S, from T to I. It also creates mutant lines which only change synonymous sites, and do not change protein sequences. And we recently also use the modern technology called RNAi inactivation technology, which silence the gene using constitutive enhancer to drive GAL4. With surprise that all the three genetic mechanisms create lethal phenotype. Only those genetic changes that only create synonymous sites make individual viable. So this experiment shows that this gene which is very young, which is only older than 6 million years to 8 million years ago, YLL1, evolve essential function, i.e., Any genetic changes to abandon its function are going to make Drosophila die. Here I give a number of reported cases of new genes which are reported interesting phenotypes and functions including Sdic, sphinx, jingwei, p24-2, which is about between 0 to 3 million years old. And the youngest one is maybe only 0.01 million years old. That is in mammal. And in the plant, I also give about 3 genes which show very important phenotypic effect or functional effect. In summary, in this section, we introduce that a new gene is a gene that originated recently in a genome and can be identified by syntenic alignment of genomic sequences from a group of closely related species. We also introduce that a number of molecular mechanisms can generate new genes and more than one mechanisms can be used in making new genes. Finally, we also tell you that new genes can be biologically important as old genes or as ancient (genes). In insects, essential functions can evolve rapidly at any time in evolution.