This lecture is about why we study genomics and what it can teach us. So genomics is the study of the genomes inside of us. Let's talk about human genomics. Everybody on the planet has a genome that has governed their development and governs a lot of their biology, and as you can see by looking at any crowd of people, we all look really different. However we've discovered through sequencing in recent years that we're actually 99.9% identical or even more than that. So it's really remarkable how much diversity you can create from a very small number of changes in your genome. But of course, now that we know that we're 99.9% identical, we still want to know, what is it that's driving all these differences? Why is one person tall and another person short? Why does one person live to be 100 an another person lives to be not 100? Why does one person get cancer and another person not? Many of these things we suspect are driven by our genomes, and we want to understand that. So, another thing, one of the most basic things that our genome determines is how our bodies develop. We start off, as you all know, we start off as a single cell which divides into a few apparently identical cells, but that quickly divides into an embryo, and eventually grows into a whole person. And somehow that entire program of development is encoded in our genome, and this is something that we don't yet understand. In addition, the code in our cells determines all the different cell types and for example, it determines how to make a neuron, which is a very complicated cell, obviously a very different kind of cell from say a skin cell, it does very different things. And yet the genome inside of a neuron in your body is identical to the genome inside of any of your skin cells. So we want to understand what's going on in that cell even though it has the same program, the same code, somehow it's executing a different program to make it into a neuron versus a skin cell. Another big area of research in genomics is cancer. So cancer is essentially a genetic disease, we know now. Cancer cells are simply, again, cells in your body that have the same genetic code, the same genome in them, but somehow they've gone haywire, and they've started replicating without control. That's what makes something cancerous. Basically it's cells that are dividing without any check on their division. And, in fact, we define cancers by the type of cell that started the cancer. So there's skin cancer, where a skin cell starts dividing without control. It's also called melanoma. There's lung cancer. There's blood cancers that are called leukemia. These are all defined by the cells that started the cancer out and they all have a common phenotype, that is, they all have a common feature that they're dividing without control. But the consequence of different cancers are very different, and in fact, the mutations in our DNA that cause these cells to become cancerous are also different. So what do our genes have to do with any of this? So what I'm talking about, I just mentioned the word mutation, a mutation is a change in your genome. And that can happen because your DNA is damaged, it can happen because of an accident in replication. So every time your cells divide, to explain that latter point, every time your cells divide, the entire genome has to be copied. And our cells are really, really good at this, fortunately, otherwise we wouldn't exist. We wouldn't survive for very long, but once in a while, they make an error, probably only one to three errors per cell division. And once in a while, that error causes something bad to happen, and we believe a lot of cancers are caused by these sort of accidental errors. And understanding that is a matter of understanding, well okay, my cell makes an error, what does it mean for a mutation or an error in replication to turn a cell cancerous? What usually we think happens is that that mutation effects a gene which now doesn't function properly and that gene, for example, that might be a gene that controls cell division, and now you've sort of turned off the check on cell division. And now the cell starts replicating without control and you have a cancer. So that's the kind of thing we're looking at when we're using genomics to study cancer. So how does this all work? So this program that I'm talking about that's encoded in our DNA. Well there's something called the central dogma. I didn't make that word up, that phrase was created by Francis Crick and one of the co-discoverers of the structure of DNA over fifty years ago. And it's now still used, even though as with many dogma, it's not an absolute dogma. But the central dogma of biology, or molecular biology, says that Information flows in a single direction from your genome, that is your DNA, to RNA, to proteins. And the processes that govern that we give different names. So the copying, when DNA is turned into genes, the first step is you take pieces of it called exons, and you transcribe them, that's the copying process, into RNA, and RNA is essentially an exact copy of the DNA where all the letters are the same with the only difference being the letter t, or thiamine becomes a letter u, which is uracil. But otherwise it's molecularly the same thing. That RNA then has to be turned into a protein. Now, proteins are not comprised of these four letters of nucleic acids. They're comprised of 20 letters that are called the abbreviations for amino acids and proteins are also long molecules, not nearly as long as DNA. A typical protein might be 300 or 400 amino acids long, and the way you get a protein is you take a piece of RNA and you read it three letters at a time, and each triplet encodes an amino acid. And if you think about it for a second there's four possible RNA nucleotides. So there's four to the third, or 64 possible combinations. Each of those 64 triplets each gets translated either into amino acid or not. There's three special ones called stop codons. They indicate the end of a protein. So that's basically how DNA goes and becomes a protein. And the proteins kind of do all the work of your cells. So the proteins in your body are what are actually doing most of the functional work of say, metabolizing things, digesting your food, moving things around in the cells. So that fundamental dogma has been around for many decades now, and it more or less describes how information flows most of the time from your genome to two proteins. However, that's not the whole picture, we now know. So over time, we've learned that information can flow the other way, and as scientists got more familiar with the whole model, they realized that it had to form the other way. As I was saying a little earlier in this lecture, there are many different cell types in your body, every cell has the same exact DNA. So if everything just flowed from the DNA to the proteins, it would seem sort of fundamentally impossible for the cells to behave differently, yet we know that neurons don't act like skin cells. So what's going on? So the proteins themselves, some of the proteins that are created by the DNA go back and bind to that DNA stuff and modify it and change the genes that get turned on and off. So proteins can self regulate in this way. And there are other things that can happen with DNA, other modifiers, some are called methylation marks that can change DNA as well. So there are features on the DNA that are affected by the proteins themselves. So this feedback loops in the process in this sort of information flow, and that as a result, information's actually flowing backwards. So in the genomics field, so how do we make these measurements that I'm talking about? How do we measure if you want to understand cancer, then we have to go and get some cancer cells and figure out what mutations happen in the cells. So how do we do that? Do that with sequencing. So sequencing is sort of at the heart of genomics, and the genomics revolution that we've been in for about the past 20 years, and this really accelerated over the past ten years. And one reason for this acceleration is that genome technology has gotten incredibly fast and efficient. So what you're looking at here are some of the latest sequencing machines. A sequencer today, the highest super sequencer we have today can sequence in a single run of the machine, as many as a trillion nucleotides of DNA. So to give you a sense of what that means, the Human Genome Project was started in 1989 with the goal of sequencing one human genome in 15 years. It beat that goal, we actually published the human genome in 2001, so in just 12 years we finished the project. I was part of that project. And it was a massive effort involving thousands of scientists from around the world. And sequencers were employed at half a dozen huge genome sequencing centers in the US, and large sequencing centers in the UK, in France, in China, all over the world. Today you can get a sequencer in a single lab, one of these machines run by a single investigator, and in just a few days, you can sequence on the order of several hundred human genome equivalents. So now we're in maybe a little more than a dozen years after the completion of the human genome. 12 year project involving thousands of scientists. Now a single scientist in one day can do far more sequencing than that entire consortium did. So that's allowed us to start looking at things like cancer genomics. When the human genome was published in 2001, no one at that time thought it was even remotely feasible to start sequencing the entire genome of a single tumor, and yet today, we have literally tens of thousands of projects going on around the world doing exactly that. So the result of that is that we are generating these enormous, enormous data sets. So sure we can sequence all that data, but what I didn't say was that towards the end of the Human Genome Project, when we were at the point where we were writing the paper, and I was part of one of the teams that was doing that, we had hundreds of scientists frantically trying to analyze all this data from a single genome and figure out what we could say about it in a scientific paper. So today, one investigator, one lab, can generate multiple genomes in a space of a week, but that doesn't mean that in the space of a week, or a few days, you can analyze all that data, not at all. So you need powerful computers running for days or even weeks just to churn through the data and turn it into something that a person can look at. And there's many different questions you can ask about it. One question that I sort of already alluded to is, you can ask well, what are the mutations in this cell versus other cells from the same person? So that's say, a kind of question you could ask. That requires significant amounts of computing to take that bewildering massive data and turn into something comprehensible to a group of scientists who can then analyze it. So another thing that's driven this revolution is not just the efficiency but the cost. So the same that things are gotten faster, and more efficient that way, they've also got much cheaper. So this plot that you're looking at now shows you the rough cost per human genome equivalent going back to around the time the human genome was completed. So when the human genome was finished in 2001, the scientific community then proceeded with several other important mammalian genomes that are about the same size, such as the mouse genome, and the cow genome, and these are genomes that, like human, are around two and a half to three billion base pairs long. And those projects cost on the order of $25 or $30 million to sequence. So that cost started to drop, from that point on dropped very rapidly, and then around 2007, there's an introduction of a new technology from a company called Solexa, now called Illumina, that led to even more rapid drops in cost, because the sequencing technology itself changed really dramatically and we'll talk about that a little bit later in this course. But as a result, the sequencing cost today for a human genome is on the order of $1000. So we've gone from $25 to $30 million to $1,000 in the space of about a dozen years. And that opens up a world of experiments that we didn't think were feasible before, not only because of the time involved but also because of the cost. So finally, where is all this data? So there are now trillions of bases of data that have already been generated. You and I can go and download this data and study it ourselves. Even though this data has been published and deposited in public archives, that doesn't mean that there's nothing more to learn from it. The convention in the field is that once you publish a paper describing some genomic data set, you're required to release it, and generally release it with no restrictions. So there's a terrific set of repositories of all this data. The biggest one is the National Center for Biotechnology Information or NCBI. The raw data is deposited there in something called the Sequence Read Archive or SRA. But many more databases are contained within NCBI that contain, for example, the names and locations of all the genes that are present in all the genomes that we've been sequencing. So this is a great resource for people who want to go and try to make new discoveries, not only about the human genome, but about the many other thousands of species that we're engaged in sequencing.