This lecture is about how we store data, data structures and computer memory. Data structures are an important part of computer science that lets us store data efficiently, and especially when we're dealing with very large data sets, we have to think carefully about how we store data. So, for example, in the context of genomics, we, we deal with large amounts of sequence data, sometimes aligned, as you can see on this slide. And we have to figure out, all right, we have sequences for many people or many species. We might have many sequences from all these people. We have to think about how to get those into memory. Now the way the sequences come at us from a sequencing machine is just as a long string of characters of text. And computers are very good at storing text, that's what they were originally designed to do in many cases, so computers already store, have a way to store that. But sometimes we can think about more efficient ways to store that. So for example, when we're looking at a multiple alignment of lots of sequences, there are lots of things those sequences have in common we might want to think about storing storing something that's a little bit smaller than all the separate sequences and just storing the differences between them. Another important aspect of data structures and memory, when we're doing, when we're designing them, is that we want to be able to find stuff. We're looking at, when we're looking at DNA sequences, they all pretty much look alike. They're a long string of As, Cs, Gs, and Ts. And when we're looking at the human genome, we have three billion base pairs to search through. So for example, an interesting problem to think about in the data structure point of, from the data structure point of view in genomics is, well I've got a little sequence of, say 100 nucleotides, 100 DNA letters, and I want to be able to store in such a way that I can go and quickly find it again. And I'm storing it in the context of a three billion base pair sequence. So, rather than just throw it in some random, like throwing it in a random pile, I'd like to store it in a way where I can go back and quickly find it. So, for example, with a DNA sequence that's from the human genome we might want to store some kind of tag that says what chromosome it's on, and what location it starts, where does it start at, in that chromosome. So, the kinds of data structures that computer scientists have designed over the years vary widely, but one of the most common sorts of data structures is a, is a tree. There's also something called a list and something called a link list, and many variants on, on these data structures. And these are simply ways of keeping track of things, keeping track of objects you stored, and having objects point to one another, so that once you go to one object you can quickly find another object. So and to, to understand how these objects work, it's important to understand that in computer memory, we not only have the data itself, but we also have an address. So every piece of memory has an address, and we can find any, any object in memory if we know that address. So those addresses are called, in programming language terms, are called pointers and those pointers will take us directly to a piece of memory. So, if we store a piece of sequence data somewhere in the computer's memory, to retrieve that again, we simply need a pointer. And one way to think about a data structure that's efficient is if we have lots of sequences that are in, that are near each other in the genome, we might want to store pointers from one seq, one of those sequences to the next one so that once we're in the right place we can quickly find these other sequences without having to start over again from scratch. So another aspect of data structures, thinking about data structures is to make, is making them efficient. As I said before, in the genomics world we're mostly dealing with sequence data. Sequence data has a natural representation as letters and computers represent letters typically as as one byte. So any, any letter in the alphabet, a through z, any nu, any numeral zero through nine is represented the same way inside the computer using one byte. So a byte actually of, of the word byte comes from the word bit. A bit is a binary digit. And that's just a zero or a one, and that's at the most fundamental level how computers represent information. If you take eight bits in a row, you can con-, you can consider that as an eight bit binary number, which, which can store up to 128 values. Usually we'd consider those to be the values 0 to 127. And the standard representation of text in the, in the, inside the computer is to represent every letter as one of those values between 0 and 127. So with that much space to, to represent information, we can represent all the lower case letters, all the upper case letters, that's another 26. We could represent, represent the ten single digits, that's another ten, and then we have a room for all the special characters. So basically everything on your computer keyboard is represented as a single byte. However, if you look at DNA, you see right away, well, there's actually only four letters there. So we can do much, much better when we're representing DNA. And this is how most serious, highly efficient programs for processing lots of DNA operate internally. Instead of representing the four DNA letters as one byte each, we can represent them as just two bits. So simply take A and call that, make that, represent that by the the two bits 0 0, C is 01, G is 1 0 and T is 1 1. And by doing it this way, we get a fourfold compression. So instead of using eight bits per letter of DNA, we're only using two bits. So, because we're storing gigabytes or even terabytes of DNA sequence data, a four-fold compression right out, right out of the box, is, is an important efficiency we can gain by that, that representation. So, finally to look at a slightly more sophisticated way of representing representing DNA when we're talking about the application of DNA, one thing that we like to capture in, in analyzing DNA are patterns of sequence that have some biological function. And here I'm showing you a picture of the, the ends of an intron. So introns are the, the interrupting sequences that are in the genes in our genome that actually don't encode proteins, but get snipped out and thrown away in the process of going from DNA to RNA to protein. And introns almost always start with the letters GT and they almost always end with the letters AG. And if you collect lots of them together and, and notice how these patterns are in common you can get the, you can create a probabilistic picture of what letters are most likely to occur at the beginnings and ends of introns. So these two pictures show you exactly those two pictures for the beginning of an intron which is called the donor site and the end which is called an acceptor site. So now we could represent all the donor sites we've ever seen as a big set of strings of say ten letters long, if we, if we chopped out a window of ten bases around those sites, or we could be much more efficient about it and capture much more interesting data by computing for every position in that little window the probability that the letter was A, C, G, or T. And these logos you see across the top use use the height of the letter to represent the probability that letter appears at that location. And with this kind of representation we've now compressed, essentially compressed, the information from hundreds or even thousands of sequences that we've seen into a simple pattern which we can then use to, to process other data to, for, for example to recognize these patterns when we see them again.