To finish up our study of the use of abstract data types, we're going to take a look at string processing in a bit more detail. So, as mentioned, we've already been using strings. It's a sequence of Unicode characters and Java's ADT defines operations on string that allows us to write programs that manipulate them. We've already been using a few simple ones like the constructor or in Java, if you put a sequence of characters inside quotes it creates a string. You can get the length of a string or you can get the ith character but there's many other operations that are defined. For example, you can get a sub-string by between index i and index j minus one by calling this method. You can find out if a given string, if our string contains a given string. You can find out if the string starts with a certain given string or ends with a certain given string. You can, this method index of searches through the string to find the first occurrence of a given pattern or the first occurrence of the pattern after a given index. Concat we've been using with Java's built-in plus operator but it creates this string with a new string T appended. You can compare strings, you can do very powerful things by changing all occurrences of a given string A to another string B within a string, replace all. Split, if you give a delimiter then you can get an array of strings that are in the original string. Every occurrence of the delimiter defines as a sub-string and you get an array of those sub-strings. We need those for examples to take a text where words are separated by blanks and turn that into an array of words. You can check whether a string is equal to another string. Actually, equals takes an object as its argument. Of course, if the object is not a string it's not going to be equals. If it is a string then that's a check for whether the value is the same. Well, take a look at now some programs that use these methods. They are very useful and powerful and string processing is a very important paradigm in modern programming, as we'll see. So, here's some typical examples, just a couple of toy examples. So, is the string a palindrome? Does it read the same forwards and backwards? Well, we get the length of the string, then we go through half of the characters and then we just check that if the character i from the beginning is equal to the character i from the end; if it's not, you return false. If you get all the way through then you return true. So, is the string a palindrome. Let's find the lines containing a specified string in standard input. So, we take a query from the command line while standard input is not empty. Read a line. Check if that line contains the query. If it does, print it out and there we go. That's a simple way to look in a file for the lines that contain certain words, for example, in a program, the lines that uses a certain variable. We'll look at later in the course at the expanded version of that operation. So maybe the text file on standard in you want to look for the hyperlinks and end in.edu. Again, so we can read the strings from standard input and if it's a hyperlink it's going to start with HTTP colon slash slash and it ends with.edu, then that's the string that you want. So, all kinds of things like this, all of those methods provide extremely flexible basis for processing strings. We'll see plenty of examples as we move through later in the course. Here's a very important example from molecular biology. For many years, the challenge was to find out the sequence of DNA in the human genome but that happened maybe about a decade ago, and now the thing is that we have all this data, that huge amount of data that characterizes the genome and what we want to do is try to understand the structure. That's what scientists do nowadays in genomics. So the idea is to represent the genome as just a string, this four different types of things in the genome, and we call them A, C, T and G. We can think of the thing as just a string of A's, T's, C's and G's. It might be a hugely long string, might be billions of characters depending on the organism and so forth and there's many, many of these out there that scientists have sequenced and discovered the actual sequence. But now they need to write programs to understand the structure and that's a string processing. So, for now, we'll talk about the idea of a gene which is a substring of a genome that represents a functional unit. What scientists know about genes is that they're made of things called codons. The A, C, T and G elements are called nucleotides, and if you have three of those in a row, you have something called a codon. The codons for a gene, just before a gene, there's the sequence A, T, and G just those three letters. Then at the end after the gene there's one of these occurrences of three letters and codon TAG, TAA or TGA. So, we have a start codon, then we have a bunch of codons and then we have a stop codon and everything in between is a gene as long as that thing is the length of it is a multiple of three. That is it has to be a sequence of codons. So, in this genome, that start and that stop we can identify that gene. There's actually another one, a short one. It might not be a gene but it's a potential gene that are worthy of further study. So, our challenge is to write a Java program that can find genes in a given genome. Again, you want to think of this genome as being something extremely long, maybe millions or billions of A's, Cs, T's and G's. As a warm up, we'll do a little bit of a simpler task which is to identify whether a given string follows these rules and is a potential gene. So, given that string, we want to check, does it begin with a start codon, does it end with a stop codon, and if so then what's in between, is a G. So let's look at that program. So, we're going to use a static method that takes a string as argument and then it's going to return either true or false depending on whether it's got the proper structure. So, the first thing is length of thing has to be a multiple of three. So, divide by three check the remainder. If it's not 0 then return false. Can't be good. Then the other thing is it's got to start with ATG. It doesn't start with ATG, return false. Again, each one of these lines is enabled by a method in Java string datatype. We're computing with strings using the set of operations in the datatype. So, now we go through the rest of the string and jump three at a time to check all the codons and pick out the given codon, if it's equal to a stop codon, then, and we're not at the end, then we have to return false. The only time that it's legal to have a stop codon is right at the end. If it ends with TAA, TAG, or TGA, then return true, and if we don't find the stop codon at the end then we return false. So, now this is a main that just reads a string from the command line and calls this function, and so if we call it with our example, we get true. If we call it with a slightly larger example, now when we get false in that case, it's because the length is not a multiple of three and here's another one where there's a stop codon in the middle that returns false as well. So, that's the kind of processing that we can do with java string data type that helps us address a scientific problem. The real problem of interest gene finding, it's a little more complicated but not a lot more complicated. We can go ahead and scan left to right through the data. When we find a start codon, we can set a variable to the index where that thing is and then if we find a stop codon and then were a multiple of three, then we found a potential gene. We can print that and then reset are we looking at a gene or not index to minus one which indicates that we're not, and so this is just a trace that you can study for how this method would work. Every time it finds this start codon, it keeps track of that index until it finds a stop and if it finds a stop and it's on a multiple of three, it prints out the gene. So, it finds those two putative genes in this example. The code for this is a bit more complicated than the one I just showed, but not a lot more complicated and we'll leave it as an entertaining programming exercise for you that uses string operations. Again, it's interesting to think about how Java might represent strings, but it's an abstract data type and our program should not be dependent on it, but it's good to have a mental model of what's going on particularly when it comes to performance and we'll come back to that. So, here's a possible way that it might be represented in the Java system. So, our variable genome maybe we'll point to an area of memory that has two things in it. It's got the length of the string. So, we can get that quickly and then it's got a memory address, in this case x, which is the place where all the characters are laid out in sequence. With that representation then, when the Java has to implement the various methods, lots of them can do very efficiently getting at the ith character and so forth with only a few operations. Then for example sub-string, a way that it could work is it looks just like a string. So sub-string s starts at one and then goes to index five. So, this says that s starts at x plus one which is the first place and it's of length four and then it can point to that same area where those letters are and that's a very efficient way to represent sub-strings and then t starts at nine is of length four and so forth. So that's one way to represent strings and again, maybe our programs don't depend on that, but it does have some implications. One thing is that s and t are not the same string, there different strings. It might be another memory representation where someone decides it's better to implement sub-string such that s and t actually would be the same strings. If you say s equal equal t, then that's going to be false, because those things are different. Equal equal just would compare the addresses, the references. But you do have to say that s.equals(t), then that's the whole purpose of the equals method. It's going to compare those character sequences and tell you whether or not those strings have the same character sequences. So, the point of this is that the API doesn't always have all the details. You need to know a little bit about the context in order to fully understand how all the methods work. All right. So, here's the summary. For object-oriented programming, what you're going to be able to do is to create your own data types, your own sets of values and operations on them and then use them in your programs. Object-oriented programming it's about manipulating objects which are things that hold the data type value. Variable names refer to objects, but really what we're doing is manipulating the objects in the datatype values that they hold. In Java, what programs do is manipulate references to object. All of these things that we've talked about are what's called reference types which is different than a primitive type. Primitive type like a double, the variable refers actually to the value. All of these others variable refers to a reference to the value. So boolean int and double so far there are exceptions to this rule and actually, in some object-oriented programming languages, there's no separate primitive types and some people believe that language shouldn't have them, but the great efficiency gains that you get with them has proved very effective for Java, so the purists versus the practitioner struggle continues. If you're a practical program, you want those primitive types. But the main idea is in this lecture is that you can write programs that can manipulate abstractions like colors, sounds, pictures, strings. In the next lecture, you're going to learn how to define your own abstractions and write programs that manipulate them and that's when you've really brought your ability to program up to a whole new level and we'll see many examples of that.