Next, we're going to talk about applications of this theory, and then we'll go back to the fundamental theoretical questions. So, the most important application is a solution to the pattern matching problem, it's called GREP, and it's based on the idea of cleanness theorem which we are given a regular expression, we build a corresponding Deterministic Finite Automaton, and then simulate the operation of that machine. Now in practice, there is a difficulty in that the DFA might have exponentially many states for some kinds of REs, so that limits the use of this specific algorithm in practice. But very soon after a much more efficient algorithm that uses what's called a Nondeterministic Finite Automaton, NFA, was developed. And that's the one that's the basis of GREP. In NFA, we'll talk briefly about later on. It's kind of an imaginary device where the finite automata can go of one of two different states for any given character, and it figures out the right one to go to. It's kind of a mind blowing concept, and you can learn more about that in algorithms course. We will talk about non-determinism again later. So that method is called GREP, Generalized Regular Expression Pattern matcher. It was developed by Ken Thompson who won the Turing Award in 1983 in part for this work. And GREP is everywhere. It's been an indispensable tool for programmers and anyone using computer effectively for decades. And now you find it built into systems of all kinds. It's in many development environments including Java. And we'll look briefly at how it's found within Java next. How you can get a t-shirt that says GREP will find you. And there's many many other products you can find out there. So the first thing is the Java's String class actually implements GREP. So there's a method called matches that takes argument of a string and that string can be a regular expression that's built according to the rules that we've described, and then it returns true if this String matches that regular expression and false otherwise. So if we put in a string re and this is our description of the finger domain protein thing, then we have a specific string, then we can say zincFinger.matches that regular expression. And in this case, we'll get true. So right away the capability of recognizing regular expressions is built right into Java, and it's definitely worthwhile not to use it in these sorts of applications. So here's just an example of a client. You can write a program that takes a regular expression from the command line and it takes strings from standard input and then validates whether or not those strings are in the set defined by the regular expression. So we take our regular expression from the command line, while StdIn is not empty, we'll read an input string and then we'll just print out the result of input.matches, the regular expression given on the command line. So all kinds of applications for this that we've discussed in our examples of regular expressions. So you might want to type the zincFinger domain thing, and there is a few things that you have to worry about in terms of quotes around the string. But there are details and then type in a string and it says yep that one's in there or a type in another one it says, no it's not. So if you know the regular expression and you might not have to go to the website, you could deal with problems like this on your own. This is another example that maybe useful in programming languages as an example of what's a legal Java identifier. So ident123 is legal but 123ident is not, you can't start with a number. It's just another example. And here's our email example, our simplified email address example. And again, this stuff is all built into Java. So for a broad variety of applications you can put a regular expression on the command line and then a bunch of strings and try to understand how that regular expression is working to validate whether or not the strings that you have are in the language described by the regular expression. That's a very simple client. There's actually many other useful methods related to regular expressions built into Java. So you can use search and replace defined by regular expressions. And then what's called parsing is taking a string and dividing it up according to the appearance of substrings matching a regular expression. I'm not going to spend a lot of time with this. It's not appropriate to cover this level of detail in lecture, but wanted to make people aware that there's quite a bit of powerful string processing built right into Java's String class. So for example, you can call replace of a given string and it's first argument is a regular expression and the second argument is a string, and you can just replace all occurrences of the substrings match in the regular expression with your given string. This is useful to clean up data of all kinds. We'll see some examples. Or the other thing is to split the string around occurrences of rematches of the given regular expression and then there is a lot of occurrences of the regular expression in our big string. Then this will make an array of the substrings that appear in between. And again we use this in all kinds of applications. So here's an example of cleaning up data. So we want to clean up a white space in a text string. So we want a process, a text string but we don't care if there's three spaces between words or one or if there's a tab or a new line, we want all of those to be treated the same. So we have this tricky notation that often comes up in string processing where we have to use a backslash or two backslashes to specify a single one. So what this means is one or more whitespace characters. And so then if we have a string that is going to read everything on standard input, say the contents of a web page, then we can call replace all to replace all contiguous occurrences of whitespace with a single space. So that's really useful, that's where we get our examples of text for various algorithms or we can do use split around the whitespace and just get the distinct words in a big input text. So a little bit for experts but not so much. Once you've seen a few examples like that, you can see ways to take big strings and clean them up and process them. And we do this quite often in our examples for this course. You can even go way beyond, this is an interface defined for professionals but it's still worth a study where actually what you can do is special classes for taking a regular expression and creating really the machine, the pattern matching machine that can find substrings, matching the pattern in the given input string. And this gives even more flexibility in processing strings. And again, I'm not going to take the time to go through the details of these implementations. But if you're interested, they're worth studying, and they're quite related to the basic concepts that we've been talking about. So here's an example of a client that uses this stuff, and I won't again go through the details of the client just to show the types of things that are possible, just a little bit more study. So maybe you went to harvest information from a huge input stream. So what we do is take the regular expression from the command line, take our input from standard input, file or a web page and then print all the substrings matching the regular expression. And this is not built into the string class, we have to call these special classes built for regular expressions. You can see there in Java util regex. And so here's what the code looks like. So first thing is to just take a regular expression from the command line, then build an input stream, and then read the whole thing, and that's our input stream. Then we build a pattern which takes the regular expression and compiles it into basically an automaton that's going to go through the string and then we build a matcher out of that. And as long as matcher finds the occurrence, we print out what we've seen since the last occurrence, that's what the group is. Pretty simple code and again not my intent that everybody should understand every line of that code. But if you study it, you find it to be useful. So if we have a chromosome and we go ahead and look for our fragile x syndrome we can print out all the occurrences of it in that one chromosome, and that can be a huge file thing like that. And then again it matters how many occurrences of the of triplet appear but the cgg and the agg appear between the two endpoints. Or you might want to get email addresses. So this is a command that we've shown in class for many years for harvesting email addresses off a web page, and you might expect to get our emails out of there. Actually that doesn't work anymore, probably because we've been showing the slide in class because there's no email addresses on that site anymore. And that's true of many sites just for this reason it's so easy now to harvest emails in this way. So in summary, regular expressions and DFAs which we use to recognize language are defined by regular expressions are useful in all kinds of applications. These are just a few that we were listing. You certainly will encounter them as you use computers for more and more sophisticated applications. So our point of this last section is to convince you that the theory is useful. But next, we have to really get into the basic theoretical questions surrounding this. So sure GREP and facilities like that are built into many many systems that you'll be using. But it's only the tip of the iceberg as far as the theory goes. And that's what we'll look at next.