0:15

We want to know is a given string an element of a given set of strings or

Â is it in a language, a given language?

Â So here's an example from computational biochemistry.

Â One way to represent an amino acid is with one of

Â the characters CABLI that are listed there.

Â And each one of those is a standard representation of an amino acid.

Â A protein is a string of amino acids, so

Â it's just a string consisting of those letters.

Â So there's a particular type of protein that is referred

Â as a C2H2 type zinc finger domain signature.

Â And that's well defined by these specific rules.

Â It's a C followed by 2, 3, or 4 amino acids, that could be any one of them,

Â followed by another C followed by exactly three amino acids,

Â followed one of these 9 specific ones L, I, V, M, F, Y,

Â W, C, or X, followed by 8 more amino acids,

Â followed by H then followed by 3, 4 or 5 amino acids followed by H.

Â It's a pretty complicated set of rules, but, well, life is complicated.

Â 1:31

So we might have the question how a given protein is.

Â Does it follow these rules?

Â Is it in the C2-H2 type zinc finger domain?

Â So there's a protein, so let's see if it actually follows these rules.

Â Whether it actually follows these rules or not.

Â Is it in the language defined by those rules?

Â Is it in the set of strings defined by those rules?

Â And the answer in this case is yes.

Â Starts with a C, then there is 3 amino acids, so that is the first bullet point.

Â Then there is another C, and then exactly 3 amino acids.

Â And then the next letter is Y, so that's one of the,

Â makes the last bullet, that's one of the ones listed.

Â And then there is 8 amino acids.

Â And then the next one is H, and then there's 3.

Â That's okay the last bullet point 3, 4, or 5.

Â And then finally an H.

Â So in this case this string is in the language.

Â And clearly we want to be able to answer this type of question for

Â a broad variety of types of proteins.

Â 2:44

Here's another example from commercial computing.

Â Suppose you want to define what an email address is.

Â It's a sequence of letters followed by an @,

Â followed by another non-empty sequence of letters followed by a dot.

Â May be any number of occurrences of that.

Â And then ending up with edu or com or gov or some other legal suffix,

Â where you would expect to find in the end of an e-mail address.

Â So you can ask then, which of the following are e-mail addresses?

Â So this one's got a sequence of letter, at sign, some more letters a dot,

Â some more letters a dot, ending in edu.

Â So that one is in the language defined by these rules.

Â So this phrase, not an email address, it's not an email address.

Â It doesn't have the at sign or any dots.

Â 3:35

There's another one that fits.

Â This one doesn't end in edu or com, so it's not an email address.

Â This one doesn't fit according to those rules, but

Â maybe we need to fix the description to allow for some numbers, say.

Â In actually specifying precisely what's a legal email address and

Â what is not a legal email address can get pretty complicated.

Â So this is just to indicate that it's not always so easy to specify the rules.

Â So the challenge is we want to develop a precise description of the set of

Â strings in question.

Â And that's true for many of these types of problems.

Â And we'll look at a way to do that soon.

Â 4:28

So nucleic acid is one of the letters A, C, T, or G.

Â And we've seen examples before with this representation.

Â A genome is a string of nucleic acids.

Â So there's a thing called a fragile X syndrome pattern.

Â And so that's a genome that is defined by the following rules.

Â It's got gcg somewhere inside, followed by any

Â number of cgg or agg triplets, followed by ctg.

Â And, in fact, this is important because the number of triplets correlates

Â with this fragile X Syndrome, which is a common cause of mental retardation.

Â So if you have a genome,

Â you might want to know this in order to deal with this illness.

Â 6:02

So regular expression, an RE, is either the empty set or an empty string,

Â or it's a single character or a wildcard symbol that represents all characters.

Â Or it's an RE enclosed in parenthesis, or it's the concatenation of two or more REs.

Â So here's examples of these rules.

Â So aabaab, that's a concatenation of a bunch of

Â single characters that specifies the set consisting of that one string.

Â So only that string, aabaab is in the set specified by that regular expression, or

Â in the language defined by that regular expression.

Â And every other string is not in that set.

Â And if we use a wildcard character, then we can specify multiple strings.

Â So we say .u.u.u., that's all those seven letter words with

Â alternating u's, starting at the second character.

Â And there maybe lots of words in the dictionary,

Â this are the ones in the dictionary but

Â of course it specifies any string of several letters with those use in there.

Â So those are examples of strength that are in the sets specified by that

Â regular expression.

Â And then if you don't have the u strictly alternating like that,

Â then it's not matching.

Â So that's simple examples of regular expressions using concatenation and

Â wild card.

Â And then we can have more complicated operations for

Â the union of two or more regular expressions.

Â That's just one or the other.

Â And we represent that with a vertical bar.

Â So in this simple example, only the two strings aa and baab are in the language

Â defined by that regular expression, and every other string is not in the language.

Â And then there's what's called the closure.

Â And that's any number of occurrences of BRE.

Â So ab*a means a followed by zero or

Â more occurrences of b followed by a.

Â So aa, abbba, a followed by any number of b's followed

Â by a is in the set defined by that string, that regular expression.

Â But these other strings are not in the set defined by ab*a.

Â First one doesn't end in an a, the second one's got an a in the middle there,

Â that's not allowed by the definition.

Â Just interesting to note that the closure operation already takes us to infinity,

Â that's specifying an infinite set of strings.

Â The number of b's is not specified, could go over that bound.

Â And then we can use parenthesization to make up complicated

Â expressions describing specific sets.

Â So for example, a followed by parenthesized a over b, aab.

Â Again, there's two strings that are in that set, aaaab and abaab.

Â And every other string is not in the set defined by that regular expression.

Â Or if we say ab*a, that's a way of saying alternating ab's ending in H.

Â And if you have a two a's in a row or two b's in a row anywhere,

Â it says not in the set defined by that regular expression.

Â Again, an infinite number of strings.

Â 10:03

So the raspberry and

Â crispbread are in the language defined by that regular expression, but

Â these other ones do not have spb in that order contiguous anywhere inside.

Â And the .* just allows matching of any number of characters,

Â so wherever the spb is, that would be a match,

Â as long as it's somewhere in the string.

Â Here's a much more complicated one, a* or

Â (a*ba*ba*ba*)*, whole thing starred.

Â 10:54

aaa, we can take the star on the right to be zero occurrences.

Â So it's got zero b's, we'll count that a multiple of three.

Â But if you have a multiple of three b's anywhere in there,

Â you can match it to that regular expression.

Â It's in the language defined by that regular expression.

Â If you don't have a multiple of three b's, then it's not in the set.

Â Already we're starting to see a little bit of computing

Â just with this simple notation.

Â 12:37

We can specify a class of characters by putting a range.

Â So little a to little z would be lowercase characters.

Â Capital A to capital Z is capitalized characters.

Â And inside the brackets it says, take a character from that sequence.

Â And with that we can check capitalization, or

Â we can specify languages that have proper capitalization.

Â Rather than + or *, you might want to say exactly a certain number of occurrences.

Â So that's within curly braces.

Â So this says take a five-digit number, followed by a four-digit number.

Â So maybe that specifies whether the string is a plus-four code.

Â And there's all kinds of other strings that are not in the language, but

Â at least you can test whether it's legal according to the rule.

Â Or maybe between j and k occurrences.

Â So that's {2,4}, that says two, three or four occurrences of dot,

Â which would mean a wildcard character.

Â So that's a and b separated by two, three or four characters.

Â 14:08

So a typical convention is to have \s match any whitespace character like space,

Â tab, newline, and other things like that.

Â And that's useful when processing text.

Â Now, it's important to note that in applications we're going to want to

Â use these things all the time.

Â From the point of view of the theory, they're just shorthand.

Â They're not essential to understanding the fundamental theoretical points that

Â we're going to talk about.

Â So you can always replace one of these shorthand notations by a regular

Â expression that follows the rules, using parentheses or star concatenation only.

Â 15:08

developing the generalized RE for that is pretty straightforward.

Â So we have a C and then we have a .2 through 4, and

Â we have the proviso that, in this case,

Â only the allowed letters representing amino acids are in the alphabet.

Â So that's 2 to 4 of them and then a C followed by exactly 3.

Â And then something from one of the characters L, I, V, M, F, Y, W, C, X.

Â 16:01

Here's a real world application having to do with proteins, and

Â actually scientists, biologists use this website as an integral part of their work.

Â And searching in there has to do with typing a regular expression in

Â that window and getting the result.

Â And more and more other search type tasks are involving regular expressions,

Â because it's an easy to use and compact notation for

Â specifying what it is that you want to see.

Â 16:42

And so this is, again, a generalized regular expression,

Â it says at least one lowercase character followed by @ sign.

Â Followed by a sequence of at least one string of lowercase

Â characters, followed by a dot followed by an edu or com.

Â And again, as an exercise you can figure out how to add other types of extensions,

Â or numbers, or whatever else.

Â 17:10

Again, a very expressive way of specifying a set of strings

Â that you want to consider to be legal, or

Â a language that you want to know is a given string in that language.

Â So now what we want to do is, given that set up, now we have a formal and

Â specific way to specify sets of strings or languages.

Â And now we want to know how we can determine whether or

Â not a given string is in the language, or matches the given regular expression.

Â That's going to be our next task.

Â 17:46

Here's just a couple of pop quizzes, and again, worthwhile to study these

Â types of questions to make sure that you understand the basic idea.

Â So in any kind of quiz or exam on this material,

Â you might be asked which of the following strings match a*bb(ab|ba)*?

Â Well, you have to figure out a way to get through this question by just

Â working through.

Â In this case, what do you think about abb?

Â Well, looking at the RE, the star says take any number of a's.

Â I can certainly just take one, and then I take the bb, and

Â then the rest of it, I'll take zero occurences of.

Â So that one certainly is in the set it describes.

Â What about the second one?

Â Second one doesn't work because every string in that language

Â has to have two consecutive b's in it.

Â What about the third one?

Â Again, after this one,

Â after you get the bb there is no way to get just a single a at the end.

Â You have to have ab or ba.

Â So that one doesn't work, that one does.

Â No a's, two b's, then take a ba and

Â an ab and that works.

Â There's no c anywhere, so number 5 is easy.

Â And this one is more complicated reasoning to figure out why that string is not in

Â there, don't have an even number of characters after bb.

Â So you can see immediately, not so

Â easy to know whether given string matches a given regular expression.

Â Okay, here's another one type of question that you would have to answer

Â in any kind of quiz on this material.

Â 19:48

So think about how you would create a regular expression that implements

Â those rules, describe the set of strings that satisfy those rules.

Â Here is the answer, at the beginning there has to be an atg.

Â At the end there has to be one of the three codons, tag, taa, or ttg.

Â In the middle you have a, or c, or t, or g, three times any number of occurrences.

Â