0:37

very easy to understand and we'll, look at some examples. in a little summary from

Â basic theoretical computer science, there's a very important theorem called

Â Kleene's theorem that, was, developed actually here at Princeton in the 30's and

Â it says that, for any discrete finite state automaton, there exists a regular

Â expression that describes the same set of strings. And equivalently, for any regular

Â expression, there's a DFA that recognizes the same set of strings. and this is just,

Â an example. and if you have seen this, this sort, this sort of thing will be

Â familiar and if we have, if you haven't you know, and you will understand better

Â as we get into it. So, a discreet finite state machine is a machine that has states

Â that are labeled and it has transitions from state to state that are labeled with

Â characters. And, the way that it works is, if it's in a state and it reads like for

Â every state is well defined is the next character is zero or one it's another

Â state to go to. So for example at state zero if you see a zero you stay in state

Â zero, if you see a one you go to state one. In state one if you see a zero you

Â stay in state one, if you see a one you go to state two, and so forth. So at every

Â state. read a character and go to the well defined next state. That's a discreet a

Â deterministic final state automaton. and so, that's the machine you start it up on

Â a string, and it reads all of the characters in the string. and then, if it,

Â ends up in a state that's so called terminal state, and started on the start

Â state the terminal state is just in dicated by this X that are on this

Â example. It ends up in the right state you say that it recognizes a set of strings.

Â and so that's a way to determine if a given string is in a specified set. The

Â machine specifies the set. In an RE the, we specify the set by writing characters

Â and stars and meta-characters in parenthesis and this regular expression

Â recognizes these same set of strings. Describes these set of same, same set of

Â strings that's recognized by this DFA. Hank Leany's theorem showed that it's

Â always possible, to, construct such a machine. So that gives a, a basic plan for

Â developing a pattern matching implementation. and this is developed by

Â Ken Thompson at Bell labs, one of the developers of UNIX, and, and this facility

Â was an important part of early UNIX and developed this idea based on the

Â well-known theorem from theoretical computer science. is let's try to build a

Â machine, the same way as we did for Knuth-Morris-Pratt, where we have no

Â backup in the text input string. so we just got a finite state machine. With

Â Knuth-Morris-Pratt we built a, a machine for recognizing the one string how about

Â building one for multiple strings and that would give us a linear time guarantee. so

Â the unruling extractions is determine if it's FSA or DFA and the basic plan is to

Â just go ahead and take your pattern. It describes a set of strings and use that to

Â build a DFA and then simulate the DFA with the Texas input, the same way we did for

Â Knuth-Morris-Pratt And if it accepts, we say we have a match. If it rejects, we

Â don't have a match. this is a, a fine plan but it's got a flaw. the bad news is that

Â The plan is infeasible because the number of states in the DFA might be exponential

Â in the length of the RE. So, for it's got too many states Kleene's, the proof of

Â Kleene's theorem, standard proof of Kleene's theorem , involves the

Â exponential number of states. and it's not that difficult to prove if you're

Â interested be sure to look it up. But from a practical, standpoint, too many states,

Â you can't use that as a basis for algorithm. But there's an easy revision.

Â And again, this gets back. It will, it will give us a quadratic time guarantee,

Â and actually, in, in real life, it's usually linear, and all we do is change

Â the abstraction to use a non-deterministic finite state machine, an NFA, rather than

Â a DFA. so, in the same basic plan, we're going to build an NFA. It's a, it's a

Â different kind of machine, but actually, it's also the case that, that, oh yeah.

Â For any regular expression we can build on NFA so and in vice versa cleaning stand to

Â this and so we're going to simulate with the NFA with the text as input. And so,

Â what do we, what do we mean by non-determined as a finite state machine?

Â That's what we have to talk about next. and we'll just do it with this example

Â that we used throughout this lecture. So, It's very similar to the DFA that we have

Â before, now we're going to put the characters in the states and actually, the

Â kind of NFA that we're going to build, we're going to have one state for every

Â character in a regular expression. So this is an NFA, corres-, corresponding to this

Â regular expression. just, to, We, we always enclose the regular expressions in

Â parentheses. just to, make everything work. and then, Got this regular

Â expression here. A star B or ACD. then, we're going show how to build, this NFA.

Â and then, simulate it, to recognize the regular expression. And, how is it

Â different than a DFA So, there's a character in every state, and if the

Â character's a text character, it's the same as before. We read that text

Â character and moved to the next state. But it's a more general kind of machine,

Â because states also have what's called epsilon transition. And with an epsilon

Â transition the machine is allowed to change the state without scanning any

Â text. So, at the beginning the machine go from zero to one, to two, back to one, I'm

Â sorry, two to three, back to two or it go from zero to one over to six there's lots

Â of places the machine could go, without scanning any text character. but we do

Â have the black matc h transitions that scan text characters and so those are the

Â rules the machine operates by. And, the, final rule is when does it accept, when

Â does it decide a strings in the pattern. It accepts if there's any sequence of

Â transitions that scans all the text characters and ends in the accept case.

Â it's a way of specifying an infinite set of strings. but it's got this

Â non-determinism. It's not always determined what the machine, will do next.

Â It's a little bit of a mind blowing concept in theoretical computer science.

Â But this particular example actually shows how such, such a concept can be made

Â concrete and actually give us a widely useful algorithm. One way to think of a

Â non-deterministic machine is a machine that has super powers and can guess the

Â proper sequence of state transitions to get to the end. Get to the accept state.

Â another way to think about is the sequences if you provide us particular

Â sequences that's a proof that the machine accepts the text. and so this is a real

Â machine. We don't have a real machines that can guess the right answer but it's a

Â completely well specified abstract machine, and we can write a program to

Â stimulate it's operation, and that's what we are going to do. So let's just make

Â sure that everyone's got the concept down. So, say we have the question is 4A is

Â followed by BD, is that accepted by this NFA down here? and the answer is yes

Â because there is a sequence of legal transitions that ends up in state eleven.

Â So in this case, we'll take an epsilon transition from zero to one to two and

Â then we'll... We've got 4A's so we'll chew up four A's, one, tw-, I'm sorry. One, and

Â then go back. Two, and then go back. Three, and then go back. Four, and then

Â we'll be in state three. And then at that point, we'll decide to take this epsilon

Â transition. We'll assume your machine decides to take this one. then it

Â 10:35

sequences that the machine might guess and go to the wrong state or stall, it doesn't

Â matter as long as there's some sequence and we are going to assume that the

Â machine always guesses the right one. So for example, if the machine just

Â recognizes three A's, one, two, three, and then went to state four, it would get

Â stuck because there's no way for it to get out of state four because state four is

Â looking for a B, and it's sitting on an A, and so forth. So, there's definitely,

Â things that the machine could do that would be wrong, but we don't care, as long

Â as there's some way for it to get through. and then, what about if it's a string

Â that's not in a language recognized by the state. the machine, well, so we have to

Â argue about all possible sequences and, prove that, no sequence of legal

Â transition, is transitions ends in state eleven. and that, seems to be, a fairly

Â complicated sort of argument. So in this case the machine could recognize a bunch

Â of A's, and then go to state four. But again, there's no B, so there's no way

Â it's going to get out of state four. and, so you can make a general argument like

Â looking at the machine that's any string that it recognizes, it has to end in D and

Â this one doesn't end in D. that's a, much more complicated thing, than we're talking

Â about, is to try to come up with, a, simple machine that will decide whether or

Â not a string, is in the, language that it specifies. so the question, so question

Â that we have is, we have this non deterministic machine, how are we going to

Â decide whether a given string is in the language that it recognizes. So for

Â deterministic machines, like we used for Knuth-Morris-Pratt it's very easy, because

Â at every time, there's only one possible transition. But for non-deterministic, if

Â you're, in some states, there's, multiple ways to go, and you have to, figure out

Â the right transition. But actually, the situation isn't so bad. and what we can do

Â is, to simulate the NFA, is just to make sure that we consider all possible,

Â transition sequences. And that's what our algorithm for a regular expression pattern

Â matching is going to be based on. that's what we will look at next.

Â