0:02

Hey, everyone. In this lesson,

Â we are going to discuss sequence tagging task.

Â We'll start with some examples and then

Â investigate one model that can be used to solve it.

Â It will be called Hidden Markov Model.

Â Let us get started.

Â So the problem is as follows.

Â You are given a sequence of tokens,

Â and you would like to generate a sequence of labels for them.

Â The examples would be part of speech tagging, named entity recognition,

Â or semantic slot filling that we have briefly seen in introduction.

Â Now, for example, I am given a sequence of words,

Â and they want to produce part-of-speech texts like here.

Â For example, saw is that the verb and I is the pronoun and so on.

Â There are different systems to list

Â all possible texts and different systems to

Â decide what are important text and not Important text.

Â One system is here,

Â and you can see that there are some open class words and

Â closed class words and some other labels.

Â For example, we have also here some punctuation and

Â symbols for punctuation if we see it in the texts.

Â Another example would be named entity recognition.

Â So, here, I have a sequence and

Â last Friday or Winnie-the-Pooh would be some named entities.

Â Sometimes we really need to find them in the texts and use them as features,

Â or maybe we need them to generate answers to some question.

Â What kind of named entities can we see?

Â First of all, it would be some persons or

Â organization names or locations but not only them.

Â For example, we can also have dates and the units

Â and any other entities that you see in the texts.

Â What kind of approaches work well for these type of tasks?

Â First, it would be rule-based approach,

Â and we do not cover it in details here.

Â And the second one would be just to take classifiers like naive Bayes classifier or

Â maybe a logistic regression and use them separately at

Â each position to classify the labels for this position.

Â This is obviously not super nice because you do

Â not use the information about sequence, right?

Â So a better idea would be to do sequence modeling,

Â and this is what is our lesson about.

Â And another idea would be to do neural networks,

Â and you will hear about them in another lesson.

Â To come to our models,

Â to hidden Markov models,

Â let us define everything properly.

Â So please remember this notation,

Â and let us go through it.

Â So we have a sequence.

Â It will be a denoted by x,

Â and we have a sequence of texts,

Â which would be denoted by y.

Â The length of the sequence is big T,

Â and by T, we will denote the positions in our sequence.

Â Now, the task is to produce y given x,

Â to produce the most probable y.

Â So we need to find the most probable text for our words in our model,

Â but to find something most probable,

Â we should first define the model and see what are the probabilities here.

Â So we are going to define the joint probability of x and y.

Â Do you understand the equation in the bottom of the slide?

Â So take a moment to see that, actually,

Â the right-hand side and the left-hand side can be both used to find

Â the argmax just because they are different just by probability of x,

Â which does not depend on y.

Â So if I multiply the left side by probability of x,

Â it is just constant for me.

Â So I don't care whether I do it or not.

Â Let us define the model.

Â This is probably the most important slide of our video

Â because it tells us what is the hidden Markov model.

Â So we need to produce the probabilities of x and y.

Â x are our observable variables and y are our hidden variables.

Â Now, first, we apply product rule to decompose it into two parts.

Â Then every part should be farther simplified,

Â and we can do this by applying two different assumptions.

Â The first assumption, which is called Markov assumption,

Â will help us to model the probability over the sequence of text.

Â So we say that this sequence can be factorized into probabilities of

Â separate pieces so we just model

Â the probabilities of the next tag given the previous tag.

Â You have already seen this in the y-gram language models.

Â So there, we applied exactly the same formula to model the probabilities of words.

Â Now, we do this to model the probabilities of text.

Â Also there, maybe you remember that we had to have

Â some special start token so that we could

Â write down this like that and be with the fact that we can have T equal to zero.

Â So, here, the first term would be the probability

Â of the first tag given some special zero tag,

Â which is just the start tag.

Â Awesome.

Â Now, the second assumption is about output probabilities.

Â So we need to produce the probabilities of x given y and it is rather complicated.

Â But we can say that we will factorize it into the probabilities for separate positions.

Â And for every position,

Â we will have the probability of the current word given the current tag.

Â So given these two assumptions,

Â you can write down the formula in the top of the slide.

Â And this is the definition of hidden Markov model.

Â Please take a moment to remember it.

Â Now, hidden Markov model can be used to model texts.

Â So it can be used to generate texts.

Â Let us see how it happens.

Â So, first, we'll need to generate a sequence of texts,

Â and after that, we will generate some words given current texts.

Â For example, we will start with our fake start tag,

Â and then we'll generate some texts using those transition probabilities.

Â So we generate them and continue like that,

Â and after we are done,

Â we start to generate words given this texts.

Â So we generate a sequence,

Â and we see that the model has generated some nice example from Winnie The Pooh here.

Â Now, let us define once again what are the parameters of the model.

Â So hidden Markov model is defined by the following formal five steps.

Â So, first, we have some hidden states,

Â those y in our previous notation.

Â We have some finite set of this states and also we have a special start state as zero.

Â We have transition probabilities that model the probabilities

Â of the next state given the previous one in our model.

Â Then we have some vocabulary of output words, and for them,

Â we have output probabilities,

Â so the probabilities of words given texts.

Â Now, can it compute how many parameters do you have?

Â Well, actually, we have just two matrices of parameters.

Â A metrics is N plus one by N because you have also this special start token.

Â And for this special start state,

Â you need to model the probabilities to transit to all other states.

Â For the output probabilities,

Â the dimension will be the number of states multiplied by the number of output words.

Â So we have lots of parameters.

Â We need to learn them somehow.

Â So how can we train the model?

Â Well, let us imagine for now that we have a supervised training data set,

Â which means that we have a sequence of words,

Â and we have a sequence of texts for this words.

Â Then we could just count,

Â so we would count how many times we see the tag Si,

Â which is followed by the tag Sj.

Â And we will normalize this count by the number of times when we see the tag as i.

Â So this way, we will get the conditional estimate to see the tag as j after the tag as i.

Â We could do the similar thing with output probabilities.

Â So there, we would count how many times some particular tag generates

Â some particular output and normalize this by

Â the number of times where we see this particular tag.

Â So this way, we would get conditional probabilities for the output words.

Â So I can tell you that this intuitive estimates are also maximum likelihood estimates,

Â but let us get in a little bit more details and

Â make sure that we understand how we compute these counts.

Â So we have many sentences.

Â Let us just concatenate all those sentences into

Â one long corpus of the length big T. Then we can compute

Â those counts just by running through the corpus from T equal to one

Â to big T and computing the indicator functions.

Â So there, in green,

Â I write down that I see the tag Si on the position Yt minus one and then I

Â see the tag as j that follows and the similar indicator function in the denominator.

Â So this is still the same formula, just another Notation.

Â But the thing is that in real life,

Â usually, you do not have tags in your training data.

Â So the only thing that you'll see is plain text,

Â and you still need to train hidden Markov model somehow. How can you do it?

Â Well, obviously, you cannot estimate

Â those indicator functions because you don't see the tags of the positions,

Â but you could try to approximate those indicators by some probabilities.

Â So compare this formula with the formula in the bottom of this slide.

Â The only thing that has changed is that instead of indicators,

Â I have probabilities now.

Â So something in between zero and one,

Â but how can we get these probabilities?

Â Well, if they have some trained hidden Markov model,

Â we could try to apply it to our texts to produce those probabilities.

Â So the E-step in the top of this slide says that given some trained hidden Markov model,

Â we can produce the probability to see text Si and

Â Sj in the position T. So this is something like three-dimensional array.

Â T, i, and j would be the indices,

Â and it can be actually done effectively with dynamic programming.

Â The only thing is that we need to have trained model there.

Â So the clever idea is to alternate two steps.

Â The E-step says that let us fix some current parameters of the Model A and B matrices

Â and use them to generate

Â those estimates for probabilities of texts for every certain position.

Â And M-step says let us fix those estimates and use them to update our parameters,

Â and those parameters updates are actually

Â still maximum likelihood estimates in this case.

Â So in the slide,

Â I have everything written down for A matrix,

Â but you can obviously do very similar things to compute B matrix.

Â So this is just a sketch of the algorithm it is called Baum-Welch algorithm.

Â And this is just a case over more general algorithm,

Â which is called EM algorithm,

Â and we will see this algorithm several times during our course every time when we

Â have some hidden variables that we do not see in

Â the training data and some observable variables.

Â This is all about training the hidden Markov model.

Â I hope you have got some flavor of how it can be done.

Â And then in the next video,

Â we will discuss how to apply the hidden Markov model once it's trained.

Â