0:02

In this video, we will cover

Â several probabilistic graphical models for sequence tagging tasks.

Â You already know one of them.

Â So you know Hidden Markov models,

Â and you briefly know how to train and apply this.

Â In this video, we'll speak about few

Â more and we'll apply them to Named Entity Recognition,

Â which is a good example of sequence tagging tasks.

Â So, this is a recap for hidden Markov model.

Â You maybe remember the formula,

Â and one important thing to tell you is that it is generative model,

Â which means that it models the joint probabilities of x and y.

Â And the picture in the bottom of the slide illustrates the formula.

Â So you see that every arrow is about some particular probability in the formula.

Â So we have transition probabilities going from one y to the next one,

Â and we have output probabilities that go to x variables.

Â Now, another one would be Maximum Entropy Markov model.

Â Is a super similar.

Â So, do you see what is now different in the picture?

Â Only one thing has changed.

Â You have now the arrows going from x to ys.

Â Okay? And in the formula,

Â we can see that now the vectorization is a little bit different.

Â We have the probabilities of current tag given the previous tag,

Â and the current x.

Â What is also important is that this model is discriminative,

Â which means that it models the conditional probability of y given x.

Â So, it doesn't care to describe how the text can be generated.

Â It says that the text is observable and we just need to

Â produce the probabilities for our hidden variables, y.

Â Now, another important thing to mention is

Â that you see that it is vectorized nicely still.

Â Right? So you have some separate probabilities,

Â and you have a product of this probabilities.

Â Let us look into more details how every probability can be written down.

Â So, in this model,

Â every probability looks like that.

Â Maybe a little bit scaring but let us see what is here.

Â So, we have some exponents and we have some normalization constants.

Â So, this is actually just soft marks.

Â This is just soft marks applied to some brackets.

Â What do we see in the brackets?

Â We have there something linear.

Â We have weights multiplied by features.

Â So you can think about a vector of weights and the vector of features,

Â and you have a dot product.

Â Probably you have a feeling that you have

Â already seen something similar just in the machine learning.

Â So, do you remember a similar model?

Â Well, actually logistic regression is super similar to maximum entropy Markov model.

Â There, you also have a soft max applied to dot product of features and weights.

Â The only difference is that here,

Â you have rather complicated features that can depend on the next and the previous states.

Â So, the model knows about the sequence,

Â and it knows that the tags are not just separate.

Â Actually this feature is

Â a very interesting question because it is our job to generate these features somehow,

Â so we will get back to this question in a few slides.

Â But now, let us write down

Â one more probabilistic graphical model that can be even more powerful than that one.

Â This model is called conditional random field.

Â First, you can see that it is still discriminative model.

Â So it is the probability of y given x.

Â Now you can see that it is actually similar to the previous one, for example,

Â in the brackets, you'll still have these dot product of weights and features.

Â But what is the difference?

Â Do you see an important difference between CRF and maximum entropy Markov model?

Â The thing is that now you have only one normalization

Â constrain that goes outside of the product.

Â So you don't have any probabilities inside at all.

Â So, the model is not vectorized into probabilities.

Â Instead, we have some product of some energy functions,

Â and then we normalize all of them to get the final probability.

Â And this normalization is actually complicated because,

Â well, we have many different sequences,

Â and we have to normalize in such a way that these

Â probability sums to one over all possible sequences of tags.

Â Now, when we depict this model with the graph,

Â it would be undirected graph.

Â So, I don't have any arrows at all.

Â I have just some links between the nodes.

Â And actually in this picture,

Â I write down a more specific case than the one in the top of the slide.

Â So here, in the top of the slide,

Â you can see that your features can depend on three things.

Â And here I kept them only for two things.

Â So I have one type of features about

Â transitions and another type of features about outputs.

Â Obviously, I could have something more complicated.

Â So, a general form of conditional random field would look like that.

Â So you have some arbitrary vectors that depend

Â on some groups of y variables and x variables.

Â In the picture, these small green boxes would stand for different vectors,

Â that you multiply them and then you normalize them and you get your probabilities.

Â So this is rather general model,

Â and maybe you are already lost with all those options.

Â So, if this is the case,

Â I have good news for you.

Â Probably, you will not need to implement this model

Â yourself because there are lots of black-box implementations for CRF model.

Â So, this is just some links to check out.

Â For example, Mallet is a nice library that has an implementation for CRF,

Â but what we have to do is we have to generate features to feed them into CRF models.

Â So, in the rest of the video,

Â we'll discuss how to generate those features.

Â From the formulas, you might remember that those "f" features can depend on three things;

Â the current tag, the previous tag,

Â and the current output.

Â Now, not to be overwhelmed with the variety of your features,

Â there is a very nice common technique which is called label observation features.

Â So, it says that you are only allowed to have these kind of features.

Â The observation part is about something that depends on the output.

Â So, we will go to this part,

Â and the green part,

Â the labeled part, is about indicators.

Â So you just check whether you have the current label equal to y,

Â and you check it for all possible labels.

Â Okay? So, it means that you have as many features as many labels you

Â have multiplied by the number of different observation functions that you invent.

Â And in the case of the second and the third line,

Â you will have even more features because there,

Â you check these indicators for the current and for the previous tags.

Â So, we are going to have lots of them.

Â Now, how those observation parts will look like.

Â This is just some example taken from the paper,

Â and it says that you can be as creative as you want.

Â So, first, you can check that your current word is equal to some predefined word.

Â And you can check it for all the words in the vocabulary.

Â So again, you will have let's say plus 100,000 features just by the first line.

Â Then, you may want to check your part-of-speech tag for

Â the current word defined by some extrinsic part-of-speech tager,

Â and you will have again many features,

Â many binary features here that tell you whether your tag is equal to

Â noun or whether it is equal to a verb and so on and so on for all possible tags.

Â And you can have lots of other ways to define your features.

Â For example, you can check whether your word is capitalized,

Â or whether it has a dash in the middle,

Â or whether it consists only from capital letters,

Â or whether it appears in some predefined list of

Â stop words or predefined list of some names.

Â So, actually this is how your work would look like if

Â you decided to use CRF for some sequence tagging task.

Â You would take some implementation and you would generate

Â as many useful features as you can think about.

Â Now, one trick that I want to mention here is that even

Â though we say that the feature can depend only on the current output,

Â x_t, well, it is not like that.

Â So, honestly, we can put into this x_t everything that we want.

Â For example, we can say that our current x_t consists of the current word,

Â the previous word and the next word,

Â and with have features for all of them.

Â So, you should multiply those tremendous number of

Â features by three right now. And it is okay.

Â So, the model will not break down just because it is discriminative model.

Â So, we do not care about modeling the sequence of

Â x which means that we can do whatever we want basically.

Â So, this picture depicts this idea and it says that

Â every feature can actually have access to all words in our sentence.

Â It can be a little bit messy.

Â So, just be prepared that it happens and people think that this is okay and this is nice,

Â and this is how it works actually.

Â So, just to sum up what we have discussed.

Â Well, we have discussed different graphical models for sequence tagging.

Â We have discussed one of them,

Â Hidden Markov model with more details.

Â So, we have seen how to train this model and how to do decoding in this model.

Â And in practice, usually you will need just to do feature generation,

Â and use some implementation of one of this model.

Â So, this is all for probabilistic approach for sequence tagging,

Â but we will also discuss neural networks for this task later.

Â