[MUSIC] Hey, let us see how many different tasks in NLP can be solved as sequence to sequence tasks. So we have talked a lot about machine translation, that's obvious, but also you have so many other options. For example, you can do speech recognition and there is model called listen, attend and spell. Or you can do image caption generation, And this will be also in quarter to quarter architecture and the paper is called show, attend and tell. So they are so similar, however every task can be solved specifically better if you just think a little bit about those constraints that you have in this task. So in this video, we'll speak in more details about text simplification. And we will see that, well, we can use just in quarter to quarter architecture. But if we think a little bit about the specific objections for this task, we can improve. Okay, let us start with summarization for now. Summarization task is when you need to get the short summary of some document. Summarization can have several types. So we can speak about extractive summarization, and it means that we just extract some pieces of the original text. Or we can speak about abstractive summarization, and it means that we want to generate some summary that is not necessarily from out text, but that nicely summarizes the whole text. So this is obviously better, but this is obviously more difficult. So most production systems would be just some extractive summaries. Now, let us try to do extractive summary with sequence to sequence model. So you get your full text as the input, and you get your summary as the output, and you have your encoder and decoder as usual. Now, you need some good dataset to train your model. For example, English Gigaword dataset is really huge, and it contains some examples of articles and their headlines. Now you apply your model, and there is also even open-source implementation for the model, so you can just use it right away and get some results. So the results are rather promising. You can see the sentence of the article, just the first sentence, and the generated headline. So the thing on the right is just generated by our model. Actually, there are some problems with this model, and we will speak about them in another video in our course. But for now, we can just say that it works somehow. And let us move forward and discuss another very related task, which is called simplification. Text simplification task would also need some good dataset to train on. And one good candidate would be simple Wikipedia. So you see that you have some normal Wikipedia sentences and simple Wikipedia sentences, what can be different there? For example, you can have some deletions. For example, in the second example, you just delete two pieces, and in the first example you try to rephrase some pieces. What kind of operations you can have to modify these sentences? Well, as I have already said, you can delete, you can paraphrase, or you can just split one sentence to two simpler smaller sentences. Now, paraphrasing is rather general approach and you can do different things. You can reorder words, or you can do some syntactic analysis of your sentence and understand that some syntactic structures are more simple and usual just substitute one syntactic structures by some others. And the straight forward way to do this would be rule-based approach. Actually, we do not cover ruled-based approach a lot in our course, well maybe it's not so fancy as deep neural networks. So usually people want to hear about deep neural networks more, but to be honest, rule-based approach is a very good thing that works in production. So if we just want to be sure that your model performs okay, it's a good idea to start with just implementing your specific rules for the model. For text simplification task, it can be just some substitutions, some context free grammar rules that tell you that, for example, solely can be simplified by only. Or if you say something population, you should better say, the population of something, okay? So lots of rules, you can either know them, for example if you have some linguists, or you can learn them from data. So this paraphrase database is a big data source, and it also has some learned rules. Another approach would be still to do some deep learning and even reinforcement learning. So this is not easy to make that model work, but I just want to give you some general idea, very hand-wavey idea of how it could be done. You can do just encoder-decoder architecture as usual. But this architecture is likely not to simplify well, because it doesn't have any simplification objective built in. So one way to build in this objective would be weak supervision by reinforcement learning. What do I mean by that? In reinforcement learning, we usually have some agents that perform some actions. So here, the actions would be to generate the next word. Usually we also have some policy, which means the probability distribution for actions. And in this case, it will be probabilities of the next word given everything else. And the agent performs some actions according to the policy and gets some rewards. So if the generated sentence is good, then the reward should be high. So one very creative step is how do we estimate this reward? And the idea is to do it in three parts. So adequacy is about whether the simplified sentence is still about the same fact as the original sentence. Fluency is just about the coherence of the sentence and the language model. And simplicity is whether the simplified version is indeed simpler than the original one. A super high level architecture would be as follows. You have your encoder-decoder agent that can generate some sequences. Then for every sequence you get some rewards based on simplicity, relevance and fluency. These rewards go to reinforce algorithm that we do not cover right now, but you need to know that this reinforced algorithm can use these rewards to update the policy of the agent. So the agent, on the next step, will be more likely to generate those actions that give higher rewards. So in some sense it is similar to gradient descent, you would say, but the important distinction is that the rewards are usually not differentiable. So reinforcement learning is really helpful when you cannot just say that you have your most function and you need to optimize it. But when you just say, well, this is simple, this is not simple, so here the reward is high, here the reward is low. If the reward is like that, you cannot just take gradients and do stochastic gradient descent. And that's why you apply something a little bit more magical, which is called a reinforced algorithm. Now, I just want go into details of just one piece on this slide, simplicity. So how do we measure simplicity? Well, we have three kinds of information. Input, which is the normal text, then references, which is the golden simplified sentences, and then our output over the system. We need to compare all of them to understand whether we perform well. For example for machine translation, you would compare just human references with system outputs, right? Because the input is usually in some other language. But here it is very important to compare all of them. For example, one nice measure that can be used is called system against reference and input. It computes precision course for different types of operations, for addition, copying, and deletion. For example, what would be the precision for addition operation? Well, what are those terms that we add? These are the terms that occur in output, but do not occur in input. And this is exactly what we see in the denominator. Now, how many of them occur in the reference? This is exactly what we see in the nominator. So we just have precision score that measures how many of the terms are indeed correct. Now you can think about recall for addition, and precision and recall for other operations, and somehow average them to get this score. I want to show you that this score actually works. For example, we have the input and three references and three outputs. And you can see that the second output is definitely better than the third one, because now is simplified, we had currently in the input. And this score can distinguish this, because we compare everything with the input. It doesn't happen for BLEU score for machine translation. There, we compare just output and reference. And the BLEU score thinks that system number two and system number three behaves just the same. [MUSIC]