0:02

In this video, we will learn how

Â LSTM and GRU architectures work and understand why they're built in the way they are.

Â Up to this point, we have been talking mostly about the Simple Recurrent Neural Network.

Â In such network, the dependence between the hidden units and

Â the neighboring time steps is given by a very simple formula.

Â We just compute the non-linear function of a linear combination on the inputs.

Â But, as you can remember from the beginning of the week,

Â we can actually use something more

Â sophisticated to compute the next hidden units from the previous ones.

Â For example, some MLP,

Â which has more than one hidden layer,

Â where layer is a primitive function.

Â So it is what we'll do now.

Â We'll construct a more effective primitive function to compute hidden units.

Â You might also think about this as a construction of a new type of recurrent layer.

Â Let's start from the function we use in the simple recurrent neural network

Â and construct the function for the LCM step-by-step.

Â On the slide, you can see the diagram of a simple recurrent layer.

Â Here, non-linearity and summation are shown in circles,

Â multiplications by the weight matrices are shown on the corresponding edges,

Â and base vectors are dropped from that picture for simplicity.

Â When we do a back propagation through such layer,

Â gradients needs to go through non-linearity and for multiplication by

Â their current weight matrix W. Both of

Â these things can cause the vanishing gradient problem.

Â The main idea here is to create a short way for

Â the gradients result any non-linearities or multiplications.

Â All sorts of LSTM propose to do it

Â by adding a new separately way through recurrent layer.

Â So the LCM layer has its own internal memory C,

Â which other layers on the network don't have access to.

Â In the LCM layer,

Â at each time step,

Â we compute not only the vector of hidden units H but also a vector of memory cell C

Â on the same dimension and,

Â as a result, we have two ways through such layer,

Â one between hidden units Ht-1 and Ht and the second

Â one between memory cells Ct-1 and Ct. Now,

Â let's understand what is going on inside the LSTM layer.

Â In the simple recurrent neural network,

Â we compute these hidden units at

Â some non-linear function on the linear combination of the inputs.

Â Here, we do the same but do not return this value as an output of the layer.

Â We want to use the internal memory too.

Â To look at the memory,

Â LSTM needs at least two controller: an input gate and an output gate.

Â These gates are the vectors of the same dimension as the hidden units.

Â To compute them, we use the similar formula: the

Â non-linearity over the linear combination.

Â We can even rewrite the formulas for

Â the information vector G and the gates in one vector formula in which weight matrices,

Â V and W, and the bisector V are the concatenations of the corresponding parameters.

Â Suppose there are N inputs X and M hidden units H in the network,

Â then what size do the matrices V and W and the vector b have?

Â Yep, the matrix V is 3m by M. The matrix W

Â is 3m by M and the vector b contains 3m elements.

Â For the gates, we use only the sigmoid non-linearity.

Â It is important because you want the elements on

Â the gates to take values from zero to one.

Â In this case, the value one can be interpreted as

Â an open gate and the value zero as a closed gate.

Â If we multiply some information by the gate vector,

Â we either get the same information or zero or something in between.

Â As you've probably already guessed,

Â the input gate controls what to store in the memory.

Â The information vector g is multiplied by the input gate and then added to the memory.

Â Multiplication here is element wise.

Â The output gate controls what to read from the memory and return to the outer world.

Â Memory cell C are multiplied by the output gate and then returned as a new hidden units.

Â Now, we see that the LCM has pretty nice interpretation

Â with internal memory and the controller suite,

Â but how does this help with the vanishing gradient problem?

Â As I already mentioned,

Â there's not only one way through the recurrent layer and the important thing

Â here is that now we have

Â at least one short way for the information and for the gradients,

Â between memory cells Ct minus 1 and Ct.

Â There is no any non-linearity or multiplication on this way

Â so if we calculate the Jacobian of Ct with respect to Ct-1,

Â we see that it is equal to one.

Â So there is no vanishing problem anymore.

Â Is it an architecture?

Â Unfortunately, it's not.

Â We can only write something new in the memory at

Â each time step and we can't erase anything.

Â What if we want to work with really long sequences?

Â Memory cell C have faint capacities so

Â the values will be a mess after a lot of time steps.

Â We need to be able to erase the information from the memory sometimes.

Â One more gate, which is called a forget gate will help us with this.

Â We compute in the same manner as the previous two gates and use

Â it on the input memory cells before doing something else with them.

Â If the forget the gate is closed,

Â we erase the information from the memory.

Â This version of LSTM with three gates is the most standard nowadays.

Â Forget gate is very important in task with long sequences,

Â but because of which, we now have multiplication on the short way through the LSTM layer.

Â If we compute the Jacobian of Ct with respect to Ct-1,

Â it is equal to Ft now not one and the Forget gate Ft takes values from zero to one.

Â So it is usually less than one and may cause vanishing gradient problem.

Â To do this, the proper initialization can be used.

Â If the base of the forget gate is initialized with high positive numbers, for example,

Â five, then the forget gate at first iteration with training is almost equal to one.

Â At the beginning, LSTM doesn't forget and can't find long range dependencies in the data.

Â Later, it learns to forget if it's necessary.

Â To have some intuition about how LSTM may behave in practice,

Â let's look at different extreme regimes in which it can work.

Â On the slide, you can see it from a different picture of the LSTM layer.

Â Here, the internal memory C is pictured inside the layer

Â as a yellow circle and gates are represented with green circles.

Â The regime type depends on the state of the gates.

Â They either opened or closed.

Â If only input and forget gates are open then

Â the LSTM reads all the information on the inputs and stores it.

Â In the opposite situation, that only forget and output gates are open,

Â the LSTM carries the information to time and release it to the next layer.

Â If both input and output gates are closed,

Â the LSTM either erases all the information or stores it but in most cases,

Â it doesn't directed to the outer world so it doesn't read or write anything.

Â Now, I have a question for you.

Â Which combination on the gate values makes the LSTM

Â very similar to the simple recurrent neural network?

Â Yeah, it is when input and output gates are open and forget gate is closed.

Â In this case, LSTM reads everything to the memory and return the whole memory

Â as an output so memory and hidden units here are all of the same entities.

Â Of course, all of these cases are just extreme cases.

Â In practice, the model behavior is somewhere in between.

Â Because of all the different regimes,

Â LSTM can work with the information more accurately.

Â For example, when the simple recurrent neural network read something from the data,

Â it outputs this information at each time step and gradually forgets it each time.

Â At the same situation,

Â LSTM can carry the information much longer from

Â time and outputs it only in the particular time steps.

Â LSTM has a lot of advantages compared with the simple recurrent neural network but,

Â at the same time,

Â it has four times more parameters because each gate and the information left in

Â g has its own set of parameters V, W, and b.

Â This makes LSTM less efficient in terms of memory

Â and time and also makes the GRU architecture more likely.

Â The GRU architecture is the strongest front.

Â It has only three times more parameters compared to

Â the simple recurrent neural network and in terms of quality,

Â it works pretty much the same as the LSTM.

Â The GRU layer doesn't contain an additional internal memory,

Â but it also contains two gates that are called the reset and update gates.

Â These gates are computed in the same manner as the ones from

Â LSTM so they're equal to the sigmoid function or the linear combination of the inputs.

Â As a result, they can take queries from zero to one.

Â On the slide, the diagram for computing the gates is

Â pictured as separate step for simplicity of the pictures.

Â There is other gate controls which part of the hidden units from the previous time step?

Â We use as an input to the information vector g.

Â It acts quite similar to an input gate in the LSTM.

Â The update gate controls the balance

Â between the storing the previous values of the hidden units.

Â And the writing the new information into hidden units so it

Â works as a combination of inputs and forget gates from the LSTM.

Â The situation that the vanishing gradient problem in GRU

Â is very similar to the one in LSTM.

Â Here, we have a short way layer with only one multiplication by the update gate on it.

Â This short way is actually an identity keep connection

Â from HTMNS1 to HT which is additionally controlled by the update gate.

Â Which initialization tree should be used to make GRUs table to the vanishing gradients?

Â We should initialize the base vector on the update gate with some high positive numbers.

Â Meaning, in the beginning of the training, the gradients

Â go through these multiplication very

Â easy and the network is capable to find long range dependencies in the data,

Â but you should use not too high numbers here.

Â Since if the update gate is open,

Â then the GRU layer doesn't pay much attention to the inputs X.

Â We discussed two architectures, LSTM and GRU.

Â How to understand which one of them should we use in a particular task?

Â There is no obvious answer here that one of them is better than the other,

Â but there is a rule of thumb.

Â First, train the LSTM since it has more parameters and can be a little bit more flexible,

Â then train the GRU,

Â and if it works the same or the quality difference is negligible then use the GRU.

Â In that case, return to LSTM.

Â Also, you can use a multi-layer recurrent neural network so you can

Â stack several recurrent layers as shown in the picture. In.

Â This case, for the last layer,

Â it's better to use

Â LSTM since GRU doesn't

Â have a proper amount of the output gate and cannot work with the output

Â as accurate as LSTM.

Â And for all the other layers,

Â you can use either LSTM or GRU.

Â It doesn't matter that much.

Â Let's summarize what we have learnt in this video.

Â We discussed two gated recurrent architectures: LSTM and GRU.

Â These models are more resistant to vanishing gradient problem

Â because there is an additional short way for the gradients through them.

Â In the next video, we will discuss how to use

Â recurrent neural networks to solve different practical tasks.

Â