0:00

In the last video, you learned about the GRU,

Â the gated recurrent units,

Â and how that can allow you to learn very long range connections in a sequence.

Â The other type of unit that allows you to do this very well

Â is the LSTM or the long short term memory units,

Â and this is even more powerful than the GRU. Let's take a look.

Â Here are the equations from the previous video for the GRU.

Â And for the GRU,

Â we had a_t equals c_t,

Â and two gates, the optic gate and the relevance gate, c_tilde_t,

Â which is a candidate for replacing the memory cell,

Â and then we use the update gate, gamma_u,

Â to decide whether or not to update c_t using c_tilde_t.

Â The LSTM is an even slightly more powerful and more general version of the GRU,

Â and is due to Sepp Hochreiter and Jurgen Schmidhuber.

Â And this was a really seminal paper,

Â a huge impact on sequence modelling.

Â I think this paper is one of the more difficult to read.

Â It goes quite along into theory of vanishing gradients.

Â And so, I think more people have learned about the details of LSTM through

Â maybe other places than from this particular paper even though I think

Â this paper has had a wonderful impact on the Deep Learning community.

Â But these are the equations that govern the LSTM.

Â So, the book continued to the memory cell, c,

Â and the candidate value for updating it, c_tilde_t,

Â will be this, and so on.

Â Notice that for the LSTM,

Â we will no longer have the case that a_t is equal to c_t.

Â So, this is what we use.

Â And so, this is just like the equation on the left except that with now,

Â more specially use a_t there or a_t minus one instead of c_t minus one.

Â And we're not using this gamma or this relevance gate.

Â Although you could have a variation of the LSTM where you put that back in,

Â but with the more common version of the LSTM,

Â doesn't bother with that.

Â And then we will have an update gate, same as before.

Â So, W updates and we're going to use a_t minus one here, x_t plus b_u.

Â And one new property of the LSTM is,

Â instead of having one update gate control,

Â both of these terms,

Â we're going to have two separate terms.

Â So instead of gamma_u and one minus gamma_u,

Â we're going have gamma_u here.

Â And forget gate, which we're going to call gamma_f.

Â So, this gate, gamma_f,

Â is going to be sigmoid of pretty much what you'd

Â expect, x_t plus b_f.

Â And then, we're going to have a new output gate which is sigma of W_o.

Â And then again, pretty much what you'd expect, plus b_o.

Â And then, the update value to the memory so will be c_t equals gamma u.

Â And this asterisk denotes element-wise multiplication.

Â This is a vector-vector element-wise multiplication,

Â plus, and instead of one minus gamma u,

Â we're going to have a separate forget gate, gamma_f,

Â times c of t minus one.

Â So this gives the memory cell the option of keeping

Â the old value c_t minus one and then just adding to it,

Â this new value, c tilde of t. So,

Â use a separate update and forget gates.

Â So, this stands for update, forget, and output gate.

Â And then finally, instead of a_t equals c_t a_t is a_t

Â equal to the output gate element-wise multiplied by c_t.

Â So, these are the equations that govern the LSTM

Â and you can tell it has three gates instead of two.

Â So, it's a bit more complicated and it places the gates into slightly different places.

Â So, here again are the equations governing the behavior of the LSTM.

Â Once again, it's traditional to explain these things using pictures.

Â So let me draw one here.

Â And if these pictures are too complicated, don't worry about it.

Â I personally find the equations easier to understand than the picture.

Â But I'll just show the picture here for the intuitions it conveys.

Â The bigger picture here was very much inspired by a blog post due to Chris Ola, title,

Â Understanding LSTM Network, and the diagram drawing

Â here is quite similar to one that he drew in his blog post.

Â But the key thing is to take away from this picture are maybe that you use

Â a_t minus one and x_t to compute all the gate values.

Â In this picture, you have a_t minus one,

Â x_t coming together to compute the forget gate,

Â to compute the update gates,

Â and to compute output gate.

Â And they also go through a tanh to compute c_tilde_of_t.

Â And then these values are combined in

Â these complicated ways with element-wise multiplies and so on,

Â to get c_t from the previous c_t minus one.

Â Now, one element of this is interesting as you have a bunch of these in parallel.

Â So, that's one of them and you connect them.

Â You then connect these temporally.

Â So it does the input x_1 then x_2, x_3.

Â So, you can take these units and just hold them up as follows,

Â where the output a at the previous timestep is the input a at the next timestep,

Â the c. I've simplified to diagrams a little bit in the bottom.

Â And one cool thing about this you'll notice is that

Â there's this line at the top that shows how,

Â so long as you set the forget and the update gate appropriately,

Â it is relatively easy for the LSTM to have

Â some value c_0 and have that be passed all the way to the right to have your,

Â maybe, c_3 equals c_0.

Â And this is why the LSTM,

Â as well as the GRU,

Â is very good at memorizing certain values even for a long time,

Â for certain real values stored in the memory cell even for many, many timesteps.

Â So, that's it for the LSTM.

Â As you can imagine,

Â there are also a few variations on this that people use.

Â Perhaps, the most common one is that instead of just having

Â the gate values be dependent only on a_t minus one, x_t,

Â sometimes, people also sneak in there the values c_t minus one as well.

Â This is called a peephole connection.

Â Not a great name maybe but you'll see, peephole connection.

Â What that means is that the gate values may depend not just on a_t minus one and on x_t,

Â but also on the previous memory cell value,

Â and the peephole connection can go into all three of these gates' computations.

Â So that's one common variation you see of LSTMs.

Â One technical detail is that these are, say, 100-dimensional vectors.

Â So if you have a 100-dimensional hidden memory cell unit, and so is this.

Â And the, say, fifth element

Â of c_t minus one affects only the fifth element of the corresponding gates,

Â so that relationship is one-to-one,

Â where not every element of

Â the 100-dimensional c_t minus one can affect all elements of the case.

Â But instead, the first element of c_t minus one affects the first element of the case,

Â second element affects the second element, and so on.

Â But if you ever read the paper and see someone talk about the peephole connection,

Â that's when they mean that c_t minus one is used to affect the gate value as well.

Â So, that's it for the LSTM.

Â When should you use a GRU?

Â And when should you use an LSTM?

Â There isn't widespread consensus in this.

Â And even though I presented GRUs first,

Â in the history of deep learning,

Â LSTMs actually came much earlier,

Â and then GRUs were relatively recent invention that were maybe

Â derived as Pavia's simplification of the more complicated LSTM model.

Â Researchers have tried both of these models on many different problems,

Â and on different problems,

Â different algorithms will win out.

Â So, there isn't a universally-superior algorithm

Â which is why I want to show you both of them.

Â But I feel like when I am using these,

Â the advantage of the GRU is that it's a simpler model

Â and so it is actually easier to build a much bigger network,

Â it only has two gates,

Â so computationally, it runs a bit faster.

Â So, it scales the building somewhat bigger models but the LSTM

Â is more powerful and more effective since it has three gates instead of two.

Â If you want to pick one to use,

Â I think LSTM has been the historically more proven choice.

Â So, if you had to pick one,

Â I think most people today will still use the LSTM as the default first thing to try.

Â Although, I think in the last few years,

Â GRUs had been gaining a lot of momentum and I feel like more and more teams

Â are also using GRUs because they're a bit simpler but often work just as well.

Â It might be easier to scale them to even bigger problems.

Â So, that's it for LSTMs.

Â Well, either GRUs or LSTMs,

Â you'll be able to build neural network that can capture a much longer range depends.

Â