Welcome to this interview series. To kick it off, I'm delighted to have with us today, Chris Manning. Chris is, I believe, the most highly cited NLP researcher in the world. He is a Professor of Computer Science and Linguistics at Stanford University. He's also the Director of the Stanford AI Lab, which is [inaudible] as well. Chris is well known globally as a leader in applying deep learning to natural language processing, and has done well-known research on tree recursive neural networks, sentiment analysis, neural network dependency parsing, the GloVe algorithm, and a lot more. He's taught at Carnegie Mellon, the University of Sydney, and at Stanford University. Welcome, Chris. Thank you, Andrew. It's great to have a chance to chat. Even though you and I Chris have had the opportunity to work together on many occasions, one thing I've actually never asked you before is, how did you get started working in AI? I know that one of the unusual aspects of your background is that you majored in linguistics and then wound up being a Computer Science Professor doing NLP. Tell us about your arc getting started in AI. Sure. My background in some sense isn't as a AI person. That's not really where I began. As an undergrad, I have a major in computer science and math. But I also got very interested in linguistics and did a major and honors in linguistics. Coming off of that, to a fair extent, my starting off point was a much more cognitive science viewpoint of human language seemed fascinating. These teeny little human beings somehow managed to learn it, at a time where, in general, their cognitive abilities don't seem that great. How could human language learning take place? In linguistics, by far the dominant belief in the second half of the 20th century was the thinking of Noam Chomsky. Noam Chomsky was just the eminent person in linguistics in the same way that I guess for the first half of the 20th century, maybe R. A. Fisher was the dominant person in statistics. Chomsky had this very strong position that humans cannot possibly just be learning languages from data alone and that they must be innate machinery in people's brains that allow them to learn languages. That's a very big topic. Even back then, it seemed to me just unbelievable given this evolutionary, extremely recent development of human language. I was interested in the idea of how could you go about learning languages. That led me to start to look at machine learning. Starting off first at the end of my undergrad, which was the late 1980s. These days, machine learning is such a big and dominant field. It's so big and dominant that the terms artificial intelligence and machine learning are two-thirds of the same thing because the vast majority of what you see in AI, it's machine learning. But at that time, it just wasn't like that at all. The machine learning was this very scruffy on the side offshoot of AI that almost no one worked in. There was these series of two or three books edited by Jaime Carbonell and Tom Mitchell from CMU who'd started to put together some papers on machine learning. The early decision tree algorithms [inaudible] still in Australia. The early decision tree algorithms in AI, the ID3 algorithm that still had seen decision tree learning. Machine learning barely existed. But I was interested in these ideas of how you could go about getting computers to learn. That was the entree that led me down the path that's turned me into an AI researcher. It was a big deal, or maybe non-intuitive at the time, that you should use data to learn from language rather than code up by hand a CFG grammar or something to truly understand a language which is what people were trying to do back then. The dominant way in which people did natural language processing was by humans and of course that wasn't only affected by the NLP, that corresponded to the dominant thinking and AI at that time as well. This was the era of knowledge-based systems where what was seen is what we needed, was to get subject-matter experts for which knowledge engineers would encode their knowledge in knowledge representation systems, and that all of this hand engineering would lead us to Intelligence. Even back then you were an early believer in machine learning for NLP? Absolutely. Right at the moment, transformer-based architectures have really become the dominant thing in neural networks. We started going out of order here so maybe we should get back to them later. The interesting thing about transformer architectures is that they're built around an idea of attention. You can think of attention as giving you a soft tree structure where you can point at from one word to another, and that allows you to build a tree structure. We've done some really interesting work especially my PhD student John Hewitt, looking at what do transformer models learn when trained on billions of words of a human language. You can actually show that these models learn all kinds of things about the structure of a language. One of the things they learn is that they learn some of these co-reference facts so that they'll learn that she refers back to Susan, and that it is referring to the bottle and sentences. But they also actually do learn this hierarchical context-free grammar structure of languages, just from a pile of sequence of words in a text alone, which is actually a really neat result. Because language can be reasonably described as this nested context-free grammar like tree like structure, a neural network or sufficiently large neural network will transform network discovers aspects of that just from the data. Yeah. In fact, you're in the lead up to modern views on transformers. Your group at Stanford did some of the very influential early work. I know that for a long time in the pre-deep learning days you did a lot of work on statistical machine translation. Then as deep learning started to make inroads into NLP, you and your PhD student [inaudible] , actually published one of the earliest papers on neural machine translation and helped lay the foundation with the bi-linear attention matrix, you helped lay some of the foundations of the modern transformer model. Do you want to just tell us a bit about that? Sure. Really, before starting work in neural networks again, I actually did a tiny bit of neural network work back in the 90s and the days when Dave Rumelhart was at Stanford that it really barely got into that. In the 2000s decades certainly everything that I was doing was using probabilistic modeling techniques, putting probabilities over symbolic structures to describe human languages which is by far the dominant approach in that decade. One part of that work that I worked on for about a decade was building machine translation models. The dominant models were referred to as statistical phrase-based machine translation those days. A lot of techniques were worked out pretty well in that time, so there's a fairly well worked out architecture where you had these factorized machine translation models, where part of that was that you had phrase tables which gave you probabilities of translating a phrase in one language to a phrase in another language so that they did the local parts of translation. Then you were combining it with what's called in natural language processing, a language model. Language model is a term of art that's very dominant in natural language processing. A language model means something that gives you probability distributions over sequences of words in a language. That's just been a really dominant, powerful idea in NLP because it lets you just tell whenever you want to have words next to words, what other words are likely or unlikely. This basic idea of a language model has used context-sensitive spelling corrections. When you have something like Google correcting your spelling in context beautifully, there's a language model. Speech recognition systems use language models and these machine translation systems that also use language models. We had architectures and they worked reasonably well. When Google first came out with Machine Learning, they just learned from data machine translation systems, they were using these statistical phrase-based machine translation systems. If I back up just for a tiny little bit of story, when Google first launched machine translation, they licensed very traditional, old rule-based machine translation system and that was the system that was originally developed by this company SYSTRAN whose roots go back really to the 1950s, from the earliest explorations of machine translation, but they had that for a couple of years and then seeing the advances that are being made in probabilistic models from machine translation they affected over to that and then things got much better. Remember, Franz Och was really a real thought leader in helping Google scale to training the traditional models on tons more data and just dramatically improve the performance of Google Translate. Yeah, absolutely. At that point, Franz Och was one or maybe even the leading person doing statistical phrase-based MT models and he went to Google and lead a team that then Google had the sort of leading large-scale implementation of statistical phrase-based MT models. They actually work pretty reasonably. They already achieved the goal that you could just sort of feed any web page and dozens of languages in and get something that was two-thirds comprehensible. You could workout basically what it was saying about what topic and that was reasonable, but that was sort of great for 2007 to 2010. But 2010 to 2014, which was the same period where Andrew and me were doing those tree recursive neural network we were just talking about. In that period, statistical phrase-based machine translation stalled. There weren't really very good ideas for making further progress and a little bit of progress was being made by just throwing more data in, more data helps. That's still true model machine learning, but the models didn't have enough capacity for it to help a lot. An idea that people were working on a huge amount, including me in those years were saying, well, surely the solution is to make more use of grammatical structure of human languages and how machine translation systems, so really the dominant research area was trying to use syntax-based machine translation systems, but that had seemed a good idea, but that barely ever worked, Basically the result was that for some language pairs, that it didn't work at all, it just wasn't better than the statistical phrase-based machine translation systems. Whereas for some other language pairs that had sort of more different grammatical structures, actually English, Chinese, machine translation was a good example. It definitely did help a bit, you could show some real gains. But the solution ironically turned out to be to pay less attention to the syntax and more attention to the data. Correct, yeah. That work kind of got blown out of the water when people started exploring using new methods for machine translation. I was going to say it was the first big success of neural methods in NLP; whether that's true or not depends on whether you count speech as part of NLP or not because really, speech recognition was the first huge success of neural network methods applied to human language problems. But for text-based work, really, the first thing that was just sort of knocked you out was building neural machine translation systems. It was a very successful domain because it's a domain where there's sort of lots of data available, here's a lot of text in a pair of languages that you could start the train big neural network models, and was first done by [inaudible] and a couple of colleagues at Google. For having modeling a sequence of anything, like a sequence of words or sequence of DNA. That the dominant model is recurrent neural networks which adjustment models that work on simply sequences. Remember, what they've seen before in a limited way, it's the continuous neural version of a hidden Markov model. Essentially, what they showed is, if you make no use of the structure of human languages at all. This is undermining all that work on syntax-based models. Was that if you just build very large recurrent neural networks that were then deep. Until that point, most of the neural network modeling in an OP was that if we had two layers we call it deep. If we had three layers or four layers we were really pushing it. They were immediately pushing it out to eight layer deep recurrent neural networks. This is where you stop getting into systems issues of needing to run out on a machine with eight GPUs to get, which is the trend that's continue to. We can maybe talk more about. They showed the much larger neural network just trading big sequence models and two sequence models. One of them is encoder, encodes a source language than another round as a generator and generates a sequence of words in the target language. Could already gives you a pretty good machine translation system. Not quite as good as the state-of-the-art at that point. But it was close enough to the state-of-the-art seem tantalizing. Since all they were doing was plugging two neural networks together. The thing that if you're not counting the library code for having recurrent neuronet units, some in particular LSTMs, long short-term memory units was very influential and allowing this work to work. You only needed your 500 lines of Python code around the neural network library. You could have an almost state-of-the-art machine translation system that seems just super intriguing. But they're also seeing something crude and missing there. Then very quickly after that [inaudible] who was working with Yoshua Bengio and Montreal. Well, actually not only [inaudible] also [inaudible] now I should mention, is the first author on the paper, who was then my youngest student in Montreal. It's really actually, Dima, who develop the idea of you could build an attention-based model. The idea of an attention-based models at any point in the sequence, you could calculate out a connection to other words in perhaps the same, perhaps a different sequence. Then using that attention, you could calculate a new vector to influence what happens next. In particular, in this machine translation contexts, you'll have started your translation and your translation was starting to say the pilot. Then you'd calculate attention back into the source sentence in the other language and work out essentially bought words and the source she wanted to be translating next based on what you've translated so far. Rather than having to remember the entire of the source sentence and your current neural network state, you could do with a human translator actually does. Dynamically look back at the source sentence and work out what to translate next. This idea of attention has just been transformative. It's increasingly being used also envision systems. Systems but knowledge graphs and other areas of work on your networks. Then show the after Cho and Dima paper. You and Zang wrote a paper on bi-linear attention. How did that come about? In early work, we're really starting with another student [inaudible] that we'd looked at this idea of neural tends networks, where what we'd wanted to do is be able to combine together vectors and have them influence each other but use another vector. We've done that by putting tensor which is the multi-dimensional generalization of matrices in between them. That idea was going on and other pieces of work in my group at the time. But here we just wanted to get an attention score. So in dimmer and chose work what they'd done and say that, okay, we wanted attention score between these two vectors. Let's feed them through a little neural net, a little multi-layer perceptron, and calculate those attention score, whereas it seem to me that we'll wait. No, I could just do this simple thing of bi-linear attention, where there are these two vectors. If I put a matrix between them and I multiply vector times matrix by vector, I just get a number out. So that is bi-linear attention sometimes referred to as multiplicative attention. It's a simpler and more directly interpratable idea of attention because in some sense the simplest idea of attention is to say, "well you have two vectors just dot-product them together, you get a score of the similarity." But that's too rigid because you want to say, "well, maybe I only want to pay attention to parts of the vector, and maybe I want to know if that the top part of one vector is similar to the bottom part of the other vector." So by sticking a matrix into the middle, you can then modulate the similarity calculation. That's a natural measure of similarity and very easily learnable by neural networks. In some sense, it's a general light. How ideas develop and spread is always complex. There are a lot of ideas in the air at any one time. But in some sense, what's become dominant in the modern work with transformer-based models essentially does build off that notion, but adding in an extra idea on top of that. It's a similar idea where instead of having a giant matrix in the middle, which requests a lot of parameters, if you have a low-rank approximation to that matrix in the middle to what you were using then that comes very close to the modern transformer model. Exactly, so in our initial work, we just had a full rank matrix in the middle of the floor of that. If you have a full rank matrix with a lot of parameters, the obvious way to have less parameters is to say, "well no, I can have two, I can regard that matrix as being the product of two low-rank matrices." Then once you have that idea, rather than multiplying them together, you can say, "well, I can sort of apply those to low-rank matrices to the vectors on each side, which is more efficient." Computationally, it's something to do, and that's exactly what modern transformer models do. You take your two vectors, you multiply each by a low-rank matrix, and then you take the dot product between those, which is exactly equivalent to saying, "I form the matrix by multiplying two low rate matrices, and then I do my vector matrix-vector product, but done more efficiently." In fact, not to jump around too much. I feel like this is not the first time in your career that you really advance the field by making observations about matrix multiplications, if I look at the work that your team did on glove, the word embeddings, I feel like the heart of that idea was also simplifying what was the roles of the complex set of neural network-like stuff into just a set of matrix multiplications.Due top of that, still find glove to this day a very elegant paper because it's really simplified what was previously a very complicated set of ideas for learning word embeddings into Simpler. Take the inner product between the word embeddings formulation. Thank you. I do think that glove paper some senses. One of its main contributions was giving a better understanding. I mean, it's not that it worked better than the other methods that then recently being developed, but it was interesting. Let's try and think about what's going on here and how these methods relate to other things that have been explored. So [inaudible] topic here is word vectors of coming up with a real vector representation that gives you the meaning of word, and which now in your LOP deep-learning courses. Essentially, often the first idea you really see because it's a very useful notion and a fairly simple one. This has been done quite successfully by a couple of pieces of work in In the few years around 2010 to 2013, but had been done in very mechanistic neural net ways. Here's the architecture and the algorithm. Run this and run it in those days often for weeks because we didn't have very fast powerful computers and out and it will pop great word vectors. We were interested. This was worked with post of Jeffery Pennington. We were interested in actually trying to understand better what was happening in terms of the math of these models. Something that we were intrigued by was, actually there was an older tradition. The LSA or Latent Semantic Analysis tradition of having vector representations of word meaning which exploited classical linear algebra. The latent semantic analysis models and linear algebra terms were neither more or less than the singular value decomposition or using the singular value decomposition on word count, co-occurrence count, matrices, and then reducing rank via getting rid of small singular values. That's fascinating. One trend that overlays all of this work that you've observed and participated in for the last many years is the scaling of NLP models. It seems like the CVR and a few models are getting larger, like GPT-3. There are really many models in the sequence. For how long do you think this will continue? Do you think someday, if ever, we'll go back to building smaller models? In 2018, bird came out and it showed that you could do fantastically well on a bunch of tasks, including question answering, natural language inference, text classification, named entity recognition, pausing. Making use of this pre-trained large language model where you're training a large transformer just on a few billion words of text to predict gapped words in sentences. But just doing that simple task, gave these really good language representations, which then could just be used in a very simple manner with something like a softmax classifier to provide great solutions for downstream NLP tasks. That was an amazing success. The idea of representation learning that many of us have been talking about for a decade or so, saying that, neural networks is about representation learning. Learning that's useful, intermediate representations. This was really showing that this was really working for building representations of language that can then be very easily applied to higher natural language understanding downstream tasks. But Bird already used a huge amount of data and compute. It was trained on billions of words of language and it was trained on large numbers of computers for quite a long time. But since bird, there's been a lot of further progress, but it's essentially just come from scaling up to compute more and more. As always, I'm leaving a few things out. There have also been a couple of new ideas. There's been ideas like doing relative attention which has improved things. To a first approximation, what's really been driving the gains is just by throwing more and more compute at the problem, running even bigger models on even more data.. Andrew, you often use the slogan, AI is the new electricity. In some of the talks I've been given recently, discussing this trend of bigger and bigger models, I [inaudible] turning it around and saying that electricity is the new AI, because if you look at what's been happening, people are now trading models, not just 10 times or 100 times bigger than the bird model, they're training models that 10,000, 100,000 times bigger than the bird model and the computational energy demands to train those models is going up accordingly. That's suddenly push progress. I think that trend can't possibly continue much further, I mean, yeah, part of it is just we're running out of texts and we're running out of computers to train up a big models on. Going from birth to TPT3, if I remember, it was massive scaling compute and modest scaling of data. So maybe we have enough text data, we just need another 100 or 1000 or 10,000 times faster computers. You're right that the scaling of compute was vastly bigger than scaling of data. But nevertheless, the amount of data that's now being used is actually quite substantial. I mean, there's suddenly more data, but there are actually using a substantial quantity of quality data. But yeah, I buy the point that to the extent that clever systems people can give us three orders of magnitude more powerful not TPUs, even without more texts data, these models are going to be better and probably will be better in the coming years. But, I don't ultimately believe that's the path to artificial intelligence, or the interesting path at this point for further improving natural language processing. Let me ask a really savvy controversial question. Several weeks ago, one of our mutual friends made a comment about this direction, that said, " Maybe the scale is a pop two OAGI." You know that conversation where one of our mutual friends made that own's comment. What are your thoughts on that? I think there's a place that, that idea is coming from. The really cool thing about the TPT3 model that was recently released by OpenAI was the discovery that then really motivated their subsequent work that with these really humungous language models. That they can actually achieve a generality where they can be used for all kinds of tasks without actually having to train them to do any task. Rob (phonetic), conventionally, if we wanted to do different tasks with a neural network, we just take data for that task and train a model for that task with these large pre-trained language models, but we sort got to a halfway house where he had a baseline. Very good language representations have been pre-trained, but still we then just fine tune for every different tasks so we find tune there's a question on string model and when we'd start again and fine tune that. Something else like summparization or something like that. What they discovered with TPT3 is that actually didn't need to do that anymore. That instead, you could just hint to the model what you'd like her to do by giving it a couple of examples. You'd say, " Okay, I'm interested in translation". If here's the sentence, what I'd like you to produce is this sentence which is a translation of that into Spanish and if you gave it a couple of examples of what you wanted it to do the model would get the idea and then you could give it more sentences and they'll translate it. But it didn't lead to translation, you could give it some questions and tell it you wanted answers and it'd give answers to other questions. The sudden mind-blowing thing about TPT3 is to a first approximation, you can give it all kinds of tasks, even weird ones that you wouldn't expect it to really know why it knows that and it'll just do them. Sometimes the things that linguists interested in as a weird sentence manipulations. You know, if I give you a sentence, can you turn it into a question sentence or can you turn it into a relative clause that modifies the person and any of those things you'd just give it a little toss to TPT3 and it will do it. That's the sense in which should actually offer the vision of general intelligence rather than what's been the main state of AI over the last decade, which was these very narrow AIs where you just trained system for one particular thing. Here is a movie recommender, here is an object recognizer and images, here's is natural language question answer. It's intriguing because of its general intelligence. But I don't actually believe it represents something that's a path to the goal [inaudible] particular data and. Which used to happen to be flexible cognitive agent like human beings, because it is just this humongous, pre-trained model, trained on an amount of data that is just massively more than a little human beings sees before they are confident, that human language. To have something that's like a human being, you have to have something that can flexibly learn different tasks as it's exposed to them. The big GPT-3 model, it's no longer learning at all. It's been exposed to so much data training time, that it can do all sorts of different tasks by effectively pattern-matching to something that saw somewhere along the line. That's actually an area that there's also starting to be a lot of work on right at the moment in deep learning as the idea of Meta learning, instead how do you build systems that are good at learning how to learn new tasks, and I think that's actually closer to the kind of intelligence we have to be seeking if we are aiming at having artificial general intelligence. Chris, in your career, you've mentored tons of very successful undergraduate students, Masters students, PhD students, and today there are many people applying to be a student in your lab. I'm curious, in your view what makes a good researcher in your lab at Stanford? What characteristics do you like to see in people applying to be a member of your lab? There are definitely people that come from different backgrounds. I suddenly do actually like some of my students to know something about and really be interested in human languages and their structure. I certainly like to have some people that have backgrounds in human languages and some linguistics, but that's certainly not all of them. I've had other people who are AI machine learning, or other [inaudible] of students who have no background at all, apart from like all human beings, they are native speakers of some language or another. I think the dominant thing is that you have to have creativity and a scientific thinking. There are lots of people around who can read papers and do whatever is in them. The secret of being successful, is that you can think a bit differently and say, "Wait a minute, these people are doing things like this. In fact, almost everyone is doing things like this, but maybe there's another way it could be approached that is better." Really I think that's core scientific training of trying to aim to break things in the sense of find why these ideas had actually a great way to do things and to think of different approaches which might work better. How does a student become creative? That's a good question. Yeah, I believe that you can make progress on this, and I think it's a matter of taking a mindset of being critical when you read things and it's time to explore around and do different things. Rather than just reading papers and thinking, "Oh, this is a good way to do things." I think you really want to be concentrating on being awake as you read, and trying to think, " Well, what are they assuming? Why are they doing it this way rather than some other way?" While a lot of times you'll experiments will fail, but I think when they fail, you're already learning more than if you're simply implementing the algorithm that's in the paper, because you definitely really learned very much by doing that. Occasionally you might see something that half works, and then that might give you an idea of, "Well, maybe there's something here, and if I modified it and push that and had a second vector of what people do now, then it might work interestingly better." I think the practice of exploring out and thinking in those directions is the way that you build these skills of coming up with something new. I find that most creative people read incredibly widely and wind up making strange connections that someone else wouldn't have made between linguistics and computer science or between some other weird thing and deep learning. Yeah, I think that's also a really good strategy. Thanks Chris. This has been great. Before we wrap up, I think a lot of the learners watching this video are looking to build a career in AI and looking to build a career in NLP. Do you have any final words of advice to someone watching this, looking to advance their career in AI? Sure. It's a great time for you to do this, that there's just huge opportunities all over industry and also in academia for people with skills in AI machine learning, natural language processing. You'll be greatly in demand. This is a great thing to do. But I think nevertheless you want to think thoughtfully about how to develop your career and this is maybe connected with what Andrew, you were just saying about reading widely. Because the reality is that the world and the skills that are useful develop quickly. We've talked about earlier in this hour that really when I started off I was being taught about rule-based NLP and then there started to be machine learning models and probabilistic models. There are other things that went by in the meantime like support vector machines and large margin models. Then there was the return of neural networks to dominance. But neural networks were actually an idea that had been around since the 1950s. At the moment, deep learning and neural networks seem so dominant. It seems like nothing else can possibly be relevant to know, the only thing I should learn is how to be state of the art in building neural network models. But that's not how the world has been in the past. Around 2008, that's what we thought of probabilistic AI models. They were obviously wide and dominant and they were the thing to know. But really progress is made by taking old ideas and rediscovering them and putting them together with some stuff that's being learned in the meantime, and further raising the bar on where we can get to. Its sure to be the case that lots of different ideas are going to be useful in various ways in the future. To be successful long-term, it both helps to have a richness of background, it definitely helps to have some length of areas of computer science, areas of math, statistics, linguistics, different ideas. But you also have to accept that this is an area where you have to adapt and move on to new ideas. I think one of the ways in which I've been very successful is not by inventing a whole new field single-handedly, but being able to see where promising ideas were emerging and moving while starting to think about those and moving to do work in them fairly quickly. That adept to keeping your antennas up for interesting ideas that are starting to appear and being adaptable and willing to explore and make use of new ideas, that's the way that you keep your thinking vibrant. Yeah. I think AI machine learning and NLP of obvious moves so fast as. I think all of us just have to keep learning. Absolutely. Thanks Chris. This has been a great and its really interesting to hear the story of how you managed doing all this work over these many years. Then I find that inspiring to think that maybe there's someone watching this that will follow in some of your footsteps and themselves maybe end up a professor somewhere or doing great research. Much as you have. Thank you. Thank you Andrew. It's been been fun chatting. A final fact I can tell people listening is for a whole bunch of years Andrew and me were actually office mates. Well, we had offices next door to each other on the corridor. Once upon a time, I had opportunities often to see Andrew in the corridor and pass a few thoughts. I don't get those chances as often anymore so it's fun getting a chance to chat. Thanks Chris. Those were good days. I remember we shared a wall so If I hit my wall in my office you were on the other side. Those were great times. Thanks Chris. Okay. See you. For more interviews with NLP thought leaders, checkout the deep learning.ai YouTube channel or enroll in the NLP specialization on Coursera.