Okay. Thanks very much. Today, today we have Pat Bossert who spent about 30 years at, at TI and was a TI fellow. And he's recently left TI to start other pursuits. Pursue it to SDN. He is one of the architects of the RFT, chipset, which was a, a SITCOMP paper last year. So today we have, today we have Pat, talking to us and, thanks very much, Pat, for joining us. >> Yeah I thought, actually I suppose one thing I'd like to put in just as a kind of a refinement of what you said is the RMT actually isn't a chip set, there was one time gee I research project, we were giving a talk and I asked what technology is this implemented in. And then I said, currently the only technology that it actually is implemented in is PowerPoint, so. >> [LAUGH] Okay. Okay, good. So that kind of like, that's a, a good overview of the first thing I wanted to ask you, which is what's the history of RMT. I guess for those of for those people watching who might not know what RMT is. It's, it's I guess we could think of it as kind of an extension of, of Open Flow's basic match action, primitives sort of extended in, into the hardware. But I think students of the class will have read, read the paper. So they'll, they'll kind of know about it. But, what, what was the history of it? How did you start working on this and, and, what, what led you to think about design or the next generation chip set for open flow, if you want? >> Well, I had of course, I had been at TI for many years, and, and kind of specialized in CPU design and, you know, big ASIC design, you know we, our, you know, they had the section of TI that was in, was doing, you know, large designs for them, and a lot of those were for networking customers. But of course, we would only see. Well, you, you'd have to lay out this thing that has this much RAM and, you know? These many wires, and all that sort of stuff. But, eventually, you know, TI closed down ASIC because it, essentially the business, is the business, and then, the whole, the whole, whole VLSI world is changing. And, that will, that'll kind of be a recurring theme here at the cost of design. But then when I was looking around for other things to do at GI, this research project popped up that was just beginning. It turns out there's been kind of a long history of interaction between TI and Stanford, networking sorts of things. You know, serial links, which provide their very high bandwidth IO. They were done FTI, and kind of the connection between between Martin Bizzard and Nick McCuin kind of started figuring out these things are really good for networking. So when this project happened, I figured well this, this is natural. I mean, you know, I enjoyed designing, and it, and it was a brand new area. I just, you know, one way to look at it is that, you know, when you design a chip, these things are so big that usually what you're doing is, you know, taking two of what you had before and sticking it on a single chip or doing something with more memory or add this bell or whistle, but this was something that was completely new. So it looked like it was going to be fun. Now of course, the fact that it's completely new, you know, was a little bit on the bad side. in, in all of my history, you know, I was not a networking person. So I was approaching this essentially knowing nothing about networking and I used to joke that that my first week, you know, on the job on this new project. You know, my weekly progress report said, you know, this week I learned to spell TCP. [LAUGH]. >> but, a lot of times, you know, innovation in an area, you know? It kind of happens by, you know, by accident. And, it happens, sometimes, when you bring in an outsider who, kind of like, doesn't know that the questions he's asking are stupid questions, you know? It's, it's, it's like, kind of, the innocence of a child. So, they're kind of knowing nothing about networking. You know, kind of I figured out there were 7,000 RFCs which specified the internet, and I didn't want to, want my kind of my project caret to read, all right this weekly report, you know, I've now read 127 only 680 to go. So I figure I better come up with something that, or I'm not going to have to learn all the internet behaviors and and so that kind of naturally fed into kind of a very programmable architecture and that was very much in, in the kind of keeping with the ti culture digital signal processing matter of fact there's a really good analogy between kind of network, you know, the processing that you do for networking and digital signal processing. You know, 10 or 15 years ago there were the same arguments. Well, do you need, you know, special hardware to do any, you know, signal processing algorithm or can you do it on, on, you know, and something like, you know either a general purpose processor or a processor that's kind of specially tailored for it. And eventually you know, the cost of design led you so that you won't get a design special architecture for everything. You train design one architecture which could, which could kind of handle you know, handle all the you know, kind of all the applications. So that kind of naturally led me into all right, a very regular very programmable sort of architecture. And you know, the, the action engine in this which is a VLIW meaning kind of one processor, one ALU for every word you know. TI built a DSP, you know, which is a VLIW so you know, so I had just been exposed to all, all those things. So kind of naturally fell in to doing something that the, the only thing that I knew is that you know, I was not going to understand all the ways in this, in which this was going to be used. I'd better go with something, you know, that's simple in general and regular. >> That's yeah, that's pretty interesting actually another, another thing I wanted to ask you is that, I mean in, in, in sort of describing in MT you sort of talk about this you need this kind of general, general architecture we can figure all match action as, as you are describe it. And I'm wondering you know, even, even with this architecture with lot more general then, then open flow. I was wondering do you think that there is still a need for you know, ultimately will there be a need a for a more extensive set of action premises besides just the ones that are IT supports. You know for example, do you think that some of these hardware pipelines should eventually support more extensive operations like transcoding or encryption, and if not, like how should, how should those kind of thing be implemented into, into these pipelines, and if so like how where do you see that integration taking place? >> Well it's almost like we included in this architecture you know a kind of everything that we could until you get to you know there's kind of a degree of difficulty in adding new functionality that just kind of. It turns upward like a hockey stick and, and it's because some of these, you know, some of the things that you mentioned, you know, they're very computationally intensive. You know, encryption, for example, you know, is, is meant to be computationally intensive. >> [INAUDIBLE] Yeah. >> And, so, it's very difficult to put those into your, your, an architecture that's, kind of, much more directed at, at, kind of, you know, you know? High-volume, but on the, you know, relatively, you know, more simple, you know, hack and header manipulation. So, so I suspect that things that do those special functions, will, will always be done in a different sort of way. Well, maybe, you know, always is a, is, is kind of a dangerous word to use, but it's, but it's easy to envision them, you know, kind of staying separate somehow. Just because, you know, if you look at it by market segment, you know, a lot of people want switching but, you know, if, if you make the decision, well I'm going to double the cost of everybody's switch because, you know, this groups of people wants encryption, or this group of people wants Red X processing, things like that. It's just, you know, it just doesn't pay. So those sorts of things, you know, will have a tendency to remain separate. But there's a common factor in there, in that, that they're all stateful processing, or most of them are stateful processing. so, the, you're, you're carrying state between successive packets of a flow, and that state can be very arbitrary. And, you know, and the amount of computation you're doing on each packet is, is arbitrary and, or varying and possibly significant. It's very difficult to, to kind of wedge those into a pipeline where the name of the game, and of course, that's one of the, you know, the characteristics of the RNP pipeline is that, every clock cycle, you send a new packet through it. So there's room for one clock cycle of computation. And actually, you can do as much as you want to all the different words in the packet header vector, as long as they're in parallel. >> Yeah. >> But you only get one, so there's really a line that you get that's very difficult to cross there. >> That makes sense. So I guess in, I mean, in some instances you're sort of ruling out or leaving the job of state or processing to some other part of the architecture. >> Yeah, I mean, there are, there are simple forms of stateful processing that you can do. but, but that, it's kind of a slippery slope. Matter of fact, one of the things I believe is mentioned in the, in the RMT paper is, you know, the introduction of kind of a stateful table. And, and there's a simple example would be GRE encapsulation has, I guess it's optional but there's a sequence number kind of equivalent of a TCP sequence number. So when you're doing GRE encapsulation, you'd like to maintain a stateful table and you just implement every time a packet matches that clone comes by. Well, all right, that's state flow processing, and if you don't have it, you know, you're, you're out of luck. You know, it's the difference between being able to do it or not. But really, on the scale of things, you know, it's a really easy piece of stateful processing, but there's essentially some no hanging fruit it, but that can get difficult very quickly. >> When you talk about, sort of where they draw, you talked about this sort of hockey stick of complexity, if you will. And when you talk about kind of where to cut things off, you know, in some sense, how do you know when you're done or how do you know when your architecture is, complete enough. It seems like you could go, there's sort of a complexity, factor that you might use to make that decision or there might also be, I think you talked about sort of the, you know, the class of users, or the class of people who might possibly want to use this. Do you consider, it, is that a technical decision or is it a, is it a, is there a market decision that you have to make there or, or is it something else? I mean, how do you-. >> Well, I, >> sort of figure out where to strike a balance? >> The decision. You know, the tecnical decision, you know, is actually reasonably, you know reasonably clear, because essentially, you know, when you go up to a certain point, let's say, you know, the inclusion of the VLIW, and then, you know, that, all that allows you to do actually quite a bit of very useful stuff. You know, the next step is just so difficult and so different that, that it just seems like a national boundary. And unfortunately I think we are, I suppose you know, marketing lines, you know, a lot of people you know, such they want to do, want to do the sort of things that open-flows allow you do. Essentially, I want to create a bunch of different tables of whatever I want. And, you know, do this sort of action processing that, where I'm going to set values in fields and, you know, move things around and encapsulate, and you can capsulate and play with the packet header, you know, one way or another, you know? You know, that's actually a pretty vast area, And so building a chip which is limited to that, well, you know, that's not such a bad limit. >> Yeah, that make sense. Actually I wanted to ask you a little bit about sort of this, this whole notion of reconfigurability. I mean, you, you, you talk a lot about in, in the paper about, you know, the, the advantages of being able to physically reconfigure reconfigure the, the way that, you know, the functions that are enabled on the chip, sort of on the sly in the field. And, and you talk about, the ability to add new fields, changing the topology, on, on the chip itself, widths and depths of tables, defining new actions and so forth. And I had a couple questions about that. One is like, how often do you think it will be necessary to change the allocation of resources on the chip, actually in the field on the fly? Is this sort of a, you know, a once a day type of thing? Is it a once every six months thing? Is it a once every two years type of thing? Is that somewhat constrained by the, you know, the decisions you make on the chip design? And I guess, a, a specific question I had there was about changing the topology because that, that was something where it seemed to me like I, I wasn't able to think of, of cases where you'd need to do that very often. You know, how often will these kinds of reconfigurations happen do you think? >> Yeah, well it turns out that [COUGH] the, you know, and I suppose it kind of a flip answer but, you know the, the, the at, at one level, the answer is at least once. And- [LAUGH]. >> [LAUGH] And and kind of a very obvious. You know kind of application of that is, lets say and you know lets pick a data center, all right you you have a bunch of different things that you need to do there are top of rack switches with their requirements. There're core switches there are edge routers, there are load balancers there's at least some part of fire wall, you know, stuff there's things that route things to special purpose. You know, well, well, let's say, just, you know, you know, network appliances, CPUs and then BAC and all of those things you know, you, you think of them as completely different functions. But they're, if you can wheel in the same box and program it differently, then, you know that's the at least once. You know, you set up one as a firewall, you set up one as a top of rack, you set up one as an edge router, or, or a NAT box or something. and, and actually that's, that's really important because you know, when you think about it, if each one of these is a different box, you know, potentially from a different vendor, you know, actually, you know, that sounds bad, but the story gets worse from there, because they all have different software environments. And, you know, the cost of, actually the cost of networking is actually, you know, to me surprisingly, not so much even about the cost of the hardware, but the cost of the software. >> Mm-hm, yep. Yeah, that makes sense. >> So,so it's sort of like-. Buy it, and then, like configure it at least once, and then place it anywhere, depending on how you. >> Yeah. So, so now, you know, all of your boxes, you know, can have the same software environment. >> I see. Actually that, that leads to another question that I had too which was, >> Oh actually there was one, I had one more thing I wanted to get at. >> Yeah. >> And it, it basically has to do with the, if you look at the trends in the LSI design, every two or three years you get the next generation of process technology, which means you get twice as many devices on a chip. And that great you know, that's taken us from you know, a $200 calculator to you know, you know. You know, you, really powerful computers in your cell phone, but, you know. From the designer's point of view, basically what that means is that every two or three years, your productivity has to double. If it doesn't, the cost of design goes up. All right. And that's what's been happening. The cost of design is go up, and up and up and up. And, you know, of course, the cost of manufacturing is also, kind of the ante to get in, you know, is up and up and up, you know. A set of photo mast for a 16 nanometer chip is $5 billion. That, that's more than. And entire designs used to cost, twenty years ago. So you have to reduce the number of chips you design, and make them in to a more general purpose. So for example, if you're Broadcom for example, who makes the switches that are most commonly used today. You know, high volume, high bandwidth. You know let's say 640, 1280 giga, gigabit per second switches, where you have to essentially put all your eggs in one basket, try and make it serve as many markets as you can. And eventually that's just much more easily done by by making kind of a general purpose programmable box, and then letting the software do the customization. It just gets a harder and harder to game to play. Doing things, you know, at the gate in the hardware level. >> Yeah, that makes sense. So in some sense you're sort of, by making the hardware design a bit more general-purpose you're in some sense kind of simplifying the, you know, that aspect of the design, because you don't have to design a new chip with an, you know, sort of increasing number of, of. Transistors and gates, and so on over, over time. So, you've got this general purpose thing that you can, can reuse for a number of different purposes. But I guess, does that some, in some sense, move the complexity, from, say, hardware design? You know, maybe, in some sense, you've, you made that part, you know, you're dealing with the complexity by, by building its [INAUDIBLE] design in its general purpose chip. But, then, you gotta somehow program it or compile it. Right? Or you, you com, compile something to it. Right? So I know you've done a little bit of work on a language that, you know, where you can describe how you want the layout to happen. But, it, it, it creates a new problem, doesn't it, in some sense? because you've gotta, somehow, configure that, how you use the chip. Yeah, no I guess I wouldn't say, you know, it's a new problem. It's, it's the same problem you always had, except now you solve it in software rather than hardware. >> Okay. >> And, and I suppose that, you know, that the good news about that is that, you know, if you make a mistake, you know, you fix it and recompile the program. You know, versus. You know, if you make a mistake on the chip. It costs you. you know five million for the photo masks. Six months of delay, >> Mm-hm. >> You know. And. You know. That, that, that's a huge. Those are, those are, you know, those are career breaking, you know, problems. You know. >> Actual. >> Versus. >> [CROSSTALK] Yes. >> The software, the software kind of build, and debug cycle is something that, you know, you go through many times a day, you know but. >> Yeah. >> But in general, it's easier to express. You know? Complex behaviour in software. You know? So, you know? You, you, you've kind of taken the industry, and kind of given it, you know, a, a higher productivity lever to get this stuff done. The problem has not gone away, but you've got better tools to attack it with now. >> Yeah. And a. And a much tighter development cycle too. Right? yeah, now that, that makes a lot of sense. It seems though, like, I mean, you, you need new tools then, right? I mean, you need things that basically take you from the software development cycle, if you will to you know, to be, to actually what gets substantiated and. In hardware. Right? I mean, of course, there are things at the level of you know, HDL and Verilog, and things like that, but most people are not going to want to be developing in those, if they're software developers, right? So, it seems like you need a new, a new set of tools. >> Yeah, well and, we think you know, that P4, which you mentioned is a really good step in that direction. You know, if you look at what you know, at the evolution of open flow you know, it started out with a single you know, a single table, and 12 fields. Yeah I think it's currently, of course Open Flow 1.x, you know, gives you as many tables as you want, and the number of fields keeps escalating, you know, it's at 40-something now. And actually what happened in in Open Flow was, let's say it was Open Flow 1.3. That added PVB encapsulation. It was the first type of encapsulation that, that they added. And there was a lot of discussion, you know, inside trying to figure out well, you know, PVB isn't you know, You know, isn't the only type of encapsulation. There's, you know? GRE and VXLAN and stuff. And, of course, I asked a joke that, why, why did PBV get included and not VXLAN? Well, the PBV guy happened to be there at the meeting, and volunteer to do it. Or, you know? You know, the VXLAN guy stepped out to the restroom while they took the vote, or something [LAUGH]. But the real issue is that, we couldn't figure out a generic way to either specify something general purpose enough that would let you do any sort of encapsulation. And we certainly didn't want to say, us be the Gestapo, that, you know, you want to add an encapsulation, well we'll do it for you, and we're really not offering any improvement, so instead of doing the wrong thing, they did nothing. You know, which is, which is the right choice. and, and, of course, what's needed is, you know, one of the aspects of P4 is just recognizing that what you want is, is you want the language itself to be protocol independent. So, you can define fields as you want. And of course there'll be a standard set of fields that everybody uses. And then to that, people can add or subtract or, or kind of do whatever, do whatever you want. But that's, that's kind of, you know, a. In a sense kind of a clean break and a clean restart, but it's very necessary to kind of to kind of get things going. And really, you know kind of the only other major break was the inclusion of, of a kind of more general sense of being able to do actions. You know, the open floor options were very limited. You know, for example, If you're encapsulating an IP field within another IP, within another IP field. So the existing one becomes the inner, and you're adding a new outer field. IP has a length, you know, subfield in it. And of course, what that length is going to be, is the whole length that the packet that you've encapsulated inside. Plus you know, some kind of offset. But you have to be able to do an add. You know, as an operation. Well, there was, there was no, there was no add. You know. [LAUGH] So you know, it doesn't take a huge number of primitives. But, but, but they have to be the right primitives. And, and people building CPUs have, you know, have solved this problem you know, you know decades ago. But, but by including you know the relatively small set of kind of usual suspects. As, as primitive operations. You can kind of build a general action sort of capability. >> Mm-hm. That, so yeah, this whole notion of generalized actions and being able to generalize the fields that you match on and, and sort of broaden the scope of what, what can happen in hardware. And then talking about like the use of a language like P4 where we can define what's going on in software. In some sense, it seems to blur the boundary between, you know, in sort of, the old school days of open vote wasn't so long ago, right? You, basically, had a very narrow interface to, to, you know? To a chip that could do, you know? Pretty simple match action. And, that, pretty much, was it, right? Like, it was split. Like, anything else that needed to happen was going to happen in software. And now it seems like. With this ability to kind of like define what's. You know, what's going on in the chip, or what the chip looks like from software. You've kind of blurred the boundary between your software controller, if you will and, and the software that's defining what the chip looks like. Yeah. I think- >> [CROSSTALK] So >> Oh, so go ahead. Yeah. >> Yeah. I think that, there are, there are a couple of interesting aspects in that. Which are, which are kind of a part of, I mean there, there's really, you know, a lot of change going on in networking right now. With, with and, and one way to kind of express an aspect of that change. And open source started to bring that about. You know even, even in its very first version you know. >> Mm-hm. >> Before OpenFlow, switches were always. Fixed-function hardware devices. And, OpenFlow, you know, started to bring in programmability. And, and that you know kind of im, improved and increased through the levels of OpenFlow. And I think P4 then continues that evolution. But, but there's a very important kind of mindset change. You know, if you, and what's happening is people are stopping thinking about switches as fixed motion things, and thinking of them as programmable devices. And you know. Let's say, an analogy is, let's say. You have your PC at home, or, or, or the, or the laptop, and the workstation that you work at at work. Well you know, a lot of people, most people don't program their PCs. But you know, those you know, those of us in the industry, well we rewrite code all the time. And eventually, you know I think it will be, you know, a sort of, you know, the, the environment will be sap, where people will naturally think well I have a switch, its a problem of advice. You know, why am I going to just use the code that's provided to me, you know, I can write, I 'm perfectly capable of writing code myself, you know, let me program it to do what I want. So you know we we have computers that we use and program everyday well you know why not have switches that we use and program everyday. >> Absolutely, do you think it's going to change the way I mean it sounds already like think it's going to change the way that we program the network, but that sort of almost I mean it almost begs the question of like well we've got these controller architectures now you know there's you know. Open day lights, and flood lights, and, and, pox. And all of these things, which have a very, very a certain notion of what, of what forwarding looks like. Right? And now, once you start adding the ability to sort of reconfigure what the hardware's doing from software, sort of begs that, you know. Begs the question of, our we going to have the same kind of abstraction, like is the controller going to be dealing with the same kinds of abstractions? Or do you think that controllers are going to look totally different, because the types of functions that the, you know, underlying hardware that something like P4 could expose to this control softwares? It seems like that might necessarily change as well, and that might change the notion of what a controller is supposed to do. >> Yeah, well, there's definitely an aspect of it, of you know, the sorcerer's apprentice gets a brand new toy if you think about it right now. You know, switches are, you know, kind of, generally are, you know let's say an open flow switch, it's programmable, but it has a fixed architecture. the, controller, you know can then, essentially put, you know, a system architecture into a network of devices. So there's. These are happening at, you know, at two different levels. But, but another way of, you know, kind of an, a 3rd dimension coming in, is that now with these switches, we can completely change the personality of the switches. It's almost like, you know, you have switching, you know you have, kind of, network configuration. And then you have architecting the whole thing. You know, it's, it's almost like, you know, and let's say, let me give an example. You know, one, you know one, you know thing which has been, you know, proposed many times in the past, and, and, for all I know it's being implemented is, let's say, you know source routing. You know, let's say, you know if you're in a data center and you're you realize you have five hops from from one topper rack to the next, let me just put in five labels which at each switch tell it where to go. You know, alright, [INAUDIBLE] that, that's, that's something simple and primitive. That, that at the very lowest level, it demands the capability in a switch to set up tables to handle it. But at the very highest level, its re-architecting your entire system. And, and, so now do you, think of your controller as a different kind of beast to do this. It's almost like you know I will, you know, would build, build it, or or, you know. Program a controller to run this way, versus run that way, it differently is providing, you know, a lot of new capability. And, and actually I think it's going to make networking much more interesting, you know, over the next several years, because, you know there's all these new different ways that it can, that it can expand. Yeah. >> Mm-hm. Mm-hm. On the, on the, on, and, I guess it's, it's, it certainly addresses kind of what, I guess, what you might call the, you know, the up direction's towards controller applications. What about the, you know, the, the direction from P4 down to the hardware? I mean, does, does a, does something that takes a language like P4 where someone can, you know, sort of specify. You know, specify various power resources should be allocated on the chip like did, does, does the compile, how aware does the compiler have to be of the underlying hardware. You know, I assume that like, you sort of designed P4 with something like RMT in mind, but what if I wanted to slip in like You know, an FPGA underneath before, or what if I wanted to slip under a software. Do, do any of those kind of forwarding planes, if we will, do those make any sense at all or is, do you think, you know, that the general purpose is that it is, is, is just the only thing that's going to be under there? >> Yeah, what I think there is you know, that question it, it kind of breaks down into several levels. But, but and actually really at, at the top level, the utility of something like AP4, or you know let, let's just call it kind of a common programming language to express, you know express and, and codifies which behavior that, in itself, is a really useful concept. yeah, it turns out, once we've, you know? Once we've submitted this P4 paper, you know? We had feedback from, you know, a bunch of different places. Oh yeah, you know, we're doing something kind of like that. We're doing something kind of like that. Turns out, everybody's doing the same thing. Well, if we could all be doing the same thing as long as we're doing that, well, let's do it in the same language. >> Yes. >> You know, just, just getting that, you know, kind of, would, would, kind of coalesce the industry together and, and provide kind of a fundamental kind of productivity lift. So, that, that in itself is a very, very useful fea, feature of the language. But now, kind of on the way down, looking toward the switch, it, it really, you know, kind of that task kind of broadly breaks down into two aspects and they're, and that they're, they're really, I suppose you can say target independent, and target dependent, you know? There are, there are a bunch of things that you, you know, can do or, sometimes, have to do that, that remain entirely in the target, you know, independent form. For example, there, there are lots of techniques, you know, which, you know, which, which have been published already. Let's say. You know? You can take two tables and squash them together. You know? And, and you may get cross-producting, you know, which may be bad or may not be. But, you know, if there are two small tables, you don't care. If you don't have enough tables, you do that. Or you may break them apart. if, let's say, in, in RMT, if, a table doesn't fit in one stage, you split it in two and, and put it in two, into two different stages. And, there's a level at which you just express that as two different tables. Well, those sorts of optimizations, you know, essentially both the original view and the modified view are both expressible in P4. You know, it's like, you know, you're just manipulating, you know, the table graph. But of course, P4 is both the language and kind of the underlying object representation that you can play in. So, you know. So there's there's as, as, you know, you can think of this as a toolbox where you start out with a P4 representation, you can manipulate it and improve it in some way and you get another, you know, you know, you know, thing in the same representation that now you can bring other tools in your, in your toolbox to manipulate, so that, you know. And that, that helps to kind of build, you know, kind of a, an industry wide ecosystem for those, you know, for those sort of capabilities because they can all interoperate. Now, now once you get down to a certain level the, you know, things become very target dependent. Let's say if you're, if you're building a software switch you know, there's, you know, in some senses, hey, I can do whatever sorts of tables I want. >> Unlimited resources, right? >> Yeah, I'm done. >> Yeah. >> But, on the other hand, let's say if somebody represents something as a TCAM. All right. Well, TCAMs don't compute very well. And, so you might, you know, try and, you know, have a bunch of code which is a TCAM optimizer. All right. Well, here's a whole bunch of things that have the same don't care pattern so I'm going to put them off into a, in an exact match table or you know? I mean there are all sorts of ways to attack that, but, but that's essentially, I suppose that this the software equivalent of all that the special optimizations you have to do. One thing for example that Huawei has a has an architecture they call POF, Protocol, Proto, Protocol Oblivious Forwarding. And one of the things that they do is they parse a little bit, then do a little bit of a match action, then parse a little bit. Essentially, they parse as you need, rather than parsing all the way up at front, which is, which is what, what's kind of described in the RMT paper. Well, in software, it's why do a task until you need it? You know, that's just a simple, that's just a simple optimization of I'm, I'm going to, you know, not expend extra compute resources. Versus in hardware, well a hardware parser is a very different animal than the match actions, so you want to separate those two. So those are, you know, stringing parsing out in between match action is an optimization that you would do for the software side of things. >> interesting, yeah, that actually, that, that brings up a question which I think is, you know, at least in my perspective is, is somewhat hotly debated, which is sort of this performance gap between hardware and software. I think that the RMT paper itself kind of plants a stake there and says like, these things are always going to be two orders of magnitude apart. And then, seem to be other camps that say, oh no, software's catching up. Like it's, you know, it's, it's almost as fast as hardware. I imagine this probably just that the performance gaps must depend on you know, the types of processing you're doing and what you're trying, you know, the types of functions that you need. Kind of like as you were just alluding to with how parsing happens. But I guess the question is, like, what's this, what is this current performance gap? And do you think that software switches are going to catch up like for certain kinds of processing? You know, they're just going to catch up in general or do you think there's always going to be a performance gap and what are the current, like what are the actual numbers and trends actually? >> Well, I do believe that that gap, you know, has stayed constant, you know, or pretty constant for, for a very long time and, and I can't see how that's really going to close significantly. Of course, one of the, you know, kind of the fundamental, you know, drivers behind it is, well, CPUs get built-in semiconductor technology and that same technology is available for switches. So as the CPUs improve, so do the switches, you know, so you're you're chasing a running target, you know? You're, if you're running a 5K and somebody is in front of you by 100 yards, you know, you're running as fast at them, you know, you're not closing you're, you know, you're both just, you're both just covering ground. But it's, it's also the case that, you know, there are some things which are very easily done in hardware, you know, which, which just take compute cycles. For example, parsing. You know, there's a, there's a loop of oh, I inspect this yield, decide what to do next, inspect this yield, decide what to do next. And that's a series of branches. All right? Well, that, you know, it's very hard to optimize that past a certain point in the CPU, and you're burning an entire CPU for it. You know, doing that in hardware, you know, even in programmable parsing, is, is, is just not that expensive. I mean, one of the things that, that the RMT paper mentioned, which was, you know, the out, out, results shown in, Glenn Gibb's thesis at Stanford was that all right, going from a fixed function parser to a, a programmable parser might cost you a factor of two, but if it's going from, you know, half a percent to 1% of an entire switch, who cares, you know. So and, and if, you know, if I can do a parser in 1% of the chip area that's going to burn a bunch of CPU time, you know, there's just a fundamental difference in you know, in, in kind of what it costs to do something. Now, you know, there are some things, let's say, if you need gigabyte match tables, well then, you know, those, you know those, those are very expensive. Essentially, on switches, the, the thing that you pay for most, most dearly is RAM, RAM bits. So really big tables? All right. Well then, then maybe you have to do a CPU or you have to do a switch with, with external RAM that, that, that you know you can, you can get at in volume. On the other hand, let's say the the RMT paper, it, it describes this pipeline of 32 stages, and you can have 16 tables in a stage. All right, now, you can have hundreds of tables now in a switch. Okay. And, and really the cost there is mostly kind of the total number of bits in your tables. You know, you know, the switch can do two small tables, you know, not much more, you know, without too much more difficulty than one big table. In a CPU, every single table costs you compute time and resource. So that sort of scaling is you know, you know, that's another dimension. And I suppose one way to look at that is that, yeah, if you know, switches, you know, like, like RMT or kind of a general OpenFlow, kind of get to be thought, you know, become com, the common way of doing things. And if you want to build a switch with 50 tables in it do it. You want to built the switch with 150 tables in it, do it. You know, if the idea of having a table, you know, a table is no longer thought of as an expensive resource. I mean, you know, it's a, it's expense is well, how big is it? But if you want a bunch of small tables, to do fancy behaviors that the, that, don't let's say have, you know, tens or hundreds of thousands of flows in them, that they're very simple to do. Well, there's, you know, that's kind of another way in which, you know, hardware switch behavior is going to leap ahead. And it's very difficult to figure out how CPUs are going to even keep up, let alone fall behind. Now, you know, of course, I mean the CPU folks are busy optimizing, you know, all sorts of things from Data IO to instructions and things, so you know they have their own bags of tricks but, but you know, I suppose it just, everybody has their, their secret sauce that they think is going to you know conquer the world. Well, it turns out their competitors are busy conquering the world at the same time. So. >> Mm-hm. Yeah, actually that what, some of what you were talking about like, okay, some things are definitely always going to be better on software, and some things like software will never catch up. I wonder if it, it, it, I mean certainly there's sort of a place for, for RMT and, and, and the like, and you designed it in such, in such a way that it's sufficiently general to cover like a specific set of of applications and, and types of processing. And then you sort of rule out things where this complexity curve kind of goes like this, encryption, transcoding and so forth. I'm wondering like what you see in terms of like hybrid architectures. I mean you talked about okay RAM, well, RAM is pricey in hardware so like that maybe we should, we should put you know we should deal with elsewhere. I mean, there are other things, too, right, like FPGAs, for example and so forth I mean in the RMT paper you kind of rule out the use of FPGAs as prohibitively expensive, but I'm wondering like, you know, still, even with the ASIC, you know, even if you get it right in a sufficiently general sense, the, the, yeah, you can sort of, you, you've got a software development cycle with P4 or, or whatever that happens to be in, in the context of that ASIC. I mean what about the other stuff? Do you think that, that there's, there's room to play with, with FPGAs and software kind of in conjunction with something like RMT? >> Well I guess, yeah, you know, one way to look at is to kind of breakdown the problem into kind of different application spaces. You know, the things that, that hardwire switches are really good at, is you know, lots and lots of parallel computation. You know, like, you know these, all these VLIW's and, and lots of matches, and, and a lot of memory and a lot of TCAM. And, you know, of course large cables are valuable. That's very, that's very hard to do in a FBGA you know. They have very much less RAM then you can, then you can put on a chip and TCAM is essentially has to be emulated by building gates, which makes it, which makes it very costly in terms of area. So FPGAs, you have a much more difficult time, you know. I mean they can't even get close to the table capacity that you get, you know, with, with a hard-wired switch. And and also, you know, the same thing is true with respect to throughput. You know, kind of the massive parallelism that you can get, you know, despite laying down gates, you know? Essentially, every gate in an FBJ is both, you know? And, and, I don't, and I don't know the exact number, an order of magnitude more expensive and an order of magnitude slower. You know, it's, it's a, it's a very, very rough guess, I mean, you know? It might be 5, it might be 50, but it's, but it's significant. So you know, if you're trying to solve the, you know, the high volume switching game in an FPGA, you know, you're, you're just not, that's just not going to have a happy ending. But, but, you know, one place where FPGAs, you know, are seen is, let's say, going in on a particular port, you know, you can put a bump in the wire of the FPGA to do some particular type of processing, and that's very much, you know that's, that's pretty common these days. Of course, one of the reasons is that the switches themselves aren't that flexible. but, but, it, it certainly it, it is the case for, where we can build a FPGA, you can, you can get it off the shelf, you know, the turnaround time to develop it, you know there is, is certainly small on the scale of, you know, of real chip design. And you know, who knows? Let's say, you know, one, one thing that, that, that's certainly possible is that well, let, let me back up and say in a high-volume switch let's, let's say of the sort that like the RMT Paper describes you know, you're, you're basically processing a packet every nanosecond. You know, so you can get at that memory you have on chip but if you want to get it off chip memory you have to have that transaction every nanosecond. You know, and that's, you know, and, and that's if you want one table, you know, one access. What if you want several? So past a certain bandwidth it just isn't feasible to get it on and off chip. Okay. Well, now suppose you lower the bandwidth a little bit, all right? And if you have an FBGA that's processing, let's say, one 10 gigabit stream, instead of 64 of them, the bandwidth is lower. Well now, you know, you can afford to hang a bunch of, a bunch of SRAM, or maybe even DRAM off of it. So if you have a particular application that, you know, has some fancy behavior, but needs access to a whole pile of memory, well, all right there's an architecture, you know, that works, because that's, that's solving a real problem that you just can't get at in the switch, you know, the ability that, all right, now I may have 64 of these things, each of which have their RAM off of it. Well, now I've got a system where I can build in lots and lots of gigabytes of RAM and use it effectively, where you know, it's not just something, you know, essentially when you can't solve it on a single chip, then there's a way to get at it. So, you know, it's so, for special purpose processing, you know, you know it's entirely, you know it's entirely possible. >> Interesting. So I guess another question I had was far as the P4 compiler and so forth do you think that you think that there are hard problems there as far as lay out is concerned? I mean you basically I think you mentioned before something about, okay, the programmer in P4, they basically, they could they could specify one table with sort of a cross product of fields, or they could specify two. >> Yeah. >> You know, in one sense you could, you could let the compiler do, make those sort of optimization choices or you could let the programmer do it. >> Mm-hm. >> So, what, I mean, what do you think is going to happen there? You see like compilers that are doing really fancy optimizations. Do you see the programmer kind of figuring that all out themselves? >> Right. I think there's a lot of undiscovered territory out there. I mean, you know, in this research project, you know, I, I, I literally, and one of the things I would say about RMT is it kind of. As close, it's as close as you can get in hardware to directly expressing the table graph. >> Right. >> Which, you would like to think makes the, makes the compile problem simpler. You know, I used to joke, saying, you know, this is not an NP-complete problem. Well, of course, we haven't proved that it's not an NP-complete problem. It, you know, it's, it's you know, it's, it's good table banter. But, you know, that, that doesn't constitute a proof. I mean so I would like to think that, that, that these, that this problem is fairly simple. But it in actuality, you know. Let's say this, The approach that you might take it first is, well, let's say for each table, you know in your, in, let's call it our logical table graph, this is the table graph that the user has, I'm going to figure out, well, how much RAM do I need for each table. You know, how much match memory, how much action memory. And you start at the beginning of the table graph. You, you put it into a stage. When you fill up a stage you move to the next stage. And you just go that way. Well, that sounds like, that sounds like linear complexity, okay. But it turns out in your real switch, you know, if you express as a constraint, well, your constraint is, each table has a, each stage has certain amount of table. Well, there are other constraints that are described in the RMT paper, like you know, we can extract a certain amount of width, you know, of bits, to do matching on. You only have 16 tables in a stage. You know, you have a certain amount of action width. And, and different hardware would have other constraints which, which might get kind of more and more Byzantine, you know, as you go into the guts of the machine. And now all of the sudden, you know, I mean, optimizing for one goal is pretty simple. All of the sudden when you have a pile of goals that you're trying to optimize simultaneously, well, you know that, you know. You know, that, that, that, that's hard to prove that, that there's a simple way to do that in general. You know, it might work for most cases, but, you know, if somebody's biggest customer turn out to not fall into the most cases, well, you know, you still have a problem. So so I, I'd like to think that, you know, the prognosis is good. But, you know, but there's a long way to go, you know, to actually show that, you know, show that all of this works. And, and I think you know, there are a lot of that will be in the realm of, of target-specific optimizations. You know, somebody will, you know, design a chip with a certain, you know, architecture instead of limits. And then, and then, you know, ways they cheat at the end, because, you know, this, this brand idea, just, you know, can't quite work, and so, you know, we, we build in a bunch of things that, you know? We're going to solve our problems and, and toss a bunch of problems over the wall to the compiler. And, and, you know, and, and, you know, not tell them for a little while. >> Yeah. That's a great place to wrap things up, actually. Which is, I mean, you mentioned, like, sort of this green field, if you will, of, of target specific optimizations. That sounds like a great place for people who are thinking of doing more work in this area to, to think about. and, I guess, just in closing, I might ask a question about sort of, other, other things that people should be thinking about. And in what context they should be thinking about it. So, for, I guess, I could ask this in one way. Which is that we can read the ROT paper and say, like, okay, do you think that you made all the right choices there. Or do you, you know, you certainly made choices between, say, using a pipeline versus a [INAUDIBLE]. Do you think, you think you made all the right decisions? if, you know, if not, like, do you think the RMT is a reasonable context within which to think about other problems. And what are the, what are the big problems in this area? Do you think there's, they remain in the chip design, or do you think they remain, they're in other parts, like in compilation or in other? >> Actually, I think one of the, probably one of the most interesting areas is is, is actually higher up. And that is, let's say, you express something in P4. Let's say, I, you know, I've described, here's a particular set of header fields, and I describe a table graph which can do, you know what the typical L2, L3, you know, whatever sort of standard processing that somebody wants. And then, what, what you really want is an ecosystem that is for, you know, it's, for this to be object-oriented, so company A can plug in, you know one enhancement. All right. Well here's our, our super duper MPLS processing. And, you know, company B, you know, you can buy another module from company B that says, well, you know, we not only have VXLAN, we've got VXYZ LAN. And plug those two things in. And, and, and make the language and the, you know, kind of the underlying, you know, object representation workable so that, you know, you can plug in orthogonal enhancements, you know, from different vendors and make them work. Now we have the beginnings of an ecosystem where, you know, hey, you know, you know, VXYZ LAN, yeah I've got an app for that, you know. but, but it, but in the end you have to think about, well, actually adding the, you know, let's say adding the fields to the parser saying, well, I, I now recognize MPLS. You know, that's actually fairly easy. But deciding, you know, when I add MPLS processing or VXYZ LAN processing, I'm kind of altering the table graph. You know, maybe I'm breaking something apart and sticking something in there. And exactly what does that really mean? You know, so I, I think there, there's an area which is, which is kind of interesting and fun. You know, I, when you take, take something and figure out, how do I want to add a capability and be able to express that in kind of a composable sort of way? And, that's, that's entirely, kind of a P4 language, and, kind of, operating at, you know, let's, well, say, a table graph representation underlining it. You know, kind of, it's all at that level there. But, but I think that's, that's an area where, you know? You know, the successful solution to that problem, I think, means a lot to the industry, and, and opens up, you know, opens up a lot of capability. >> That's super interesting. It's like an app, app store for hardware, almost. >> That's right. >> That's very cool. Well thanks a lot. I just don't. I don't know if there's anything else you want to close out with. But, I really appreciate you taking the time and, speaking, speaking to, to the students and, and everything. I think, I hope people I think this, this clearly is an area that's becoming super hot, that, like, basically wasn't even there almost a year ago. So this, that's really great to have the opportunity to chat with you today. >> Yeah. Oh. Well, thanks for the opportunity. >> Thanks a lot! >> All right.