What we have done up until now is we have understood how the DPDK as an exemplar for user space packet processing will let us develop virtual network functions efficiently, bypassing both the hypervisor as well as the kernel and resulting in an implementation that will be appropriate for packet processing. So what we're gonna do now is we're gonna look at implementing VNF on commodity hardware using DPDK. And this is where we're gonna see some of the facilities that's available in DPDK, to exploit the features that may be there in commodity hardware. So the DPDK library provides several optimizations, and these opportunities for optimization that's available in DPDK ultimately will improve packet processing. And each optimization that's available in DPDK, is meant to eliminate a particular source of performance drop in an operating system like Linux kernel. And that may be one reason, or the optimization may in fact be exploiting some hardware feature of the NICs and modern CPUs in order to make packet processing very efficient. So now let's dive in and look at using DPDK to implement a network function on commodity hardware. So first of all, we need to understand what commodity servers look like these days. Modern commodity servers contain multi-core CPUs. And now what we are able to do is we can use multiple cores for packet processing, and that will allow us to match the increasing capacities of the NICs. Remember what I said about the net scaling breaking down and therefore, the CPU is not being able to keep up with the network speeds. And now parallelism is available in the form of multiple cores. And so we could do parallel processing of the packets coming in, and therefore, we can match the increasing capacities of the NICs. And DPDK actually gives you the ability to exploit the hardware, and we'll see how that works. The other thing is that these servers are getting to be NUMA servers, there are multiple sockets. When I say multiple sockets, each socket is a CPU, and a particular CPU may have multiple cores, and associated with every CPU is a local RAM. Or in other words, what you're doing is, when you have the multi socket board, in which you have multiple CPUs here. And there are cores within each CPU, and the RAM that is associated with each of these guys. So accessing remote RAMs is much more expensive than accessing local RAMs. And so if this is one CPU, there's another CPU, accessing this RAM is going to be more expensive than accessing this RAM. So that's something that you have to pay attention to. So the upshot of this is that you have to carefully design packet processing paths, from the NIC to the network function, taking these hardware trends into account. What are the hardware trends? The first hardware trend is the fact that you have multiple cores in any CPU. The other hardware trend is the fact that in a motherboard, you have multiple sockets, meaning multiple CPUs, and each CPU has a local RAM. And the local RAM is faster to access than a remote RAM that is associated with a different CPU. So all of these are considerations that you have to take into account. And this is where you need a partnership between system software and hardware. And DPDK is a vehicle by which we can get this partnership between the system software and hardware. If you're implementing a network function on top of a DPDK library, the application model that you can choose can be one of two types. One is run-to-completion model. Basically, what is going on in a run-to-completion model is that you're pulling for an incoming packet, and this is a code snippet that I showed you earlier also. And once you have a packet, you process the packet, and then you transmit the packet out on the network card. And all of this being done in the same core, right? The same core is getting the incoming packet, processing it, and transmitting it out of that particular core. And each packet is handled by a unique core. There are multiple cores that are being deployed for a particular network function. You can use a particular core to do all of the things that I mentioned in this particular bullet. The second model is what is called a pipelined model, where we're taking this network function and compartmentalizing it into different functionalities. And we dedicate cores for polling separately and processing the packet separately. And then in that case, you have to do inter-core packet transfer using the ring buffers, okay? These are the two different models. And in terms of where you might use it, so the run-to-completion model is one in which all the cores are responsible for both I/O and the packet processing. So it's the simplest model to implement. So what you have is the network card here. And you have associated with each core, we think of this as a core as opposed to entire processor, just from the point of view of this figure. And so what you're doing is this core, the network function that is running on this core, is polling the ring buffer. And when a packet comes in, it's gonna process it, do all the work that is associated with the packet processing. And once it is done, it's gonna transmit the packet out in the transmit queue. So this is how each core is operating. So you can see that there's a complete path that's available from the NIC receive queue to the transmit queue for a single core. And each packet sees only one core. And this works for monolithic packet processing code, where all the packet processing logic is contained within a single thread. And it is simple to implement, but it is less expressive, obviously. Because you're not breaking down the functionality into packet I/O, and the actual function that is being implemented by the network function. The second model is pipelined execution model, and here what we're gonna do is we're gonna dedicate cores for processing the network function logic. Some cores are dedicated for reading the packets and some cores are for the logic that is associated with the network function. So the result, each packet is gonna see multiple cores. And it is particularly useful when you have to chain multiple processing logic within a network function. And you remember when in the first lecture I talked about the fact that there are different network functions that may have to be in an enterprise. For instance, you may need a firewall, you may need a router, you may need a load balancer, all of these are things that are in the path of a packet coming in. And so in that case, what you may wanna do is, you may wanna separate the packet processing from the functions that implement a particular network function. So in this case, what's going on is that a packet comes in, and so here is the thread that is doing the polling for new packets. And so it pulls and gets a packet out. And once it gets a packet out, it puts it in a ring buffer for another network function to pick it up, for instance a firewall. And the firewall may process the packet. And maybe there are other intermediate hops as well if there are additional network functions that have to get in between the packet coming in and packet going out. And that's the pipelined execution model. And so the inter-core communication is done using these queue buffers that we already talked about in memory. And this is also useful when packet processing is CPU bound. So if packet processing can actually happen at line speeds, then it's perfectly fine to have a run-to-completion model. But when packet processing is CPU bound, then it may not be possible to sustain the rate at which packets are coming in from the NIC. And so in that case, we may want to form out the packets to different cores. So having a number of polling cores and a number of processing cores is a good choice in that case. And a good example of such a network function is Intrusion Detection System, where the processing that you're doing, looking for signatures that are malicious or malware attacks, that may have to be done when a CPU intensive manner. And therefore, you may not be able to do this processing at line speed if you dedicate one core for the entire processing. And therefore a pipelined execution model may be a good way by which you can sort of have parallelism exploited in order to do the packet processing and keep up with the line speed.