[MUSIC] So this would work but there's a problem. So, this is from a former student here at UW who did some work on this problem, and this plot shows time in seconds on this x-axis, and this is just a list of all the tasks. And so you see that the reduced tasks here. They aren't, they can't start until all the map tasks are finished. Okay. And the map tasks mostly finish quite quickly, sort of, maybe 20 seconds. But one or two of these map tasks take a very long time, sort of, on the order of 270 seconds or so, okay. So, all this facing here is sort of wasted work. There's a couple of problems here, one is, fundamentally MapReduce, in many applications, you logically actually could start doing some work early based on the map output that's already been finished. And that's just not the way MapReduce is designed. It's not designed to be able to take advantage of those applications because you can't guarantee that it's safe to do so. So for example, if you're adding up numbers, you could start adding up in the Reduce phase. You could start adding up numbers early. But if you're doing something more complicated, you may actually need to wait for all the results to be present in order to get the correct result. And so, since they can't guarantee that it's safe, they make you wait. Okay, so fine, so this skew problem ends up sort of killing parallelism and turns a job that could take 50 seconds into one that takes 350 seconds. And it's not even significantly longer than doing this sequentially. If you sort of added all these pieces up. Well, I shouldn't say that. This is still worse than sequential but you certainly lose a lot of your bandwidth and parallelism. Okay, so one task might take five times longer than the average, and so there's little benefit. And so one take away here, and so I'll tell you what, one way to partially solve this problem on the next slide. But the take away here is if someone asks you what the, one of the biggest performance bottlenecks of MapReduce is as you should say, stragglers or skew. Skew is more of a term for this in the database community, but stragglers is I think a little bit more common. So this is a straggler task that takes a little bit longer. All right, so what can we do about this? Well, here's the situation where we had a lot of keys that all ended up in the same reducer, and this guy's taking too long. One thing we can do is split this data up into more reducers. So take and allocate a few more reduce tasks and move that data over here and split it into three more. Now broadcast that little red relation. Right, so I made it disappear from over here, recognizing that this is perhaps small. And replicate it to all three of these reducers, so sort of a combination of the replicated or broadcast join and the regular reduced side, hash join. Okay, so now you've got three reducers working on this problem as opposed to just one and you get things a little bit more balanced. And so this skew join is something that Pig can do if you specify it. It won't do it automatically though. So fine, now we get our entire joined relation, which is what we want. So finally, Merge Join is the third special case. Let's see, so the first special case was an opportunity to do things much faster if one relation was very large and the other one is which was small to fit in memory. Is more of a, there's a problem that can occur, and you need a special trick to be able to address the problem. Merge join is more like the former. It's looking at an opportunity to use a more high performance algorithm when certain conditions are met, so what are those conditions? Well, once you realize that the red relation and the blue relation have already been co-partitioned on the appropriate join key, right? Then the map phase alone has enough information to just compute the join itself. It doesn't actually need to assign it all to assign it to a joining queue and shuffle it across the network. So, you know that all the line items for a particular order. You know, for every order that's here, all of it's line items are also here on this machine. If you know that to be true, then you can do that in the map phase. Okay. So the question may be, when is that true? Well, that when we would go back to the co-group operator. It's possible that you've already partitioned these two tables on the appropriate join key, because of previous command in Pig. And so we can be aware of that in user Merge Join. You still specify explicitly do you wanna use a Merge Join, but, so Pig's not doing that automatically. But you can take advantage of this situation when it arises. So since each mapper already has local access to the records from both relations. They're already grouped and sorted by the join key. You can just read both relations from disk in order and compute the join. [MUSIC]