Welcome to the fifth video in the Introduction to MapReduce. In this lesson we want to continue looking at the Vector Multiplication example and consider computational trade offs related to MapReduce. We can ask ourselves, how many indices come out of the mapper? We can also ask ourselves how many groups have to be shuffles around by Hadoop? These reflect the computational costs of using Hadoop. For the first question, recall that each vector has N numbers. So, the total number of mapper outputs is 2N. Hadoop has to put the 2N indices into groups and then shuffle them around to the reducers. The actual cost of this operation is gonna depend on how Hadoop partitions the data, how many reducers there are, how many computer nodes there are, the communication cost between computers nodes. So there's a lot of stuff going on into the total cost. But in general, the more groups, the more shuffling there will be. So the question arises, whether we can reduce the shuffling cost. One way is to reduce the amount of mapper output by combining into season the mapper. This makes the mapper do some of the reducing functions. And for some tasks, like Wordcount, that can work well. Especially because pieces of text will very often have many words that occur more than once. So those can be summed up into one word count output. But for vector multiplication, the mapper will not necessarily see matching index pairs. An alternative is to re-number the indices into bins. Let's see how that works. For example, let's say we bin the indices into ranges of length ten. We start with the original N keys. We put each set of ten ranges into a bin, then we use the bins as the key. So now, instead of N indices, now there are N/10 indices. The new key is now the index bin number, and the original index number can be put into the value field to be used later. Bin one from vector A will still group with bin one from vector B and reduce will still be able to finish the multiplication by using the original indexes that are now part of the value field. So now we can rewrite the shuffling cost using N divided by R. If R equals one, each index gets its own bin and the cost doesn't change. But if R is greater than 1, meaning we're gonna group things into some range of values, then some indices will bin together and there will be less shuffling for Hadoop to do. So in general, we can say that if the size of (N/R) is high, the shuffling costs are also relatively high. But the trade off is that there is more work for the reducer, because the reducer then needs to partition the indices for the same bin into the original index values. Keep in mind that the specific trade off depends on other factors like how much data there is, how many computer nodes. But you can imagine testing with different bin rages if you wanted to try to optimize something like multiplication vectors. It's worth pointing out that in real data, vectors are typically not so huge that you need to use HDFS. But matrices more often get very large, and in those cases MapReduce can be very helpful. There are many analysis that involve matrix multiplication, as we will consider in later courses when we talk about the big data analytics, or the machine learning, or the data mining. The same technique in analysis and the same tradeoffs that we had in vector multiplication applies to matrix multiplication, except one has to deal with row and column indices. The next video is the last video in our lesson, it will summarize and put together the highlights from the examples we've been going through.