Welcome to the third video in the introduction to Map/Reduce. The previous video presented the framework and the user requirements for using Map/Reduce. This video will show the word count example in some detail. Imagine you have a lot of documents. And you need to count the number of occurrences of each word across all the documents. Perhaps, this seems like an arbitrary task but the basic idea is related to having let say, a lot of webpages and you want to make them available for search queries. The word count task has become the paradigmatic example for map reducing, so it's really worth understanding it in full detail. For the sake of concreteness, let's imagine you have to count all the words in Star Wars. If you only had to count one episode, you could imagine doing this on one computer. And the basic process can be something like the following. First, you get a word. And you look up the word in some kind of table, where the word is a table index, or the key instead of table, and if the word is not there, you could insert the word. And then you add 1 to whatever the count is in that table. So your results would look something like this. You would have words, and counts. And you would use that for whatever you're going to do with that. Now imagine you have to count all the words in all the Star Wars episodes, and pretty soon there's gonna be a lot of episodes. And there's already a lot and lots of books out there and fan fiction and blogs. One website I saw had something like 35,000 fan fiction stories so it's a lot of words. To process all this data, you would of course use something like map reduce. So the strategy for map reduce functions and the strategy for design key value combinations is really you should try to keep things simple. For the word count task use, for example, word as a key and the number one as a value. In other words this is just like marking one occurrence of the word. Hadoop is gonna do the hard work for us. Hadoop is gonna do the shuffling and the grouping by key so that the reducer only needs to aggregate. So let's write some pseudocode. The mapper could look something like this. You get a word, you Emit word, followed by 1 as the key value pair, and you loop until this is done. Until you get all the words. For example, if the input was the first line from Star Wars, a long time ago in a galaxy far far away, that would be the first line. You could read that line, split that line into words. Those words are your keys. You just output the key with the number one. And that's gonna be sent to the reducers. Notice that words in this case, words will appear multiple times each time with just the number one. The reducer is guaranteed to get all the same key values together so as it reads key value pairs it just needs to keep a running total. The pseudo code will look something like this. If it gets a word, it checks if the word's the same as the last word read in. If it is the same, it can add one to the running total. And if the word changes, it prints out that total. And you loop over this and there's some other housekeeping you would have to do, but that's, essentially, what the reducer function is. Let's look at this in our flow diagram. So imagine we have some data. And this data is gonna be spread out among different disks. So we have to imagine there's lots and lots of Star Wars stories. And the map functions are gonna take the data. Split it out into key value pairs. Produce output. So Hadoop is gonna take that output and shuffling and grouping according to key value pairs. So imagine we have this output, from the map function. The word A might go to some particular reducer, the word long might go to another reducer. And as Hadoop goes through all this list of key value pairs It's gonna sort them so that all the A's are together the word far is together and so forth and so on. Notice that it's not necessarily a global sort at this point. The reducer is just gonna aggregate that data, it has the key value pairs together for all the same key values. So reduce() takes the output from the Hadoop shuffling process, it does the aggregation, and then your final output is gonna be the word and the total count of that word. In the next video, we're gonna look at an example of how to actually run this in the Cloudera VM.