Hi everybody. This is Enrique again, and today I'm going to present you the final workshop of this module. Here, you're going to implement the final refinements for your MapReduce framework. So the objective of this workshop is to implement a heartbeat between the master and the worker, handle worker failure, and also replication of the Master data. By the end of this workshop, you should have an implementation which is able to periodically ping the workers to obtain this status, have the master verify completion of each of the phases, if required, restart the task in a different worker, and also have the master replicate the local data structures. So let's discuss how are you going to implement this workshop. What's the meaning of a heartbeat? This extract from the MapReduce paper from 2004. The master pings every worker periodically. If no response is received from a worker, then the master marks the worker as failed, and any map or task that was in progress on that note, is going to be reset to idle, such that it can be rescheduled. So basically a heartbeat is a way for the master to note that a worker is still available, and we can use it for doing computations. We're going to be doing this heartbeat using RPC. We shall have a periodic function that calls the client stub. So notice the direction of this operation. The master pings the worker. When we're using RPC, then the worker is going to be the server, and the master is going to be the client. Beware that the terminology can be backwards from the way we use master and worker in MapReduce, but this is just to define in which direction the communication is happening in gRPC. So make sure that you're not confused when you are trying to implement this and make sure that you are going in the right direction. Otherwise, the implementation can be tougher than it actually is. How are we going to be handling worker failure and also stragglers? Because is not only when a node goes down, is also when it is taking too long to execute the corresponding phase. Each task is going to have a timestamp associated with it in the data structure that the master keeps in memory. Then we're going to have a periodic function that checks for any expired execution, and change the status of the task to be idle if the execution has not finished by the time that timer is expired. The timing for this operation does not need to be tight because we're just reacting to a failure. Periodically checking for the task being processed is acceptable if the period of time for which we do this periodically checking, is lower than the execution time. What this means, is that during the execution of a given phase, we're going to at least check several times to see if that worker is still alive. As I mentioned before, we are going to be using gRPC for implementing. So we're going to be replicating the data structures for the master. There are two main ways for doing this. The easier one will be to have all the information in Zookeeper, and we can store all the required data source there, or we could use RPC, such that, the leader master talk with all the other masters or the shadow masters, each time a data structure is modified, and he will not reply back to the worker notes or to the client's unless is already replicated in all the replicas of the master. This is basically what I explained to you here, is that each time a query modifies any data, it needs to wait until the data has been persisted and then it can complete the corresponding call. But in reality, it's better to use both of them in a combination. Let's discuss a little more of why we need to do this. Zookeeper may look easier, but it can only contain a limited amount of information. Information that is in the order of the number of map operations or the number of rules reduce operations, may be too high for Zookeeper. The reason is because Zookeeper provides strict consistency, it can only save up to certain amount of information before it start acting slow at least for the speed that we want the master to work on. But if the information that we are going to be storing into Zookeeper is going to be in the order of the workers, that's still acceptable, this is the number of machines that we are using to do the computations. Think about which queries are in the critical path. For the purposes of this project, doing an RPC to the followers is going to be faster than doing a request to Zookeeper. Discuss with your fellow students about your decisions and when the assumptions that I presented to you may break because the assumptions of the size of M, the size of R, and also the assumption that RPC is faster than Zookeeper are not necessarily true for all scenarios. So please discuss with other students about which scenarios will be better for one another and what type of combinations will be more interesting. At the end of this workshop, you'll actually have an implementation of MapReduce that complies with the MapReduce Google paper. You should have the whole source code, an explanation of the replication decisions, I hope you gained some ideas when you were discussing this with your fellow students, and also your code should be able to, ping the workers from the master, start a MapReduce execution, and then just kill a worker in the middle of a computation, and then make sure that it completes execution successfully. Finally, learn how to replicate the master data structures. This is the last of the workshop for this module. I hope you have enjoyed all these projects, and I hope you can already understand how big data and how MapReduce can be used to process millions of key values stored in the Cloud.