Now that we've learned about computation graphs, let's try and see how they actually get mapped onto real multi-core computers. So, let me start with an example of a computation graph. Suppose it looks like this where you have S1, S2, S3, S4, S5, S6, and then maybe a last step which does a join on all of them. Now the great thing about a computation graph is that when we think about it as a graph, we don't have to really know where the edges came from, whether they were continue fork or join, even though those are the ways in which the edges are created. But as we discussed earlier we can reason about the performance. So, let's give these some execution times. Let's say these are all one units of work each except S6 which is 10. And we're very interested in the idea of the execution time T on P processors. If you had a multi-core machine with P processors: 2, 4, 8, 16, what execution time can we get for a given computation graph? So 'T sub P' is the execution time on P processors. Now to reason about it, we have to think about how these steps in the computation graph actually get scheduled on the processors. So, let's take two processors. And we have P0, P1. And let's see how these steps can be scheduled. This is not something that you have to worry about as a programmer. But when you write a parallel program it has an underlying computation graph, and it's the job of the runtime software and hardware to schedule the steps in the computation graph across the processors. So, we see that S1 has to be executed first no matter what, because the whole point of the graph is it indicates the ordering relationships and nothing else can execute until S1 executes. So, let's say it runs on P0. OK, that's good. Then S2, and S4, and S6 could all be run. So, let's say that P0 does S2 and P1 gets S4. Now we have S2 and S4 running in parallel with each other which is good. P0 does S3, P1 does S5. Now, after this we cannot execute S7 because S6 still remains to be done. It can be run on either processor. Let's do it over here. And then after that S7 can run. So, that's our schedule, and we can now reason about the execution time of the schedule. So, for this schedule 'T sub 2' on two processors would be 1-1-1-3, and 10 for S6, and 1 for S7, which is 14. So, T2 equals 14. And we can see there's some idle slots over here for P1 where it's sitting idle with no work to do, but the whole computation graph took 14 units of time on two processors. But this was not the only way to schedule the graph. We could take another approach, and it could be another schedule, where we are smart and realize that S6 was the bottleneck and was causing P1 to be idle. So, our goal now in the second schedule would be to run S6 as soon as possible. We still have to do S1 first but after that we can prioritize and do S6. So while P0 is working on S6, which goes on for 10 units of time, P1 is idle in the beginning but can start doing the other steps. It can do S2, S3, S4, S5. The order doesn't matter because these add up to four units of work no matter which order you do them. And S6 is going to take longer with 10. And at the end we'll do S7 on either core. So, now what's the execution time for this schedule. Over here we see that we have S1 is 1, S6 is 10, and S7 is 1 - 12. So, we have a smaller execution time. So, this definition of the execution time on P processors actually depends on the schedule. But there are certain properties about this execution time that we can determine. First is, what happens if P equals 1? If you only have 1 processors we can assert that T1 equals to work. Why do I say that? Because if you're running on one processor, basically that processor has to do all the work in the computation graph, which is what you get when you add up the execution times. So, in this case it would be 10, and 5, 6 - 16. So, it will be 16 in this case. The other thing we can think about, even though we don't have it in practice, is what if we had an infinite number of processors. Well the claim over here that's interesting is the time on an infinite number of processors is the span, that we had learned earlier, which is the length of the longest path in the computation graph. Because if we have enough processors there's no other reason for a step to wait, other than it being on the longest path and waiting for its predecessors. So, in this case the span is 1 plus 10 plus 1 which is 12. And given any P processors, we know we must be in this range T1 less than equal to TP less than equal to, or actually the lower bound will be the span TInfinity less than equal to TP, less than equal to T1. And that makes sense because P can vary from one to infinity. The time can go from T1 to 'T sub infinity'. Another very interesting concept when talking about parallel programs is the speed-up. So, the whole motivation of parallelism is to get your program to run faster with all those cores that the hardware vendors are giving us. And the speed-up with P processors is the ratio of T1 over TP. So, lets think about that. T1 is the sequential execution time. 'T sub P' is the execution time we get on P processors. And this ratio will be the factor by how much times faster the parallel version was able to run. So we can now see that the speed-up must be less than equal to P. And, the other bound on the speed-up, that it must be less than equal to the work over span ratio which we had discussed earlier, the ideal parallelism. So, we see that the speed-up is of course bounded by how many processors are available, but is also bounded by this really important property of the computation graph which is the ideal parallelism. And our goal in parallel algorithms is to generate computation graphs with ideal parallelism that's much larger than the number of processors that you have, so that you have the flexibility of running that parallel program on a large number of processors. So, in summary we've taken the computation graph concept and talked about how to execute it on a fixed number of processors. So, you have the execution time on P processors and reasoned about the bounds of that execution time. And this is a very good way for us to evaluate parallel algorithms, so that we understand what the inherent parallelism is in an algorithm, and what's the best we may hope for in terms of the best case speed-up, which is either bounded by the ideal parallelism or the number of processors that you have.