We will next examine the question of performance of parallel programs. Performance is particularly important for parallel programs because it is one of the main motivations for introducing parallelisms in the first place. How can we estimate the performance? We can do empirical measurements by running our parallel program on a particular architecture, or we can do asymptotic analysis. Asymptotic analysis allows us to understand how the performance of our algorithm will change when the inputs to the algorithm get larger or when we have more parallel hardware to use to solve the problem. We will basically be considering the worst-case, or upper bounds on the performance of our programs. This kind of analysis is something that you have already seen in the case of sequential programs. You basically try to count the number of primitive steps that a program can take as a function of its argument. For example, if you are inserting an integer into a sorted list of integers, then you know that you can characterize the performance of such a program using O notation, as O(n). So that means that it takes a linear time in the length of the list to insert a number into it. On the other hand, suppose that we have a balanced binary tree, storing also n integers. Then you can show that because we only need to traverse one path in this balanced binary tree, that actually the running time is given by the expression of log n. So let's see how we can do this analysis in our examples. We will first consider the simple building block that we define previously, namely sumSegment. Recall that this function takes the array, takes a parameter p, and crucially the variables s and t that indicate which part of the array you should process and it's given by this while loop. Can you find the time bound on sequential execution time of this function? The answer is that the running time is linear in the difference between t and s because we have an array and we are processing elements from s to t. For each element of the array, we are taking the constant amount of time which gives us some linear function of the form c1 (t- s) + c2. This constants c1 and c2 may be important in practice but are irrelevant for the asymptotic analysis, which only looks at the rate growth of the function describing the performance. Now, let's look at the segmentRec function. Recall that this function was computing essentially the same result as a sumSegment, but it was doing it by dividing the longer array segments into smaller ones. Here I am still looking at the sequential version of this function. So when the difference between the bounds is below some threshold, then we will invoke our previous sumSegment function that we have just analyzed. And concluded that it has the running time of t- s. And otherwise, we will compute the middle element of the array and then invoke ourselves recursively with the intervals s to m and m to t. So now we have a recursive function, so the question is what is the running time of this function as a function of parameters s and t. Observe the dysfunction can be described using a computation tree. For argument parameter s and t, the computation reduces to two recursive calls, one that takes parameters s and m1 where m1 is the middle. And the other one takes parameters m1 and t. And this decomposition continues until the length of our segment is below the threshold. So computation over recursive function can be described using such tree. These description allows us to also write down and analyze the running time of the function. We can describe the running time using the following recurrence equation. When t- s is below the threshold, then the work denoted by W, that this function performs, is this linear expression because it simply reduces to sumSegment whose cost we have already determined. And otherwise it equals the sum of the costs of two recursive calls because we need to compute both of them and we are doing this sequentially, plus some constant overhead corresponding to these other operations that we are performing as well as some costs for performing the recursive calls. This m is computed according to this expression here which I have written here, equivalently. Now, let's try to solve this equation or at least estimate the upper bound for the values of this function given using this definition that we obtained. In order to simplify our analysis, let us look at a special case when the difference between t and s, that is the width of the array segment that we are processing, is of the form 2N (threshold- 1). So, it's just a relatively nice, round number for our analysis and then we will see that this will also give us the result for the general case. Now, what is this N here? N will be the depths of the tree that I have shown on the previous slide. So, if we now look at this computation, it will be a tree that has the depth N and that has 2N leaves and 2N- 1 internal nodes. This is just a property when you have such balance tree. So, if we sum up the cost in such tree, we'll find the following. Because we have 2N leaves and each leaf is given by this cost, we will accumulate this part of the cost for our function. So this corresponds to all the base cases taken together. In addition, we will be performing some work in order to decompose the problem. And every time we do the decomposition, we'll have some constant amount of work and how many times do we need to do this decomposition. Well, that corresponds to the number of these internal nodes in the tree. So that's 2N- 1. If we now look what is constant in this big expression, we will now conclude that N is the variable. That's what depends on t- s. And so we can describe the work performed by our function as a linear function of 2N. So we have concluded that in fact when t- s is of this form then the work is, in fact, linear in t- s. Because 2N and t- s differ by just a constant factor. Now, let's just see why, essentially, the same asymptotic bound applies, also, for other values of t- s. For that purpose, let's look at t- s divided by this constant threshold, minus 1. And let's pick number N such that it's the smallest number for which this value is less than equal to 2N. What that means is that two to n minus one will be actually smaller than it. We picked this value, n, such that even is t minus s is not the exact power of two to the n times threshold minus one we find the closest n. So then we observe that our function in fact, w hen you give it a bigger value, a bigger parameter, then it does more work that's easy to see by inspecting the description of this function. And that means that the work performed for some arbitrary parameters, s and t, will be bounded by the work for two to the n. Okay, so this bound applies. Now, only the remains is now to relate somehow this n back to t minus s and for that we'll use just this other equation because we can see that, two to the n is now also less than two times Threshold 2 divided by threshold minus 1 times t minus s. So from this equation here inequality, we get this other inequality. And this therefore, gives us the bound on the work of the function for arbitrary s and t. Because threshold is a constant, hence c5 is also a constant. Then we conclude that, in fact, work is linear in t- s. Okay? So the running time, the work performed by our sequential function, segmentRec, is linear in t- s. Let's now look at what happens when we add parallelism. Recall that the only difference in the description of the function is that we do these two recursive calls in parallel. But this has a profound impact on the worst case running time. Let us first examine the case where we have an unbounded amount of parallelism. And so lets assume that when we invoke these two recursive calls. They can proceed in parallel without any interference. The running time will be in fact the maximum of the running times of these two recursive calls. In terms of our computation tree, we still have the same tree because we are performing the same recursive calls. But the branches of the tree are always performed in parallel. So what is then the recursive equation for the time taken in the case of such unbounded parallelism? We will denote this value by D(s,t) D stands for depth as you will see. Now in the base case nothing has changed, we just have the linear function of T minus S, that happens when T minus S is again below the threshold, but in the recursive case we are performing these two recursive calls in parallel so the running time will be the maximum of the costs. these two recursive calls. And there is again some constant over here. So now instead of having plus, we have an equation with a max. What is the solution of this equation? Let's assume again that t minus s is of this convenient form, two to the n times threshold minus one, where n is the same depth of the tree. As we have seen before, we have 2 to the n leaves and 2 to the n-1 internal nodes. Now the value of this function for the base case is just c1(threshold-1) + c2 because when we perform the base case the width of the interval is (threshold -1). As we go one level above, we'll take the maximum of those values, which are identical, so the maximum is just that value itself and we'll add c3. Okay, so that's why for all the nodes one level above the leaves will have this cost. And as we move up the tree, towards the root, all that will happen is that we'll add these costs of decomposition. So when we reach root, we'll have something like C1 (threshold- 1) + c2 + (N- 1) c3. And recall N is the depth of the tree. So this time, the cost of our function, here this D(s,t), is in fact bounded by O(N), as opposed to O of 2 to the N. Now, as before, running time, even in this case, when we assume unbounded parallelism, is in fact monotonic in t minus s. So for an arbitrary value of t minus s. By picking the value of n, such as the 2 to the n minus 1, and 2 to the n. Are the tightest interval in containing this value t- s over threshold- 1. It's going to be obtain that N is in fact bounded by a logarithm of t- s. Okay, so by looking at this inequality here, we see that N is less than 1 plus logarithm of t minus s. Divided by this which becomes another constant, so when we collect the minus log of threshold minus 1 and this constant one from here, we get this c6. Okay, because we still have the bound that applies in this case, which is O(N), and N is bounded by log(t- s) plus some constant. Turns out that this function D(s, t) is in fact bounded by O(log(t- s). So we have Exponentially smaller value of this function, D. This is therefore an example of a function that, with the use of parallelism, we can speed up exponentially many times, assuming an unbounded amount of parallel resources. Now that we have seen this example, let's see what is the situation with asymptotic complexity analysis of parallel programs. As you have observed the complexity depends not only on the size of inputs but also the amount of parallel hardware resources. And for this purpose we'll use two measures to quantify the behavior of the program. The first one is work of some expression some computation e. this is the number of steps that need to be taken by all the processing units involved in computation. And this really corresponds to sequential execution time. It can not be better than sequential execution time. And that also means that after a constant we treat the invocation of parallel of e1, e2 simply as the pair of e1 and e2. So W(e) is essentially the sequential execution time of e. What is specific to parallel programs is this notion of depth which is also sometimes called span. And this is the number of steps that we would take if we had unbounded amount of parallelism. And in this case, for parallel of e1, e2, we'll take the maximum running times of the two arguments. Now given the situation for work and depth, what would be the rules for computing it in the case of parallel construct? For work, we need to do both arguments of both e1 and e2. That means that the total work will be the sum of work for e1 and the work for e2. In terms of the tree, if this node denotes the invocation of parallel, Then both arguments need to be counted. We add them up, and that means that if we have a tree where each node corresponds to the parallel construct, then all nodes and leaves of the tree need to be summed up. And at each step we have some constant overhead as well. The key difference is now for depth. The depth of computation of construct parallel(e1, e2) is the maximum of the depth of e1 and the depth of e2. What that means is that we compute the amount of work for one branch and the amount of work for the other branch, and then we simply take the larger one. The assumption is that we will be executing these computations in parallel. We're looking at a total time. And what will happen is that whichever computation finishes sooner, we'll wait for the one that takes longer. And that's why we have the maximum here. And of course, there's some constant overhead as well. Or at least the overhead we will be assuming that it is constant. So it's in the parallel construct that work and depth differ. In all other respects, they behave the same. So but in the case of parallel, if we manage to take some work given by the pair e1, e2 and split it into equal parts, then we will actually finish twice as fast because in that case, the maximum will be half of the sum. For all the other parts of the code where we don't use parallel explicitly, we will assume that the computation proceeds sequentially. So particularly, if we have some operation or function call with arguments e1 through en, what do we need to do? We need to evaluate each of the arguments, and then perform the computation given by f. So for work, that just means we do work for e1 and so on, work for en, and then with the values, v1 through vn obtained by evaluating e1 through en, we need to invoke function f. And the cost of that depends on what f is. If f is some primitive operation of integers, it's going to be constant. if it's some user defined function, then we need to estimate the work for that function. And for that, the computation is exactly the same because we don't assume that somehow arguments are evaluated in parallel implicitly. If we want parallelism we have to use the parallel construct or something equivalent. So computation of work and depth is conceptually straightforward and it proceeds recursively on the structure of the expressions that we are computing. Of course when we have recursive definitions, we get those recursive equations. Let us just remark that we will assume here that c1 is less than equal to c2. Which will make them the depth always be less then equal to work. And this is also to be expected. In most cases, work will be in fact strictly more than depth. Now that we have seen how to computer work and depth, let's see how we might use this information to obtain some reasonable estimates for the running time of programs on a concrete platform that has p parallel threads. Now observe that, no matter how large P is, we cannot finish sooner than depth of e. In fact, we have defined depth of e to be the amount of work if we had an infinite amount of parallelism. So that's the best case. At the same time, regardless of depth, we cannot finish sooner than the total amount of work divided by P. Because realistically, every piece of work will need to be done by some piece of parallel hardware. If we combine these two observations, we conclude that it is reasonable to use the following estimate for the running time of a parallel program in case we have P parallel processing units. We'll take the depth of the expression and add work divided by P. Now we can examine what happens if the amount of computation grows or if the amount of parallelism we have grows. Suppose, for the moment, that amount of parallelism is constant so we have a fixed machine or a fixed cluster. But the input grows. So, we are considering, for example, larger and larger amounts of data to process. Because P is constant, then parallel programs will have the same asymptomatic complexity as the sequential one. Namely this D(e) will be not really bigger asymptotically than W(e). And because P is constant, then we'll still have the same asymptomatic complexity as W(e). Let's now look at another scenario. A scenario where as we process larger inputs, we put more resources onto it. Or if we keep the same amount of data but increase parallelism. So as parallelism is increasing, then we can see that this work term vanishes, and what is left is depth. And it is important to realize that depth is typically still non-trivial because it involves having to decompose the computation and it's bounded by the inherent dependencies that we have among the data. And that's why even if P goes to infinity, the running time will not go to 0, it's going to reduce to the depth. Let's see consequences of this analysis for our segment rec example. There, we observe that the work is O(t- s). So it's linear in the amount of input to process. And the depth is logarithmic in the same value. Now, we can expand what this O notation means. It corresponds to some functions like this. And now, if we observe what happens if P is bounded, but we have varying amounts of input data, then we see that we obtain a linear function in t- s. On the other hand, if we are increasing P, if we are increasing fast enough, the function of t- s, then we have logarithmic behavior in t- s. So if we are throwing more and more power and resources at the problem, then segmentRec behaves logarithmically in the amount of input. Let us now finish by establishing a correspondence with something called Amdahl's Law, which is well-known in performance evaluation. And it also arises when we have parallel computation. Suppose that we have two parts of computation that need to be done sequentially so part1 needs to be done first and then part2, because part2 depends on the result of part1. And let's suppose that initially, part1 takes fraction f of computation, let's say 40% of the time. And then part2 takes 1- f, so the remaining time. Now if we speed up part2, we speed it up, let's say P times. What happens to the speed up of the entire computation? So if the original computation took a unit amount of time, the new computation will take amount f, which was the part that we couldn't speed up, and then 1- f divided by P, which was the part that we managed to speed up. And so this is the function that we get. And now we observe that this function actually has a plateau. As we increase parallelism more and more, we get diminishing returns. So even for P = 200 and our value of f of 40%, we obtain 2.46 as the speed up. And in fact, even if P goes to infinity, then we'll get just 1- f, which in this case is 2.5 speed up. So that tells you that at some point, we are facing diminishing returns because there will always be these parts such as part1 that we cannot speed up and that's one reason why we introduce this notion of depth to account for them.