What would be the simplest way to indicate that if we have two expressions e1 and e2 we wish to compute them in parallel? Let's consider a construct that simply takes two expressions. And whose meaning is that these two expressions are evaluated at the same time. To understand how to use this construct, lets look at an example of computing a p-norm. P-norm is a generalization of the notion of length of a vector in geometry. When p is equal to 2, 2-norm would be simply the length of a vector given by its coordinates. So we computed by summing up the squares of the coordinates and then raising the result to the power of one to 1/2 [INAUDIBLE]. For arbitrary p, we simply take the sum of the absolute values of the coordinates of the vector raised to the power of p, and then we raise the resulting sum to the power 1/p. So how would we compute the p-norm of a vector? Let's look at a sub problem that computes the sum of p powers of a segment of an array. This array represents our vector. Suppose that p is given as the double floating point number. And that the segment is given using two indices s, which indicates the first index of the array where we should start the sum, and t, which indicates the index where we should stop. So it's t-1 that indicates the last element of the array that we should consider. What is this color function that computes this expression? Well here's the simple version of it. We simply use a while loop to traverse all elements of the array, then we raise the elements to the power p. And accumulated the result in this variable sum here. The sum is an integer so we will be rounding the powers to an integer value. If the implementation of power is not particularly important, it just uses a couple of math library functions. Now if we have a sum of p powers of our array elements, how would we then compute the p-norm? Of course we can just invoke the sumSegment function we defined and give it the entire segment from 0 to the length of the array. This will give us the sum. Then we can raise the result to the power 1/p. So this is the simple sequential version of this function. Now how would we go about arriving at a parallel version? First observe that we can write this big sum from 0 to N-1 in two parts. Let's pick some element m in the middle for the array and then first sum up the powers of elements up to m and then the elements starting from m. This is of course equivalent to our original expression. How would we implement this version of the code? Well of course we can just reuse the sumSegment function and invoke it twice to obtain two intermediate sums. Which when we add up, we obtain the same value that we can raise to the power 1/p. So we have grouped the sum into two components. So the function that corresponds to these two separate sums completely invokes sumSegment twice. So it compute a pair of values. sum1 corresponding to first part of the array. And sum2, corresponding to the second part. Once we have these two sums we can simply add them up and then again raise the result to power 1/p. Now how, this is still a sequential version, how would we obtain a parallel variant of this function? Turns out the only thing that we need to do is insert the parallel combinator in front of this pair. Now let's compare the execution of these two versions. In the first case we simply compute the first element of theory and then the second element of theory. So the computation proceeds sequentially. In the second case we start by setting up two parallel computations which may incur some initial overhead. But then we proceed in summing up the elements of the array potentially twice as fast as in the sequential case. So as a result, we may finish sooner than in the sequential version. Now how would we generalize this to process four array segments in parallel? Well we can divide the array into four segments, and then we take first two segments and we sum them up in parallel, we take the second two segments and sum them up in parallel. Now we have two computations which themselves can run in parallel which we do by invoking yet another parallel instance. Now, it is natural to consider how to generalize this idea to an algorithm that works with an arbitrary number of threads. What would be a function that processes very large arrays. And, exploits, as much paralleism we might have in a natural way. So here's one version of it, let's call it segmentRec. SegmentRec will be called in the same way as sumSegment from before. It's in the special case when the length of the segment is small, less than some threshold value. It's simply going to invoke that same sequential computation that we have seen before. And otherwise, it's going to divide the segment into two. If we have the starting point of the segment s, the end point t, we're going to find a value that is approximately in the middle. This way we obtain two smaller segments. One from s to m, one from m to t. Which we then invoke in parallel. So, we are going to invoke the same function recursively here. When we obtain the resulting partial sums we can of course add them up. And this will give us the overall sum which then is before we can simply raise to the power of 1/p. In general what is the signature of parallel? It takes two computations, let's call them taskA and taskB. And in fact returns the values of those two computations. So from the point of view of the value computed, parallel behaves as an identity function. Of course the benefit is that the result may be computed faster than just computing the pair of these two computations. Now observe that parallel takes its arguments by name which is indicated by these two arrows in the signature of the function. Why is this? Suppose that instead of the signature that we really have for parallel. We have an alternative one, called parallel1. Where taskA and taskB simply have types A and B. Suppose that now we, on the one hand, invoke parallel(a, b). And on the other hand, parallel1(a, b). What is the difference between these two computations? Well, it turns out the second computation, in fact, evaluates a and b sequentially. Much like if we had simply assignment (va,vb) = (a,b). This is because the semantics of passing the values is such that we first need to evaluate a and we need to evaluate b, and then we give the values obtain to the parallel1 function. In order to obtain the benefits of parallelism we need, in fact to indicate that we are passing not the value of these expressions but in fact the computations themselves. And this is indicated by the arrow before the type. In other words, we can say that parallel much like if or while statements, the user defined control structure. How did we in fact define parallel at a very high level? Well it turns out that efficient support for parallelism requires support from the programming language and its libraries. Also the virtual machine in case programming language runs on top of a virtual machine. As well as the operating system and the hardware. Our specific implementation of parallel uses Java Virtual Machine threads. And they typically map to operating system threads. Operating system in turn can schedule different threads on multiple cores. And that means that if we do have multiple hardware cores, a parallel program can in fact run faster. Now, if we happen to be running that computation on a machine that does not have multiple parallel cores or they are busy doing some other computation, then these different layers of the software stack will still allow us to run the computation, but we will not get the benefits of parallel execution. One example where we can see that there's some complexity underneath our parallel abstraction is the following. Consider now someone which is similar to some segment, but instead of raising the elements of the array to some power p, we simply sum them up directly. So we just compute the sum of the elements of the array in a given segment from s to d. Now, if we try to invoke four instances for four disjoint segments of an array, much like we had in the example before. We try to execute this in parallel on our commodity desktop machine that has, say four hardware cores, and perhaps reports eight threads with hyperthreading. We may find, in fact, that it is difficult to obtain some speed up. Why would this be the case? Note that this happens even if we have a very long array. It turns out that this computation is in fact bound by the memory bandwidth. The array is stored in random access memory, which even if we have multiple processors is shared across them. Now, whether we have one or more cores working, they will all end up accessing the array elements through the same memory unit. And that means that the time that the computation takes cannot be less than the time it takes to fetch the entire array from the memory into the processor. Therefore, when considering opportunities for speed-up, we must take into account not only the number of cores, but also the parallelism available for any other shared resources that we might need in order to perform computation. Such as memory in this case. Let's observe another aspect of parallel computation, as expressed with the construct parallel. Suppose that we try to apply it to expressions e1 and e2 whose computation takes very different lengths of time. In that case, because we need to compute the result of the first expression and the result of the second expression to obtain the entire pair. We'll be left waiting until the longer expression finishes the computation, which means that the minimum time that this parallel expression can take is in fact the maximum of the running times of e1 and e2.