We will now start discussing parallel operations on collections of elements. In fact, processing elements of collections in parallel is one of the main applications of parallelism today. We will start by discussing conditions under which this can be done, both in terms of what kind of collections we should be considering. Here, we will need to pay attention to whether we can efficiently split a collection into multiple collections to be processed in parallel, and possibly later, combine multiple collections into a new one. And also, we need to pay attention to the properties of operations that we wish to perform on the elements of the collection. Examples are associativity or independence of operations in case they're performing side-effects. Operations on collections are, in fact, key to functional programming even in the sequential case. Here's some of the common operations on collections. A map operation applies an even function to each element of the collection. Let's look, for example, at the list containing elements 1, 3 and 8. If we map this list using the squaring function that maps x to x square, we obtained the list containing 1, 9, and 64. Each element corresponding to appropriate element of the original list, and obtained by applying our function. What other operations are of interest? An important operation is fold. Fold combines elements of a collection using a given operation and many variants of fold also take an initial element of this combination. Let's look again at the same list of elements, containing 1, 3, and 8, and suppose that we want to start from 100 and sum up these elements. What is the result? The result is 100 plus 1 plus 3 plus 8, which is 112. Another relevant function which can be used to express some other algorithms of interest is scan. You can think of scan as applying fold to all list prefixes, or alternatively, as recording the intermediate results of computing the fold of a list. Let's apply now scan to the same input list and to the same initial elements in the operation of summation. What we will get is the sequence that has the length one more than your original sequence and contains elements 100, so the initial element. Then 101, so your initial element plus the first element of the list, then 104 obtained by adding 3, and then after adding 8, we obtain 112. So these operations exist in the sequential case already, but they become even more important in case of parallel operations, because in that case, implementing them from scratch is more difficult. So there's even more value in reusing such implementations from the library. We have been using list to specify the intended result of these operations, but in fact, lists themselves are not a very good implementation of parallel collections. Because we cannot efficiently split them in half since we would need to find the position of the middle of the list. And it is also not efficient to combine them because concatenation takes linear time. For simplicity, here, we will consider two alternatives to lists. The first one is imperative arrays, as you may have recalled from array sum example, as well as trees, which can in fact be implemented in a purely functional manner as well. In subsequent lectures, you will see more information about Scala's parallel collection libraries which include many other data structures that are implemented efficiently and that offer additional benefits. Let's start by discussing map. If you have a list of elements a1 through an and you apply map with a function f, you will get a list of elements f(a1) through f(an). Some of the properties that map satisfies is that if we happen to be mapping with identity function, we'll get the list that has the same number and the same ordering of elements as the original list. Another property that list map happens to have is that mapping with the composition of two functions, which is defined in the following way, is the same as mapping first with one of the functions. And then mapping again with the second function. And this may be relevant because it allows us to fuse together multiple applications of a map to an application of map with a more complex function. It should be straightforward for you to come up with a recursive definition of a sequential map which takes a list of elements of type A, and produces a list of elements of type B. The resulting list should be obtained by applying function f. Clearly, if the input list is empty, the result is empty list as well. Otherwise, we have a list containing some head element and the remainder of the list. In which case, we need to generate a list that applies function f to that head element and then continues recursively on the tail of the list. Now we would like to have parallel versions of such map operation, that means that we would like to perform computations of f applied to different list elements in parallel. And we would also like to parallelize the transformation of the list itself, that means that the choice of list is no longer ideal. Because even finding the middle element of the list is already linear, so we would not be able to parallelize operation on long lists that have not yet achieve operation. We will therefore start by looking at implementations of maps on arrays. Here's one signature of such an operation. It would take an input array, denoted by int, and it also takes this argument, an output array to which the results should be written. To indicate which part of the array should be processed, we have indices left and right. And the processing should start at left and stop at right-1. The function to be applied is again past this argument, this is the function f. Let's look first at sequential implementation of such function. We can do that using a simple while loop that starts from left, and then right, the output, right to the output array, the result of input of i for which function f is applied. Then we do that for increasingly large index i going from left up to and not including right. So the effect of this function is that the content of the out array is going to be changed between this, left and right-1. For example, if we take the array containing two, three, four, five, and six, and we have an array filled with zeros that we will use as output, and take the squaring function f. Then if we fill in indices from one up to but not including three, using result of applying f to the corresponding elements of the input array. What will happen is that, this a zero-based indexing, so we'll process the elements at index one and index two. And that's why the result will contain 9 and 16, and the remaining elements will stay at 0. So that was the sequential version, how could we do this in parallel? Let's use the parallel construct in order to express this. When the difference between the left index and the right index is small enough, below some threshold, then we can invoke the function that we have defined previously. Otherwise, we compute the middle element and then we invoke the functions recursively from left to middle and from middle to right. There are two things that we need to pay attention to. One is that we are now invoking, in parallel, computations that are writing output to some array. It means that we need to be careful that parallel operation write to disjoint parts of the memory. In this case, what these codes are writing to are elements of the out array, so we need to track to which indices these codes are writing. And here we have a specification returning as an informal comment saying that our recursive function, just like the sequential counterpart. Writes to element out(i) for indices i between and including left, and up to and including right-1. Because of this property, we see that this recursive calls, in fact, will not interfere, because the highest element to which the first argument of parallel will write is mid minus 1. And the first element to which the second call will write is mid. Moreover, we see that if we assume the specification holds for these calls, then the specification will also hold for the entire function. So by induction, we can actually verify that this property that we have written in comments holds. In terms of performance, another point that we need to take into account is that this threshold needs to be large enough. This is particularly important for an example such as this, where the only thing that we are doing is writing certain elements of the array. If the function f is relatively simple, then performing this write to one index of an array is going to be several orders of magnitude cheaper than invoking parallel computations? So the overhead of parallelization will need to be somehow amortized over the number of times we are invoking these writes to individual indices of the array. That's why threshold needs to be several orders of magnitude as well. Once we have defined such parallel maps, we can then use it for various concrete functions. Say we want to raise all elements of an array to power p. Take elements a1 through an and compute absolute value of each element raised to the power of p is in our array sum example from before. Then we can invoke either the sequential or the parallel map on the arrays. There are several questions on performance that arise in this example. The first one is of course do we gain from parallel execution? And the second question you may ask is wouldn't it be more efficient if we reimplement for this particular operation of raising to the power of p the application of the operation to each element of the array. So let's see how those examples would look like if we wrote them from scratch. Let's call the resulting function normsOf. So it's simply that the map that specialized to this power function. The function is no longer passed as an argument, it just traverses all elements and fills in the array with powers of the input. There's also corresponding parallel version, which performs this computation in the base case when the index is small enough. And otherwise invokes itself recursively in parallel, similarly, as in the generic case. Now, what do you think is the relative performance of these different versions? Your results of measurements on a particular machine for relatively large area, we have chosen corresponding to the higher threshold. The slowest operation is when we do it sequentially and when we use the hierarchy function, we get the number of 174 milliseconds in this example. And when we perform parallelization on this machine that has actually [INAUDIBLE], we obtain quite nice [INAUDIBLE]. On the other hand, if we simply implement this operation from scratch, but do it sequentially, then we have obtained very close performance to the performance of in working general sequential map. And if we do it by hand, and do parallelization by hand, then we also obtain only negligible improvement compared to invoking the generic parallel operation. So we could conclude that parallelization pays off, but in manually removing higher order functions does not really pay off given the effort and the loss of maintainability that we obtain. And hardly measurable improvements in the time. Now that we have seen parallel map on the list, let's look at implementing parallel map on trees. We will consider trees whose leaves store arrays segments and whose non leaf nodes contain references to left and to right sub tree. It is also going to be convenient for both leaves and non leaves to store the total number of elements. We will be assuming that our trees are approximately balanced. That will allow us to explore the branches in parallel while obtaining benefits of parallelization. For map, we do not need to concern ourselves in how we will maintain this balancing. Here then is the implementation of a parallel map. It is surprisingly simple, we do better matching on the shape of the tree, in case it is a leaf, we need to perform the map on the array. In this case, we're going to allocate a new array B, whose size is the same as the size of the array A stored in the current leaf. We will then fill in this newly allocated array with the appropriately mapped elements of the original array. We will do that using one sequential loop. The idea is that these array segments are small enough that it would not pay off to perform this mapping parallel where benefits of parallelization are obtained are in nodes of the tree. Node has left and right sub tree, you will simply invoke the parallel map recursively on the left and on the right sub tree, the result will be two trees. Note that unlike the operations on arrays that we had before, this parallel map returns a new tree. Once we have these two trees, we simply combine them into a new node. If we now measure the speed up relative to a sequential version that does not have parallel, we obtain similar improvements as for the array. It is instructive to compare using arrays versus basically immutable trees as collections on which we perform operations such as map. Arrays are very appealing because of their simplicity and because we can access elements in arbitrary order. When we have tasks that are executing on the same shared memory, that means we just need to pass the pointer or reference to the beginning of the array. And some indication of which indices of the array, appropriate task is suppose to process. Once we are processing a particular region of the array, because array elements are stored continuously in memory, we obtain good memory locality. On the other hand, because we need to create these arrays in an imperative way, we have to be careful that tasks that are executing in parallel write to disjoint parts of the array. Otherwise, we will obtain unpredictable results that depend on the order in which the writes are performed. That's one disadvantage of arrays. And another disadvantage is that if we obtain these arrays into completely different computations and we later want to put them together, they will necessarily have to copy some parts of the array. It is interesting to compare the properties of arrays with properties of immutable trees. In this approach, we have seen that we produce new trees as the result of operations such as map and we keep the old ones. That means that we can continue using the old versions of data which is useful for many applications. Moreover, because operations such as map produces new trees, it is easier to ensure that it does not write to the same parts of memory as some other operation that's executing in parallel. Next, it is easy to combine two trees because all we need to do is create a new node that has two other trees is, certain trees. Even if we need to ensure balancing, this can be done reasonably efficiently. Among the negative aspects of immutable trees is high memory allocation overhead and also bad locality. Because different parts of the tree may be stored in principle in completely different parts of memory.