In this lecture, we will see how to evaluate parallel program performance through Benchmarking. Before we start, it is important to differentiate the act of benchmarking computer programs from testing them. Specifically testing insures that parts of the program are behaving according to the intended specification, that is, the intended behavior. For example, let's assume that we test that calling the reverse function on a list works correctly. We would create a unit test that instantiates a list with some numbers, and then check if the result of the reverse method is what we expect. By contrast, benchmarking is used to evaluate various performance metrics of the program. A performance metric could be the running time, memory footprint, metric traffic, disk usage, or latency. Often, this value is a random variable, its actual value is varied from run to run. For instance, if we want to test the running time of the reverse function on a list we could instantiate on a list, record the starting time, call the reverse function and then record the time again to compute the total running time. At the end we bring the running time to the standard output, this is a very naive way of evaluating the running time of the reverse function, but we start with this for illustration purposes. There is one important difference between benchmarking and testing. Typically, testing yields a binary yes or no answer, the program is either correct or it is not. In the previous example, the list access is that or equal to the expected list or it is not. There is no other possibility aside from raising an exception, which is also considered incorrect. Benchmarking on the other hand, usually results in a continuous value, which denotes the extent to which the program is correct. In our benchmark example, the running time is a continuous value and will be slightly different each time we run the bench mark. We need to remember why we write parallel program in the first place. The main reason are performance benefits, in particular, improving the running time. If performance were not the benefit, we could just continue with writing sequential programs, which are easier to both produce and understand. For this reason, benchmarking parallel programs is much more important than benchmarking sequential programs. In sequential programming we usually only measure the performance of the bottlenecks in the system. Conversely, a parallel program is typically a bottleneck in the system to begin with. We will focus on the ramming time metric which is subject to, first of all, processor speed. The faster the processor, the faster the resulting program. The second factor is the number of processors. A parallel program can to some extent disrepute it's computation workload across different processors. The number of available processors is the necessary precondition for improving performance, main memory also plays an important role since processors are separated from the main memory with a bus, they sometimes have to wait while fetching the data from memory. Here we differentiate between latency, which is the amount of time that a processor must wait from the moment it requests data from the main memory until data arrives and throughput, the amount of data that can be retrieved from main memory per time unit. The two properties affect the degree of memory contention, an effect that we will study later in this course. To decrease the negative effects of limited latency and throughput, most computer systems today use caches. A smaller amount of high-speed memory close to the processor cores, which mirrors the parts of the main memory. Caches are usually divided into several levels, which allow the processor to fetch and modify recently used data without going through the bus to reach the memory. In doing so, they boost performance of a program by orders of magnitude. However, caches make performance analysis more complicated, and although they exist to increase program performance, they can sometimes negatively impact running time with effects such as false sharing. Finally almost all the programs coexist with or run within some run time environment such as a virtual machine, memory management sub system or an operating system, these components in use effect such as garbage collection,just in time compilation or concurrently related tasks which also effects the observe running time. This is a very high level overview of a typical computer system. In reality, there are many other vectors that drive performance. For those who would like to learn more, we recommend the article, What every programmer should know about memory. We discuss these concepts in greater depth. As a consequence of different performance vectors, it is difficult to accurately measure a metric such as the running time. Two runs of the same program usually have a similar but different running times. How can we be sure what the real running time is? Well, let's take a look at some measurement recipes. The first obvious approach is to repeat the measurement multiple times, to get a sense of how the program behaves, however, this is not enough. Once you obtain a sequence of measurements, we need to do some statistical analysis. In the simplest form, this involves computing a mean value, variance and the confidence intervals. We can also use statistics to eliminate the outliers. A handful of measurements that deviate from the majority by some large and unusual amount to fight against various run time environment effects, we should ensure that our program is in a steady state. Our measurements should be taken in the state where there are no concurrently executing programs and the run time effects of dynamic compilation and memory and management are eliminated. This usually happens after the program has been running long enough, which we refer to as a warm up. Finally, we sometimes need to prevent anomalies that impair our measurements. Garbage collection can be avoided by allocating sufficient memory beforehand and the JIT compilation can be disabled for the purposes of running the benchmark. In some cases, the compiler optimizes parts of the benchmark program that will not be optimized in a real application because the benchmark is too simple. To prevent this, we often need to restructure the code manually. If we turn back to our earlier example, we can see that we neglected to do most of these steps. First of all, we did not repeat the bench mark multiple times, we only ran it once. Then we do not eliminate outliers nor did we prevent various anomalies or ensure the program is in steady state before doing the measurement. One other problem is that this benchmark is so short for this particular list that the output would be very noisy, if you are interested in learning more about the accurate performance measurement, we recommend that you read the article Statistically Rigorous Java Performance Evaluation. Having convinced ourselves that measuring performance is a complex task, we turn to an existing tool to do the job for us. ScalaMeter is a benchmarking and performance regression testing framework for the JVM. Here we differentiate between performance regression testing, comparing the performance of the program run against known previous runs, and the benchmarking, measuring performance of the current program. In this course, we will focus on the benchmarking features of the ScalaMeter framework. To start using ScalaMeter, we need to edit as a library dependency to our SBT project, we do this as follows. After that, we only need to import the ScalaMeter package and use the measure method to get the running time of the code. In ScalaMeter 06 version, this method returns a double value for the running time. In this snippet, the measure the time required to convert a range of 1 million integers into an array. Finally, we print the array initialization time. So let's see how we can do this from the SBT interactive shell. First, we start with the scala console, then we import the ScalaMeter package. Finally, we measure the running time of our snippet. The result we get is 21 milliseconds, however, we run the measurement again. Oddly enough the running time is now 10 milliseconds, this is quite a big deviation from the initial measurement. Let's rerun the snippet several more times. The numbers we get are quite interesting, first the running time was 10 milliseconds consecutively. Then for some strange reason the running time grew to 53 milliseconds, this is quite unusual. What's even more interesting, is that after that, the running time is no longer 10 milliseconds that's seven and a half. We conclude that during this run of the benchmark, several things must have happened. First, JVM started a dynamic compilation. The evidence for this is that the running time after this long run is almost always around seven milliseconds, which is faster than before. Obviously the JVM optimized something. Another thing that could have potentially occurred is a garbage collection cycle, during its execution, the program allocates more and more memory. This memory is de-allocated and returns to the memory subsystem only after the memory occupancy raises above a certain threshold. When this happens, a mechanism known as garbage collection kicks in. If we scroll down a little bit, we can notice that this mechanism known as garbage collection happens very periodically, every third in row of the benchmark. But the normal running time is quite consistent and is around seven and a half milliseconds. The running time with garbage collection deviates by quite a large amount. Initially, the demo showed two very different running times on two consecutive runs of the program. After doing more measurements we observed several other effects. We noticed first hand that when a JVM program starts, it undergoes a period of warmup after which it achieves its maximum performance. First, the program is run in the interpreted mode, here the bite codes, which are the outputs to the scholar compiler, are executed directly in a softer component called the interpreter. Then parts of the program are compiled into machine code. JVM is smart enough to figure out which parts of the program are the hot parts, those that are executed the most often and exactly those parts are compiled. However, it turns out that later the JVM may choose to apply additional dynamic optimization for program segments which run really often. It makes sense to apply further analysis and optimizations to ensure that they are as fast as possible and eventually, the program hopefully reaches some steady state. Usually, we want to measure steady state program performance. It turns our that ScalaMeter can help us here. Warmer objects run the benchmarked code until detecting steady state. And being sure that the JVM is properly warmed up before a measurement is executed. We specify the warmer with the withWarmer clause and in this case, we use the default warmer implementation. Let's ensure that the run time reaches the steady state before we run our benchmark. We do this with the following withWarmer statement, which precedes the measure call. [SOUND] Recall that in the last demo, running time eventually became seven and a half milliseconds. We could have falsely concluded that the system reached steady state, but as shown in this example, when we actually run the scalameter warmer, the real steady state is reached. The actual running time of this benchmark in this steady state is between four and four and a half milliseconds. In some cases, we are not entirely satisfied with the default parameter that the ScalaMeter uses. For instance, the behavior of the warmer object can be governed with this configuration clause. Inside the configuration clause we specify several keys mapped to different vendors. In this case, we changed the minimum and the maximum number of warm up runs this scalameter executes. We also increased the positive of the standard output. Finally, ScalaMeter can measure more than just running time. So far, we've seen examples that use default measure, which just measures the plain running time. IgnoringGC measures the running time without GC pauses. The OutlierElimination measure removes statistical outliers from the measurement. We can also measure different values, not just the running time. MemoryFootPrint, measures memory footprint of an object. Garbage collection cycles measures the total number of GC pauses during the execution of the benchmark. Newer ScalaMeter versions can also measure method invocation counts and primitive value boxing counts. Let's see an example of using a different measuring curve. Instead of measuring the running time, we will measure the memory footprint. Memory footprint will measure the total amount of memory occupied by the object returned from this input, which is in this case an array with a million integers. Strangely enough, we got a negative value. This is because we didn't use a warmer to ensure steady state. We will run the benchmark several more times to ensure that some steady state has been reached. We can see that measuring the memory footprint is much less noisy than measuring the running time. The total memory footprint is around 4,000 kilobytes or four megabytes. Since integers are 32 bit that is they occupy four bytes of memory and we have a million integers. That means that we have four megabytes of memory, which is exactly what ScalaMeter measured here.