Welcome to our course on parallel programming. In the following weeks, we will see some of the cutting edge technologies for parallel programming. See how to write efficient parallel programs. And learn the basic principles and algorithms of this fast moving and exciting field of computing. In this first lecture, we give a general introduction to parallel computing and study various forms of parallelism. We will also give a summary about what we will expect in the rest of this course. The first big question that you need to answer is, what is parallel computing? So, lets define it. We'll say that parallel computing is a type of computation in which many calculations execute at the same time. Parallel computing is based on the following principle, a computational problem can be divided into smaller subproblems, which can then be solved simultaneously. Parallel computing assumes the existence of some sort of parallel hardware, which is capable of undertaking these computations simultaneously. That is, in parallel. Having said that, we will take a short glance into the history of parallel computing to better understand its origin. Despite its recent rise in popularity, parallel execution was present ever since the early days of computing when computers were still mechanical devices. It is difficult to identify the exact conception of the the first parallel computing machine, but we can agree on the following. In the 19th century, Charles Babbage invented the concept of a programmable computer, which he called an analytical engine. The analytical engine was able to execute various user written programs. For this, Babbage is credited for being a pioneer in the field of programmable computers. After Babbage presented his ideas about the analytical engine, an Italian general and mathematician Luigi Menabrea, wrote down a description of Babbage's computer. Menabrea envisioned an extension in which independent computations occurs simultaneously at the same time. His idea was to speed up Babbage's analytical engine. Modern computers, based on transistor technology, appeared much later, in the 20th century. Researchers from IBM, Benny John and Gene Amdahl to mention a few, were on of the early evangelists of parallel processing. Amdahl's famous quote comes from this era. He said, for over a decade prophets have voiced the contention that the organization of a single computer has reached its limits. And that the truly significant advances can be made only by interconnection of a multiplicity of computers. Regardless of Amdahl's visionary views, at the time, parallel computing did not become mainstream, but it obtained a niche with the higher performance computing community. In high performance computing, the main area of focus are parallel algorithms for physics and biology simulations, weather forecasting, cryptography, and applications that require a lot of processing power. Performance requirement of regular users were satisfied by increasing the clock frequency of the CPU, which meant that the CPU could execute a higher number of instructions per second. At the beginning of the 21st century, it became obvious that scaling the processor frequency is no longer feasible after a certain point. The reason is that the power required for the processor starts to grow a nonlinear limit frequency. So instead of increasing CPU clock frequency, processor vendors turned to providing multiple CPU cores on the same processor chip. And named these new devices, multi-core processors. It is interesting to note that these three stories share a common theme. Whether the need for faster computation arose from the sluggishness of a mechanical computer, limitations of the available technology, or the physical boundaries imposed by the power wall. Additional computing power was provided through parallelization. This brings us to the next big question. Why do we need parallel computing? As you will learn in this course, parallel programming is much harder than sequential programming. There are several reasons for this. First, separating one sequential computation into parallel subcomputation is a challenging mental task for the programmer, and is sometimes not even possible. Usually this means identifying independent parts of the program and executing them at the same time. However, some problems cannot be divided as shown in this figure. The second source of difficulty in ensuring correctness. We will see that many new types of errors can arise in parallel programming, making the programmer's life harder. With these increased complexities, why then would we bother writing parallel programs at all? As shown in earlier anecdotes, the main benefit of parallelization is increased program performance. When we write a parallel program, we expect it to be faster than its sequential counterpart. Thus, speedup is the main reason why we want to write parallel programs. In fact, achieving speedup is the third source of difficulty in parallel computing. The speedup discussion brings us to the next big topic, the relationship of parallel programming to a closely related discipline called concurrent programming. In spite of their connection, parallel and concurrent programming are not the same things. There are many contrasting definitions of these programming disciplines, and people sometimes disagree on what these disciplines are. We will decide on one particular definition, and stick to it. A parallel program is a program that uses the provided parallel hardware to execute a computation more quickly. As such, parallel programming is concerned mainly with efficiency. Parallel programming answers questions such as, how to divide a computational problem into subproblems that can be executed in parallel. Or, how to make best use of underlying parallel hardware to obtain the optimal speedup. A concurrent program is a program in which multiple executions may or may not execute at the same time. Concurrent programming serves to structure programs better in order to improve modularity, responsiveness to input and output events, or understandability of the program. As such, concurrent programming is concerned with questions like, when can a specific execution start? When and how can two concurrent executions exchange information? And how do computations manage access to shared resources, such as files or database handles? So parallel programming is mainly concerned with algorithmic problems, numerical computations, or big data applications. Examples of such applications include matrix multiplication, data processing, computer graphics rendering, or simulation of fluid movement. Concurrent programming is targeted at writing asynchronous applications such as web servers, user interfaces, or databases. While parallel programming is concerned mainly with speedup, concurrent programming is concerned with convenience, better responsiveness and maintainability. The two disciplines often overlap. Parallel programming may rely on insights from concurrent programming and vice versa. Concurrent programming may be used to solve parallel programming problems. However, neither discipline is the superset of the other. Having more clearly established what parallel programming is, let's take a look at various forms of parallelism. It turns out that parallel computing manifests itself on various granularity levels. For example, some microprocessors have historically had a four bit word size. Number ranges that did not fit into the four bits had to be represented with multiple words. This meant that variables in the eight bit ranges, such as x and y, had to be represented with eight bits, that is two words. Adding these two variables together required two separate add instructions, each operating on a four bit words. Words size subsequently increased to 8-bits, then 16-bits, and eventually 32-bits. Each time this happened, several instructions could be replaced by just a single one. This meant processing more data in the same time interval. In effect, performance was improved by parallelizing operations on the bit level. So we call this form of parallelization bit-level. A decade ago, we witnessed a switch from 32-bit to 64-bit architecture in commercial processors. More recently, a similar effect was achieved through the introduction of vector instructions in Intel and ARM processors. On a more coarse grain scale, parallelism can be achieved by executing multiple instructions at the same time. Consider the following example. Here, the processor has no reason not to compute b and c in parallel. This is because neither b or c depends on the other to be computed. This form of parallelism is thus called instruction-level parallelism. Note that the computation of d, however, can only be started after both b and c have been computed. So in summary, instruction level parallelism executes different instructions from the same instruction stream in parallel whenever this is possible. Finally, task-level parallelism deals with parallel execution of entirely separate sequences of instructions. These instructions can execute on the same or entirely different data. Bit-level and instruction-level parallelism are exploited by the underlying parallel hardware. In other words, they are in most cases implemented inside the processor itself. This course will mainly focus on task-level parallelism, which is usually achieved through software support. This level of granularity will allow us to express general parallel algorithms. The different forms of parallelism that we just saw manifest themselves on some concrete parallel hardware. It turns out that there are many classes of parallel computers out there, and we will take a look at some of them. A multi-core processor is a processor that contain's multiple processing units, called cores, on the same chip. Processor vendors, such as Intel, AMD, ARM, Oracle, IBM, and Qualcomm, produce different models of multi-core processors. A symmetric multiprocessor, or SMP, is a computer system with multiple identical processors, that share memory and connect to it by a bus. Here, multiple execution units are not on the same chip. An SMP itself can contain multi-core processors, as is the case in this illustration. A general purpose graphics expressing unit, GPGPU, is a form of a co-processor originally intended for graphics processing. As a core processor, it does not execute all user programs by default, but can execute a program when this is explicitly requested by the host processor. Field-programmable gate arrays, FPGA, is another form of a core processor which can rewire itself for a given task. Their advantages improved performance when optimized for a particular application. Usually, these are programmed in hardware description languages, such as Verilog and HTL. Final class of parallel computers that we will mention are computer clusters. Groups of computers connected via a network that do not have a common shared memory. Sometimes computers in the cluster also have co-processors, such as GPUs, and then we call them heterogeneous clusters. In this course, we focus mainly on multi-core processors and symmetric multiprocessors, which are closely related. All our examples and code will be optimized for multi-cores and SMPs. However, the algorithms that we will learn about generalize, and it can be implemented on most forms of parallel hardware. In this lecture, we learn about some basics of parallel computing. The rest of this week will focus on simple parallel programming examples, and on the performance analysis of parallel programs. The second week deals with task parallelism, and some basic parallel algorithms. The third week focuses on data parallelism, and data parallel computations, with a preview of Scala parallel collections. Finally, the fourth week will focus on data structures which are tailored for parallel computations. In the next lecture, we're going to learn how parallel computations are implemented on the JVM. We will see the basic primitives used for parallelism and learn about some concrete foundations for parallel programming.