Hi everybody! In this lecture I am going to introduce you to Xilinx SDAccel. This toolchain was released few years ago and is collecting a lot of attention, thanks also to its integration inside the Amazon EC2 F1 instances. So, let’s start! The first thing we have to introduce is the Hardware Design Flow, or HDF. If we consider the particular case of High Performance Computing, from an high level perspective, it can be defined as a 2-step process that starts with a high level description of the FPGA code, that can be, as an example C or C++. Once this high level representation has been provided, the first step is called High Level Synthesis, or HLS. This step involves the optimization of the FPGA code and the production of a lower level representation of the code. In particular, it generates a code written in HDL, that we know it stands for Hardware Description Language. Most of the times, this step produces what it is known as an IP core - a higher level representation that then can be used for the following step of the Hardware Design Flow. Now that we have an IP core, we can move forward to the second step, namely the System Level Design. This second step allows to integrate the IP core we defined in the previous step into our FPGA, and to generate the final bitstream that will then be used to configure our FPGA for the final execution. Furthermore, during this step the runtime drivers for the final execution, as well as a host for our FPGA are defined. Let us now have a closer look to the flow, trying to highlight a little bit more all the steps that it is composed of. As it was previously mentioned, at the very beginning we have to provide in input a high level representation of our code as C or C++. After providing this high level code, the first step is usually represented by a profiling of all the functions available within the provided code. This operation is performed to identify what are the bottlenecks of the functions we want to accelerate. Furthermore, we are particularly interested in the runtime of the different functions inside our application trying to identify the portion of code where the application spends most of its runtime. Basically, we look for the code functions that respect the “90/10” rule: 90% of the application runtime is spent in 10% of the code, as described in the book by Hennessy and Patterson: “Computer Architecture, A Quantitative Approach”. Such functions may be good candidate for hardware acceleration on FPGA. Now that good candidates to be accelerated have been identified we can start with the high level synthesis step by translating our candidate functions to HDL. The translation produces the aforementioned IP cores, that are basically hardware cores that can be implemented on our FPGA. With these representations, we can test our generated code performing a functional simulation. This simulation works by creating test vectors for the generated IP cores and by checking the correctness of the results of the computation. At this point, it is also possible to have a first idea/estimate of the performance. So let’s say that we got to this point, but we are not fully satisfied by the solution that has been implemented, and by its estimated performance. The HLS step is an iterative process; hence, we may need several iterations before converging to a good solution! What we can do, is to revise our solution by, as an example, changing the optimization we are applying or increasing the level of optimization. Otherwise, if we are happy with the output of this module, we can move to defining all the interfaces with the memory as well as to integrate the module produced within the provided system infrastructure. In this phase it is defined how the core communicates with the rest of the system, how the computation is performed and various other information such as if the core will need a host device to manage the communication or not. Once all these information have been specified, we are able to generate the final bitstream, that will then be necessary to execute our code on the FPGA. This module involves multiple steps among Synthesis, floorplanning, place and route and many others to understand how the different parts of the algorithm, now defined as a hardware architecture, can be placed on the FPGA chip. It is important to notice that the bitstream is usually not the only part that it is necessary to be executed on our FPGA. In fact, to start the execution on the FPGA or provided data for the computation, it is needed to generate multiple drivers. These drivers are used to manage the runtime of the application as to read the final result of the computation provided as output from our FPGA. During the years, multiple tools have been developed to ease the design process. Considering tools produced by Xilinx, the first step of the Hardware Design Flow, highlighted in green in this picture is performed by using Vivado HLS, a tool that helps the designed in the process of performing the High Level Synthesis. For what concerns the second step, the System Level Design, it is performed by using Xilinx Vivado that, taken the IP cores produced by Vivado HLS, allows the user to specify the surrounding architecture and different strategies to produce the final bitstream. Although it is many years now that Vivado and Vivado HLS have been around, only recently Xilinx has released a tool that is capable of completely abstracting the two steps of the Hardware Design Flow. This tool is exactly what we were looking for, “our” Xilinx SDAccel. It automates the great majority of both the High Level Synthesis and the System Level Design steps increasing the overall productivity.