Hi. Welcome to lesson 3 of module 5, the last lesson in this course. Here we are in our class agenda. In this lesson, we will cover FPGA-specific features that allow for flexibility and customization to design highly optimized systems. These include single work-item execution, channels, controlled hardware generation, libraries, and SoC platforms, and custom boards. Single-threaded kernels differ from NDRange kernels. In the former, a kernel is launched with one thread or work-item at a time. Meaning, if you use clEnqueueNDRange kernel host call to launch kernels, then the global work size, local work size, and dimensions are all set to one. In OpenCL, this execution is called a task. Single-threaded execution of kernels utilize a sequential programming mode that closely match many algorithms with C programming making them faster and easier to port. This, however, is very powerful when applied to FPGA compilation, because with single work item execution, you can rely on the compiler to pipeline your loops and extract parallelism instead of writing your kernels in the NDRange data-parallel fashion that may not match your execution model. NDRange requires you to have data available and divide it into workgroups prior to kernel launch. This is not feasible for many applications, especially streaming ones. In these cases, single-threaded kernels are appropriate to take advantage of loop parallelism. If you have an algorithm where results consistently depend on previous results, such as compression, algorithms, or sequential algorithms, such as FIR filters, then single-threaded kernel execution is perfect for that. Here's how the compiler works when it performs the analysis. For a single work-item kernel, the compiler performs automatic analysis and pipelining of kernel loops. It first examines your kernel code which looks like traditional C programming code, it then notices the data dependencies between iterations of your loop. In this example, we have an accumulator with an initial value. The compiler will recognize that this accumulator depends on prior results. It will then connect the proper FPGA resources to make a physical hardware connection to feedback that value. If the operation was more complex and require more than one clock cycle, the compiler will automatically build the hardware with that timing accounted for. These connections are easy and cheap with FPGA resources, and there's no need for the user to worry about building up any buffers, it happens automatically. When the offline compiler analyzes your code, it will check for dependencies. When the last dependency is reached, it is free then to launch another iteration of for loop. Here's a code example of that type of analysis. In this example, we see that the code in orange is implementing a shift register. Every iteration of the shift register is going to depend on the one before it. So that means iterations of the orange loop are dependent on one another. Furthermore, since the code in blue is operating upon the result of the orange code, the blue code is also dependent on the orange code completing. However, the blue code is implementing a matrix dot product and summation. The iterations of the blue loop are not dependent on each other in order to complete. We'll see how the compiler intelligently launches iterations of the loop based on this in the next slide. This diagram shows what loop pipelining looks like. Without the loop pipelining on the left, loops are executed serially. Each successive iteration must wait for the previous iteration to exit the pipeline completely before entering. With loop pipelining, assuming dependencies are resolved by the compiler in one clock cycle, then the pipeline is fully populated where we can launch an iteration every clock cycle. Notice on the right, the dependency denoted in orange do not overlap since each successive iteration must wait for the dependency to be resolved, but we can easily overlap the non-dependent logic execution and achieve much faster throughput. This diagram compares loop pipelining to parallel threads. With NDRange parallel execution, each thread is executed on successive clock cycles. With loop pipelining iterations assuming dependencies can be resolved in one clock cycle, iterations can be launched every clock cycle. Loop pipelining looks like multi-threaded NDRange pipeline execution but with the added ability to resolve dependencies across iterations. With loop pipelining, the data dependencies resolved without adding extra compute time. Of course, the NDRange thread, you do have the ability to widen the pipeline as well. When comparing single work-items to NDRange kernels, one approach is not always better than the other, it really depends on the application. Single work-item kernels with loop pipelining works better in situations ill-suited for NDRange data-parallel execution. These includes cases where sequential data processing is critical due to dependencies. Scenarios for a single work-item include if an algorithm cannot be easily broken down into work-items, if not all data is available prior to kernel launch, such as in streaming applications or in cases where data cannot be easily partitioned into workgroups. On the other hand, NDRange kernels are ideal if the kernel does not have loop and memory dependencies from thread to thread. With NDRange kernels, the FPGA-OpenCL compiler has the ability to further improve the computation bandwidth of the kernel to be greater than that allowed from a pipeline single work-item. This is achieved by constructing multiple SIMD vector lengths that can process multiple work-items in lockstep, or even allow multiple compute units to process workgroups in parallel. In order to implement an open seal single work-item kernel, which is also called a task in your host code, you can simply launch a kernel with an NDRange of one and every dimension. Such as your work dimension, your global work size, your local work size, as seen in the parameters passed in the command above. An easier equivalent is to launch a single work-item kernel as a task using clEnqueueTask, as seen in the second command. On the kernel side, the offline compiler will know your kernel is a single work-item kernel if it does not query any work item information such as global ID or, in other words, if your kernel looks like traditional C code. The compiler works automatically on tasks to parallelize their operations. It will automatically pipeline and unroll loops when it can. When it can't, it will perform analysis and is capable of reporting the analysis back to you so that you can change your code and increase its performance. This feature is an Intel FPGA specific feature and we can increase the performance of some algorithms that don't map well to conventional NDRange kernels. Another important implementation technique for using OpenCL with Intel FPGAs, is Channels. In the default OpenCL execution model, the host processor is controlling data movement and Kernel launches. However, many times data enters the FPGA through a streaming IO standard such as a 10 gigabyte ethernet with streaming data having the host processor control all data movement can incur significant penalties. Pipes and Channels allow Kernel compute units to run and access data as it becomes available. Pipes are part of the OpenCL standard, and Channels which are super sets of pipes on Intel FPGA vendor extension. Both are implemented using hardware FIFOs. There are three types of channels: IO channels allow data access from an interface of the FPGA, Kernel to Kernel channels allow running Kernels to access data between each other, Host pipes allow the host to write data directly into the Kernels without the data first having to go into the global memory. Host pipes allow host to send and receive data to and from the Kernels without global memory. This allows performance advantage and peak host to device bandwidth. In this pseudo code example, two pipes are created in the Host code, a Read pipe where the host reads data from, and a Write pipe where the host writes data from. Flags in the CL create pipe specify pipe permissions. The host reads and writes to and from these pipes to communicate with the Kernel. There's additional code in the host that sets the Kernel arguments to bind the host pipes to the Kernel. The Kernel can then access the pipes to read and write accordingly. The next Intel FPGA specific feature is OpenCL Attributes. Attributes allow a Kernel developer to control hardware generation through OpenCL attributes. With the auto run Kernel attribute, Kernel start running without the host initiating them. With the pragma unroll, we can specify loop hardware unrolling and replication of hardware to control the number of loops in the hardware that execute in parallel. You can also control memory implementation and system topology. For example, the local mem size attribute declare size of local memory attributed to pointer B. OpenCL Libraries. Another Intel FPGA specific feature allows you to incorporate functions written in an RTL language such as VHDL, or Verilog into your FPGA design. This is done by using libraries. To create a library from your RTL code, there are a few steps required which include writing an XML file which describes the interfaces of your component. You then call the component as a function in your OpenCL Kernel code and point to the library when you compile. You will need an XML file which describes the properties of the RTL component, the RTL source file, as well as an OpenCL emulation model file when you're not running on the actual hardware. The Intel FPGA SDK for OpenCL, supports writing Kernels for SOC FPGAs. When running OpenCL on these parts, you will use an embedded ARM processor as your host instead of relying on an external host. The CPU and the FPGA can also share DDR memory on the SOC device. Using shared memory instead of dedicated FPGA, DDR is recommended. All you have to do is mark the shared buffers between Kernels as volatile. This ensures that buffer modification by one Kernel is visible to the other. Finally, if you choose to, you can build your own custom platforms for a custom board to be compatible with Intel FPGA SDK for OpenCL. Building a custom BSP, Board Support Package is not trivial. You will need all the traditional FPGA skills including PCIE, memory interfaces, time enclosure, interconnects, platform designer, system integration, and incremental compilation, among other things. In addition, you'll need software drivers and board bring up skills. Here's an overview of the custom platform deliverables. On the left, you see the host application running on the host processor, and on the right in the gray, is the hardware accelerator board containing the FPGA. Everything in red needs to be provided by the designer for your custom platform. On the FPGA side, your custom platform will need to include a Post Place and Route Netlist including all the hardware necessary to communicate with the host and the memory. That will include DDR and/or QDR memory, interfaces, DMA, a host interface which can be PCIE, and any streaming interface intended to be implemented as channels. When an OpenCL Kernel is compiled by the offline compiler, a custom Dataflow circuit representing your Kernel will be generated and connected to the board support package hardware. This concludes Module 5, Introduction to OpenCL for Intel FPGAs. Module 5 summary. Here are the key takeaways from this module. In this module, we learned how to use the Intel SDK for OpenCL to compile OpenCL Kernels to target an FPGA. We described the tools and the Intel SDK for OpenCL used to analyze the results of OpenCL completion. We understood the debug of an OpenCL Kernel for functionality and performance. This concludes the course, Introduction to OpenCL for Intel FPGAs. Thank you very much for attending. Here are the key takeaways from this course. In Module 1, we learned the challenges of parallel systems and how OpenCL solves these challenges. In Module 2, we then learned the framework for running OpenCL programs in particular, the OpenCL Platform and Host-side software. In Module 3 and 4, we learned how to write different types of Kernels to execute on devices. Lastly, in module 5 we delved deeper into the Intel SDK for OpenCL and learned how to use it to target the FPGA. Thank you.