We're now going to explore two different types of OpenCL kernels, NDRange kernels, and Single Work-item kernels, both are important. Before talking about NDRange kernels, it is important to understand data parallelism. NDRange kernels are the traditional way of writing OpenCL. They provide a way of explicitly implementing data parallelism when doing computation. The goal when implementing an NDRange kernel is to break up the problem say finite screen components, say each individual computation becomes simple and quick. Think of the classic vector and once again as an example. When the problem is broken down into many identical concurrent threads, individual threads are called work items. The complete amount of work to do is call the NDRange. One way to look at this is by picture in a factory whose purpose is to build cars. We can break this down into a two step process. One assembles the engine while the other assembles the body. In this scenario one worker assembles one part of the engine or the body. For each engine a group of workers assembles the engine and a group of workers assembles the body. Once the work is complete, the workers can leave your workshop. It's very important that each worker only works on one process for one car. Then car parts are stored and assembled in the central warehouse. Workers go to the central warehouse for parts then go there to deliver the car. We can then say that the total number of cars to build are the global dimensions. The workers to work item, the group of workers or the work group. The work that is done on the assembly line in this case we see two processes, one for the engine and the other for the body and these are the kernels. The entire factory is a device, and finally the workshops and machinery inside is the compute unit. For the kernels, open CRC code is written to run on the OpenCL device. Kernels provide data parallelism with NDRange launches and the same kernel can be executed by all the different data parallel threads of a single launch. The key concept when launching the kernel is the work item. While we launch a kernel in a data parallel fashion, the unit of concurrent execution is the work item. So each work item will execute the same exact kernel function independently. When writing the kernel, we usually map a single iteration of a loop in standard C to work item, and usually we would generate as many work items as elements in the input and output array. The kernel operations are mapped to hardware during runtime. Here's an example code for a simple vector array as represented by an NDRange kernel. Notice again that the kernel function is preceded by the keyword kernel and that it returns the type void. Here we have three arguments. There are all integer arrays. A and B are the inputs and C the output. The body of the function represents a single iteration of the loop. If we had written the function in standard C programming, when we launched a kernel we do so with N number of work items if N matching the array size, and the first line of the body of the kernel, we use get_global_id(0) to retrieve the position of the work item across the N numbers. This is like the index counter in a loop. The second line of the kernel body the computation is done. We are taking the elements of the same index in A and B and adding them together within storing the result in a corresponding index of C. So each execution of the kernel performs one add then I vectorized that. The global dimensions encapsulate all the work items. No global dimensions or the N-dimensional range where N can be one, two or three. You can think of it as a three-dimensional array of work items. We use it to launch kernels and most often it's dimensions match up to the dimensions of the input or output data array. For the example vector add, we just saw that operated over N items are low dimension size and the first dimension was N, and the second and third dimensions the size was one. In the example on the slide, we have 512 work items that have been divided into space of 512, one and one for each dimension. The global dimensions are then set when we launched a kernel per the OpenCL standard. The data division in the case of audio, you use one-dimensional series of samples which processing each sample independently. The global size is the total number of samples. Images map well on two-dimensional data and global dimensions are the total number of pixels. For a physic stress simulation we use three-dimensional data to model the behavior of materials and a global dimensions are representation of the 3D space. We can divide the entire global dimensions into smaller equally-sized workgroups. These are also called local dimensions. The global dimensions and the workgroups must have the same number of dimensions and the size of the goal dimensions must be evenly divisible by the workgroup size in each dimension. Let's continue the example from the last slide. If we have 512 work items in the first dimension, we can create eight workers by setting the workgroup size to be 64 in the first dimension and one in the second and third dimensions. Like the global dimensions, the workgroup size is set at launch time. If we don't specify a work group size, the size is automatically generated. However, this can lead to sub optimal workgroup size. So it is recommended to always specify size. It's important to note as well that within the kernel, synchronization can only be done amongst worked items within the same work group and not across work item boundaries. Here's a graphical representation of two-dimensional global dimensions. On the left we see the global dimensions comprising workgroups. Then on the right you see each work-item within the work group also represented in a 2-D matter. You can see the structure is a hierarchy or the global dimensions is being composed of workgroups, and workgroup is being composed of work-items. So to set total number of work-items, multiply the area of the square on the right and work items by the area of the square on the left and workgroups. Launching an NDarray kernel involves more arguments enlarging a task, and here is the host API call to do that. clEnqeueuNDRangeKernel needs to command queue as the first argument. The second argument is the kernel object. The third argument work-item is the number of dimensions or the global dimensions which can be one, two or three. The fourth argument is the offset for global idea. In each dimension, you can pass a no, if you are not using any offsets. The fourth argument is the global work size or the global dimensions, which is expressed as an array if the number of dimensions exceeds one. Arguments to control events are also present and they are in every clEnqeueu command. Here's an example code to launch two NDrange kernels. The top kernel launch, we're launching the kernel with a work group size of eight which corresponds to local work size and low dimensions of 1920 which corresponds to global work size. In the second half, we are launching a three-dimensional NDrange kernel. We're specifying global dimensions of 512 in each dimension and a work group size of 16 times eight times two. OpenCL kernels have functions to query the NDRange properties determined at kernel launch time. Most take dimension as a unit argument, zero to two oftentimes. The get_work_dim is the number of dimensions used. The get_global_size (dim) is a total number of work items in dimension. The get_local_size (dim) is the return size of work group in dimension, and get_num_groups (dim) is the number of work group in dimension. OpenCL kernels have functions to identify the current work item executed in the kernel, which often are used to dereference data pointers. The get_global_id dim is the index of work item in the global space, get_local_id dim is the index of work item within workgroup, and get_group_id dim is the index of current workgroup. OpenCL kernels have functions to identify the current work item executing the kernel. These functions are most often used to dereference data pointers essentially providing an index to input and output data race. The functions available to use for the current work item are get_global_id which provides an index of the item within the global space, get_local_id which provides an index within the local or work group space, and get_group_id which identifies all members of one work group. The argument of these functions indicates the dimension within the space. They're also functions available to query properties of the global dimensions which are shown at the top in this box. There are also functions available to query properties of the global dimensions which are shown at the top in the box. The second type of kernel we'll discuss, is a single work item kernel or task. In order to implement this type of kernel in your host code, you simply launch a kernel with global dimensions of one in every dimension or launch it as task using CL and queue task. On the kernel side, the offline compiler will know your kernel is a single work item kernel. If it does not query any work item information such as global ID. Or in other words, if it looks like traditional C-code. The compiler works automatically on tasks to parallelize operations. We'll automatically pipeline and unroll loops when we can. The feature is an intel FPGA specific feature. It can increase the performance of some algorithms that don't map well to convention NDRange kernels. Traditionally, the model of OpenCL kernel execution has been to launch kernels across a large NDRange. But this requires you to have data available and divide it into work groups prior it's a kernel launch. However, this is not feasible for many applications especially streaming ones. In these cases, single-threaded kernels are appropriate because they take advantage of loop parallelism. If you have an algorithm where results consistently depend on previous results such as a compression algorithm or a sequential algorithm such as an FIR filter, then single-threaded kernels are perfect. Single threaded kernels are also similar to traditional C coding style making even faster and easier to port. Here's an example of how single work item kernels are optimized by the offline compiler. The type of parallelism that is achieved is called pipeline parallelism. It is also sometimes called loop parallelism since it originates from the compiler optimizing loops in their kernel. This example is showing a vector add. On each clock cycle, input data from one iteration of the loop is inputted and different portions of the pipeline perform processing at the same time. After an initial latency to fill up the pipeline, you can see that each cycle will have a pipeline component working on a work item. In the next clock cycle animation, you can see that the iteration 3 gets loaded, iteration 2 passes through the adder, and iteration 1 gets stored. Subsequent cycles work in the same manner, this makes very efficient use of your hardware. Here's how the compiler works when it performs the analysis in pipelining of single work item kernel loops. It first examines your kernel code which looks like traditional C programming code. It then notices the data dependencies between iterations of your loops. In this example, you have an accumulator with an initial value. The compiler will recognize that the accumulator depends on your prior result. It will then connect the proper FPGA resources to make a physical hardware connection to feed back that value. If the operations was more complex and required more than one clock cycle, the compiler will automatically build the hardware that time accounted for. These connections are easy and cheap with FPGA resources. There is no need for the user to worry about building up any buffers, it happens automatically. The offline compiler analyzes your code. It will check for dependencies. Then when the last dependency is reached, it is then free to launch another iteration of your loop. Here's an example of that type of analysis. In the example, we see the code in orange is implementing a shift register. Every iteration of the shift register is going to depend on the one before. That means iterations of the orange loop are dependent on one another. Furthermore, since the code in blue is operating upon resort with the orange curve, the blue code is also dependent on the orange code completing. However, the blue code is implementing a matrix.product and summation. The iteration of the blue loop are now dependent on each other to complete. We all see how the compiler intelligently launches the iterations of the loops based on this in the next slide. Since the compiler intelligently analyzes the loop dependencies within a single work item kernel, it knows it does not need to wait for the blue loop to finish each time before launching another iteration of the entire kernel as shown on the right. So we get fast execution and it is done with less hardware than the NDRange kernel will take. When mapping vector executions of kernels to hardware, the simplest solution first appears to be replicating the hardware for each work item. Example of this would be evacuated of two arrays where each set of array elements is fed its own adder. This would enable everything to be done in parallel. With FPGAs however, this is often not the best method due to several problems. First global dimensions of most OpenCL implementations tend to be very large even in millions. Is not feasible to implement this many kernel pipelines at once in the FPGA. Secondly, since the FPGA compute bandwidth is not usually the bottleneck in the system, is inefficient or wasteful to duplicate the number of Compute Engines. It will be difficult to keep them all busy since memory bandwidth is often the bottleneck. Lastly, it's unknown at compile time how many work items will be run. Usually the better method to map multi-threaded kernels is to use pipeline parallelism. In early stages of design planning, consider whether constructing your OpenCL kernel as a single work item kernel might improve performance. Intel recommends that you construct your OpenCL kernel targeting FPGAs as a single work item kernel if possible. This is especially important if your kernel has data dependencies or if the input data does not arrive all at once before the kernel launch. However, a single work item kernel is not always the best answer. Fewer kernel program does not have loop and memory dependencies, you may want to structure your applications in NDRange kernel because the kernel execute multiple work items in parallel efficiently. For more information, please see the single-threaded versus multithreaded kernels online training. Another important implementation technique for using OpenCL with Intel FPGAs is channels. In the default OpenCL execution model, the host processor is controlling data movement and kernel launches. However, many times data enters the FPGA through a streaming IO standard such as 10 gigabit ethernet. With streaming data, having the host process control all data movement can incur significant penalties. Pipes and channels allow kernel compute units to run an access data as it becomes available. Pipes are part of the OpenCL standard and channels which are a superset of pipes, are insular FPGA vendor extension. Both are implemented using hardware pfeiffer's. There are three types of channels, IO channels allow access for the interface of the FPGA, kernel to kernel channels allow running kernels to access data between one another, and host pipes allow the host to write data directly into the kernel without a data recycled to go into global memory.