Hi, and welcome to this new lesson. During the last one, we saw an example where unrolling by a factor of two provided this three two x reduction in the latency of the loop. At the cost of two x extra resources for this implementation. This was an ideal case. In fact, it might not be possible to achieve always such an idea latency improvement. When performing loops optimization, there are two potential issues that need to be considered; constraints on the number of available memory parts and available hardware resources, loop-carried dependency. To understand these potential issues, let's consider some more examples. What is the loop latency you would expect by unrolling the loop four times instead of two? Well, if we assume that the loop body latency remains constant to 10 cycles and the number of loop iterations reduces by a factor of four, we would get a latency of 10 times 256, which is equal to 2560, which is four times smaller than the latency of the original loop. Well, this is not totally correct. Indeed, by looking at the synthesis report, we can see that the iteration latency of the loop is now 11 cycles and not 10 as expected. Here's our loop latency is 2816 cycles instead of 2560. But why do we get this extra cycle considering that the fact that all the loop iterations should be able to be performed in parallel? This issue, resides in the load and store operations. As we can see from the analysis report, to read operation on an array local a are scheduled on the first cycle of the loop. While the remaining two read operation are schedule in the next cycle. The same happens also for the read operation on array local b, and on the right operation on array local s. The reason for this comes from the fact that local arrays are store on RAM resources on the FPGA. Each brand provides up to two memory ports that can be used to perform read and write operations. In other words, it's not possible to load more than two memory elements per cycle, and hence divided to less schedules multiple load operations in different cycles. These delays, two out of the four floating point additions by one cycle and the final loop iteration latency consists of 11 cycles. One way to overcome this issue is by starting different elements of the arrays on different local memories. In order to increase the number of elements that can be read in parallel. However, this optimization will be discussed when dealing with the array partitioning optimization. So for now, just to recall that we have to pay this extra cycles. So far we have considered an embarrassing parallel code. Where all the iterations of the loop are completely independent from each other. Let's now consider a different code namely, simple kernel that performs the dark product of two vectors, also known as scalar product. By looking at the product loop, that performs the actual scalar product, we can see that each iteration of the loop depends on the previous iteration. Indeed, the variable called product holds the accumulation of the products of the components of the two vectors; local a and local b. In other words, the variable and it previous value with the product from the two vector component at iteration i. Looking at these annotated analysis report, we can see that we have a loop carried dependency. The results of the floating point other, FADD, is sent to a multiplexer that provides the value to an input to the same floating point adder in the next iteration of the loop. The multiplexer is needed because we need to distinguish between the first iteration of the loop, in which the value of the accumulation is zero and the next iteration in which the value of the accumulation is the one from the previous loop iteration. Notice that the loop also contains the read operation to get the values from the two vectors and the floating point multiplier that computes the product of the vector components which is provided in input to the floating point adder. Overall, the iteration latency of the loop is 13 and the loop is iterated 1024 times which yield the total loop latency of 13,312 cycles. What is the iteration latency and the trip count that you would expect by unrolling the loop by a vector of two? Well let's try it out. The trip count has halved as expected, but the iteration latency has increased from 15 to 20 cycles. Overall, the loop latency is now 10,240 cycles which is only 23% less than the original 13,312 cycles. This is far from the 50% reduction that we achieved for our vector sum example. To understand why this is the case, let's look at un-annotated analysis reports side-by-side to manually unrolled code. After the unrolling, we get an extra floating point addition, and the next floating point multiplication to schedule. Nevertheless, the two floating point additions cannot be scheduling parallel since the second addition required results from the first edition. Overall, we've added HLS manages to hide the execution time of the second floating point multiplication. However, the length of the carry dependency cycle, has increased by floating point addition which counts for the seven extra cycles in our loop iteration latency. In this situation, the loop carried dependencies several impacts on the performance that we can extract from our loop simply by applying the unrolling optimization. Indeed, even if we try to avoid the loop four times, we would get a loop carried dependency path consisting of four floating point additions and the loop iteration latency will increase as well. For this particular example, one possible solution to reduce the impact of carry dependencies would be to manually unroll the code and perform the addition as a three-year reduction in order to express higher instruction-level parallelism.