Hello and welcome back. In the last video, we discussed the network core and preprocessing parts of the inference engine library. In this video, we will discuss how to run inference on a device. In the previous video, we showed how to create an instance of executable network object. This is what handles the actual deployment of inference on the targeted device. Inference with executable network can be done in one or two mode; synchronous and asynchronous. Synchronous means that main code execution waits for the inference to complete before continuing. Well, asynchronous means that the main code execution can continue while inference is happening in parallel. Another common term for this is blocking versus nonblocking. In the example in a workflow video, we use infer method. This is a synchronous version of inference. The input for infer method is a dictionary that contains all the inputs for the network. The key for dictionary are the names of the input layers and the values are the numpy array inputs for the respective layer. The easiest way to find what to use for the dictionary keys is to check the input attribute of the IENetwork class. We looked at this attribute in the last video. But to recap, it is also a dictionary that contains inference engine representation of the input layers. The keys in this inputs attribute is the keys that you should use for the dictionary input of infer method. The values for the dictionary are numpy arrays that we preprocessed in the last video. If you have a complex network with multiple input layers, your dictionary will have to include entries for all the input layers. Beyond the input, executing the infer method is very straightforward. The output of infer method is also a dictionary. The procedure for getting the result of the network is similar to how we set the input. Instead, this time the keys can be found in the output attribute of IENetwork class. Similar to the input case, the dictionary will have an entry for every output layer of the model you use. The value in dictionary is the output of the network in a numpy array, which you can then process. Next, let's move on to the asynchronous case. Once again, this is where the inference happens in parallel with the rest of the main code. This feature is especially useful for when you are using an accelerator device like the onboard GPU. With asynchronous mode, you can offload the work on the accelerator while the CPU continues to work on other tasks. A common example of this is CPU preprocessing the next image while the GPU works on inference. Asynchronous inference is started with start_async method. The setup for this function is the same as the synchronous mode, with the same executable network object and the same input arguments. Calling this method starts the inference in parallel and the main execution is allowed to continue. The method returns a infer_request object. You can think of this as a service ticket for the inference workload, and is used to get the status of the inference. So once the main code is ready to get the result to the inference, call the wait method of the infer request. This will hold the execution until the inference is done. After you call this method, the result of the inference is available in the outputs attribute of the infer request object, is a dictionary just like the output of the infer method. The wait method itself returns an integer status code. Zero means it was successful, whereas non-zero number means something went wrong. It is often a good idea to check for and handle non-zero status codes. Now, as I mentioned earlier, the asynchronous mode is useful for cases when you have a compute accelerator like FPGA. But some systems have two or more accelerators. It will be beneficial to run multiple inference requests in parallel in that case. For such a use case, asynchronous inference allows for multiple concurrent requests. Now, as I mentioned earlier, the asynchronous mode is useful for cases when you have a compute accelerator like FPGA. But some systems have two or more accelerators. It will be beneficial to run multiple inference requests in parallel in that case. For such a use case, asynchronous inference allows for multiple concurrent requests. There's an important distinction here that I want to make. Multiple requests is different from batch size. Batch size controls images to process in parallel inside one inference workload, whereas multiple requests runs multiple inferences in parallel. Enabling multiple requests requires a little backtracking, as the allowed number of concurrent requests is determined when you create the executable network object. So let's go back to the load_network method of IECore where we have created the executable network. The allowed number of concurrent requests is determined by the argument num_requests. Setting this to two allows for two concurrent requests. You can think of multiple requests for executable network as request slots. So when you run the start_async method, you need to specify which slot to use by specifying an integer request ID. This is a zero index integer value, so our value of zero uses the first slot. For getting the inference request objects to wait on and to get the results for, there are two approaches. One is to save every infer request returned by start_async function, or alternately, you can get infer request from the request attribute of executable network object. This request attribute is an array containing each inference request. The index in this array corresponds to request ID you have specified. The rest of the work flow is the same, get the infer request you are interested in, use the wait function to block until inference is complete, and then recover the result from the output attribute, then repeat for each request slot. This flow, however, is not necessarily efficient. If the first request your wait for happens to be the slowest, then the code will be waiting for this request even if others have finished first. There are several ways around this issue, but these will be covered in the next part of the tutorial. With that, we've come to the end of the videos on using inference engine and the first part of the intermediate Intel Distribution of OpenVINO Toolkit Tutorial. In the next course of this tutorial, we will be focusing more on the actual process of deployment. Thank you for your attention, and see you in the next course.