In this section, you'll learn more about SageMaker hosting. If you look at your toolbox, SageMaker hosting includes SageMaker endpoints, and you might recall that these are persistent endpoints that can be used for real-time inference. You've already worked with SageMaker endpoints throughout your labs for this specialization. But let's go a little deeper and talk more about those endpoints and some of their more advanced features. As a reminder, SageMaker endpoints can be used to serve your models for predictions in real-time with low latency. Serving your predictions in real-time requires a model serving stack that not only has your trained model, but also a hosting stack to be able to serve those predictions. That hosting stack typically include some type of a proxy, a web server that can interact with your loaded serving code and your trained model. Your model can then be consumed by client applications through real time invoke API request. The request payload sent when you invoke the endpoint is routed to a load balancer and then routed to your machine learning instance or instances that are hosting your models for prediction. SageMaker has several built-in serializers and deserializers that you can use depending on your data formats. As an example for serialization on prediction request, you can use the JSON line serializer, which will then serialize your inference requests data to a JSON lines formatted string. For deserialization on prediction response, the JSON deserializer will then deserialize JSON lines data from an inference endpoint response. Finally, response payload is then routed back to the client application. With SageMaker model hosting, you choose the machine-learning instance type, as well as the count combined with the docker container image and optionally the inference code, and then SageMaker takes care of creating the endpoint, and deploying that model to the endpoint. The type of machine learning instance you choose really comes down to the amount of compute and memory you need. I discovered the high-level architecture with deployed SageMaker endpoint, but let's now cover some of the deployment options as they relate to the actual components that are deployed inside your machine learning instance. SageMaker has three basic scenarios for deployment when you use it to train and deploy your model. You can use prebuilt code, prebuilt serving containers, or a mixture of the two. I'll start with deploying a model that was trained using a built-in algorithm. In this option, you use both prebuilt inference code combined with a prebuilt serving container. The container includes the web proxy and the serving stack combined with the code that's needed to load and serve your model for real time predictions. This scenario would be valid for some of the SageMaker built-in algorithms where you need only your trained model and the configuration for how you want to host that machine learning instance behind that endpoint. For this scenario to deploy your endpoint, you identify the prebuilt container image to use and then the location of your trained model artifact in S3. Because SageMaker provides these built-in container images, you don't have any container images to actually build for this scenario. Let's now cover deploying a model using a built-in framework like TensorFlow or PyTorch, where you're still using prebuilt container images for inference, but with the option of bringing your own serving code as well. The next option still uses a prebuilt container that's purpose-built for a framework such as TensorFlow or PyTorch, and then you can optionally bring your own serving code. In this option, you'll notice that while you're still using a prebuilt container image, you may still need or want to bring your own inference code. You'll have the opportunity to specifically work with this option in the lab for this week. Finally, the last optional cover is bringing your own container image and inference code for hosting a model on a SageMaker endpoint. In this case, you'll have some additional work to do by creating a container that's compatible with SageMaker for inference. But this also offers the flexibility to choose and customize the underlying container that's hosting your model. You just learned about the three different types of deployment options for SageMaker. All of these options deploy your model to a number of machine learning instances that you specify when you're configuring your endpoint. You typically want to use smaller instances and more than one machine learning instance. In this case, SageMaker will automatically distribute those instances across AWS availability zones for high availability. But once your endpoints are deployed, how do you then ensure that you're able to scale up and down to meet the demands of your workloads without overprovisioning your ML instances. This is where autoscaling comes in. It allows you to scale the number of machine learning instances that are hosting your endpoints up or down based on your workload demands. This is important to meet the demands of your workload, which means that you can increase the number of instances that serve your model when you reach a threshold for capacity that you've established. This is also important for cost optimization for two reasons. First, not only can you scale your instances up to meet the higher workload demands when you need it, but you can also scale it back down to a lower level of compute when it is no longer needed. Second, using autoscaling allows you to maintain a minimum footprint during normal traffic workloads, versus overprovisioning and paying for compute that you don't need. The on-demand access to compute and storage resources that the Cloud provides allows for this ability to quickly scale up and down. Let's take a look at how it works conceptually. When you deploy your endpoint, the machine learning instances that back that implant will emit a number of metrics to Amazon CloudWatch. For those that are unfamiliar with it, CloudWatch is the managed AWS service for monitoring your AWS resources. SageMaker emits a number of metrics about that deployed endpoints such as utilization metrics and invocation metrics. Invocation metrics indicate the number of times an invoke endpoint request has been run against your endpoint, and it's the default scaling metric for SageMaker autoscaling. You can actually define a custom scaling metric as well, such as CPU utilization. Let's assume you've set up your autoscaling on your endpoint and you're using the default scaling metric of number of invocations. Each instance will emit that metric to CloudWatch. As part of the scaling policy that you can figure. If the number of invocations exceeds the threshold that you've identified, then SageMaker will apply the scaling policy and scale up by the number of instances that you've configured. After scaling policy for your endpoint, the new instances will come online and your load balancer will be able to distribute traffic load to those new instances automatically. You can also add a cool down policy for scaling out your model, which is the value in seconds that you specify to wait for a previous scaled-out activity to take effect. The scale out cooldown period is intended to allow instances to scale out continuously, but not excessively. Finally, you can specify a cool down period for scaling in your model as well. This is the amount of time in seconds, again, after a scale-in activity completes, before another scale-in activity can start. This allows instances to scale in slowly. I just covered the concept of autoscaling SageMaker endpoints, but let's now cover how you actually set it up. First, you register your scalable target. A scalable target is an AWS resource, and in this case, you want to scale the SageMaker resource as indicated in the service namespace. This is accepted as your input parameter. Because autoscaling is used by other AWS resources, you'll see a few parameters that specifically indicate that you want to scale a SageMaker endpoint resource. Similarly, the scalable dimension is a set value for SageMaker endpoint scaling. Some of the additional input parameters that you need to configure include the resource ID, which in this case is the endpoint variant that you want to scale. You'll also need to specify a few key parameters that control the minimum and maximum number of machine learning instances. The minimum capacity indicates the minimum value you plan to scale into. The maximum capacity is the maximum number of instances that you want to scale out to. In this case, you always want to have at least one instance running, and a maximum of two during peak periods. After you register your scalable target, you need to then define the scaling policy. The scaling policy provides additional information about the scaling behavior for your instances. In this case, you have your predefined metric, which is the number of invocations on your instance, and then your target value, which indicates the number of invocations per machine learning instance that you want to allow before invoking your scaling policy. You'll also see the scale-out and scale-in cooldown metrics that I mentioned previously. In this case, you see a scale-out cooldown of 60, which means that after autoscaling successfully scales out, it starts to calculate that cool-down time. The Scaling policy will increase again to that desired capacity until the cool down period ends. The ScaleInCool down setting of 300 seconds means a SageMaker will not attempt to start another cool down policy within 300 seconds when the last one completed. In your final step to set up autoscaling, you will apply autoscaling policy, which means you apply that policy to your endpoint. Your endpoint will now be skilled in and scaled out according to that scaling policy that you've defined. You'll notice here you refer to the previous configuration that was discussed, and you'll also see a new parameter called policy type. Target tracking scaling refers to the specific autoscaling type that is supported by SageMaker. This uses a scaling metric and a target value as an indicator to scale. You'll have the opportunity to get hands on your lab for this week in setting up and applying autoscaling to SageMaker endpoints. You just learned about how SageMaker handles deployment to your machine-learning instances across a variety of options, and I also walked you through how to apply autoscaling to dynamically provision resources to meet the demands of your workload. But I'll quickly cover a few additional capabilities for SageMaker endpoints that you should be aware of, including multi-model endpoints and inference pipelines. I'll start with multi-model endpoints. Until now, you've learned about SageMaker endpoints that serve predictions for one model. However, you can also host multiple models behind a single endpoint. Instead of downloading your model from S3 to them machine learning instance immediately when you create the endpoint, with multi-model endpoints, SageMaker dynamically loads your models when you invoke them. You invoke them through your client applications by explicitly identifying the model that you're invoking. In this case you see the predict function is identifying Model 1 for this prediction request. SageMaker will keep that model loaded until resources are exhausted on that instance. If you remember, I previously discussed the deployment options around the container image that is used for inference when you deploy a SageMaker endpoint. All of the models that are hosted on a multi-modal endpoint must share the same serving container image. Multi-model endpoints are an option that can improve endpoint utilization when your models are of similar size and share the same container image and have similar invocation latency requirements. Here, you'll see another feature called inference pipeline. Inference pipeline allows you to host multiple models behind a single endpoint. But in this case, the models are sequential chain of models with the steps that are required for inference. This allows you to take your data transformation model, your predictor model, and your post-processing transformer, and host them so they can be sequentially run behind a single endpoint. As you can see in this picture, the inference request comes into the endpoint, then the first model is invoked, and that model is your data transformation. The output of that model is then passed to the next step, which is actually your XGBoost model here, or your predictor model. That output is then passed to the next step, where ultimately in that final step in the pipeline, it provides the final response or the post-process response to that inference request. This allows you to couple your pre and post-processing code behind the same endpoint and helps ensure that your training and your inference code stay synchronized. In this section, you learned more about using SageMaker Hosting to deploy models, do a fully managed endpoint for your real-time inference use cases. You also learned about hosting your endpoint on machine learning instances where you can take advantage of capabilities like autoscaling to dynamically increase or decrease the number of machine learning instances hosting your models so that they can meet the demands of your prediction request traffic. You also learned about some of the advanced deployment options such as multi-model endpoints and inference pipeline. These won't be in your labs for this week, but they are advanced deployment options to be aware of when you're looking at the best option for deploying your models.