Now that you have a high-level understanding of machine learning pipelines, I'll now cover how to create a machine learning pipeline using SageMaker Pipelines. As I mentioned before, there are a lot of tools and services that can be used to create your machine learning pipelines, but for this section, I'm going to focus on a service from your toolbox, that is native to Amazon SageMaker, and it's called SageMaker Pipelines. It's also the service that you'll have the opportunity to work within your lab for this week. Sagemaker Pipelines allows you to create automated workflows using a Python SDK, that's purpose-built for automating model-building tasks. You can also visualize your workflows inside Amazon SageMaker Studio. Pipelines also includes the ability to natively integrate with SageMaker Model Registry. This allows you to capture some of that Model Metadata that I previously discussed, like the location of your training model artifact in S3, or key information about your trained model, so things like model evaluation metrics. Model Registry also allows you to choose the best performing model that you want to approve for deployment. Finally, SageMaker Projects allows you to extend your pipelines, and incorporate CI/CD practices into your machine learning pipelines. This includes things like source and version control for that true end-to-end traceability. Let's take a look at the main components or features of pipelines and see how it all fits together. Sagemaker Pipelines has three primary components. First, you have pipelines, which allows you to build automated Model Building workflows using your Python SDK. Again, these workflows can be visualized inside SageMaker Studio. Second, you have SageMaker Model Registry, which stores Metadata about the model and has built-in capabilities to include Model Deployment approval workflows as well. Finally, you have Projects which includes built-in project templates, as well as the ability to bring your own custom templates that establish and pre-configured a pattern for incorporating CI/CD practices into your Model Building Pipelines and your Model Deployment Pipelines. Although SageMaker Pipelines provides the ability to work with all three of these components. For this section, I'm primarily going to focus on pipelines and Model Registry because you'll be working with those inside your lab for this week. Let's start with Pipelines. SageMaker Pipelines allows you to take those machine learning workflow tasks, that I've been talking about and automate them together, into a pipeline that's built through code. A Python SDK is provided, so that you can build and configured these workflows. The pipeline visualizations again, which are similar to what you see here, are all provided through SageMaker Studio. Pipelines provides a server-less option for creating and managing automated machine learning pipelines. Meaning, you don't have to worry about any of the Infrastructure or managing any of the servers that are hosting the actual pipeline. Let's take a look at some of the steps. I won't spend a lot of time on the features that have been introduced in previous weeks, but if you look at the first step in your pipeline, you have a data processing step. For this, SageMaker pipelines supports Amazon SageMaker processing jobs that you can use to transform your raw data into your training datasets. As a reminder, SageMaker processing expect your input to be an Amazon Simple Storage Service or S3. Your input in this case is your raw data, or more specifically, your product review dataset. SageMaker processing also expects your data processing script to be an S3 as well. So in this case, your script is your Scikit-learn data processing script that will be run for your processing job, in that will be used to transform your data, and split it into your training test and validation datasets. How do you configure this step in your pipeline? Your first step is to configure the inputs and outputs that are expected for the step in the pipeline. Again, you can see here you're identifying your S3 bucket as the storage for your input. This is your raw data. Once you've configured your inputs, and your outputs for the step, you now need to configure the actual step. To do this, you'll use the Python SDK that's specifically built for SageMaker Pipelines, and it includes a built-in processing step. That indicates that you want to run a SageMaker processing job as part of this step. You can see here, you provide the name of the script that you want to run, which, in this case, is your prepared data Python script. You also have to indicate the processor that you're using, so in this case, it's the Sklearn processors since you're running a Scikit processing script, and this is being run as part again of your data processing step. You also referred to your inputs and the outputs that you previously configured. Now that you've configured the data processing step. If you look at your pipeline configuration, you can see that the output of your processing step is then fed into the input of your training step. Again, I don't spend a lot of time on SageMaker training jobs since they've been covered in previous weeks. But in this step, you're going to set up and configure a step to run a SageMaker training job. That's going to train your model using the input from your previous task. In this case, you'll want to use your training dataset to train the model and then you're going to use the validation dataset to evaluate how well the model is actually learning during training. The output of this particular step is going to be a trained model artifact that gets stored in S3. Here we're configuring the hyperparameters for this step. Again, hyperparameters are specific to the algorithm that you're using. Here we're configuring the hyperparameters that we'll use as input into the training job step. Again, let's take a look at how to set up and configure this task inside your pipeline, starting with setting up your estimator. Here you're going to configure your PyTorch estimator which is what tells SageMaker that you want to use the PyTorch built-in framework to train this model. You'll also specify multiple configurations for your PyTorch estimator including things like the hyperparameter you want to use, the PyTorch version that you want to use, the configuration for your compute resources, and how you want your compute environment for your training jobs to look like. You'll also provide the training script that you want to use for training. Once you've configured the estimator, you now want to configure the actual pipeline step. To do this you will again use the Python SDK specifically the built-in training step, which indicates that for this particular step in your pipeline you want to run a SageMaker training job. You can see here that you provide the name of the estimator on input, which is what you just previously configured and this is what defines your PyTorch training job. You also supply on input which is the output of your previous processing step like you can see here. You're providing your training and validation datasets as input to this step within your pipeline. You also provide the output configuration. In this case, it's the S3 location where you want to store your trained model artifact. The trained model artifact then becomes input into the next step inside your pipeline, which is the evaluation step. In this particular step, you're going to use a SageMaker processing job again. But in this case, you're going to use it for model evaluation, by providing your test hold-out dataset as input. If you remember your test hold-out dataset was generated in your data processing step which was the very first step inside your pipeline. Here you're going to provide that dataset as input into this evaluation step where you're also providing your trained model artifact, that's stored in S3 and was produced as a result of your previous training step. You're also providing your Python script that you want to use for model evaluation. The processing job will then load your model, run your test, hold out dataset against that model in batch mode, and then output the evaluation metrics to the S3 bucket that you specify. Let's take a moment to talk specifically about model evaluation as it relates to your product review dataset and what you're actually doing inside this step. As discussed earlier, an important step in model development is to evaluate that final model with another holdout dataset that model hasn't seen before. These final model metrics are used to compare the performance of competing models. The higher the final model score, the better the model generalizes. You normally start with a single dataset and then depending on the number of samples in your dataset, you might want to keep 20 percent as a holdout dataset for model journey. You can then use that larger portion of your data for your model training. This is the data that's actually used to fit the model. It's also common to create a test holdout dataset that is used for that final evaluation of your model. Once you have this test hold-out dataset, how do you then use it for model evaluation in your SageMaker processing job? Let's discuss specifically how to evaluate a BERT base text classifier using the SageMaker processing job. First, you need to write your model evaluation code. Here is an extract from a sample model evaluation script. If you remember, this evaluation script is actually provided as input into your SageMaker processing job for model evaluation. You can use the popular Python Scikit libraries to calculate model metrics such as classification report or accuracy scores. To evaluate your model, you need to provide a model predict function, you can then use this predict function with the holdout test dataset to calculate the model metrics. For your bird text classifier, you want to calculate test accuracy. Once that processing job has completed as a step inside your pipeline, you can then analyze the results as shown here. The sample evaluation code will write the output to a file that's called evaluation.json, and you'll find this output in the s3 bucket that you specified in the configuration for your SageMaker processing job. You could also use the output of this step to determine if the model should be deployed. Now that I took a step back to explain model evaluation, specifically evaluating a bird base text classifier, let's now circle back to how you include this step inside your pipeline. This step is going to look similar to the data processing tests step that I previously show, but there's going to be one exception. In this case, you're going to configure a property file. In SageMaker Pipelines, a property file stores information about the output of a processing job. This is useful when you want to analyze the results of a processing step so that you can then decide how a conditional step should be executed. In this case, the property file will include your evaluation metrics, which will then be used in a conditional step that determines whether or not you want to deploy this model based on the metrics in that file. You also need to configure the pipeline step again using the built-in processing step that indicates that that step is going to kick off a SageMaker processing job. The job uses the inputs that you identified, including the model artifact and the test hold-out dataset, in combination with the Python script for model evaluation. You also need to specify the s3 location for the processing output and the property file configuration for your evaluation report. Again, this is the report that could be integrated with a conditional step in your pipeline. In this condition step, you're using the accuracy metric that was produced in your previous evaluation step and establishing a condition that will determine which step to perform next inside your pipeline. In this case, the condition is, if accuracy is above 99 percent, you will register that model and then create the model. Which essentially packages that model for deployment. Let's take a look at how a condition step gets configured. In this step, you're using a built-in function that allows you to configure the condition, which in this case, is minimum accuracy. To create the condition, you've established a minimum accuracy value which you previously assigned. After you configure this condition, you now need to actually configure the step. In this step, you're using a built-in function that allows you to configure the condition, which in this case, is minimum accuracy. To create the condition step, you've established the minimum accuracy value which you have previously assigned, and then after you configure this condition, you now need to configure that step. Again, to configure the step, you're going to use a built-in conditional step, or you're referring back to the condition that you configured, which in this case, is minimum accuracy. You're using this to determine if model accuracy is above that specified threshold. If it is, then you're going to proceed into the model register and create model steps inside your pipeline as you can see here in the if_steps. If the accuracy is below that threshold, you'll mark that pipeline as fail. Now that I've covered the pipeline steps that are used for your data and your model building or training tasks, I'll now cover the last two steps in your pipeline, which are registering the model and creating the model package that can then be used for deployment. If you recall, one of the key components of SageMaker Pipelines is SageMaker model registry. It's very difficult to manage machine learning models at scale without a model registry. It's often one of the first conversations that I have with teams that are looking to scale and manage their machine learning workloads more effectively. There are different tools and different ways to implement a model registry, which is why covered the concept of model registry in an earlier section. But in this particular section, I'm going to look specifically at SageMaker model registry and incorporating it as a step inside our pipeline. SageMaker model registry contains a central catalog of models with their corresponding metadata. It also contains the ability to manage their approval status of a workflow by either marking a model as approved or rejected. Let's say the model accuracy is still lower than required for production deployment. As a machine learning engineer, you may want to mark that model as rejected so it's not a candidate for deployment to a higher-level environment. You could also use the model registry as a trigger for downstream deployment pipeline so that when you approve a model, it then automatically kicks off a deployment pipeline to deploy your model to downstream environments. But now let's go back to your pipeline and look at how you can set up and configure the last two steps of your pipeline. When you register your model, you want to indicate which serving image should be used when you decide to deploy that model. This ensures that not only do you know how the model was trained through the metadata that you capture in the model registry, but you also know how you can host that same model because you've defined the image to use for inference. You also need to define your model metrics where you're essentially pulling data that already exists about your model, but ensuring that it's stored as metadata into that central model registry. Finally, you configure the actual step inside SageMaker Pipelines using the built-in function called Register Model. In your configuration, you can see that you include the container image that should be used for inference, the location of your model artifact in S3, the target configuration for the compute resources that you would use for deployment of the model, as well as some metrics that are very specific to this model version. All of this metadata will be used to populate the model registry. You can see here that when you register the model, the approval status is also a configuration parameter that you can optional use to set the approval status for the model when you register it. The default is to set the approval status to pending manual approval, which is more in line with the continuous delivery strategy versus the continuous deployment strategy because you're indicating that you still want a human to approve that model manually before you start any downstream deployment activities. I just walked through each of these steps and I explained the configuration of each step, but how do you link all these steps together? Once you have all these steps configured within your pipeline, you now need to link them together to create an end-to-end machine learning pipeline. To link all of these steps together, you need to configure the pipeline using the pipeline function that's part of the SDK. You can see in this example that you're specifying several parameters as input across all of the steps in your pipeline. You also define the steps that you configured previously. Now that you've configured all those different steps and you've configured the pipeline itself, you want to be able to take this instantiation of a pipeline and actually run it. To actually run the pipeline, you'll use the Python SDK to start that pipeline. You're going to provide the input that corresponds to the first step in your pipeline. In this case, your first step is the data processing step. When you start running your pipeline, you need to specify your raw dataset as input and the S3 location of that raw dataset. Then when you start your pipeline, you can then visualize the status of each of your steps through SageMaker Studio or you can describe the status of your steps using the Python SDK. I just walked through the core components of pipelines and I primarily focused on SageMaker Pipelines and SageMaker Model Registry. I also explained how you configure each step inside your pipeline, as well as how you instantiate and run your pipeline. Once you have your pipelines set up, you can now make changes to the code in the configuration and quickly iterate across your experiments using your automated machine learning pipeline. In the next section, I'll cover the third component, which is SageMaker Projects. Projects allows you to automatically incorporate CI/CD practices such as source control and setting up automated workflows to automatically initiate downstream deployment processes based on an approved model in your model registry. This is not covered in the lab specifically, but it's included to show you how you can continuously improve and evolve your machine learning pipelines with new capabilities.