Before I walk you through some of the resources and artifacts generated by Sage maker autopilot, I want to show you a couple of things inside SageMaker Studio. First, I previously walked you through how to configure and launch an autopilot job programmatic. But you can also launch an autopilot job from directly inside the studio user interface as well as you can see here, there is an option to create a new autopilot experiment. If you click this option, you'll see an interface that walks you through the same inputs and configuration options that I previously walked you through, where I defined the configuration options. programmatically. You can accomplish the same thing by defining your configuration options here and then clicking create experiment. If you remember in the background, whether you're using one of the SDKs or the studio interface, you're still hitting the same autopilot APIs in the background. Let's now circle back to the scenario where I've already created my experiment and now want to look at the resources and artifacts produced from the autopilot job, all of the tasks that happened within an autopilot job happened within an experiment. To view this, if you don't already have it on your console, you go to this last icon you see here on the left called 'Components and registries.' And then you want to make sure that in the dropdown you've selected 'Experiments in trials.' So I'll select the autopilot experiment that I previously ran for this demo. If I right click on this experiment and then select 'Describe AutoML Job,' you'll now see the trials that were run as part of this experiment andyou can also see in this case all of the trials have a status of completed, meaning they have already run. You may see in progress here. if your jobs are still running. As you can see there are three trials for this experiment and a trial is a tuning job. A tuning job is essentially taking one candidate pipeline that, if you remember, has your data pre-processing code and your algorithm with hyperparameter ranges for that algorithm. And it's using a feature of SageMaker called hyperparameter tuning, which is designed to perform automatic hyperparameter tuning to find the best combination of hyperparameter settings. To optimize your model with your objective metric. You can also see here autopilot has clearly identified the best performing model on the leaderboard. This is where you see the yellow star and the word Best, which maps to the highest level of accuracy that you see over here on the right. You can dig into the details of your trials by clicking on a trial and then selecting open in model details. This will provide more details about model explain ability, the hyperparameters that were chosen and the additional metrics such as training and validation air. You'll also be able to view a list of the inputs, artifacts, and resources such as the S three links to your input data. The automatically generated training and validation splits and the feature engineering an algorithm model artifacts. For this demo, I want to show you the data expiration notebook and walk through the candidate generation notebook. To do this, I'll go back to our experiments here, and we'll start with the data exploration notebook which provides insight into the data that you provided as input to the autopilot job. This notebook provides details on the type of ML problem that was identified by autopilot as well as the details of the input data set. So, this includes details on how the job analyzed features to select the candidate pipelines. Let's now go back to our experiments, and we'll look next at the candidate generation notebook. This notebook is using the data learned from data exploration, and it's suggesting different candidates. If you remember candidates include Beecher transformations, the algorithm and the model tuning strategy. This notebook gets automatically generated, and you can see this notebook is in read only mode. So, if you want to further optimize the options that are selected by autopilot or download some of the resources that are generated and stored in S3 to your local studio environment, you can create a local copy of this notebook that can then be executed. So, let's go ahead and create a local copy by selecting 'Import notebook.' I'll then select Python 3 (Data Science) Kernel. Now I have a local copy that can be used to copy resources from S3 to our local notebook as well as be able to do any more further optimizations to the candidate pipelines that were identified by autopilot. So I'm going to scroll down here and execute this cell to go ahead and download the generated data transformation modules from S3 to our local studio notebook environment. So, I can look further into the data transformations that were applied. So, this cell just copied those data transformation modules to my studio environment. So if I click over here on the file browser, I now see a folder called automl- dm-artifacts. I will now double click on that folder, and you'll see two folders inside here. We have generated module which contains the candidate data processors and then we have sagemaker_automl, which just contains a notebook helper library. For this, I want to look at the generated feature code, so I'll double click on this folder and then we'll double click on the candidate_data_processors folder. You'll now see the three sets of data pre-processing code that was automatically produced and run by autopilot for your trials. You can see we have data pre processors 0, 1, and 2 mapping to the three candidate pipelines that we saw. Let's double click on one of these and take a look at what autopilot generated. I'm going to click on the first day to pre processor here called dpp0. In this view, you can see the Python code that was generated for your first candidate model which uses a Sagemaker SciKit Learn extension. In this case you can see autopilot is using the multicolumntfidfvectorizer that I talked about before and has automatically identified a few configuration parameters. Some of those configuration parameters include the max_df, which defines the proportion of terms to ignore, which is basically identifying the stop words that are specific to your text data. So, these are words that appear to frequently, which could be words like 'is' or 'the' or for this specific use case, you may see the word 'product' a lot as well. In this case, it means ignore terms that appear in more than 99.41 of the reviews. min_df is used for removing terms that appear too infrequently. So in this case, .0007 means ignore terms that appear in less than .07 of the reviews. Autopilot also identifies the analyzer which determines whether the feature should be made from a word or a character n-grams. In this case, autopilot is going to use the word to generate the feature. Finally, you have max features which tells the vectorizer to only consider a specific number of features when generating the vocabulary. In this case, Autopilot has specified 10,000, which means only the top 10,000 ordered by term frequency will be considered. This data transformation strategy first transforms text features using the multicolumntfidfvectorizer. It then merges the features that are generated and applies robust standard scalar, robust standard scalar uses the same Sagemaker SciKit extension to perform standardization of your features based on the sparsity or density of the data on input. So, this is the data processing code for one of the candidate pipelines generated by autopilot. Let's check out the others. I'm going to go into dpp1. In this second data pre-processor, autopilot again is using the multicolumntfidfvectorizer. However, you can see the parameters are different this time. In this case, the values for maximum and minimum data frequency have changed in addition to the analyzer. In the previous candidate, we were using the word 'analyzer' but in this candidate, autopilot has identified char_wb, which creates character and grams only from text inside word boundaries and grams at the edges of words are going to be padded with space. Also in this case, the data transformation strategy first transforms text features using multicolumntfidfvectorizer. Then it merges all the generated features and applies RobustPCA followed by RobustStandardScalar, RobustPCA performs dimension reduction for dense and sparse input. So basically it's going to attempt to reduce the number of input variables first. Then it will use RobustStandardScalar again to perform standardization of the features based on the sparsity or density of the data on input again. Let's now take a look at the final candidate of future engineering code that was generated by autopilot. I'm going to click on dpp2 in this candidate, the analyzer that's chosen is'word' so similar to our first candidate, but there are some modifications to our minimum and maximum data frequency. While these may look like minor changes taking max data frequency from 99.41 in our first candidate to 99.83%, in this example, those refinements can have an impact on properly engineering our features to obtain the most optimal results for our model. This data transformation strategy first transforms text features using the multicolumntfidfvectorizer and then merges all of the generated features and applies RobustStandardScalar again, which uses again, that same Sagemaker SciKit extension to perform standardization of your features based on the sparsity or density of the data on input. So let's now look at how autopilot is using these data pre processors in your candidate pipelines. For this I will go back to our candidate generation notebook so I can show you how the candidate pipelines were identified. To do this, I'm going to click back on that candidate generation notebook that we imported into our local environment for this. I'm going to go ahead and scroll down to a section called Candidate Pipelines. For this section, it contains the three machine learning pipelines or trials that you saw on that leaderboard before. Remember, each pipeline includes data pre-processing, an algorithm, and hyperparameter tuning jobs to find the right combination that results in that best performing model according to your objective metric. The first candidate pipeline that we see here is ddp0-xgboost. This Canada pipeline will use the data pre-processing code for dpp0 that I just showed you. Which will, if you'll remember, transform text features using the multicolumntfidfvectorizerand then apply RobustStandardScaler to the merge features. That final transformed data will then be used to tune an XGBoost model. In the second candidate pipeline that you see here, this one is called dpp1-xgboost. This candidate pipeline will use the data pre-processing code that I just showed you for DPP one, which will transform text features using the multicolumntfidfvectorizer and then apply RobustPCA followed by RobustStandardScalar to merge the features the final transformation again here will be used to run tuning of your XGBoost model. The final candidate pipeline Is dpp2 XGBoost. This candidate pipeline will use the data pre-processing code for dpp2 that I showed you which will transform text features again using the multicolumntfidfvectorizer and RobustStandardScalar to merge the features and again the final transform data will again be used to tune your XGBoost model. If you continue to explore this notebook, you'll also see additional details about hyperparameter tuning job inputs and ranges that were identified by autopilot, results of the tuning jobs, details on model selection, and also configuration code to actually be able to deploy the best performing model to a hosted SageMaker endpoint. So, let's go back to the leaderboard. You can see, you can also choose to deploy a SageMaker endpoint directly through the Studio I from the leaderboard as well here with this deploy model button. As you can see this notebook that was automatically generated by autopilot, combined with the generated resources and artifacts produced by autopilot, provide visibility into how each model is built. And you can use these notebooks in combination with the generated code to go ahead and continue to refine the best performing model identified by autopilot or go ahead and deploy that model to a SageMaker managed endpoint.