In this video, I will introduce the libraries and APIs to generate book embeddings from raw text programmatically. For this purpose, you will use the psychic learn libraries. From the psychic learn, you will use the RoBERTa tokenizer class. Before jumping into the code, I will briefly introduce the RoBERTa model. Roberta model is built on top of BERT model, but it modifies a few hyper parameters and the way the model is trained. It also uses a lot more training data than the original BERT model. This results in significant performance improvements in a variety of NLP tasks when compared to the original BERT model. You will learn more about RoBERTa model architecture in week two. If you would like to read more about this model, please have a look at this research paper. You will find the link in the additional resources section for this week. Time to look at some code. To start using the RoBERTa tokenizer, you first import the class and then you construct an object of tokenizer. To construct the organizer class, you specify the pretrained model. The pretrained model we use here is RoBERTa base. Once you have the tokenizer object in hand, you will run the encode_plus method. The encode_plus method expects a few parameters, as you can see here. One of the parameters is the review. This is the raw review text from your product review data set, which is the text that needs to be encoded. You will also see a flag, a true or false flag, whether to add special tokens to the embeddings or not. You will also see the maximum parameter that specifies the maximum sequence length, along with a few other parameters. A brief note on the maximum sequence length parameter. This is a hyper parameter that is available on both BERT and RoBERTa models. The max of sequence length parameter specifies the maximum number of tokens that can be passed into BERT model with a single sample. To determine the right value for this hyper parameter, you can analyze your data set. Here you can see the word distribution of the product review data set. This analysis indicates that all of our reviews consist of 115 or less words. Now, it's not a 1-1 mapping between the word count and the input token count, but it can be a good indication. You can definitely experiment with different values for this parameter. For the purposes of this use case and the product review data set, setting this max sequence length to the value of 128 has been proven to work well. Once you determine all the necessary parameters, generating the embeddings is really very straightforward. You simply run the encode_plus method. However, the real challenge comes in when you have to generate these embeddings at scale. This is exactly the challenge that you will tackle in this week's lab. The challenge is performing feature engineering at scale, and to address the challenge, you will use Amazon SageMaker processing. Amazon SageMaker processing allows you to perform data related tasks such as, preprocessing, postprocessing, and model evaluation at scale. SageMaker processing provides this capability by using a distributed cluster. By specifying some parameters, you can control how many notes and the type of the notes that make up the distributed cluster. Sagemaker processing job executes on the distributed cluster that you configure. Sagemaker processing provides a built-in container for Sklearn. So the court that you use with Sklearn and RoBERTa tokenizer should work out-of-the-box using Sagemaker processing. As you can see here, Sagemaker processing expects data to be in S3 bucket. So you specify the S3 location where you're on, input data is stored and Sagemaker processing execute the Sklearn script on the raw data. Finally, the output, which consists of the embeddings, is persisted back into an S3 bucket. Let's look at some code again. To use Sagemaker processing of its psychic learn, you start by importing a class called Sklearn processor, along with a couple of other classes that captured input and the output of the processing job. Then you set up the processing cluster using the Sklearn processing object. To this object, you pass in the framework version of the psychic learn you would like to use, as well as the instance count and the instance type that make up the distributed cluster. Once you can figure the cluster, you simply run the run method with a few parameters. As expected, these parameters include a script to execute. This is a python script that consists of the psychic learned code to generate the embeddings. Additionally, you provide the processing input that specifies the location of the input data in the S3 bucket. And finally, you specify where in S3 the output should go to, and you mention that using the processing output construct. Pulling all of these together, this is how your lab for this week is going to look like. You will convert the review text from the product review data set into BERT embeddings, using the psychic learn container on Sagemaker processing.