In this session, we're going to talk about the Transcribe service, and how you can actually improve the confidence level with the feedback loop. My name is Ken Shek. I'm the specialist SA, and I also have my coworker Emmanuel Etheve. He is a specialized solution architect. So before we go into the details, I wanted to show you a couple of common-use cases in the media-and-entertainment field. If you look at the architecture here, there are couple use cases that you could use with our AI/ML services. For example, you can use recognition celebrity detections face match to generate a sequence of the market information and you could use that for the video editing to help the video-editing software. And also you could use the moderation API to apply a reduction on any kind of unsafe content. And of course you also have a transcript service that could do a speech-to-text to generate a subtitle and which is what we're focusing on today. And then last, you could also use the comprehends to extract the sentiments of the subtitle or you could use translate to do a multi-language translation as well. So you can have multiple languages, different subtitles. And also at the end you could have all this metadata that goes into the Elasticsearch engine and [INAUDIBLE] you could generate a hot-search capability for all this metadata. For example, you could do a search for flowers that will show all the video that contains the metadata of flower. While there are so many different use cases in the AI/ML area there are some use cases more tolerant to a low-confidence level of result, for example, hot search. You could do a search for spaceship, and in return you have a list of videos that there might be a miss detection, but that shouldn't be too much of a problem because it's just a search in recommendation of the result. But there are also use cases that actually require much higher accuracy and much higher confidence of the result. For example, if you're generating a subtitle of the Transcribe, what you really want it to do is you want it to have a much higher accuracy and much higher confidence of the subtitle. And this is why we're doing the talk, to see, to help you to train the Transcribe service with the feedback loop to get the confidence level and accuracy better with the result. So with that I wanted to show you one of the case studies that we did which is actually how we started with this prototype thing is that there is a company they are the In-home Fitness company and what they do is they live stream their instructor-led classes. They live stream it to the customer and what they needed to do is they needed to automate the process of generating subtitles and closed captioning for two reasons. One is, in the US there is a governance reason that they need to provide subtitles and closed caption, and then the second reason is they wanted to automate to minimize all these operation costs and overhead as well. So with that I'm going to introduce Emmanuel, and he is going to walk through the demo with you, and then after that I will come back to talk about how the architecture looks like behind the scene. Thank you. Hi everybody, and thank you Ken. Before we dive into the subtitle use case that was mentioned before by Ken, I'd like to give you a quick disclaimer. AI/ML tools that we have in the AWS platform are no magic trick, so remember that you need to customize them and to adapt them to your customer use case for them to be efficient. With that said let's have a look at what we have here, so this is a platform that can developed in order to generate subtitles automatically from a transcription. We already have a video here. I'm just going to upload a second one. So I'm going to upload this video, so there's a few steps that we're going to do when we upload the video. So we are going to transcode this video in order to have a proxy format. Basically, we are just extracting the audio track because the video content is of no use for Transcribe, and we are also generating the transcription from from this audio track. So we can see here the different steps that we are going through in order to generate that content. Before I look at this specific content, I'm going to show you a short video that has already been processed in our platform and I'll come back to it afterward showing you what has been done. I just want you to pay attention to the capitalized word in the subtitles. >> Just wanted to run through a quick proof-of-concept demo. So we have Amazon connect. We have a call coming in that's going to hit Amazon connect, and the first thing we're going to do is quickly play a prompt that says press one to start streaming. That's going to be done by Polly. Once the customer presses one, we're going to associate the call, the customer audio with a Kinesis video stream and start actually streaming that audio. The next I'm going to do is invoke a Lambda function to take the KBS details like the ARN, the start fragment and the timestamp and so forth and put that in a Dynamo table with the customer's phone number and contact ID. I'm going to have a job application on my computer that I will then start and the first thing that that Java application will do is talk to Dynamo and retrieve the Kinesis information based on the phone number that I have programmed in that application. And then it will actually go out to Kinesis video streams and start consuming that stream and feed it over to Amazon Transcribe and take the transcription in real time of the customer audio and put it out on my console, so let's give it a try. >> All right, so what you've seen here in this particular video also capitalized word are what we've been providing through the feedback loop. So we've built our own dictionary adding some extra word to help Transcribe to better recognize a certain set of words. If I go back into the content for instance here, we have Java application. If we have a look at what was originally identified we would have found a job application, which is quite different from what we are expecting. So there are also other interesting detections here. If I browse my content, and I would like you to pay attention to the acronym. Recognizing an acronym is a very hard task for Transcribe because acronyms are not actual words. They're abbreviations, and by default Transcribe we would not be able to recognize an acronym. So, we have a way to work around that cave it by telling to transcribe to pay attention to sequence of words. So when we are going to create our custom dictionary, we are going to tell transcribe to pay attention to the sequence of K space V space S. And we are going to spell it that way into the dictionary, this way when transcribe is going to pause. The audio file is going to look and wait for this sequence, and if it matches that sequence, then it's going to use the acronym that has been provided by the custom dictionary. Before I move to the next video, I'd like to show you how we achieve this result. We did it by running several iteration of the transcribe service with a different set of vocabulary in order to improve the average confidence level of the transcription. If you have a look at the result here, we started with %94.3 of average confidence, and we ended up with a %95.3 of average confidence. So we've been able by several iteration. We've custom set of vocabulary to improve the average confidence level of the transcription by one percent. So I'm going to move to the next video and while I explain to you why this video is useful. I'm also going to paste here our custom dictionary that we're going to use for this particular video. And here you can have some example of the words that we've identified that's going to help us to improve the subtitles into the video. I'm just going to submit it, and why this is processed? I would like you to pay attention to what is being said in this video. I'm just going to restart it. >> Okay, tell us which ws service [CROSSTALK >> Absolutely, I love young lumber yard, which is the game engine, that's really good. Most like history because storage is fantastic. And of course you should really lam the whole things. >> Cognito, cognito [INAUDIBLE] technical way. Yes Cognito is fantastic, API Gateway, >> Media convert, >> Media converters quite nice, but it's not my favorite obviously like elastic mapreduce or EMR. No doubt database services a good red shift. I love Aurora and really like all the different flavors of RDS. >> Thank you David. >> You're welcome. >> So you can see we have a lot of acronym in this video and I'd like to go back to the beginning of the video and to pause. >> Absolutely, I love young lumber yard, which is the game engine, that's really good, most like history because, >> So here were talking about S3 and another limitation that we have with transcribe service is the recognition of numbers. It just simply doesn't work currently, we have transcribed. So the way around this is to spell out in a word. So instead of using the number, you just going to spell out the number three in writing so that transcribe is going to be able to recognize or identify the number. So in the particular use case of S3, because it's an acronym and a number you need to spell the S, and spell out the word three and tell transcribe to look for this specific sequence in order to recognize the acronym. So if I look at the result of the transcription that has been done here. You'll be able to see that S3 has now been identified. Compared to what has been detected before. You can also see that the average confidence level of the recognition is a lot higher once we've used as a custom dictionary. So remember, when you are providing a demo to your customer, you need to train your transcribe in order to get a better result. You need to adapt it to the use case of your customer, and training the transcribe service doesn't need to happen at the top at all time. You will only trend the transcribe service at the beginning, and once you reach an acceptable confidence level you will be able to roll the service into production and let it work by itself. You will always need human interaction to complete some of the words that identified by transcribe, but this tool will be able to help you to speed up the transcription of your content. So I'm going to give back the stand to Ken and he will dive deep into what's happening behind the scene. >> Thank you in a Emmanuel. So let us now look at what is happening behind the scene. So, as you see in the architecture bill. When Emmanuel upload the video, this is the path that is actually happening. It uploads the video to the S3 buckets. And then what it does is actually trigger a state machine, and then the state machine will actually call out a Transcoder with me to convert. And what it does is extract the audio stream from the video and then send the audio be streamed to the transcribe to do the transcription. So once it was done, the second pass is where that in a while actually provide the actual custom vocabulary dictionary and then to train the transcribe a little bit. And this is actually the path that is going through is it's calling out another step function state machine and then what it does. The next thing is, it will try to create a vocabulary dictionary with transcribe surfaces. And then once this is done, then it will rerun the transcription by starting transcription and wait for the transcription to be completed. And then at the end we collect the result from the transcribe and do our conversion from the transcribe result into the webvtt subtitle. Also, there is one thing that I wanted to mention is that, in this particular demo. We also uses another service called IoT, and the way that we use IoT, we use IoT as a message broker. And what it does is a publish/subscribe surface that will actually post all the messages from the backend to IoT and then you will have all these web connected clients that subscribe to the iot surface. We'll be able to get all these messages asynchronously and at the same time as well. Let's move on to look into the actual implementation of the state machine. What you're looking at right now is the transcoding state machine, As I mentioned earlier, the transcoding state machine does two things. One, is it go through the media converter to extract the audio bit stream. And then second, once the audio bit stream is extracted, it will send audio to transcribe surface. So transcribe will generate that transcription result. This is exactly what is happening. There is a start transcode job, and then we'll wait for the job to complete. So once the transcode is completed, what we actually do is we are actually calling out invoke the next sub- state machine, which is the state machine that doing the transcription. And why we are doing that is because then you could actually reuse the same state machine and build other use cases on top of it. So let's look at the transcribe state machine. What you're seeing here is when we start to transcribe state machine, the very first thing that it check is to see whether there is a custom vocabulary or dictionary being provided. If it's no, then it will go through just starting the transcribe and wait for the transcription to complete. And if there is a custom dictionary or vocabulary provided, what it does is actually it's going to create the vocabulary set in the transcribe surface, wait for the vocabulary to complete. And once this is done, then it will go back to where it will start to transcribe surface as well. With that, I kind of wanted to show you how easy it is in the backend with the Lambda function and step function. This is how you create the vocabulary with node.js. Couple things that I wanted to highlight is, of course, you have to give a vocabulary name, and that vocabulary name needs to be unique. And if it's not unique, then what you do is actually instead of creating that vocabulary in transcribe, you'll be updating the vocabulary. And then the way that you do that is in the parameter. You will see you set the vocabulary name equal to the base name. And then the base name will be your unique vocabulary name for that particular video, or maybe a set of the video. And then the next is how you actually start the transcribe service. Same thing, there is only few lines of code. The very first one is you need the audio file name, and then what you do is actually you provided the transcription job name equal to the audio file name. And at the same time, if you see here in this particular code, you also wanted to generate a unique file name for every transcription that you create a transcribe job. And then after that, you have an option is if you have custom vocabulary or dictionary, then you provide that through the setting parameter. And then you start the transcription. And then the transcribe surface will look into this parameter to see whether that you have a custom vocabulary to use to make it to pay attention and pay focus on those word or not. So the next one that I wanted to show you is the transcription result from the transcribe. A couple things that I wanted to highlight is that you will get a list of items in an array, and then there it will contains the start time and end times of a specific word. And you also get a confidence level in the content. For example, in this case, tell have the confidence level of 100%, 1.0. And then the start time is 0 and end time is 0.25. And how we actually take this result and convert it into the webptt subtitle. If you looked at the webptt subtitle format, you also have a start time and end time, then you'll have to tell us which surface the sentence. So what we do is we actually taking the list of items returned from the transcribe, and then we'll stitch the words together and then split it with the start time and end time of the sentence as well. So this is how we create a creative webptt track for the subtitle. So the next thing I wanted to show you is the IoT, the message broker. And as I mentioned earlier, it's a publish/subscribe surface, and we use it to actually send a synchronous message from the backend to the web client. So the way that you create is quite simple. The first thing you do, create a thing by calling AWS IoT create things and given a name. The next thing is you create a IoT policy. And in this particular policy, what we do is we're giving it an action as IoT and resources asterisk. The third thing is you would define a message topic. This is the message topic that your web clients or your connected clients will subscribe to, as well as the backend will be posting the message to this particular topic. It could be any name of your choice. And then, the very last thing is also very important, is that in order for the web client, the kinetic client to subscribe to the topic, you need to give a permission to that connected client. So you attach a policy to the connected client? And this is how you do it. So the way that you attach a policy to a Kanedo user is to running the command aws iot attached policy giving it the policy name that we created earlier. And then, the target is actually the unique identifier of the Kanedo user. So let's look at how the backend is publishing the message to the IoT. Here's the little code cipher. If you look at it, what you need is you need a topic, the topic that we defined earlier. And then, you'll have the payload. And then, you also have the endpoint. The endpoint is the IoT endpoint. And then, what you do is you just call publish. And then, providing that topic and the payload. In this case, is the text message is the payload. Next, then, we will look into how you're connected web client will actually subscribe to the topic. Before we do that, I highly recommend that you download the IoT JavaScript SDK. Here's the link from the GitHub. And then what it does is actually it does all this heavy lifting for you. And then the way that we use it is we'll construct an instance of the device, providing the host, which is the IoT endpoint. And then, providing the client's ID. It could be any name. In this case, it's kanedouser. And then, the protocol is WSS, which is web socket. And then, providing your usual credential access key ID secret key and sections tokens. Then, what you're going to do is you will have to listen to two of the messages. The first message is connect. So when it's on connect, what you're going to do is you're going to call a subscribe. And then, that's where that you subscribe to a particular topic. And the topic that we defined earlier. And when you get the message, when you subscribe to the message event, then you will get the message when the backend is posting the message to the IoT. And that's the payload is where that you could actually process the rest of your workflow. Now, let's take a look at the other modules that we use media convert for the transcoding. So to create a job, what you need to do is very first thing is to create a IM service role. And we'll talk about why we need that. And then we'll define a job template. In our case we are doing a extracting the audio bit stream, so the job templates will be a audio output only. And then we'll submit the job. So the first step is to create the IAM service role. And why is it important is because you need to give permission to the media convert surface to have access to our S3 buckets as well as API Gateway. And then the way you do it is through the command line is, aws iam create-role, and then giving it the assume role permission to the middle convert. And then you attach the S3 policy to that role as well as the API Gateway. So what you actually do is actually you allowing me to convert, to assume, a role that have access to the S3 buckets as well as calling API Gateway. Then you define a job template. So in our demo what we actually do is we have two outputs. The very first output called m4a is the audio only output, and we do that to subscribe is to actually use that audio output to transcribe service. And we also create the second set of the output, which is the impactful. And why are we doing it is because we generate the proxy file. So you could upload any kind of video file format, for example, mlv, mp4, mxf, or wmv file, and it will process it and being able to generate a proxy file that you could play back on the web client. And here is a simple JSON file that you can actually scan this QR code to download the code of the template. So the last step is to submit a job. Media convert service actually has a concept of Parisian per account endpoint. So what you actually need to do is to find out the exact endpoint that you can submit job. And how you do that, you will run a command, aws mediaconvert, describe endpoint, and then providing the regions, because as I said, it's region specific and account specific. And what it does is it will return the endpoint to you and that's where the endpoint that you will have to submit a job. So to actually submit a job or create a job, then you'll run aws mediaconvert create-job, and then providing the role, the service role that we created earlier, to provide access to S3 buckets. And then we'll provide the endpoint. The endpoint is from the described endpoint that we run the command earlier, as far as providing the region. So here is the equivalent way of doing it in the Node.js. First of all, you create a instance and call the described endpoint to get the endpoint. And then once this is done, you will use the endpoint to create a new instance of the media convert, and then you'll create job. By doing that what we are actually telling the media convert instance is that instead of using the default endpoint, use the endpoint that we got from the described endpoint API. Okay, so there are couple few gotchas that I wanted to share with you guys. So transcribe actually support mp4 and other file format, but it does have a limitation of one gigabyte file size. So if you have an mp4 file over one gigabyte transcribe with actually will not process it. So the way to do it is you actually use media convert or elastic transcode or even ffmpeg to simply extract the audio bit stream from your source file. Also, transcribe has another limitation is two hours duration limitation. Demo doesn't support it today, but then what you could actually do is you could actually split a long file into smaller chunks, like maybe an hour each, and then you'll parallel process the transcribe. Once you get the results back you can actually stitch the results back together with some post-processing lambda function. The third example is working with acronym. You could actually train transcribe to recognize the acronyms. But the way to do it is you actually providing transcribe a sequence of letters and asking transcribe to pay attention to a sequence of letters. For example, AWS, then you will say a space w space and s. In that case then transcribe will actually pay much more attention in a sequence of these particular letters. And then as the result when it when you get the results back from the transcribe, you could actually remove the space yourself or you could just leave it as it is. In addition to that you can also train transcribe to work with number. For example, if you have S3 or EZ2, what you really want it to do is to actually spell out S3, for example, s space and then T_H-R-E-E to spell out numbers and then to create that custom vocabulary. And then once you get the result, then you actually have a post processing lambda function to do a mapping of mapping S3 back into the s number three and provide that into the subtitle tracks. So if accuracy is the top priority, consider to use Amazon Mechanical Turk to outsource the air check the final edits process. Last but not least, also create a dictionary or a collections of vocabulary per instructors or speakers or could be a sequence of movie. For example, if you have Star Wars, Lord of the Ring, you could actually pre-create those vocabularies and dictionary, then you can fit that into the transcript surface to allow it to have a much better result of the transcription. This concludes the presentation. My name is Kenny Shaq and my colleague Amo El Edith. Thank you for watching.