- [Morgan] At this point in the course, you know what services are generally used for data storage and data movement. But what about data processing and analytics? You'll certainly want to actually use the data once it's been ingested and stored. In this video, we'll be talking about which services you could use for processing and analytics at a high level. First, let's discuss the idea of transforming data. It's common to need to tweak data in your data lake before analyzing it, because you might find that the data you are ingesting is not in a very useful format, or perhaps the data includes things like line breaks that your tooling can't handle, or the data is dirty and full of errors like duplication or missing fields. You might want to process your data in batches, or as the data comes in from stream ingestion, live. It depends on the use case. Let's focus on the idea of processing data in batches first. To process data in batches, you can create scripts to fix errors, reformat, or otherwise process the data after it has been uploaded into your data lake. Using a Hadoop cluster for this data processing is common. Apache Hadoop is an open-source framework that is used to efficiently store and process large data sets ranging in size from gigabytes to petabytes of data. Instead of using one big powerful computer to store and process the data, Hadoop allows you to cluster multiple computers together to analyze massive datasets in parallel, quickly. Let's say you have an existing Hadoop cluster in place, and you've been using this on-premises. And now you want to shift to using AWS using the same tooling in order to minimize refactoring. In this case, you can use Amazon EMR. Amazon EMR is a managed cluster platform that allows you to process and analyze the data from your data lake. Unlike infrastructure constraints of on-premises clusters, EMR decouples the compute and the storage, giving you the ability to scale each independently by using S3 as the storage layer. Amazon EMR is great for processing data in batches, because it allows you to process massive amounts of data in an efficient way. And it's likely that you will be working with massive amounts of data in your data lake. With the EMR, you can create long running or transient clusters. The benefit of a transient cluster is that you can spin up a cluster, and then it turns off when your processing job is complete, only paying for what you use. In contrast, persistent clusters continue to run after data processing is complete. If you determine that your cluster will be idle for the majority of the time, it is best to use transient clusters. For example, if you have a batch processing job that pulls your web logs from Amazon S3 and processes the data once an hour, it is probably going to be more cost effective to use transient clusters to process your data and then shut down the nodes when the processing is complete. So although EMR manages the Hadoop cluster for you, wouldn't it be really nice if you could just run your processing jobs without worrying about the configuration and management of a cluster? This is where the service AWS Glue can be handy. You already know about the AWS Glue Data Catalog, but it has other features that you can use in your data lake. I mentioned in a previous video that Glue is a managed ETL tool that makes it easier to categorize your data, transform it, enrich it, and move it reliably between various data sources and data streams. You know about the classifiers and the crawlers and that you can use the output created from the Glue crawlers and classifiers to run your ETL jobs. But let's talk more about what this really means. AWS Glue has a concept called jobs. Jobs are the business logic that is required to perform data processing work, the type of work you would do with an Amazon EMR or Hadoop cluster. With Glue jobs, instead of creating a cluster to run your processing logic on, you instead provide a set of configurations. A job is composed of data sources and data targets, which come from the tables created in the AWS Glue Data Catalog. It also has a transformation script as well as other customizations that you provide. This job is run on a schedule or triggered by an event. The job runs and transforms or moves your data and Glue will do all of the orchestration and management of the underlying infrastructure that is needed to run your data processing logic. Think of it as serverless Hadoop. Amazon EMR will give you greater control over the entire process, but Glue jobs is a very convenient tool. We will dive deeper into Glue later on in this course. Now, let's explore another option for processing with a focus on real time data. Let's imagine you are collecting data in real time using one of the Kinesis services, either Kinesis Data Streams or Kinesis Firehose. The data you are collecting is raw data, and it might not be in the best format for analyzing. In this scenario, you want to transform the data as it comes in, instead of running processing jobs or processing it in batches. What options exist for this use case? The first service that comes to mind is AWS Lambda. AWS Lambda is a serverless compute service that runs on demand based on triggers. Kinesis Data Streams is an available trigger for your Lambda functions. This allows you to pre-process the data before sending it downstream to its ultimate destination.