So now, let's look at using ETL to solve data quality issues. Unless you have specific needs, we recommend that you use Dataflow and BigQuery. What could those needs be? First, latency and throughput. BigQuery queries are subject to a latency on the order of a few 100 milliseconds, less if you're leveraging BI Engine, and you can stream on the order of a million rows per second into a BigQuery table. If your latency and throughput considerations are more stringent, then Cloud Bigtable might be the better sync for your data processing pipeline. Or perhaps you want to reuse Spark pipelines. If you already have a significant investment in Hadoop and Spark, you might be more productive in a familiar technology. Use Spark if that's what you really know well. Lastly, if you have a need for visual pipeline building. Dataflow requires you to code data pipelines in Java or Python. If you want to have data analysts and non-technical users create data pipelines use Cloud Data Fusion. They can drag and drop and visually build pipelines. We'll look at the last two options briefly now, and in greater detail in the remainder of this course. Cloud Dataproc is a managed service for batch processing, querying, streaming, and machine learning. It provides a managed service for Hadoop workloads and is quite cost-effective. About $0.01 more than the cost of running it bare-metal, but eliminating all the typical Hadoop maintenance activities. It also has a few cool features like auto-scaling, and out of the box integration with GCP products like BigQuery. Some benefits, it's fast and scalable, open-source, fully-managed, it has versioning, it's integrated with the GCP and open source ecosystems, and very cost-effective. Cloud Data Fusion is a fully-managed, cloud native, enterprise data integration service for quickly building and managing data pipelines. You can use it to populate a data warehouse, but you can also use it for transformations, cleanup, and ensuring data consistency. Users who can be part of the business can build visual pipelines to address business imperatives like regulatory compliance, without having to wait for an IT team to code up a Dataflow pipeline. Data Fusion also has an API to code against. IT folks can use it to script and automate. Regardless of which ETL you use, Dataflow, Dataproc or Data Fusion, there are some crucial aspects to keep in mind. First, maintaining data lineage is important. What do we mean by lineage? Things like where the data came from, what processes it has been through, and what condition is in? It also tells you what kinds of uses a data is suited for as well as the current condition of the data, and the processes it might need to undergo to be suitable for an intended use. If you find the data gives odd results, you can check the lineage to find out if there's a cause that can be corrected. Lineage also helps with trust and regulatory compliance. The other cost fitting concern is that you need to keep metadata around. You need a way to track the lineage of data in your organization for discovery and identification of suitability for users. On Google Cloud, Cloud Data Catalog provides discoverability, but you have to do your bit by adding labels. A label is a key value pair that helps you organize your resources. In BigQuery you can attach labels to datasets, tables, and views. Labels are useful for managing complex resources because you can filter them based on their labels. Labels are a first step toward a data catalog. Among the things that labels help with is Cloud billing. If you attach labels to Compute Engine instances, to buckets, and to Dataflow pipelines, information about those labels will be forwarded to the billing system. This allows you ways of break down your billing charges by label and gives you a fine-grain look at your Cloud bill. Think of Data Catalog as a metadata as a service. It provides metadata management services for cataloging data assets via custom APIs and the UI, thereby providing a unified view of data wherever it is. It supports schematized tags like enum, bool, or datetime and not just simple text tags, providing organizations reach an organized business metadata. It's serverless and requires no infrastructure to set up or manage. Data Catalog empowers users to annotate business metadata in a collaborative manner and provides a foundation for data governance. So what did you learn in this module? First, you got a quick refresher on when to use EL and ELT. Then you learned about the power of BigQuery SQL to solve many data quality issues and perform transformations. Finally, we discussed the use of ETL for circumstances where EL or ELT might not suffice. Tune in to the following modules to learn more about batch and streaming data pipelines.