Let's start with a discussion about what data lakes are and where they fit in as a critical component of your overall data engineering ecosystem. What is a data lake after all? It's a fairly broad term, but it generally describes a place where you can securely store various types of data of all scales for processing and analytics. Data lakes are typically used to drive data analytics, data science, and ML workloads or batch and streaming data pipelines. Data lakes will accept all types of data. Finally, data lakes are portable, on-premise or in the Cloud. Here is where data lakes fit into the overall data engineering ecosystem for your team. You have to start with some originating system or systems that are the source of all your data. Those are your data sources. Then as a data engineer, you need to build reliable ways of retrieving and storing that data, those are your data sinks. The first line of defense in an enterprise data environment is your data lake. Again, it's the central give me whatever data at whatever volume, variety of formats, and velocity you got. I can take it. We'll cover the key considerations and options for building a data lake in this module. Once your data is off the source systems and inside your environment, generally, considerable cleanup and processing is required to transform that data into a useful format for the business. It will then end up in your data warehouse. That's our focus for the next module. What actually performs the cleanup and processing of data? Those are your data pipelines. They are responsible for doing the transformations and processing on your data at scale and bring your entire system to life with freshly, newly processed data for analysis. An additional abstraction layer above your pipelines is what I will call your entire workflow. You will often need to coordinate efforts between many different components at a regular or event-driven cadence. While your data pipeline may process data from your lake to your warehouse, your orchestration workflow may be the one responsible for kicking off that data pipeline in the first place when it notice that there was a new raw data available from a source. Before we move into what Cloud products can fit what roles, I want to leave you with an analogy that helps disambiguate these components. Picture yourself in the world of civil engineering for a moment. You're tasked with building an amazing skyscraper in a downtown city. Before you break ground, you need to ensure you have all the raw materials that you're going to need to achieve your end objective. Sure, some materials could be sourced later in the project, but let's keep this example simple. The act of bringing the steel, the concrete, the water, the wood, the sand, the glass from wherever source, elsewhere in the city onto your construction site is analogous to data coming from source systems and into your lake. Great. Now you have all these raw materials, but you can't use them as is to build your building. You need to cut the wood and metal, measure and format the glass before it is suited for the purpose of building the building. The end result, the cut glass shape metal, that is the formatted data that is stored in your data warehouse. It is ready to be used to directly add value to your business, which in our analogy is building the building. How did you transform these raw materials into useful pieces? On a construction site, that's the job of the worker. As you'll see later when we talk about data pipelines, the individual unit behind the scenes is literally called a worker, which is just a virtual machine and it takes some small piece of data and transforms it for you. What about the building itself? That's whatever end goal or goals you have for this engineering project. In the data engineering world, the shiny new building could be a brand new analytical insight that wasn't possible before, or a machine learning model or whatever else you want to achieve now that you have the cleaned data available. The last piece of the analogy is the orchestration layer. On a construction site, you have a manager or supervisor that directs when work is to be done and any dependencies. They could say, once the new metal gets here, send it to this area of the site for cutting and shaping, and then alert this other team that it's available for building. In the data engineering world, that's your orchestration layer or overall workflow. You might say, every time a new piece of CSV data drops into this Cloud storage bucket, I want you to automatically pass it to our data pipeline for processing and once it's done processing, I want you, the pipeline, to stream it into the data warehouse. Once it's in the data warehouse, I will notify the machine learning model that new cleaned training data is available for training and direct to start training a new model version. Can you see the graph of actions building? What if one step fails? What if you want to run that every day? You're beginning to see the need for an orchestrator, which in our solutioning will be Apache Airflow running on Cloud composer later. Let's bring back one example solution architecture diagram that you saw earlier in the course. The data lake here is Cloud storage buckets, right in the center of the diagram. It's your consolidated location for raw data that is durable and highly available. In this example, our data lake is a Cloud storage, but that doesn't mean Cloud storage is your only option for data lakes. Cloud storage is one of a few good options to serve as a data lake. In other examples, we will look at BigQuery, maybe your data lake and your data warehouse, and you don't use Cloud storage buckets at all. This is why it's so important to understand what you want to do first and then find which solutions best meet your needs. Regardless of which Cloud tools and technologies you use, your data lake generally serves as that single consolidated place for all your raw data. Think of it as a durable staging area. The data may end up in many other places like a transformation pipeline that cleans it up, moves it to the warehouse, and then it's read by a machine learning model, but it all starts with getting that data into your lake first. Let's do a quick overview on some of the core Google Cloud big data products that you need to know as a data engineer and we'll practice later in your labs. Here is a list of big data and ML products organized by where you would likely find them in a typical data processing workload from storing the data on the left to ingesting it into your Cloud-native tools for analysis, training, machine learning models, and serving up insights. In this data lake module, we will focus on two of the foundational storage products which will make up your data lake: Cloud storage and Cloud SQL for your relational data. Later in the course, you will practice with Cloud Bigtable as well when you do high-throughput streaming pipelines. You may be surprised to not see BigQuery in the storage column. Generally, BigQuery is used as a data warehouse. What's the core difference between a data lake and data warehouse then? A data lake is essentially the place where you capture every aspect of your business operations. Because you want to capture every aspect, you tend to store the data in its natural raw format, the format in which it is produced by your application. So you may have a log file and the log file stored as is in a data lake. You can basically store anything that you want and because you want to store it all, you tend to store these things as object blobs or files. The advantage of the data lake's flexibility as a central collection point is also the problem. With a data lake, the data format is very much driven by the application that writes the data and it is in whatever format that is. The advantage of a data lake is that whenever the application gets upgraded, it can start writing the new data immediately because it's just a capture of whatever raw data exists. How do you take this flexible and large amount of raw data and do something useful with it? Enter the data warehouse. On the other hand, a data warehouse is much more thoughtful. You might load the data into a data warehouse only after you have a schema defined and the use case identified. You might take the raw data that exists in a data lake and transform it, organize it, process it, clean it up, and then store it in a data warehouse. Why are you getting the data warehouse? Maybe because the data in the data warehouse is used to generate: charts, reports, dashboards, and so on. The idea is that because the schema is consistent and shared across all of the applications, someone can go ahead and analyze the data and derive insights from it much faster. A data warehouse tends to be structured and semi-structured data that is organized and placed in a format that makes it conducive for querying and analysis.