Let's look at some of the challenges that a data engineer faces. As a data engineer, you'll usually encounter a few problems when building data pipelines, you might find it difficult to access the data that you need. You might find that the data, even after you access it, doesn't have the quality that's required by the analytics or machine learning model. You plan to build a model, and even if the data quality exists, you might find that the transformations require computational resources that might not be available to you. And finally, you might run into challenges around query performance and being able to run all of the queries. And all of the transformations that you need with the computational resources that you have. Let's take the first challenge of consolidating disparate data sets, data formats and managing access at scale. For example, you want to compute the customer acquisition cost. How much does it cost in terms of marketing and promotions and discounts to acquire a customer? That data might be scattered across a variety of marketing products and customer relationship management software. And finding a tool that can analyze all of this data might be difficult because it might come from different organizations. Different tools and different schemas and maybe some of that data is not even structured. So in order to find something as essential to your business as how much getting a new customer costs. So that you can figure out what kind of discounts to offer to keep them from turning, you can't have your data exists in silos. So what makes data access so difficult, primarily, this is because data in many businesses is siloed by departments. And each department creates its own transactional systems to support its own business processes. So, for example, you might have operational systems that correspond to store systems. Have a different operational system maintained by your product warehouses that manages your inventory. And have a marketing department that manages all the promotions given that you need to do an analytic query on. Such as give me all the in store promotions for recent orders and their inventory levels. You need to know how to combine data from the stores from the promotions and from the inventory levels. And because these are all stored in separate systems, some of which have restricted access. Building an analytics system that uses all three of these data sets to answer an ad hoc query like this can be very difficult. The second challenge is that cleaning formatting and getting the data ready for insights requires you to build ETL pipelines. ETL pipelines are usually necessary to ensure data accuracy and quality. The clean and transform data are typically stored, not in a data lake within a data warehouse. A data warehouse is a consolidated place to store the data, and all the data are easily joinable and query able. Unlike a data lake where the data is in the raw format in the data warehouse, the data is stored in a way that makes it efficient to query. Because data becomes useful only after you clean it up, you should assume that any raw data that you collect from source systems need to be cleaned and transformed. And if you are transforming it, you might as well transform it into a format that makes it efficient query. In other words, ETL the data and store it in a data warehouse. Let's say you're a retailer and you have to consolidate data from multiple source systems. Think about what the use case is, suppose the use case is to get the best performing in store promotions in France. You need to get the data from the stores, and you have to get the data from the promotions. But perhaps to store data is missing information, maybe some of the transactions are in cash. And for those, perhaps there is no information on who the customer is, or some transactions might be spread over multiple receipts. And you might need to combine those transactions because they come from the same customer. Or perhaps the time stamps of the products are stored in local time, whereas you have to spread across the globe. And so before you do anything you need to convert everything into UTC. Similarly, the promotions may not be stored in the transaction database at all. They might be just a text file that somebody loads on their web page and has a list of codes that are used by the web application to apply discounts. It can be extremely difficult to do a query like finding the best performing in store promotions because the data has so many problems. Whenever you have data like this, you need to get the raw data and transform it into a form with which you can actually carry out the necessary analysis. It is obviously best if you can do this sort of clean up and consolidation just once. And store the resulting data to make further analysis easy, that's the point of a data warehouse. If you need to do so much consolidation and clean up, a common problem that arises is where to carry out this compute. The availability of computation resources can be a challenge. If you're on an on premises system, data engineers will need to manage server. And closer capacity and make sure that enough capacity exists to carry out a detailed jobs. The problem is that the compute needed by these ETL jobs is not constant over time. Very often it varies week to week and depending on factors like holidays and promotional sales. This means that when traffic is low, you're wasting money, and when traffic is high, your jobs are taking way too long. Once your data is in your data warehouse, you need to optimize the queries your users are running to make the most efficient use of your compute resources. If you're managing an on premise data analytics cluster, you will be responsible for choosing a query engine. And installing the query engine software and keeping it up to date as well as provisioning anymore servers for additional capacity. Isn't there a better way to manage server overhead so we can focus on insights?