Let's look at some of the challenges that a data engineer faces. As a data engineer, you'll usually encounter a few problems when building data pipelines. You might find it difficult to access the data that you need. You might find that the data, even after you access it, doesn't have the quality that is required by the analytics or machine learning model. You plan to build the model. Even if the data quality exists, you might find that the transformations require computational resources that might not be available to you. Finally, you might run into challenges around query performance and being able to run all of the queries and all of the transformations that you need with the computational resources that you have. Let's take the first challenge of consolidating disparate datasets, data formats, and managing access at scale. For example, you want to compute the customer acquisition cost. How much does it cost in terms of marketing and promotions and discounts to acquire a customer? That data might be scattered across a variety of marketing products and customer relationship management software. Finding a tool that can analyze all of this data might be difficult because it might come from different organizations, different tools, and different schemas. Maybe some of that data is not even structured. In order to find something as essential to your business as how much getting a new customer costs so that you can figure out what discounts to offer to keep them from turning. You can't have your data exists in silos. What makes data access so difficult? Primarily, this is because data in many businesses is siloed by departments and each department creates its own transactional systems to support its own business processes. For example, you might have operational systems that correspond to store systems, have a different operational system maintained by your product warehouses that manages your inventory. Have a marketing department that manages all the promotions given that you need to do an analytic query on, such as give me all the in-store promotions for recent orders and their inventory levels. You need to know how to combine data from the stores, from the promotions, and from the inventory levels. Because these are all stored in separate systems, some of which have restricted access. Building an analytic system that uses all three of these datasets to answer an ad hoc query like this can be very difficult. The second challenge is that cleaning, formatting, and getting the data ready for insights requires you to build ETL pipelines. ETL pipelines are usually necessary to ensure data accuracy and quality. The cleaned and transformed data are typically stored not in a data lake within a data warehouse. A data warehouse is a consolidated place to store the data and all the data are easily joinable and queryable. Unlike a data lake where the data is in the raw format in the data warehouse, the data is stored in a way that makes it efficient to query. Because data becomes useful only after you clean it up. You should assume that any raw data that you collect from source systems need to be cleaned and transformed. If you were transforming it, you might as well transform it into a format that makes it efficient to query. In other words, ETL the data and store it in a data warehouse. Let's say you're a retailer and you have to consolidate data from multiple source systems. Think about what the use case is. Suppose the use case is to get the best-performing in-store promotions in France. You need to get the data from the stores and you have to get the data from the promotions. But perhaps the stored data is missing information. Maybe some of the transactions are in cash, and for those perhaps there is no information on who the customer is. Or some transactions might be spread over multiple receipts. You might need to combine those transactions because they come from the same customer. Or perhaps the timestamps of the products are stored in local time, whereas you have to spread across the globe, and so before you do anything, you need to convert everything into UTC. Similarly, the promotions may not be stored in the transaction database at all. They might be just the text file that somebody loads on their webpage, and it has a list of codes that are used by the web application to apply discounts. It can be extremely difficult to do a query like finding the best-performing in-store promotions because the data has so many problems. Whenever you have data like this, you need to get the raw data and transform it into a form with which you can actually carry out the necessary analysis. It is obviously best if you can do this clean-up and consolidation just once and store the resulting data to make further analysis easy. That's the point of a data warehouse. If you need to do so much consolidation and clean-up, a common problem that arises is where to carry out this compute. The availability of computation resources can be a challenge. If you're on an on-premises system, data engineers will need to manage server, and cluster capacity and make sure that enough capacity exists to carry out to ETL jobs. The problem is that the compute needed by these ETL jobs is not constant over time. Very often, it varies week-to-week and depending on factors like holidays and promotional sales. This means that when traffic is low, you're wasting money and when traffic is high, your jobs are taking way too long. Once your data is in your data warehouse, you need to optimize the queries your users are running to make the most efficient use of your compute resources. If you're managing an on-premise data analytics cluster, you will be responsible for choosing a query engine and installing the query engine software and keeping it up-to-date, as well as provisioning any more server for additional capacity. Isn't there a better way to manage server overhead so we can focus on insights?