Welcome back. In this lesson, we'll be discussing the history of data. In particular, we'll discuss two major trends, data lakes and data warehouses. In the process, we'll also talk about data models as well. Now, there's a well-known statistic in data science, that data scientists spend about 80 percent of their time on data management. Or to state this otherwise, data scientists spend about 80 percent of their time cleaning data, and 20 percent of their time complaining about cleaning data. Jokes aside, these issues really matter because they set the foundation for any downstream task. By the end of this lesson, you'll be able to explain different major trends and how we manage data. If you're a fan of the sci-fi author William Gibson, you might have heard this quote before, "The future is here. It's just not evenly distributed." When I began working with data about half a decade or so ago, it was really only the big five tech companies that were using data well. Any company other than Amazon, Google, and a few others, were really struggling with data. Here are some stats to reinforce that. There's great value to advanced analytics like artificial intelligence, but the majority of data projects fail. Why do they fail? Well, often it has to do with the data. As I mentioned in the module, introduction, the early days of data meant creating your own data-center and then buying some proprietary software to store your data. That meant databases, but also data warehouses too. While databases are used for online transactions, data warehouses are common for all sorts of business intelligence or BI applications, like nightly reports for various business outcomes. There are clear downsides to data warehouses though. First, data warehouses are often expensive and closed source. Second, they really only support really structured data. You couldn't work with images, videos, or other unstructured data formats. Semi-structured data types like JSON also pose some challenges. Finally, these systems were intended for reporting, not more advanced workflows like machine learning. Enter data lakes. Recall that data warehouses were really created for your own custom data center. Then the Cloud came about where anybody could rent resources in a flexible way from companies like Amazon and Microsoft. This means, among other things, scalable and cheap storage. This enabled data lakes, where you can land data in any format you need. They support machine learning workloads because they're so highly flexible. You could better integrate with open formats like Parquet and Delta. But there are downsides too. Namely, how can you ensure that your data is as you would want it to be? If you can't trust that data, you'll have poor support for BI and poor ability to govern the data you do have by making sure it matches your expectations. Data lakes could easily become data swamps because of that flexibility. Now, to make it even more complex, we can talk about more than just data warehousing and data science workloads. On the bottom left-hand side of the screen, we have a normal pipeline for working with data warehouses. You start with structured data, loaded into your data warehouse, and then you create data marts to serve different business use cases. Then your data engineers are working on their ETL pipelines. They take some raw data, they transform it, and they load it into some target databases. If you wanted to do streaming, you'd need a different technology stack to land your data and make it available. Then finally, for data science, you would need separate data lakes in order to enable those workloads. Now, try to stitch all of this together. Each of these data stacks have their own individual technologies to handle them and to integrate them all is no easy task, to put it lightly. Each comes with their own protocols and limitations, and keeping everything up-to-date is truly a nightmare. In brief, this means that most enterprises struggle with data because these various data systems are siloed from each other. The lakehouse paradigm solves many of these problems and we'll discuss them in the next video. Suffice it to say for now that lakehouses join the robustness and guarantees of data warehouses with the flexibility, scalability, and cheap storage of data lakes. But before we talk about lakehouses, I also want to address data models. You can structure a data in many different ways. We'll start with the model most commonly used in databases, relational models. This is used in traditional RDBMS's or relational database management systems. The core idea here is that you are normalizing your data into different tables that you can join back together using keys. This reduces data redundancy and improves the integrity of your data. While this works well in many contexts, creating our data model can take a good deal of time, and updating that model often as challenges. Next up is NoSQL. These are non-relational models. There are many approaches out there, but document stores, like storing newspaper articles, for instance, or key value stores, are common NoSQL databases. Beyond relational and non-relational models, there's also the idea of doing query-first design. This is common in distributed environments. In this case, you model your data based on optimizing queries. If you know you'll need a user's name in multiple tables, it's okay to copy it to both locations because this could improve your query speed. Then finally, there are star schemas. These are used in data warehouses where you organize your data into fact and dimension tables. A fact could be an individual event like maybe a sales transaction, while dimension tables would have related information like geographical aspects, or time aspects. Now, this was a high-level introduction to data management and some of its complexities. In the next video, we'll be discussing the lakehouse paradigm as a way of solving many of these limitations of these various design patterns by balancing flexibility and reliability.