Medallion Architecture: How It Works and Why It Matters in Modern Data Engineering

Written by Coursera Staff • Updated on

Explore what medallion architecture is and how it works in data lakehouse environments. Learn how data quality improves at each layer of the medallion architecture and how this enhances lakehouse scalability.

[Featured Image] A group of learners gather to study, preparing for an exam covering medallion architecture in their data engineering program.

Key takeaways

The medallion architecture is a multi-layered data management approach for data lakehouses, where data quality gradually improves across each layer. 

  • Medallion architecture includes bronze, silver, and gold data layers: the bronze layer for raw data ingestion, the silver layer for basic data cleaning and refinement, and the gold layer for delivering business-ready data.

  • In medallion architecture, you can organize your data logically for different use cases, ensuring the availability of the right data at the right time while maintaining data quality and governance. 

  • You can use the medallion architecture in data lakehouse environments to address the data organization and quality challenges of data lakes.

Explore how the medallion architecture works in data lakehouses and how it compares to traditional extract, transform, load (ETL) processes. If you’re ready to start building expertise in data engineering, enroll in the IBM Data Warehouse Engineer Professional Certificate. You’ll have the opportunity to gain experience with fundamental data warehousing concepts like building ETL pipelines, querying databases using SQL, and analyzing data with business intelligence tools in as little as four months. Upon completion, you’ll have earned a career certificate for your resume.

What is the medallion architecture?

The medallion architecture is a multi-layered data management approach for data lakehouses, where the data quality gradually improves as it moves through each layer. The name comes from the term “medallion,” where various raw metals undergo different stages of refinement to form a structured, usable component. Similarly, in a medallion architecture, raw data passes through various stages to become fully processed, business-ready data. The names of the different layers (bronze, silver, and gold data layers) reflect the data quality in each layer. Medallion architecture supports organizations with reliable and consistent data for advanced analytics and reporting, business intelligence (BI), artificial intelligence (AI), and machine learning.

Bronze, silver, and gold data layers: How each transforms and improves data

As data progresses through the medallion architecture, the different layers process, store, and manage it in different ways, incrementally enhancing data quality, organization, and reliability for analytics applications. The three data layers and their specific data-related functions are:

  • Bronze: Raw, unchecked data of diverse formats enters the system from various sources and is stored here without any transformations.

  • Silver: Data from the bronze layer undergoes cleaning, transformation, and refinement to prepare a more structured, usable format for analysis.

  • Gold: This layer fully refines and aggregates data from the silver layer, producing highly optimized, structured, business-ready data for decision-making and AI models.

Why is this layered approach used in lakehouse environments?

The layered approach of data lakehouse platforms combines the best features of data warehouses and data lakes, providing an organized, scalable solution for storing both raw and structured data for real-time analytics and business decision-making. For organizations that manage vast amounts of data from multiple sources, ensuring accurate, accessible data across the data lakehouse at every stage is a priority. 

This is where medallion architecture helps by providing a structured approach to data storage and management. In a medallion architecture, you can organize your data logically for different use cases, ensuring the right data is available at the right time while maintaining data quality and governance

Read more: Data Lake vs. Data Warehouse: What’s the Difference?

Who invented medallion architecture?

Databricks first coined the term “medallion architecture” [1]. The open analytics platform also popularized its use, describing a multi-layered data storage and processing architecture in which data quality improves at each layer.

How to build data pipelines with a medallion architecture

A medallion architecture is well-suited for building data pipelines because its layered structure already provides a pre-organized pipeline. In turn, it facilitates integration with larger pipelines and provides flexibility for building scalable pipelines. 

Bronze layer: Raw data ingestion

The bronze layer contains raw, unprocessed data from external sources, including databases, real-time pipelines, and messaging systems like Apache Kafka. This is the landing point for ingested data, which is stored “as-is” in the source format without transformation. The purpose of this layer is to provide a historical archive of data, serving as a single source of truth to ensure the accessibility of original data for auditing or reprocessing.

To build this layer, consider the following steps:

  • Integrate all data sources to form an ingestion layer of data in its original format.

  • Store data in append-only tables to maintain accuracy, schema fidelity, and original metadata.

  • Apply minimal initial data validation and quarantine invalid data for later investigation.

  • Ensure you capture all source-specific metadata to allow reverting to previous versions.

Silver layer: Cleaning and enriching data

The silver layer involves data transformation, validation, and cleaning to create enriched data for the gold layer. In this layer, you’ll remove duplicate data; fix null values, missing fields, and corrupted data; perform data normalization and cleansing; and structure data into a usable format that’s “just enough” for data engineers and scientists to implement for reporting or analysis on an as-needed basis. You might also apply schema enforcement to ensure data adheres to a specific structure or type, and schema evolution to manage data changes over time.

Gold layer: Delivering analytics-ready data

The gold layer is the final layer of the medallion architecture, where you can apply final transformations, aggregations, and data modeling to align with project-specific business applications. Data in this layer is analytics-ready, meaning it is organized and optimized so that data scientists, data engineers, and business analysts, as well as non-technical teams like marketing and product development teams, can use it to inform business decisions. 

In the gold layer, you’ll perform data modeling to align data with business needs by defining relationships between data and ensuring analysts can find what they’re looking for. You’ll also apply aggregate functions, such as averages and counts, tailored to specific business needs, such as financial reporting and marketing analytics.

What is the difference between ETL and medallion architecture?

While traditional extract, transform, load (ETL) transforms data before loading it, the medallion pattern mainly follows ELT (extract, load, transform), which prioritizes loading raw data first in the bronze layer, applying minimal transformations in the silver layer, and then applying complex transformations when loading the gold layer. 

In traditional ETL processes, you perform data cleaning before loading to ensure data quality, which can be time-intensive. The medallion architecture improves on this method by incrementally enhancing data quality as the data moves through each layer. 

The medallion architecture also addresses the issue of data changing as it moves through traditional ETL pipelines, which requires complex intervention at later stages to fix the problem because the original files are often overwritten. In contrast, the medallion architecture’s bronze layer stores the original file in the exact source format, facilitating easier fixes.

Benefits and challenges of medallion architecture

Implementing a medallion architecture means balancing the tradeoffs between its benefits and limitations. Key benefits of a medallion architecture include:

  • Easy to use: Its clear, intuitive structure makes it easy to understand, adopt, and scale. 

  • Data quality: Multiple data refinement stages ensure only highly accurate data reaches the final stage, enhancing confidence in decision-making.

  • Traceability and governance: Data moving through the various stages is easily trackable, ensuring lineage traceability for auditing and compliance purposes.

  • Optimized performance: Refined and denormalized data in the gold layer enables more efficient querying and real-time insights.

Despite the benefits, it’s important to be mindful of the limitations of medallion architecture.

  • Increased storage and costs: Storing data across three layers can potentially triple data storage costs, especially for data-heavy applications. Overcoming this might require compressed object storage and long-term retention policies for duplicate data.

  • Complex data management: You must still model and manage schemas and tables separately, which can increase the complexity of data management. Training data engineers and automating processes can help somewhat overcome this. 

  • Compatibility issues: If a lakehouse architecture is impractical for your organization, you might not find a medallion architecture helpful, since they typically go hand-in-hand. However, medallion architecture can still be effective if you use a combination of data lakes and data warehouses.

How medallion architecture supports lakehouse scalability

Medallion architecture can inherently support large, growing data sets in diverse formats from multiple sources, ensuring scalability with business needs without compromising on data quality or performance. Through clear layer separation and incremental data quality improvements, the medallion architecture promotes independent scaling to meet data needs. This structure allows organizations to start with minimal implementations and expand as their data requirements change. 

Data lakes: Common challenges addressed with medallion architecture

The primary challenge with data lakes is that data is easily disorganized. Although scalable and flexible, large amounts of data can quickly turn data lakes into “data swamps,” making the data difficult to navigate. Additionally, structured, unstructured, and semi-structured data stored in data lakes require separate types of processing. Finally, since data lakes use the schema-on-read method, data transformations for raw data happen only when queried, which slows down real-time analytics. 

Using a data lakehouse architecture solves some of these limitations. A lakehouse uses both the schema-on-read model and the schema-on-write model. With the schema-on-read model, data is stored in its raw form, then structured when it is “read” or accessed. With the schema-on-write model, data is structured before storage. Additionally, lakehouses store structured and unstructured data within a single ecosystem, eliminating the need for separate processing. However, it doesn’t organize data or offer a system for data refinement.

A medallion architecture improves upon the lakehouse model by providing a framework for incrementally refining data quality and for organizing and managing data in a lakehouse, enabling more efficient decision-making.

Is medallion architecture still used?

Yes, professionals, including data engineers, data scientists, and BI analysts, still widely use medallion architecture. Large companies also use medallion architecture for their data management. For example, Microsoft adopted the medallion architecture for its central data storage platform, OneLake, which is part of Microsoft’s Fabric implementation [2].

Getting started in data engineering

If you’re looking to start building data engineering knowledge or start a career in data engineering, start by considering what type of position you’re looking for and whether it will require formal education or if you can satisfy the requirements with online training. If you want to develop a strong foundation of data engineering skills, getting a bachelor’s degree in computer science, software engineering, or a related field might be the way to go. 

Alternatively, you can start building your data engineering skills through online courses or training. Data engineering requires a strong command of programming languages like Java and Python, database technologies like SQL, ETL tools, and big data technologies. Industry leaders like Microsoft and Databricks offer online training courses for data engineering using their platforms, which can be helpful if you’re trying to learn specific skills. You can also take up beginner-friendly online courses, like the IBM Data Engineering Professional Certificate on Coursera, to develop your understanding of foundational concepts.

Once you feel confident in your skills, consider getting a certification to validate your abilities. Popular certifications for data engineers include the Google Cloud Professional Data Engineer, AWS Certified Data Engineer - Associate, and the Databricks Certified Data Engineer Associate. Make sure you check the certification prerequisites, as some credentials require a few years of on-the-job experience.

Your go-to library for career growth

Discover fresh insights into your career or learn about trends in your industry by subscribing to our LinkedIn newsletter, Career Chat. Or if you want to learn more about careers, skills, and concepts related to the fields of data science and data engineering, check out these free resources:

Whether you want to develop a new skill, get comfortable with an in-demand technology, or advance your abilities, keep growing with a Coursera Plus subscription. You’ll get access to over 10,000 flexible courses. 

Article sources

1

Redpanda. “Implementing the Medallion Architecture with Redpanda, https://www.redpanda.com/blog/medallion-architecture-redpanda/. Accessed February 18, 2026. 

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.