November 15, 2024

What is Medallion Architecture? Understanding the Medallion Framework for Better Data Management

Learn what medallion architecture is, how it works, and why it’s useful for ensuring data accuracy and building data pipelines.

clock
15
min read
Copied!

Landon Iannamico

linkedin
Content Strategist
What is Medallion Architecture? Understanding the Medallion Framework for Better Data Management

Data is becoming increasingly crucial for businesses across all industries. As data usage grows, so does its complexity—which is why creating proper data management systems, platforms, and storage models has become so important. 

One of the most popular and effective data management structures is medallion architecture, created and popularized by Databricks. Keep reading to learn what medallion architecture is, how it works, why it’s so popular, and the benefits it offers for organizations using data lakehouses. 

What is Medallion Architecture? 

Medallion Architecture is a data management framework that logically organizes data into 3 layers within a data lakehouse—bronze, silver, and gold. Each layer contains data at a different processing stage. 

Raw data enters the system in the bronze layer, gets cleaned and transformed in the silver layer, and then turns into fully processed, business-ready data that can be modeled and analyzed at the gold layer.

Why is It Called “Medallion” Architecture?

Just as raw precious metals pass through stages of refinement to create a medallion, raw data passes through the stages of refinement of medallion architecture to create fully processed, ready-to-use data. The name symbolizes the purification and curation of a raw material into something useful and valuable.

What About Other 3-Tier Data Architecture Systems?

It is important to note that the basic premise of medallion architecture has existed since at least the late 1990s (although it was only officially coined by Databricks in the 2010s) and has since spawned many slight variations. It’s sometimes called a “multi-hop” structure, and many companies use terms like “raw, processed, refined” instead of “bronze, silver, gold.” Some also add sub-layers to the 3 main layers or use 2 or 4 layers instead of 3. 

In any case, whether or not a system technically qualifies as “medallion” architecture doesn’t really matter. It’s not meant to be a strict step-by-step system you absolutely must follow—it’s just a general framework you can use as a starting point. 

Why Use the Medallion Framework? 

Medallion architecture adds an official structure and workflow to your data processing and management within a data lakehouse. With the medallion framework, you have a better chance of keeping data for different use cases organized, preserving data quality and governance, and getting more use out of your data. 

How Medallion Architecture Works: Breaking Down The Three Layers of Medallion Architecture 

As data flows through medallion architecture, each layer stores, processes, and manages the data during a different stage in its lifecycle. Here’s a summarized breakdown:

  1. Bronze Layer: Raw, unstructured data is ingested from various sources and stored in its most basic form.
  2. Silver Layer: The data is cleaned, transformed, and structured to make it more usable for analytics and reporting.
  3. Gold Layer: Refined and business-ready data, ready for deeper insights, AI models, and decision-making. 
How medallion architecture works: a breakdown of how data flows through a medallion framework.
In medallion architecture, raw data enters into the bronze layer, than goes through stages of processing until it's ready to be used in business data applications.

Let’s break these layers down further.

Bronze Layer (Raw Data)

The bronze layer is the foundational stage of the Medallion Architecture. It acts as a repository for raw, unprocessed data that flows in from various sources, such as databases, APIs, and real-time data pipelines. This data is ingested in its original format without any transformation, cleaning, or processing—it is pure, unmodified, raw data. 

Although bronze layer data may occasionally be used as-is for crude, rudimentary insights, it typically goes on to be refined further in the silver and gold layers. 

Example: In an e-commerce business, the bronze layer might include raw customer interaction logs, clickstream data, and purchase records collected from the e-commerce website.

Silver Layer (Cleaned, Consolidated Data)

After raw data enters the bronze layer, it goes onto the silver layer to undergo standard transformation and cleaning procedures. Removing duplicates, filling in missing values, and ensuring consistency across a dataset are all common. The silver layer may also involve merging similar data points from multiple datasets or sources.

At this stage, the data is “good enough” to be used by data scientists and engineers for reporting and analytics on an ad-hoc basis. However, it isn’t quite refined enough to be streamed into machine learning applications or used by non-data-oriented professionals. 

Example: In an e-commerce business, the Silver layer could contain cleaned and normalized data about user behavior, such as deduplicated customer purchase histories and interactions tied to unique user IDs. It can also include similar clean data from other competing e-commerce platforms. 

Gold Layer (Business-Level Curated Data)

The Gold layer is the highest tier in the Medallion Architecture. The data goes through final transformations and gets aggregated into project-specific applications. From there, the fully refined, aggregated, and contextualized data is streamed into data modeling, dashboards, and reporting applications

Unlike silver data, which can only be used by data scientists and engineers, gold data is contextualized and modeled so it’s ready to be used by non-data professionals like marketing, sales, and product development teams. 

Example: For an e-commerce company, the Gold layer might include aggregated reports showing sales trends, customer segmentation analysis, and predictive analytics for future sales.

Want a data solution that gathers, processes, and prepares data for you? Explore Nimble’s Online Pipelines.

What Is a Data Lakehouse? A Quick History of Data Management

Medallion architecture is typically applied in a data lakehouse—a data storage and management model that combines aspects of data warehouses and data lakes

Here’s a breakdown of what that means and why they exist. 

Data Warehouses: The Early Days 

Data warehouses were first developed in the late 1980s to provide centralized repositories for structured data optimized for online analytical processing (OLAP). They have rigid organization, can only store structured data, and support complex queries, data aggregation, and analytics. 

Data warehouses are highly effective at managing relatively small amounts of non-diverse data with predictable uses and formats because that’s what people needed at the time. However, they aren’t very scalable and can’t support large quantities of diverse, unstructured, and semi-structured data. 

Data Lakes: The Solution to the Data Revolution

The rise of the internet and social media in the 2000s and 2010s sparked a data revolution. Companies needed to store and process vast quantities of diverse data with different sources, formats, and uses. They needed to store raw, unstructured data for later use, making the old-school warehouse structure inefficient. 

To solve this problem, data lakes were invented. Data lakes are large repositories that store raw, unstructured, and semi-structured data in its native format until needed. This flexible, scalable management model enabled companies to store nearly unlimited data without spending hours sorting and cleaning it. 

Data Lakehouses: The Modern Approach 

Although data lakes were scalable, companies soon realized they were terribly disorganized.  Many data lake systems became “data swamps”: large, confusing masses of data that are incredibly difficult to navigate and sort through.

In response to this problem, data lakehouses, a hybrid solution, were created in the 2010s. Data lakehouses store unstructured and structured data and support real-time processing, analytics, and queries—marrying the organization of data warehouses with the diversity and scalability of data lakes. 

How Medallion Architecture Ties into Data Lakehouses

Databricks developed medallion architecture as the most effective method of organizing, sorting, and processing data, specifically within a data lakehouse. By implementing medallion architecture in a data lakehouse, you can get a logical, structured approach to data management that makes it easy to find and use the data you need.

Benefits of Using Medallion Architecture

Easy to Understand and Implement

Medallion architecture's straightforward and intuitive structure offers an easy way to formalize processes your data team is likely already using. The separation of data into the three tiers of bronze, silver, and gold offers a logical progression for data processing, where data teams can follow clear steps to transform raw data into refined, analysis-ready datasets. 

This clarity and intuitive structure make it easy to onboard new team members and ensure data processing workflows can be quickly adopted and scaled. 

Promotes Better Data Quality and Governance

Making three distinct, progressive layers of data processing inherently promotes data governance and quality, as it demands organizations trace data as it flows from bronze to gold. Each stage acts as a checkpoint that ensures only validated and accurate data proceeds to the next stage, with the silver layer acting as a particularly important checkpoint in this process.

This structured approach reduces the risk of errors in analysis and allows organizations to easily enforce data lineage and auditing practices, ensuring compliance with industry regulations and internal policies. 

Scalability and Flexibility

Medallion architecture was created with data lakehouses in mind, which means it was built to scale and adapt to changing business needs. Medallion architecture offers a system to deal with a wide variety of data sources, types, and levels of processing.

The modular structure of medallion architecture also means organizations can start with basic implementations and expand their architecture as data requirements evolve.

Improved Security and Access Control

Because data in medallion architecture is divided into layers, it is easy to implement access control policies and limitations that compartmentalize data exposure, making it easier to avoid unauthorized data manipulation and prevent data breaches.

For example, you could build access controls to ensure bronze-level raw data is only accessible to data engineers; silver data can be accessed by a broader group of data scientists and analysts, and gold data is accessible to specific stakeholders or end users on a role-based basis. 

Enables Building of Data Pipelines

Medallion architecture is well-suited for building data pipelines—an absolute must for organizations that wish to have consistent, real-time data to fuel AI and BI applications. 

Because medallion architecture already provides an organized, pre-determined pipeline structure, integrating into larger data pipelines is easy. The flexibility of the bronze layer also allows for easy additions of new data sources as needed. This combination of organization and flexibility ensures teams can build reusable and scaleable data pipelines. 

Medallion Architecture Best Practices

Here are some best practices to follow when implementing medallion architecture.

Design Robust Pipelines

When creating data pipelines, it’s essential to ensure they are scalable, fault-tolerant, and efficient. Use automation to handle common tasks, such as data ingestion and transformation, and ensure each layer is optimized for its specific function. 

Ensure Data Lineage and Auditing

Tracking data transformations is crucial for understanding how data moves through the system and flagging issues or malfunctions before they become larger problems. Implementing data lineage and auditing helps ensure transparency and enables teams to trace data from its raw origins to its final gold-level state.  

Common Mistakes to Avoid

Some common pitfalls during the implementation of medallion architecture include:

  • Overcomplicating your pipelines. 
  • Failing to enforce data quality rules early.
  • Not ensuring the scalability of systems. 

Conclusion: Medallion Architecture Is Popular for a Reason

Medallion architecture is popular for a reason: It’s a highly effective method for organizing data within a data lakehouse, and it helps ensure data quality, clear workflows, and scalability. It’s also great for building and integrating data pipelines.

However, if you want to use data to inform business decisions, organizing and cleaning data is only half the battle: you also need to ensure you’re getting accurate, relevant data from good sources.

Nimble’s web API makes gathering large quantities of accurate data from any public web source easy, while our customizable Online Pipelines that intake data from hundreds of different sources offer a pre-built way to funnel the data you need directly into your data storage system. Both solutions integrate seamlessly with the Databricks lakehouse and medallion architecture.

To learn more about our data collection solutions, contact us today. 

FAQ

Answers to frequently asked questions

No items found.