Master Data Engineering with Microsoft F... (11 Blogs)

What is the Medallion Lakehouse Architecture?

Published on May 28,2025 6 Views

MERN stack web developer with expertise in full-stack development. Skilled in React,... MERN stack web developer with expertise in full-stack development. Skilled in React, Node.js, Express, and MongoDB, building scalable web solutions.
image not found!image not found!image not found!image not found!Copy Link!

Medallion architecture as a data design pattern

A medallion architecture is a way to organize data in a way that makes sense. With each layer of the architecture (from Bronze to Silver to Gold layer tables), the goal is to improve the organization and quality of the data in small steps.

Medallion architectures are also known as multi-hop systems, or from time to time.By moving data through these stages, businesses can gradually boost its quality and dependability, making it better for use in business intelligence and machine learning.

It is a good idea to follow the medallion design, but it is not required.

QuestionBronzeSilverGold
What happens in this layer?Captures and stores raw, unprocessed data directly from source systemsPerforms data cleaning, de-duplication, validation, and basic transformationsApplies aggregations, business logic, and dimensional modeling for analytics
Who is the intended user?Data Engineers
Operations and compliance teams
Data Engineers and Analysts
Data scientists who require structured yet detailed datasets
Business analysts
Decision-makers
BI developers
Machine Learning Engineer

Building data pipelines with medallion architecture

With just a few lines of code, Databricks gives customers tools like Delta Live Tables (DLT) that let them build data pipelines right away with Bronze, Silver, and Gold tables. And with streaming tables and materialized views, users can make streaming DLT pipelines on Apache SparkTM Structured Streaming that are renewed and updated in small steps.

To learn more about how to combine streaming tables and materialized views in a single pipeline, check out the Databricks instructions.

Bronze layer (raw data)

The Bronze layer is where all the info from outside sources lands. These layers’ table structures are the same as the source system’s table structures “as-is,” plus any extra metadata fields that store information like the load date and time, process ID, and so on.

This layer is all about quick Change Data Capture and being able to provide a historical record of source data (cold storage), data lineage, auditability, and the ability to reprocess data without having to read it from the source system again if needed.

Silver layer (cleansed and conformed data)

The Silver layer in the medallion architecture serves as a critical stage where raw data from the Bronze layer is refined to create a trusted and structured dataset. This layer focuses on enhancing data quality and preparing it for advanced analytics and business intelligence.

Data Cleansing and Validation: Identifies and rectifies errors, inconsistencies, and duplicates in the data to ensure accuracy.

Conformation and Standardization: Aligns data from diverse sources into a consistent format, facilitating a unified enterprise view.

Minimal Transformations (ELT Approach): Applies essential transformations post data loading, prioritizing speed and flexibility over extensive pre-processing.

Structured Data Modeling: Organizes data into normalized structures, often following Third Normal Form (3NF) or Data Vault methodologies, to support efficient querying and scalability.
Matillion

Preparation for Advanced Analytics: Provides a reliable dataset for data scientists and analysts to perform machine learning, reporting, and other analytical tasks.

By performing these functions, the Silver layer ensures that data is accurate, consistent, and ready for consumption in the Gold layer, where it can be further transformed and utilized for specific business needs.

Gold layer (curated business-level tables)

In the Gold layer of the lakehouse, data is usually grouped in “project-specific” databases that are ready to be used. There are fewer joins in the Gold layer, which is used for reporting, and the data models are more de-normalized and read-optimized. This is where the last layer of data changes and quality rules are put into action.

The last layer of a project’s presentation. This layer is for things like Customer Analytics, Product Quality Analytics, Inventory Analytics, Customer Segmentation, Product Recommendations, Marking/Sales Analytics, and more. Most of the star schema-based data models or data marts that fit the Inmon style are in this Gold Layer of the lakehouse.

So you can see that as the data goes through the different levels of the lakehouse, it is being taken care of. In some cases, a lot of Data Marts and EDWs from the old RDBMS technology stack are brought into the lakehouse. This lets enterprises do “pan-EDW” advanced analytics and machine learning for the first time, which they couldn’t do before or couldn’t afford to do on the old stack.

For example, IoT/Manufacturing data is linked to Sales and Marketing data for debugging, or health care genomics and EMR/HL7 clinical data markets are linked to financial claims data to create a Healthcare Data Lake for better and more rapid patient care analytics.

Benefits of a lakehouse architecture

Data Structures Simplified

Teams find it easier to collaborate with Lakehouse’s unified and simple data model, particularly when working across departments.

Simple Implementation for Novices

Even teams without extensive data engineering experience can use it because it is simpler to set up and comprehend than traditional complex architectures.

Supports Incremental ETL You can save time and money by processing and updating data in smaller portions rather than starting over from scratch.

Recovery of Data from Unprocessed Sources

You can regenerate your processed tables at any time, guaranteeing flexibility and dependability, because raw data is always preserved.

ACID Transactions’ Integrated Data Reliability

It guarantees that your data updates are secure and dependable, even in the event of failures, by supporting ACID (Atomicity, Consistency, Isolation, Durability) transactions.

Data Time Travel

For auditing, debugging, or reversing errors, you can view and restore your data as it was at a particular moment in time.

Validate and de-duplicate data in the silver layer

Raw data is cleaned, verified, and made more trustworthy at the silver layer. To get the data ready for analysis, duplicate records are eliminated, missing or inconsistent values are fixed, and simple transformations are used. This guarantees that the data is reliable and prepared for further analysis or modeling.

Build silver tables from the bronze layer

To create the silver layer, you should read data from existing bronze (or sometimes other silver) tables and write the processed results into new silver tables.

Avoid writing directly to silver tables during initial data ingestion. Doing so may cause issues like schema mismatches or corrupt records. Instead, it’s best to stream data from the bronze layer – especially when dealing with large, append-only sources. Use batch reads only for small and manageable datasets, like lookup or reference tables.

The silver layer holds cleaned, validated, and slightly enriched data. It plays a key role in improving data quality and making the data more structured and usable. Key characteristics of the silver layer include:

  • It stores a reliable, non-aggregated version of each record.

  • Cleaning, deduplication, and basic transformation steps are performed here.

  • Data inconsistencies and errors are resolved.

  • It prepares the data in a well-organized format for use in the gold layer or by downstream applications.

This stage bridges raw data from the bronze layer to refined insights delivered in the gold layer.

Enforce data quality

Silver tables refine and clean raw data from the bronze layer to make it accurate, consistent, and ready for analysis. Key operations include:

  • Schema Enforcement: Ensures data follows a defined structure and catches format issues.

  • Handling Missing Values: Deals with null or incomplete data using default values or filtering.

  • Data Deduplication: Removes duplicate entries to maintain accuracy.

  • Managing Late or Out-of-Order Data: Corrects the timing of records for reliable analysis.

  • Quality Checks: Validates data against rules to catch errors or inconsistencies.

  • Schema Evolution: Adjusts to changes in data structure over time.

  • Type Casting: Converts data types (e.g., string to integer) for compatibility.

  • Joins: Merges datasets to enrich records or provide missing context.

Start modeling data

Data modeling typically begins in the silver layer, especially when working with nested or semi-structured data. Common approaches include:

  • Using the VARIANT data type for flexibility in handling mixed formats.

  • Storing complex values as JSON strings.

  • Building structured data using arrays, maps, or structs.

  • Flattening deeply nested data or normalizing it across multiple related tables for better organization and easier querying.

Conclusion

Implementing the medallion architecture—progressing data through Bronze, Silver, and Gold layers—ensures a structured and efficient approach to data management. This methodology enhances data quality, supports scalable analytics, and facilitates advanced business intelligence applications.

To deepen your expertise in modern data engineering and lakehouse architectures, consider enrolling in Edureka’s Microsoft Fabric Data Engineer Associate (DP-700) course. This program offers hands-on experience with Microsoft Fabric’s integrated tools, including OneLake, Delta Live Tables, and Power BI, preparing you for the DP-700 certification and advancing your career in data engineering.

Comments
0 Comments

Join the discussion

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.