The Azure Databricks architecture is designed to become an incredibly robust framework in data analytics on the Microsoft Azure platform. This architecture combines the powers and capabilities of Apache Spark and Azure to provide a scalable and secure architecture that assists organizations in efficiently processing large datasets, thus sustaining collaborative work.
Table of Contents:
Azure Databricks simplifies the data engineering and data science workflows. It allows any user to create and deploy a machine-learning model quickly. This overview will look at the high-level architecture, Serverless compute plane, and classic compute plane of Azure Databricks.
High-level architecture
Internally, Azure Databricks runs on top of a two-plane architecture: The control and Compute planes.
- Control Plane: This consists of the backend services managed by Azure Databricks. It handles user interaction through the web application. The Control Plane manages the overall system, comprising user authentication, resource allocation, and job scheduling.
- Compute Plane: This is where data processing takes place. It has two variants: serverless computing and classic computing. Each variant caters to different use cases, along with various operational requirements. The compute plane is responsible for executing data processing tasks, including but not limited to ETL pipelines, training machine learning models, and batch processing.
Serverless Compute Plane
In Azure Databricks, a serverless compute plane is referred to as a modern approach to resource management. In this case, compute resources run within the Azure Databricks account. Resources automatically grow when workload demand increases or scales down similarly. Key features:
- Automatic Scaling: The serverless compute plane can self-manage resources about applications. This feature helps in optimizing costs and performance. Suppose the workload increases; the serverless compute plane adds more resources against that increase in workload. On the other hand, when the workload goes down, it can release resources to save on costs.
- Security: Serverless computing operates within a defined network perimeter.This means that Azure Databricks has completely sequestered and moved customer data. At many levels, the workspace and data are secured by Azure Databricks. Azure Databricks uses Azure security features, such as role-based access control and network security groups, ensuring data remains out of reach from undesired access.
Classic Compute plane
The classic compute plane is the traditional model in which one deploys compute resources in Azure Databricks. This model considers that compute resources are inside the customer’s Azure subscription. This setup grants better control over resource management and configuration of the network.
Also Read : What is integration runtime in Azure data factory?
Here are some of the Key Features:
- Isolation: Each legacy compute plane resides in its own Azure subscription. This inherently builds in a layer of isolation for enhanced security and compliance. With individual subscriptions within Azure, organizations can isolate data and resources against co-tenants to meet rigorous security and compliance requirements.
- Resource Management: This enables organizations to manage the virtualized networks and resources directly. Therefore, customized configurations can be attained based on needs. Such flexibility is helpful for those organizations with specific performance or compliance requirements.
- Flexibility: Classic compute plane supports different types of workloads. Organizations can choose the correct sizes and types of virtual machines depending on the performance requirement.
If you want to learn more, consider taking a Data engineering course
Frequently Asked Questions
What is the architecture of Databricks?
The architecture of Databricks comprises a control plane and a compute plane. The control plane is responsible for the Backend services, while the compute plane processes the data. Compute plane can be serverless or classic, depending upon the user’s need. The control plane thus handles user interaction, managing resources, scheduling jobs, and executing data processing tasks under the compute plane.
What are the components in Azure Databricks?
Some of the critical components of Azure Databricks are :
- Control Plane: Handles the backend services and interacts with users.
- Compute Plane: Processes data against serverless or classic compute resources.
- Workspace Storage Account: System data, notebooks, and logs are stored here. This storage account contains files and metadata associated with user workspaces, notebooks, and logs.
- It provides data versioning and transaction support. Delta Lake is an open-source storage layer that brings reliability to data lakes for ACID transactions, enforcing schema, and promoting time travel through data versioning.
- Azure Integration: It integrates with a variety of Azure services, like Azure Data Lake Storage, Azure Event Hubs, and Azure SQL Database, to help in the ingestion, processing, and storing of data.
What are the three layers of the data reference architecture in Azure Databricks?
The three layers of the data reference architecture inside Azure Databricks are:
- Data Ingestion Layer: It ingests data from several sources, including Azure Event Hubs and Azure Data Factory. This layer brings data into the Azure Databricks ecosystem from different sources, like event streams, databases, or files.
- Data Processing Layer: Data processing is done by Apache Spark in Azure Databricks. This processing layer is the prime area for data transformation, cleansing, and enrichment using the muscles of data processing in Apache Spark. Users can now write Spark code directly in a notebook or use Databricks’ native libraries and frameworks to execute tasks in categories such as ETL, machine learning, and streaming.
- Information Access Layer: This layer provides access to transformed data through dashboards and business analytics tools. Users could consume the processed data through the consumption layer for any analysis, reporting, and visualization. It integrates several business intelligence tools and dashboarding solutions to enable the end-user to formulate interactive reports and draw insights from the data.
Is Azure Databricks based on Spark?
Yes, Azure Databricks is based on Apache Spark. It offers an environment that allows data engineers, data scientists, and machine learning practitioners to collaborate in an interactive workspace that Spark powers quickly. That’s a general-purpose, open-source framework for distributed data processing, and that’s what Azure Databricks takes advantage of to realize large-scale data processing and analytics.
Conclusion
Azure Databricks’s architecture is robust in terms of the results and applications suggested for data analytics and machine learning. This high-level design integrates the Control and Compute planes in a flexible framework on one side and scalability in terms of execution on the other. Within this framework is a Serverless Compute plane that makes resource management more accessible and a Classic Compute plane that gives a handle on control and customization.
Upcoming Batches For Data Engineering Courses (Masters Program)