Distributed Caching With Broadcast Variables: Apache Spark

Contributed by Prithviraj Bose

Broadcast variables are useful when large datasets needs to be cached in executors. This blog explains how to get started.

What are Broadcast Variables?

Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables these variables would be shipped to each executor for every transformation and action, and this can cause network overhead. However, with broadcast variables, they are shipped once to all executors and are cached for future reference.

Broadcast Variables Use case

Imagine that while doing a transformation we need to lookup a large table of zip codes/pin codes. Here, it is neither feasible to send the large lookup table every time to the executors, nor can we query the database every time. The solution should be to convert this lookup table to a broadcast variables and Spark will cache it in every executor for future reference.

Let’s take a simple example to understand the above concepts. We have a CSV file with names of countries and their capitals. The CSV file can be found here.

Assuming we are processing demographic data of countries and we need to get the capital of that country. In this case we can convert the data in the CSV file to a broadcast variable.

First we load the CSV file in a map, if the file is found then the method returns Some(countries) else it returnsNone.

After successful loading of the CSV file we convert the map to a broadcast variable and use it in our programme.

In the code snippet above we load the CSV file to a mapcountries then we convert that map to a broadcast variablecountriesCache. Subsequently, we create an RDD from the keys of countries. In the searchCountryDetails method we search for all the countries starting with a user defined letter and the method returns an RDD of countries along with their capitals. The broadcast variable countrieCache is used for looking up the capitals.
This way we need not send the whole CSV data every time we need to search.

The code for the searchCountryDetails is shown below,

The whole source code can be found here.

Got a question for us? Mention them in the comment section and we will get back to you.

Related Posts:

Get Started with Apache Spark and Scala

Distributed Caching With Broadcast Variables: Apache Spark

Broadcast Variables Use case

Recommended videos for you

Administer Hadoop Cluster

Distributed Cache With MapReduce

5 Scenarios: When To Use & When Not to Use Hadoop

Tailored Big Data Solutions Using MapReduce Design Patterns

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Hive Tutorial – Understanding Hive In Depth

Big Data Processing With Apache Spark

Introduction to Hadoop Administration

What is Apache Storm all about?

Introduction to Apache Solr-1

Advanced Security In Hadoop Cluster

Real-Time Analytics with Apache Storm

What is Big Data and Why Learn Hadoop!!!

Apache Spark Will Replace Hadoop ! Know Why

Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution

Hadoop Cluster With High Availability

Boost Your Data Career with Predictive Analytics! Learn How ?

HBase Tutorial – A Complete Guide On Apache HBase

Hadoop for Java Professionals

What Is Hadoop – All You Need To Know About Hadoop

Recommended blogs for you

Overview of Hadoop 2.0 Cluster Architecture Federation

Big Data Engineer Resume Examples and Tips for 2025

Apache Spark Ecosystem

Implementing Hadoop & R Analytic Skills in Banking Domain

Why Scala is getting Popular?

Hive Tutorial – Hive Architecture and NASA Case Study

Hadoop Job Opportunities 101: Your Guide To Bagging Top Hadoop Jobs In 2020

Career Advantages of Hadoop Certification

Azure Data Factory Vs Databricks

Big Data Processing with Apache Spark & Scala

Why Hadoop?

Hive and Yarn Examples on Spark

Scala Functional Programming

Splunk Architecture: Tutorial On Forwarder, Indexer And Search Head

Game Changing Big Data Use Cases

Apache Hadoop HDFS Architecture

Hadoop MapReduce Interview Questions In 2025

How Predictive Analysis can Help you Combat Employee Attrition

Helpful Hadoop Shell Commands

Using Big Data to Boost Telecom’s Marketing Capabilities

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

Apache Spark and Scala Certification Training ...

ELK Stack Training & Certification

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Distributed Caching With Broadcast Variables: Apache Spark