Apache Spark combineByKey Explained

Apache Spark and Scala (25 Blogs) Become a Certified Professional

Contributed by Prithviraj Bose

Spark is a lightning-fast cluster computing framework designed for rapid computation and the demand for professionals with Apache Spark and Scala Certification is substantial in the market today. Here’s a powerful API in Spark which is combineByKey.

Scala API: org.apache.spark.PairRDDFunctions.combineByKey.

Python API: pyspark.RDD.combineByKey.

The API takes three functions (as lambda expressions in Python or anonymous functions in Scala), namely,

Create combiner function: x
Merge value function: y
Merge combiners function: z

and the API format is combineByKey(x, y, z).

Let’s see an example (in Scala).The full Scala source can be found here.

Our objective is to find the average score per student.

Here’s a placeholder class ScoreDetail storing students name along with the score of a subject.

Some test data is generated and converted to key-pair values where key = Students name and value = ScoreDetail instance.

Then we create a Pair RDD as shown in the code fragment below. Just for experimentation, I have created a hash partitioner of size 3, so the three partitions will contain 2, 2 and 4 key value pairs respectively. This is highlighted in the section where we explore each partition.

Now we can explore each partition. The first line prints the length of each partition (number of key value pairs per partition) and the second line prints the contents of each partition.

And here’s the finale movement where we compute the average score per student after combining the scores across the partitions.

The above code flow is as follows…
First we need to create a combiner function which is essentially a tuple = (value, 1) for every key encountered in each partition. After this phase the output for every (key, value) in a partition is (key, (value, 1)).

Then on the next iteration the combiner functions per partition is merged using the merge value function for every key. After this phase the output of every (key, (value, 1)) is (key, (total, count)) in every partition.

Finally the merge combiner function merges all the values across the partitions in the executors and sends the data back to the driver. After this phase the output of every (key, (total, count)) per partition is
(key, (totalAcrossAllPartitions, countAcrossAllPartitions)).

The map converts the
(key, tuple) = (key, (totalAcrossAllPartitions, countAcrossAllPartitions))
to compute the average per key as (key, tuple._1/tuple._2).

The last line prints the average scores for all the students at the driver’s end.

Got a question for us? Mention them in the comment section and we will get back to you.

Related Posts:

Apache Spark combineByKey Explained

Recommended videos for you

What is Apache Storm all about?

When not to use Hadoop

Apache Spark For Faster Batch Processing

Hadoop Tutorial – A Complete Tutorial For Hadoop

What Is Hadoop – All You Need To Know About Hadoop

Introduction to Hadoop Administration

Big Data Tutorial – Get Started With Big Data And Hadoop

Introduction to Apache Solr-1

Advanced Security In Hadoop Cluster

Spark SQL | Apache Spark

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Secure Your Hadoop Cluster With Kerberos

Introduction to Big Data TDD and Pig Unit

What is Big Data and Why Learn Hadoop!!!

Filtering on HBase Using MapReduce Filtering Pattern

Apache Spark Redefining Big Data Processing

Apache Spark Will Replace Hadoop ! Know Why

MapReduce Design Patterns – Application of Join Pattern

Big Data – XML Parsing With MapReduce

Improve Customer Service With Big Data

Recommended blogs for you

Introduction of Hadoop Architecture

Top 5 Hadoop Admin Tasks

Is Big Data the Right Move for You?

Why Should a Data Warehouse Professional Move to Big Data Hadoop?

Switching Careers: From Java to Big Data / Hadoop

Machine Learning and Big Data: Is it the future?

Top Hive Commands with Examples in HQL

Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Drilling Down On Apache Drill, the New-Age Query Engine

What is CCA-175 Spark and Hadoop Developer Certification?

Helpful Hadoop Shell Commands

30+ Azure Data Engineer Interview Questions

Why Hadoop?

Career Advantages of Hadoop Certification

Data Engineer Salary in India

How To Install MongoDB On Ubuntu Operating System?

Big Data Analytics Tools and Technologies with key Features

Overview of Hadoop 2.0 Cluster Architecture Federation

Azure Data Factory Vs Databricks

PySpark Tutorial – Learn Apache Spark Using Python

Join the discussionCancel reply

Trending Courses in Big Data

PySpark Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

Big Data Hadoop Certification Training Course

Apache Spark and Scala Certification Training ...

Big Data Hadoop Administration Certification ...

Splunk Certification Training: Power User and ...

ELK Stack Training & Certification

Comprehensive MapReduce Certification Trainin ...

Apache Storm Certification Training

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Apache Spark combineByKey Explained