Prithviraj Bose
Published on Dec 07,2018
41.5K Views
Email Post

Contributed by Prithviraj Bose

Spark’s Resilient Distributed Datasets (the programming abstraction) are evaluated lazily and the transformations are stored as directed acyclic graphs (DAG). So every action on the RDD will make Spark recompute the DAG. This is how the resiliency is attained in Spark because if any worker node fails then the DAG just needs to be recomputed.

It is also mandatory to cache (persist with appropriate storage level) the RDD such that frequent actions on the RDD do not force Spark to recompute the DAG.

Why Use a Partitioner?

In cluster computing, the central challenge is to minimize network traffic. When the data is key-value oriented, partitioning becomes imperative because for subsequent transformations on the RDD, there’s a fair amount of shuffling of data across the network. If similar keys or range of keys are stored in the same partition then the shuffling is minimized and the processing becomes substantially fast.

Transformations that require shuffling of data across worker nodes greatly benefit from partitioning. Such transformations are cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey andlookup.

Partitions are configurable provided the RDD is key-value based.

Properties of Partition

  1. Tuples in the same partition are guaranteed to be in the same machine.
  2. Each node in a cluster can contain more than one partition.
  3. The total number of partitions are configurable, by default it is set to the total number of cores on all the executor nodes.

Types of Partitioning in Spark

Spark supports two types of partitioning,

  • Hash Partitioning: Uses Java’s Object.hashCodemethod to determine the partition as partition = key.hashCode() % numPartitions.

hash-partitioning-demystifying-partitioning-in-spark

  • Range Partitioning: Uses a range to distribute to the respective partitions the keys that fall within a range. This method is suitable where there’s a natural ordering in the keys and the keys are non negative. The below code snippet shows the usage of range partitioner.

range-partitioning-demystifying-partitioning-in-spark

Code Example

Let’s see an example on how to partition data across worker nodes. The full Scala code is available here.

Here’s some test data of 12 coordinates (as tuples),

test-data-demystifying-partitioning-in-spark

Create an org.apache.spark.HashPartitioner of size 2, where the keys will be partitioned across these two partitions based on the hash code of the keys.

hash-partitioner-demystifying-partitioning-in-spark

Then we can inspect the pairs and do various key based transformations like foldByKey and reduceByKey.

Summarizing, partitioning greatly improves speed of execution for key based transformations.

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

Get Started with Apache Spark and Scala

Why You Should Learn Spark After Mastering Hadoop

Your Guide to Career Opportunities in Spark

Apache Spark Vs Hadoop MapReduce

About Author
Prithviraj Bose
Prithviraj Bose
Published on Dec 07,2018
Prithviraj has spent close to two decades in the software development industry designing and developing applications ranging from Level 5 process control software at M N Dastur & Co., stock trading & allocation software at Lehman Brothers to Electronic Program Guides for Set Top Boxes. At the moment he is curious about Design Patterns, Python, Java, C++, REST, Agile Methodologies and Cluster Computing.

Share on

Browse Categories

Comments
8 Comments