Big Data Hadoop Certification Training Course
- 169k Enrolled Learners
- Live Class
Contributed by Prithviraj Bose
Spark’s Resilient Distributed Datasets (the programming abstraction) are evaluated lazily and the transformations are stored as directed acyclic graphs (DAG). So every action on the RDD will make Spark recompute the DAG. This is how the resiliency is attained in Spark because if any worker node fails then the DAG just needs to be recomputed.
It is also mandatory to cache (persist with appropriate storage level) the RDD such that frequent actions on the RDD do not force Spark to recompute the DAG. Topics covered in this blog are essentially required for Apache Spark and Scala Certification. Topics covered in this blog are essentially required for Apache Spark and Scala Certification.
In cluster computing, the central challenge is to minimize network traffic. When the data is key-value oriented, partitioning becomes imperative because for subsequent transformations on the RDD, there’s a fair amount of shuffling of data across the network. If similar keys or range of keys are stored in the same partition then the shuffling is minimized and the processing becomes substantially fast.
Transformations that require shuffling of data across worker nodes greatly benefit from partitioning. Such transformations are cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey andlookup.
Spark supports two types of partitioning,
Let’s see an example on how to partition data across worker nodes. The full Scala code is available here.
Here’s some test data of 12 coordinates (as tuples),
Create an org.apache.spark.HashPartitioner of size 2, where the keys will be partitioned across these two partitions based on the hash code of the keys.
Then we can inspect the pairs and do various key based transformations like foldByKey and reduceByKey.
Summarizing, partitioning greatly improves speed of execution for key based transformations.
Got a question for us? Please mention it in the comments section and we will get back to you.
|Apache Spark and Scala Certification Training Course|
Class Starts on 29th October,2022
29th OctoberSAT&SUN (Weekend Batch)