Lineage Graph in Spark

0 votes
What is Lineage Graph in Spark ?
Jun 19, 2018 in Apache Spark by Data_Nerd
• 2,370 points

3 answers to this question.

0 votes
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
answered Jun 19, 2018 by Ashish
• 2,630 points
0 votes

RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.


The execution DAG or physical execution plan is the DAG of stages.


The following diagram uses cartesian or zip for learning purposes only. You may use other operators to build an RDD graph.

rdd lineage.png

Figure 1. RDD lineage

The above RDD graph could be the result of the following series of transformations:

val r00 = sc.parallelize(0 to 9)
val r01 = sc.parallelize(0 to 90 by 10)
val r10 = r00 cartesian r01
val r11 = => (n, n))
val r12 = r00 zip r01
val r13 = r01.keyBy(_ / 20)
val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)

A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

You can learn about the RDD lineage graph using RDD.toDebugString method.

answered Jul 12, 2018 by zombie
• 3,750 points
0 votes

Whenever a series of transformations are performed on an RDD, they are not evaluated immediately, but lazily.

When a new RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs will be logged in a graph, rather than the actual data. This graph is called the lineage graph.

For eg., consider the below operations:

1. Create a new RDD from a text file - first RDD
2. Apply map operation on first RDD to get second RDD
3. Apply filter operation on second RDD to get third RDD
4. Apply count operation on third RDD to get fourth RDD

Lineage graph of all these operations looks like:

First RDD ---> Second RDD (applying map) ---> Third RDD (applying filter) ---> Fourth RDD (applying count)

This lineage graph will be useful in case if any of the partitions are lost. Spark can replay the transformation on that partition using the lineage graph existing in DAG (Directed Acyclic Graph) to achieve the same computation, rather than replicating the data cross different nodes as in HDFS.

answered Aug 27, 2018 by shams
• 3,580 points

Related Questions In Apache Spark

0 votes
1 answer

What is RDD Lineage in Spark?

Hey, Lineage is an RDD process to reconstruct ...READ MORE

answered Jul 4, 2019 in Apache Spark by Gitika
• 26,310 points
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,310 points
+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21, 2019 in Apache Spark by anonymous
+2 votes
4 answers

use length function in substring in spark

You can use the function expr val data ...READ MORE

answered May 3, 2018 in Apache Spark by kurt_cobain
• 9,310 points
+1 vote
1 answer
0 votes
1 answer

Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by kurt_cobain
• 9,310 points
0 votes
1 answer

Different Spark Ecosystem

Spark has various components: Spark SQL (Shark)- for ...READ MORE

answered Jun 4, 2018 in Apache Spark by kurt_cobain
• 9,310 points
0 votes
1 answer

Minimizing Data Transfers in Spark

Minimizing data transfers and avoiding shuffling helps ...READ MORE

answered Jun 19, 2018 in Apache Spark by Data_Nerd
• 2,370 points
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,630 points
0 votes
1 answer

Spark standalone client mode

spark-submit \ class org.apache.spark.examples.SparkPi \ deploy-mode client \ master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT ...READ MORE

answered Jun 20, 2018 in Apache Spark by Ashish
• 2,630 points