Lineage Graph in Spark

Question

Can somebody please explain as to what is Lineage Graph in Spark?

Ashish · Answer 1 · Jun 19, 2018

The RDDs in Spark, depend upon one or more other RDDs. The representation of dependencies in between RDDs is understood because of the lineage graph. Lineage graph information is employed to compute each RDD on demand, in order that whenever a bit of persistent RDD is lost, the info that's lost are often recovered using the lineage graph information.

answered Jun 19, 2018 by Ashish
• 2,650 points

zombie · Answer 2 · Jul 12, 2018

RDD Lineage (aka RDD operator graph or RDD dependency graph) actually is a graph of all the parent RDDs of an RDD. It is built as a consequence of applying transformations to the RDD and creates a logical execution plan.

The execution DAG or physical execution plan is that the DAG of stages.

The above diagram represents the RDD lineage

The above RDD graph could be the result of the following series of transformations:

val r00 = sc.parallelize(0 to 9)
val r01 = sc.parallelize(0 to 90 by 10)
val r10 = r00 cartesian r01
val r11 = r00.map(n => (n, n))
val r12 = r00 zip r01
val r13 = r01.keyBy(_ / 20)
val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)

An RDD lineage graph is hence a graph of what transformations got to be executed after an action has been called.

You can study the RDD lineage graph using RDD.toDebugString method.

answered Jul 12, 2018 by zombie
• 3,790 points
edited Jun 8, 2020 by Sirajul

What is the number of stages in the DAG of the result?

commented Sep 18, 2020 by anonymous

Hi,

I think it depends on the no of stages. The DAG scheduler will then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile.

commented Sep 18, 2020 by MD
• 95,460 points

shams · Answer 3 · Aug 28, 2018

Whenever a series of transformations are performed on an RDD, they are not evaluated immediately, but lazily.

Whenever a series of transformations are performed on an RDD, they're not evaluated immediately, but lazily.

When a replacement RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs are going to be logged during a graph, instead of the particular data. This graph is called the lineage graph.

For eg., consider the below operations:

1. Create a replacement RDD from a document - first RDD
2. Apply map operation on first RDD to urge second RDD
3. Apply filter operation on second RDD to urge third RDD
4. Apply count operation on third RDD to urge fourth RDD

Lineage graph of all these operations looks like:

First RDD ---> Second RDD (applying map) ---> Third RDD (applying filter) ---> Fourth RDD (applying count)

This lineage graph are going to be useful just in case if any of the partitions are lost. Spark can replay the transformation thereon partition using the lineage graph existing in DAG (Directed Acyclic Graph) to realize an equivalent computation, instead of replicating the info cross different nodes as in HDFS.