Can somebody please explain as to what is Lineage Graph in Spark?
Jun 19, 2018
edited Jun 8, 2020 11,506 views

## 3 answers to this question.

The RDDs in Spark, depend upon one or more other RDDs. The representation of dependencies in between RDDs is understood because of the lineage graph. Lineage graph information is employed to compute each RDD on demand, in order that whenever a  bit of persistent RDD is lost, the info that's lost are often recovered using the lineage graph information.
• 2,650 points

RDD Lineage (aka RDD operator graph or RDD dependency graph) actually is a graph of all the parent RDDs of an RDD. It is built as a consequence of applying transformations to the RDD and creates a logical execution plan.

The execution DAG or physical execution plan is that the DAG of stages.

The above diagram represents the RDD lineage

The above RDD graph could be the result of the following series of transformations:

```val r00 = sc.parallelize(0 to 9)
val r01 = sc.parallelize(0 to 90 by 10)
val r10 = r00 cartesian r01
val r11 = r00.map(n => (n, n))
val r12 = r00 zip r01
val r13 = r01.keyBy(_ / 20)
val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)```

An RDD lineage graph is hence a graph of what transformations got to be executed after an action has been called.

You can study the RDD lineage graph using RDD.toDebugString method.

• 3,790 points

edited Jun 8, 2020 by Sirajul
What is the number of stages in the DAG of the result?

Hi,

I think it depends on the no of stages. The DAG scheduler will then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile.

Whenever a series of transformations are performed on an RDD, they are not evaluated immediately, but lazily.

Whenever a series of transformations are performed on an RDD, they're not evaluated immediately, but lazily.

When a replacement RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs are going to be logged during a graph, instead of the particular data. This graph is called the lineage graph.

For eg., consider the below operations:

1. Create a replacement RDD from a document - first RDD
2. Apply map operation on first RDD to urge second RDD
3. Apply filter operation on second RDD to urge third RDD
4. Apply count operation on third RDD to urge fourth RDD

Lineage graph of all these operations looks like:

First RDD ---> Second RDD (applying map) ---> Third RDD (applying filter) ---> Fourth RDD (applying count)

This lineage graph are going to be useful just in case if any of the partitions are lost. Spark can replay the transformation thereon partition using the lineage graph existing in DAG (Directed Acyclic Graph) to realize an equivalent computation, instead of replicating the info cross different nodes as in HDFS.
• 3,670 points

## What is RDD Lineage in Spark?

Hey, Lineage is an RDD process to reconstruct ...READ MORE

## Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

## Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

## use length function in substring in spark

You can use the function expr val data ...READ MORE

+1 vote

## Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

## Different Spark Ecosystem

Spark has various components: Spark SQL (Shark)- for ...READ MORE

## Minimizing Data Transfers in Spark

Minimizing data transfers and avoiding shuffling helps ...READ MORE