What is Lineage Graph in Spark ? Jun 19, 2018 3,183 views

## 3 answers to this question.

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information. answered Jun 19, 2018 by
• 2,630 points

RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.

 Note The execution DAG or physical execution plan is the DAG of stages.
 Note The following diagram uses cartesian or zip for learning purposes only. You may use other operators to build an RDD graph. Figure 1. RDD lineage

The above RDD graph could be the result of the following series of transformations:

```val r00 = sc.parallelize(0 to 9)
val r01 = sc.parallelize(0 to 90 by 10)
val r10 = r00 cartesian r01
val r11 = r00.map(n => (n, n))
val r12 = r00 zip r01
val r13 = r01.keyBy(_ / 20)
val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)```

A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

You can learn about the RDD lineage graph using RDD.toDebugString method. answered Jul 12, 2018 by
• 3,750 points

Whenever a series of transformations are performed on an RDD, they are not evaluated immediately, but lazily.

When a new RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs will be logged in a graph, rather than the actual data. This graph is called the lineage graph.

For eg., consider the below operations:

1. Create a new RDD from a text file - first RDD
2. Apply map operation on first RDD to get second RDD
3. Apply filter operation on second RDD to get third RDD
4. Apply count operation on third RDD to get fourth RDD

Lineage graph of all these operations looks like:

First RDD ---> Second RDD (applying map) ---> Third RDD (applying filter) ---> Fourth RDD (applying count)

This lineage graph will be useful in case if any of the partitions are lost. Spark can replay the transformation on that partition using the lineage graph existing in DAG (Directed Acyclic Graph) to achieve the same computation, rather than replicating the data cross different nodes as in HDFS. answered Aug 27, 2018 by
• 3,580 points

## What is RDD Lineage in Spark?

Hey, Lineage is an RDD process to reconstruct ...READ MORE

## Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

## Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

## use length function in substring in spark

You can use the function expr val data ...READ MORE

+1 vote

## Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

## Different Spark Ecosystem

Spark has various components: Spark SQL (Shark)- for ...READ MORE

## Minimizing Data Transfers in Spark

Minimizing data transfers and avoiding shuffling helps ...READ MORE