Lineage Graph in Spark

0 votes
Can somebody please explain as to what is Lineage Graph in Spark?
Jun 19, 2018 in Apache Spark by Data_Nerd
• 2,390 points

edited Jun 8 by Sirajul 4,837 views

3 answers to this question.

0 votes
The RDDs in Spark, depend upon one or more other RDDs. The representation of dependencies in between RDDs is understood because of the lineage graph. Lineage graph information is employed to compute each RDD on demand, in order that whenever a  bit of persistent RDD is lost, the info that's lost are often recovered using the lineage graph information.
answered Jun 19, 2018 by Ashish
• 2,650 points
0 votes

RDD Lineage (aka RDD operator graph or RDD dependency graph) actually is a graph of all the parent RDDs of an RDD. It is built as a consequence of applying transformations to the RDD and creates a logical execution plan.

The execution DAG or physical execution plan is that the DAG of stages.

The above diagram represents the RDD lineage

The above RDD graph could be the result of the following series of transformations:

val r00 = sc.parallelize(0 to 9)
val r01 = sc.parallelize(0 to 90 by 10)
val r10 = r00 cartesian r01
val r11 = r00.map(n => (n, n))
val r12 = r00 zip r01
val r13 = r01.keyBy(_ / 20)
val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)

An RDD lineage graph is hence a graph of what transformations got to be executed after an action has been called.

You can study the RDD lineage graph using RDD.toDebugString method.

answered Jul 12, 2018 by zombie
• 3,750 points

edited Jun 8 by Sirajul
What is the number of stages in the DAG of the result?

Hi,

I think it depends on the no of stages. The DAG scheduler will then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile.

0 votes
Whenever a series of transformations are performed on an RDD, they are not evaluated immediately, but lazily.

Whenever a series of transformations are performed on an RDD, they're not evaluated immediately, but lazily.

When a replacement RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs are going to be logged during a graph, instead of the particular data. This graph is called the lineage graph.

For eg., consider the below operations:

1. Create a replacement RDD from a document - first RDD
2. Apply map operation on first RDD to urge second RDD
3. Apply filter operation on second RDD to urge third RDD
4. Apply count operation on third RDD to urge fourth RDD

Lineage graph of all these operations looks like:

First RDD ---> Second RDD (applying map) ---> Third RDD (applying filter) ---> Fourth RDD (applying count)

This lineage graph are going to be useful just in case if any of the partitions are lost. Spark can replay the transformation thereon partition using the lineage graph existing in DAG (Directed Acyclic Graph) to realize an equivalent computation, instead of replicating the info cross different nodes as in HDFS.
answered Aug 27, 2018 by shams
• 3,630 points

Related Questions In Apache Spark

0 votes
1 answer

What is RDD Lineage in Spark?

Hey, Lineage is an RDD process to reconstruct ...READ MORE

answered Jul 4, 2019 in Apache Spark by Gitika
• 36,530 points
606 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,320 points
2,864 views
+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21, 2019 in Apache Spark by anonymous
55,781 views
+2 votes
4 answers

use length function in substring in spark

You can use the function expr val data ...READ MORE

answered May 3, 2018 in Apache Spark by kurt_cobain
• 9,320 points
28,737 views
+1 vote
1 answer
0 votes
1 answer

Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by kurt_cobain
• 9,320 points
10,912 views
0 votes
1 answer

Different Spark Ecosystem

Spark has various components: Spark SQL (Shark)- for ...READ MORE

answered Jun 4, 2018 in Apache Spark by kurt_cobain
• 9,320 points
156 views
0 votes
1 answer

Minimizing Data Transfers in Spark

Minimizing data transfers and avoiding shuffling helps ...READ MORE

answered Jun 19, 2018 in Apache Spark by Data_Nerd
• 2,390 points
303 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,650 points
8,457 views
0 votes
1 answer

Spark standalone client mode

spark-submit \ class org.apache.spark.examples.SparkPi \ deploy-mode client \ master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT ...READ MORE

answered Jun 20, 2018 in Apache Spark by Ashish
• 2,650 points
161 views