What is the difference between rdd and dataframes in Apache Spark

Question

zombie · Answer 1 · Jul 27, 2018

DataFrame: A Data Frame is used for storing data in tables. It is equivalent to a table in a relational database but with richer optimization. It is a data abstraction and domain-specific language (DSL) applicable to a structure and semi-structured data. It is a distributed collection of data in the form of named column and row. It has a matrix-like structure whose column may be different types (numeric, logical, factor, or character ).we can say data frame has a two-dimensional array like structure where each column contains the value of one variable and row contains one set of values for each column. It combines feature of list and matrices.

RDD is the representation of a set of records, immutable collection of objects with distributed computing. RDD is a large collection of data or RDD is an array of reference for partitioned objects. Each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. self-recovered/recomputed in the case of failure. The dataset could be data loaded externally by the users which can be in the form of JSON file, CSV file, text file or database via JDBC with no specific data structure.

answered Jul 27, 2018 by zombie
• 3,790 points

nitinrawat895 · Answer 2 · Aug 3, 2018

A data frame is a table, or a two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a black box of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method
In general, it is recommended to use a DataFrame where possible due to the built-in query optimization.

answered Aug 3, 2018 by nitinrawat895
• 11,380 points

shams · Answer 3 · Aug 28, 2018

Comparison between Spark RDD vs DataFrame

1. Release of DataSets
RDD – Basically, Spark 1.0 release introduced an RDD API.

DataFrame- Basically, Spark 1.3 release introduced a preview of the new dataset, that is dataFrame.

2. Data Formats
RDD- Through RDD, we can process structured as well as unstructured data. But, in RDD user need to specify the schema of ingested data, RDD cannot infer its own.

DataFrame- In data frame data is organized into named columns. Through dataframe, we can process structured and unstructured data efficiently. It also allows Spark to manage schema.

3. Data Representations
RDD- It is a distributed collection of data elements. That is spread across many machines over the cluster, they are a set of Scala or Java objects representing data.

DataFrame- As we discussed above, in a data frame data is organized into named columns. Basically, it is as same as a table in a relational database.

4. Compile- Time Type Safety
RDD- RDD Supports object-oriented programming style with compile-time type safety.

DataFrame- If we try to access any column which is not present in the table, then an attribute error may occur at runtime. Dataframe will not support compile-time type safety in such case.

5. Immutability and Interoperability
RDD- RDDs are immutable in nature. That means we can not change anything about RDDs. We can create it through some transformation on existing partitions. Due to immutability, all the computations performed are consistent in nature. If RDD is in tabular format, we can move from RDD to dataframe by to() method. We can also do the reverse by the .rdd method.

DataFrame- One cannot regenerate a domain object, after transforming into dataframe. By using the example, if we generate one test data frame from tested then, we can not recover the original RDD again of the test class.

6. Data Sources API
RDD- From any data source, e.g. text files, a database via JDBC, etc. , an RDD can come. Also, can easily handle data with no predefined structure.

DataFrame- In different formats, data source API allows data processing, such as AVRO, CSV, JSON, and storage system HDFS, HIVE tables, MySQL.

7. Optimization

RDD- There was no provision for optimization engine in RDD. On the basis of its attributes, developers optimise each RDD.

DataFrame- By using Catalyst Optimizer, optimization takes place in dataframes. In 4 phases, dataframes use catalyst tree transformation framework

By Analysis
With logical plan optimization
By physical planning
With code generation to compile parts of the query to java bytecode.

8. Serialization
RDD- Spark uses java serialization, whenever it needs to distribute data over a cluster. Serializing individual Scala and Java objects are expensive. It also requires sending both data and structure between nodes.

DataFrame- In dataframe, we can serialize data into off-heap storage in binary format. Afterwards, it performs transformations on this off-heap memory, as spark understands schema. Moreover, to encode the data, there is no need to use java serialization.

9. Efficiency/Memory use
RDD- When serialization executes individually on a java and scala object, efficiency decreases. It also takes lots of time.

DataFrame- Use of off-heap memory for serialization reduces the overhead also generates, bytecode. So that, many operations can be performed on that serialized data. Basically, there is no need of deserialization for small operations.

10. Lazy Evaluation
RDD- Spark does not compute their result right away, it evaluates RDDs lazily. Apart from it, Spark memorizes the transformation applied to some base data set. Moreover, When an action needs, a result sent to driver program for computation.

DataFrame- Similarly, computation happens only when action appears as Spark evaluates dataframe lazily.

11. Language Support
RDD- APIs for RDD is available in 4 languages, such as Java, Scala, Python, and R. As a result, this feature provides flexibility to the developers.

DataFrame- As similar as RDD, it also has APIs in same 4 languages, such as Java, Scala, Python, and R.

12. Schema Projection
RDD- Since RDD APIs, use schema projection explicitly. Therefore, a user needs to define the schema manually.

DataFrame- In dataframe, there is no need to specify a schema. Generally, it discovers schema automatically.

13. Aggregation
RDD- While performing simple grouping and aggregation operations RDD API is slower.

DataFrame- In performing exploratory analysis, creating aggregated statistics on data, dataframes are faster.

14. Usage
RDD- When you want low-level transformation and actions, we use RDDs. Also, when we need high-level abstractions we use RDDs.

DataFrame- We use dataframe when we need a high level of abstraction and for unstructured data, such as media streams or streams of text.

I hope this will help!!

Bro, Spark DataFrame is applicable only to structured and semi-structured but not un-structured data. Please correct it. — Nov 18, 2018
Yes @Sri, you're right. But reading that answer again, I think he meant to say that DataFrames can be used to process unstructured data after converting it to structured data. — Nov 20, 2018