Optimal column count for ORC and Parquet

+1 vote

I am using PySpark to read the files and have query regarding the maximum number of columns that can be handled - 

  • What is optimal column count for ORC and Parquet
  • If we have 3000+ column and 10lac+ records in a file then which of the two ( ORC and Parquet)  is more appropriate
May 8, 2020 in Apache Spark by Amey
• 210 points
813 views

1 answer to this question.

+1 vote

Hi@Amey,

It depends on your use case. Both Parquet and ORC have their own advantages and disadvantages. If you go with Spark then parquet is better because Spark has a vectorized parquet reader and no vectorized ORC reader. But it also depends on "How nested is your Data and how many columns are there". They uses a hierarchal tree-like structure to store data. More the nesting deeper the tree.

But ORC is designed for a flattened file store. So if your Data is flattened with fewer columns, you can go with ORC, otherwise, parquet would be fine for you. Compression on flattened Data works amazingly in ORC.

Hope this will give you some idea.

answered May 8, 2020 by MD
• 95,300 points

Related Questions In Apache Spark

0 votes
1 answer

Parquet to ORC format in Spark

I appreciate that you want to try ...READ MORE

answered Feb 15, 2019 in Apache Spark by Anjali
1,145 views
0 votes
1 answer

Enable encryption for local Input and Output

You can enable local I/O encryption like ...READ MORE

answered Mar 14, 2019 in Apache Spark by Raj
112 views
0 votes
1 answer

How the if-else statement is different for Scala and Java?

Hey, There is one main difference. This statement ...READ MORE

answered Jul 29, 2019 in Apache Spark by Gitika
• 65,950 points
201 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,650 points
10,927 views
0 votes
1 answer

Which query to use for better performance, join in SQL or using Dataset API?

DataFrames and SparkSQL performed almost about the ...READ MORE

answered Apr 19, 2018 in Apache Spark by kurt_cobain
• 9,390 points
790 views
+1 vote
2 answers

Hadoop 3 compatibility with older versions of Hive, Pig, Sqoop and Spark

Hadoop 3 is not widely used in ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,390 points
4,084 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,390 points
4,557 views
0 votes
2 answers

Difference between createOrReplaceTempView and registerTempTable

I am pretty sure createOrReplaceTempView just replaced ...READ MORE

answered Sep 18, 2020 in Apache Spark by Nathan Mott
9,087 views
+2 votes
14 answers

How to create new column with function in Spark Dataframe?

val coder: (Int => String) = v ...READ MORE

answered Apr 5, 2019 in Apache Spark by anonymous

edited Apr 5, 2019 by Omkar 73,978 views