Optimal column count for ORC and Parquet

+1 vote

I am using PySpark to read the files and have query regarding the maximum number of columns that can be handled - 

  • What is optimal column count for ORC and Parquet
  • If we have 3000+ column and 10lac+ records in a file then which of the two ( ORC and Parquet)  is more appropriate
May 7 in Apache Spark by Amey
• 210 points
196 views

1 answer to this question.

+1 vote

Hi@Amey,

It depends on your use case. Both Parquet and ORC have their own advantages and disadvantages. If you go with Spark then parquet is better because Spark has a vectorized parquet reader and no vectorized ORC reader. But it also depends on "How nested is your Data and how many columns are there". They uses a hierarchal tree-like structure to store data. More the nesting deeper the tree.

But ORC is designed for a flattened file store. So if your Data is flattened with fewer columns, you can go with ORC, otherwise, parquet would be fine for you. Compression on flattened Data works amazingly in ORC.

Hope this will give you some idea.

answered May 7 by MD
• 56,480 points

Related Questions In Apache Spark

0 votes
1 answer

Parquet to ORC format in Spark

I appreciate that you want to try ...READ MORE

answered Feb 14, 2019 in Apache Spark by Anjali
764 views
0 votes
1 answer

Enable encryption for local Input and Output

You can enable local I/O encryption like ...READ MORE

answered Mar 14, 2019 in Apache Spark by Raj
47 views
0 votes
1 answer

How the if-else statement is different for Scala and Java?

Hey, There is one main difference. This statement ...READ MORE

answered Jul 29, 2019 in Apache Spark by Gitika
• 37,370 points
81 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,650 points
8,529 views
0 votes
1 answer

Which query to use for better performance, join in SQL or using Dataset API?

DataFrames and SparkSQL performed almost about the ...READ MORE

answered Apr 19, 2018 in Apache Spark by kurt_cobain
• 9,320 points
365 views
+1 vote
2 answers

Hadoop 3 compatibility with older versions of Hive, Pig, Sqoop and Spark

Hadoop 3 is not widely used in ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,320 points
3,299 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,320 points
2,905 views
0 votes
2 answers

Difference between createOrReplaceTempView and registerTempTable

I am pretty sure createOrReplaceTempView just replaced ...READ MORE

answered Sep 18 in Apache Spark by Nathan Mott
5,785 views
0 votes
1 answer

Difference between map() and mapPartitions() function in Spark??

Hi@ akhtar, Both map() and mapPartitions() are the ...READ MORE

answered Jan 29 in Apache Spark by MD
• 56,480 points
1,860 views
+1 vote
1 answer