Optimal column count for ORC and Parquet

+1 vote

I am using PySpark to read the files and have query regarding the maximum number of columns that can be handled - 

  • What is optimal column count for ORC and Parquet
  • If we have 3000+ column and 10lac+ records in a file then which of the two ( ORC and Parquet)  is more appropriate
May 8, 2020 in Apache Spark by Amey
• 210 points
2,517 views

1 answer to this question.

+1 vote

Hi@Amey,

It depends on your use case. Both Parquet and ORC have their own advantages and disadvantages. If you go with Spark then parquet is better because Spark has a vectorized parquet reader and no vectorized ORC reader. But it also depends on "How nested is your Data and how many columns are there". They uses a hierarchal tree-like structure to store data. More the nesting deeper the tree.

But ORC is designed for a flattened file store. So if your Data is flattened with fewer columns, you can go with ORC, otherwise, parquet would be fine for you. Compression on flattened Data works amazingly in ORC.

Hope this will give you some idea.

answered May 8, 2020 by MD
• 95,460 points

Related Questions In Apache Spark

0 votes
1 answer

Parquet to ORC format in Spark

I appreciate that you want to try ...READ MORE

answered Feb 15, 2019 in Apache Spark by Anjali
2,418 views
0 votes
1 answer

Enable encryption for local Input and Output

You can enable local I/O encryption like ...READ MORE

answered Mar 14, 2019 in Apache Spark by Raj
559 views
0 votes
1 answer

How the if-else statement is different for Scala and Java?

Hey, There is one main difference. This statement ...READ MORE

answered Jul 29, 2019 in Apache Spark by Gitika
• 65,770 points
1,055 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,650 points
13,813 views
0 votes
1 answer

Which query to use for better performance, join in SQL or using Dataset API?

DataFrames and SparkSQL performed almost about the ...READ MORE

answered Apr 19, 2018 in Apache Spark by kurt_cobain
• 9,350 points
1,855 views
+1 vote
2 answers

Hadoop 3 compatibility with older versions of Hive, Pig, Sqoop and Spark

Hadoop 3 is not widely used in ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,350 points
5,989 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,350 points
7,887 views
0 votes
2 answers

Difference between createOrReplaceTempView and registerTempTable

I am pretty sure createOrReplaceTempView just replaced ...READ MORE

answered Sep 18, 2020 in Apache Spark by Nathan Mott
13,678 views
+2 votes
14 answers

How to create new column with function in Spark Dataframe?

val coder: (Int => String) = v ...READ MORE

answered Apr 5, 2019 in Apache Spark by anonymous

edited Apr 5, 2019 by Omkar 88,875 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP