Is it better to have one large parquet file or lots of smaller parquet files

I know that hdfs will split files into 64mb chunks. We have streaming data coming in and we can store them to large files or medium sized files.
So my question is that What is the optimum size for columnar file storage?
Would smaller files save any computation time over having, say, 1gb files?

May 23, 2018 in Apache Spark by Shubham
• 13,490 points • 14,178 views

1 answer to this question.

Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable.

Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.

.option("compression", "gzip") is the option to override the default snappy compression.

If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions> or worst case .repartition(<num_partitions>) function.

Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).

Hope this will help you!

answered May 23, 2018 by nitinrawat895
• 11,380 points

Related Questions In Apache Spark

0 votes

1 answer

The number of stages in a job is equal to the number of RDDs in DAG. however, under one of the cgiven conditions, the scheduler can truncate the lineage. identify it.

Hi@Edureka, Spark's internal scheduler may truncate the lineage of the RDD graph ...READ MORE

answered Nov 26, 2020 in Apache Spark by MD
• 95,460 points • 4,533 views

0 votes

0 answers

The number of stages in a job is equal to the number of RDDs in DAG. however, under one of the cgiven conditions, the scheduler can truncate the lineage. identify it.

14)The number of stages in a job ...READ MORE

Nov 25, 2020 in Apache Spark by Edureka
• 200 points
closed Nov 25, 2020 by MD • 5,426 views

0 votes

1 answer

The number of stages in a job is equal to the number of RDDs in DAG. however, under one of the cgiven conditions, the scheduler can truncate the lineage. identify it.

Hi@ritu, Spark's internal scheduler may truncate the lineage of the RDD graph if ...READ MORE

answered Nov 25, 2020 in Apache Spark by akhtar
• 38,260 points • 3,009 views

0 votes

1 answer

How to use yield keyword in scala and why it is used instead of println?

Hi, The yield keyword is used because the ...READ MORE

answered Jul 6, 2019 in Apache Spark by Gitika
• 65,730 points • 2,210 views

–1 vote

1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points • 5,091 views

+1 vote

1 answer

I installed Spark but while executing command, I am getting ‘hadoop’ command not found error?

For accessing Hadoop commands & HDFS, you ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by Shubham
• 13,490 points • 3,005 views

0 votes

1 answer

“no such file or directory" in case of hadoop fs -ls

The behaviour that you are seeing is ...READ MORE

answered May 9, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
edited May 9, 2018 by nitinrawat895 • 8,867 views

+1 vote

2 answers

How do I get number of columns in each line from a delimited file??

Instead of spliting on '\n'. You should ...READ MORE

answered Aug 7, 2019 in Apache Spark by ashish
• 6,174 views

0 votes

1 answer

Which is better in term of speed, Shark or Spark?

Spark is a framework for distributed data ...READ MORE

answered Jun 26, 2018 in Apache Spark by nitinrawat895
• 11,380 points • 1,237 views

0 votes

1 answer

Is it mandatory to start Hadoop to run spark application?

No, it is not mandatory, but there ...READ MORE

answered Jun 14, 2018 in Apache Spark by nitinrawat895
• 11,380 points • 1,144 views

Subscribe to our Newsletter, and get personalized recommendations.

REGISTER FOR FREE WEBINAR

Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP