Is it better to have one large parquet file or lots of smaller parquet files?

0 votes
I know that hdfs will split files into 64mb chunks. We have streaming data coming in and we can store them to large files or medium sized files.
So my question is that What is the optimum size for columnar file storage?
Would smaller files save any computation time over having, say, 1gb files?
May 23, 2018 in Apache Spark by Shubham
• 12,270 points
938 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable. 

Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.

.option("compression", "gzip") is the option to override the default snappy compression.

If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions> or worst case .repartition(<num_partitions>) function.

Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).

Refer the following links to know more: 

https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html 

http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

Hope this will help you!

answered May 23, 2018 by nitinrawat895
• 9,070 points

Related Questions In Apache Spark

0 votes
1 answer

Which query to use for better performance, join in SQL or using Dataset API?

DataFrames and SparkSQL performed almost about the ...READ MORE

answered Apr 19, 2018 in Apache Spark by kurt_cobain
• 9,260 points
63 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,260 points
759 views
0 votes
1 answer

Is it possible to run Spark and Mesos along with Hadoop?

Yes, it is possible to run Spark ...READ MORE

answered May 29, 2018 in Apache Spark by Data_Nerd
• 2,340 points
23 views
0 votes
1 answer

What happens to RDD when one of the nodes goes down?

Whenever a node goes down, Spark knows ...READ MORE

answered Sep 3, 2018 in Apache Spark by nitinrawat895
• 9,070 points
144 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
577 views
+1 vote
1 answer
0 votes
1 answer

“no such file or directory" in case of hadoop fs -ls

The behaviour that you are seeing is ...READ MORE

answered May 9, 2018 in Big Data Hadoop by nitinrawat895
• 9,070 points

edited May 9, 2018 by nitinrawat895 1,691 views
0 votes
0 answers
0 votes
1 answer

Which is better in term of speed, Shark or Spark?

Spark is a framework for distributed data ...READ MORE

answered Jun 25, 2018 in Apache Spark by nitinrawat895
• 9,070 points
22 views
0 votes
1 answer

Is it mandatory to start Hadoop to run spark application?

No, it is not mandatory, but there ...READ MORE

answered Jun 14, 2018 in Apache Spark by nitinrawat895
• 9,070 points
32 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.