Is it better to have one large parquet file or lots of smaller parquet files

Question

I know that hdfs will split files into 64mb chunks. We have streaming data coming in and we can store them to large files or medium sized files.
So my question is that What is the optimum size for columnar file storage?
Would smaller files save any computation time over having, say, 1gb files?

nitinrawat895 · Answer 1 · May 23, 2018

Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable.

Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.

.option("compression", "gzip") is the option to override the default snappy compression.

If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions> or worst case .repartition(<num_partitions>) function.

Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).

Refer the following links to know more:

https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html

http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

Hope this will help you!