What is the best way to merge multi-part HDFS files into single file

Question

I am using Spark 2.2.1. My application code creates several 0 byte/very small size part files like the below.

part-04498-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
part-04499-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet

All of these files are either 0 byte files with no actual data or very small files.
1.What is the best way to merge all of these files into single HDFS file?
2. If all of these are 0 byte files, I want to get rid of them. Can I achieve it with some setting in Spark?

I tried the below but the multi-part files were still there.

sc.textFile("hdfs://nameservice1/data/refined/lzcimp/mbr/part*").coalesce(1).saveAsTextFile("hdfs://nameservice1/data/refined/lzcimp/mbr/final.snappy.parquet")

Gitika · Answer 1 · Jul 29, 2019

1. In order to merge two or more files into one single file and store it in hdfs, you need to have a folder in the hdfs path containing the files that you want to merge.

Here, I am having a folder namely merge_files which contains the following files that I want to merge

Then you can execute the following command to the merge the files and store it in hdfs:

hadoop fs -cat /user/edureka_425640/merge_files/* | hadoop fs -put - /user/edureka_425640/merged_file s

The merged_files folder need not be created manually. It is going to be created automatically to store your output when you are using the above command. You can view your output using the following command. Here my merged_files is storing my output result.

hadoop fs -cat merged_files

Supposing we have a folder with multiple empty files and some non-empty files and if we want to delete the files that are empty, we can use the below command:

hdfs dfs -rm $(hdfs dfs -ls -R /user/A/ | grep -v "^d" | awk '{if ($5 == 0) print $8}')

Here I am having a folder, temp_folder with three files, 2 being empty and 1 file is nonempty. Please refer to the screenshot below:

answered Jul 29, 2019 by Tina

ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE

commented Oct 17, 2019 by Pedro Donis

Hello @Pedros,

What does this code do?

commented Oct 30, 2019 by Roman

Hello can i ask how do i add 100+ of text files into your merge_files folder?

commented Apr 18, 2020 by anonymous

Hey,

This is technically what cat ("concatenate") is supposed to do, even though most people just use it for output files to stdout. If you give it multiple filenames it will output them all sequentially, and then you can redirect that into a new file; in the case of all files just use * (or /path/to/directory/* if you're not in the directory already) and your shell will expand it to all the filenames

$ cat * > merged-file

commented Apr 21, 2020 by Gitika
• 65,730 points

score 0 · Answer 2 · Oct 1, 2020

I figured out a way using hadoop fs commands - hadoop fs -cat [dir]/* | hadoop fs ... If you are working in Hortonworks cluster and want to merge multiple file ... filename) This will merge all part files into one and save it again into hdfs location ... of weeks ago, so if you have a better design or solution, feel free to let me.