What is the best way to merge multi-part HDFS files into single file?

0 votes

I am using Spark 2.2.1. My application code creates several 0 byte/very small size part files like the below.

part-04498-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
part-04499-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet

All of these files are either 0 byte files with no actual data or very small files. 
1.What is the best way to merge all of these files into single HDFS file? 
2. If all of these are 0 byte files, I want to get rid of them. Can I achieve it with some setting in Spark?

I tried the below but the multi-part files were still there.

sc.textFile("hdfs://nameservice1/data/refined/lzcimp/mbr/part*").coalesce(1).saveAsTextFile("hdfs://nameservice1/data/refined/lzcimp/mbr/final.snappy.parquet")
Jul 29 in Big Data Hadoop by Karan
290 views

1 answer to this question.

0 votes

1. In order to merge two or more files into one single file and store it in hdfs, you need to have a folder in the hdfs path containing the files that you want to merge.

Here, I am having a folder namely merge_files which contains the following files that I want to merge


image

Then you can execute the following command to the merge the files and store it in hdfs:

hadoop fs -cat /user/edureka_425640/merge_files/* | hadoop fs -put - /user/edureka_425640/merged_file s

The merged_files folder need not be created manually. It is going to be created automatically to store your output when you are using the above command. You can view your output using the following command. Here my merged_files is storing my output result.

hadoop fs -cat merged_files

Supposing we have a folder with multiple empty files and some non-empty files and if we want to delete the files that are empty, we can use the below command:

hdfs dfs -rm $(hdfs dfs -ls -R /user/A/ | grep -v "^d" | awk '{if ($5 == 0) print $8}')

Here I am having a folder, temp_folder with three files, 2 being empty and 1 file is nonempty. Please refer to the screenshot below:

image

image

answered Jul 29 by Tina

Related Questions In Big Data Hadoop

0 votes
1 answer

What is the standard way to create files in your hdfs file-system?

Well, it's so easy. Just enter the below ...READ MORE

answered Sep 22, 2018 in Big Data Hadoop by Frankie
• 9,810 points
108 views
0 votes
1 answer

What are the various ways to import files into HDFS?

There are various tools and frameworks available ...READ MORE

answered Apr 13, 2018 in Big Data Hadoop by nitinrawat895
• 10,670 points
355 views
0 votes
1 answer

What is the command to count number of lines in a file in hdfs?

hadoop fs -cat /example2/doc1 | wc -l READ MORE

answered Nov 22, 2018 in Big Data Hadoop by Omkar
• 67,620 points
344 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
2,998 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
334 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
14,797 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
1,100 views
0 votes
5 answers
0 votes
1 answer

What is the command to navigate in HDFS?

First of all there is no command ...READ MORE

answered Apr 27, 2018 in Big Data Hadoop by Shubham
• 13,300 points
474 views