How compression works in Hadoop

0 votes

In my Map Reduce job, let us say, I specify the compression for either the map or reduce output to LZO, how does it get compressed? Is the entire data from map or reduce task first obtained without compression  and  then the uncompressed data gets compressed, or does it get incrementally gets  compressed and written. If it gets incrementally compressed and written, then how is it done?

Jul 27, 2018 in Big Data Hadoop by Neha
• 6,300 points
2,017 views

1 answer to this question.

0 votes

It basically depends on the file type you use. If it is a text file then compression happens at the file level. 

But if it is SequenceFile then compression could be at record level or block level. 

Note that here block means a buffer in using sequence file and not the hdfs block.

If it is block compression then multiple records are compressed into a block at once. Records are added to a block until it reaches a minimum size in bytes. The maximum size of input data to be compressed at a time is calculated by subtracting the maximum overhead of the compression algorithm from the buffer size. The default buffer size is 512 bytes and for compression overhead it's 18 bytes(1% of bufferSize + 12 bytes) for zlib algorithm. 

Then a BlockCompressorStream is created with given output-stream and compressor and the compressed data gets written.

If you specify compression for the map stage (mapreduce.map.output.compress=true) the intermediate map output data will be compressed using whatever code you’ve specified (mapreduce.map.ouput.compress.codec=org.apache.hadoop.io.compress.*) and saved to disk at the conclusion of each map task’s completion (or earlier if the map task exceeds it serialization buffer limit, and begins to spill to the disk). The compressed data is then read from the disk and sent to the appropriate nodes during the Shuffle & Sort stage of your mapreduce job.

I hope this answer helps :)

answered Jul 27, 2018 by Frankie
• 9,830 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How to run Hadoop in Docker containers?

Hi, You can run Hadoop in Docker container. Follow ...READ MORE

answered Jan 24, 2020 in Big Data Hadoop by MD
• 95,460 points
2,248 views
0 votes
7 answers

How to run a jar file in hadoop?

I used this command to run my ...READ MORE

answered Dec 10, 2018 in Big Data Hadoop by Dasinto
26,651 views
0 votes
1 answer

How to configure secondary namenode in Hadoop 2.x ?

bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode ...READ MORE

answered Apr 6, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
1,846 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,076 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,573 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
109,070 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,641 views
0 votes
1 answer

What is -cp command in hadoop? How it works?

/user/cloudera/data1 is not a directory, it is ...READ MORE

answered Oct 17, 2018 in Big Data Hadoop by Frankie
• 9,830 points
4,193 views
0 votes
1 answer

How do I print hadoop properties in command line?

You can dump Hadoop config by running: $ ...READ MORE

answered Aug 23, 2018 in Big Data Hadoop by Frankie
• 9,830 points
2,634 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP