How compression works in Hadoop?

0 votes

In my Map Reduce job, let us say, I specify the compression for either the map or reduce output to LZO, how does it get compressed? Is the entire data from map or reduce task first obtained without compression  and  then the uncompressed data gets compressed, or does it get incrementally gets  compressed and written. If it gets incrementally compressed and written, then how is it done?

Jul 26, 2018 in Big Data Hadoop by Neha
• 6,180 points
85 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

It basically depends on the file type you use. If it is a text file then compression happens at the file level. 

But if it is SequenceFile then compression could be at record level or block level. 

Note that here block means a buffer in using sequence file and not the hdfs block.

If it is block compression then multiple records are compressed into a block at once. Records are added to a block until it reaches a minimum size in bytes. The maximum size of input data to be compressed at a time is calculated by subtracting the maximum overhead of the compression algorithm from the buffer size. The default buffer size is 512 bytes and for compression overhead it's 18 bytes(1% of bufferSize + 12 bytes) for zlib algorithm. 

Then a BlockCompressorStream is created with given output-stream and compressor and the compressed data gets written.

If you specify compression for the map stage (mapreduce.map.output.compress=true) the intermediate map output data will be compressed using whatever code you’ve specified (mapreduce.map.ouput.compress.codec=org.apache.hadoop.io.compress.*) and saved to disk at the conclusion of each map task’s completion (or earlier if the map task exceeds it serialization buffer limit, and begins to spill to the disk). The compressed data is then read from the disk and sent to the appropriate nodes during the Shuffle & Sort stage of your mapreduce job.

I hope this answer helps :)

answered Jul 26, 2018 by Frankie
• 9,710 points

Related Questions In Big Data Hadoop

0 votes
0 answers

How to run Hadoop in Docker containers?

I want to incorporate Hadoop in Docker ...READ MORE

Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 9,610 points
55 views
0 votes
7 answers

How to run a jar file in hadoop?

I used this command to run my ...READ MORE

answered Dec 10, 2018 in Big Data Hadoop by Dasinto
3,669 views
0 votes
1 answer

How to configure secondary namenode in Hadoop 2.x ?

bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode ...READ MORE

answered Apr 6, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
278 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,610 points
1,866 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,610 points
165 views
0 votes
10 answers

hadoop fs -put command?

copy command can be used to copy files ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Sujay
9,313 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
674 views
0 votes
1 answer

What is -cp command in hadoop? How it works?

/user/cloudera/data1 is not a directory, it is ...READ MORE

answered Oct 17, 2018 in Big Data Hadoop by Frankie
• 9,710 points
236 views
0 votes
1 answer

How do I print hadoop properties in command line?

You can dump Hadoop config by running: $ ...READ MORE

answered Aug 23, 2018 in Big Data Hadoop by Frankie
• 9,710 points
83 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.