How compression works in Hadoop?

0 votes

In my Map Reduce job, let us say, I specify the compression for either the map or reduce output to LZO, how does it get compressed? Is the entire data from map or reduce task first obtained without compression  and  then the uncompressed data gets compressed, or does it get incrementally gets  compressed and written. If it gets incrementally compressed and written, then how is it done?

Jul 26, 2018 in Big Data Hadoop by Neha
• 6,280 points
223 views

1 answer to this question.

0 votes

It basically depends on the file type you use. If it is a text file then compression happens at the file level. 

But if it is SequenceFile then compression could be at record level or block level. 

Note that here block means a buffer in using sequence file and not the hdfs block.

If it is block compression then multiple records are compressed into a block at once. Records are added to a block until it reaches a minimum size in bytes. The maximum size of input data to be compressed at a time is calculated by subtracting the maximum overhead of the compression algorithm from the buffer size. The default buffer size is 512 bytes and for compression overhead it's 18 bytes(1% of bufferSize + 12 bytes) for zlib algorithm. 

Then a BlockCompressorStream is created with given output-stream and compressor and the compressed data gets written.

If you specify compression for the map stage (mapreduce.map.output.compress=true) the intermediate map output data will be compressed using whatever code you’ve specified (mapreduce.map.ouput.compress.codec=org.apache.hadoop.io.compress.*) and saved to disk at the conclusion of each map task’s completion (or earlier if the map task exceeds it serialization buffer limit, and begins to spill to the disk). The compressed data is then read from the disk and sent to the appropriate nodes during the Shuffle & Sort stage of your mapreduce job.

I hope this answer helps :)

answered Jul 26, 2018 by Frankie
• 9,810 points

Related Questions In Big Data Hadoop

0 votes
0 answers

How to run Hadoop in Docker containers?

I want to incorporate Hadoop in Docker ...READ MORE

Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 10,710 points
92 views
0 votes
7 answers

How to run a jar file in hadoop?

I used this command to run my ...READ MORE

answered Dec 10, 2018 in Big Data Hadoop by Dasinto
5,940 views
0 votes
1 answer

How to configure secondary namenode in Hadoop 2.x ?

bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode ...READ MORE

answered Apr 6, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
427 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,710 points
3,303 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,710 points
391 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
16,262 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
1,185 views
0 votes
1 answer

What is -cp command in hadoop? How it works?

/user/cloudera/data1 is not a directory, it is ...READ MORE

answered Oct 17, 2018 in Big Data Hadoop by Frankie
• 9,810 points
336 views
0 votes
1 answer

How do I print hadoop properties in command line?

You can dump Hadoop config by running: $ ...READ MORE

answered Aug 23, 2018 in Big Data Hadoop by Frankie
• 9,810 points
195 views