Multiple Output format in Hadoop

Question

I'm new to Hadoop.

I am trying the Wordcount program in Hadoop

Now to try out multiple output files, I use MultipleOutputFormat. this link helped me in doing it.

I have the following code in my driver class.

    MultipleOutputs.addNamedOutput(conf, "even",
            org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
            IntWritable.class);

    MultipleOutputs.addNamedOutput(conf, "odd",
            org.apache.hadoop.mapred.TextOutputFormat.class, Text.class,
            IntWritable.class);`

The Reducer class is as follows

public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {
    MultipleOutputs mos = null;

    public void configure(JobConf job) {
        mos = new MultipleOutputs(job);
    }

    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        if (sum % 2 == 0) {
            mos.getCollector("even", reporter).collect(key, new IntWritable(sum));
        }else {
            mos.getCollector("odd", reporter).collect(key, new IntWritable(sum));
        }
        //output.collect(key, new IntWritable(sum));
    }
    @Override
    public void close() throws IOException {
        // TODO Auto-generated method stub
    mos.close();
    }
}

It worked, but I get LOT of files, (one odd and one even for every map-reduce)

My Query is, How can I have just 2 output files so that every odd output of every map-reduce gets written into that odd file and same for even.

ravikiran · Answer 1 · Jul 26, 2019

Each reducer uses an OutputFormat to write records. So you are getting a set of odd and even files per reducer. This is by design so that each reducer can perform writes in parallel.

If you want just a single odd and single even file, you'll need to set mapred.reduce.tasks to 1. But performance will suffer because all the mappers will be feeding into a single reducer.

Another option is to change the process the reads these files to accept multiple input files or write a separate process that merges these files together.