Hadoop DistributedCache is deprecated - what is the preferred API?

+3 votes

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.

The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows:

// In the driver
JobConf conf = new JobConf(getConf(), WordCount.class);
...
DistributedCache.addCacheFile(new Path(filename).toUri(), conf); 

// In the mapper
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);
...

However, DistributedCache is marked as deprecated in Hadoop 2.2.0.

What is the new preferred way to achieve this? Is there an up-to-date example or tutorial covering this API?

Sep 27, 2018 in Big Data Hadoop by digger
• 26,550 points
335 views

5 answers to this question.

+1 vote

The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like

Job job = new Job();
...
job.addCacheFile(new Path(filename).toUri());

In your mapper code:

Path[] localPaths = context.getLocalCacheFiles();
...
answered Sep 27, 2018 by slayer
• 29,170 points
+1 vote

The preferred way of using DistributedCache for YARN/MapReduce 2 is as follows:

In your driver, use the Job.addCacheFile()

public int run(String[] args) throws Exception {
    Configuration conf = getConf();

    Job job = Job.getInstance(conf, "MyJob");

    job.setMapperClass(MyMapper.class);

    // ...

    // Mind the # sign after the absolute file location.
    // You will be using the name after the # sign as your
    // file name in your Mapper/Reducer
    job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));
    job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));

    return job.waitForCompletion(true) ? 0 : 1;
}

And in your Mapper/Reducer, override the setup(Context context) method:

@Override
protected void setup(
        Mapper<LongWritable, Text, Text, Text>.Context context)
        throws IOException, InterruptedException {
    if (context.getCacheFiles() != null
            && context.getCacheFiles().length > 0) {

        File some_file = new File("./some");
        File other_file = new File("./other");

        // Do things to these two files, like read them
        // or parse as JSON or whatever.
    }
    super.setup(context);
}
answered Oct 12, 2018 by Tenali
+1 vote

The new DistributedCache API for YARN/MR2 is found in the org.apache.hadoop.mapreduce.Jobclass.

   Job.addCacheFile()

Unfortunately, there aren't as of yet many comprehensive tutorial-style examples of this.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29

answered Oct 12, 2018 by Jeevith
+1 vote

I did not use job.addCacheFile(). Instead I used -files option like "-files /path/to/myfile.txt#myfile" as before. Then in the mapper or reducer code I use the method below:

/**
 * This method can be used with local execution or HDFS execution. 
 * 
 * @param context
 * @param symLink
 * @param throwExceptionIfNotFound
 * @return
 * @throws IOException
 */
public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException
{
    URI[] uris = context.getCacheFiles();
    if(uris==null||uris.length==0)
    {
        if(throwExceptionIfNotFound)
            throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
        return null;
    }
    URI symlinkUri = null;
    for(URI uri: uris)
    {
        if(symLink.equals(uri.getFragment()))
        {
            symlinkUri = uri;
            break;
        }
    }   
    if(symlinkUri==null)
    {
        if(throwExceptionIfNotFound)
            throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
        return null;
    }
    //if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink
    return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink);

}

Then in mapper/reducer:

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    super.setup(context);

    File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true);
    ... do work ...
}

Note that if I used "-files /path/to/myfile.txt" directly then I need to use "myfile.txt" to access the file since that is the default symlink name.

answered Oct 12, 2018 by Rohit
+1 vote

I had the same problem. And not only is DistributedCach deprecated but getLocalCacheFiles and "new Job" too. So what worked for me is the following:

Driver:

Configuration conf = getConf();
Job job = Job.getInstance(conf);
...
job.addCacheFile(new Path(filename).toUri());

In Mapper/Reducer setup:

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    super.setup(context);

    URI[] files = context.getCacheFiles(); // getCacheFiles returns null

    Path file1path = new Path(files[0])
    ...
}
answered Oct 12, 2018 by Rohan

Related Questions In Big Data Hadoop

0 votes
1 answer

What is the difference between Hadoop API and Streaming?

Usually we have Map/Reduce pair written in ...READ MORE

answered Dec 12, 2018 in Big Data Hadoop by Neha
• 6,280 points
53 views
0 votes
10 answers

What is the difference between Mongodb and Hadoop?

Apart from the similarity that they are ...READ MORE

answered Dec 6, 2018 in Big Data Hadoop by Deeraj
2,669 views
0 votes
13 answers

What is the difference between Hadoop/HDFS & HBase?

HDFS is a distributed file system whereas ...READ MORE

answered Apr 26 in Big Data Hadoop by Arihar
• 160 points
10,822 views
0 votes
1 answer

What is the use of sequence file in Hadoop?

Sequence files are binary files containing serialized ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by Ashish
• 2,630 points
1,688 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
3,370 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
16,731 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
1,221 views
0 votes
1 answer
0 votes
1 answer

Hadoop Java Error: java.lang.NoClassDefFoundError: WordCount (wrong name: org/myorg/WordCount)

Hey, try this code import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import ...READ MORE

answered Sep 19, 2018 in Big Data Hadoop by slayer
• 29,170 points
1,170 views
0 votes
1 answer

Hadoop: intervals and JOIN

Hey, a solution was given on Biostar: http://biostar.stackexchange.com/questions/8821. Hope ...READ MORE

answered Sep 24, 2018 in Big Data Hadoop by slayer
• 29,170 points
31 views