Hadoop DistributedCache is deprecated - what is the preferred API

+3 votes

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.

The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows:

// In the driver
JobConf conf = new JobConf(getConf(), WordCount.class);
...
DistributedCache.addCacheFile(new Path(filename).toUri(), conf); 

// In the mapper
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);
...

However, DistributedCache is marked as deprecated in Hadoop 2.2.0.

What is the new preferred way to achieve this? Is there an up-to-date example or tutorial covering this API?

Sep 27, 2018 in Big Data Hadoop by digger
• 26,720 points
982 views

5 answers to this question.

+1 vote

The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like

Job job = new Job();
...
job.addCacheFile(new Path(filename).toUri());

In your mapper code:

Path[] localPaths = context.getLocalCacheFiles();
...
answered Sep 27, 2018 by slayer
• 29,310 points
+1 vote

The preferred way of using DistributedCache for YARN/MapReduce 2 is as follows:

In your driver, use the Job.addCacheFile()

public int run(String[] args) throws Exception {
    Configuration conf = getConf();

    Job job = Job.getInstance(conf, "MyJob");

    job.setMapperClass(MyMapper.class);

    // ...

    // Mind the # sign after the absolute file location.
    // You will be using the name after the # sign as your
    // file name in your Mapper/Reducer
    job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));
    job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));

    return job.waitForCompletion(true) ? 0 : 1;
}

And in your Mapper/Reducer, override the setup(Context context) method:

@Override
protected void setup(
        Mapper<LongWritable, Text, Text, Text>.Context context)
        throws IOException, InterruptedException {
    if (context.getCacheFiles() != null
            && context.getCacheFiles().length > 0) {

        File some_file = new File("./some");
        File other_file = new File("./other");

        // Do things to these two files, like read them
        // or parse as JSON or whatever.
    }
    super.setup(context);
}
answered Oct 12, 2018 by Tenali
+1 vote

The new DistributedCache API for YARN/MR2 is found in the org.apache.hadoop.mapreduce.Jobclass.

   Job.addCacheFile()

Unfortunately, there aren't as of yet many comprehensive tutorial-style examples of this.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29

answered Oct 12, 2018 by Jeevith
+1 vote

I did not use job.addCacheFile(). Instead I used -files option like "-files /path/to/myfile.txt#myfile" as before. Then in the mapper or reducer code I use the method below:

/**
 * This method can be used with local execution or HDFS execution. 
 * 
 * @param context
 * @param symLink
 * @param throwExceptionIfNotFound
 * @return
 * @throws IOException
 */
public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException
{
    URI[] uris = context.getCacheFiles();
    if(uris==null||uris.length==0)
    {
        if(throwExceptionIfNotFound)
            throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
        return null;
    }
    URI symlinkUri = null;
    for(URI uri: uris)
    {
        if(symLink.equals(uri.getFragment()))
        {
            symlinkUri = uri;
            break;
        }
    }   
    if(symlinkUri==null)
    {
        if(throwExceptionIfNotFound)
            throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
        return null;
    }
    //if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink
    return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink);

}

Then in mapper/reducer:

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    super.setup(context);

    File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true);
    ... do work ...
}

Note that if I used "-files /path/to/myfile.txt" directly then I need to use "myfile.txt" to access the file since that is the default symlink name.

answered Oct 12, 2018 by Rohit
+1 vote

I had the same problem. And not only is DistributedCach deprecated but getLocalCacheFiles and "new Job" too. So what worked for me is the following:

Driver:

Configuration conf = getConf();
Job job = Job.getInstance(conf);
...
job.addCacheFile(new Path(filename).toUri());

In Mapper/Reducer setup:

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    super.setup(context);

    URI[] files = context.getCacheFiles(); // getCacheFiles returns null

    Path file1path = new Path(files[0])
    ...
}
answered Oct 12, 2018 by Rohan

Related Questions In Big Data Hadoop

0 votes
1 answer

What is the difference between Hadoop API and Streaming?

Usually we have Map/Reduce pair written in ...READ MORE

answered Dec 12, 2018 in Big Data Hadoop by Neha
• 6,300 points
242 views
0 votes
10 answers

What is the difference between Mongodb and Hadoop?

MongoDB is a NoSQL database, whereas Hadoop is ...READ MORE

answered Jun 20, 2018 in Big Data Hadoop by jenny_code
8,261 views
0 votes
13 answers

What is the difference between Hadoop/HDFS & HBase?

HDFS is a distributed file system whereas ...READ MORE

answered Apr 26, 2019 in Big Data Hadoop by Arihar
• 160 points
24,941 views
0 votes
1 answer

What is the use of sequence file in Hadoop?

Sequence files are binary files containing serialized ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by Ashish
• 2,650 points
6,481 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
7,802 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
60,767 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
2,939 views
0 votes
1 answer
0 votes
1 answer

Hadoop Java Error: java.lang.NoClassDefFoundError: WordCount (wrong name: org/myorg/WordCount)

Hey, try this code import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import ...READ MORE

answered Sep 19, 2018 in Big Data Hadoop by slayer
• 29,310 points
3,629 views
0 votes
1 answer

Hadoop: intervals and JOIN

Hey, a solution was given on Biostar: http://biostar.stackexchange.com/questions/8821. Hope ...READ MORE

answered Sep 24, 2018 in Big Data Hadoop by slayer
• 29,310 points
169 views