What Distributed Cache is actually used for in Hadoop

Question

I am confused about the functionality of distributed cache? I have referred to multiple blogs but I still not able to properly understand what actually distibuted cache does?

Basically, I would like to know that by caching a file using distributed cache, is it stored or copied into every datanode? If not, does it mean copying the data into memory of each data node?

Ashish · Answer 1 · Apr 3, 2018

Basically distributed cache allows you to cache files that are need for your Map-Reduce jobs. Once you use distributed cache to cache a file for a job, it will be accessed by all the slave nodes (worker nodes) of the hadoop cluster. Basically, hadoop copies the cached files on the slave nodes' file system so that it can be accessed on that node. This copying of the file happens before any task w.r.t a job is executed on that node.

For further read and usage, you can refer to the official documentation: https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/filecache/DistributedCache.html

Let me know in case you need anything else...