Integration of Hadoop with Mongo DB concept

Question

Hi I am new to Hadoop and NoSQL technologies. I started learning with world-count program by reading file stored in HDFS and and processing it. Now I want to use Hadoop with MongoDB.

Now here is confusion with me that it stores mongodb data on my local file system, and read data from local file system to HDFS in map/reduce and again write it to mongodb local file system. When I studied HBase, we can configure it to store it's data on HDFS, and hadoop can directly process it on HDFS(map/reduce). How to configure mongodb to store it's data on HDFS.

I think it is better approach to store data in HDFS for fast processing. Not in the local file system. Am I right? Please clear my concept if I am going in wrong direction.

Frankie · Answer 1 · Sep 25, 2018

MongoDB isn't built to work on top of HDFS and it's not really necessary since Mongo already has its own approach for scaling horizontally and working with data stored across multiple machines.

A better approach if you need to work with MongoDB and Hadoop is to use MongoDB as the source of your data but process everything in Hadoop (which will use HDFS for any temporary storage). Once your done processing the data you can write it back to MongoDB, S3, or wherever you want.

OR

HDFS is a distributed file system while HBase is a NoSQL database that uses HDFS as its file system provide a fast and efficient integration with Hadoop that has been prove to work at scale. Being able to work with HBase data directly in Hadoop or push it into HDFS is one of the big advantages when picking HBase as a NoSQL database solution - I don't believe MongoDB provides such tight integration with Hadoop and HDFS which would mitigate any performance and efficiency concerns with moving data from/to a database.

Please look at this blog post for a detailed analysis : https://www.edureka.co/blog/mongodb-the-database-for-big-data-processing/