Relational Databases for a long time were enough to handle small or medium datasets. But the colossal rate at which data is growing makes the traditional approach to data storage and retrieval unfeasible. This problem is being solved by newer technologies which can handle Big Data. Hadoop, Hive and Hbase are the popular platforms to operate this kind of large data sets. NoSQL or Not Only SQL databases such as MongoDB® provide a mechanism to store and retrieve data in loser consistency model with advantages like:
- Horizontal scaling
- Higher availability
- Faster access
The MongoDB® engineering team has recently updated the MongoDB® Connector for Hadoop to have better integration. This makes it easier for Hadoop users to:
- Integrate real-time data from MongoDB® with Hadoop for deep, offline analytics.
- The Connector exposes the analytical power of Hadoop’s MapReduce to live application data from MongoDB®, driving value from big data faster and more efficiently.
- The Connector presents MongoDB as a Hadoop-compatible file system allowing a MapReduce job to read from MongoDB® directly without first copying it to HDFS (Hadoop file System), thereby removing the need to move Terabytes of data across the network.
- MapReduce jobs can pass queries as filters, so avoiding the need to scan entire collections, and can also take advantage of MongoDB®’s rich indexing capabilities including geo-spatial, text-search, array, compound and sparse indexes.
- Reading from MongoDB®, the results of Hadoop jobs can also be written back out to MongoDB®, to support real-time operational processes and ad-hoc querying.
Hadoop and MongoDB® Use Cases:
Let’s look at a high-level description of how MongoDB® and Hadoop can fit together in a typical Big Data stack. Primarily we have:
- MongoDB® used as the “Operational” real-time data store
- Hadoop for offline batch data processing and analysis
Application of MongoDB® with Hadoop in Batch Aggregation:
In most scenarios the built-in aggregation functionality provided by MongoDB® is sufficient for analyzing data. However in certain cases, significantly more complex data aggregation may be necessary. This is where Hadoop can provide a powerful framework for complex analytics.
In this scenario:
- Data is pulled from MongoDB® and processed within Hadoop via one or more MapReduce jobs. Data may also be sourced from other places within these MapReduce jobs to develop a multi-data source solution.
- Output from these MapReduce jobs can then be written back to MongoDB® for querying at a later stage and for any analysis on ad-hoc basis.
- Applications built on top of MongoDB® can therefore use the information from batch analytics to present to the end client or to enable other downstream features.
Application in Data Warehousing:
In a typical production setup, application’s data may reside on multiple data stores, each with their own query language and functionality. To reduce complexity in these scenarios, Hadoop can be used as a data warehouse and act as a centralized repository for data from the various sources.
In this kind of scenario:
- Periodic MapReduce jobs load data from MongoDB® into Hadoop.
- Once the data from MongoDB® and other sources is available in Hadoop, the larger dataset can be queried against.
- Data analysts now have the option of using either MapReduce or Pig to create jobs that query the larger datasets that incorporate data from MongoDB®.
The team working behind MongoDB® has ensured that with its rich integration with Big Data technologies like Hadoop, it’s able to integrate well in the Big Data Stack and help solve some complex architectural issues when it comes to data storage, retrieval, processing, aggregating and warehousing. Stay tuned for our upcoming post on career prospects for those who take up Hadoop with MongoDB®. If you are already working with Hadoop or just picking up MongoDB®, do check out the courses we offer for MongoDB® here