How does Hadoop/Spark is used for building large analytics report?

0 votes
I have a huge table in relation database, users work with it every day (crud operations and search).

And now there is a new task - have a possibility to build huge aggregate report for a one-two year period on demand. And do it fast. All this table records for last two years are too big to fit in memory, so I should split computations into chunks, right?

I don't want to reinvent the wheel, so my question is, does distributed processing systems like Hadoop are suit for this kind of tasks?
Aug 7, 2018 in Big Data Hadoop by Neha
• 6,180 points
135 views

2 answers to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes
The non-Hadoop way would be to create a semi aggregate report which you can use for another aggregate. I.e using 30 daily aggregates to create 1 monthly aggregate.

In some cases, it may not be possible so you can pull the data to your spark cluster or such and do your aggregate. Usually, the relational database won't give you the data locality features so you can move the data to some NoSQL database like Cassandra or HBase or elastic-search. Also, a big key question is do you want the answer to be in real time? Unless you go through some effort like job server etc spark or Hadoop jobs are usually batch job. Means you submit the job and get the answer later (spark streaming is an exception.)

I hope this answer helps you :)
answered Aug 7, 2018 by Frankie
• 9,710 points
0 votes
The best possible framework for this task is PySpark or Spark. Let me tell you why:
As Spark works on top of Hadoop, we can leverage HDFS for storage and split the data and the in-memory computation along with the lazy evaluation of spark will help you achieve your goal easily.

Hope this helps
answered Aug 7, 2018 by kurt_cobain
• 9,260 points

Related Questions In Big Data Hadoop

0 votes
1 answer

What Distributed Cache is actually used for in Hadoop?

Basically distributed cache allows you to cache ...READ MORE

answered Apr 2, 2018 in Big Data Hadoop by Ashish
• 2,630 points
90 views
0 votes
1 answer

Which is the easiest way for text analytics with hadoop?

Apache pig provides CSVExcelStorage class for loading ...READ MORE

answered Nov 22, 2018 in Big Data Hadoop by Frankie
• 9,710 points
30 views
0 votes
1 answer
0 votes
1 answer

Hey for all, how to get on large data i want use in hadoop?

Hey! You can get large data-sets for ...READ MORE

answered Apr 24 in Big Data Hadoop by Ariba
29 views
0 votes
0 answers
0 votes
1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
135 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,410 points
1,843 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,410 points
159 views
0 votes
1 answer

Which Windows client is used for Cloudera Hadoop Cluster?

You can very well use VM linux ...READ MORE

answered Sep 4, 2018 in Big Data Hadoop by Frankie
• 9,710 points
47 views
0 votes
1 answer

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.