How does Hadoop/Spark is used for building large analytics report?

0 votes
I have a huge table in relation database, users work with it every day (crud operations and search).

And now there is a new task - have a possibility to build huge aggregate report for a one-two year period on demand. And do it fast. All this table records for last two years are too big to fit in memory, so I should split computations into chunks, right?

I don't want to reinvent the wheel, so my question is, does distributed processing systems like Hadoop are suit for this kind of tasks?
Aug 7, 2018 in Big Data Hadoop by Neha
• 6,280 points
146 views

2 answers to this question.

0 votes
The non-Hadoop way would be to create a semi aggregate report which you can use for another aggregate. I.e using 30 daily aggregates to create 1 monthly aggregate.

In some cases, it may not be possible so you can pull the data to your spark cluster or such and do your aggregate. Usually, the relational database won't give you the data locality features so you can move the data to some NoSQL database like Cassandra or HBase or elastic-search. Also, a big key question is do you want the answer to be in real time? Unless you go through some effort like job server etc spark or Hadoop jobs are usually batch job. Means you submit the job and get the answer later (spark streaming is an exception.)

I hope this answer helps you :)
answered Aug 7, 2018 by Frankie
• 9,810 points
0 votes
The best possible framework for this task is PySpark or Spark. Let me tell you why:
As Spark works on top of Hadoop, we can leverage HDFS for storage and split the data and the in-memory computation along with the lazy evaluation of spark will help you achieve your goal easily.

Hope this helps
answered Aug 7, 2018 by kurt_cobain
• 9,240 points

Related Questions In Big Data Hadoop

0 votes
1 answer

What Distributed Cache is actually used for in Hadoop?

Basically distributed cache allows you to cache ...READ MORE

answered Apr 2, 2018 in Big Data Hadoop by Ashish
• 2,630 points
129 views
0 votes
1 answer

Which is the easiest way for text analytics with hadoop?

Apache pig provides CSVExcelStorage class for loading ...READ MORE

answered Nov 22, 2018 in Big Data Hadoop by Frankie
• 9,810 points
53 views
0 votes
1 answer
0 votes
2 answers

Hey for all, how to get on large data i want use in hadoop?

Hi, To work with Hadoop you can also ...READ MORE

answered Jul 30 in Big Data Hadoop by Sunny
39 views
0 votes
1 answer
0 votes
1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
173 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,510 points
2,392 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,510 points
244 views
0 votes
1 answer

Which Windows client is used for Cloudera Hadoop Cluster?

You can very well use VM linux ...READ MORE

answered Sep 4, 2018 in Big Data Hadoop by Frankie
• 9,810 points
62 views
0 votes
1 answer