Relationship between Spark, Hadoop and Cassandra?

0 votes

Is it like Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.

Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?

Mar 26, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
77 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.

For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.

In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).

Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.

Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.

So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.

answered Mar 26, 2018 by nitinrawat895
• 9,310 points

Related Questions In Big Data Hadoop

0 votes
1 answer

Is there any Relationship between Hadoop and Databases?

As such, there is no relationship between ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
33 views
0 votes
1 answer

How to choose between Cassandra, Membase, Hadoop, MongoDB and RDBMS?

Actually it comes in two ways: One ...READ MORE

answered Sep 18, 2018 in Big Data Hadoop by Frankie
• 9,710 points
29 views
0 votes
1 answer

Explain to me the relationship between Hadoop and Databases.

Okay, that's a lot of queries together, ...READ MORE

answered May 15 in Big Data Hadoop by ravikiran
• 2,040 points
17 views
0 votes
10 answers

What is the difference between Mongodb and Hadoop?

Apart from the similarity that they are ...READ MORE

answered Dec 6, 2018 in Big Data Hadoop by Deeraj
1,981 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,310 points
1,824 views
0 votes
10 answers

hadoop fs -put command?

copy command can be used to copy files ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Sujay
9,016 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
647 views
0 votes
1 answer
+1 vote
1 answer

Cassandra and Hadoop - realtime vs batch

Apache Hadoop, is a big data analytics ...READ MORE

answered Mar 26, 2018 in Big Data Hadoop by nitinrawat895
• 9,310 points
54 views
0 votes
1 answer

Difference between Hadoop file system and Linux

Yes, to a certain extent we can ...READ MORE

answered Apr 20, 2018 in Big Data Hadoop by nitinrawat895
• 9,310 points
113 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.