Apache Spark and Scala (36 Blogs) Become a Certified Professional

Apache Spark Ecosystem

Last updated on Feb 09,2021 11K Views

Spark Ecosystem is still in the stage of work-in-progress with Spark components, which are not even in their beta releases. It is still in their alpha release stage, and are being tested by their respective developers.

apache spark- edureka

Components of Spark Ecosystem

The components of Spark ecosystem are getting developed and several contributions are being made every now and then. Primarily, Spark Ecosystem comprises the following components:

  1. Shark (SQL)
  2. Spark Streaming (Streaming)
  3. MLLib (Machine Learning)
  4. GraphX (Graph Computation)
  5. SparkR (R on Spark)
  6. BlindDB (Approximate SQL)

These components are built on top of Spark Core Engine. Spark Core Engine allows writing raw Spark programs and Scala programs and launch them; it also allows writing Java programs before launching them. All these are being executed by Spark Core Engine. To top it all, there are various projects that have come up very fast and efficient.


Shark is one of the Spark Ecosystem components. It is used to perform structured data analysis, especially if the data is too voluminous. Shark also allows running unmodified Hive queries on existing Hadoop deployment.


BlindDB or Blind Database is also known as an Approximate SQL database. If there is a huge amount of data barraging and you are not really interested in exactitude, or in exact results, but just want to have a rough or an approximate picture, BlindDB gets you the same. Firing a query, doing some sort of sampling, and giving out some output is called Approximate SQL. Isn’t it a new and interesting concept? Many a time, when you do not require accurate results, sampling would certainly do.

Spark Streaming

Spark Streaming is one of those unique features, which have empowered Spark to potentially take the role of Apache Storm. Spark Streaming mainly enables you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then, Spark can run its operations from the streamed data itself.


MLLib is a machine learning library like Mahout. It is built on top of Spark, and has the provision to support many machine learning algorithms. But the point difference with Mahout is that it runs almost 100 times faster than MapReduce. It is not yet as enriched as Mahout, but it is coming up pretty well, even though it is still in the initial stage of growth.


For graphs and graphical computations, Spark has its own Graph Computation Engine, called GraphX. It is similar to other widely used graph processing tools or databases, like Neo4j, Girafe, and many other distributed graph databases.


There are many people from data science track, who must be aware that for statistical analysis, R is among the best. There is already an integration of R with Hadoop. Now, SparkR is a package for R language to enable R users to leverage the power of Spark from R shell.

Got a question for us? Mention them in the comments section and we will get back to you.

edurekaRelated Posts

Apache Spark Lighting up the Big Data World

Apache Spark Redefining Big Data Processing

What is Scala?

Get trained in Apache Spark & Scala

Upcoming Batches For Apache Spark and Scala Certification Training Course
Course NameDate
Apache Spark and Scala Certification Training Course

Class Starts on 29th October,2022

29th October

SAT&SUN (Weekend Batch)
View Details

Join the discussion

Browse Categories

Send OTP
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Subscribe to our Newsletter, and get personalized recommendations.

image not found!
image not found!

Apache Spark Ecosystem