Hadoop, the data processing framework that’s become a platform unto itself, becomes even better when good components are connected to it. Some shortcomings of Hadoop, like MapReduce component of Hadoop have a reputation for being slow for real-time data analysis.
Enter Apache Spark, a Hadoop-based data processing engine designed for both batch and streaming workloads, now in its 1.0 version and outfitted with features that exemplify what kinds of work Hadoop is being pushed to include. Spark runs on top of existing Hadoop clusters to provide enhanced and additional functionality.
Let’s look at spark’s key features and how it works along with Hadoop
Apache Spark Key Benefits:
Spark’s Awesome Features:
- Hadoop Integration – Spark can work with files stored in HDFS.
- Spark’s Interactive Shell – Spark is written in Scala, and has it’s own version of the Scala interpreter.
- Spark’s Analytic Suite – Spark comes with tools for interactive query analysis, large-scale graph processing and analysis and real-time analysis.
- Resilient Distributed Datasets (RDD’s) – RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the primary data objects used in Spark.
- Distributed Operators – Besides MapReduce, there are many other operators one can use on RDD’s.
Advantages of Using Apache Spark with Hadoop:
Apache Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.
Well suited to machine learning algorithms – Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly.
Run 100 times faster – Spark, analysis software can also speed jobs that run on the Hadoop data-processing platform. Dubbed the “Hadoop Swiss Army knife,” Apache Spark provides the ability to create data-analysis jobs that can run 100 times faster than those running on the standard Apache Hadoop MapReduce. MapReduce has been widely criticized as a bottleneck in Hadoop clusters because it executes jobs in batch mode, which means that real-time analysis of data is not possible.
Alternative to MapReduce –Spark provides an alternative to MapReduce. It executes jobs in short bursts of micro-batches that are five seconds or less apart. It also provides more stability than real-time, stream-oriented Hadoop frameworks such as Twitter Storm. The software can be used for a variety of jobs, such as an ongoing analysis of live data and thanks to a software library, more computationally in-depth jobs involving machine learning and graph processing.
Support for Multiple Languages – Using Spark, developers can write data-analysis jobs in Java, Scala or Python, using a set of more than 80 high-level operators.
Library Support – Spark’s libraries are designed to complement the types of processing jobs being explored more aggressively with the latest commercially supported deployments of Hadoop. MLlib implements a slew of common machine learning algorithms, such as naïve Bayesian classification or clustering; Spark Streaming enables high-speed processing of data ingested from multiple sources; and GraphX allows for computations on graph data.
Stable API – With the version 1.0, Apache Spark offers a stable API (application programming interface), which developers can use to interact with Spark though their own applications. This helps in using Storm more easily in Hadoop based deployment.
SPARK SQL Component – Spark SQL component for accessing structured data, allows the data to be interrogated alongside unstructured data in analysis work. Spark SQL, which is only in alpha at the moment, allows SQL-like queries to be run against data stored in Apache Hive. Extracting data from Hadoop via SQL queries is yet another variant of the real-time querying functionality springing up around Hadoop.
Apache Spark Compatibility with Hadoop [HDFS, HBASE and YARN] – Apache Spark is fully compatible with Hadoop’s Distributed File System (HDFS), as well as with other Hadoop components such as YARN (Yet Another Resource Negotiator) and the HBase distributed database.
IT companies such as Cloudera, Pivotal, IBM, Intel and MapR have all folded Spark into their Hadoop stacks. Databricks, a company founded by some of the developers of Spark, offers commercial support for the software. Both Yahoo and NASA, among others, use the software for daily data operations.
What Spark has to offer is bound to be a big draw for both users and commercial vendors of Hadoop. Users who are looking to implement Hadoop and who have already built many of their analytics systems around Hadoop are attracted to the idea of being able to use Hadoop as a real-time processing system.
Spark 1.0 provides them with another variety of functionality to support or build proprietary items around. In fact, one of the big three Hadoop vendors, Cloudera, has already been providing commercial support for Spark via its Cloudera Enterprise offering. Hortonworks has also been offering Spark as a component of its Hadoop distribution. The implementation of Spark on a large scale by top companies indicates its success and its potential when it comes to real-time processing.
Got a question for us? Mention them in the comments section and we will get back to you.