9 Oct 2015

5 Things One Must Know About Spark

Contents of the Webinar

1. Low Latency

2. Streaming support

3. Machine Learning and Graph

4. Data Frame API Introduction

5. Spark Integration with Hadoop

Spark Architecture

Similar to Hadoop, Spark is a framework as well. In the image below, Spark core is a processing engine which is the core spark API, that is internally written in Scala.

Low Latency

Spark cuts down read/write I/O to Disk

Spark stores its data in the form of RDDs and they’re nothing but in memory collection of the data which are distributed across the machines, however, there are limitations. The unique feature of spark is it stores data depending on the kind of infrastructure.

Streaming support

Event Processing

Used for processing real-time streaming data.

It uses the D-stream: A series of RDDs, to process the real-time data support.

Cyclic Data flows

1. All jobs in Spark comprise a series of operators and run on a set of data.

2. All the operators in a job are used to construct a DAG.

3. The DAG is optimized by rearranging and combining operators where its possible.

Support for data frames

Data frame features

Ability to scale from KBS to PBS.
Support for a wide array of data formats and storage systems.
Seemless integration with all big data tooling and infrastructure via spark.

Questions asked during the webinar

Mesos Vs YARN

Mesos and YARN are resource managers. YARN is popular because of Hadoop, mesos is not, although its functionality is the same.

Got a question for us? Please mention them in the comments section and we will get back to you.

Related Posts:

Get Started with Apache Spark and Scala

Apache Spark will replace Hadoop. Know why

ol/u/0/