Stateful Transformations in Apache Spark Streaming

Contributed by Prithviraj Bose

In my previous blog I have discussed stateful transformations using the windowing concept of Apache Spark Streaming. You can read it here.

In this post I am going to discuss cumulative stateful operations in Apache Spark Streaming. If you are new to Spark Streaming then I strongly recommend you to read my previous blog in order to understand how windowing works.

Types of Stateful Transformation in Spark Streaming (Continued…)

> Cumulative tracking

We had used the reduceByKeyAndWindow(…) API to track the states of keys, however windowing poses limitations for certain use cases. What if we want to accumulate the states of the keys throughout rather than limiting it to a time window? In that case we would need to use updateStateByKey(…) API.

This API was introduced in Spark 1.3.0 and has been very popular. However this API has some performance overhead, its performance degrades as the size of the states increases over time. I have written a sample to show the usage of this API.

Spark 1.6.0 introduced a new API mapWithState(…) which solves the performance overheads posed by updateStateByKey(…). In this blog I am going to discuss this particular API using a sample program that I have written.

Before I dive into a code walk-through, let’s spare a few words on checkpointing. For any stateful transformation, checkpointing is mandatory. Checkpointing is a mechanism to restore the state of the keys in case the driver program fails. When the driver restarts, the state of the keys are restored from the checkpointing files. Checkpoint locations are usually HDFS or Amazon S3 or any reliable storage. While testing the code, one can also store in the local file system.

In the sample program, we listen to socket text stream on host = localhost and port = 9999. It tokenizes the incoming stream into (words, no. of occurrences) and tracks the word count using the 1.6.0 API mapWithState(…). Additionally, keys with no updates are removed using StateSpec.timeout API. We are checkpointing in HDFS and the checkpointing frequency is every 20 seconds.

Let’s first create a Spark Streaming session,

We create a checkpointDir in the HDFS and then call the object method getOrCreate(…). The getOrCreate API checks the checkpointDir to see if there are any previous states to restore, if that exists, then it recreates the Spark Streaming session and updates the states of the keys from the data stored in the files before moving on with new data. Otherwise it creates a new Spark Streaming session.

The getOrCreate takes the checkpoint directory name and a function (which we have named createFunc) whose signature should be () => StreamingContext.

Let’s examine the code inside createFunc.

Line# 2: We create a streaming context with job name to “TestMapWithStateJob” and batch interval = 5 seconds.

Line# 5: Set the checkpoint directory.

Line# 8: Set the state specification using the classorg.apache.streaming.StateSpec object. We first set the function that will track the state, then we set the number of partitions for the resulting DStreams that are to be generated during subsequent transformations. Finally we set the timeout (to 30 seconds) where if any update for a key is not received in 30 seconds then the key state will be removed.

Line 12#: Setup the socket stream, flatten the incoming batch data, create a key-value pair, call mapWithState, set the checkpointing interval to 20s and finally print the results.

The Spark framework calls the createFunc for every key with the previous value and the current state. We compute the sum and update the state with the cumulative sum and finally we return the sum for the key.

Github Sources -> TestMapStateWithKey.scala, TestUpdateStateByKey.scala

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

Get Started with Apache Spark & Scala

Stateful Transformations in Apache Spark Streaming

Types of Stateful Transformation in Spark Streaming (Continued…)

> Cumulative tracking

Recommended videos for you

Webinar: Introduction to Big Data & Hadoop

HBase Tutorial – A Complete Guide On Apache HBase

Filtering on HBase Using MapReduce Filtering Pattern

Distributed Cache With MapReduce

Pig Tutorial – Know Everything About Apache Pig Script

When not to use Hadoop

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Ways to Succeed with Hadoop in 2015

MapReduce Tutorial – All You Need To Know About MapReduce

Big Data Processing With Apache Spark

Introduction to Big Data TDD and Pig Unit

Apache Spark Will Replace Hadoop ! Know Why

Top Hadoop Interview Questions and Answers – Ace Your Interview

Introduction to Apache Solr-1

Is It The Right Time For Me To Learn Hadoop ? Find out.

Hive Tutorial – Understanding Hive In Depth

What Is Hadoop – All You Need To Know About Hadoop

Apache Spark Redefining Big Data Processing

Boost Your Data Career with Predictive Analytics! Learn How ?

Power of Python With BigData

Recommended blogs for you

Infographics: How Big is Big Data?

Big Data In Healthcare: How Hadoop Is Revolutionizing Healthcare Analytics

Why Scala is getting Popular?

Insights on HBase Architecture

Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Operators in Apache Pig: Part 1- Relational Operators

Cloudera Hadoop: Getting started with CDH Distribution

5 Reasons When to and When not to use Hadoop

Scala Functional Programming

HDFS Commands: Hadoop Shell Commands to Manage HDFS

Hadoop Cluster Configuration Files

Splunk Knowledge Objects: Splunk Events, Event Types And Tags

How to become a Hadoop Developer? Job Trends and Salary

What are the Roles and Responsibilities of a Hadoop Developer?

Apache Storm Use Cases

Big Data Testing: A Perfect Guide You Need to Follow

Brief Introduction to Oozie

Microsoft Fabric vs. Databricks

Splunk Careers – Your Pathway To Hot Big Data Jobs

Introduction to Spark with Python – PySpark for Beginners

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

Apache Spark and Scala Certification Training ...

ELK Stack Training & Certification

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Stateful Transformations in Apache Spark Streaming