Stateful Transformations with Windowing in Spark Streaming

Contributed by Prithviraj Bose

In this blog we will discuss the windowing concept of Apache Spark’s stateful transformations.

What is stateful transformation?

Spark streaming uses a micro batch architecture where the incoming data is grouped into micro batches called Discretized Streams (DStreams) which also serves as the basic programming abstraction. The DStreams internally have Resilient Distributed Datasets (RDD) and as a result of this standard RDD transformations and actions can be done.

In streaming if we have a use case to track data across batches then we need state-ful DStreams.

For example we may track a user’s interaction in a website during the user session or we may track a particular twitter hashtag across time and see which users across the globe is talking about it.

Types of state-ful transformation.

State-ful DStreams are of two types – window based tracking and full session tracking.

For stateful tracking all incoming data should be transformed to key-value pairs such that the key states can be tracked across batches. This is a precondition.

Further we should also enable checkpointing, a concept which we will discuss in the later blogs.

> Window based tracking

In window based tracking the incoming batches are grouped in time intervals, i.e. group batches every ‘x’ seconds. Further computations on these batches are done using slide intervals.

For example if the window interval = 3 secs and slide interval = 2 secs, then all incoming data will be grouped in batches every 3 seconds and the computations on these batches will happen every 2 seconds. Alternatively we can say, do computations every 2 seconds on the batches that arrived in the last 3 seconds.

In the above diagram we see that the incoming batches are grouped every 3 units of time (window interval) and the computations are done every 2 units of time (slide interval).
Note: Unlike Apache Flink, Apache Spark does not have a concept of tumbling window, all windows are sliding.

API

A popular API for window based transformations is

PairDStreamFunctions.reduceByKeyAndWindow.

There are several overloaded versions of this API, let’s see the one that has the most number of parameters. After this explanation the rest of the overloaded versions of this API should be self explanatory.

Returns: The transformed DStream[(K, V)]

reduceFunc: The associative reduce function.

invReduceFunc: The inverse of the above reduce function. This is required for efficient computing of incoming and outgoing batches. With the help of this function the value of the batches which are outgoing is deducted from the accumulated value of the above reduce function. For example, if we are computing the sum of the incoming values for the respective keys then for the outgoing batches we will subtract the values for the respective keys (provided they are present in the current batch else ignore).

windowDuration: Units of time for grouping the batches, this should be a multiple of the batch interval.

slideDuration: Units of time for computation, this should be a multiple of the batch interval.partitioner: The partitioner to use for storing the resulting DStream. For more information on partitioning read this.

filterFunc: Function to filter out expired key-value pairs, i.e. for example if we don’t get an update for a key for sometime we may wish to remove it.

Here’s a program to count the words coming from a socket stream. We have used the an overloaded version of the above function with a window interval of 4 secs and a slide interval of 2 secs.

In my next blog I will write about full session tracking and checkpointing.

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

Stateful Transformations with Windowing in Spark Streaming

Recommended videos for you

Distributed Cache With MapReduce

Advanced Security In Hadoop Cluster

Is Hadoop A Necessity For Data Science?

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Administer Hadoop Cluster

Hadoop for Java Professionals

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

5 Things One Must Know About Spark

Big Data – XML Parsing With MapReduce

Hadoop Tutorial – A Complete Tutorial For Hadoop

Pig Tutorial – Know Everything About Apache Pig Script

Secure Your Hadoop Cluster With Kerberos

Filtering on HBase Using MapReduce Filtering Pattern

Top Hadoop Interview Questions and Answers – Ace Your Interview

Python for Big Data Analytics

HBase Tutorial – A Complete Guide On Apache HBase

Introduction to Apache Solr-1

Spark SQL | Apache Spark

Improve Customer Service With Big Data

Apache Spark For Faster Batch Processing

Recommended blogs for you

Overview of Hadoop 2.0 Cluster Architecture Federation

Apache Spark Architecture – Spark Cluster Architecture Explained

What are the Roles and Responsibilities of a Hadoop Developer?

Drilling Down On Apache Drill, The New-Age Query Engine (Part 2)

Why Scala is getting Popular?

Big Data Processing with Apache Spark & Scala

How To Install MongoDB On Windows Operating System?

Overview of HBase Storage Architecture

What is a Data Engineer? – A Comprehensive Guide

What are the Best books for Hadoop?

Transfer files from Windows to Cloudera Demo VM

Explaining Hadoop Configuration

Introduction to Real-time Analytics with Apache Storm

Do You Need Java To Learn Hadoop?

Top Hadoop Developer Skills You Need to Master in 2025

Install Apache Hadoop Cluster on Amazon EC2 free tier Ubuntu server in 30 minutes

Switching Careers: From Java to Big Data / Hadoop

Hadoop Developer-Job Responsibilities & Skills

Apache Falcon: New Data Management Platform For The Hadoop Ecosystem

Spark SQL Tutorial – Understanding Spark SQL With Examples

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

Apache Spark and Scala Certification Training ...

ELK Stack Training & Certification

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Stateful Transformations with Windowing in Spark Streaming