18 Sep 2015

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Apache projects like Kafka and Spark continue to be popular when it comes to stream processing. Engineers have started integrating Kafka with Spark streaming to benefit from the advantages both of them offer. This webinar discusses the advantages of Kafka, different components and use cases along with Kafka-Spark integration.

The above video includes the following topics. You can check out the presentation at the end of the post:

What is Kafka?
Why do we need Kafka?
Kafka components
How does Kafka work
Which companies are using Kafka?
Kafka and Spark integration

What is Kafka & why do we need it?

Kafka is an open-source message broker project developed by the Apache Software Foundation and is written in Scala. Kafka’s objective is to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka’s design is predominantly based on transaction logs.

Kafka is used in scenarios where its features like scalability, data partitioning, low latency, and the ability to handle large data can be utilized. Basically, Kafka is a good player when it comes to data integration related use cases.

Kafka is adopted in the following use cases:

Transportation of logs
Activity stream in real-time
CPU/IO/Memory usage
Time taken to load a web page
Time taken by multiple services while building a web page
Number of requests
Number of hits on a particular page/URL

Why LinkedIn built Kafka?

Kafka was originally developed at LinkedIn and became an Apache project in July 2011. It is used by LinkedIn for applications including log aggregation, queuing, and real-time monitoring and event processing.

LinkedIn uses Apache Kafka as a central publish-subscribe log for integrating data between applications, stream processing, and Hadoop data ingestion.

Kafka benchmarks

LinkedIn relies heavily on the scalability and reliability of Kafka and a surrounding ecosystem of both open source and internal components. A test was conducted where the Kafka cluster was set up on three machines. The six drives are directly mounted with no RAID (JBOD style).

How fast is Kafka?

Kafka works up to two million writes/sec on three cheap machines. This was done using three producers on three different machines, 3x async replication and only one producer/machine because NIC was already saturated. It has shown sustained throughput as stored data grows.

Why is Kafka so fast?

Fast writes:

Apache Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM and the hardware specifications and OS are tuned in such a way that it is fast.

Fast reads:

Kafka efficiently transfers data from page cache to a network socket, through sendfile() system call in Linux.

The Combination of fast writes and fast reads have resulted in fast Kafka!

On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.

Example: According to loggly.com, which run Kafka on Amazon AWS, “99.99999% of the time the data is coming from disk cache and RAM; only very rarely do we hit the disk.”

“One of our consumer groups (8 threads) which maps a log to a customer can process about 200,000 events per second draining from 192 partitions spread across three brokers.”

Kafka Components

In Kafka, a message stream is defined by a topic, divided into one or more partitions. Replication happens at the partition level and each partition has one or more replicas.

The replicas are assigned evenly to different servers (called brokers) in a Kafka cluster. Each replica maintains a log on disk. Published messages are appended sequentially in the log and each message is identified by a monotonically increasing offset within the log.

When a consumer subscribes to a topic, it keeps track of an offset in each partition for consumption and uses it to issue fetch requests to the broker.

Kafka adoption and use cases

Here’s a list of companies that have adopted Kafka and their corresponding use cases:

LinkedIn – Activity streams, operational metrics, data bus
Netflix – Real-time monitoring and event processing
Twitter – Part of their Storm real-time data pipelines
Spotify – Log delivery (from 4h down to 10s), Hadoop
Loggly – Log collection and processing
Mozilla – Telemetry data

Other companies like Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber and many more have also adopted Kafka.

Webinar Presentation

Questions asked during the webinar

1. Does Kafka help to build lineage?
No, Kafka will just publish the data.
2. What is the example configuration of the scalable system?
It depends on the size of the entire message which is going to be passed. Plus the RAM configuration available at the nodes where you are going to setup Kafka.
3. Is aggregation possible on Kafka?
Aggregation is not something that Kafka was designed to do.

Got a question for us? Please mention them in the comments section and we will get back to you.

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

What is Kafka & why do we need it?

Why LinkedIn built Kafka?

Kafka benchmarks

How fast is Kafka?

Why is Kafka so fast?

Kafka Components

Kafka adoption and use cases

Webinar Presentation

Questions asked during the webinar

Recommended blogs for you

Microsoft Fabric vs. Databricks

Copy Activity in Azure Data Factory and Azure Synapse Analytics

Azure Data Factory Vs Databricks

Data Engineer Salary in India

What is a Data Engineer? – A Comprehensive Guide

How to Create a Pipeline in Azure Data Factory Step-by-Step

What is Azure Cosmos DB? – Types, Features, Benefits

What is integration runtime in Azure data factory?

Azure Databricks Architecture Overview

What is Delta Lake?

Azure Synapse vs. Databricks – What Are the Differences?

What is Azure Data Factory – Here’s Everything You Need to Know

Azure Synapse: Unlocking the Power of Your Data

Azure Data Engineer Roadmap in 2025

30+ Azure Data Engineer Interview Questions

Azure Data Engineer Salary in India 2025

What are Kafka Streams and How are they implemented?

What are the Best books for Hadoop?

How to become an Apache Spark Developer?

How to Plan the Capacity of a Hadoop Cluster?

Playlist & Videos

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

ELK Stack Training & Certification

Apache Spark and Scala Certification Training ...

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.