18 Sep 2015

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Apache projects like Kafka and Spark continue to be popular when it comes to stream processing. Engineers have started integrating Kafka with Spark streaming to benefit from the advantages both of them offer.  This webinar discusses the advantages of Kafka, different components and use cases along with Kafka-Spark integration. The above video includes the following...
Read More

Apache projects like Kafka and Spark continue to be popular when it comes to stream processing. Engineers have started integrating Kafka with Spark streaming to benefit from the advantages both of them offer.  This webinar discusses the advantages of Kafka, different components and use cases along with Kafka-Spark integration.

The above video includes the following topics. You can check out the presentation at the end of the post:

  • What is Kafka?

  • Why do we need Kafka?

  • Kafka components

  • How does Kafka work

  • Which companies are using Kafka?

  • Kafka and Spark integration

What is Kafka & why do we need it?

Kafka is an open-source message broker project developed by the Apache Software Foundation and is written in Scala. Kafka’s objective is to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka’s design is predominantly based on transaction logs.

Kafka is used in scenarios where its features like scalability, data partitioning, low latency, and the ability to handle large data can be utilized. Basically, Kafka is a good player when it comes to data integration related use cases.

Kafka is adopted in the following use cases:

  • Transportation of logs

  • Activity stream in real-time

  • CPU/IO/Memory usage

  • Time taken to load a web page

  • Time taken by multiple services while building a web page

  • Number of requests

  • Number of hits on a particular page/URL

What is Kafka & why do we need it

Why LinkedIn built Kafka?

Kafka was originally developed at LinkedIn and became an Apache project in July 2011. It is used by LinkedIn for applications including log aggregation, queuing, and real-time monitoring and event processing.

Before delving into the reasons behind the creation of Kafka, we must understand the actual processes that happen at LinkedIn:

Why LinkedIn built Kafka

LinkedIn uses Apache Kafka as a central publish-subscribe log for integrating data between applications, stream processing, and Hadoop data ingestion.

Kafka benchmarks

LinkedIn relies heavily on the scalability and reliability of Kafka and a surrounding ecosystem of both open source and internal components. A test was conducted where the Kafka cluster was set up on three machines. The six drives are directly mounted with no RAID (JBOD style). Here are the test results.

Kafka benchmarks

How fast is Kafka?

Kafka works up to two million writes/sec on three cheap machines. This was done using three producers on three different machines, 3x async replication and only one producer/machine because NIC was already saturated. It has shown sustained throughput as stored data grows. Below is the result which is slightly different test config than 2M writes/sec mentioned above:

How fast is Kafka

Why is Kafka so fast?

Fast writes:

Apache Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM and the hardware specifications and OS are tuned in such a way that it is fast.

Fast reads:

Kafka efficiently transfers data from page cache to a network socket, through sendfile() system call in Linux.

The Combination of fast writes and fast reads have resulted in fast Kafka!

On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.

Example: According to loggly.com, which run Kafka on Amazon AWS, “99.99999% of the time the data is coming from disk cache and RAM; only very rarely do we hit the disk.”

“One of our consumer groups (8 threads) which maps a log to a customer can process about 200,000 events per second draining from 192 partitions spread across three brokers.”

How Kafka Works

Here’s a pictorial representation of how Kafka works.

How Kafka Works

Kafka Components

In Kafka, a message stream is defined by a topic, divided into one or more partitions. Replication happens at the partition level and each partition has one or more replicas.

The replicas are assigned evenly to different servers (called brokers) in a Kafka cluster. Each replica maintains a log on disk. Published messages are appended sequentially in the log and each message is identified by a monotonically increasing offset within the log.

When a consumer subscribes to a topic, it keeps track of an offset in each partition for consumption and uses it to issue fetch requests to the broker.

Putting it all together, this is how a Kafka cluster looks like:

 Kafka Components

Real-time Analysis with Spark Streaming

There are numerous data ingestion sources besides Kafka; like Flume, HDFS, Kinesis, Twitter, etc. The image below shows how an ingestion source/Kafka fits in the streaming application and how the result is stored.

Real-time Analysis with Spark Streaming

Streaming In Detail:

Real-time Analysis with Spark Streaming

Kafka adoption and use cases

Here’s a list of companies that have adopted Kafka and their corresponding use cases:

  • LinkedIn – Activity streams, operational metrics, data bus
  • Netflix – Real-time monitoring and event processing
  • Twitter – Part of their Storm real-time data pipelines
  • Spotify – Log delivery (from 4h down to 10s), Hadoop
  • Loggly – Log collection and processing
  • Mozilla – Telemetry data

Other companies like Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber and many more have also adopted Kafka.

Webinar Presentation

Questions asked during the webinar

1. Does Kafka help to build lineage?
No, Kafka will just publish the data.
2. What is the example configuration of the scalable system?
It depends on the size of the entire message which is going to be passed. Plus the RAM configuration available at the nodes where you are going to setup Kafka.
3. Is aggregation possible on Kafka?
Aggregation is not something that Kafka was designed to do.

Got a question for us? Please mention them in the comments section and we will get back to you.

Related Posts:

Apache Kafka – What you need for a career in real-time analytics

Get Started With Real-Time Analytics With Apache Kafka

Continue Watching

Watch It Again

Comments
1 Comment

24 X 7 Customer Support X

  • us flag 1-800-275-9730 (Toll Free)
  • india flag +91 88808 62004