Apache Kafka Interview Notes

Posted on: June 24, 2025 Posted by: rahulgite Comments: 0

What is Apache Kafka?

Apache Kafka is an open-source, distributed event streaming platform developed by LinkedIn and later open-sourced in 2011. It is primarily used for building real-time data pipelines and streaming applications. Kafka uses a publish-subscribe model and provides a durable messaging system that is highly scalable and fault-tolerant.

Key Characteristics:

Asynchronous Communication: Multiple producers can send data to a topic while multiple consumers subscribe and consume data independently.
Durability: Messages are stored on disk with a configurable time-to-live (TTL), allowing multiple consumers to process them independently.
Ordering and Fault Tolerance: Kafka preserves the order of messages within a partition and replicates data across brokers.

Real-World Example:

In an e-commerce application:

A payment service produces transaction records.
A fraud detection service and a notification service consume the same topic to take appropriate actions (e.g., alerting or flagging transactions).

Kafka Architecture Overview

Kafka’s architecture revolves around a clustered environment composed of the following components:

1. Kafka Cluster

A Kafka cluster consists of multiple Kafka Brokers (servers) managed centrally.

2. Kafka Broker

A broker is responsible for storing data, handling producer and consumer requests, and replicating messages.
One broker acts as the leader for a partition, while others act as followers.

3. Topics

A topic is a logical channel to which records are published.
Topics are split into multiple partitions to enable scalability.

4. Partitions

Each partition is a sequence of records ordered and immutable.
Kafka replicates partitions to ensure data availability.

Diagram: Kafka Topic Partitioning

              +-------------+     +-------------+
Producer -->  | Partition 0 | <-- |   Consumer  |
              +-------------+     +-------------+
Producer -->  | Partition 1 | <-- |   Consumer  |
              +-------------+     +-------------+

5. Offset

An offset is a unique ID for each message in a partition.
Kafka uses offsets to track how much of the log a consumer has read.

6. Producers

Producers create a ProducerRecord specifying:
- Topic (mandatory)
- Message content (mandatory)
- Partition, Key, Headers (optional)

7. Consumers

Consumers subscribe to topics, process messages, and commit offsets.
They may belong to Consumer Groups to distribute load.

8. Zookeeper

Coordinates Kafka brokers
Manages:
- Leader election
- Metadata (topics, partitions)
- Broker registration

Why Use Kafka?

1. Scalability

Kafka scales horizontally via more brokers and partitions.

2. Durability

Messages remain even after consumption (until TTL), allowing for multi-subscriber models.

3. Real-Time Processing

Kafka enables low-latency streaming due to fast disk-based log reads and offset management.

4. High Throughput

Kafka can handle millions of messages per second, making it ideal for high-traffic scenarios.

5. Retention Policy

Kafka retains data for a configurable period, ensuring availability even during consumer outages.

6. Dynamic Configuration

Topics and partitions can be dynamically updated.

7. Open Source

Strong community support and wide adoption by companies like LinkedIn, Netflix, Uber, etc.

Role of Zookeeper in Kafka

Zookeeper acts as a central coordinator for the Kafka cluster.

Responsibilities:

Leader Election: Chooses a broker as partition leader in case of failure.
Broker Registration: Maintains list of active brokers.
Metadata Management: Maintains topic-partition mappings.
Consumer Group Management: Tracks offsets for recovery.
Health Monitoring: Checks broker status and triggers elections when needed.

Kafka Without Zookeeper

Kafka Evolution:

Prior to Kafka 2.8.0, Zookeeper was mandatory.
Post 2.8.0, KIP-500 initiative allows running Kafka without Zookeeper.

Benefits:

Simplified architecture
Fewer moving parts
Improved performance and reliability

Current Status:

Kafka 4.x is expected to be fully Zookeeper-free, but this is still in development and not yet production-ready.

Diagram: Kafka Evolution Without Zookeeper

[Old Architecture]         [New Architecture]
Kafka <--> Zookeeper   =>   Kafka (Self-managed metadata)

Summary Table

Component	Description
Producer	Publishes messages to topics
Consumer	Subscribes to topics and processes messages
Broker	Kafka server handling requests and data storage
Topic	Logical channel for messages
Partition	Unit of parallelism and ordering within a topic
Offset	Unique ID per message per partition
Zookeeper	Cluster coordinator (until 2.8)

Kafka