Apache Kafka 101
Apache Kafka 101
2. Kafka Architecture
The underlying structure that makes Kafka performant.
1
1. Introduction
1.1. Real-life context
• Business often have multiple data sources with varying formats.
• Various target systems use this data for insights; some require immediate processing.
• Engineers must create custom integrations between these sources for a unified business view.
• Direct one-to-one integrations can lead to complex systems.
2
1. Introduction
1.2. Apache Kafka for the rescue
• With Apache Kafka as an integration layer, it allows us to decouple data streams and systems:
• Data sources will publish their data to Apache Kafka.
• Target systems will source their required data from Apache Kafka.
• The data sources does not need to know about the target systems and vice versa.
3
1. Introduction
1.3. “When it comes to data event streaming, Kafka is the de facto standard.”
• At its core, Apache Kafka acts as a data event streaming platform:
• A data stream is a potentially unbounded sequence of data.
• A streaming platform allows us to process the data as soon as it arrives.
• Each applications is a potential data stream creator.
• Apache Kafka store these data streams and allows systems to perform stream processing.
4
1. Introduction
1.4. In conclusion: Why Apache Kafka is so popular?
5
2. Kafka Architecture
Overview
6
2. Kafka Architecture
2.1. Storage Layer.
• One distributed system is called a cluster:
• A Cluster contains multiple Brokers (a Kafka server).
• A Brokers holds multiple Partitions of a Topic.
7
2. Kafka Architecture
2.1. Storage Layer.
• A Topic is:
• A particular data stream in Kafka.
• We can have as many topics as we want, and a
Topic can contain any kind of data format.
• A topic is identified by its name --> Thus, topics
are usually described as data categories.
• Topics cannot be queried, it can only have its
data sent by Kafka Producers or read by Kafka
Consumers.
8
2. Kafka Architecture
2.1. Storage Layer.
• One Topic is separated into multiple Partitions:
• Partitions are numbered starting from `0` to `N-1`, where `N` is the number of partitions.
• These numbers are called Partition Offset.
• Each new piece of data (or Message) can only be appended to the end of a partition.
• Every partition will have multiple replications, so called `Followers` across different brokers
• This ensure while a broker may shut down unexpectedly, another broker containing the
replications of the partition will be used as an alternative.
9
2. Kafka Architecture
2.1. Storage Layer.
• The underlying structure of a Partition consists of multiple Log file.
10
2. Kafka Architecture
2.2. Compute Layer.
• The Compute Layers: is built around four major pillars (or module):
• Kafka Producer API
• Kafka Consumer API
• Kafka Streams API
• Kafka Connect API
11
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Producer:
• Kafka replication in partition level. Producer pushes data into primary (leader) partition only.
• Kafka identifies the partition to save producer messages based on round robin policy (if key
not provide) or by key which user provides.
• Each producer has its own buffer.
• Client id is used to distinguish producers from the same host.
12
2. Kafka Architecture
13
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Consumer:
• Consumer can read from both primary and secondary (replica) partitions.
• Each consumer can read from multiple topics and partitions at the same time.
• Consumer belongs to the same group will not read duplicate data.
• Client id is used to distinguish consumers from the same host.
• Consumers work in a consumer group. This enable parallelism to increase the throughput
14
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Consumer:
• Consumer can read from both primary and secondary (replica) partitions.
• Each consumer can read from multiple topics and partitions at the same time.
• Consumer belongs to the same group will not read duplicate data.
• Client id is used to distinguish consumers from the same host.
• Consumers work in a consumer group. This enable parallelism to increase the throughput
15
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Connect:
• A free, open-source component of Apache
Kafka that serves as a centralized data hub for
simple data integration between databases, key-
value stores, search indexes, and file systems.
• Key concepts of Kafka Connect:
• Connectors: The high-level abstraction that
coordinates data streaming by managing tasks
• Tasks: The implementation of how data is
copied to or from Kafka
• Workers: The running processes that execute
connectors and tasks
• Converters: The code used to translate data
between Connect and the end system
• Transforms: Simple logic to alter each message
produced by or sent to a connector
• Dead Letter Queue: How Connect handles
connector errors
16
2. Kafka Architecture
2.2. Compute Layer.
• Connectors: The high-level abstraction that coordinates data streaming by managing tasks
• Source Connector: ingest entire databases and stream table updates to Kafka topics
• Sink Connector: deliver data from Kafka topics to secondary indexes, such as Elasticsearch,
or batch systems such as Hadoop, …
• Confluent encourages users to leverage existing connectors. However, it is possible to write a
new connector plugin from scratch, as the following workflow:
17
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Streams:
• A client library that exposes a high-level API for processing, transforming and enriching
data in real time.
18
3. Use-case and Demo
19
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.
20
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.
21
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.
22
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.
23
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.
24
3. Use-case and Demo
3.2. Demo
25