0% found this document useful (0 votes)
29 views26 pages

Apache Kafka 101

Uploaded by

nnta1342004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views26 pages

Apache Kafka 101

Uploaded by

nnta1342004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Apache Kafka 101

Mid-term Seminar Report on PIP

Advisor(s): Thoai Nam


La Quoc Nhut Huan
Dinh Phuc Hung

Team member(s): Tran Tuan Kiet


Pham Duy Tuong Phuoc
Phan Hong Quan
1. Introduction
Explaining why we need Apache Kafka.

2. Kafka Architecture
The underlying structure that makes Kafka performant.

3. Use-case and Demo


How do we utilize Apache Kafka in real-life project.

1
1. Introduction
1.1. Real-life context
• Business often have multiple data sources with varying formats.
• Various target systems use this data for insights; some require immediate processing.
• Engineers must create custom integrations between these sources for a unified business view.
• Direct one-to-one integrations can lead to complex systems.

2
1. Introduction
1.2. Apache Kafka for the rescue
• With Apache Kafka as an integration layer, it allows us to decouple data streams and systems:
• Data sources will publish their data to Apache Kafka.
• Target systems will source their required data from Apache Kafka.
• The data sources does not need to know about the target systems and vice versa.

3
1. Introduction
1.3. “When it comes to data event streaming, Kafka is the de facto standard.”
• At its core, Apache Kafka acts as a data event streaming platform:
• A data stream is a potentially unbounded sequence of data.
• A streaming platform allows us to process the data as soon as it arrives.
• Each applications is a potential data stream creator.
• Apache Kafka store these data streams and allows systems to perform stream processing.

4
1. Introduction
1.4. In conclusion: Why Apache Kafka is so popular?

- Simplify Data Integration

- Ensure Scalability and Flexibility

- Enable Real-Time Data Processing

- Maintain Data Consistency & Reliability

5
2. Kafka Architecture
Overview

6
2. Kafka Architecture
2.1. Storage Layer.
• One distributed system is called a cluster:
• A Cluster contains multiple Brokers (a Kafka server).
• A Brokers holds multiple Partitions of a Topic.

7
2. Kafka Architecture
2.1. Storage Layer.

• A Topic is:
• A particular data stream in Kafka.
• We can have as many topics as we want, and a
Topic can contain any kind of data format.
• A topic is identified by its name --> Thus, topics
are usually described as data categories.
• Topics cannot be queried, it can only have its
data sent by Kafka Producers or read by Kafka
Consumers.

8
2. Kafka Architecture
2.1. Storage Layer.
• One Topic is separated into multiple Partitions:
• Partitions are numbered starting from `0` to `N-1`, where `N` is the number of partitions.
• These numbers are called Partition Offset.
• Each new piece of data (or Message) can only be appended to the end of a partition.
• Every partition will have multiple replications, so called `Followers` across different brokers
• This ensure while a broker may shut down unexpectedly, another broker containing the
replications of the partition will be used as an alternative.

9
2. Kafka Architecture
2.1. Storage Layer.
• The underlying structure of a Partition consists of multiple Log file.

10
2. Kafka Architecture
2.2. Compute Layer.
• The Compute Layers: is built around four major pillars (or module):
• Kafka Producer API
• Kafka Consumer API
• Kafka Streams API
• Kafka Connect API

11
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Producer:
• Kafka replication in partition level. Producer pushes data into primary (leader) partition only.
• Kafka identifies the partition to save producer messages based on round robin policy (if key
not provide) or by key which user provides.
• Each producer has its own buffer.
• Client id is used to distinguish producers from the same host.

12
2. Kafka Architecture

13
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Consumer:
• Consumer can read from both primary and secondary (replica) partitions.
• Each consumer can read from multiple topics and partitions at the same time.
• Consumer belongs to the same group will not read duplicate data.
• Client id is used to distinguish consumers from the same host.
• Consumers work in a consumer group. This enable parallelism to increase the throughput

14
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Consumer:
• Consumer can read from both primary and secondary (replica) partitions.
• Each consumer can read from multiple topics and partitions at the same time.
• Consumer belongs to the same group will not read duplicate data.
• Client id is used to distinguish consumers from the same host.
• Consumers work in a consumer group. This enable parallelism to increase the throughput

15
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Connect:
• A free, open-source component of Apache
Kafka that serves as a centralized data hub for
simple data integration between databases, key-
value stores, search indexes, and file systems.
• Key concepts of Kafka Connect:
• Connectors: The high-level abstraction that
coordinates data streaming by managing tasks
• Tasks: The implementation of how data is
copied to or from Kafka
• Workers: The running processes that execute
connectors and tasks
• Converters: The code used to translate data
between Connect and the end system
• Transforms: Simple logic to alter each message
produced by or sent to a connector
• Dead Letter Queue: How Connect handles
connector errors
16
2. Kafka Architecture
2.2. Compute Layer.
• Connectors: The high-level abstraction that coordinates data streaming by managing tasks
• Source Connector: ingest entire databases and stream table updates to Kafka topics
• Sink Connector: deliver data from Kafka topics to secondary indexes, such as Elasticsearch,
or batch systems such as Hadoop, …
• Confluent encourages users to leverage existing connectors. However, it is possible to write a
new connector plugin from scratch, as the following workflow:

17
2. Kafka Architecture
2.2. Compute Layer.
• Kafka Streams:
• A client library that exposes a high-level API for processing, transforming and enriching
data in real time.

18
3. Use-case and Demo

19
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.

Uber’s Data Pipeline.

20
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.

End to end interaction of Kafka broker with tiered storage.

21
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.

22
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.

23
3. Use-case and Demo
3.1. Kafka Tiered Storage at Uber.

24
3. Use-case and Demo

3.2. Demo

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy