0% found this document useful (0 votes)

27 views37 pages

Lectur 5

Uploaded by

2022da04739

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views37 pages

Lectur 5

Uploaded by

2022da04739

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Generalized Streaming Data

Architecture

Pravin Y Pawar
Streaming Data Systems
Defined again
• … are layered systems that rely on several loosely coupled systems
• Helps in achieving high availability
• Helps in managing the system
• Helps in maintaining the cost under control

• All subsystems / components can reside on individual physical servers or can

be co hosted on the single or more than one servers
• Not all components to be present in every system

Contents adapted from Real-Time Analytics , Byron Ellis

Streaming Data System Components

• Streaming Data System Architecture Components

• Collection
• Data Flow
• Processing
• Storage
• Delivery
Generalized Architecture

Altered version, original concept : Andrew G. Psaltis

Architecture Components (1)
Collection System

• Mostly communication over TCP/IP network using HTTP

• Websites log data was the initial days use case
• W3C standard log data format was used
• Newer formats like JSON, AVRO, Thrift are available now

• Collection happens at specialized servers called edge servers

• Collection process is usually application specific
• New servers integrates directly with data flow systems
• Old servers may or may not integrate directly with data flow systems
Architecture Components (2)
Data Flow Tier
• Separation between collection tier and processing layer is required
• Rates at which these systems works are different
• What if one of system is not able to cope with another system?

• Required intermediate layer that takes responsibility of

• accepting messages / events from collection layer
• providing those messages / events to processing layer

• Real time interface to data layer for both producers and consumers of data
• Helps in guaranteeing the “at least once” semantics
Architecture Components (3)
Processing / Analytics Tier
• Based on “data locality” principle
• Move the software / code to a the location of data
• Rely on distributed processing of data
• Framework does the most of the heavy lifting of data partitioning, job
scheduling, job managing
• Available Frameworks
• Apache Storm
• Apache Spark (Streaming)
• Apache Kafka Streaming etc
Architecture Components (4)
Storage Tier
• In memory or permanent
• Usually in memory as data is processed once
• But can have use cases where events / outcomes needs to be persisted as
well

• NoSQL databases becoming popular choice for permanent storage

• MongoDB
• Cassandra

• But usage varies as per the use case, still no database that fits all use cases
Architecture Components (5)
Delivery Layer
• Usually web based interface
• Now a days mobile interfaces are becoming quite popular
• Dashboards are built with streaming visualizations that gets continuously
updated as underlying events are processed
• HTML + CSS + Java script + Websockets can be used to create interfaces
and update them
• HTML5 elements can be used to render interfaces
• SVG, PDF formats used to render the outcomes

• Monitoring / Alerting Use cases

• Feeding data to downstream applications
Lambda Architecture

Pravin Y Pawar
Lambda Architecture
Defined
• Proposed by Nathan Marz based on his experience working on distributed
data processing systems at Backtype and Twitter
• A generic, scalable and fault-tolerant data processing architecture

• Lambda Architecture
• aims to satisfy the needs for a robust system that is fault-tolerant, both against
hardware failures and human mistakes
• being able to serve a wide range of workloads and use cases
• in which low-latency reads and updates are required.

• The resulting system should be linearly scalable, and it should scale out
rather than up.
Lambda Architecture (2)
Block Diagram

Source : http://lambda-architecture.net/
Lambda Architecture (3)
Basic Flow of Events
1. All data entering the system is dispatched to both the batch layer and the
speed layer for processing.
2. The batch layer has two functions:
(i) managing the master dataset (an immutable, append-only set of raw data)
(ii) to pre-compute the batch views.
3. The serving layer indexes the batch views so that they can be queried in
low-latency, ad-hoc way.
4. The speed layer compensates for the high latency of updates to the serving
layer and deals with recent data only.
5. Any incoming query can be answered by merging results from batch views
and real-time views.
Architectural Components (1)
Batch layer
• New data comes continuously, as a feed to the data system.
• It gets fed to the batch layer and the speed layer simultaneously.
• It looks at all the data at once and eventually corrects the data in the stream
layer.
• Here we can find lots of ETL and a traditional data warehouse.
• This layer is built using a predefined schedule, usually once or twice a day.

• The batch layer has two very important functions:

• To manage the master dataset
• To pre-compute the batch views.

Source : https://databricks.com/glossary/lambda-architecture
Architectural Components (2)
Speed Layer (Stream Layer)
• This layer handles the data that are not already delivered in the batch view
due to the latency of the batch layer.
• In addition, it only deals with recent data in order to provide a complete view
of the data to the user by creating real-time views.

• Speed layer provides the outputs on the basis enrichment process and
supports the serving layer to reduce the latency in responding the queries.
• As obvious from its name the speed layer has low latency because it deals
with the real time data only and has less computational load.
Architectural Components (3)
Serving Layer
• The outputs from batch layer in the form of batch views and from speed layer
in the form of near-real time views are forwarded to the serving layer.

• This layer indexes the batch views so that they can be queried in low-latency
on an ad-hoc basis.
Applications of Lambda Architecture

• User queries are required to be served on ad-hoc basis using the immutable
data storage.
• Quick responses are required and system should be capable of handling
various updates in the form of new data streams.
• None of the stored records shall be erased and it should allow addition of
updates and new data to the database.
Pros and Cons of Lambda Architecture

• Pros
• Batch layer of Lambda architecture manages historical data with the fault
tolerant distributed storage which ensures low possibility of errors even if
the system crashes.
• It is a good balance of speed and reliability.
• Fault tolerant and scalable architecture for data processing.

• Cons
• It can result in coding overhead due to involvement of comprehensive
processing.
• Re-processes every batch cycle which is not beneficial in certain
scenarios.
• A data modelled with Lambda architecture is difficult to migrate or
reorganize.
Source :
https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-b
ig-data
Kappa Architecture

Pravin Y Pawar
Bad thing about Lambda Architecture

• The problem with the Lambda Architecture is that maintaining code that
needs to produce the same result in two complex distributed systems is
exactly as painful as it seems like it would be.

• Programming in distributed frameworks like Storm and Hadoop is complex.

Inevitably, code ends up being specifically engineered toward the framework
it runs on. The resulting operational complexity of systems implementing the
Lambda Architecture is the one thing that seems to be universally agreed on
by everyone doing it.

Interesting read : https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Kappa Architecture
Defined
• First described by Jay Kreps at Linkedin.
• It focuses on only processing data as a stream.
• It is not a replacement for the Lambda Architecture, except for where your
use case fits.
• For this architecture, incoming data is streamed through a real-time layer and
the results of which are placed in the serving layer for queries.
Kappa Architecture (2)
Block Diagram
Kappa Architecture (3)

• The idea is to handle both real-time data processing and continuous

reprocessing in a single stream processing engine.
• This requires that the incoming data stream can be replayed (very quickly),
either in its entirety or from a specific position.
• If there are any code changes, then a second stream process would replay all
previous data through the latest real-time engine and replace the data stored
in the serving layer.
• This architecture attempts to simplify by only keeping one code base rather
than manage one for each batch and speed layers in the Lambda
Architecture.
• In addition, queries only need to look in a single serving location instead of
going against batch and speed views.

Interesting read : https://www.talend.com/blog/2017/08/28/lambda-kappa-real-time-big-data-architectures/

Pros and Cons of Kappa architecture

• Pros
• Kappa architecture can be used to develop data systems that are online
learners and therefore don’t need the batch layer.
• Re-processing is required only when the code changes.
• It can be deployed with fixed memory.
• It can be used for horizontally scalable systems.
• Fewer resources are required as the machine learning is being done on
the real time basis.

• Cons
• Absence of batch layer might result in errors during data processing or
while updating the database that requires having an exception manager
to reprocess the data or reconciliation.

Interesting read :
https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-ka
ppa-for-big-data-
Real Time System Characteristics
BITS Pilani Pravin Y Pawar
Pilani Campus
BITS Pilani
Pilani Campus

BA ZC420, Real Time Analytics

Lecture No. 1.4
Agenda

 Distinguishing Features of Streaming Data

 Data always in motion
 Data structuring
 Data Cardinality

 Features of Real-Time Architecture

 High Availability
 Low Latency
 Horizontal Scalability

BITS Pilani, Pilani Campus

Distinguishing Features of Streaming Data
 Data always in motion
 Streaming data
 getting generated continuously
 Always flowing

 Two critical requirements

 Collection system should be robust
 Processing should be able to keep pace with collection

 Solutions
 Horizontal Scalability
 Algorithmic handling of streaming data

BITS Pilani, Pilani Campus

Distinguishing Features of Streaming Data - II

 Data Structuring
 Loosely structured

 Various data sources

 structured , unstructured data
 Forming a joint schema is difficult
 For example, social media streams

 Young , evolving projects

 Adds many dimensions to the data
 Collect as much as data possible to make interesting analysis

BITS Pilani, Pilani Campus

Distinguishing Features of Streaming Data - III

 Data Cardinality
 Number of unique values in data
 Very few values appears often, many are very sparce

 Challenges with Streaming data

 Processing
 Streaming data can be processed only once
 Difficult to identify state of data
 Batch processing on processed data can be used for estimation

 Storage
 Memory requirements are high while processing data
 Linear amount of space required for storing state information

BITS Pilani, Pilani Campus

Features of Real-Time Architecture
 High Availability
 Key distinguishing factor from batch / BI systems
 Very critical for collection, flow and processing systems

 Two Approaches
 Distribution
 Use multiple physical servers to distribute the load

 Replication
 Write to several machines
 Master-slave configuration
 Automatic failover
 Master less configuration
 Recovery is difficult in case of failure

BITS Pilani, Pilani Campus

Features of Real-Time Architecture - II
 Low Latency
 Time taken to service a request

 Streaming systems latency

 Time taken to process the event from the moment it entered the system

 Many streaming systems works in batches

 Micro batching – processing in very small batches, milli seconds
 Collection systems bothers about first definition of latency
 Flow and processing components bother about second one

 Tradeoff between speed and safety

 If data can be safely lost, latency can be very small
 If not, needs to live with lower limit of latency

BITS Pilani, Pilani Campus

Features of Real-Time Architecture - III
 Horizontal Scalability
 Adding more physical servers to a clusters
 Needs to care about amount of coordination required between the systems
 Use of partitioning technique
 Use principle of data locality – move program to data

BITS Pilani, Pilani Campus

Reference

 Real-Time Analytics , Byron Ellis

 Chapter 1 : Introduction to Streaming Data
 Chapter 2 : Designing Real-Time Streaming Architecture

BITS Pilani, Pilani Campus

Thank You!
In our next session: Streaming Data Systems Components
Thank You!
In our next session: Kappa Architecture
Thank You!
In our next session: Lambda Architecture

Facebook Guide With RedTrack
No ratings yet
Facebook Guide With RedTrack
21 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Ebook Fast Data Architectures For Streaming Applications 2
No ratings yet
Ebook Fast Data Architectures For Streaming Applications 2
58 pages
Big Data Architecture
No ratings yet
Big Data Architecture
41 pages
What Is Lambda Architecture
No ratings yet
What Is Lambda Architecture
5 pages
Lambda Archi
No ratings yet
Lambda Archi
2 pages
Details
No ratings yet
Details
2 pages
Lez.a-03 Architectures BigData NewStyle
No ratings yet
Lez.a-03 Architectures BigData NewStyle
23 pages
5
No ratings yet
5
1 page
Lambda Architecture
No ratings yet
Lambda Architecture
20 pages
Lambda - A Modern Big Data Architecture 5 - 12 PDF
No ratings yet
Lambda - A Modern Big Data Architecture 5 - 12 PDF
128 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
4
No ratings yet
4
2 pages
BDA Unit3
No ratings yet
BDA Unit3
17 pages
Rad Stack
No ratings yet
Rad Stack
10 pages
Big Data Architecture Basics
No ratings yet
Big Data Architecture Basics
24 pages
Module II
No ratings yet
Module II
22 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
6
No ratings yet
6
1 page
3
No ratings yet
3
2 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
Four Distributed System Architectural Patterns
No ratings yet
Four Distributed System Architectural Patterns
10 pages
7
No ratings yet
7
1 page
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
Lamda Architecture
No ratings yet
Lamda Architecture
10 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
DBT Unit 4 Slides
No ratings yet
DBT Unit 4 Slides
286 pages
Week 4 - Azure-AWSStorage
No ratings yet
Week 4 - Azure-AWSStorage
97 pages
Ingestion Layer PDF
No ratings yet
Ingestion Layer PDF
11 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Soft Architect
No ratings yet
Soft Architect
6 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
8
No ratings yet
8
1 page
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
9
No ratings yet
9
1 page
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Latency 5
No ratings yet
Latency 5
8 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
1202990.an Overview of Current Data Lake Architecture Models
No ratings yet
1202990.an Overview of Current Data Lake Architecture Models
6 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
Lecture 9 - Realtime Analytics
No ratings yet
Lecture 9 - Realtime Analytics
34 pages
Streaming Data
No ratings yet
Streaming Data
33 pages
Lambda Architecure On For Batch Aws
No ratings yet
Lambda Architecure On For Batch Aws
12 pages
Module4 1
No ratings yet
Module4 1
68 pages
L16 SW Patterns
No ratings yet
L16 SW Patterns
49 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Architectural Patterns in de
No ratings yet
Architectural Patterns in de
15 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Parallel and Distributed Computing - 482-CCS-3: Dr. Mohammad Nadeem Ahmed
No ratings yet
Parallel and Distributed Computing - 482-CCS-3: Dr. Mohammad Nadeem Ahmed
17 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cs Aat 2
No ratings yet
Cs Aat 2
28 pages
Riya Singh Resume
No ratings yet
Riya Singh Resume
2 pages
Welcome To Unity OE v4.1 - Scalability and Performance.
No ratings yet
Welcome To Unity OE v4.1 - Scalability and Performance.
47 pages
PC R2 TechnicalDocument SpecSheet 1v10 - 230526 - 204646
No ratings yet
PC R2 TechnicalDocument SpecSheet 1v10 - 230526 - 204646
20 pages
Audit Program For Retail Teller Module (Oracle FLEXCUBE)
No ratings yet
Audit Program For Retail Teller Module (Oracle FLEXCUBE)
11 pages
Cyberjaya Data Centers: With Intelligence Built Into Our Solutions
No ratings yet
Cyberjaya Data Centers: With Intelligence Built Into Our Solutions
9 pages
Ways in Which Technology Has Influenced Banking and Commerce
No ratings yet
Ways in Which Technology Has Influenced Banking and Commerce
2 pages
Technical Assignment 2 - S2021 1
No ratings yet
Technical Assignment 2 - S2021 1
6 pages
2.process and Threds
No ratings yet
2.process and Threds
48 pages
2 8 Excel Pivot Table Examples
No ratings yet
2 8 Excel Pivot Table Examples
17 pages
Task 1
No ratings yet
Task 1
48 pages
Virus Project
No ratings yet
Virus Project
2 pages
Ducat Noida Sec 63 BROCHURE
No ratings yet
Ducat Noida Sec 63 BROCHURE
60 pages
Report
No ratings yet
Report
26 pages
Unit 4 - Distributed System - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Distributed System - WWW - Rgpvnotes.in
11 pages
MayuriKothawade Resume
No ratings yet
MayuriKothawade Resume
5 pages
Cloud Computing
No ratings yet
Cloud Computing
3 pages
21 CFR Part 11 and Pharmaceutical Best Practices With Ignition
No ratings yet
21 CFR Part 11 and Pharmaceutical Best Practices With Ignition
48 pages
Single API Gateway PDF
No ratings yet
Single API Gateway PDF
9 pages
Higher Education's Top 10 Strategic Technologies For 2016: Educause Center For Analysis and Research
No ratings yet
Higher Education's Top 10 Strategic Technologies For 2016: Educause Center For Analysis and Research
55 pages
L6 DFS
No ratings yet
L6 DFS
27 pages
CV Khải Nguyễn - Cvenglish-TopCV.vn
No ratings yet
CV Khải Nguyễn - Cvenglish-TopCV.vn
1 page
SEO Cheat Sheet and Checklist5
No ratings yet
SEO Cheat Sheet and Checklist5
4 pages
QP24DP2 - 290 - 13-03-2024 13:25:26 - 117.55.242.132
No ratings yet
QP24DP2 - 290 - 13-03-2024 13:25:26 - 117.55.242.132
1 page
TCPIPPB
No ratings yet
TCPIPPB
6 pages
Twinstar How It Works
No ratings yet
Twinstar How It Works
8 pages
Assam Computer Lesson Plan
No ratings yet
Assam Computer Lesson Plan
33 pages
DBMS Unit1
No ratings yet
DBMS Unit1
48 pages
Raghav Singh Resume - 2021
No ratings yet
Raghav Singh Resume - 2021
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lectur 5

Uploaded by

Lectur 5

Uploaded by

Generalized Streaming Data

• All subsystems / components can reside on individual physical servers or can

Contents adapted from Real-Time Analytics , Byron Ellis

• Streaming Data System Architecture Components

Altered version, original concept : Andrew G. Psaltis

• Mostly communication over TCP/IP network using HTTP

• Collection happens at specialized servers called edge servers

• Required intermediate layer that takes responsibility of

• NoSQL databases becoming popular choice for permanent storage

• Monitoring / Alerting Use cases

• The batch layer has two very important functions:

• Programming in distributed frameworks like Storm and Hadoop is complex.

Interesting read : https://www.oreilly.com/ideas/questioning-the-lambda-architecture

• The idea is to handle both real-time data processing and continuous

Interesting read : https://www.talend.com/blog/2017/08/28/lambda-kappa-real-time-big-data-architectures/

BA ZC420, Real Time Analytics

 Distinguishing Features of Streaming Data

 Features of Real-Time Architecture

BITS Pilani, Pilani Campus

 Two critical requirements

BITS Pilani, Pilani Campus

 Various data sources

 Young , evolving projects

BITS Pilani, Pilani Campus

 Challenges with Streaming data

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

 Streaming systems latency

 Many streaming systems works in batches

 Tradeoff between speed and safety

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

 Real-Time Analytics , Byron Ellis

BITS Pilani, Pilani Campus

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.