0% found this document useful (0 votes)
27 views37 pages

Lectur 5

Uploaded by

2022da04739
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views37 pages

Lectur 5

Uploaded by

2022da04739
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Generalized Streaming Data

Architecture

Pravin Y Pawar
Streaming Data Systems
Defined again
• … are layered systems that rely on several loosely coupled systems
• Helps in achieving high availability
• Helps in managing the system
• Helps in maintaining the cost under control

• All subsystems / components can reside on individual physical servers or can


be co hosted on the single or more than one servers
• Not all components to be present in every system

Contents adapted from Real-Time Analytics , Byron Ellis


Streaming Data System Components

• Streaming Data System Architecture Components


• Collection
• Data Flow
• Processing
• Storage
• Delivery
Generalized Architecture

Altered version, original concept : Andrew G. Psaltis


Architecture Components (1)
Collection System

• Mostly communication over TCP/IP network using HTTP


• Websites log data was the initial days use case
• W3C standard log data format was used
• Newer formats like JSON, AVRO, Thrift are available now

• Collection happens at specialized servers called edge servers


• Collection process is usually application specific
• New servers integrates directly with data flow systems
• Old servers may or may not integrate directly with data flow systems
Architecture Components (2)
Data Flow Tier
• Separation between collection tier and processing layer is required
• Rates at which these systems works are different
• What if one of system is not able to cope with another system?

• Required intermediate layer that takes responsibility of


• accepting messages / events from collection layer
• providing those messages / events to processing layer

• Real time interface to data layer for both producers and consumers of data
• Helps in guaranteeing the “at least once” semantics
Architecture Components (3)
Processing / Analytics Tier
• Based on “data locality” principle
• Move the software / code to a the location of data
• Rely on distributed processing of data
• Framework does the most of the heavy lifting of data partitioning, job
scheduling, job managing
• Available Frameworks
• Apache Storm
• Apache Spark (Streaming)
• Apache Kafka Streaming etc
Architecture Components (4)
Storage Tier
• In memory or permanent
• Usually in memory as data is processed once
• But can have use cases where events / outcomes needs to be persisted as
well

• NoSQL databases becoming popular choice for permanent storage


• MongoDB
• Cassandra

• But usage varies as per the use case, still no database that fits all use cases
Architecture Components (5)
Delivery Layer
• Usually web based interface
• Now a days mobile interfaces are becoming quite popular
• Dashboards are built with streaming visualizations that gets continuously
updated as underlying events are processed
• HTML + CSS + Java script + Websockets can be used to create interfaces
and update them
• HTML5 elements can be used to render interfaces
• SVG, PDF formats used to render the outcomes

• Monitoring / Alerting Use cases


• Feeding data to downstream applications
Lambda Architecture

Pravin Y Pawar
Lambda Architecture
Defined
• Proposed by Nathan Marz based on his experience working on distributed
data processing systems at Backtype and Twitter
• A generic, scalable and fault-tolerant data processing architecture

• Lambda Architecture
• aims to satisfy the needs for a robust system that is fault-tolerant, both against
hardware failures and human mistakes
• being able to serve a wide range of workloads and use cases
• in which low-latency reads and updates are required.

• The resulting system should be linearly scalable, and it should scale out
rather than up.
Lambda Architecture (2)
Block Diagram

Source : http://lambda-architecture.net/
Lambda Architecture (3)
Basic Flow of Events
1. All data entering the system is dispatched to both the batch layer and the
speed layer for processing.
2. The batch layer has two functions:
(i) managing the master dataset (an immutable, append-only set of raw data)
(ii) to pre-compute the batch views.
3. The serving layer indexes the batch views so that they can be queried in
low-latency, ad-hoc way.
4. The speed layer compensates for the high latency of updates to the serving
layer and deals with recent data only.
5. Any incoming query can be answered by merging results from batch views
and real-time views.
Architectural Components (1)
Batch layer
• New data comes continuously, as a feed to the data system.
• It gets fed to the batch layer and the speed layer simultaneously.
• It looks at all the data at once and eventually corrects the data in the stream
layer.
• Here we can find lots of ETL and a traditional data warehouse.
• This layer is built using a predefined schedule, usually once or twice a day.

• The batch layer has two very important functions:


• To manage the master dataset
• To pre-compute the batch views.

Source : https://databricks.com/glossary/lambda-architecture
Architectural Components (2)
Speed Layer (Stream Layer)
• This layer handles the data that are not already delivered in the batch view
due to the latency of the batch layer.
• In addition, it only deals with recent data in order to provide a complete view
of the data to the user by creating real-time views.

• Speed layer provides the outputs on the basis enrichment process and
supports the serving layer to reduce the latency in responding the queries.
• As obvious from its name the speed layer has low latency because it deals
with the real time data only and has less computational load.
Architectural Components (3)
Serving Layer
• The outputs from batch layer in the form of batch views and from speed layer
in the form of near-real time views are forwarded to the serving layer.

• This layer indexes the batch views so that they can be queried in low-latency
on an ad-hoc basis.
Applications of Lambda Architecture

• User queries are required to be served on ad-hoc basis using the immutable
data storage.
• Quick responses are required and system should be capable of handling
various updates in the form of new data streams.
• None of the stored records shall be erased and it should allow addition of
updates and new data to the database.
Pros and Cons of Lambda Architecture

• Pros
• Batch layer of Lambda architecture manages historical data with the fault
tolerant distributed storage which ensures low possibility of errors even if
the system crashes.
• It is a good balance of speed and reliability.
• Fault tolerant and scalable architecture for data processing.

• Cons
• It can result in coding overhead due to involvement of comprehensive
processing.
• Re-processes every batch cycle which is not beneficial in certain
scenarios.
• A data modelled with Lambda architecture is difficult to migrate or
reorganize.
Source :
https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-b
ig-data
Kappa Architecture

Pravin Y Pawar
Bad thing about Lambda Architecture

• The problem with the Lambda Architecture is that maintaining code that
needs to produce the same result in two complex distributed systems is
exactly as painful as it seems like it would be.

• Programming in distributed frameworks like Storm and Hadoop is complex.


Inevitably, code ends up being specifically engineered toward the framework
it runs on. The resulting operational complexity of systems implementing the
Lambda Architecture is the one thing that seems to be universally agreed on
by everyone doing it.

Interesting read : https://www.oreilly.com/ideas/questioning-the-lambda-architecture


Kappa Architecture
Defined
• First described by Jay Kreps at Linkedin.
• It focuses on only processing data as a stream.
• It is not a replacement for the Lambda Architecture, except for where your
use case fits.
• For this architecture, incoming data is streamed through a real-time layer and
the results of which are placed in the serving layer for queries.
Kappa Architecture (2)
Block Diagram
Kappa Architecture (3)

• The idea is to handle both real-time data processing and continuous


reprocessing in a single stream processing engine.
• This requires that the incoming data stream can be replayed (very quickly),
either in its entirety or from a specific position.
• If there are any code changes, then a second stream process would replay all
previous data through the latest real-time engine and replace the data stored
in the serving layer.
• This architecture attempts to simplify by only keeping one code base rather
than manage one for each batch and speed layers in the Lambda
Architecture.
• In addition, queries only need to look in a single serving location instead of
going against batch and speed views.

Interesting read : https://www.talend.com/blog/2017/08/28/lambda-kappa-real-time-big-data-architectures/


Pros and Cons of Kappa architecture

• Pros
• Kappa architecture can be used to develop data systems that are online
learners and therefore don’t need the batch layer.
• Re-processing is required only when the code changes.
• It can be deployed with fixed memory.
• It can be used for horizontally scalable systems.
• Fewer resources are required as the machine learning is being done on
the real time basis.

• Cons
• Absence of batch layer might result in errors during data processing or
while updating the database that requires having an exception manager
to reprocess the data or reconciliation.

Interesting read :
https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-ka
ppa-for-big-data-
Real Time System Characteristics
BITS Pilani Pravin Y Pawar
Pilani Campus
BITS Pilani
Pilani Campus

BA ZC420, Real Time Analytics


Lecture No. 1.4
Agenda

 Distinguishing Features of Streaming Data


 Data always in motion
 Data structuring
 Data Cardinality

 Features of Real-Time Architecture


 High Availability
 Low Latency
 Horizontal Scalability

BITS Pilani, Pilani Campus


Distinguishing Features of Streaming Data
 Data always in motion
 Streaming data
 getting generated continuously
 Always flowing

 Two critical requirements


 Collection system should be robust
 Processing should be able to keep pace with collection

 Solutions
 Horizontal Scalability
 Algorithmic handling of streaming data

BITS Pilani, Pilani Campus


Distinguishing Features of Streaming Data - II

 Data Structuring
 Loosely structured

 Various data sources


 structured , unstructured data
 Forming a joint schema is difficult
 For example, social media streams

 Young , evolving projects


 Adds many dimensions to the data
 Collect as much as data possible to make interesting analysis

BITS Pilani, Pilani Campus


Distinguishing Features of Streaming Data - III

 Data Cardinality
 Number of unique values in data
 Very few values appears often, many are very sparce

 Challenges with Streaming data


 Processing
 Streaming data can be processed only once
 Difficult to identify state of data
 Batch processing on processed data can be used for estimation

 Storage
 Memory requirements are high while processing data
 Linear amount of space required for storing state information

BITS Pilani, Pilani Campus


Features of Real-Time Architecture
 High Availability
 Key distinguishing factor from batch / BI systems
 Very critical for collection, flow and processing systems

 Two Approaches
 Distribution
 Use multiple physical servers to distribute the load

 Replication
 Write to several machines
 Master-slave configuration
 Automatic failover
 Master less configuration
 Recovery is difficult in case of failure

BITS Pilani, Pilani Campus


Features of Real-Time Architecture - II
 Low Latency
 Time taken to service a request

 Streaming systems latency


 Time taken to process the event from the moment it entered the system

 Many streaming systems works in batches


 Micro batching – processing in very small batches, milli seconds
 Collection systems bothers about first definition of latency
 Flow and processing components bother about second one

 Tradeoff between speed and safety


 If data can be safely lost, latency can be very small
 If not, needs to live with lower limit of latency

BITS Pilani, Pilani Campus


Features of Real-Time Architecture - III
 Horizontal Scalability
 Adding more physical servers to a clusters
 Needs to care about amount of coordination required between the systems
 Use of partitioning technique
 Use principle of data locality – move program to data

BITS Pilani, Pilani Campus


Reference

 Real-Time Analytics , Byron Ellis


 Chapter 1 : Introduction to Streaming Data
 Chapter 2 : Designing Real-Time Streaming Architecture

BITS Pilani, Pilani Campus


Thank You!
In our next session: Streaming Data Systems Components
Thank You!
In our next session: Kappa Architecture
Thank You!
In our next session: Lambda Architecture

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy