Lectur 5
Lectur 5
Architecture
Pravin Y Pawar
Streaming Data Systems
Defined again
• … are layered systems that rely on several loosely coupled systems
• Helps in achieving high availability
• Helps in managing the system
• Helps in maintaining the cost under control
• Real time interface to data layer for both producers and consumers of data
• Helps in guaranteeing the “at least once” semantics
Architecture Components (3)
Processing / Analytics Tier
• Based on “data locality” principle
• Move the software / code to a the location of data
• Rely on distributed processing of data
• Framework does the most of the heavy lifting of data partitioning, job
scheduling, job managing
• Available Frameworks
• Apache Storm
• Apache Spark (Streaming)
• Apache Kafka Streaming etc
Architecture Components (4)
Storage Tier
• In memory or permanent
• Usually in memory as data is processed once
• But can have use cases where events / outcomes needs to be persisted as
well
• But usage varies as per the use case, still no database that fits all use cases
Architecture Components (5)
Delivery Layer
• Usually web based interface
• Now a days mobile interfaces are becoming quite popular
• Dashboards are built with streaming visualizations that gets continuously
updated as underlying events are processed
• HTML + CSS + Java script + Websockets can be used to create interfaces
and update them
• HTML5 elements can be used to render interfaces
• SVG, PDF formats used to render the outcomes
Pravin Y Pawar
Lambda Architecture
Defined
• Proposed by Nathan Marz based on his experience working on distributed
data processing systems at Backtype and Twitter
• A generic, scalable and fault-tolerant data processing architecture
• Lambda Architecture
• aims to satisfy the needs for a robust system that is fault-tolerant, both against
hardware failures and human mistakes
• being able to serve a wide range of workloads and use cases
• in which low-latency reads and updates are required.
• The resulting system should be linearly scalable, and it should scale out
rather than up.
Lambda Architecture (2)
Block Diagram
Source : http://lambda-architecture.net/
Lambda Architecture (3)
Basic Flow of Events
1. All data entering the system is dispatched to both the batch layer and the
speed layer for processing.
2. The batch layer has two functions:
(i) managing the master dataset (an immutable, append-only set of raw data)
(ii) to pre-compute the batch views.
3. The serving layer indexes the batch views so that they can be queried in
low-latency, ad-hoc way.
4. The speed layer compensates for the high latency of updates to the serving
layer and deals with recent data only.
5. Any incoming query can be answered by merging results from batch views
and real-time views.
Architectural Components (1)
Batch layer
• New data comes continuously, as a feed to the data system.
• It gets fed to the batch layer and the speed layer simultaneously.
• It looks at all the data at once and eventually corrects the data in the stream
layer.
• Here we can find lots of ETL and a traditional data warehouse.
• This layer is built using a predefined schedule, usually once or twice a day.
Source : https://databricks.com/glossary/lambda-architecture
Architectural Components (2)
Speed Layer (Stream Layer)
• This layer handles the data that are not already delivered in the batch view
due to the latency of the batch layer.
• In addition, it only deals with recent data in order to provide a complete view
of the data to the user by creating real-time views.
• Speed layer provides the outputs on the basis enrichment process and
supports the serving layer to reduce the latency in responding the queries.
• As obvious from its name the speed layer has low latency because it deals
with the real time data only and has less computational load.
Architectural Components (3)
Serving Layer
• The outputs from batch layer in the form of batch views and from speed layer
in the form of near-real time views are forwarded to the serving layer.
• This layer indexes the batch views so that they can be queried in low-latency
on an ad-hoc basis.
Applications of Lambda Architecture
• User queries are required to be served on ad-hoc basis using the immutable
data storage.
• Quick responses are required and system should be capable of handling
various updates in the form of new data streams.
• None of the stored records shall be erased and it should allow addition of
updates and new data to the database.
Pros and Cons of Lambda Architecture
• Pros
• Batch layer of Lambda architecture manages historical data with the fault
tolerant distributed storage which ensures low possibility of errors even if
the system crashes.
• It is a good balance of speed and reliability.
• Fault tolerant and scalable architecture for data processing.
• Cons
• It can result in coding overhead due to involvement of comprehensive
processing.
• Re-processes every batch cycle which is not beneficial in certain
scenarios.
• A data modelled with Lambda architecture is difficult to migrate or
reorganize.
Source :
https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-b
ig-data
Kappa Architecture
Pravin Y Pawar
Bad thing about Lambda Architecture
• The problem with the Lambda Architecture is that maintaining code that
needs to produce the same result in two complex distributed systems is
exactly as painful as it seems like it would be.
• Pros
• Kappa architecture can be used to develop data systems that are online
learners and therefore don’t need the batch layer.
• Re-processing is required only when the code changes.
• It can be deployed with fixed memory.
• It can be used for horizontally scalable systems.
• Fewer resources are required as the machine learning is being done on
the real time basis.
• Cons
• Absence of batch layer might result in errors during data processing or
while updating the database that requires having an exception manager
to reprocess the data or reconciliation.
Interesting read :
https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-ka
ppa-for-big-data-
Real Time System Characteristics
BITS Pilani Pravin Y Pawar
Pilani Campus
BITS Pilani
Pilani Campus
Solutions
Horizontal Scalability
Algorithmic handling of streaming data
Data Structuring
Loosely structured
Data Cardinality
Number of unique values in data
Very few values appears often, many are very sparce
Storage
Memory requirements are high while processing data
Linear amount of space required for storing state information
Two Approaches
Distribution
Use multiple physical servers to distribute the load
Replication
Write to several machines
Master-slave configuration
Automatic failover
Master less configuration
Recovery is difficult in case of failure