0% found this document useful (0 votes)

28 views27 pages

SA Unit 1 PPT 2

The document provides an overview of stream processing concepts, including data streams, their characteristics, and the importance of real-time processing for decision-making and operational efficiency. It discusses components such as data sources, processing pipelines, and sinks, along with frameworks like Apache Spark's Structured Streaming and Spark Streaming. Additionally, it covers stateful processing, handling event order, and practical use cases like fraud detection and IoT monitoring.

Uploaded by

pg0145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views27 pages

SA Unit 1 PPT 2

Uploaded by

pg0145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Stream-Processing

Model
Bridging Data Streams with Programming Constructs
Agenda
• Introduction to Stream-Processing Concepts

• Components of Stream Processing

• Data Sources

• Stream-Processing Pipelines

• Data Sinks

• Apache Spark’s Stream-Processing Model

• Stateful Stream Processing

• Handling Event Order and Timeliness

• Summary and Q&A

Introduction to Stream-
Processing Concepts
• What is a data stream?
• The importance of stream processing in real-time data
scenarios.
• Difference between data streams and data at rest.
What is a Data Stream?
• A data stream is a continuous flow of real-time
data generated by various sources.
• It differs from traditional datasets by being dynamic and
unbounded, requiring immediate or near-immediate
processing.
Key Characteristics:
1.Continuous Flow: Data is generated in real-time and arrives sequentially.

2.Unbounded Size: The stream grows indefinitely over time.

3.Time-Sensitive: Data value often diminishes if not processed promptly.

4.Examples:

1. IoT sensor readings (e.g., temperature, humidity).

2. Real-time stock prices.

3. Social media feeds (e.g., tweets, posts).

Real-Life Use Case:

• Analyzing live tweets during an event for sentiment or trend detection.

The importance of stream
processing in real-time data
scenarios
• Immediate Decision-Making: Enables organizations to respond to events as they happen, such as fraud

detection or system alerts.

• Enhanced User Experience: Real-time recommendations and updates improve customer engagement,

e.g., live content suggestions.

• Operational Efficiency: Streamlining processes like monitoring IoT devices or server logs to reduce

downtime and increase productivity.

• Timely Insights: Provides up-to-the-moment analytics for dynamic environments like financial markets or

traffic systems.
Difference between data streams and
data at rest
Components of Stream Processing
1. Data Sources
• Definition: Where data enters the streaming framework.
• Examples: Apache Kafka, Flume, Twitter, TCP sockets.
2. Stream-Processing Pipelines
• Definition: Logical flow of transformations from sources
to sinks.
3. Data Sinks
• Definition: Where processed data exits the streaming
framework.
• Examples: Databases, files, dashboards.
Stream processing libraries, such as Streamz, help build pipelines to manage streams of
continuous data, allowing applications to respond to events as they occur.

Stream processing pipelines often involve multiple actions such as filters, aggregations,
counting, analytics, transformations, enrichment, branching, joining, flow control, feedback into
earlier stages, back pressure, and storage.
Apache Spark Stream-
Processing Model
Key Points:
• Frameworks: Structured Streaming and Spark
Streaming.
• APIs available in Scala, Java, Python, and R.
• Sources and sinks define system boundaries.
Frameworks: Structured Streaming and
Spark Streaming
Structured Streaming
• High-Level API: Provides a declarative, SQL-like
approach to stream processing.
• Batch-Like Semantics: Processes streams as
incremental micro-batches or continuous execution.
• Fault-Tolerant: Ensures exactly-once processing with
state management.
• Ease of Use: Ideal for developers familiar with
DataFrame/Dataset APIs in Spark.
The components of Structured Streaming
Frameworks: Structured Streaming and
Spark Streaming
Spark Streaming
• Original Framework: Older, DStream-based API for
stream processing.
• Micro-Batch Processing: Splits streams into small
batches for processing.
• Customizable: Allows more control for lower-level
stream processing.
• Transition Phase: Gradually being replaced by
Structured Streaming for modern applications.
Overview Of Spark Streaming
APIs Available in Scala, Java,
Python, and R
• Multi-Language Support: Apache Spark supports a wide range of programming languages to cater

to diverse developer preferences.

• Scala: The native language of Spark, offering concise syntax and seamless integration with Spark’s

core.

• Java: Robust and widely used for enterprise-level applications, with extensive libraries.

• Python: Popular for data science and machine learning due to its simplicity and rich ecosystem (e.g.,

Pandas, NumPy).

• R: Designed for statisticians, providing advanced statistical capabilities for data analysis.
Sources and Sinks

• Sources: Consuming data streams into Spark (e.g., Kafka).

• Sinks: Sending processed data out of Spark (e.g., databases, files).

• Chaining: One framework’s sink can be another’s source (pipelines).

Immutable Streams Defined from One
Another
Streams are Immutable:

• Once created, individual elements in a stream cannot be modified.

Transformations Create New Streams:

• Operations like map, filter, and reduce derive new streams from existing ones.

• Original data remains intact, ensuring consistency.

Traceability and Reproducibility:

• Each stream can be traced back to its inputs via a sequence of transformations.

• Guarantees that computations are unambiguous and repeatable.

Stateful Stream Processing
Definition:

• Maintains intermediate states to process incoming data based on prior computations.

• Enables advanced operations that require context from historical data.

Key Use Cases:

• Counting: Tracking the number of events in a stream (e.g., page views).

• Aggregations: Computing running totals, averages, or other cumulative metrics.

• Session-Based Computations: Identifying and analyzing user sessions based on activity.

Handling Event Order and
Timeliness
Challenges:

• Out-of-Order Events: Events may not arrive in the order they were generated.

• Late-Arriving Data: Data delayed due to network or processing latency.

Solutions:

• Windowing: Groups events into fixed or sliding time intervals for processing.

• Watermarking: Specifies a threshold for how late data can arrive and still be included.

• Buffering: Temporarily holds events to ensure proper order or completeness before processing.
Case Study: Real-Time Dashboard
Example:

• Streaming website analytics to a real-time dashboard for monitoring user behavior.

Source:

• User Activity Logs: Captures real-time events like page views, clicks, and user sessions.

Processing:

• Aggregations: Compute metrics like total page views, active users, or average session duration.

• User Trends: Identify popular pages, geographic distribution, or user engagement patterns.

Sink:

• Live Visualization Dashboard: Displays metrics and trends for instant insights.
When working with both unbounded and bounded streams of data, there are generally two
ways to work with the events received:

1.Process each event as received.

2.Process each event as received, but including or taking into account history/context
(other received and/or processed events)

• With the first workflow, we have no idea about other events; we receive an event and
process it as is. But with the second, we store information about other events, i.e. the
state, and use this information to process the current event!

• Therefore the first workflow is stateless streaming while the second is stateful
streaming!
The limitations of stateless stream processing
The lack of context when processing new events — that is, not being able to relate this event
with other events — means you lack the ability to do quite a bit:
1.You can’t aggregate your data

2.You can’t join or compare data streams

3.You can’t identify patterns

4.You’re limited in use cases

It’s clear that stateless streaming has limitations that might make it hard to answer your
business questions.
Example use cases of stateful stream
processing
Fraud and anomaly detection
• In transactional systems, detecting fraudulent and anomalous behavior in real
time is critical.
• Anomalies (and fraud) are behaviors that are outside the norm… “the norm”
being previous habits in the system!
• Therefore, when processing the current transaction, you need to be able to
compare its attributes in real time with previous habits!
Example use cases of stateful stream
processing
ML/AI systems

• In the age of immediate responses, we expect on-the-fly recommendations when

making purchases from Amazon, customer support to answer our questions as
quickly and accurately as possible, our social media feeds to recommend the
best content always.

• These systems need to infer the best response based on your current
interactions vs previous interactions. For instance, Amazon will recommend other
items to buy based on your current cart contents and items you’ve viewed before.
Example use cases of stateful stream
processing
Device (IoT) monitoring

• When working with IoT devices, monitoring their health becomes quite important.

• Health of the device can be defined as: for a defined window of time, the device
sends a certain number of pings that are no more than 5% fewer than the
previous window. Basically, if we expect our device to send 100 pings every hour,
if the following hour we receive less than 95, we have a problem.

• In this case, we need to store the state of the previous hour to process the
current hour!
Thank you

SA Unit 1 PPT 1
No ratings yet
SA Unit 1 PPT 1
19 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
DAV Chapter3
No ratings yet
DAV Chapter3
44 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
Lecture 3 SStreaming Data Systems and Applications
No ratings yet
Lecture 3 SStreaming Data Systems and Applications
39 pages
Lec 05
No ratings yet
Lec 05
10 pages
Hazelcast Level Up To Instant Action-1706173416548
No ratings yet
Hazelcast Level Up To Instant Action-1706173416548
36 pages
Lecture #7.2 - Apache Spark - Streaming API
No ratings yet
Lecture #7.2 - Apache Spark - Streaming API
37 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
Ververica Platform Whitepaper Stream Processing For Real-Time Business, Powered by Apache Flink®
No ratings yet
Ververica Platform Whitepaper Stream Processing For Real-Time Business, Powered by Apache Flink®
22 pages
Reference Guide To Stream Processing
No ratings yet
Reference Guide To Stream Processing
14 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Lec 19
No ratings yet
Lec 19
24 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Lec 19
No ratings yet
Lec 19
23 pages
Lec 02
No ratings yet
Lec 02
13 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Bda Ut-2
No ratings yet
Bda Ut-2
18 pages
ITHome - Deep Dive Into Apache Flink - Gordon
No ratings yet
ITHome - Deep Dive Into Apache Flink - Gordon
44 pages
Big Data
No ratings yet
Big Data
37 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
JyothsnaDST Unit-1 Extra
No ratings yet
JyothsnaDST Unit-1 Extra
25 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Unit 3
No ratings yet
Unit 3
51 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
15 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Unit Iv
No ratings yet
Unit Iv
11 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Stream Processing and Website Tracking
No ratings yet
Stream Processing and Website Tracking
2 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
SPA Notes
No ratings yet
SPA Notes
4 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
DLL W3 Organization and Management 11
100% (4)
DLL W3 Organization and Management 11
3 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
Unit Iv
No ratings yet
Unit Iv
5 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Plane Strain and Plane Stress
No ratings yet
Plane Strain and Plane Stress
35 pages
UNDP Malaysia Peat Swamp Forest PDF
100% (1)
UNDP Malaysia Peat Swamp Forest PDF
40 pages
Module II
No ratings yet
Module II
22 pages
Georgetown Thesis Database
100% (3)
Georgetown Thesis Database
4 pages
Topswitch - GX: Flyback Quick Selection Curves
No ratings yet
Topswitch - GX: Flyback Quick Selection Curves
12 pages
Brand Loyalty vs. Repeat Purchasing Behavior
No ratings yet
Brand Loyalty vs. Repeat Purchasing Behavior
9 pages
AGS Guide To Ground Investigation Reports Final
No ratings yet
AGS Guide To Ground Investigation Reports Final
6 pages
Southpoint School & College: Time: 30 Mins Subject: Computer Studies (Objectives) Full Marks: 30
No ratings yet
Southpoint School & College: Time: 30 Mins Subject: Computer Studies (Objectives) Full Marks: 30
2 pages
Fikir Pure Natural Spring Water
No ratings yet
Fikir Pure Natural Spring Water
2 pages
FMX / Cruiso / BW 8-12: Ganzeboom Transmission Parts & Torque Converters
No ratings yet
FMX / Cruiso / BW 8-12: Ganzeboom Transmission Parts & Torque Converters
2 pages
Mark Meadows Motion To Dismiss
No ratings yet
Mark Meadows Motion To Dismiss
34 pages
Home Stay Registration Way of Sri Lanka Tourism
No ratings yet
Home Stay Registration Way of Sri Lanka Tourism
12 pages
En Girafe
No ratings yet
En Girafe
4 pages
University of The Philippines College of Law: CPE, 1-D
No ratings yet
University of The Philippines College of Law: CPE, 1-D
2 pages
Accounting and Finance Notes For Final Exam
No ratings yet
Accounting and Finance Notes For Final Exam
5 pages
Strategic Value Management - Michael Thiry
No ratings yet
Strategic Value Management - Michael Thiry
8 pages
Smartax Mt800 Adsl Router: User Manual
No ratings yet
Smartax Mt800 Adsl Router: User Manual
109 pages
Engine Test Stands For Automotive Technicians
No ratings yet
Engine Test Stands For Automotive Technicians
6 pages
A Study Between Social Media Usage and Self-Esteem Among Youths
No ratings yet
A Study Between Social Media Usage and Self-Esteem Among Youths
10 pages
Transcript
No ratings yet
Transcript
12 pages
Summary of Main Tasks of Contract Administration
No ratings yet
Summary of Main Tasks of Contract Administration
4 pages
Chemistry IA Exemplar Document
No ratings yet
Chemistry IA Exemplar Document
15 pages
Cato DLP WP
No ratings yet
Cato DLP WP
10 pages
Maaz Assignment # 3 Deep Learning
No ratings yet
Maaz Assignment # 3 Deep Learning
5 pages
Tap Magic Eco Oil Sds en Us 2023pdf
No ratings yet
Tap Magic Eco Oil Sds en Us 2023pdf
8 pages
PT205 Hydrogen Sulfide
No ratings yet
PT205 Hydrogen Sulfide
2 pages
Aguinaldo Industries V CIR - Peralta
No ratings yet
Aguinaldo Industries V CIR - Peralta
2 pages
Smallest Physical Size: Screen Screen Operate Nozzle at or Above 4 Bar
No ratings yet
Smallest Physical Size: Screen Screen Operate Nozzle at or Above 4 Bar
1 page
WP - No.10205 of 2017
No ratings yet
WP - No.10205 of 2017
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SA Unit 1 PPT 2

Uploaded by

SA Unit 1 PPT 2

Uploaded by

Stream-Processing

• Components of Stream Processing

• Apache Spark’s Stream-Processing Model

• Stateful Stream Processing

• Handling Event Order and Timeliness

• Summary and Q&A

2.Unbounded Size: The stream grows indefinitely over time.

3.Time-Sensitive: Data value often diminishes if not processed promptly.

1. IoT sensor readings (e.g., temperature, humidity).

2. Real-time stock prices.

3. Social media feeds (e.g., tweets, posts).

Real-Life Use Case:

• Analyzing live tweets during an event for sentiment or trend detection.

detection or system alerts.

e.g., live content suggestions.

downtime and increase productivity.

to diverse developer preferences.

• Sources: Consuming data streams into Spark (e.g., Kafka).

• Sinks: Sending processed data out of Spark (e.g., databases, files).

• Chaining: One framework’s sink can be another’s source (pipelines).

• Once created, individual elements in a stream cannot be modified.

Transformations Create New Streams:

• Original data remains intact, ensuring consistency.

Traceability and Reproducibility:

• Guarantees that computations are unambiguous and repeatable.

• Maintains intermediate states to process incoming data based on prior computations.

• Enables advanced operations that require context from historical data.

Key Use Cases:

• Counting: Tracking the number of events in a stream (e.g., page views).

• Aggregations: Computing running totals, averages, or other cumulative metrics.

• Session-Based Computations: Identifying and analyzing user sessions based on activity.

• Late-Arriving Data: Data delayed due to network or processing latency.

• Streaming website analytics to a real-time dashboard for monitoring user behavior.

1.Process each event as received.

2.You can’t join or compare data streams

3.You can’t identify patterns

4.You’re limited in use cases

• In the age of immediate responses, we expect on-the-fly recommendations when

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.