0% found this document useful (0 votes)
28 views27 pages

SA Unit 1 PPT 2

The document provides an overview of stream processing concepts, including data streams, their characteristics, and the importance of real-time processing for decision-making and operational efficiency. It discusses components such as data sources, processing pipelines, and sinks, along with frameworks like Apache Spark's Structured Streaming and Spark Streaming. Additionally, it covers stateful processing, handling event order, and practical use cases like fraud detection and IoT monitoring.

Uploaded by

pg0145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views27 pages

SA Unit 1 PPT 2

The document provides an overview of stream processing concepts, including data streams, their characteristics, and the importance of real-time processing for decision-making and operational efficiency. It discusses components such as data sources, processing pipelines, and sinks, along with frameworks like Apache Spark's Structured Streaming and Spark Streaming. Additionally, it covers stateful processing, handling event order, and practical use cases like fraud detection and IoT monitoring.

Uploaded by

pg0145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Stream-Processing

Model
Bridging Data Streams with Programming Constructs
Agenda
• Introduction to Stream-Processing Concepts

• Components of Stream Processing

• Data Sources

• Stream-Processing Pipelines

• Data Sinks

• Apache Spark’s Stream-Processing Model

• Stateful Stream Processing

• Handling Event Order and Timeliness

• Summary and Q&A


Introduction to Stream-
Processing Concepts
• What is a data stream?
• The importance of stream processing in real-time data
scenarios.
• Difference between data streams and data at rest.
What is a Data Stream?
• A data stream is a continuous flow of real-time
data generated by various sources.
• It differs from traditional datasets by being dynamic and
unbounded, requiring immediate or near-immediate
processing.
Key Characteristics:
1.Continuous Flow: Data is generated in real-time and arrives sequentially.

2.Unbounded Size: The stream grows indefinitely over time.

3.Time-Sensitive: Data value often diminishes if not processed promptly.

4.Examples:

1. IoT sensor readings (e.g., temperature, humidity).

2. Real-time stock prices.

3. Social media feeds (e.g., tweets, posts).

Real-Life Use Case:

• Analyzing live tweets during an event for sentiment or trend detection.


The importance of stream
processing in real-time data
scenarios
• Immediate Decision-Making: Enables organizations to respond to events as they happen, such as fraud

detection or system alerts.

• Enhanced User Experience: Real-time recommendations and updates improve customer engagement,

e.g., live content suggestions.

• Operational Efficiency: Streamlining processes like monitoring IoT devices or server logs to reduce

downtime and increase productivity.

• Timely Insights: Provides up-to-the-moment analytics for dynamic environments like financial markets or

traffic systems.
Difference between data streams and
data at rest
Components of Stream Processing
1. Data Sources
• Definition: Where data enters the streaming framework.
• Examples: Apache Kafka, Flume, Twitter, TCP sockets.
2. Stream-Processing Pipelines
• Definition: Logical flow of transformations from sources
to sinks.
3. Data Sinks
• Definition: Where processed data exits the streaming
framework.
• Examples: Databases, files, dashboards.
Stream processing libraries, such as Streamz, help build pipelines to manage streams of
continuous data, allowing applications to respond to events as they occur.

Stream processing pipelines often involve multiple actions such as filters, aggregations,
counting, analytics, transformations, enrichment, branching, joining, flow control, feedback into
earlier stages, back pressure, and storage.
Apache Spark Stream-
Processing Model
Key Points:
• Frameworks: Structured Streaming and Spark
Streaming.
• APIs available in Scala, Java, Python, and R.
• Sources and sinks define system boundaries.
Frameworks: Structured Streaming and
Spark Streaming
Structured Streaming
• High-Level API: Provides a declarative, SQL-like
approach to stream processing.
• Batch-Like Semantics: Processes streams as
incremental micro-batches or continuous execution.
• Fault-Tolerant: Ensures exactly-once processing with
state management.
• Ease of Use: Ideal for developers familiar with
DataFrame/Dataset APIs in Spark.
The components of Structured Streaming
Frameworks: Structured Streaming and
Spark Streaming
Spark Streaming
• Original Framework: Older, DStream-based API for
stream processing.
• Micro-Batch Processing: Splits streams into small
batches for processing.
• Customizable: Allows more control for lower-level
stream processing.
• Transition Phase: Gradually being replaced by
Structured Streaming for modern applications.
Overview Of Spark Streaming
APIs Available in Scala, Java,
Python, and R
• Multi-Language Support: Apache Spark supports a wide range of programming languages to cater

to diverse developer preferences.

• Scala: The native language of Spark, offering concise syntax and seamless integration with Spark’s

core.

• Java: Robust and widely used for enterprise-level applications, with extensive libraries.

• Python: Popular for data science and machine learning due to its simplicity and rich ecosystem (e.g.,

Pandas, NumPy).

• R: Designed for statisticians, providing advanced statistical capabilities for data analysis.
Sources and Sinks

• Sources: Consuming data streams into Spark (e.g., Kafka).

• Sinks: Sending processed data out of Spark (e.g., databases, files).

• Chaining: One framework’s sink can be another’s source (pipelines).


Immutable Streams Defined from One
Another
Streams are Immutable:

• Once created, individual elements in a stream cannot be modified.

Transformations Create New Streams:

• Operations like map, filter, and reduce derive new streams from existing ones.

• Original data remains intact, ensuring consistency.

Traceability and Reproducibility:

• Each stream can be traced back to its inputs via a sequence of transformations.

• Guarantees that computations are unambiguous and repeatable.


Stateful Stream Processing
Definition:

• Maintains intermediate states to process incoming data based on prior computations.

• Enables advanced operations that require context from historical data.

Key Use Cases:

• Counting: Tracking the number of events in a stream (e.g., page views).

• Aggregations: Computing running totals, averages, or other cumulative metrics.

• Session-Based Computations: Identifying and analyzing user sessions based on activity.


Handling Event Order and
Timeliness
Challenges:

• Out-of-Order Events: Events may not arrive in the order they were generated.

• Late-Arriving Data: Data delayed due to network or processing latency.

Solutions:

• Windowing: Groups events into fixed or sliding time intervals for processing.

• Watermarking: Specifies a threshold for how late data can arrive and still be included.

• Buffering: Temporarily holds events to ensure proper order or completeness before processing.
Case Study: Real-Time Dashboard
Example:

• Streaming website analytics to a real-time dashboard for monitoring user behavior.

Source:

• User Activity Logs: Captures real-time events like page views, clicks, and user sessions.

Processing:

• Aggregations: Compute metrics like total page views, active users, or average session duration.

• User Trends: Identify popular pages, geographic distribution, or user engagement patterns.

Sink:

• Live Visualization Dashboard: Displays metrics and trends for instant insights.
When working with both unbounded and bounded streams of data, there are generally two
ways to work with the events received:

1.Process each event as received.

2.Process each event as received, but including or taking into account history/context
(other received and/or processed events)

• With the first workflow, we have no idea about other events; we receive an event and
process it as is. But with the second, we store information about other events, i.e. the
state, and use this information to process the current event!

• Therefore the first workflow is stateless streaming while the second is stateful
streaming!
The limitations of stateless stream processing
The lack of context when processing new events — that is, not being able to relate this event
with other events — means you lack the ability to do quite a bit:
1.You can’t aggregate your data

2.You can’t join or compare data streams

3.You can’t identify patterns

4.You’re limited in use cases


It’s clear that stateless streaming has limitations that might make it hard to answer your
business questions.
Example use cases of stateful stream
processing
Fraud and anomaly detection
• In transactional systems, detecting fraudulent and anomalous behavior in real
time is critical.
• Anomalies (and fraud) are behaviors that are outside the norm… “the norm”
being previous habits in the system!
• Therefore, when processing the current transaction, you need to be able to
compare its attributes in real time with previous habits!
Example use cases of stateful stream
processing
ML/AI systems

• In the age of immediate responses, we expect on-the-fly recommendations when


making purchases from Amazon, customer support to answer our questions as
quickly and accurately as possible, our social media feeds to recommend the
best content always.

• These systems need to infer the best response based on your current
interactions vs previous interactions. For instance, Amazon will recommend other
items to buy based on your current cart contents and items you’ve viewed before.
Example use cases of stateful stream
processing
Device (IoT) monitoring

• When working with IoT devices, monitoring their health becomes quite important.

• Health of the device can be defined as: for a defined window of time, the device
sends a certain number of pings that are no more than 5% fewer than the
previous window. Basically, if we expect our device to send 100 pings every hour,
if the following hour we receive less than 95, we have a problem.

• In this case, we need to store the state of the previous hour to process the
current hour!
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy