SA Unit 1 PPT 2
SA Unit 1 PPT 2
Model
Bridging Data Streams with Programming Constructs
Agenda
• Introduction to Stream-Processing Concepts
• Data Sources
• Stream-Processing Pipelines
• Data Sinks
4.Examples:
• Enhanced User Experience: Real-time recommendations and updates improve customer engagement,
• Operational Efficiency: Streamlining processes like monitoring IoT devices or server logs to reduce
• Timely Insights: Provides up-to-the-moment analytics for dynamic environments like financial markets or
traffic systems.
Difference between data streams and
data at rest
Components of Stream Processing
1. Data Sources
• Definition: Where data enters the streaming framework.
• Examples: Apache Kafka, Flume, Twitter, TCP sockets.
2. Stream-Processing Pipelines
• Definition: Logical flow of transformations from sources
to sinks.
3. Data Sinks
• Definition: Where processed data exits the streaming
framework.
• Examples: Databases, files, dashboards.
Stream processing libraries, such as Streamz, help build pipelines to manage streams of
continuous data, allowing applications to respond to events as they occur.
Stream processing pipelines often involve multiple actions such as filters, aggregations,
counting, analytics, transformations, enrichment, branching, joining, flow control, feedback into
earlier stages, back pressure, and storage.
Apache Spark Stream-
Processing Model
Key Points:
• Frameworks: Structured Streaming and Spark
Streaming.
• APIs available in Scala, Java, Python, and R.
• Sources and sinks define system boundaries.
Frameworks: Structured Streaming and
Spark Streaming
Structured Streaming
• High-Level API: Provides a declarative, SQL-like
approach to stream processing.
• Batch-Like Semantics: Processes streams as
incremental micro-batches or continuous execution.
• Fault-Tolerant: Ensures exactly-once processing with
state management.
• Ease of Use: Ideal for developers familiar with
DataFrame/Dataset APIs in Spark.
The components of Structured Streaming
Frameworks: Structured Streaming and
Spark Streaming
Spark Streaming
• Original Framework: Older, DStream-based API for
stream processing.
• Micro-Batch Processing: Splits streams into small
batches for processing.
• Customizable: Allows more control for lower-level
stream processing.
• Transition Phase: Gradually being replaced by
Structured Streaming for modern applications.
Overview Of Spark Streaming
APIs Available in Scala, Java,
Python, and R
• Multi-Language Support: Apache Spark supports a wide range of programming languages to cater
• Scala: The native language of Spark, offering concise syntax and seamless integration with Spark’s
core.
• Java: Robust and widely used for enterprise-level applications, with extensive libraries.
• Python: Popular for data science and machine learning due to its simplicity and rich ecosystem (e.g.,
Pandas, NumPy).
• R: Designed for statisticians, providing advanced statistical capabilities for data analysis.
Sources and Sinks
• Operations like map, filter, and reduce derive new streams from existing ones.
• Each stream can be traced back to its inputs via a sequence of transformations.
• Out-of-Order Events: Events may not arrive in the order they were generated.
Solutions:
• Windowing: Groups events into fixed or sliding time intervals for processing.
• Watermarking: Specifies a threshold for how late data can arrive and still be included.
• Buffering: Temporarily holds events to ensure proper order or completeness before processing.
Case Study: Real-Time Dashboard
Example:
Source:
• User Activity Logs: Captures real-time events like page views, clicks, and user sessions.
Processing:
• Aggregations: Compute metrics like total page views, active users, or average session duration.
• User Trends: Identify popular pages, geographic distribution, or user engagement patterns.
Sink:
• Live Visualization Dashboard: Displays metrics and trends for instant insights.
When working with both unbounded and bounded streams of data, there are generally two
ways to work with the events received:
2.Process each event as received, but including or taking into account history/context
(other received and/or processed events)
• With the first workflow, we have no idea about other events; we receive an event and
process it as is. But with the second, we store information about other events, i.e. the
state, and use this information to process the current event!
• Therefore the first workflow is stateless streaming while the second is stateful
streaming!
The limitations of stateless stream processing
The lack of context when processing new events — that is, not being able to relate this event
with other events — means you lack the ability to do quite a bit:
1.You can’t aggregate your data
• These systems need to infer the best response based on your current
interactions vs previous interactions. For instance, Amazon will recommend other
items to buy based on your current cart contents and items you’ve viewed before.
Example use cases of stateful stream
processing
Device (IoT) monitoring
• When working with IoT devices, monitoring their health becomes quite important.
• Health of the device can be defined as: for a defined window of time, the device
sends a certain number of pings that are no more than 5% fewer than the
previous window. Basically, if we expect our device to send 100 pings every hour,
if the following hour we receive less than 95, we have a problem.
• In this case, we need to store the state of the previous hour to process the
current hour!
Thank you