Data Analytics and Visualization Unit-III
Data Analytics and Visualization Unit-III
SYLLABUS
In this article, we are going to discuss concepts of the data stream in data analytics in detail
what data streams are, their importance, and how they are used in fields like
finance, telecommunications, and IoT (Internet of Things).
2.Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per day.
Surveillance cameras generate images with lower resolution than satellites, but there can be
numerous of them, each producing a stream of images at a break of 1 second each.
DSMS consists of various layer which are dedicated to perform particular operation which are
as follows:
1. Data source Layer
The first layer of DSMS is data source layer as it name suggest it is comprises of all the data
sources which includes sensors, social media feeds, financial market, stock markets etc. In the
layer capturing and parsing of data stream happens. Basically it is the collection layer which
collects the data.
2. Data Ingestion Layer
You can consider this layer as bridge between data source layer and processing layer. The main
purpose of this layer is to handle the flow of data i.e., data flow control, data buffering and
data routing.
3. Processing Layer
This layer consider as heart of DSMS architecture it is functional layer of DSMS applications. It
process the data streams in real time. To perform processing it is uses processing engines like
Apache flink or Apache storm etc., The main function of this layer is to filter, transform,
aggregate and enriching the data stream. This can be done by derive insights and detect
patterns.
4. Storage Layer
Once data is process we need to store the processed data in any storage unit. Storage layer
consist of various storage like NoSQL database, distributed database etc., It helps to ensure
data durability and availability of data in case of system failure.
5. Querying Layer
As mentioned above it support 2 types of query ad hoc query and standard query. This layer
provides the tools which can be used for querying and analyzing the stored data stream. It also
have SQL like query languages or programming API. This queries can be question like how
many entries are done? which type of data is inserted? etc.,
6. Visualization and Reporting Layer
This layer provides tools for perform visualization like charts, pie chart, histogram etc., On the
basis of this visual representation it also helps to generate the report for analysis.
7. Integration Layer
This layer responsible for integrating DSMS application with traditional system, business
intelligence tools, data warehouses, ML application, NLP applications. It helps to improve
already present running applications.
The layers are responsible for working of DSMS applications. It provides scalable and fault
tolerance application which can handle huge volume of streaming data. These layer can
change according to the business requirements some may include all layer some may exclude
layers.
5 Main Components Of Data Streaming Architecture
This is the core component that processes streaming data. It can perform various operations
such as filtering, aggregation, transformation, enrichment, windowing, etc. A stream processing
engine can also support Complex Event Processing (CEP) which is the ability to detect
patterns or anomalies in streaming data and trigger actions accordingly.
Some popular stream processing tools are Apache Spark Streaming, Apache Flink, Apache
Kafka Streams, etc.
How To Choose A Stream Processing Engine?
Honestly, there is no definitive answer to this question as different stream processing engines
have different strengths and weaknesses and different use cases have different requirements
and constraints.
However, there are some general factors that you must consider when choosing a
stream processing engine. Let’s take a look at them:
•Data volume and velocity: Look at how much data you need to process per second
or per minute and how fast you need to process it. Depending on their architecture and
design, some stream processing engines can handle higher throughput and lower
latency than others.
•Data variety and quality: The type of data you need to process is a very important
factor. Depending on their schema support and data cleansing capabilities, different
stream processing engines can handle data of different complexity or diverse data
types.
•Processing complexity and functionality: The type and complexity of data
processing needed are also very important. Some stream processing engines can
support more sophisticated or flexible processing logic than others which depends on
their programming model and API.
•Scalability and reliability: The level of reliability you need for your stream
processing engine depends on how you plan to use it. Other factors like your company’s
future plans determine how scalable the stream processing engine needs to be. Some
stream processing engines can scale up or down more easily than others.
•Integration and compatibility: Stream processing engines don’t work alone. Your
stream processing engine must integrate with other components of your data streaming
architecture. Depending on their connectors and formats, some stream processing
engines can interoperate more seamlessly than others.
B. Message Broker
Message brokers act as buffers between the data sources and the stream processing
engine. It collects data from various sources, converts it to a standard message format (such
as JSON or Avro), and then streams it continuously for consumption by other
components.
A message broker also provides features such as scalability, fault tolerance, load
balancing, partitioning, etc. Some examples of message brokers are Apache Kafka, Amazon
Kinesis Streams, etc.
C. Data Storage
This component stores the processed or raw streaming data for later use. Data
storage can be either persistent or ephemeral, relational or non-relational, structured or
unstructured, etc. Because of the large amount and diverse format of event streams, many
organizations opt to store their streaming event data in cloud object stores as an
operational data lake. A standard method of loading data into the data storage is using ETL
pipelines.
Some examples of data storage systems are Amazon S3, Hadoop Distributed File System
(HDFS), Apache Cassandra, Elasticsearch, etc.
Read more about different types of data storage systems such as databases, data warehouses,
and data lakes here.
•Data volume and velocity: When it comes to data storage, you need a service that
can keep up with your needs – without sacrificing performance or reliability. How much
do you have to store and how quickly must it be accessed? Your selection should meet
both of these criteria for the best results.
•Data variety and quality: It pays to choose a storage service that can accommodate
your data type, and format, as well as offer features like compression for smoother
operation. What sort of data are you looking to store? How organized and consistent is
it? Are there any issues with quality or integrity? Plus encryption and deduplication
mean improved security – so make sure it’s on the table.
•Data access and functionality: What kind of access do you need to your data? Do
you just want basic read/write operations or something more intricate like queries and
analytics? If it’s the latter, a batch processing system or real-time might be necessary.
You’ll have to find a service that can provide whatever your required pattern is as well
as features such as indexing, partitioning, and caching – all these extras could take up
any slack in functionality.
•Integration and compatibility: Pick something capable of smooth integration and
compatibility across different components, like message brokers to data stream
processing engines. Plus, it should support common formats/protocols for optimal
performance.
D. Data Ingestion Layer
The data ingestion layer is a crucial part that collects data from various sources and
transfers it to a data storage system for further data manipulation or analysis. This
layer is responsible for processing different types of data, including structured, unstructured, or
semi-structured, and formats like CSV, JSON, or XML. It is also responsible for ensuring that the
data is accurate, secure, and consistent.
Some examples of data ingestion tools are Apache Flume, Logstash, Amazon Kinesis Firehose,
etc.
Here are some factors that you can consider when selecting a tool or technology for your data
ingestion layer:
•Data source and destination compatibility: When searching for the right data
ingestion tool, make sure it can easily link up with your data sources and destinations.
To get this done without a hitch, double-check if there are any connectors or adapters
provided along with the tech. These will allow everything to come together quickly and
seamlessly.
•Data transformation capability: When selecting a tool or technology for data
transformation, it is important to consider whether it supports your specific logic. Check
if it offers pre-built functions or libraries for common transformations, or if it allows you
to write your custom code for more complex transformations. Your chosen tool or
technology should align with your desired data transformation logic to ensure smooth
and efficient data processing.
•Scalability and reliability: It is important to consider its scalability and reliability
when selecting your data ingestion tool. Consider if it can handle the amount of data
you anticipate without affecting its performance or dependability. Check if it offers
features like parallelism, partitioning, or fault tolerance to ensure it can scale and
remain dependable.
While you can use dedicated data ingestion tools such as Flume or Kafka, a better option would
be to use a tool like Estuary Flow that combines multiple components of streaming
architecture. It includes data ingestion, stream processing, and message broker; and it contains
data-lake-style storage in the cloud.
Estuary Flow supports a wide range of streaming data sources and formats so you can
easily ingest and analyze data from social media feeds, IoT sensors, clickstream data,
databases, and file systems.
This means you can get access to insights from your data sources faster than ever before.
Whether you need to run a historic analysis or react quickly to changes, our stream processing
engines will provide you with the necessary support.
You don’t need to be a coding expert to use Estuary Flow. Our powerful no-code solution
makes it easy for organizations to create and manage data pipelines so you can focus on
getting insights from your data instead of wrestling with code.
Some examples of data visualization and reporting tools are Grafana, Kibana, Tableau, Power
BI, etc.
How To Choose A Data Visualization & Reporting Tools For Your Data Streaming Architecture
Here are some factors to consider when choosing data visualization and reporting tools for your
data streaming architecture.
•Type and volume of data you want to stream: When choosing data visualization
and reporting tools, you might need different tools for different types of data, such as
structured or unstructured data, and for different formats, such as JSON or CSV.
•Latency and reliability requirements: If you’re looking to analyze data quickly and
with precision, check whether the tools match your latency and reliability requirements.
•Scalability and performance requirements: Select data visualization and reporting
tools that can adapt and scale regardless of how much your input increases or
decreases.
•Features and functionality of the tools: If you’re trying to make the most of your
data, you must select tools with features catered specifically to what you need. These
might include filtering abilities and interactive visualization options along with alerts for
collaboration purposes.
Now that we are familiar with the main components, let’s discuss the data streaming
architecture diagram.
Data streaming architecture is a powerful way to unlock insights and make real-time decisions
from continuous streams of incoming data. This innovative setup includes three key
components: data sources that provide the raw information, pipelines for processing it all in an
orderly manner, and finally applications or services consuming the processed results.
With these elements combined, you can track important trends as they happen.
•Data sources: Data sources, such as IoT devices, web applications, and social media
platforms are the lifeblood of data pipelines. They can also use different protocols, like
HTTP or MQTT, to continuously feed info up into your pipeline whether it’s through push
mechanisms or pull processes.
•Data pipelines:Data pipelines are powerful systems that enable you to receive and
store data from a multitude of sources while having the ability to scale up or down.
These tools can also transform your raw data into useful information through operations
like validation, enrichment, filtering, and aggregation.
•Data consumers: Data consumers are the individuals, organizations, or even
computer programs that access and make use of data from pipelines for a variety of
applications, like real-time analytics, reporting visualizations, decisions making, and
more.
With such information at their fingertips, the users can analyze it further through
descriptive analysis which examines trends in historical data to predict what may
happen next. Additionally, others choose to utilize predictive technology as well as
prescription tactics, both crucial aspects of any business decision-making process.
There are many ways to design and implement a data streaming architecture depending on
your needs and goals. Here are some examples of a modern streaming data architecture:
I. Lambda Architecture
This is a hybrid architecture that uses two layers: a historical layer utilizing traditional
technologies like Spark, then another for near-real time with quickly responding
streaming tools such as Kafka or Storm.
These are unified together in an extra serving layer for optimal accuracy, scalability, and fault
tolerance – though this complexity does come at some cost when it comes to latency and
maintenance needs.
The processed data is stored in queryable storage that supports both batch and
stream queries. This approach not only offers low latency, simplicity, and consistency but
also requires high performance, reliability, and idempotency.
Stream Computing :-
•Stream computing is a computing paradigm that reads data from collections of software or hardware
sensors in stream form and computes continuous data streams.
•Stream computing uses software programs that compute continuous data streams.
•Stream computing uses software algorithm that analyzes the data in real time.
•Stream computing is one effective way to support Big Data by providing extremely low-latency velocities
with massively parallel processing architectures.
•It is becoming the fastest and most efficient way to obtain useful knowledge from Big Data.
Data Sampling is a statistical method that is used to analyze and observe a subset of data from
a larger piece of dataset and configure meaningful information, all the required info from the
subset that helps in gaining information, or drawing conclusion for the larger dataset, or it's
parent dataset.
•Sampling in data science helps in finding more better and accurate results and works best
when the data size is big.
•Sampling helps in identifying the entire pattern on which the subset of the dataset is based
upon and on the basis of that smaller dataset, entire sample size is presumed to hold the same
properties.
•It is a quicker and more effective method to draw conclusions.
What is Data Sampling important?
Data sampling is important for a couple of key reasons:
1.Cost and Time Efficiency: Sampling allows researchers to collect and analyze a subset of
data rather than the entire population. This reduces the time and resources required for data
collection and analysis, making it more cost-effective, especially when dealing with large
datasets.
2.Feasibility: In many cases, it's impractical or impossible to analyze the entire population due
to constraints such as time, budget, or accessibility. Sampling makes it feasible to study a
representative portion of the population while still yielding reliable results.
3.Risk Reduction: Sampling helps mitigate the risk of errors or biases that may occur when
analyzing the entire population. By selecting a random or systematic sample, researchers can
minimize the impact of outliers or anomalies that could skew the results.
4.Accuracy: In some cases, examining the entire population might not even be possible. For
instance, testing every single item in a large batch of manufactured goods would be
impractical. Data sampling allows researchers to get a good understanding of the whole
population by examining a well-chosen subset.
Types of Data Sampling Techniques
There are mainly two types of Data Sampling techniques which are further divided into 4
sub-categories each. They are as follows:
Probability Data Sampling Technique
Probability Data Sampling technique involves selecting data points from a dataset in such
a way that every data point has an equal chance of being chosen. Probability sampling
techniques ensure that the sample is representative of the population from which it is drawn,
making it possible to generalize the findings from the sample to the entire population with a
known level of confidence.
1.Simple Random Sampling: In Simple random sampling, every dataset has an equal chance
or probability of being selected. For eg. Selection of head or tail. Both of the outcomes of the
event have equal probabilities of getting selected.
2.Systematic Sampling: In Systematic sampling, a regular interval is chosen each after which
the dataset continues for sampling. It is more easier and regular than the previous method of
sampling and reduces inefficiency while improving the speed. For eg. In a series of 10 numbers,
we have a sampling after every 2nd number. Here we use the process of Systematic sampling.
3.Stratified Sampling: In Stratified sampling, we follow the strategy of divide & conquer. We
opt for the strategy of dividing into groups on the basis of similar properties and then perform
sampling. This ensures better accuracy. For eg. In a workplace data, the total number of
employees is divided among men and women.
4.Cluster Sampling: Cluster sampling is more or less like stratified sampling. However in
cluster sampling we choose random data and form it in groups, whereas in stratified we use
strata, or an orderly division takes place in the latter. For eg. Picking up users of different
networks from a total combination of users.
Non-Probability Data Sampling
Non-probability data sampling means that the selection happens on a non-random basis, and it
depends on the individual as to which data does it want to pick. There is no random selection
and every selection is made by a thought and an idea behind it.
1.Convenience Sampling: As the name suggests, the data checker selects the data based on
his/her convenience. It may choose the data sets that would require lesser calculations, and
save time while bringing results at par with probability data sampling technique. For eg.
Dataset involving recruitment of people in IT Industry, where the convenience would be to
choose the data which is the latest one, and the one which encompasses youngsters more.
2.Voluntary Response Sampling: As the name suggests, this sampling method depends on
the voluntary response of the audience for the data. For eg. If a survey is being conducted on
types of Blood groups found in majority at a particular place, and the people who are willing to
take part in this survey, and then if the data sampling is conducted, it will be referred to as the
voluntary response sampling.
3.Purposive Sampling: The Sampling method that involves a special purpose falls under
purposive sampling. For eg. If we need to tackle the need of education, we may conduct a
survey in the rural areas and then create a dataset based on people's responses. Such type of
sampling is called Purposive Sampling.
4.Snowball Sampling: Snowball sampling technique takes place via contacts. For eg. If we
wish to conduct a survey on the people living in slum areas, and one person contacts us to the
other and so on, it is called a process of snowball sampling.
Data Sampling Process
2.Find the values of confidence levels that represent the accuracy of the data.
3.Find the value of error margins if any with respect to the sample space dataset.
4.Calculate the deviation from the mean or average value from that of standard deviation value
calculated.
Best Practices for Effective Data Sampling
Before performing data sampling methods, one should keep in mind the below three mentioned
considerations for effective data sampling.
1.Statistical Regularity: A larger sample space, or parent dataset means more accurate
results. This is because then the probability of every data to be chosen is equal, ie., regular.
When picked at random, a larger data ensures a regularity among all the data.
2.Dataset must be accurate and verified from the respective sources.
3.In Stratified Data Sampling technique, one needs to be clear about the kind of strata or group
it will be making.
4.Inertia of Large Numbers: As mentioned in the first principle, this too states that the
parent data set must be large enough to gain better and clear results.
GET IT ON YT
Estimating Moments :-
•Estimating moments is a generalization of the problem of counting distinct elements in a stream. The
problem, called computing "moments," involves the distribution of frequencies of different elements in the
stream.
•Suppose a stream consists of elements chosen from a universal set. Assume the universal set is ordered so
we can speak of the ithith element for any i.
•Let mimi be the number of occurrences of the ithith element for any i. Then the kthkth-order moment of
the stream is the sum over all i of (mi)k(mi)k
. For example :-
•The 0th0th moment is the sum of 1 of each mi that is greater than 0 i.e., 0th0th moment is a count of the
number of distinct element in the stream.
•The 1st moment is the sum of the mimi ’s, which must be the length of the stream. Thus, first moments are
especially easy to compute i.e., just count the length of the stream seen so far.
•The second moment is the sum of the squares of the mimi’s. It is sometimes called the surprise number,
since it measures how uneven the distribution of elements in the stream is.
•To see the distinction, suppose we have a stream of length 100, in which eleven different elements appear.
The most even distribution of these eleven elements would have one appearing 10 times and the other ten
appearing 9 times each.
•In this case, the surprise number is 102102 + 10 × 9292 = 910. At the other extreme, one of the eleven
elements could appear 90 times and the other ten appear 1 time each. Then, the surprise number would
be 902902 + 10 × 12 = 8110.
GET IT ON YT
In a decaying window algorithm, you assign more weight to newer elements. For a new element, you first reduce the
weight of all the existing elements by a constant factor k and then assign the new element with a specific weight. The
aggregate sum of the decaying exponential weights can be calculated using the following formula:
∑t−1i=0at−i(1−c)i
In a data stream consisting of various elements, you maintain a separate sum for each distinct element. For every
incoming element, you multiply the sum of all the existing elements by a value of (1−c). Further, you add the weight of
the incoming element to its corresponding aggregate sum.
A threshold can be kept to, ignore elements of weight lesser than that.
Finally, the element with the highest aggregate score is listed as the most popular element.
Example
For example, consider a sequence of twitter tags below:
fifa, ipl, fifa, ipl, ipl, ipl, fifa
In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264
So, ipl is more trending then fifa
Even though both of them occurred same number of times in input there score is still different.
Advantages of Decaying Window Algorithm:
1. Sudden spikes or spam data is taken care.
2. New element is given more weight by this mechanism, to achieve right trending output.
GET IT ON YT.
What Is Real-Time Sentiment Analysis?
Real-time Sentiment Analysis is a machine learning (ML) technique that automatically recognizes and extracts the
sentiment in a text whenever it occurs. It is most commonly used to analyze brand and product mentions in live social
comments and posts. An important thing to note is that real-time sentiment analysis can be done only from social media
The real-time sentiment analysis process uses several ML tasks such as natural language processing, text analysis,
semantic clustering, etc to identify opinions expressed about brand experiences in live feeds and extract business
Real-time sentiment analysis has several applications for brand and customer analysis. These include the following.
2.Real-time sentiment analysis of text feeds from platforms such as Twitter. This is immensely helpful in
prompt addressing of negative or wrongful social mentions as well as threat detection in cyberbullying.
4.Live video streams of interviews, news broadcasts, seminars, panel discussions, speaker events, and lectures.
5.Live audio streams such as in virtual meetings on Zoom or Skype, or at product support call centers
7.Up-to-date scanning of news websites for relevant news through keywords and hashtags along with
Live sentiment analysis is done through machine learning algorithms that are trained to recognize and analyze all data
types from multiple data sources, across different languages, for sentiment.
A real-time sentiment analysis platform needs to be first trained on a data set based on your industry and needs. Once
this is done, the platform performs live sentiment analysis of real-time feeds effortlessly.
To extract sentiment from live feeds from social media or other online sources, we first need to add live APIs of those
specific platforms, such as Instagram or Facebook. In case of a platform or online scenario that does not have a live
API, such as can be the case of Skype or Zoom, repeat, time-bound data pull requests are carried out. This gives the
solution the ability to constantly track relevant data based on your set criteria.
All the data from the various platforms thus gathered is now analyzed. All text data in comments are cleaned up and
processed for the next stage. All non-text data from live video or audio feeds is transcribed and also added to the text
pipeline. In this case, the platform extracts semantic insights by first converting the audio, and the audio in the video
This transcript has timestamps for each word and is indexed section by section based on pauses or changes in the
speaker. A granular analysis of the audio content like this gives the solution enough context to correctly identify entities,
themes, and topics based on your requirements. This time-bound mapping of the text also helps with semantic search.
Even though this may seem like a long drawn-out process, the algorithms complete this in seconds.
All the data is now analyzed using native natural language processing (NLP), semantic clustering, and aspect-based
sentiment analysis. The platform derives sentiment from aspects and themes it discovers from the live feed, giving you
It can also give you an overall sentiment score in percentile form and tell you sentiment based on language and data
sources, thus giving you a break-up of audience opinions based on various demographics.
All the intelligence derived from the real-time sentiment analysis in step 3 is now showcased on a reporting dashboard
in the form of statistics, graphs, and other visual elements. It is from this sentiment analysis dashboard that you can set
A live feed sentiment analysis solution must have certain features that are necessary to extract and determine real-time
•Multiplatform
One of the most important features of a real-time sentiment analysis tool is its ability to analyze multiple social media
platforms. This multiplatform capability means that the tool is robust enough to handle API calls from different
platforms, which have different rules and configurations so that you get accurate insights from live data.
This gives you the flexibility to choose whether you want to have a combination of platforms for live feed analysis such
as from a Ted talk, live seminar, and Twitter, or just a single platform, say, live Youtube video analysis.
•Multimedia
Being multi-platform also means that the solution needs to have the capability to process multiple data types such as
audio, video, and text. In this way, it allows you to discover brand and customer sentiment through live TikTok social
listening, real-time Instagram social listening, or live Twitter feed analysis, effortlessly, regardless of the data format.
•Multilingual
Another important feature is a multilingual capability. For this, the platform needs to have part-of-speech taggers for
each language that it is analyzing. Machine translations can lead to a loss of meanings and nuances when translating
non-Germanic languages such as Korean, Chinese, or Arabic into English. This can lead to inaccurate insights from live
conversations.
•Web scraping
While metrics from a social media platform can tell you numerical data like the number of followers, posts, likes,
dislikes, etc, a real-time sentiment analysis platform can perform data scraping for more qualitative insights. The tool’s
in-built web scraper automatically extracts data from the social media platform you want to extract sentiment from. It
does so by sending HTTP requests to the different web pages it needs to target for the desired information, downloads
It parses the saved data and applies various ML tasks such as NLP, semantic classification, and sentiment analysis. And
in this way gives you customer insights beyond the numerical metrics that you are looking for.
•Alerts
The sentiment analysis tool for live feeds must have the capability to track and simplify complex data sets as it conducts
repeat scans for brand mentions, keywords, and hashtags. These repeat scans, ultimately, give you live updates based on
comments, posts, and audio content on various channels. Through this feature, you can set alerts for particular keywords
or when there is a spike in your mentions. You can get these notifications on your mobile device or via email.
•Reporting
Another major feature of a real-time sentiment analysis platform is the reporting dashboard. The insights visualization
dashboard is needed to give you the insights that you require in a manner that is easily understandable. Color-coded pie
charts, bar graphs, word clouds, and other formats make it easy for you to assess sentiment in topics, aspects, and the
The user-friendly customer experience analysis solution, Repustate IQ, has a very comprehensive reporting dashboard
that gives numerous insights based on various aspects, topics, and sentiment combinations. In addition, it is also
available as an API that can be easily integrated with a dashboard such as Power BI or Tableau that you are already
using. This gives you the ability to leverage a high-precision sentiment analysis API without having to invest in yet