(BIT-601) Data Analytics Question Bank
(BIT-601) Data Analytics Question Bank
UNIT-1
1. Which are the technical tools that you have used for analysis and presentation
purposes?
2. What are the common problems that data analysts encounter during analysis?
● Handling duplicate
● Collecting the meaningful data at the right time
● Handling data purging and storage problems
● Making data secure and dealing with compliance issues
Some general strengths of a data analyst may include strong analytical skills, attention to
detail, proficiency in data manipulation and visualization, and the ability to derive insights
from complex datasets.
Weaknesses could include limited domain knowledge, a lack of experience with certain
data analysis tools or techniques, or challenges in effectively communicating technical
findings to non-technical stakeholders.
4. What are the steps involved when working on a data analysis project?
Many steps are involved when working end-to-end on a data analysis project. Some of the
important steps are listed below:
● Problem statement
● Data cleaning/preprocessing
● Data exploration
● Modeling
● Data validation
● Implementation
● Verification
● Create a data cleaning plan by understanding where the common errors take place
and keeping all communications open.
● Before working with the data, identify and remove the duplicates. This will lead to
an easy and effective data analysis process.
● Focus on the accuracy of the data. Set cross-field validation, maintain the value types
of data, and provide mandatory constraints.
● Normalize the data at the entry point so that it is less chaotic. You will be able to
ensure that all information is standardized, leading to fewer errors on entry.
1. Structured Data
● Data having a pre-defined structure or schema, which can also be categorized as
quantitative data and is well-organized, is defined as Structured Data.
● Because it has a pre-defined structure property, data can be organized into tables—
columns and rows—just like in spreadsheets.
● Most of the time, when data has relationships and can’t be stored in spreadsheets due
to the large size of the structured data stored in relational databases,.
1. PostgreSQL
2. SQLite
3. MySQL
4. Oracle Database
5. Microsoft SQL Server
2. Unstructured Data
3. Semi-Structured Data
● Semi-Structured data contains elements of both structured and unstructured data; its
schema is not fixed as structured data and with the help of metadata (which enables
users to define some partial structure or hierarchy), it can be organized to some extent
so it's not unorganized as unstructured data.
● Metadata includes tags and other markers, just like in JSON, XML, or CSV, which
separates the elements and enforces the hierarchy, but the size of the element varies
and order is not important.
● Tools for working with Semi-Structured Data
1. Cassandra
2. MongoDB
● Semi-Structured Data Use Cases
1. E-commerce
2. For mobile phones: {“storage”: “64GB”, “network”: “5G”, “color”: “black”}
3. For books: {“publisher”: “Oxford Press”, “writer”: “John Doe”, “pages”: 250}
Analytics Reporting
1. R and Python
2. Microsoft Excel
3. Tableau
4. RapidMiner
5. KNIME
6. Power BI
7. Apache Spark
8. QlikView
9. Talend
10. 10.Splunk
11. Programming Languages: R & Python
● R and Python are the top programming languages used in the Data Analytics field.
● R is an open-source tool used for Statistics and analytics, whereas Python is a high-level,
interpreted language that has easy syntax and dynamic semantics.
● Products: Both R and Python are completely free, and you can easily download both of
them from their respective official websites.
● Companies using R : Companies such as ANZ, Google, and Firefox use R, and other
multinational companies such as YouTube and Netflix Facebook uses Python.
● Recent Advancements anddata Features: Python and R are developing their features and
functionalities to ease the process of Data Analysis with high speed and accuracy. They are
coming up with various releases on a frequent basis with their updated features.
● Pros: It works on any platform, is very compatible, and has a lot of packages.
● Cons: It’s slower, less safe, and harder to pick up than Python.
1. Microsoft Excel
● Microsoft Excel is a platform that will help you get better insights into your data.
● Being one of the most popular tools for data analytics, Microsoft Excel provides users with
features such as sharing workbooks, working on the latest version for real-time
collaboration, adding data to Excel directly from a photo, and so on.
● Products: Microsoft Excel offers products in the following three categories:
● For Home
● For Business
● For Enterprises
● Companies using: Almost all organizations use Microsoft Excel on a daily basis to gather
meaningful insights from the data. A few of the popular names are McDonald’s, IKEA,
and Marriot.
● Recent Advancements and Features: The recent advancements vary depending on the
platform. few of the recent advancements in Windows platform are as follows:
● You can get a snapshot of your workbook with Workbook Statistics
● You can give your documents more flair with backgrounds and high-quality stock images
absolutely for free
● Pros: It’s used by a lot of people and has a lot of useful features and plug-ins.
● Cons: Cost, mistakes in calculations, and being bad at handling big amounts of data.
1. Power BI
● Power BI is a Microsoft product used for business analytics.
● Named as a leader for the 13th consecutive year in the Gartner 2024 Magic Quadrant, it
provides interactive visualizations with self-service business intelligence capabilities,
where end users can create dashboards and reports by themselves without having to depend
on anybody.
● Products: :Power BI provides the following products:
● Power BI Desktop
● Power BI Pro
● Power BI Premium
● Power BI Mobile
● Power BI Embedded
● Power BI Report Server
● Multinational organizations such as Adobe, Heathrow, Worldsmart, andAdvancements
and GE Healthcare are using Power BI to achieve powerful results from their data.
● Recent Advancements/ Features:
Power BI has recently come up with solutions such as Azure + Power BI and Office 365 +
Power BI to help users analyze the data, connect the data, and protect the data across various
Office platforms.
Talend is the only EuroNext that delivers complete and clean data at the moment you need it by
maintaining data quality, providing big data integration, cloud API services, preparing data, and
providing a data catalog and stitch data loader.
Recently, Talend has also accelerated the journey to the lakehouse paradigm and the path to
revealing intelligence in data. Not only this, but the Talend Cloud is now available in the Microsoft
Azure Marketplace.
Recently, ikVhas launched an intelligent alerting platform called Qlik Alerting for Qlik Sense®,
which helps organizations handle exceptions, notify users of potential issues, help users analyze
further, and also prompt actions based on the derived insights.
● Pros: Quick Data Visualization, Drag-and-Drop Functionality, Robust Analytics.
● Cons: Steep Learning Curve, Limited Customization, Higher Cost.
1. 10. Apache Spark
● Apache Spark is one of the most successful projects in the Apache Software Foundation
and is a cluster computing framework that is open-source and is used for real-time
processing.
● Being the most active Apache project currently, it comes with a fantastic open-source
community and an interface for programming.
● This interface makes sure of fault tolerance and implicit data parallelism.
● Products: Apache Spark keeps on releasing new releases with new features. You can also
choose the various package types for Spark. The most recent version is 2.4.5, and 3.0.0 is
in preview.
● Companies using: Companies such as Oracle, Hortonworks, Verizon, and Visa use Apache
Spark for real-time computation of data with ease of use and speed.
● Recent Advancements and Features
● In today’s world, Spark runs on Kubernetes, Apache Mesos, standalone, Hadoop, or in the
cloud.
● It provides high-level APIs in Java, Scala, Python, and R, and Spark code can be written
in any of these four languages.
● Spark’s MLlibe learning componen, is handy when it comes to big data processing.
● Pros: Fast, dynamic, and easy to use.
● Cons: no file management system; rigid user interface.
10. Define the various phases of the data analytics life cycle.
To address the specific demands for conducting analysis on big data, a step-by-step
methodology is required to plan the various tasks associated with the acquisition, processing,
analysis, and recycling of data.
Phase 1: Discovery
● The team studies data to discover the connections between variables. Later, it selects the
most significant variables as well as the most effective models.
● In this phase, the data science teams create data sets that can be used for training, testing,
production, and training goals.
● The team builds and implements models based on the work completed in the modeling
planning phase.
● Some of the tools commonly used for this stage are MATLAB and STASTICA.
● The team creates datasets for training, testing, and production use.
● The team is also evaluating whether its current tools are sufficient to run the models or if
they require an even more robust environment to run models.
● [AKTU] Tools that are free or open-source or free tools: Rand PL/R, Octave, WEKA.
● Commercial tools: MATLAB, STASTICA.
● Following the execution of the model, team members will need to evaluate the outcomes
of the model to establish criteria for its success or failure.
● The team is considering how best to present findings and outcomes to the various
members of the team and other stakeholders while taking into consideration cautionary
tales and assumptions.
● The team should determine the most important findings, quantify their value to the
business, create a narrative to present findings, and summarize them for all stakeholders.
Phase 6: Operationalize:
● The team distributes the benefits of the project to a wider audience. It sets up a pilot
project that will deploy the work in a controlled manner prior to expanding the project to
the entire enterprise of users.
● This technique allows the team to gain insight into the performance and constraints
related to the model within a production setting on a small scale and then make necessary
adjustments before full deployment.
● The team produces the last reports, presentations, and codes.
● Open-source or free tools such as WEKA, SQL, MADlib, and Octave.
● Univariate analysis is the simplest and easiest form of data analysis, where the data being
analyzed contains only one variable.
Example: studying the heights of players in the NBA.
Univariate analysis can be described using central tendency, dispersion, quartiles, bar
charts, histograms, pie charts, and frequency distribution tables.
● Bivariate analysis involves the analysis of two variables to find causes, relationships, and
correlations between the variables.
Example: analyzing the sale of ice cream based on the temperature outside.
The bivariate analysis can be explained using correlation coefficients, linear regression,
logistic regression, scatter plots, and box plots.
● Multivariate analysis involves the analysis of three or more variables to understand the
relationship of each variable with the other variables.
Example: revenue based on expenditure.
Multivariate analysis can be performed using Multiple regression, Factor analysis,
Classification & regression trees, Cluster analysis, Principal component analysis, Dual-
axis charts, etc.
Overfitting Underfitting
The model trains the data well using the Here, the model neither trains the data well
training set. nor can generalize to new data.
The performance drops considerably Performs poorly both on the train and the
over the test set. test set.
It happens when the model learns the This happens when there is less data to
random fluctuations and noise in the build an accurate model and when we try to
training dataset in detail.
develop a linear model using non-linear
data.
3. Explain the concept of outlier detection and how you would identify outliers in a
dataset? How do you treat outliers in a dataset?
An outlier is a data point that is distant from other similar points. They may be due to
variability in the measurement or may indicate experimental errors.
Outlier detection is the process of identifying observations or data points that significantly
deviate from the expected or normal behavior of a dataset. Outliers can be valuable sources
of information or indications of anomalies, errors, or rare events.
It's important to note that outlier detection is not a definitive process, and the identified
outliers should be further investigated to determine their validity and potential impact on
the analysis or model. Outliers can be due to various reasons, including data entry errors,
measurement errors, or genuinely anomalous observations, and each case requires careful
consideration and interpretation.
The graph depicted below shows there are three outliers in the dataset.
To deal with outliers, one can use the following four methods:
The output must be continuous value, such Output must be categorical value such
as price, age, etc. as 0 or 1, Yes or no, etc.
1. This concept is flexible, and we can easily understand and implement it.
2. It is used for helping minimize the logic created by humans.
3. It is the best method for finding the solution to those problems that are suitable for
approximate or uncertain reasoning.
4. It always offers two values, which denote the two possible solutions to a problem or
statement.
5. It allows users to build or create functions that are non-linear of arbitrary complexity.
1. Rule Base
2. Fuzzification
3. Inference Engine
4. Defuzzification
Rule Base
Rule Base is a component used for storing the set of rules, and the If-Then conditions given
by the experts are used for controlling the decision-making systems. There have been so
many updates to the fuzzy theory recently, which offers effective methods for designing
and tuning fuzzy controllers. These updates or developments decrease the number of fuzzy
sets of rules.
Fuzzification
Fuzzification is a module or component for transforming the system inputs, i.e., it converts
the crisp number into fuzzy steps. The crisp numbers are those inputs that are measured by
the sensors and then fuzzified and passed into the control systems for further processing.
This component divides the input signals into following five states in any Fuzzy Logic
system:
Inference Engine
This component is a main component in any fuzzy logic system (FLS), because all the
information is processed in the inference engine. It allows users to find the matching degree
between the current fuzzy input and the rules. After the matching degree, this system
determines which rule is to be added according to the given input field. When all rules are
fired, they are combined to develop control actions.
Defuzzification
Defuzzification is a module or component that takes the fuzzy set inputs generated by the
Inference Engine and then transforms them into a crisp value. It is the last step in the
process of developing a fuzzy logic system. The crisp value is a type of value which is
acceptable by the user. Various techniques are present to do this, but the user has to select
the best one for reducing the error
Reinforcement Learning
Applications of NN
● Signal processing
● Pattern recognition, e.g. handwritten characters or face identification,.
● Diagnosis or mapping symptoms to a medical case.
● Speech recognition
● Human Emotion Detection
● Educational Loan Forecasting
7. State the concept of the competitive learning rule. How do I select the best
regression method?
There would be competition among the output nodes, so the main concept is that during
training, the output unit that has the highest activation of a given input pattern will be
declared the winner. This rule is also called winner-take-all because only the winning
neuron is updated and the rest of the neurons are left unchanged. Competitive learning is a
form of unsupervised learning in artificial neural networks in which nodes compete for the
right to respond to a subset of the input data. A variant of Hebbian learning, learning
works by increasing the specialization of each node in the network. It is well suited to
finding clusters within data.
1. Data exploration is an inevitable part of building predictive model. It should be your first
step before selecting the right model, like identify the relationship and impact of variables
2. To compare the goodness of fit for different models, we can analyse different metrics like
statistical significance of parameters, R-square, adjusted r-square, AIC, BIC, and error
term. Another one is Mallow’s Cp criterion. This essentially checks for possible bias in
your model by comparing the model with all possible submodels (or a careful selection of
them).
3. Cross-validation is the best way to evaluate models used for prediction. Here, you divide
your data set into two groups (train and validate). A simple mean squared difference
between the observed and predicted values gives you a measure of the prediction accuracy.
4. If your data set has multiple confounding (surprising) variables, you should not choose
automatic model selection method because you do not want to put these in a model at the
same time.
5. Depend on your objective. It can happen that a less powerful model is easier to implement
as compared to a highly statistically significant model.
6. Regression regularization methods (Lasso, Ridge, and ElasticNet) work well in cases of
high dimensionality and multicollinearity among the variables in the data set.
8. What is regularization? Why is regularization important? Discuss types of
regularization. Which regularization is better, and why?
Regularization refers to techniques that are used to calibrate machine learning models in
order to minimize the adjusted loss function and prevent overfitting or underfitting. Using
regularization, we can fit our machine learning model appropriately on a given test set and,
hence, reduce the errors in it.
You should use regularization if the gap in performance between train and test is large.
This means the model grasps too many details of the train set. Overfitting is related to high
variance, which means the model is sensitive to specific samples of the training set.
Regularization aims to balance the trade-off between bias and variance and improve the
prediction accuracy and robustness of the model. Regularization prevents overfitting,
making models generalize better on unseen data by penalizing complexity There are a
range of different regularization techniques. The most common approaches rely on
statistical methods such as Lasso regularization (also called L1 regularization), Ridge
regularization (L2 regularization), and Elastic Net regularization, which combines both
Lasso and Ridge techniques.
Ridge regression is a technique used when the data suffers from multicollinearity
(independent variables are highly correlated). In multicollinearity, even though the least
squares estimates (OLS) are unbiased, their variances are large, which deviates the
observed value far from the true value. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard e-multicollinearity. The first linear
regression can be represented as y = a + b * x. This equation also has an error term.
The complete equation becomes: y=a+b*x+e (error term); a+b1x1+b2x2+ ... [error term
is the value needed to correct for a prediction error between the observed and predicted
value] = y= = a + b1x1+ b2x2 +... + e, for multiple independent variables
In a linear equation, prediction errors can be decomposed into two sub components. The
first is due to the bias, and Second is due to the variance. A prediction error can occur due
to any one of these two or both components.
Ridge regression solves the multicollinearity problem through the shrinkage parameter λ
(lambda).
In this equation, we have two components. The first one is the least squared term. The
second one is the lambda of the summation of β2 (beta-square), where β is the coefficient.
This is added to the least squares advantage term in order to shrink the parameter to have
a very low variance.
advantage is to avoid overfitting. Our ultimate model is the one that could generalize
patterns, i.e., works best on the training and testing dataset Over fitting occurs when the
trained model performs well on the training data but poorly on the testing datasets
Ridge regression works by applying a penalizing term (reducing the weights and biasing
overfitting) to overcooverfitting. Similar to Ridge Regression, Lasso (Least Absolute
Shrinkage and Selection Operator) also penalizes the absolute size of the regression
coefficients. It is capable of reducing the variability and improving the accuracy of linear
regression models.
Lasso regression differs from ridge regression in a way that it uses absolute values in the
penalty function instead of squares. This leads to penalizing (or equivalently constraining)
the sum of the absolute values of the estimates, which causes some of the parameter
estimates to turn out to be exactly zero. The larger the penalty applied, the further the
estimates shrank towards absolute zero.
This results in a variable selection out of the given n variables. L1 regularization is more
robust than L2 regularization, for a fairly obvious reason. L2 regularization takes the square
of the weights, so the cost of outliers present in the data increases exponentially. L1
regularization takes the absolute values of the weights, so the cost only increases linearly.
9. What do you mean by SVM? Discuss the various kernel methods used in SVM.
SVM in ML is defined as a data science algorithm SVM belongs to the class of supervised
learning that analyses the trends and characteristics of the data set. SVM solves problems
related to classification, regression, and outlier detection. All of these are common tasks
in machine learning. SVM is based on the learning framework of VC theory (Vapnik-
Chervonenkis theory). Examples: Detect cancerous cells based on millions of images,
or use them to predict future driving routes with a well-fitted regression mode. These are
just math equations tuned to give you the most accurate answer possible as quickly as
possible.
SVMs are different from other classification algorithms because of the way they choose
the decision boundary that maximizes the distance from the nearest data points of all the
classes. The decision boundary created by SVMs is called the maximum margin classifier
or the maximum margin hyperplane. Support Vector Machines is a strong and powerful
algorithm that is best used to build machine learning models with small data sets. SVM is
able to generalize the characteristics that differentiate the training data that is provided to
the algorithm. This is achieved by checking for a boundary that differentiates the two
classes by the maximum margin.
The boundary that separates the 2 classes is known as a hyperplane. Even if the name has
a plane, if there are two features, this hyperplane can be a line; in 3D, it will be a plane;
and so on. The data points that are closest to the separable line are known as the support
vectors. The distance of the support vectors from the separable line determines the best
hyperplane. We followed some basic rules for the most optimum separation, and they are:
Maximum Separation: We selected the line that is able to segregate all the data points
into corresponding classes, and that is how we narrowed it down to A, B, C, and D.
Best Separation: Out of the lines A, B, C, and D, we chose the one that has the maximum
margin to classify the data points. This is the line that has the maximum width from the
corresponding support vectors (the data points that are the closest).This is what is known
as a kernel trick.
In a kernel trick, the data is projected into higher dimensions, and then a plane is
constructed so that the data points can be segregated. Some of these kernels are RBF,
Sigmoid, Poly, etc. Using the respective kernel, we will be able to tune the data set to get
the best hyperplane to separate the training data points.
Kernels, or kernel methods (also called kernel functions), are sets of different types of
algorithms that are being used for pattern analysis. They are used to solve a non-linear
problem by using a linear classifier. Kernel methods are employed in SVM (Support
Vector Machines), which are used in classification and regression problems. The SVM
uses what is called a “Kernel Trick,” where the data is transformed and an optimal
boundary is found for the possible outputs.
Liner Kernel
These are commonly recommended for text classification because most of these types of
classification problems are linearly separable. The linear kernel works really well when
there are a lot of features, and text classification problems have a lot of features. Linear
kernel functions are faster than most of the others and you have fewer parameters to
optimize.
If there are two vectors with the names x1 and Y1, then the linear kernel is defined by the
dot product of these two vectors: K(x1, x2) = x1 + x2.
Polynomial Kernel
The polynomial kernel isn't used in practice very often because it isn't as computationally
efficient as other kernels, and its predictions aren't as accurate. A polynomial kernel is
defined by the following equation:
K(x1, x2) = (x1 + x2 + 1)d, Where d is the degree of the polynomial and x1 and x2 are
vectors
The given sigma plays a very important role in the performance of the Gaussian kernel It
should neither be overestimated nor be underestimated; it should be carefully tuned
according to the problem. In this equation, gamma specifies how much influence a single
training point has on the other data points around it. ||X1 - X2|| is the dot product between
your features
Sigmoid
More useful in neural networks than in support vector machines, but there are occasional
specific use cases. The function for a sigmoid kernel is:
In this function, alpha is a weight vector and C is an offset value to account for some
misclassification of data that can happen.
10. What do you mean by dimensionality reduction? Explain its advantages. Discuss steps
of PCA.
In pattern recognition, Dimension Reduction is defined as
● It is the process of converting a data set with vast dimensions into a data set with
smaller dimensions.
● It ensures that the converted data set conveys similar information concisely.
Example Consider the following example:.
● The following graph shows two dimensions, x1 and x2.
● x1 represents the measurement of several objects in cm.
● x2 represents the measurement of several objects in inches.
The process of selecting a subset of features for use in model construction
1. Preprocessing of data in machine learning,
2. Can be useful for both supervised and unsupervised learning problem
Because
True Dimension <<< Observed Dimensions
● The abundance of reduction and irrelevant features
Curse of Dimensionality
● With a fixed number of training samples, the predictive power reduces as the
dimensionality reduces [Hughes Phenomenon]
● With d binary variables, the number of possible combination is O(2d)
Value of Analytics
● Descriptive Diagnostic Predictive Prescriptive
Benefits
Dimension reduction offers several benefits, such as
● It compresses the data and thus reduces the storage
● It reduces the time required for computation since less dimensions require less
computation.
● It eliminates the redundant features.
● It improves the model's performance.
The two popular and well-known dimension reduction techniques
1. Principal Component Analysis (PCA)
2. Fisher Linear Discriminate Analysis (LDA)
Principal Component Analysis is a well known dimension reduction technique It
transforms the variables into a new set of variables called as principal components.
Eliminates multicollinearity, but explicability is compromised . These principal
components are linear combinationsn of original variables and are orthogonal. The first
principal component accounts for most of the possible variation of original data. The
second principal component does its best to capture the variance in the data. There can
be only two principal components for two- dimensional data set Principal component
analysis (PCA) is a technique for reducing the dimensionality of datasets, exploiting the
fact that the images in these datasets have something in common.
When to use
● Excessive multi collinearity
● Explanation of the predicators is not important
● A slight overhead in implementation is okay
● More suitable for unsupervised learning
● Screen plot – elbow method
PCA is a 4 step process.
Starting with a dataset containing n dimensions (requiring n-axes to be represented):
1. Find a new set of basis functions (n-axes) where some axes contribute to most of the
variance in the dataset while others contribute very little.
2. Arrange these axes in the decreasing order of variance contribution.
3. Now, pick the top k axes to be used and drop the remaining n-k axes.
4. Now, project the dataset onto these k axes.
After these 4 steps, the dataset will be compressed from n-dimensions to just k-dimensions
(k<n).
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an
element is a member of a set.
For example, checking the availability of a username is a set membership problem, where the
set is the list of all registered usernames.
The price we pay for efficiency is that it is probabilistic in nature, which means, there might
be some False positive results.
False positive means, it might say that the given username is already taken, but actually it’s
not.
If the item is not present then it is TRUE NEGATIVE
The item that might be present in the set – can be either false positive or true positive.
A empty bloom filter is a bit array of m bits, all set to zero, like this –
We need k number of hash functions to calculate the hashes for a given input. When we
want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where
indices are calculated using hash functions.
insert(x) : To insert an element in the Bloom Filter.
lookup(x) : to check whether an element is already present in Bloom Filter with a positive
false probability.
NOTE : We cannot delete an element in Bloom Filter.
Medium uses bloom filters for recommending post to users by filtering post which have
been seen by user.
Quora implemented a shared bloom filter in the feed backend to filter out stories that
people have seen before.
The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
Google Bigtable, Apache HBase and Apache Cassandra, and PostgreSQL use Bloom
filters to reduce the disk lookups for non-existent rows or columns
2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per day.
Surveillance cameras generate images with lower resolution than satellites, but there can be
numerous of them, each producing a stream of images at a break of 1 second each.
Real-TimeAnalytics :
Real-time data analytics lets users see, examine and recognize data as it enters a system. Logic
and mathematics are put into the data, so it can give users perception for making real-time
decisions.
Overview :
Real-time analytics permits businesses to gain awareness and take action on data immediately or
soon after the data enters their system. Real-time app analytics response queries within seconds.
They grasp a large amount of data with high velocity and low reaction time. For example, real-
time big data analytics uses data in financial databases to inform trading decisions. Analytics can
be on-demand or uninterrupted. On-demand notifies results when the user requests it. Continuous
renovation users as events happen and can be programmed to answer automatically to certain
events. For example, real-time web analytics might refurbish an administrator if the page load
presentation goes beyond the present boundary.
Examples –
Examples of real-time customer analytics include as follows.
1. Viewing orders as they happen for better tracing and to identify fashion.
2. Continually modernize customer activity like page views and shopping cart use to
understand user etiquette.
3. Customers with advancement as they shop for items in a store, affecting real-time
decisions.
The functioning of real-time analytics :
Real-time data analytics tools can either push or pull. Streaming requires the faculty to shove
gigantic amounts of brisk-moving data. When streaming takes too many assets and isn’t
empirical, data can be hauled at intervals that can range from seconds to hours. The tow can
happen in between business needs, which require figuring out funds so as not to disrupt
functioning. The reaction times for real-time analytics can differ from nearly immediate to a
few seconds or minutes. The components of real-time data analytics include the follows.
· Aggregator
· Broker
· Analytics engine
· Stream processor
A high-performance computer system that analyzes multiple data streams from many sources. The
word stream in stream computing is used to mean pulling in streams of data, processing the data,
and streaming it back out as a single flow. Stream computing uses software algorithms that
analyzes the data in real time as it streams in to increase speed and accuracy when dealing with
data handling and analysis.
· Stream computing is a computing paradigm that reads data from collections of software
or hardware sensors in stream form and computes continuous data streams.
· Stream computing uses software programs that compute continuous data streams.
· Stream computing uses software algorithm that analyzes the data in real time.
· Stream computing is one effective way to support Big Data by providing extremely low-
latency velocities with massively parallel processing architectures.
· It is becoming the fastest and most efficient way to obtain useful knowledge from Big
Data.
UNIT IV: Frequent Itemsets and Clustering
Predictive market basket analysis- Although “predict” and “analysis” make up the word
predictive analysis, it actually works in reverse. It first analyses and then predicts what the
future holds. This type utilizes supervised learning models like regression and
classification. It is a valuable tool for marketers, even if it is less used than descriptive
market basket analysis. It considers items purchased in sequence to evaluate cross-sell. For
instance, when a consumer purchases a laptop, they are more likely to buy an extended
warranty with it. This analysis thus helps in recognizing those considered items in a
sequence so they can be sold together. It finds application in the retail industry, mainly to
determine the item baskets that are purchased together.
Differential market basket analysis- Differential market basket analysis is a great tool
for competitive analysis that can help you determine why consumers prefer to purchase the
same product from a particular platform even when they are labelled with the same price
on both platforms. This decision of the consumers is often based on several factors, such
as-
● Delivery time
● User experience
● Purchase history between stores, seasons, time periods, and others.
Q.4 Discuss the Apriori algorithm for frequent item set mining.
Ans- Apriori algorithm was given by R. Agrawal and R. Srikant in 1994 for finding frequent item
sets in a dataset for the Boolean association rule. Apriori is designed to operate on databases
containing transactions (collections of items bought by customers, details of website accesses,
etc.). For example, the items customers buy at a big bazaar. The Apriori algorithm helps customers
buy their products with ease and increases the sales performance of the particular store.
Since it makes use of previous knowledge about common item features, the method is referred to
as apriori. This is achieved by the use of an iterative technique or level-wise approach in which k-
frequent itemsets are utilized to locate k+1 itemsets. An essential feature known as the apriori
property is utilized to boost the effectiveness of level-wise production of frequent itemsets. This
property helps by minimizing the search area, which in turn serves to maximize the productivity
of level-wise creation of frequent patterns. Each transaction is seen as a set of items (an itemset).
Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C
transactions in the database. Apriori uses a "bottom-up" approach, where frequent subsets are
extended one item at a time (a step known as candidate generation), and groups of candidates are
tested against the data. The algorithm terminates when no further successful extensions are found.
Algorithm
Step 1: Create a list of all the elements that appear in every transaction and create a frequency
table.
Step 2: Set the minimum level of support. Only those elements whose support exceeds or equals
the threshold support are significant.
Step 3: All potential pairings of important elements must be made, bearing in mind that AB and
BA are interchangeable.
Step 4: Tally the number of times each pair appears in a transaction.
Step 5: Only those sets of data that meet the criterion of support are significant.
Step 6: Now, find a set of three things that may be bought together. A rule, known as self-join, is
needed to build a three-item set. The item pairings OP, OB, PB, and PM state that two
combinations with the same initial letter are sought from these sets. OPB is the result of OP and
OB, and PBM is the result of PB and PM.
Step 7: When the threshold criterion is applied again, get the significant itemset.
Q. 5 Find the frequent item sets from given data using Apriori algorithm.
Ans- Step-1: In the first step, we index the data and then calculate the support for each one, if
support was less than the minimum value we eliminate that from the table.
Step-2: Calculate the support for each one
Step-3: Continue to calculate the support and select the best answer
Step-4: Find frequent item sets and make association rules.
FIS= {Milk, Bread, Eggs}
Association Rules are:
{Milk^Bread→Eggs},{Milk^Eggs→Bread},{Bread^Eggs→Milk}
Q.7 Apply the PCY algorithm on the following transaction to find the candidate sets
(frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
Ans- Step 1: Find the frequency of each element and remove the candidate set having length 1.
Items 1 2 3 4 5 6
Freque
4 7 7 8 6 4
ncy
2 (3,4)
3 (1,3)
4 (4,6)
5 (3,5)
6 (2,3)
8 (2,4)
Step 4: Prepare candidate set
1 2 4 (3,4) (3,4)
1 3 3 (1,3) (1,3)
1 4 4 (4,6) (4,6)
1 5 3 (3,5) (3,5)
1 6 3 (2,3) (2,3)
1 8 4 (2,4) (2,4)
Pass 1:
Map (key, value)
• Find the item set frequent in the partition sample p with support threshold ps.
• The output is key valve pair ( F, ) where f is frequent item set.
Reduce ( key, value)
• Valves are ignored and reduce task will produce those keys which appear one or more times.
PASS 2:
MAP (key, value)
• Count occurrences of item sets
• For internet in item sets:
• Emit (item set, support (item set))
Reduce (key, value)
• Result = 0
• For value in values :
Result = result + value
• If result > = s then emit (key, result)
1. Initialization: Start by randomly selecting K points from the dataset. These points will act
as the initial cluster centroids.
2. Assignment: For each data point in the dataset, calculate the distance between that point
and each of the K centroids. Assign the data point to the cluster whose centroid is closest
to it. This step effectively forms K clusters.
3. Update centroids: Once all data points have been assigned to clusters, recalculate the
centroids of the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids
no longer change significantly or when a specified number of iterations is reached.
5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster.
Q.10 What is Hierarchical Clustering? Discuss its applications, types and algorithmic steps.
Ans- Hierarchical clustering is a popular method for grouping objects. It creates groups so that
objects within a group are similar to each other and different from objects in other groups. Clusters
are visually represented in a hierarchical tree called a dendrogram. Hierarchical clustering has a
couple of key benefits:
1. There is no need to pre-specify the number of clusters. Instead, the dendrogram can be cut
at the appropriate level to obtain the desired number of clusters.
2. Data is easily summarized/organized into a hierarchy using dendrograms. Dendrograms
make it easy to examine and interpret clusters.
Applications
There are many real-life applications of Hierarchical clustering. They include:
● It helps to obtain confidence in the data to a point where you’re ready to engage a
machine learning algorithm.
● It allows us to refine the selection of feature variables that will be used later for model
building.
Q2. Write a short note on Hadoop and also list its advantages.
Advantages of Hadoop:
● 1. Fast:· a. In HDFS (Hadoop Distributed File System), data is dispersed across the
cluster and mapped, allowing for faster retrieval.
· b. Even the tools used to process the data are frequently hosted on the same servers,
decreasing processing time.
● 2. Scalable: Hadoop clusters can be extended by just adding nodes to the cluster.
● 3. Cost-effective: Hadoop is open source and uses commodity technology to store data,
making it significantly less expensive than traditional relational database management
systems.
● 4. Resilient to failure: HDFS has the ability to duplicate data across networks, so if one
node fails or there is another network failure, Hadoop will use the other copy of the data.
● 5. Flexible: · a. Hadoop allows businesses to readily access new data sources and tap
into various sorts of data in order to derive value from that data.
b. It aids in the extraction of useful business insights from data sources such as social
media, email discussions, data warehousing, fraud detection, and market campaign
analysis.
Hive architecture: The following architecture explains the flow of query submission into Hive.
Hive client: Hive allows writing applications in various languages, including Java, Python, and
C++. It supports different types of clients, including:
● 1. Thrift Server: It is a platform for cross-language service providers that serves requests
from any programming languages that implement Thrift.
● 2. JDBC Driver: It is used to connect Java applications and Hive applications. “The class
org.apache.hadoop.hive.jdbe.HiveDriver contains the JDBC Driver.
● 3. ODBC Driver: It allows the applications that support the ODBC protocol to connect to
Hive.
Features of HBase:
1. It is both linearly and modularly scalable across several nodes because it is spread across
multiple nodes.
2. HBase provides consistent read and write performance.
3. It enables atomic read and write, which implies that while one process is reading or
writing, all other processes are blocked from completing any read or write actions.
4. It offers a simple Java API for client access.
5. It offers Thrift and REST APIs for non-Java front ends, with options for XML, Protobuf,
and binary data encoding.
6. It has a Block Cache and Bloom Filters for real-time query optimisation as well as large
volume query optimisation.
7. HBase supports automatic failover across region servers.
8. It supports exporting measurements to files using the Hadoop metrics subsystem.
9. It does not enforce data relationships.
10. It is a data storage and retrieval platform with random access.
Q5. Write a short note on R programming language and its features.
Interaction approaches enable the data analyst to interact with the visuals and dynamically adjust
them based on the exploration objectives. They also enable the linking and combining of multiple
independent visualizations.
a. Dynamic projection:
b. Interactive filtering:
c. Zooming:
Distortion:
3. It is difficult to perform data It has built-in operators for performing data operations
operations in MapReduce.
such as union, sorting, and ordering.
4. It does not allow nested data It provides nested data types like tuple, bag, and map.
types.
Q8. What are the main differences between HDFS and S3?
Q9. Write R function to check whether the given number is Prime or not.
if (n1 == 2) {
return(TRUE)
}
if (n1 <= 1) {
return(FALSE)
for (i in 2:(n1-1)) {
if (n1 %% i == 0) {
return(FALSE)
return(TRUE)
numb_1 <- 13
if (Find_Prime_No(numb_1)) {
} else {