0% found this document useful (0 votes)
15 views56 pages

(BIT-601) Data Analytics Question Bank

The document is a question bank for a Data Analytics course, covering various topics such as technical tools, common problems faced by data analysts, project steps, data cleaning methods, ethical considerations, and data classification. It also differentiates between analysis and reporting, lists modern data analytic tools, and provides insights into programming languages like R and Python. Additionally, it discusses specific tools like Microsoft Excel, Power BI, Tableau, KNIME, RapidMiner, Splunk, and Talend, detailing their features, use cases, and pros and cons.

Uploaded by

Om Sivay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views56 pages

(BIT-601) Data Analytics Question Bank

The document is a question bank for a Data Analytics course, covering various topics such as technical tools, common problems faced by data analysts, project steps, data cleaning methods, ethical considerations, and data classification. It also differentiates between analysis and reporting, lists modern data analytic tools, and provides insights into programming languages like R and Python. Additionally, it discusses specific tools like Microsoft Excel, Power BI, Tableau, KNIME, RapidMiner, Splunk, and Talend, detailing their features, use cases, and pros and cons.

Uploaded by

Om Sivay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Question Bank

Course: Data Analytics Course Code: KIT-601


Programme: B. Tech. (IT) Semester: 6th

Unit 1 Introduction to Data Analytics

UNIT-1

1. Which are the technical tools that you have used for analysis and presentation
purposes?

Some of the popular tools should know are:


● MS SQL Server, MySQL
For working with data stored in relational databases
● MS Excel, Tableau
For creating reports and dashboards
● Python, R, SPSS
For statistical analysis, data modeling, and exploratory analysis
● MS PowerPoint
For presentation, display the final results and important conclusions

2. What are the common problems that data analysts encounter during analysis?

The common steps involved in any analytics project are:

● Handling duplicate
● Collecting the meaningful data at the right time
● Handling data purging and storage problems
● Making data secure and dealing with compliance issues

3. What are your strengths and weaknesses as a data analyst?

Some general strengths of a data analyst may include strong analytical skills, attention to
detail, proficiency in data manipulation and visualization, and the ability to derive insights
from complex datasets.
Weaknesses could include limited domain knowledge, a lack of experience with certain
data analysis tools or techniques, or challenges in effectively communicating technical
findings to non-technical stakeholders.

4. What are the steps involved when working on a data analysis project?
Many steps are involved when working end-to-end on a data analysis project. Some of the
important steps are listed below:

● Problem statement
● Data cleaning/preprocessing
● Data exploration
● Modeling
● Data validation
● Implementation
● Verification

5. What are the best methods for data cleaning?

● Create a data cleaning plan by understanding where the common errors take place
and keeping all communications open.
● Before working with the data, identify and remove the duplicates. This will lead to
an easy and effective data analysis process.
● Focus on the accuracy of the data. Set cross-field validation, maintain the value types
of data, and provide mandatory constraints.
● Normalize the data at the entry point so that it is less chaotic. You will be able to
ensure that all information is standardized, leading to fewer errors on entry.

6. What are the ethical considerations of data analysis?

Some of the most ethical considerations in data analysis include:

● Privacy: safeguarding the privacy and confidentiality of individuals' data and


ensuring compliance with applicable privacy laws and regulations.
● Informed Consent: Obtaining informed consent from individuals whose data is
being analyzed, explaining the purpose and potential implications of the analysis.
● Data Security: Implementing robust security measures to protect data from
unauthorized access, breaches, or misuse.
● Data Bias: Being mindful of potential biases in data collection, processing, or
interpretation that may lead to unfair or discriminatory outcomes.
● Transparency: Being transparent about the data analysis methodologies, algorithms,
and models used enables stakeholders to understand and assess the results.
● Data Ownership and Rights: Respecting data ownership rights and intellectual
property, using data only within the boundaries of legal permissions or agreements.
● Accountability: taking responsibility for the consequences of data analysis and
ensuring that actions based on the analysis are fair, just, and beneficial to individuals
and society.
● Data Quality and Integrity: Ensuring the accuracy, completeness, and reliability of
data used in the analysis to avoid misleading or incorrect conclusions.
● Social Impact: Considering the potential social impact of data analysis results,
including potential unintended consequences or negative effects on marginalized
groups.
● Compliance: Adhering to legal and regulatory requirements related to data analysis,
such as data protection laws, industry standards, and ethical guidelines.

7. Explain the classification of data?

1. Structured Data
● Data having a pre-defined structure or schema, which can also be categorized as
quantitative data and is well-organized, is defined as Structured Data.
● Because it has a pre-defined structure property, data can be organized into tables—
columns and rows—just like in spreadsheets.
● Most of the time, when data has relationships and can’t be stored in spreadsheets due
to the large size of the structured data stored in relational databases,.

Tools for working with structured Data

1. PostgreSQL
2. SQLite
3. MySQL
4. Oracle Database
5. Microsoft SQL Server

Structured Data Use Cases

1. Customer Relationship Management


2. Online Booking
3. Accounting

2. Unstructured Data

● Unstructured data is typically categorized as qualitative rather than quantitative.


● It doesn’t have a pre-defined structure or specific format.
● Data that lies in this category are audio, video, images, and text file contents, which
have different properties for making these data available for analysis and can’t be
stored in relational databases (How would you store images in interrelated
spreadsheets?) as you can see, these data lack attributes andrelationshipss between
each other.
● So these are stored in their raw format, and analysis is done by applying image
processing, natural language processing, and machine learning.
● Tools for working with Unstructured Data
1. MongoDB
2. Hadoop
3. Data Lake
● Unstructured Data Use Cases
1. Data Mining
2. Predictive Data Analytics
3. Chatbots

3. Semi-Structured Data

● Semi-Structured data contains elements of both structured and unstructured data; its
schema is not fixed as structured data and with the help of metadata (which enables
users to define some partial structure or hierarchy), it can be organized to some extent
so it's not unorganized as unstructured data.
● Metadata includes tags and other markers, just like in JSON, XML, or CSV, which
separates the elements and enforces the hierarchy, but the size of the element varies
and order is not important.
● Tools for working with Semi-Structured Data
1. Cassandra
2. MongoDB
● Semi-Structured Data Use Cases
1. E-commerce
2. For mobile phones: {“storage”: “64GB”, “network”: “5G”, “color”: “black”}
3. For books: {“publisher”: “Oxford Press”, “writer”: “John Doe”, “pages”: 250}

8. Differentiate between analysis and reporting.

Analytics Reporting

Analytics is the method of


Reporting is an action that includes all the
examining and analyzing
needed information and data and is put
summarized data to make business
together in an organized way.
decisions.
Questioning the data, understanding Identifying business events, gathering the
it, investigating it, and presenting it required information, organizing,
to the end users are all part of summarizing, and presenting existing data
analytics. are all part of reporting.
The purpose of analytics is to draw
The purpose of reporting is to organize the
conclusions.
data into meaningful information.
Conclusions based on data.

Analytics is used by data analysts, Reporting is provided to the appropriate


scientists, and business people to business leaders to perform effectively and
make effective decisions. efficiently within a firm.

9. What are the Modern Data Analytic Tools?

1. R and Python
2. Microsoft Excel
3. Tableau
4. RapidMiner
5. KNIME
6. Power BI
7. Apache Spark
8. QlikView
9. Talend
10. 10.Splunk
11. Programming Languages: R & Python
● R and Python are the top programming languages used in the Data Analytics field.
● R is an open-source tool used for Statistics and analytics, whereas Python is a high-level,
interpreted language that has easy syntax and dynamic semantics.
● Products: Both R and Python are completely free, and you can easily download both of
them from their respective official websites.
● Companies using R : Companies such as ANZ, Google, and Firefox use R, and other
multinational companies such as YouTube and Netflix Facebook uses Python.
● Recent Advancements anddata Features: Python and R are developing their features and
functionalities to ease the process of Data Analysis with high speed and accuracy. They are
coming up with various releases on a frequent basis with their updated features.
● Pros: It works on any platform, is very compatible, and has a lot of packages.
● Cons: It’s slower, less safe, and harder to pick up than Python.
1. Microsoft Excel
● Microsoft Excel is a platform that will help you get better insights into your data.
● Being one of the most popular tools for data analytics, Microsoft Excel provides users with
features such as sharing workbooks, working on the latest version for real-time
collaboration, adding data to Excel directly from a photo, and so on.
● Products: Microsoft Excel offers products in the following three categories:
● For Home
● For Business
● For Enterprises
● Companies using: Almost all organizations use Microsoft Excel on a daily basis to gather
meaningful insights from the data. A few of the popular names are McDonald’s, IKEA,
and Marriot.
● Recent Advancements and Features: The recent advancements vary depending on the
platform. few of the recent advancements in Windows platform are as follows:
● You can get a snapshot of your workbook with Workbook Statistics
● You can give your documents more flair with backgrounds and high-quality stock images
absolutely for free
● Pros: It’s used by a lot of people and has a lot of useful features and plug-ins.
● Cons: Cost, mistakes in calculations, and being bad at handling big amounts of data.
1. Power BI
● Power BI is a Microsoft product used for business analytics.
● Named as a leader for the 13th consecutive year in the Gartner 2024 Magic Quadrant, it
provides interactive visualizations with self-service business intelligence capabilities,
where end users can create dashboards and reports by themselves without having to depend
on anybody.
● Products: :Power BI provides the following products:
● Power BI Desktop
● Power BI Pro
● Power BI Premium
● Power BI Mobile
● Power BI Embedded
● Power BI Report Server
● Multinational organizations such as Adobe, Heathrow, Worldsmart, andAdvancements
and GE Healthcare are using Power BI to achieve powerful results from their data.
● Recent Advancements/ Features:

Power BI has recently come up with solutions such as Azure + Power BI and Office 365 +
Power BI to help users analyze the data, connect the data, and protect the data across various
Office platforms.

● Pros: It’s fast, it’s interactive, and it works on mobile devices.


● Cons: no pre-processing of data and bad version control.
1. Tableau
● Tableau is a market-business intelligence tool used to analyze and visualize data in an easy
format.
● Being named as a leader in the Gartner Magic Quadrant 2024 For the eighth consecutive
year, Tableau allows you to work on live data-set and spend more time on Data analysis
than Data Wrangling.
● Products in the Tableau Product Family include the following:
● Tableau Desktop
● Tableau Server
● Tableau Online
● Tableau Reader
● Tableau Public
● Companies using TableauAdvancements and: Multinational organizations such as
Citibank, Deloitte, Skype, and Audi use Tableau to visualize their data and generate
meaningful insights.
● Recent Advancements and Features: Tableau is coming up with frequent updates to provide
users with the following:
● Fast Analytics
● Smart Dashboards
● Update Automatically
● Ease of Use
● Explore any data
● Publish a dashboard and share it live on the web and on mobile devices.
● Pros: It’s fast, it’s interactive, and it works on mobile devices.
● Cons: no pre-processing of data and bad version control.
1. KNIME
● Konstanz Information Miner, most commonly known as KNIME, is free and open-source
data analytics, reporting, and integration platform built for analytics on a GUI-based
workflow.
● Products:KNIME provides the following two products:
● KNIME Analytics Platform is open-source and used to clean & gather data, make reusable
components accessible to everyone, and create Data Science workflows.
● KNIME Server is a platform used by enterprises for the deployment of data science
workflows, team collaboration, management, and automation.
● Companies such as Siemens, Novartis, Deutsche Telekom, and Continental use KNime to
make sense of their data and leverage meaningful insights.
● Recent Advancements and Features: You do not need prior programming knowledge to
use KNIME and derive insights. You can work all the way from gathering data and creating
models to deployment and production.
● Pros: It is an open-source platform that is great for programming that is based on images.
● Cons: It can’t be scaled up, and some functions need technical knowledge.
1. RapidMiner
● Being named a Visionary in 2024 Gartner Magic Quadrant for Data Science and Machine
Learning Platforms, RapidMiner is a platform for data processing, building Machine
Learning models, and deployment.
● Products: The products of RapidMiner are as follows:
● Studio
● GO
● Server
● Real-Time Scoring
● Radoop
● Companies such as BMW, Hewlett-Packard Enterprise, EZCater, and Sanofi use
RapidMiner for their Data Processing and Machine Learning models.
● Recent Advancements and Features: Recently, RapidMiner has launched RapidMiner 9.6,
which has extended the platform to full-time coders and BI Users. It is a fully transparent,
end-to-end Data Science platform that enables data preparation, Machine Learning, and
model operations.
● Pros: User-Friendly, Extensive Data Analytics Capabilities, Strong Community Support.
● Cons: Limited Customization, Steep Learning Curve for Advanced Features, Resource-
Intensive.
1. Splunk
● Splunk is a platform used to search, analyze, and visualize the machine-generated data
gathered from applications, websites, etc.
● Being named by Gartner as a Visionary in the 2024 Magic Quadrant for APM, Splunk has
evolved products in various fields such as IT, security, DevOps, and analytics.
● Products
● Splunk Free
● Splunk Enterprise
● Splunk Cloud
● Companies using Splunk include Intel, trusted by 92 out of the Fortune 100 companies,
and companies such as Dominos, Otto Group, Intel, and Lenovo are using Splunk in their
day-to-day practices to discover processes and correlate data in real-time.
● Recent Advancements and Features: Since almost all organizations need to deal with data
across various divisions, according to the official website Splunk aims to bring data to
every part of your organization, by helping teams use Splunk to prevent and predict
problems with monitoring experience, detect and diagnose issues with clear visibility,
explore and visualize business processes and streamline the entire security stack.
● Pros: Scalable, Real-time Monitoring, Rich Ecosystem of Add-ons
● Cons: High Cost, Steep Learning Curve, Resource-Intensive
1. Talend
● Talend is one of the most powerful data integration ETL tools available on the market and
is developed in the Eclipse graphical development environment.
● Being named as a Leader in Gartner’s Magic Quadrant for Data Integration Tools and Data
Quality tools 2019, this tool lets you easily manage all the steps involved in the ETL
process and aims to deliver compliant, accessible and clean data for everyone.
● Products: Talend comes with the following five products:
1. Talend Open Source
2. Stitch Data Loader
3. Talend Pipeline Designer
4. Talend Cloud Data Integration
5. Talend Data Fabric
● Companies ranging from small startups to multinational companies such as ALDO,
ABInBev, EuroNext, and AstraZeneca are using Talend to make critical decisions.
● Recent Advancements and Features:

Talend is the only EuroNext that delivers complete and clean data at the moment you need it by
maintaining data quality, providing big data integration, cloud API services, preparing data, and
providing a data catalog and stitch data loader.
Recently, Talend has also accelerated the journey to the lakehouse paradigm and the path to
revealing intelligence in data. Not only this, but the Talend Cloud is now available in the Microsoft
Azure Marketplace.

● Recently,ros: highly customizable, real-time data tracking, seamless integration with


other tools.
● Cons: Steeper learning curve, limited out-of-the-box templates, higher initial setup costs.
1. QlikView
● QlikView is a self-service business intelligence, data visualization, and data analytics tool.
● Being named a leader in Gartner Magic Quadrant 2024 for analytics and BI platforms, it
aims to accelerate business value through data by providing features such as data
integration, data literacy, and data analytics.
● Products: QlikView comes with a variety of products and services for data integration, data
analytics, and developer platforms, of which few are available for a free trial period of 30
days.
● Companies using QlikView:Trusted by more than 50,000 customers worldwide, a few of
the top customers of QlikView are CISCO, NHS, KitchenAid,and SamsungG.
● Recent advancements and features

Recently, ikVhas launched an intelligent alerting platform called Qlik Alerting for Qlik Sense®,
which helps organizations handle exceptions, notify users of potential issues, help users analyze
further, and also prompt actions based on the derived insights.
● Pros: Quick Data Visualization, Drag-and-Drop Functionality, Robust Analytics.
● Cons: Steep Learning Curve, Limited Customization, Higher Cost.
1. 10. Apache Spark
● Apache Spark is one of the most successful projects in the Apache Software Foundation
and is a cluster computing framework that is open-source and is used for real-time
processing.
● Being the most active Apache project currently, it comes with a fantastic open-source
community and an interface for programming.
● This interface makes sure of fault tolerance and implicit data parallelism.
● Products: Apache Spark keeps on releasing new releases with new features. You can also
choose the various package types for Spark. The most recent version is 2.4.5, and 3.0.0 is
in preview.
● Companies using: Companies such as Oracle, Hortonworks, Verizon, and Visa use Apache
Spark for real-time computation of data with ease of use and speed.
● Recent Advancements and Features
● In today’s world, Spark runs on Kubernetes, Apache Mesos, standalone, Hadoop, or in the
cloud.
● It provides high-level APIs in Java, Scala, Python, and R, and Spark code can be written
in any of these four languages.
● Spark’s MLlibe learning componen, is handy when it comes to big data processing.
● Pros: Fast, dynamic, and easy to use.
● Cons: no file management system; rigid user interface.

10. Define the various phases of the data analytics life cycle.

To address the specific demands for conducting analysis on big data, a step-by-step
methodology is required to plan the various tasks associated with the acquisition, processing,
analysis, and recycling of data.
Phase 1: Discovery

● The data science team is trained and researches the issue.


● Create context and gain understanding.
● Learn about the data sources that are needed and accessible to the project.
● The team comes up with an initial hypothesis, which can be later confirmed with
evidence.

Phase 2: Data Preparation

● Methods to investigate the possibilities of pre-processing, analysing, and preparing data


before analysis and modeling.
● It is required to have an analytic sandbox. The team performs, loads, and transforms to
bring information to the data sandbox.
● Data preparation tasks can be repeated, but not in a predetermined sequence.
● Some of the tools used commonly for this process include Hadoop, Alpine Miner, Open
Refine, etc.

Phase 3: Model Planning

● The team studies data to discover the connections between variables. Later, it selects the
most significant variables as well as the most effective models.
● In this phase, the data science teams create data sets that can be used for training, testing,
production, and training goals.
● The team builds and implements models based on the work completed in the modeling
planning phase.
● Some of the tools commonly used for this stage are MATLAB and STASTICA.

Phase 4: Model Building

● The team creates datasets for training, testing, and production use.
● The team is also evaluating whether its current tools are sufficient to run the models or if
they require an even more robust environment to run models.
● [AKTU] Tools that are free or open-source or free tools: Rand PL/R, Octave, WEKA.
● Commercial tools: MATLAB, STASTICA.

Phase 5: Communication Results

● Following the execution of the model, team members will need to evaluate the outcomes
of the model to establish criteria for its success or failure.
● The team is considering how best to present findings and outcomes to the various
members of the team and other stakeholders while taking into consideration cautionary
tales and assumptions.
● The team should determine the most important findings, quantify their value to the
business, create a narrative to present findings, and summarize them for all stakeholders.

Phase 6: Operationalize:

● The team distributes the benefits of the project to a wider audience. It sets up a pilot
project that will deploy the work in a controlled manner prior to expanding the project to
the entire enterprise of users.
● This technique allows the team to gain insight into the performance and constraints
related to the model within a production setting on a small scale and then make necessary
adjustments before full deployment.
● The team produces the last reports, presentations, and codes.
● Open-source or free tools such as WEKA, SQL, MADlib, and Octave.

Unit 2 : Data Analysis

1. Describe univariate, bivariate, and multivariate analyses.

● Univariate analysis is the simplest and easiest form of data analysis, where the data being
analyzed contains only one variable.
Example: studying the heights of players in the NBA.
Univariate analysis can be described using central tendency, dispersion, quartiles, bar
charts, histograms, pie charts, and frequency distribution tables.

● Bivariate analysis involves the analysis of two variables to find causes, relationships, and
correlations between the variables.
Example: analyzing the sale of ice cream based on the temperature outside.
The bivariate analysis can be explained using correlation coefficients, linear regression,
logistic regression, scatter plots, and box plots.

● Multivariate analysis involves the analysis of three or more variables to understand the
relationship of each variable with the other variables.
Example: revenue based on expenditure.
Multivariate analysis can be performed using Multiple regression, Factor analysis,
Classification & regression trees, Cluster analysis, Principal component analysis, Dual-
axis charts, etc.

2. How is overfitting different from underfitting?

Overfitting Underfitting

The model trains the data well using the Here, the model neither trains the data well
training set. nor can generalize to new data.
The performance drops considerably Performs poorly both on the train and the
over the test set. test set.
It happens when the model learns the This happens when there is less data to
random fluctuations and noise in the build an accurate model and when we try to
training dataset in detail.
develop a linear model using non-linear
data.

3. Explain the concept of outlier detection and how you would identify outliers in a
dataset? How do you treat outliers in a dataset?

An outlier is a data point that is distant from other similar points. They may be due to
variability in the measurement or may indicate experimental errors.
Outlier detection is the process of identifying observations or data points that significantly
deviate from the expected or normal behavior of a dataset. Outliers can be valuable sources
of information or indications of anomalies, errors, or rare events.
It's important to note that outlier detection is not a definitive process, and the identified
outliers should be further investigated to determine their validity and potential impact on
the analysis or model. Outliers can be due to various reasons, including data entry errors,
measurement errors, or genuinely anomalous observations, and each case requires careful
consideration and interpretation.
The graph depicted below shows there are three outliers in the dataset.

To deal with outliers, one can use the following four methods:

● Drop the outlier records


● Cap your outliers' data
● Assign a new value
● Try a new transformation

4. Differentiate between Logistics and Linear regression

Linear regression Logistic Regression

used to predict the continuous dependent used to predict the categorical


variable using a given set of independent dependent variable using a given set of
variables. independent variables.

used for solving regression problem. used for solving classification


problems.

predict the value of continuous variables predict values of categorical variables

find best fit line. find S-Curve.

Least square estimation method is used for Maximum likelihood estimation


estimation of accuracy. method is used for Estimation of
accuracy.

The output must be continuous value, such Output must be categorical value such
as price, age, etc. as 0 or 1, Yes or no, etc.

It required linear relationship between It not required linear relationship.


dependent and independent variables.

There may be collinearity between the There should not be collinearity


independent variables. between independent variables.
5. What do you rmean by fuzy logic ? State its characteristics. Discuss the architecture
of Fuzzy logis system?
The 'Fuzzy' word means the things that are not clear or are vague. Sometimes, we cannot
decide in real life that the given problem or statement is either true or false. At that time,
this concept provides many values between true and false and gives the flexibility to find
the best solution to that problem.
The following are the characteristics of fuzzy logic:

1. This concept is flexible, and we can easily understand and implement it.
2. It is used for helping minimize the logic created by humans.
3. It is the best method for finding the solution to those problems that are suitable for
approximate or uncertain reasoning.
4. It always offers two values, which denote the two possible solutions to a problem or
statement.
5. It allows users to build or create functions that are non-linear of arbitrary complexity.

6. In fuzzy logic, everything is a matter of degree.


7. In fuzzy logic, any system that is logical can be easily fuzzified.
8. It is based on natural language processing.
9. It is also used by quantitative analysts to improve their algorithm's execution.
10. It also allows users to integrate with the programming.

Architecture of a Fuzzy Logic System


In the architecture of the fuzzy logic system, each component plays an important role. The
architecture consists of the four different components, which are given below.

1. Rule Base
2. Fuzzification
3. Inference Engine
4. Defuzzification

Rule Base
Rule Base is a component used for storing the set of rules, and the If-Then conditions given
by the experts are used for controlling the decision-making systems. There have been so
many updates to the fuzzy theory recently, which offers effective methods for designing
and tuning fuzzy controllers. These updates or developments decrease the number of fuzzy
sets of rules.
Fuzzification
Fuzzification is a module or component for transforming the system inputs, i.e., it converts
the crisp number into fuzzy steps. The crisp numbers are those inputs that are measured by
the sensors and then fuzzified and passed into the control systems for further processing.
This component divides the input signals into following five states in any Fuzzy Logic
system:

● Large Positive (LP)


● Medium Positive (MP)
● Small (S)
● Medium Negative (MN)
● Large negative (LN)

Inference Engine
This component is a main component in any fuzzy logic system (FLS), because all the
information is processed in the inference engine. It allows users to find the matching degree
between the current fuzzy input and the rules. After the matching degree, this system
determines which rule is to be added according to the given input field. When all rules are
fired, they are combined to develop control actions.
Defuzzification
Defuzzification is a module or component that takes the fuzzy set inputs generated by the
Inference Engine and then transforms them into a crisp value. It is the last step in the
process of developing a fuzzy logic system. The crisp value is a type of value which is
acceptable by the user. Various techniques are present to do this, but the user has to select
the best one for reducing the error

6. Discuss the learning algorithms. State the applications of neural networks.


Supervised Learning

● Learning is performed by presenting pattern to target


● During learning, the output produced is compared with the desired output
The difference between both output is used to modify learning weights according
to the learning algorithm
● Recognizing hand-written digits, pattern recognition, etc.
● Neural Network models: perceptron, feed-forward (FF), radial basis function
(RBF), support vector machine (SVM)
Un-Supervised Learning

● Targets are not provided


● Appropriate for clustering task
Find similar groups of documents on the web, content addressable memory, and
clustering.
● Neural network models: Kohonen, self- organizing maps (SOM), Hopfield
networks.

Reinforcement Learning

● Target is provided, but the desired output is absent.


● The net is only provided with guidance to determine the produced output is
correct or vise versa.
● Weights are modified in the units that have errors
● Rewards and penalty concepts

Applications of NN

● Signal processing
● Pattern recognition, e.g. handwritten characters or face identification,.
● Diagnosis or mapping symptoms to a medical case.
● Speech recognition
● Human Emotion Detection
● Educational Loan Forecasting

7. State the concept of the competitive learning rule. How do I select the best
regression method?
There would be competition among the output nodes, so the main concept is that during
training, the output unit that has the highest activation of a given input pattern will be
declared the winner. This rule is also called winner-take-all because only the winning
neuron is updated and the rest of the neurons are left unchanged. Competitive learning is a
form of unsupervised learning in artificial neural networks in which nodes compete for the
right to respond to a subset of the input data. A variant of Hebbian learning, learning
works by increasing the specialization of each node in the network. It is well suited to
finding clusters within data.

Within multiple types of regression models, it is important to choose the best-suited


technique based on the type of independent and dependent variables, the dimensionality of
the data, and other essential characteristics of the data. Below are the key factors that you
should practice to select the right regression model:

1. Data exploration is an inevitable part of building predictive model. It should be your first
step before selecting the right model, like identify the relationship and impact of variables
2. To compare the goodness of fit for different models, we can analyse different metrics like
statistical significance of parameters, R-square, adjusted r-square, AIC, BIC, and error
term. Another one is Mallow’s Cp criterion. This essentially checks for possible bias in
your model by comparing the model with all possible submodels (or a careful selection of
them).
3. Cross-validation is the best way to evaluate models used for prediction. Here, you divide
your data set into two groups (train and validate). A simple mean squared difference
between the observed and predicted values gives you a measure of the prediction accuracy.
4. If your data set has multiple confounding (surprising) variables, you should not choose
automatic model selection method because you do not want to put these in a model at the
same time.
5. Depend on your objective. It can happen that a less powerful model is easier to implement
as compared to a highly statistically significant model.
6. Regression regularization methods (Lasso, Ridge, and ElasticNet) work well in cases of
high dimensionality and multicollinearity among the variables in the data set.
8. What is regularization? Why is regularization important? Discuss types of
regularization. Which regularization is better, and why?
Regularization refers to techniques that are used to calibrate machine learning models in
order to minimize the adjusted loss function and prevent overfitting or underfitting. Using
regularization, we can fit our machine learning model appropriately on a given test set and,
hence, reduce the errors in it.
You should use regularization if the gap in performance between train and test is large.
This means the model grasps too many details of the train set. Overfitting is related to high
variance, which means the model is sensitive to specific samples of the training set.
Regularization aims to balance the trade-off between bias and variance and improve the
prediction accuracy and robustness of the model. Regularization prevents overfitting,
making models generalize better on unseen data by penalizing complexity There are a
range of different regularization techniques. The most common approaches rely on
statistical methods such as Lasso regularization (also called L1 regularization), Ridge
regularization (L2 regularization), and Elastic Net regularization, which combines both
Lasso and Ridge techniques.

Ridge regression is a technique used when the data suffers from multicollinearity
(independent variables are highly correlated). In multicollinearity, even though the least
squares estimates (OLS) are unbiased, their variances are large, which deviates the
observed value far from the true value. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard e-multicollinearity. The first linear
regression can be represented as y = a + b * x. This equation also has an error term.
The complete equation becomes: y=a+b*x+e (error term); a+b1x1+b2x2+ ... [error term
is the value needed to correct for a prediction error between the observed and predicted
value] = y= = a + b1x1+ b2x2 +... + e, for multiple independent variables

In a linear equation, prediction errors can be decomposed into two sub components. The
first is due to the bias, and Second is due to the variance. A prediction error can occur due
to any one of these two or both components.

Ridge regression solves the multicollinearity problem through the shrinkage parameter λ
(lambda).

In this equation, we have two components. The first one is the least squared term. The
second one is the lambda of the summation of β2 (beta-square), where β is the coefficient.
This is added to the least squares advantage term in order to shrink the parameter to have
a very low variance.

advantage is to avoid overfitting. Our ultimate model is the one that could generalize
patterns, i.e., works best on the training and testing dataset Over fitting occurs when the
trained model performs well on the training data but poorly on the testing datasets

Ridge regression works by applying a penalizing term (reducing the weights and biasing
overfitting) to overcooverfitting. Similar to Ridge Regression, Lasso (Least Absolute
Shrinkage and Selection Operator) also penalizes the absolute size of the regression
coefficients. It is capable of reducing the variability and improving the accuracy of linear
regression models.

Lasso regression differs from ridge regression in a way that it uses absolute values in the
penalty function instead of squares. This leads to penalizing (or equivalently constraining)
the sum of the absolute values of the estimates, which causes some of the parameter
estimates to turn out to be exactly zero. The larger the penalty applied, the further the
estimates shrank towards absolute zero.

This results in a variable selection out of the given n variables. L1 regularization is more
robust than L2 regularization, for a fairly obvious reason. L2 regularization takes the square
of the weights, so the cost of outliers present in the data increases exponentially. L1
regularization takes the absolute values of the weights, so the cost only increases linearly.

9. What do you mean by SVM? Discuss the various kernel methods used in SVM.

SVM in ML is defined as a data science algorithm SVM belongs to the class of supervised
learning that analyses the trends and characteristics of the data set. SVM solves problems
related to classification, regression, and outlier detection. All of these are common tasks
in machine learning. SVM is based on the learning framework of VC theory (Vapnik-
Chervonenkis theory). Examples: Detect cancerous cells based on millions of images,
or use them to predict future driving routes with a well-fitted regression mode. These are
just math equations tuned to give you the most accurate answer possible as quickly as
possible.

SVMs are different from other classification algorithms because of the way they choose
the decision boundary that maximizes the distance from the nearest data points of all the
classes. The decision boundary created by SVMs is called the maximum margin classifier
or the maximum margin hyperplane. Support Vector Machines is a strong and powerful
algorithm that is best used to build machine learning models with small data sets. SVM is
able to generalize the characteristics that differentiate the training data that is provided to
the algorithm. This is achieved by checking for a boundary that differentiates the two
classes by the maximum margin.

The boundary that separates the 2 classes is known as a hyperplane. Even if the name has
a plane, if there are two features, this hyperplane can be a line; in 3D, it will be a plane;
and so on. The data points that are closest to the separable line are known as the support
vectors. The distance of the support vectors from the separable line determines the best
hyperplane. We followed some basic rules for the most optimum separation, and they are:

Maximum Separation: We selected the line that is able to segregate all the data points
into corresponding classes, and that is how we narrowed it down to A, B, C, and D.

Best Separation: Out of the lines A, B, C, and D, we chose the one that has the maximum
margin to classify the data points. This is the line that has the maximum width from the
corresponding support vectors (the data points that are the closest).This is what is known
as a kernel trick.

In a kernel trick, the data is projected into higher dimensions, and then a plane is
constructed so that the data points can be segregated. Some of these kernels are RBF,
Sigmoid, Poly, etc. Using the respective kernel, we will be able to tune the data set to get
the best hyperplane to separate the training data points.

Kernels, or kernel methods (also called kernel functions), are sets of different types of
algorithms that are being used for pattern analysis. They are used to solve a non-linear
problem by using a linear classifier. Kernel methods are employed in SVM (Support
Vector Machines), which are used in classification and regression problems. The SVM
uses what is called a “Kernel Trick,” where the data is transformed and an optimal
boundary is found for the possible outputs.
Liner Kernel
These are commonly recommended for text classification because most of these types of
classification problems are linearly separable. The linear kernel works really well when
there are a lot of features, and text classification problems have a lot of features. Linear
kernel functions are faster than most of the others and you have fewer parameters to
optimize.

If there are two vectors with the names x1 and Y1, then the linear kernel is defined by the
dot product of these two vectors: K(x1, x2) = x1 + x2.

The function for a linear kernel is:


f(X) = w^T * X + b
In this equation, w is the weight vector that you want to minimize, X is the data that you're
trying to classify, and b is the linear coefficient estimated from the training data. This
equation defines the decision boundary that the SVM returns

Polynomial Kernel
The polynomial kernel isn't used in practice very often because it isn't as computationally
efficient as other kernels, and its predictions aren't as accurate. A polynomial kernel is
defined by the following equation:

K(x1, x2) = (x1 + x2 + 1)d, Where d is the degree of the polynomial and x1 and x2 are
vectors

The function for a Polynomial kernel is:


f(X1, X2) = (a + X1^T * X2) ^ b
This is one of the more simple isomial kernel equations you can use. f (X1, X2) represents
the polynomial decision boundary that will separate your data. X1 and X2 represent your
data.
Gaussian Kernel
This kernel is an example of a radial basis function kernel. One of the most powerful and
commonly used kernels in SVMs. Usually the choice for non-linear data.

The given sigma plays a very important role in the performance of the Gaussian kernel It
should neither be overestimated nor be underestimated; it should be carefully tuned
according to the problem. In this equation, gamma specifies how much influence a single
training point has on the other data points around it. ||X1 - X2|| is the dot product between
your features

Sigmoid
More useful in neural networks than in support vector machines, but there are occasional
specific use cases. The function for a sigmoid kernel is:

f(X, y) = tanh(alpha * X^T * y + C)

In this function, alpha is a weight vector and C is an offset value to account for some
misclassification of data that can happen.

10. What do you mean by dimensionality reduction? Explain its advantages. Discuss steps
of PCA.
In pattern recognition, Dimension Reduction is defined as
● It is the process of converting a data set with vast dimensions into a data set with
smaller dimensions.
● It ensures that the converted data set conveys similar information concisely.
Example Consider the following example:.
● The following graph shows two dimensions, x1 and x2.
● x1 represents the measurement of several objects in cm.
● x2 represents the measurement of several objects in inches.
The process of selecting a subset of features for use in model construction
1. Preprocessing of data in machine learning,
2. Can be useful for both supervised and unsupervised learning problem
Because
True Dimension <<< Observed Dimensions
● The abundance of reduction and irrelevant features
Curse of Dimensionality
● With a fixed number of training samples, the predictive power reduces as the
dimensionality reduces [Hughes Phenomenon]
● With d binary variables, the number of possible combination is O(2d)
Value of Analytics
● Descriptive Diagnostic Predictive Prescriptive

Benefits
Dimension reduction offers several benefits, such as
● It compresses the data and thus reduces the storage
● It reduces the time required for computation since less dimensions require less
computation.
● It eliminates the redundant features.
● It improves the model's performance.
The two popular and well-known dimension reduction techniques
1. Principal Component Analysis (PCA)
2. Fisher Linear Discriminate Analysis (LDA)
Principal Component Analysis is a well known dimension reduction technique It
transforms the variables into a new set of variables called as principal components.
Eliminates multicollinearity, but explicability is compromised . These principal
components are linear combinationsn of original variables and are orthogonal. The first
principal component accounts for most of the possible variation of original data. The
second principal component does its best to capture the variance in the data. There can
be only two principal components for two- dimensional data set Principal component
analysis (PCA) is a technique for reducing the dimensionality of datasets, exploiting the
fact that the images in these datasets have something in common.

When to use
● Excessive multi collinearity
● Explanation of the predicators is not important
● A slight overhead in implementation is okay
● More suitable for unsupervised learning
● Screen plot – elbow method
PCA is a 4 step process.
Starting with a dataset containing n dimensions (requiring n-axes to be represented):
1. Find a new set of basis functions (n-axes) where some axes contribute to most of the
variance in the dataset while others contribute very little.
2. Arrange these axes in the decreasing order of variance contribution.
3. Now, pick the top k axes to be used and drop the remaining n-k axes.
4. Now, project the dataset onto these k axes.
After these 4 steps, the dataset will be compressed from n-dimensions to just k-dimensions
(k<n).

UNIT III:Mining Data Streams

1.Explain the architecture of data stream model.


A data stream is an existing, continuous, ordered (implicitly by entrance time or explicitly
by timestamp) chain of items/data.
It is unfeasible to control the order in which units arrive, nor it is feasible to locally capture
stream in its entirety.
It is enormous volumes of data, items arrive at a high rate.

In analogy to a database-management system, can view a stream processor as a


kind of data-management system, the high-level organization of which is suggested in the
above figure.
Any number of streams can enter the system.
Each stream can provide elements at its own schedule; they need not have the same data
rates or data types,
The time between elements of one stream need not be uniform.
The rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a database-
management system.
The latter system controls the rate at which data is read from the disk, and therefore never
has to worry about data getting lost as it attempts to execute queries.
Streams may be archived in a large archival store, but we assume it is not possible
to answer queries from the archival store.
It could be examined only under special circumstances using time-consuming
retrieval processes.
There is also a working store, into which summaries or parts of streams may be placed,
and which can be used for answering queries.
The working store might be disk, or it might be main memory, depending on how fast we
need to process queries.
But either way, it is of sufficiently limited capacity that it cannot store all the data from
all the streams.

2. Explain the database stream management system in detail.


A data stream is a real time, continuous and ordered sequence of items.
Not possible to control the order in which the items arrive, nor it is feasible to locally
store a stream in its entirety in any memory device
Data stored into 3 partitions
1. Temporary working storage
2. Summary storage
3. Static storage for meta-data
Data model and query processor must allow both order-based and time based operations
Inability to store a complete stream indicates that some approximate summary structures
must be used.
Streaming query plans must not use any operators that require the entire input before any
results are produced.
Any query that requires backtracking over a data stream is infeasible. This is due to
storage and performance constraints imposed by a data stream
Applications that monitor streams in real-time must react quickly to unusual data values.
Scalability requirements dictate that parallel and shared execution of many continuous
queries must be possible.

3. What are the various sampling techniques in data stream?


4. Explain any one algorithm to count number of distinct elements in a data
stream.

Flajolet Martin Algorithm:


Flajolet Martin algorithm, also known as FM algorithm, is used to approximate the
number of unique elements in a data stream or database in one pass.
The highlight of this algorithm is that it uses less memory space while executing.

Pseudo Code-Stepwise Solution:


1. Selecting a hash function h so each element in the set is mapped to a string to at least
log2n bits.
2. For each element x, r(x)= length of trailing zeroes in h(x)
3. R= max(r(x))=> Distinct elements= 2r

Drawbacks of the Flajolet Martin Algorithm:


It is important to choose the hash parameters wisely while implementing this algorithm, as it
has been proven practically that the FM algorithm is very sensitive to the hash function
parameters.
The hash function used in the above code is of the form ax+b mod c where x is an element in
the stream and a, b, and c are 1, 6, and 32, respectively, where a,b,c are the hash function
parameters.
For any other values of a, b, or c, the algorithm may not give the same result. Thus, it is
important to observe the stream and then assign proper values to a,b and c.

4. Why is filtering required in a data stream? Explain.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an
element is a member of a set.
For example, checking the availability of a username is a set membership problem, where the
set is the list of all registered usernames.
The price we pay for efficiency is that it is probabilistic in nature, which means, there might
be some False positive results.
False positive means, it might say that the given username is already taken, but actually it’s
not.
If the item is not present then it is TRUE NEGATIVE
The item that might be present in the set – can be either false positive or true positive.

5.Explain Bloom filter and its applications

A empty bloom filter is a bit array of m bits, all set to zero, like this –
We need k number of hash functions to calculate the hashes for a given input. When we
want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where
indices are calculated using hash functions.
insert(x) : To insert an element in the Bloom Filter.
lookup(x) : to check whether an element is already present in Bloom Filter with a positive
false probability.
NOTE : We cannot delete an element in Bloom Filter.
Medium uses bloom filters for recommending post to users by filtering post which have
been seen by user.
Quora implemented a shared bloom filter in the feed backend to filter out stories that
people have seen before.
The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
Google Bigtable, Apache HBase and Apache Cassandra, and PostgreSQL use Bloom
filters to reduce the disk lookups for non-existent rows or columns

6. Explain the concept of Decaying window.


7.Explain the working of DGIM ALGORITHM WITH EXAMPLE.
8. Give some examples of Stream Sources-
1. Sensor Data –
In navigation systems, sensor data is used. Imagine a temperature sensor floating about in the
ocean, sending back to the base station a reading of the surface temperature each hour. The
data generated by this sensor is a stream of real numbers. We have 3.5 terabytes arriving every
day and we for sure need to think about what we can be kept continuing and what can only be
archived.

2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per day.
Surveillance cameras generate images with lower resolution than satellites, but there can be
numerous of them, each producing a stream of images at a break of 1 second each.

3. Internet and Web Traffic –


A bobbing node in the center of the internet receives streams of IP packets from many inputs
and paths them to its outputs. Websites receive streams of heterogeneous types. For example,
Google receives a hundred million search queries per day.

9. Give the full form of RTAP and discuss its applications.

Real-TimeAnalytics :
Real-time data analytics lets users see, examine and recognize data as it enters a system. Logic
and mathematics are put into the data, so it can give users perception for making real-time
decisions.
Overview :
Real-time analytics permits businesses to gain awareness and take action on data immediately or
soon after the data enters their system. Real-time app analytics response queries within seconds.
They grasp a large amount of data with high velocity and low reaction time. For example, real-
time big data analytics uses data in financial databases to inform trading decisions. Analytics can
be on-demand or uninterrupted. On-demand notifies results when the user requests it. Continuous
renovation users as events happen and can be programmed to answer automatically to certain
events. For example, real-time web analytics might refurbish an administrator if the page load
presentation goes beyond the present boundary.

Examples –
Examples of real-time customer analytics include as follows.
1. Viewing orders as they happen for better tracing and to identify fashion.
2. Continually modernize customer activity like page views and shopping cart use to
understand user etiquette.
3. Customers with advancement as they shop for items in a store, affecting real-time
decisions.
The functioning of real-time analytics :
Real-time data analytics tools can either push or pull. Streaming requires the faculty to shove
gigantic amounts of brisk-moving data. When streaming takes too many assets and isn’t
empirical, data can be hauled at intervals that can range from seconds to hours. The tow can
happen in between business needs, which require figuring out funds so as not to disrupt
functioning. The reaction times for real-time analytics can differ from nearly immediate to a
few seconds or minutes. The components of real-time data analytics include the follows.
· Aggregator
· Broker
· Analytics engine
· Stream processor

Benefits of using real-time analytics :


1. Momentum is the main benefit of real-time data analytics. The less time a business must
wait to access data between the time it arrives and is processed, the more it can use data insights
to make changes and act on a critical decision.
2. Similarly, real-time data analytics gadgets let companies see how users link with a product
upon liberating, which means there is no detain in understanding user conduct for making the
needed adaptation.

Advantages of Real-Time Analytics:


Real-time analytics offers the following advantages over traditional analytics as follows.
· Create custom interactive analytics tools.
· Share information through transparent dashboards.
· Customize monitoring of behavior.
· Make immediate changes when needed.
· Apply machine learning.

10. Explain the concept of stream computing.

A high-performance computer system that analyzes multiple data streams from many sources. The
word stream in stream computing is used to mean pulling in streams of data, processing the data,
and streaming it back out as a single flow. Stream computing uses software algorithms that
analyzes the data in real time as it streams in to increase speed and accuracy when dealing with
data handling and analysis.

· Stream computing is a computing paradigm that reads data from collections of software
or hardware sensors in stream form and computes continuous data streams.
· Stream computing uses software programs that compute continuous data streams.
· Stream computing uses software algorithm that analyzes the data in real time.
· Stream computing is one effective way to support Big Data by providing extremely low-
latency velocities with massively parallel processing architectures.
· It is becoming the fastest and most efficient way to obtain useful knowledge from Big
Data.
UNIT IV: Frequent Itemsets and Clustering

Q.1: What is market basket analysis, and how is it used by retailers?


Ans- Market basket analysis is a data mining technique that analyzes patterns of co-
occurrence and determines the strength of the link between products purchased together.
We also refer to it as frequent itemset mining or association analysis. It leverages these
patterns recognized in any retail setting to understand the behavior of the customer by
identifying the relationships between the items bought by them. For retailers to increase
sales and make as much profit, they need to know their customers and understand their
needs and behaviors. To do this, retailers use a technique called market basket analysis.
Market basket analysis consists of analyzing large data sets that include purchase history,
revealing product groupings,and identifying products that are likely to be purchased
together. To put it simply, market basket analysis helps retailers know about the products
frequently bought together so as to keep those items always available in their inventory.
The source from which these patterns are found is the vast amount of data that is
continually collected and stored. With frequent mining of the item set, it becomes easy to
discover the correlation between items in huge relational or transactional datasets. It
considerably helps in decision-making processes related to cross-marketing, catalogue
design, and consumer shopping analytics. The key question in the market basket analysis
is what products are most frequently purchased together.
Q.2 Explain Association Rules and how they are used in retail basket analysis?
Ans- Market basket analysis utilizes the association rule {IF} - > {THEN} to predict the
probability of certain products being purchased together. They count the frequency of
items occurring together and seek to find associations that occur more than expected.
Some algorithms that leverage these association rules are AIS, Apriori, and SETM.
Apriori is the commonly cited algorithm by the data scientist that identifies frequent items
in the database. It is useful for unsupervised learning and requires no training and thus no
predictions. This algorithm is used especially for large data sets where useful relationships
among the items are to be determined.
Q.3 Discuss the different types of market basket analysis and their application areas.
Ans- Market basket analysis comprises the following types.

1. Descriptive market basket analysis


2. Predictive market basket analysis
3. Differential market basket analysis
Descriptive market basket analysis- This type of market basket analysis offers actionable
insights based on historical data. It is a frequently used approach that does not make any
predictions but rates the association using statistical techniques between the products. We
also refer to it as unsupervised learning based on the way it is modelled.

Predictive market basket analysis- Although “predict” and “analysis” make up the word
predictive analysis, it actually works in reverse. It first analyses and then predicts what the
future holds. This type utilizes supervised learning models like regression and
classification. It is a valuable tool for marketers, even if it is less used than descriptive
market basket analysis. It considers items purchased in sequence to evaluate cross-sell. For
instance, when a consumer purchases a laptop, they are more likely to buy an extended
warranty with it. This analysis thus helps in recognizing those considered items in a
sequence so they can be sold together. It finds application in the retail industry, mainly to
determine the item baskets that are purchased together.

Differential market basket analysis- Differential market basket analysis is a great tool
for competitive analysis that can help you determine why consumers prefer to purchase the
same product from a particular platform even when they are labelled with the same price
on both platforms. This decision of the consumers is often based on several factors, such
as-

● Delivery time
● User experience
● Purchase history between stores, seasons, time periods, and others.

Q.4 Discuss the Apriori algorithm for frequent item set mining.
Ans- Apriori algorithm was given by R. Agrawal and R. Srikant in 1994 for finding frequent item
sets in a dataset for the Boolean association rule. Apriori is designed to operate on databases
containing transactions (collections of items bought by customers, details of website accesses,
etc.). For example, the items customers buy at a big bazaar. The Apriori algorithm helps customers
buy their products with ease and increases the sales performance of the particular store.
Since it makes use of previous knowledge about common item features, the method is referred to
as apriori. This is achieved by the use of an iterative technique or level-wise approach in which k-
frequent itemsets are utilized to locate k+1 itemsets. An essential feature known as the apriori
property is utilized to boost the effectiveness of level-wise production of frequent itemsets. This
property helps by minimizing the search area, which in turn serves to maximize the productivity
of level-wise creation of frequent patterns. Each transaction is seen as a set of items (an itemset).
Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C
transactions in the database. Apriori uses a "bottom-up" approach, where frequent subsets are
extended one item at a time (a step known as candidate generation), and groups of candidates are
tested against the data. The algorithm terminates when no further successful extensions are found.
Algorithm
Step 1: Create a list of all the elements that appear in every transaction and create a frequency
table.
Step 2: Set the minimum level of support. Only those elements whose support exceeds or equals
the threshold support are significant.
Step 3: All potential pairings of important elements must be made, bearing in mind that AB and
BA are interchangeable.
Step 4: Tally the number of times each pair appears in a transaction.
Step 5: Only those sets of data that meet the criterion of support are significant.
Step 6: Now, find a set of three things that may be bought together. A rule, known as self-join, is
needed to build a three-item set. The item pairings OP, OB, PB, and PM state that two
combinations with the same initial letter are sought from these sets. OPB is the result of OP and
OB, and PBM is the result of PB and PM.
Step 7: When the threshold criterion is applied again, get the significant itemset.

Q. 5 Find the frequent item sets from given data using Apriori algorithm.
Ans- Step-1: In the first step, we index the data and then calculate the support for each one, if
support was less than the minimum value we eliminate that from the table.
Step-2: Calculate the support for each one
Step-3: Continue to calculate the support and select the best answer
Step-4: Find frequent item sets and make association rules.
FIS= {Milk, Bread, Eggs}
Association Rules are:
{Milk^Bread→Eggs},{Milk^Eggs→Bread},{Bread^Eggs→Milk}

Q.6 Discuss PCY Algorithm and illustrate its steps.


Ans- The PCY algorithm (Park-Chen-Yu algorithm) is a data mining algorithm that is used to find
frequent itemset in large datasets. It is an improvement over the Apriori algorithm The PCY
algorithm uses hashing to efficiently count item set frequencies and reduce overall computational
cost. The basic idea is to use a hash function to map item sets to hash buckets, followed by a hash
table to count the frequency of item sets in each bucket.
Algorithm:
Step 1: Find the frequency of each element and remove the candidate set having length 1.
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to them write
their frequency. Note – Note: Pairs should not get repeated avoid the pairs that are already written
before.
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It
gives us the bucket number). It defines in what bucket this particular pair will be put.
Step 4: This is the last step, and in this step, we have to create a table with the following details:
Bit vector – if the frequency of the candidate pair is greater than equal to the threshold then the
bit vector is 1 otherwise 0. (mostly 1)
Bucket number – found in the previous step
Maximum number of supports – frequency of this candidate pair, found in step 2.
Correct – the candidate pair will be mentioned here.
Candidate set – if the bit vector is 1, then “correct” will be written here.

Q.7 Apply the PCY algorithm on the following transaction to find the candidate sets
(frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
Ans- Step 1: Find the frequency of each element and remove the candidate set having length 1.
Items 1 2 3 4 5 6
Freque
4 7 7 8 6 4
ncy

T1 {(1, 2), 2,3


(1, 3)}
T2 {(2, 3), 3,4
(2, 4)}
T3 {(3, 4,3
4),(3, 5)}
T4 {(4, 5) 3,4
,(4, 6)}
T5 {(1, 5)} 1
T6 {(2, 6)} 2
T7 {(1, 4)} 2
T8 {(2, 5)} 2
T9 {(3, 6)} 1
T1 –
0
T1 –
1
T1 –
2
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to it write its
frequency.
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It
gives us the bucket number). Hash Function = ( i * j) mod 10
(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4
Bucket No.

Bucket no. Pair


0 (4,5)

2 (3,4)

3 (1,3)

4 (4,6)

5 (3,5)

6 (2,3)

8 (2,4)
Step 4: Prepare candidate set

Bit Bucket Highest Support Pair Candidate


Vector No. Count s Set
1 0 3 (4,5) (4,5)

1 2 4 (3,4) (3,4)

1 3 3 (1,3) (1,3)

1 4 4 (4,6) (4,6)

1 5 3 (3,5) (3,5)

1 6 3 (2,3) (2,3)
1 8 4 (2,4) (2,4)

Q.8 Discuss SON algorithm in detail and outline its characteristics.


Ans- SON Algorithm:

● It is an improvement over PCY to count frequent item sets


● The idea is to divide input file into chunks
● Treat each chunk as sample and then find set of frequent item sets in chunks
● We use Ps as a threshold, if each chunk is fraction P of the whole file, and s is the support
threshold.
● Store on the disk all the frequent item sets found for each chunk.
● Once all the chunks have been processed in this way, take the union of all the item sets
that have been found frequently for one or more chunks. These are the candidate item
sets.
● Here every item set that is frequent in the whole is frequent in at least one chunk. Thus
there are no false negatives.
● We made a total one pass through the data as we need each chunk and processed it.
● In the second pass, we count all the candidate item sets and select those that have support
at least S as the frequent item sets.
● The MR version of SON is as follows:

Pass 1:
Map (key, value)
• Find the item set frequent in the partition sample p with support threshold ps.
• The output is key valve pair ( F, ) where f is frequent item set.
Reduce ( key, value)
• Valves are ignored and reduce task will produce those keys which appear one or more times.
PASS 2:
MAP (key, value)
• Count occurrences of item sets
• For internet in item sets:
• Emit (item set, support (item set))
Reduce (key, value)
• Result = 0
• For value in values :
Result = result + value
• If result > = s then emit (key, result)

Q.9 What is Clustering? Discuss k-means clustering algorithm with an example.


Ans- Cluster analysis is a technique used in data mining and machine learning to group similar
objects into clusters. K-means clustering is a widely used method for cluster analysis where the
aim is to partition a set of objects into K clusters in such a way that the sum of the squared distances
between the objects and their assigned cluster mean is minimized.
K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a
dataset into a pre-defined number of clusters. The goal is to group similar data points together and
discover underlying patterns or structures within the data. K-means is a centroid-based algorithm
or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-
Means, each cluster is associated with a centroid.
The main objective of the K-Means algorithm is to minimize the sum of distances between the
points and their respective cluster centroid. Optimization plays a crucial role in the k-means
clustering algorithm. The goal of the optimization process is to find the best set of centroids that
minimizes the sum of squared distances between each data point and its closest centroid.
Algorithm:

1. Initialization: Start by randomly selecting K points from the dataset. These points will act
as the initial cluster centroids.
2. Assignment: For each data point in the dataset, calculate the distance between that point
and each of the K centroids. Assign the data point to the cluster whose centroid is closest
to it. This step effectively forms K clusters.
3. Update centroids: Once all data points have been assigned to clusters, recalculate the
centroids of the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids
no longer change significantly or when a specified number of iterations is reached.
5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster.

Q.10 What is Hierarchical Clustering? Discuss its applications, types and algorithmic steps.
Ans- Hierarchical clustering is a popular method for grouping objects. It creates groups so that
objects within a group are similar to each other and different from objects in other groups. Clusters
are visually represented in a hierarchical tree called a dendrogram. Hierarchical clustering has a
couple of key benefits:

1. There is no need to pre-specify the number of clusters. Instead, the dendrogram can be cut
at the appropriate level to obtain the desired number of clusters.
2. Data is easily summarized/organized into a hierarchy using dendrograms. Dendrograms
make it easy to examine and interpret clusters.

Applications
There are many real-life applications of Hierarchical clustering. They include:

● Bioinformatics: grouping animals according to their biological features to reconstruct


phylogeny trees
● Business: dividing customers into segments or forming a hierarchy of employees based on
salary.
● Image processing: grouping handwritten characters in text recognition based on the
similarity of the character shapes.
● Information Retrieval: categorizing search results based on the query.

Hierarchical clustering types


There are two main types of hierarchical clustering:

1. Agglomerative: Initially, each object is considered to be its own cluster. According to a


particular procedure, the clusters are then merged step by step until a single cluster remains.
At the end of the cluster merging process, a cluster containing all the elements will be
formed.
2. Divisive: The Divisive method is the opposite of the Agglomerative method. Initially, all
objects are considered in a single cluster. Then the division process is performed step by
step until each object forms a different cluster. The cluster division or splitting procedure
is carried out according to some principles that maximum distance between neighbouring
objects in the cluster.

Between Agglomerative and Divisive clustering, Agglomerative clustering is generally the


preferred method. The below example will focus on Agglomerative clustering algorithms because
they are the most popular and easiest to implement.
Hierarchical clustering steps
Steps for Agglomerative clustering can be summarized as follows:
Step 1: Compute the proximity matrix using a particular distance metric
Step 2: Each data point is assigned to a cluster
Step 3: Merge the clusters based on a metric for the similarity between clusters
Step 4: Update the distance matrix
Step 5: Repeat Step 3 and Step 4 until only a single cluster remains

UNIT V: Frame Works and Visualization

Q1. What is the significance of Exploratory Data Analysis (EDA)?

● Exploratory data analysis (EDA) helps to understand the data better.

● It helps to obtain confidence in the data to a point where you’re ready to engage a
machine learning algorithm.
● It allows us to refine the selection of feature variables that will be used later for model
building.

● Help discover hidden trends and insights from the data.

Q2. Write a short note on Hadoop and also list its advantages.

Hadoop is an open-source software framework designed for the development of scalable,


dependable, and distributed applications that process massive amounts of data. It is an open-source
distributed, batch processing, fault tolerance system that can store massive amounts of data while
also processing the same amount of data.

Advantages of Hadoop:

● 1. Fast:· a. In HDFS (Hadoop Distributed File System), data is dispersed across the
cluster and mapped, allowing for faster retrieval.

· b. Even the tools used to process the data are frequently hosted on the same servers,
decreasing processing time.
● 2. Scalable: Hadoop clusters can be extended by just adding nodes to the cluster.
● 3. Cost-effective: Hadoop is open source and uses commodity technology to store data,
making it significantly less expensive than traditional relational database management
systems.
● 4. Resilient to failure: HDFS has the ability to duplicate data across networks, so if one
node fails or there is another network failure, Hadoop will use the other copy of the data.
● 5. Flexible: · a. Hadoop allows businesses to readily access new data sources and tap
into various sorts of data in order to derive value from that data.

b. It aids in the extraction of useful business insights from data sources such as social
media, email discussions, data warehousing, fraud detection, and market campaign
analysis.

Q3. Draw and discuss the architecture of Hive in detail.

Hive architecture: The following architecture explains the flow of query submission into Hive.
Hive client: Hive allows writing applications in various languages, including Java, Python, and
C++. It supports different types of clients, including:

● 1. Thrift Server: It is a platform for cross-language service providers that serves requests
from any programming languages that implement Thrift.
● 2. JDBC Driver: It is used to connect Java applications and Hive applications. “The class
org.apache.hadoop.hive.jdbe.HiveDriver contains the JDBC Driver.
● 3. ODBC Driver: It allows the applications that support the ODBC protocol to connect to
Hive.

Hive services: The following are the services provided by Hive:


● 1. Hive CLI: The Hive CLI (Command Line Interface) is a command shell from which
we can run Hive queries and commands.
● 2. Hive Web User Interface: Hive Web UI is a replacement for Hive CLI. It provides a
web-based interface for running Hive queries and commands.
● 3. Hive MetaStore:
· a. It is a central repository that maintains all of the warehouse’s structure
information for various tables and partitions.
· b. It also provides column and type metadata that is needed to read and write data,
as well as the related HDFS files where the data is kept.
● 4. Hive server:
· a. It is referred to as the Apache Thrift Server.
· b. It accepts requests from different clients and provides them to Hive Driver.
● 5. Hive driver:
· a. It receives queries from a variety of sources, including the web UI, CLI, Thrift,
and JDBC/ODBC driver.
· b. It sends the requests to the compiler.
● 6. Hive compiler:
· a. The compiler’s job is to parse the query and perform semantic analysis on the
various query blocks and expressions.
· b. It converts HiveQL statements into MapReduce jobs.
● 7. Hive execution engine:
· a. The optimizer creates a logical plan in the form of a DAG of MapReduce and
HDFS tasks.
· b. Finally, the execution engine executes the incoming tasks in the order in which
they are dependent.

Q4. Write about the features of HBase.

Features of HBase:

1. It is both linearly and modularly scalable across several nodes because it is spread across
multiple nodes.
2. HBase provides consistent read and write performance.
3. It enables atomic read and write, which implies that while one process is reading or
writing, all other processes are blocked from completing any read or write actions.
4. It offers a simple Java API for client access.
5. It offers Thrift and REST APIs for non-Java front ends, with options for XML, Protobuf,
and binary data encoding.
6. It has a Block Cache and Bloom Filters for real-time query optimisation as well as large
volume query optimisation.
7. HBase supports automatic failover across region servers.
8. It supports exporting measurements to files using the Hadoop metrics subsystem.
9. It does not enforce data relationships.
10. It is a data storage and retrieval platform with random access.
Q5. Write a short note on R programming language and its features.

● a. R language is a programming language that is combined with packages.


● b. It is used to process and visualize data.
● c. It is a multi-functional language that allows for data manipulation, calculation, and
display.
● d. It can store the figures and execute computations on them with the goal of assembling
an ideal set.
● e. It provides the following features to assist data operations:

· 1. R includes data-handling functions such as declaration and definition, and it also


enables in-memory data storage.
· 2. It allows data gathering procedures such as set and matrix.
· 3. There are numerous tools available for data analysis in R.
· 4. R-generated visual representations can be printed as well as displayed on the
screen.
· 5. The ‘S’ programming language is accessible online to help simplify R’s
functions.
· 6. A large number of packages for various data processing functions in the R
language are available in the repository.
· 7. R has a graphical illustration function for data analysis that may be exported to
external files in a variety of formats.
· 8. R can meet data analytics requirements from start to finish. It can be used to
generate any analysis quickly.

Q6. Explain different interaction techniques.

Interaction approaches enable the data analyst to interact with the visuals and dynamically adjust
them based on the exploration objectives. They also enable the linking and combining of multiple
independent visualizations.

Different interaction techniques are:

a. Dynamic projection:

● 1. Dynamic projection is a type of automated navigation.


● 2. The core idea is to change the projections dynamically in order to study a multi-
dimensional data set.
● 3. The GrandTour system, which attempts to display all interesting two-dimensional
projections of a multidimensional data set as a series of scatter plots, is a well-known
example.
● 4. The sequence of projections shown can be random, manual, pre computed, or data
driven.
● 5. Examples of dynamic projection techniques include X Gobi XLispStat, and ExplorN.

b. Interactive filtering:

● 1. Interactive filtering combines selection with view enhancement.


● 2. While exploring huge data sets, it is critical to partition the data set interactively into
segments and focus on interesting subsets.
● 3. This can be accomplished through either a direct selection of the desired subset
(browsing) or by specifying the desired subset’s attributes (querying).
● 4. The Magic Lens is an example of a tool that can be used for interactive filtering.
● 5. The core idea behind Magic Lens is to filter data directly in the visualization using a
tool comparable to a magnifying glass. The filter processes the data under the magnifying
glass and displays it differently than the rest of the data set.
● 6. Magic Lens changes the view of the specified location while leaving the rest of the
display alone.
● 7. Examples of interactive filtering techniques include InfoCrystal Dynamic Queries, and
Polaris.

c. Zooming:

● 1. Zooming is a frequently utilized visual modification method in a variety of


applications.
● 2. When dealing with vast volumes of data, it is critical to show the data in a highly
compressed form to provide an overview of the data while also allowing for varied
display at different resolutions.
● 3. Zooming means not only making the data objects larger, but also changing the data
representation to provide more details at higher zoom levels.
● 4. The items can be displayed as single pixels at a low zoom level, icons at an
intermediate zoom level, and named objects at a high resolution, for example.
● 5. The TableLens approach is an intriguing example of applying the zooming concept to
huge tabular data sets.
● 6. TableLens’ core concept is to display each numerical value with a tiny bar.
● 7. The lengths of all bars are controlled by the attribute values and have a one-pixel
height.
● 8. Examples of zooming techniques include PAD++, IVEE/Spotfire, and DataSpace.

d. Brushing and Linking:

● 1. Brushing is an interactive selection procedure that communicates the selected data to


other views of the data set.
● 2. The concept of linking and brushing is to integrate several visualization methods in
order to overcome the inadequacies of individual techniques.
● 3. Linking and brushing can be applied to visualizations generated by different
visualization approaches. As a result, the brushing points are highlighted in all
representations, allowing dependencies and correlations to be identified.
● 4. Changes made interactively in one visualization are automatically reflected in the
others.

Distortion:

● 1. Distortion is a visual modification technique that aids in data exploration by


maintaining an overview of the data during drill-down activities.
● 2. The main idea is to display some of the data in high detail while others are displayed in
low detail.
● 3. Hyperbolic and spherical distortions are popular distortion techniques.
● 4. They are commonly employed on hierarchies and graphs, but they can be used to any
other type of visualization technique.

5. Examples of distortion techniques include Bifocal Displays, Perspective Wall, Graphical


Fisheye Views, Hyperbole Visualization, and Hyperbox.

Q7. Differentiate between MapReduce and Apache Pig

S. MapReduce Apache Pig


No
1. It is a low-level data It is a high-level data flow tool.
processing tool.

2. Here, it is required to develop It is not required to develop complex programs.


complex programs using Java
or Python.

3. It is difficult to perform data It has built-in operators for performing data operations
operations in MapReduce.
such as union, sorting, and ordering.

4. It does not allow nested data It provides nested data types like tuple, bag, and map.
types.

Q8. What are the main differences between HDFS and S3?

The main differences between HDFS and S3 are:


● 1. S3 is more scalable than HDFS.
● 2. When it comes to durability, S3 has the edge over HDFS.
● 3. Data in S3 is always persistent, unlike data in HDFS.
● 4. S3 is more cost-efficient and likely cheaper than HDFS.
● 5. HDFS excels when it comes to performance, outshining S3.

Q9. Write R function to check whether the given number is Prime or not.

Find_Prime_No <- function(n1) {

if (n1 == 2) {

return(TRUE)

}
if (n1 <= 1) {

return(FALSE)

for (i in 2:(n1-1)) {

if (n1 %% i == 0) {

return(FALSE)

return(TRUE)

numb_1 <- 13

if (Find_Prime_No(numb_1)) {

# Using paste function to include the number in the output

print(paste(numb_1, "is a prime number"))

} else {

print("It is not a prime number")

Q10. What is the classification of visualization techniques?

The visualization technique may be classified as:


● 1. Standard 2D/3D displays
● 2. Geometrically-transformed displays
● 3. Icon-pixel displays
● 4. Dense pixel displays
● 5. Stacked displays

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy