0% found this document useful (0 votes)

46 views7 pages

Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284

The document provides examples of data preprocessing techniques including data cleaning, transformation, and reduction. It discusses data cleaning techniques like handling missing data through methods such as ignoring tuples, filling in values, and addressing noisy data using binning, regression, and clustering. It also explains data transformation including normalization, attribute selection, discretization, and concept hierarchy generation. Finally, it summarizes data reduction techniques such as data cube aggregation, attribute subset selection, numerosity reduction, and dimensionality reduction using wavelet transforms and PCA.

Uploaded by

Sachin Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views7 pages

Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284

Uploaded by

Sachin Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

ASSIGNMENT – DWDM

Submitted by- Tanya Sikka

1719210284

Ques 1.Give some examples of data preprocessing

techniques?
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves
handling of missing data, noisy data etc.
 (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments
of equal size and then various methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its mean or boundary values can
be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city”
can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge
volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction
technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Componenet Analysis).

Ques 2. Explain the concept of classification in data

mining.
Classification: It is a Data analysis task, i.e. the process of finding a model that describes
and distinguishes data classes and concepts. Classification is the problem of identifying to which
of a set of categories (subpopulations), a new observation belongs to, on the basis of a training set
of data containing observations and whose categories membership is known.

Example: Before starting any Project, we need to check it’s feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further
approve it. It is a two-step process such as :
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model
on test data and hence estimate the accuracy of the classification rules.

Ques 3. What is attribute selection measures?

Attribute subset Selection is a technique which is used for data reduction in data mining
process. Data reduction reduces the size of data so that it can be used for analysis purposes more
efficiently.
Need of Attribute Selection-
The data set may have a large number of attributes. But some of those attributes can be irrelevant
or redundant. The goal of attribute subset selection is to find a minimum set of attributes such that
dropping of those irrelevant attributes does not much affect the utility of data and the cost of data
analysis could be reduced. Mining on a reduced data set also makes the discovered pattern easier
to understand.
Process of Attribute Selection-
The brute force approach can be very expensive in which each subset (2^n possible subsets) of
the data having n attributes can be analysed.
The best way to do the task is to use the statistical significance tests such that best (or worst)
attributes can be recognized. Statistical significance test assumes that attributes are independent
of one another. This is a kind of greedy approach in which a significance level is decided
(statistically ideal value of significance level is 5%) and the models are tested again and again until
p-value (probability value) of all attributes is less than or equal to the selected significance level.
The attributes having p-value higher than significance level are discarded. This procedure is
repeated again and again until all the attribute in data set has p-value less than or equal to the
significance level. This gives us the reduced data set having no irrelevant attributes.

Ques 4. Discuss association rule.

Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical example is
Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items.It allows retailers to identify relationships between the items that people buy
together frequently.

Ques 5. Write a short note on decision tree based

algorithm.

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3.
ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then

return N as leaf node labeled with class C;

if attribute_list is empty then

return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)

to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and

multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute

for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition

let Dj be the set of data tuples in D satisfying outcome j; // a
partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Ques6 . what are neural networks?

An Artificial Neural Network, often just called a neural network, is a

mathematical model inspired by biological neural networks. A neural network
consists of an interconnected group of artificial neurons, and it processes
information using a connectionist approach to computation. In most cases a
neural network is an adaptive system that changes its structure during a
learning phase. Neural networks are used to model complex relationships
between inputs and outputs or to find patterns in data.

The inspiration for neural networks came from examination of central nervous
systems. In an artificial neural network, simple artificial nodes, called
“neurons”, “neurodes”, “processing elements” or “units”, are connected
together to form a network which mimics a biological neural network.

There is no single formal definition of what an artificial neural network is.

Generally, it involves a network of simple processing elements that exhibit
complex global behavior determined by the connections between the
processing elements and element parameters. Artificial neural networks are
used with algorithms designed to alter the strength of the connections in the
network to produce a desired signal flow.

Neural networks are also similar to biological neural networks in that functions
are performed collectively and in parallel by the units, rather than there being
a clear delineation of subtasks to which various units are assigned. The term
“neural network” usually refers to models employed in statistics, cognitive
psychology and artificial intelligence. Neural network models which emulate
the central nervous system are part of theoretical neuroscience and
computational neuroscience.

Real-life applications
The tasks artificial neural networks are applied to tend to fall within the
following broad categories:

 Function approximation, or regression analysis, including time series prediction,

fitness approximation and modeling.
 Classification, including pattern and sequence recognition, novelty detection and
sequential decision making.
 Data processing, including filtering, clustering, blind source separation and
compression.
 Robotics, including directing manipulators, Computer numerical control.

Ques 7. What are grid based methods?

Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized space

Ques 8. What is Distance-based algorithms?

Distance-based algorithms are machine learning algorithms that classify queries by computing
distances between these queries and a number of internally stored exemplars. Exemplars that are
closest to the query have the largest influence on the classification assigned to the query. Two
specific distance-based algorithms, the nearest neighbor algorithm and the nearest-hyperrectangle
algorithm, are studied in detail.

It is shown that the k-nearest neighbor algorithm (kNN) outperforms the first nearest neighbor
algorithm only under certain conditions. Data sets must contain moderate amounts of noise.
Training examples from the different classes must belong to clusters that allow an increase in the
value of k without reaching into clusters of other classes. Methods for choosing the value of k for
kNN are investigated. It shown that one-fold cross-validation on a restricted number of values for k
suffices for best performance. It is also shown that for best performance the votes of the k-nearest
neighbors of a query should be weighted in inverse proportion to their distances from the query.

Ques 9. What is the difference between agglometric and divisive

hierarchical clustering?
Agglomerative Hierarchical clustering method allows the clusters to be read from bottom to top and it follows
this approach so that the program always reads from the sub-component first then moves to the parent.
Whereas, divisive uses top-bottom approach in which the parent is visited first then the child.
Agglomerative hierarchical method consists of objects in which each object creates its own clusters and these
clusters are grouped together to create a large cluster. It defines a process of merging that carries on till all the
single clusters are merged together into a complete big cluster that will consists of all the objects of child
clusters. Whereas, in divisive the parent cluster is divided into smaller cluster and it keeps on dividing till each
cluster has a single object to represent.

DTS304TC_CW2_Paper
No ratings yet
DTS304TC_CW2_Paper
21 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Assignment 3
No ratings yet
Assignment 3
4 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
DWDM
No ratings yet
DWDM
9 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
5 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
Data Mining Long Answers
No ratings yet
Data Mining Long Answers
4 pages
Cs1004: Data Warehousing and Mining Two Marks Questions and Answers Unit I
No ratings yet
Cs1004: Data Warehousing and Mining Two Marks Questions and Answers Unit I
31 pages
1
No ratings yet
1
4 pages
Final Exam Review
No ratings yet
Final Exam Review
6 pages
Data Mining and Warehousing
100% (3)
Data Mining and Warehousing
30 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
DMA_qb_solved
No ratings yet
DMA_qb_solved
42 pages
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
No ratings yet
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
5 pages
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
No ratings yet
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
7 pages
DM passing package
No ratings yet
DM passing package
38 pages
Datamining Quiz
No ratings yet
Datamining Quiz
173 pages
Data Mining Unit-II
No ratings yet
Data Mining Unit-II
4 pages
DATA MINING ASSIGN 1
No ratings yet
DATA MINING ASSIGN 1
7 pages
Data Mining_dm 1-5 Question Bank
No ratings yet
Data Mining_dm 1-5 Question Bank
10 pages
Data Mining IMP Objective Questions_Sep 2023
No ratings yet
Data Mining IMP Objective Questions_Sep 2023
4 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
Datawarehouse&Data mining_ALL
No ratings yet
Datawarehouse&Data mining_ALL
46 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
unit 1
No ratings yet
unit 1
28 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Data Mining
No ratings yet
Data Mining
20 pages
Seperated
No ratings yet
Seperated
11 pages
DMDW Lab Oral Question Bank
No ratings yet
DMDW Lab Oral Question Bank
4 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
DWM Unit 2
No ratings yet
DWM Unit 2
4 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
CEUC502 - DMBI_Question_Bank
No ratings yet
CEUC502 - DMBI_Question_Bank
12 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Mining Questions and Answers
No ratings yet
Data Mining Questions and Answers
22 pages
solved DM questions
No ratings yet
solved DM questions
6 pages
Journal On Decision Tree
No ratings yet
Journal On Decision Tree
5 pages
Short Notes On Data Mining & Warehousing
No ratings yet
Short Notes On Data Mining & Warehousing
43 pages
DM Unit 1 PDF
No ratings yet
DM Unit 1 PDF
9 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Model Question paper 2
No ratings yet
Model Question paper 2
7 pages
dwm NOTES
No ratings yet
dwm NOTES
118 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
CS402 Data Mining and Warehousing
No ratings yet
CS402 Data Mining and Warehousing
3 pages
BA-Chapter 8 - Solution Evaluation
No ratings yet
BA-Chapter 8 - Solution Evaluation
52 pages
Lecture 5. GIS Analysis Functions: Dr. Faith - Karanja
No ratings yet
Lecture 5. GIS Analysis Functions: Dr. Faith - Karanja
32 pages
Statistical Pattern Recognition
No ratings yet
Statistical Pattern Recognition
15 pages
The Best Python Libraries b0d3576dpz
100% (1)
The Best Python Libraries b0d3576dpz
50 pages
Laptop Price Predictor Final Report
No ratings yet
Laptop Price Predictor Final Report
7 pages
Presentation
No ratings yet
Presentation
27 pages
Hardware_Acceleration_of_SVM_classifier
No ratings yet
Hardware_Acceleration_of_SVM_classifier
9 pages
Sentiment Analysis of Comment Texts Based On BiLSTM
No ratings yet
Sentiment Analysis of Comment Texts Based On BiLSTM
11 pages
Machine Learning
No ratings yet
Machine Learning
27 pages
DWM - Classification-Unit7
No ratings yet
DWM - Classification-Unit7
44 pages
Fairness Through Awareness: Cynthia Dwork Moritz Hardt Toniann Pitassi Omer Reingold Richard Zemel November 30, 2011
No ratings yet
Fairness Through Awareness: Cynthia Dwork Moritz Hardt Toniann Pitassi Omer Reingold Richard Zemel November 30, 2011
24 pages
Unit 03 - 04
No ratings yet
Unit 03 - 04
27 pages
Technical Writing
80% (5)
Technical Writing
24 pages
Chapter 4 PDF
No ratings yet
Chapter 4 PDF
11 pages
Machine Learning KNN Presentation
No ratings yet
Machine Learning KNN Presentation
28 pages
1.introduction of Statistics
No ratings yet
1.introduction of Statistics
22 pages
AFRICDSA Certified Data Scientist Syllabus - V1.2
No ratings yet
AFRICDSA Certified Data Scientist Syllabus - V1.2
12 pages
Prospects of Computer Vision Automated Grading and Sorting Systems in Agricultural and Food Products For Quality Evaluation
No ratings yet
Prospects of Computer Vision Automated Grading and Sorting Systems in Agricultural and Food Products For Quality Evaluation
9 pages
Classification of Input Document or Text in Different Indian IT Laws Using Machine Learning Techniques
No ratings yet
Classification of Input Document or Text in Different Indian IT Laws Using Machine Learning Techniques
6 pages
Bhatia Rawat Kumar
No ratings yet
Bhatia Rawat Kumar
6 pages
JOURNAL Analysis on Leaf Disease Identification Using
No ratings yet
JOURNAL Analysis on Leaf Disease Identification Using
6 pages
290+ Machine Learning Projects With Python
No ratings yet
290+ Machine Learning Projects With Python
20 pages
Group-Project Final Documentation2
No ratings yet
Group-Project Final Documentation2
59 pages
Multi-Modal Hate Speech Detection Using Machine
No ratings yet
Multi-Modal Hate Speech Detection Using Machine
5 pages
Karandeep Singh
No ratings yet
Karandeep Singh
2 pages
Tana Basin LULC Analysis
No ratings yet
Tana Basin LULC Analysis
7 pages
Machine Learning - CheatSheet
100% (1)
Machine Learning - CheatSheet
2 pages
Lesson 3.2 - Supervised Learning Evaluation
No ratings yet
Lesson 3.2 - Supervised Learning Evaluation
31 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284

Uploaded by

Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284

Uploaded by

ASSIGNMENT – DWDM

Submitted by- Tanya Sikka

Ques 1.Give some examples of data preprocessing

Ques 2. Explain the concept of classification in data

Ques 3. What is attribute selection measures?

Ques 4. Discuss association rule.

Ques 5. Write a short note on decision tree based

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

Decision Tree Induction Algorithm

if tuples in D are all of the same class, C then

if attribute_list is empty then

apply attribute_selection_method(D, attribute_list)

if splitting_attribute is discrete-valued and

attribute_list = splitting attribute; // remove splitting attribute

// partition the tuples and grow subtrees for each partition

An Artificial Neural Network, often just called a neural network, is a

There is no single formal definition of what an artificial neural network is.

 Function approximation, or regression analysis, including time series prediction,

Ques 7. What are grid based methods?

Ques 8. What is Distance-based algorithms?

Ques 9. What is the difference between agglometric and divisive

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.