0% found this document useful (0 votes)
46 views7 pages

Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284

The document provides examples of data preprocessing techniques including data cleaning, transformation, and reduction. It discusses data cleaning techniques like handling missing data through methods such as ignoring tuples, filling in values, and addressing noisy data using binning, regression, and clustering. It also explains data transformation including normalization, attribute selection, discretization, and concept hierarchy generation. Finally, it summarizes data reduction techniques such as data cube aggregation, attribute subset selection, numerosity reduction, and dimensionality reduction using wavelet transforms and PCA.

Uploaded by

Sachin Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views7 pages

Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284

The document provides examples of data preprocessing techniques including data cleaning, transformation, and reduction. It discusses data cleaning techniques like handling missing data through methods such as ignoring tuples, filling in values, and addressing noisy data using binning, regression, and clustering. It also explains data transformation including normalization, attribute selection, discretization, and concept hierarchy generation. Finally, it summarizes data reduction techniques such as data cube aggregation, attribute subset selection, numerosity reduction, and dimensionality reduction using wavelet transforms and PCA.

Uploaded by

Sachin Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

ASSIGNMENT – DWDM

Submitted by- Tanya Sikka


1719210284

Ques 1.Give some examples of data preprocessing


techniques?
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves
handling of missing data, noisy data etc.
 (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments
of equal size and then various methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its mean or boundary values can
be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city”
can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge
volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction
technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Componenet Analysis).

Ques 2. Explain the concept of classification in data


mining.
Classification: It is a Data analysis task, i.e. the process of finding a model that describes
and distinguishes data classes and concepts. Classification is the problem of identifying to which
of a set of categories (subpopulations), a new observation belongs to, on the basis of a training set
of data containing observations and whose categories membership is known.

Example: Before starting any Project, we need to check it’s feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further
approve it. It is a two-step process such as :
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model
on test data and hence estimate the accuracy of the classification rules.

Ques 3. What is attribute selection measures?


Attribute subset Selection is a technique which is used for data reduction in data mining
process. Data reduction reduces the size of data so that it can be used for analysis purposes more
efficiently.
Need of Attribute Selection-
The data set may have a large number of attributes. But some of those attributes can be irrelevant
or redundant. The goal of attribute subset selection is to find a minimum set of attributes such that
dropping of those irrelevant attributes does not much affect the utility of data and the cost of data
analysis could be reduced. Mining on a reduced data set also makes the discovered pattern easier
to understand.
Process of Attribute Selection-
The brute force approach can be very expensive in which each subset (2^n possible subsets) of
the data having n attributes can be analysed.
The best way to do the task is to use the statistical significance tests such that best (or worst)
attributes can be recognized. Statistical significance test assumes that attributes are independent
of one another. This is a kind of greedy approach in which a significance level is decided
(statistically ideal value of significance level is 5%) and the models are tested again and again until
p-value (probability value) of all attributes is less than or equal to the selected significance level.
The attributes having p-value higher than significance level are discarded. This procedure is
repeated again and again until all the attribute in data set has p-value less than or equal to the
significance level. This gives us the reduced data set having no irrelevant attributes.

Ques 4. Discuss association rule.


Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical example is
Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items.It allows retailers to identify relationships between the items that people buy
together frequently.

Ques 5. Write a short note on decision tree based


algorithm.

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.


 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm


A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3.
ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a
partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Ques6 . what are neural networks?

An Artificial Neural Network, often just called a neural network, is a


mathematical model inspired by biological neural networks. A neural network
consists of an interconnected group of artificial neurons, and it processes
information using a connectionist approach to computation. In most cases a
neural network is an adaptive system that changes its structure during a
learning phase. Neural networks are used to model complex relationships
between inputs and outputs or to find patterns in data.

The inspiration for neural networks came from examination of central nervous
systems. In an artificial neural network, simple artificial nodes, called
“neurons”, “neurodes”, “processing elements” or “units”, are connected
together to form a network which mimics a biological neural network.

There is no single formal definition of what an artificial neural network is.


Generally, it involves a network of simple processing elements that exhibit
complex global behavior determined by the connections between the
processing elements and element parameters. Artificial neural networks are
used with algorithms designed to alter the strength of the connections in the
network to produce a desired signal flow.

Neural networks are also similar to biological neural networks in that functions
are performed collectively and in parallel by the units, rather than there being
a clear delineation of subtasks to which various units are assigned. The term
“neural network” usually refers to models employed in statistics, cognitive
psychology and artificial intelligence. Neural network models which emulate
the central nervous system are part of theoretical neuroscience and
computational neuroscience.

Real-life applications
The tasks artificial neural networks are applied to tend to fall within the
following broad categories:

 Function approximation, or regression analysis, including time series prediction,


fitness approximation and modeling.
 Classification, including pattern and sequence recognition, novelty detection and
sequential decision making.
 Data processing, including filtering, clustering, blind source separation and
compression.
 Robotics, including directing manipulators, Computer numerical control.

Ques 7. What are grid based methods?


Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in the quantized space

Ques 8. What is Distance-based algorithms?


Distance-based algorithms are machine learning algorithms that classify queries by computing
distances between these queries and a number of internally stored exemplars. Exemplars that are
closest to the query have the largest influence on the classification assigned to the query. Two
specific distance-based algorithms, the nearest neighbor algorithm and the nearest-hyperrectangle
algorithm, are studied in detail.

It is shown that the k-nearest neighbor algorithm (kNN) outperforms the first nearest neighbor
algorithm only under certain conditions. Data sets must contain moderate amounts of noise.
Training examples from the different classes must belong to clusters that allow an increase in the
value of k without reaching into clusters of other classes. Methods for choosing the value of k for
kNN are investigated. It shown that one-fold cross-validation on a restricted number of values for k
suffices for best performance. It is also shown that for best performance the votes of the k-nearest
neighbors of a query should be weighted in inverse proportion to their distances from the query.

Ques 9. What is the difference between agglometric and divisive


hierarchical clustering?
 Agglomerative Hierarchical clustering method allows the clusters to be read from bottom to top and it follows
this approach so that the program always reads from the sub-component first then moves to the parent.
Whereas, divisive uses top-bottom approach in which the parent is visited first then the child. 
Agglomerative hierarchical method consists of objects in which each object creates its own clusters and these
clusters are grouped together to create a large cluster. It defines a process of merging that carries on till all the
single clusters are merged together into a complete big cluster that will consists of all the objects of child
clusters. Whereas, in divisive the parent cluster is divided into smaller cluster and it keeps on dividing till each
cluster has a single object to represent.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy