0% found this document useful (0 votes)

19 views64 pages

Data Mining Unit 3

Uploaded by

sengargungun858

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views64 pages

Data Mining Unit 3

Uploaded by

sengargungun858

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Forms of Data Pre-processing

• Requirements
• Quality Data
• Major Task in data Pre-processing
• Data cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data discretization
Requirement of Data Pre-processing

• The pre-processing is required in advance

before data mining task because real world
data are generally incomplete due to lacking
attribute values.
• Lacking certain attributes of interest, or containing
only aggregate data
• Noisy data ( containing errors or outliers)
• Inconsistent data(containing discrepancies in codes or
names).
Quality Data
• Reason behind dirty data are noise, redundancy and
inconsistency.
• Quality data satisfy requirement of desired use.
• Quality data satisfy followings:
– Accuracy:- Noiseless or error free
– Completeness: No lacking of attribute values
– Consistency: No discrepancy
– Timeliness: all records updated within deadline
– Believability: Trusted Data
– Interpretability: Easy to understand
Major Task in data Pre-processing
• Data cleaning: fill in missing values, smooth noisy data,
identify or remove outliers, and resolve inconsistencies.
• Data integration: using multiple databases, data cubes, or
files.
• Data transformation: normalization and aggregation.
• Data reduction: reducing the volume but producing the same
or similar analytical results.
• Data discretization: part of data reduction, replacing
numerical attributes with nominal ones.
Data Cleaning

1. Handling Data with Missing Values

2. Handling Noisy Data(Data Smoothing)
3. Handling Inconsistent Data.
Methods used to handle data with the missing values

• Ignore the tuple: usually done when class label is missing.

• Use the attribute mean (or majority nominal value) to fill in
the missing value.
• Use the attribute mean (or majority nominal value) for all
samples belonging to the same class.
• Fill missing values manually.

• Use the global constant(UNKNOWN or ∞) for missing values.

• Predict the missing value by using a learning algorithm:
consider the attribute with the missing value as a dependent
(class) variable and run a learning algorithm (usually Bay’s or
decision tree) to predict the missing value.
Handling Noisy Data (Data Smoothing)

To Identify outliers and smooth out noisy

data by following methods:
1. Binning Methods
2. Regression
3. Clustering and outliers
4. Computer and Human Inspection
Binning Methods
This method smooth the sorted values by consulting its
neighbourhood i.e. closest value . This methods perform only
local smoothing because of neighbour consult.
Steps are as below:
1. Sort the attribute values.
2. Partition them into equal size bins(not necessary for last bin).
3. Now smooth the data by replacing each data of bins using any
one of the following:
I. Mean of bin
II. Median of bin
III. Closest boundary values of bin .
Binning Methods Example:
Smooth the following price list using binning methods.
4,8,15,21, 25,28,34,21,24 . let the bin size is 3(given)
Step 1:sorted data:4,8,15,21,21,24,25,28,34
Step2: Partition data into bins of given size-3:
Bin1: 4, 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34
Step3: Data smoothing using mean value:
Bin1: 9, 9, 9 [mean of 4,8,15 is 9]
Bin2: 22, 22,22 [mean of 21,21,24 is 22]
Bin3: 29, 29, 29 [mean of 25,28,34 is 29]
Binning Methods Example:
Bin1: 4, 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34
Step3: Data smoothing using median value:
Bin1: 8, 8, 8 [median of 4,8,15 is 8]
Bin2: 21, 21,21 [median of 21,21,24 is 21]
Bin3: 28, 28, 28 [median of 25,28,34 is 28]
Binning Methods Example:
Bin1: 4 , 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34

Step3: Data smoothing using Closest boundary value:

Bin1: 4, 4, 15 [4 and 15 are boundaries so 8 will be
replaced by closest boundary 4]
Bin2: 21, 21,24 [21 and 24 are boundaries]

Bin3: 25, 25, 34 [25and 34 are boundaries so 28 will

be replaced by closest boundary 25]
Regression Methods
• Regression and log regression models can be used to
approximate given data.
• In simple linear regression the data are modeled to fit a best
straight line between two attributes so that one attribute can
be use to predict other attributes.
• Multiple regression can be used to smooth the noise.

• Y=a.X+b where X,Y are attributes and a, b are regression

coefficient
Regression Methods
Clustering and Outliers
• Clustering is process of grouping a set of data values
into multiple group such that objects within the
group have high similarity and objects within different
group have high dissimilarity.
• Data which are out side the clusters treated as
outliers and can be remove or ignored.
Clustering

NOISE or
OUTLIERS
Outlier analysis by box plotter

• A simple way of representing statistical data on a

plot in which a rectangle is drawn to represent the
second and third quartiles usually with a vertical
line inside to indicate the median value.
• The lower and upper quartiles are shown as
horizontal lines either side of the rectangle.
• For box plot arrange data in increasing order of the
values.
• Then calculate Q1,Q2, Q3 and IQR(Q3-Q1) Inter
Quartile range and draw box plot.
Outlier analysis by box plotter
Drawing a Box Plot.

Example : Draw a Box plot for the data below

Q1 Q2 Q3
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4

Lower Upper
Quartile = Median = Quartile =
14.4 14.6 14.9

10.2 14.4 14.6 14.9 16.4

Q2=14.6 (median value of whole data set)
Q1 = 14.4 (median value left half of data set)
Q3 = 14.9 (median value right half of data set)
Then the IQR is given by:
IQR=Q3-Q1
IQR = 14.9 – 14.4 = 0.5
Outliers will be any points satisfy following:
outliers< Q1 – 1.5 ×IQR Outliers> Q3 + 1.5×IQR
Outliers> 14.9 + 0.75
outliers <14.4 – 0.75
Outliers> 15.65.
outliers <13.65
Here outliers are
Here outliers are :
15.9 and 16.4
10.2

The Outliers of data set are:10.2, 15.9,16.4

Computer and Human Inspection
• It detect the suspicious values by computer and checked by
human being manually.
• Outliers can be identified through a combination of computer
and human inspection. Outliers pattern may be informative or
garbage.
• The patterns whose surprise contents is above a threshold are
out to be list then human can sort through the pattern in the
list to identify actual garbage.
• This is much faster than manually search through entire
database.
• The garbage pattern can be used in subsequent data mining.
• Example: Indentifying outliers in hand written character
database for classification in which different verity of
characters are using.
Handling Inconsistent Data
• Inconsistent means discrepancies in code or name. It is
important issue which is considerable during data integration.
• Following are main reasons for data inconsistency:
✓ Faulty hardware and software for data collection.
✓ Human or computer error.
✓ Error due to data transmission.
✓ Technology limitations(data generated more faster than
receiver).
✓ Inconsistency in naming or convention(02/05/2018 may be
consider as 5 Feb, 2018 or 2 May, 2018)
✓ Due to duplicate tuple in database.
Data Integration
• It is a process of merging of data from multiple and
heterogeneous data sources.
• It is necessary to maintaining consistency to avoid
redundancy in resulted data set.
• This can be helpful to improve the accuracy and
speed of mining process over subsequent data set.
• Major challenges in integration process are:
1. Entity identification problem
2. Tuple Duplication
3. Data value conflict detection and resolutions
4. Redundancy and correlation analysis
Entity Identification Problem
• This refers to how real world entity from different databases
matched up.
– Example customer_id in one relation and customer_no in other
relation.

• During integration special attention need to be take care while

merging data from various sources.
– Discount attribute for individual in one data base is differ in other
database
• Metadata resolve these problem which defines name,
meaning, data type and range value for each attribute.
Tuple Duplications
• In addition to redundancy detection to maintain data
consistency, data tuple duplication should be deleted.
• Inconsistency often arises between various duplicates due to
updating partial data occurrences.

• Example :some customer name may be repeated with

different addresses against each name.
Data value conflict detection and resolutions

• Data Integration also involves the detection and resolution of

conflict data values for some real world entity.
• Attribute values may differs from different sources.
• This may be because of due to difference in scaling ,encoding
or representation.

• Examples: In University Grade System percentage of mark or

GRADE A/B/C or CREDIT SYSTEM
Redundancy and correlation analysis

• Inconsistency in value or attribute causes the redundancy in

dataset. An attribute may be redundant if it derived from
another attribute set.
• Example: Annual revenue of any organization may be derived
from different sources so ignoring any source during
integration may be create redundancy.
• Redundancy can be detected by correlation analysis. It detect
how strongly two or more attributes are dependent.
• Chi square ( χ²) test can be apply to detect how one attributes
strongly implies to other attributes
Example:
In a group of 1500 people is surveyed . In contingency table
observed frequency of preferred reading by male and female is
given. Perform chi square(χ²) to test hypothesis that “gender and
preferred reading are independent” at significant level 0.001.

Gender
Total
Male Female

Fiction 250 200 450

Preferred
Reading
Non Fiction 50 1000 1050

Total 300 1200 1500

Expected frequency (e)can be computed by formula:

e= (Count_gender X Count_prefered_reading)/N

Gender
Total
Male Female

Fiction 90 360 450

Preferred
Reading Non
Fiction
210 840 1050

Total 300 1200 1500

Degree of freedom=(Row-1)X(Column-1)
=(2-1)X(2-1)
=1
Chi Square Table For(independency)
Example 2:A researcher might want to know if there is a significant association between
the variables gender and soft drink choice (Coke and Pepsi were considered). The null
hypothesis would be,
Ho: There is no significant association between gender and soft drink choice.
(Gender and preferred soft drinking is independent )
Significant level 5%
Data Transformation
In data transformation, the data are transformed or
consolidated into appropriate forms for mining.
Strategies used for Data transformation are :
1. Smoothing: Remove the noise
2. Attribute Construction-New set of attributes generated.
3. Aggregation: Summarization of data value.
4. Normalization: Attributes values are scaled in new range(-1 to
+1, 0 to 1 etc.)
5. Discretization: Data divided into discrete intervals(Like 0-100,
101 to 200, 201 to 300 etc.)
6. Concepts of hierarchy generation from higher to lower .
For examples address hierarchy is :country>state>city>street
Data Transformation by Normalization
Normalization change the unit of measurement like
meter to kilometre.
For better performance data must scaled in smaller
format like interval [-1 to +1 or 0 to 1].
Following methods can be used for data normalization:
1. Min-Max Normalization
2. Z-Score(zero mean) Normalization
3. Normalization By decimal Scaling
Example:
Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. Map
income $73,600 to the range [0.0,1.0].
Here MINA=12000, MAXA=98000,
NEW_MINA=0, NEW_MAXA=1 and V=73600
V’=[(73,600−12,000 )/(98,000−12,000 )]*(1.0−0) +0
= 0.716.
So 73600 will represented 0.716 in new range[0-1]
Example:
Suppose that the mean and standard deviation of the values
for the attribute income are $54,000 and $16,000,
respectively. With z-score normalization, a value of $73,600
for income is transformed to :
V’=(73,600−54,000)/ 16,000
= 1.225

So new representation value of 73600 is 1.225

Example:
Suppose that the recorded values of A range from −986 to
917. The maximum absolute value of A is 986. To normalize
by decimal scaling, we therefore divide each value by 1,000
(i.e., j = 3 to make 986 les than 1 i.e. 0.986) so that −986
normalizes to:
−0.986 and 917 normalizes to 0.917.
Normalization Practice
Use the two methods below to normalize the
following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting
min = 0 and max = 1
(b) z-score normalization
Data reduction
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume.
• It Keep closely maintains the integrity of the original data.
• Following Strategies used for data reduction :
1. Dimensionality reduction
2. Attribute subset selection
3. Numerosity reduction
4. Data cube aggregation
5. Discretization and concept hierarchy generation
6. Decision Tree
7. Data Compression
Dimensionality reduction
• In dimensionality reduction, data encoding or transformations
are applied to obtain a reduced representation of the original
data.
• If the original data can be reconstructed from the reduced data
without any loss of information, the data reduction is called
The reduced data retains all the information from the original data.
lossless.
• If we can reconstruct only an approximation of the original
Some information is lost, but the reduced
data, then the data reduction is called lossy. representation features.
captures the most important

• Two popular and effective methods of lossy dimensionality

reduction:
1. Wavelet transforms
2. Principal components analysis
Attribute Subset Selection

• Data sets for analysis may contain hundreds of attributes, many of

which may be irrelevant to the mining task or redundant.
• It reduces the data set size by removing irrelevant or redundant
attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes.
• Attribute subset selection include the following techniques:
1. Stepwise forward selection:
The procedure starts with an empty set of attributes as the reduced
set. The best of the original attributes is determined and added to
the reduced set. At each subsequent iteration or step, the best of
the remaining original attributes is added to the set.
Attribute Subset Selection
2. Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.

3. Combination of forward selection and backward

elimination:
The stepwise forward selection and backward elimination methods
can be combined so that, at each step, the procedure selects the
best attribute and removes the worst from among the remaining
attributes.
4. Decision tree induction:
When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data. All attributes
that do not appear in the tree are assumed to be irrelevant. The
set of attributes appearing in the tree form the reduced subset of
attributes.
Decision Tree Induction-
1- Flow Chart like tree Structure
2- Supports taking decisions
3- It Defines rules visually in the form of tree.
Types of Nodes-
1- Root Node- Main Question
2- Branch Node- Intermediate Processes
3- Leaf Node- Answer
Decision Tree
Attribute Selection Measures-
1- Information Gain- How much information do we get in answer of an specific
question.
2- Entropy- Measures the amount of uncertainty in information.
Both are inversely proportional to each other.
reduce the data volume by finding smaller, more efficient representations of the data

Numerosity reduction
• Numerosity reduction reduce the data volume by choosing
alternative, ‘smaller’ forms of data representation.
• These techniques may be Parametric or Nonparametric.

• For parametric methods, a model is used to estimate the data,

so that typically only the data parameters need to be stored,
instead of the actual data. Regression and Log-linear models
are an example. fitting a line to represent the trend in the data.
• Nonparametric methods for storing reduced representations
of the data include :
1. Histograms The data is divided into intervals (bins), and the frequency of data points in each bin is stored.
2. Clustering Data points are grouped into clusters based on similarity, and representative values for each cluster
are used to summarize the data.

3. Sampling. A subset of the data is selected (randomly or systematically) to represent the entire dataset.
4. Data Cube aggregation Multidimensional data is summarized by computing aggregates (e.g.,
sum, average) for different combinations of dimensions in a data cube.
A histogram partitions the data distribution of an attribute
,A into disjoint subsets, also known as buckets.
Histograms
• A histogram for an attribute, A, partitions the data
distribution of A into disjoint subsets, or buckets.
• If each bucket represents only a single attribute-value/
frequency pair, the buckets are called singleton buckets.
• There are several partitioning rules, including the following:
• Equal-width: In an equal-width histogram, the width of each
bucket range is uniform (such as the width of $10 for the
The range of the attribute A is divided into buckets of uniform width. For example, if A ranges from 0 to 100, and
buckets . we use 10 buckets, each bucket will have a width of 10 10 (e.g., [0–10), [10–20), ..., [90–100]).
• Equal-frequency (or equidepth): In an equal-frequency
histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data
samples).
Buckets are created such that each bucket contains approximately the same number of data points.
For example, if there are 100 data points and 10 buckets, each bucket will contain roughly 10 data points.
Histograms
• Example :The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest dollar):
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Histograms
• Example :The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest dollar):
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Sampling
Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random sample (or subset) of the data.
1. Simple random sample without replacement (SRSWOR) of size s:
This is created by drawing s of the N tuples from D (s < N), where
the probability of drawing any tuple in D is 1=N, that is, all tuples
are equally likely to be sampled.
2. Simple random sample with replacement (SRSWR) of size s:
This is similar to SRSWOR, except that each time a tuple is drawn
from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn Again.

Similar to SRSWOR, but with replacement. After a tuple is drawn, it is returned to the dataset before the next draw.
This allows the same tuple to be sampled multiple times.
Cluster sampling is a probabilistic sampling technique in which the population is
divided into groups or clusters, and a random selection of these clusters is made.
All members within the selected clusters are included in the sample. This method

• 3-Cluster sampling
is particularly effective when the population is large and widely dispersed
geographically.

Groups are heterogeneous, meaning the individual characteristics vary. The

population is divided into groups, often by geographical location, and then
all members of randomly selected groups are included in the sample. This
method is often used to reduce costs and increase efficiency, especially for
populations spread across different locations.

• 4-Stratified sampling
Groups are homogeneous, meaning the units share characteristics. The
population is divided into specific groups based on an interest, and some
members of all groups are included in the sample. This method is used to
improve precision and representation, and can help researchers make
better estimates.
Stratified sampling is a probabilistic sampling technique in which the population is divided into smaller groups, known as strata,
• based on shared characteristics. A sample is then drawn from each stratum. This method ensures that specific subgroups of the
population are adequately represented in the final sample.
Data Cube Aggregation
• It is use to summarize multidimensional data cube.
• Data aggregation is any process in which information is
gathered and expressed in a summary form, for purposes such
as statistical analysis.
• A common aggregation purpose is to get more information
about particular groups based on specific variables such as
age, profession, sales or income.
Data Discretization
In data discretization data divided into discrete intervals .Data
discretization can be categorized based on how it performed.
1. Supervised discretization use class information.
2. Unsupervised discretization not use prior class information.
3. Other techniques are: Data discretization is the process of converting
continuous data attributes into discrete intervals
• Binning methods or bins. This transformation is particularly useful
for simplifying data analysis, visualizing patterns,
and preparing data for algorithms that work
• Histogram better with discrete values (e.g., certain
classification algorithms).

• Cluster Analysis
• Decision tree analysis
• Correlation analysis
Discretization refers to transforming continuous numerical data into discrete categories or intervals.
Concept hierarchy generation involves replacing raw attribute values with higher-level categories or concepts (e.g., replacing age values like "
25" with "young")
Purpose-1)These techniques simplify the representation of data for analysis.2) They are essential in mining data at multiple abstraction levels,
enabling pattern detection and insights.
Discretization and Concept Hierarchy Generation for
Numerosity Reduction:
Numerical Data
Discretization reduces the volume of data by grouping continuous values into intervals, lowering complexity without significant information loss.
Example: Replace individual age values (1, 2, 3...) with intervals (e.g., "1–10", "11–20").

• Discretization and concept hierarchy generation, where raw data

values for attributes are replaced by ranges or higher conceptual
levels.
• Data discretization is a form of numerosity reduction that is very
useful for the automatic generation of concept hierarchies.
• Discretization and concept hierarchy generation are powerful tools
for data mining, in that they allow the mining of data at multiple
levels of abstraction. Strategies are: Discretization and concept hierarchies allow
algorithms to focus on relationships at
different levels of abstraction, improving the
1. Binning flexibility of analysis.

Example: A retail store could analyze data

2. Histogram by:
Specific prices (numerical level)
3. Entropy based discretization Price ranges (conceptual level, e.g., "low", "
medium", "high").

4. Cluster analysis
5. Discretization by intuitive partitioning
Discretization and Concept Hierarchy Generation for
Numerical Data

1. Binning: Binning is a top-down splitting technique

based on a specified number of bins.
These methods are also used as discretization
methods for numerosity reduction and concept
hierarchy generation.
2. Histogram analysis: Like binning, histogram analysis
is an unsupervised discretization technique because
it does not use class information. Histograms
partition the values for an attribute, A, into disjoint
ranges called buckets.
Discretization and Concept Hierarchy Generation for
A recursive process starting with the entire
Numerical Data range of the attribute, and progressively
splitting it into smaller intervals based on
entropy minimization.

3. Entropy-based discretization: Entropy-based

discretization is a supervised, top-down splitting technique.
• It explores class distribution information in its calculation and
determination of split-points (data values for partitioning an
attribute range). A split-point is a numerical value used to divide an attribute's range into two intervals.
For each possible split-point in the attribute range, calculate the entropy of the resulting intervals.Select the split-point that minimizes entropy

• To discretize a numerical attribute, A, the method selects the

value of A that has the minimum entropy as a split-point, and
Once the first split-point is chosen, repeat the process within each resulting interval.
recursively partitions the resulting intervals to arrive at a
hierarchical discretization. Such discretization forms a concept
hierarchy for A.
Minimum interval size.
Maximum number of intervals.
Desired entropy threshold.

Entropy quantifies the amount of uncertainty or impurity in a dataset.

Discretization and Concept Hierarchy Generation for
Numerical Data

5. Cluster analysis: A clustering algorithm can be applied to

discretize a numerical attribute, A, by partitioning the values
of A into clusters or groups.
Clustering takes the distribution of A into consideration, as
well as the closeness of data points, and therefore is able to
produce high-quality discretization results.
Discretization and Concept Hierarchy Generation for
Numerical Data

6. Discretization by intuitive partitioning:

A concept hierarchy for the attribute price, where an interval
($X : : :$Y] denotes the range from $X (exclusive) to $Y
Intuitive partitioning creates intervals for numerical attributes based on domain knowledge
(inclusive). or human understanding rather than purely statistical or algorithmic approaches.
X: Lower bound (exclusive). This approach is unsupervised and focuses on logical, easy-to-interpret divisions.
Y: Upper bound (inclusive).
100:::500\]]** includes values greater than \(100 and up to 500.
Concept Hierarchy Generation for Categorical Data

• Categorical data are discrete data. Categorical attributes have

a finite (but possibly large) number of distinct values, with no
ordering among the values. Examples include geographic
Geographic location (country state city), Job
location, job category, and item type. category (industry role title), and Item type
(category subcategory brand).

• There are several methods for the generation of concept

hierarchies for categorical data.
Users or domain experts define a hierarchy directly within the schema by specifying the relationships between attributes.
1. Specification of a partial ordering of attributes explicitly at
the schema level by users or Experts:
2. Specification of a portion of a hierarchy by explicit data
grouping: Users specify groups or categories directly based on the values of the attribute.
3. Specification of a set of attributes, but not of their partial
ordering:
Users specify which attributes belong to a concept hierarchy, but do not define their explicit
order.
This approach involves manual creation of a concept hierarchy by users or domain experts who specify the relationships and
order of attributes based on their knowledge of the domain.

1- Specification of a partial ordering of attributes explicitly at the schema

level by users or experts:
In this approach, domain experts or users de ne a partial order of attributes based on
their knowledge, creating a hierarchy at the schema level
Example: In a geographic location hierarchy, experts may specify that "City" is below
"State," which is below "Country." So, the hierarchy would look like
• Country → State → Cit
Rather than defining an entire hierarchy in advance, the focus is on creating clusters or groups of related data.

2-Speci ca on of a por on of a hierarchy by explicit data grouping:

This method involves grouping data based on the inherent relationships within the data.
Portions of the hierarchy are generated by clustering or organizing similar items
together
Example: For job categories, we could group roles such as "Software Engineer," "Data
Scientist," and "Product Manager" under the broader category "Technical Roles," while
"Accountant" and "Financial Analyst" fall under "Finance Roles." This might create a
hierarchy like
• Job Category → Technical Roles, Finance Roles
.

fi
ti
:

ti
y

fi
.

3- Speci ca on of a Set of A ributes Without Par al Ordering

Here, a set of attributes is de ned, but no explicit ordering is assigned. This
allows exibility in how data
is grouped at different levels without a xed sequence
Example: For an item hierarchy, we could specify attributes like "Brand,"
"Type," and "Model" for
products without specifying a strict order. This enables the hierarchy to adapt
based on the context,
such as grouping by "Brand" rst in one case or "Type" in another
Suppose we have three attributes for a set of products
Brand (e.g., "Apple," "Samsung," “Sony"
Type (e.g., "Smartphone," "Tablet," “Laptop"
Model (speci c model names, like "iPhone 14," "Galaxy S23”
Without a partial ordering, we can group or organize these attributes in various
ways based on different analytical needs or perspectives
fl
fi
ti
fi

fi
tt
fi
)

ti
.

Attributes Defined:
Brand: Apple, Samsung, Sony.
Type: Smartphone, Tablet, Laptop.
Model: Specific models like
iPhone 14, Galaxy S21, etc.
Dynamic Hierarchy Generation:
Case 1: Group by Brand First

Brand Type Model

Apple Smartphone iPhone 14
Samsung Smartphone Galaxy
S21
Case 2: Group by Type First

Type Brand Model

Smartphone Apple iPhone 14
Smartphone Samsung Galaxy
S21
Case 3: Group by Model First
(Context-Specific)

Model Brand Type

iPhone 14 Apple Smartphone
Galaxy S21 Samsung
Smartphone

Mind On Statistics 5th Edition Utts Test Bank
No ratings yet
Mind On Statistics 5th Edition Utts Test Bank
22 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CH 2
No ratings yet
CH 2
36 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Unit 2
No ratings yet
Unit 2
37 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Normalization
No ratings yet
Normalization
35 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Week2 2
No ratings yet
Week2 2
25 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Correlation
No ratings yet
Correlation
14 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Quality
100% (2)
Data Quality
16 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Year 11 Preliminary Standard Math: Analysing Data
No ratings yet
Year 11 Preliminary Standard Math: Analysing Data
32 pages
Describing Data: Numerical Measures: Nguyen Thi Lien
No ratings yet
Describing Data: Numerical Measures: Nguyen Thi Lien
50 pages
Chapter 4 MMW Part 3
No ratings yet
Chapter 4 MMW Part 3
64 pages
Lecture 9 Measure of Dispersion
No ratings yet
Lecture 9 Measure of Dispersion
43 pages
Lesson Two
No ratings yet
Lesson Two
66 pages
Business Statistics MCQs
100% (2)
Business Statistics MCQs
55 pages
Q4 - Week 1 - Illustrating Quartiles, Deciles and Percentiles
No ratings yet
Q4 - Week 1 - Illustrating Quartiles, Deciles and Percentiles
11 pages
Math10 - Q4 - Weeks 1 - 2
No ratings yet
Math10 - Q4 - Weeks 1 - 2
30 pages
Assignment 1 - Data Screening (16 March)
100% (1)
Assignment 1 - Data Screening (16 March)
5 pages
Assignment 4.54: First Quartile Second Quartile Third Quartile 4.56
No ratings yet
Assignment 4.54: First Quartile Second Quartile Third Quartile 4.56
2 pages
XZCX
No ratings yet
XZCX
6 pages
Question Paper DSBDA
No ratings yet
Question Paper DSBDA
16 pages
Median and Quartiles From Grouped Data
No ratings yet
Median and Quartiles From Grouped Data
6 pages
Proximate-Ultimate June 2022
No ratings yet
Proximate-Ultimate June 2022
25 pages
Calculating Measures of Position: Grade 10
No ratings yet
Calculating Measures of Position: Grade 10
10 pages
Ib Studies Descriptive Statistics Assessment 2012.09 Mark Scheme
No ratings yet
Ib Studies Descriptive Statistics Assessment 2012.09 Mark Scheme
5 pages
Practice Questions
No ratings yet
Practice Questions
9 pages
MDM4U Unit3
No ratings yet
MDM4U Unit3
22 pages
Lecture 1
No ratings yet
Lecture 1
94 pages
Modified Ps Final 2023
No ratings yet
Modified Ps Final 2023
124 pages
Math 10-4th Exam
No ratings yet
Math 10-4th Exam
6 pages
Prediction of Rental Prices
No ratings yet
Prediction of Rental Prices
7 pages
Revised Fourth Sem Matlab Manual 21-22-1
No ratings yet
Revised Fourth Sem Matlab Manual 21-22-1
40 pages
MATH 10 QUARTER 4 WW and PT 2023 2024
No ratings yet
MATH 10 QUARTER 4 WW and PT 2023 2024
6 pages
Descriptive Statistics Theory
No ratings yet
Descriptive Statistics Theory
5 pages
NAME: - Assignment 5 Data Files Needed For These Problems Are in The Attached Files. Problems: 3.31
No ratings yet
NAME: - Assignment 5 Data Files Needed For These Problems Are in The Attached Files. Problems: 3.31
9 pages
Economics Ss2 1st Term Week 3
No ratings yet
Economics Ss2 1st Term Week 3
27 pages
Trainity Project-6
No ratings yet
Trainity Project-6
12 pages
Stat210 FL17 LCN 1
No ratings yet
Stat210 FL17 LCN 1
43 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining Unit 3

Uploaded by

Data Mining Unit 3

Uploaded by

Forms of Data Pre-processing

• The pre-processing is required in advance

1. Handling Data with Missing Values

• Ignore the tuple: usually done when class label is missing.

• Use the global constant(UNKNOWN or ∞) for missing values.

To Identify outliers and smooth out noisy

Step3: Data smoothing using Closest boundary value:

Bin3: 25, 25, 34 [25and 34 are boundaries so 28 will

• Y=a.X+b where X,Y are attributes and a, b are regression

• A simple way of representing statistical data on a

Example : Draw a Box plot for the data below

10.2 14.4 14.6 14.9 16.4

The Outliers of data set are:10.2, 15.9,16.4

• During integration special attention need to be take care while

• Example :some customer name may be repeated with

• Data Integration also involves the detection and resolution of

• Examples: In University Grade System percentage of mark or

• Inconsistency in value or attribute causes the redundancy in

Fiction 250 200 450

Total 300 1200 1500

Fiction 90 360 450

Total 300 1200 1500

So new representation value of 73600 is 1.225

• Two popular and effective methods of lossy dimensionality

• Data sets for analysis may contain hundreds of attributes, many of

3. Combination of forward selection and backward

• For parametric methods, a model is used to estimate the data,

Groups are heterogeneous, meaning the individual characteristics vary. The

• Discretization and concept hierarchy generation, where raw data

Example: A retail store could analyze data

1. Binning: Binning is a top-down splitting technique

3. Entropy-based discretization: Entropy-based

• To discretize a numerical attribute, A, the method selects the

Entropy quantifies the amount of uncertainty or impurity in a dataset.

5. Cluster analysis: A clustering algorithm can be applied to

6. Discretization by intuitive partitioning:

• Categorical data are discrete data. Categorical attributes have

• There are several methods for the generation of concept

1- Specification of a partial ordering of attributes explicitly at the schema

2-Speci ca on of a por on of a hierarchy by explicit data grouping:

3- Speci ca on of a Set of A ributes Without Par al Ordering

Brand Type Model

Type Brand Model

Model Brand Type

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.