02 Pre Processing
02 Pre Processing
Course Outline
Chapter 1: Introduction to Data Mining
Chapter 2: Data Processing
Chapter 3: Classification and Prediction
Chapter 4: Mining Association Rules
Chapter 5: Cluster Analysis
Given the following data (in increasing order)
for the attribute age: 13, 15, 16, 16, 19, 20,
20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35,
35, 35, 35, 36, 40, 45, 46. Use smooth by bin
boundaries to smooth these data, using
depth of 5.
Bins Boundaries:
•Bin 1: 13, 15, 16, 16, 19 •Bin 1: [13,19]
•Bin 2: 20, 20, 21, 22, 22 •Bin 2: [20,22]
•Bin 3: 25, 25, 25, 25, 30 •Bin 3: [25,30]
•Bin 4: 33, 33, 35, 35, 35 •Bin 4: [33,35]
•Bin 5: 35, 36, 40, 45, 46 •Bin 5: [35,46]
Replacing by closest boundary value
•Bin 1: 13, 13, 13, 13, 19
•Bin 2: 20, 20, 20, 22, 22
•Bin 3: 25, 25, 25, 25, 30
•Bin 4: 33, 33, 35, 35, 35
•Bin 5: 35, 35, 35, 46, 46
Given the following data (in increasing order)
for the attribute age: 13, 15, 16, 16, 19, 20, 20,
21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46. Use smooth by bin means to
smooth these data, using depth of 5.
Bins Means:
•Bin 1: 13, 15, 16, 16, 19 •Bin 1: 16
•Bin 2: 20, 20, 21, 22, 22 •Bin 2: 21
•Bin 3: 25, 25, 25, 25, 30 •Bin 3: 26
•Bin 4: 33, 33, 35, 35, 35 •Bin 4: 34
•Bin 5: 35, 36, 40, 45, 46 •Bin 5: 40
Replacing by means
•Bin 1: 16, 16, 16, 16, 16
•Bin 2: 21, 21, 21, 21, 21
•Bin 3: 26, 26, 26, 26, 26
•Bin 4: 34, 34, 34, 34, 34
•Bin 5: 40, 40, 40, 40, 40
Regression: Data smoothing can also be done by
regression, a technique that conforms data values
to a function.
Linear regression involves finding the “best” line to
fit two attributes (or variables) so that one
attribute can be used to predict the other.
Multiple linear regression is an extension of linear
regression, where more than two attributes are
involved and the data are fit to a multidimensional
surface.
Clustering: It is used for grouping the similar
data in clusters and is used for finding
outliers.
Outliers may be detected by clustering, for
example, where similar values are organized
into groups, or “clusters.”
Intuitively, values that fall outside of the set of
clusters may be considered outliers.
Some approaches to integrate data:
• Data consolidation: Data is physically brought together and stored
in a single place. Having all data in one place increases efficiency
and productivity. This step typically involves using data warehouse
software.
• Data virtualization: In this approach, an interface provides a
unified and real-time view of data from multiple sources. In other
words, data can be viewed from a single point of view.
• Data propagation: Involves copying data from one location to
another with the help of specific applications. This process can be
synchronous or asynchronous and is usually event-driven
Group Work
Grp 1: Discuss the following steps of Attribute Subset
Selection:
- Discuss Stepwise forward selection
- Discuss Stepwise backward elimination
Grp 2: Discuss min-max normalization.
Grp 3: Discuss z-score normalization.
Grp 4: Discuss normalization by decimal scaling.
Grp 5: Discuss Discretization
Grp 6: Concept Hierarchy Generation