0% found this document useful (0 votes)

19 views18 pages

L2 Data Preparation

The document discusses data preprocessing techniques including data cleaning, integration, reduction, and transformation. It explains why preprocessing is important to ensure data quality and prepare data for analysis. Specific techniques covered include data cleaning methods like handling missing values, smoothing noisy data through binning, clustering, and regression. Examples are provided to illustrate binning noisy temperature values.

Uploaded by

Vy Phan Thị Thanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views18 pages

L2 Data Preparation

Uploaded by

Vy Phan Thị Thanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

6/28/2023

Data Preparation & Preprocessing

Phụ trách: TS. Võ Thị Hồng Thắm

Kiểm tra nhanh

Hãy nêu ngắn gọn những hiểu biết của bạn
về:
Data quality
Data cleaning
Data integration
Data reduction
Data transformation
Data discretization

1
6/28/2023

The Knowledge Discovery Process

- The KDD Process

Data Preprocessing
 Why do we need to prepare the data?
 In real world applications data can be inconsistent, incomplete and/or noisy
Data entry, data transmission, or data collection problems
Discrepancy in naming conventions
Duplicated records
Incomplete or missing data
Contradictions in data
 What happens when the data can not be trusted?
 Can the decision be trusted? Decision making is jeopardized

 Better chance to discover useful knowledge when data is clean

2
6/28/2023

Data Preprocessing
Data Cleaning

Data Integration

-2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48 Data Transformation

Data Reduction

Data Cleaning
 Real-world application data can be incomplete, noisy, and
inconsistent
 No recorded values for some attributes
 Not considered at time of entry
 Random errors
 Irrelevant records or fields

 Data cleaning attempts to:

 Fill in missing values
 Smooth out noisy data
 Correct inconsistencies
 Remove irrelevant data

3
6/28/2023

Dealing with Missing Values

 Solving the Missing Data Problem
 Ignore the record with missing values;
 Fill in the missing values manually;
 Use a global constant to fill in missing values (NULL, unknown, etc.);
 Use the attribute value mean to filling missing values of that attribute;
 Use the attribute mean for all samples belonging to the same class to
fill in the missing values;
 Infer the most probable value to fill in the missing value
may need to use methods such as Bayesian classification or
decision trees to automatically infer missing attribute values

Smoothing Noisy Data

 The purpose of data smoothing is to eliminate noise and
“smooth out” the data fluctuations.

Ex: Original Data for “price” (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into equidepth bins

Bin1: 4, 8, 15
Binning Bin2: 21, 21, 24
Bin3: 25, 28, 34

Min and Max

values in each bin
Each value in a means boundaries are identified
bin is replaced Bin1: 9, 9, 9 Bin1: 4, 4, 15 (boundaries).
by the mean Bin2: 22, 22, 22 Bin2: 21, 21, 24 Each value in a
value of the bin. Bin3: 29, 29, 29 Bin3: 25, 25, 34 bin is replaced
with the closest
boundary value.

4
6/28/2023

Smoothing Noisy Data

 Other Methods

Similar values are organized into

groups (clusters). Values falling
outside of clusters may be considered
Clustering “outliers” and may be candidates for
elimination.

Fit data to a function. Linear

regression finds the best line to fit two
variables. Multiple regression can
Regression handle multiple variables. The values
given by the function are used instead
of the original values.

Smoothing Noisy Data - Example

Want to smooth “Temperature” by bin means with bins of size 3:
1. First sort the values of the attribute (keep track of the ID or key so
that the transformed values can be replaced in the original table.
2. Divide the data into bins of size 3 (or less in case of last bin).
3. Convert the values in each bin to the mean value for that bin
4. Put the resulting values into the original table

ID Outlook Temperature Humidity Windy ID Temperature

1 sunny 85 85 FALSE 7 58
2 sunny 80 90 TRUE 6 65 Bin1
3 overcast 83 78 FALSE 5 68
4 rain 70 96 FALSE 9 69
5 rain 68 80 FALSE 4 70 Bin2
6 rain 65 70 TRUE 10 71
7 overcast 58 65 TRUE 8 72
8 sunny 72 95 FALSE 12 73 Bin3
9 sunny 69 70 FALSE 11 75
10 rain 71 80 FALSE 14 75
11 sunny 75 70 TRUE 2 80 Bin4
12 overcast 73 90 TRUE 13 81
13 overcast 81 75 FALSE 3 83
14 rain 75 80 TRUE Bin5
1 85

5
6/28/2023

Smoothing Noisy Data - Example

ID Temperature ID Temperature
7 58 7 64
6 65 Bin1 6 64 Bin1
5 68 5 64
9 69 9 70
4 70 Bin2 4 70 Bin2
10 71 10 70
8 72 8 73
12 73 Bin3 12 73 Bin3
11 75 11 73
14 75 14 79
2 80 Bin4 2 79 Bin4
13 81 13 79
3 83 3 84
Bin5 Bin5
1 85 1 84

Value of every record in each bin is changed to the mean value for
that bin. If it is necessary to keep the value as an integer, then the
mean values are rounded to the nearest integer.

Smoothing Noisy Data - Example

The final table with the new values for the Temperature attribute.

ID Outlook Temperature Humidity Windy

1 sunny 84 85 FALSE
2 sunny 79 90 TRUE
3 overcast 84 78 FALSE
4 rain 70 96 FALSE
5 rain 64 80 FALSE
6 rain 64 70 TRUE
7 overcast 64 65 TRUE
8 sunny 73 95 FALSE
9 sunny 70 70 FALSE
10 rain 70 80 FALSE
11 sunny 73 70 TRUE
12 overcast 73 90 TRUE
13 overcast 79 75 FALSE
14 rain 79 80 TRUE

6
6/28/2023

Data Integration
 Data analysis may require a combination of data from multiple
sources into a coherent data store
 Challenges in Data Integration:
 Schema integration: CID = C_number = Cust-id = cust#
 Semantic heterogeneity
 Data value conflicts (different representations or scales, etc.)
 Synchronization (especially important in Web usage mining)
 Redundant attributes (redundant if it can be derived from other attributes) --
may be able to identify redundancies via correlation analysis:

Pr(A,B) / (Pr(A).Pr(B))
= 1: independent,
> 1: positive correlation,
< 1: negative correlation.

 Meta-data is often necessary for successful data integration

Data Transformation: Normalization

 Min-max normalization: linear transformation from v to v’
 v’ = [(v - min)/(max - min)] x (newmax - newmin) + newmin
 Note that if the new range is [0..1], then this simplifies to
v’ = [(v - min)/(max - min)]
 Ex: transform $30000 between [10000..45000] into [0..1] ==>
[(30000 – 10000) / 35000] = 0.514

 z-score normalization: normalization of v into v’ based on

attribute value mean and standard deviation
 v’ = (v - Mean) / StandardDeviation

 Normalization by decimal scaling

 moves the decimal point of v by j positions such that j is the minimum number
of positions moved so that absolute maximum value falls in [0..1].
 v’ = v / 10j
 Ex: if v in [-56 .. 9976] and j=4 ==> v’ in [-0.0056 .. 0.9976]
14

7
6/28/2023

Normalization: Example
 z-score normalization: v’ = (v - Mean) / Stdev
 Example: normalizing the “Humidity” attribute:

Humidity
Humidity
0.48
85
0.99
90
78 -0.23
1.60
96 Mean = 80.3 -0.03
80
70 Stdev = 9.84 -1.05
65 -1.55
95 1.49
70 -1.05
80 -0.03
70 -1.05
90 0.99
75 -0.54
80 -0.03

Normalization: Example II
 Min-Max normalization on an employee database
 max distance for salary: 100000-19000 = 81000
 max distance for age: 52-27 = 25
 New min for age and salary = 0; new max for age and salary = 1

ID Gender Age Salary ID Gender Age Salary

1 F 27 19,000 1 1 0.00 0.00
2 M 51 64,000 2 0 0.96 0.56
3 M 52 100,000 3 0 1.00 1.00
4 F 33 55,000 4 1 0.24 0.44
5 M 45 45,000 5 0 0.72 0.32

8
6/28/2023

Data Transformation: Discretization

 3 Types of attributes
 nominal - values from an unordered set (also “categorical” attributes)
 ordinal - values from an ordered set
 numeric/continuous - real numbers (but sometimes also integer values)

 Discretization is used to reduce the number of values for a given

continuous attribute
 usually done by dividing the range of the attribute into intervals
 interval labels are then used to replace actual data values

 Some data mining algorithms only accept categorical attributes

and cannot handle a range of continuous attribute value
 Discretization can also be used to generate concept hierarchies
 reduce the data by collecting and replacing low level concepts (e.g., numeric
values for “age”) by higher level concepts (e.g., “young”, “middle aged”, “old”)

Discretization - Example
 Example: discretizing the “Humidity” attribute using 3
bins.
Humidity
Humidity
85 High
90 High
78 Low = 60-69 Normal
96 High
80 Normal = 70-79 High
70 Normal
65
High = 80+ Low
95 High
70 Normal
80 High
70 Normal
90
High
75
Normal
80
High

9
6/28/2023

Data Discretization Methods

Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis
Unsupervised, top-down split or bottom-up merge
Decision-tree analysis
Supervised, top-down split
Correlation (e.g., 2) analysis
Unsupervised, bottom-up merge

Simple Discretization: Binning

 Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well
 Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky

10
6/28/2023

Discretization by Classification &

Correlation Analysis
 Classification (e.g., decision tree analysis)
Supervised: Given class labels, e.g., cancerous vs. benign
Using entropy to determine split point (discretization point)
Top-down, recursive split
 Correlation analysis (e.g., Chi-merge: χ2-based
discretization)
Supervised: use class information
Bottom-up merge: merge the best neighboring intervals (those
with similar distributions of classes, i.e., low χ2 values)
Merge performed recursively, until a predefined stopping
condition

Converting Categorical Attributes to

Numerical Attributes
ID Outlook Temperature Humidity Windy
1 sunny 85 85 FALSE
2 sunny 80 90 TRUE
Attributes:
3 overcast 83 78 FALSE Outlook (overcast, rain, sunny)
4 rain 70 96 FALSE Temperature real
5 rain 68 80 FALSE
Humidity real
6 rain 65 70 TRUE
7 overcast 58 65 TRUE Windy (true, false)
8 sunny 72 95 FALSE
9 sunny 69 70 FALSE
10 rain 71 80 FALSE
11 sunny 75 70 TRUE
12 overcast 73 90 TRUE
13 overcast 81 75 FALSE
14 rain 75 80 TRUE Standard Spreadsheet Format
OutLook OutLook OutLook Temp Humidity Windy Windy
Create separate columns overcast rain sunny TRUE FALSE
for each value of a 0 0 1 85 85 0 1
0 0 1 80 90 1 0
categorical attribute (e.g., 1 0 0 83 78 0 1
3 values for the Outlook 0 1 0 70 96 0 1
attribute and two values 0 1 0 68 80 0 1
of the Windy attribute). 0 1 0 65 70 1 0
1 0 0 64 65 1 0
There is no change to the
. . . . . . .
numerical attributes. . . . . . . .

11
6/28/2023

Data Reduction
 Data is often too large; reducing data can improve performance
 Data reduction consists of reducing the representation of the data
set while producing the same (or almost the same) results
 Data reduction includes:
 Data cube aggregation
 Dimensionality reduction
 Discretization
 Numerosity reduction
Regression
Histograms
Clustering
Sampling

Data Cube Aggregation

 Reduce the data to the concept level needed in the analysis
 Use the smallest (most detailed) level necessary to solve the problem

 Queries regarding aggregated information should be answered

using data cube when possible

12
6/28/2023

Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Principal Component Analysis
 Attribute subset selection
 Attribute or feature generation

Principal Component Analysis (PCA)

 Find a projection that captures the largest amount of variation
in data
 The original data are projected onto a much smaller space,
resulting in dimensionality reduction
 Done by finding the eigenvectors of the covariance matrix, and these
eigenvectors define the new space

x1
26

13
6/28/2023

Principal Component Analysis (Steps)

 Given N data vectors (rows in a table) from n dimensions
(attributes), find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 The size of the data can be reduced by eliminating the weak
components, i.e., those with low variance
Using the strongest principal components, it is possible to
reconstruct a good approximation of the original data
 Works for numeric data only
27

Attribute Subset Selection

 Another way to reduce dimensionality of data
 Redundant attributes
Duplicate much or all of the information contained in one or
more other attributes
E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant attributes
Contain no information that is useful for the data mining task at
hand
E.g., students' ID is often irrelevant to the task of predicting
students' GPA

14
6/28/2023

Heuristic Search in Attribute Selection

 There are 2d possible attribute combinations of d attributes
 Typical heuristic attribute selection methods:
Best single attribute under the attribute independence
assumption: choose by significance tests
Best step-wise feature selection:
The best single-attribute is picked first. Then next best attribute condition to
the first, ...
{}{A1}{A1, A3}{A1, A3, A5}
Step-wise attribute elimination:
Repeatedly eliminate the worst attribute: {A1, A2, A3, A4, A5}{A1, A3, A4,
A5} {A1, A3, A5}, ….
Combined attribute selection and elimination
Decision Tree Induction
29

Decision Tree Induction

Use information theoretic techniques to select the most
“informative” attributes

15
6/28/2023

Attribute Creation (Feature Generation)

 Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
 Three general methodologies
Attribute extraction
 Domain-specific
Mapping data to new space (see: data reduction)
E.g., Fourier transformation, wavelet transformation, etc.
Attribute construction
Combining features
Data discretization

Data Reduction: Numerosity Reduction

 Reduce data volume by choosing alternative, smaller forms
of data representation
 Parametric methods (e.g., regression)
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
 Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling, …

16
6/28/2023

Regression Analysis
y
 Collection of techniques for the
modeling and analysis of
numerical data consisting of Y1
values of a dependent variable
(also response variable or
measurement) and of one or more Y1’
y=x+1
independent variables (aka.
explanatory variables or
predictors)
 The parameters are estimated to X1 x
obtains a "best fit" of the data
 Typically the best fit is evaluated  Used for prediction (including
by using the least squares method, forecasting of time-series data),
but other criteria have also been inference, hypothesis testing, and
used modeling of causal relationships

Regression Analysis
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion on known values of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point in a multi-dimensional space for a
set of discretized attributes, based on a smaller subset of dimensions
 Useful for dimensionality reduction and data smoothing

17
6/28/2023

Numerocity Reduction
 Reduction via histograms:
 Divide data into buckets and store
representation of buckets (sum, count, etc.)

 Reduction via clustering

 Partition data into clusters based on
“closeness” in space
 Retain representatives of clusters (centroids)
and outliers

 Reduction via sampling

 Will the patterns in the sample represent the
patterns in the data?
 Random sampling can produce poor results
 Stratified sample (stratum = group based on
attribute value)

Sampling Techniques

Raw Data

Cluster/Stratified Sample

Raw Data

Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Topic 05 - Data Preprocessing
No ratings yet
Topic 05 - Data Preprocessing
62 pages
Commercial Gas Water Heaters: Cyclone Mxi Modulating
No ratings yet
Commercial Gas Water Heaters: Cyclone Mxi Modulating
6 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Week2 2
No ratings yet
Week2 2
25 pages
PB CB Entra
No ratings yet
PB CB Entra
12 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Ftx24axvju Rx24axvju Submittal
No ratings yet
Ftx24axvju Rx24axvju Submittal
4 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Chap 3
No ratings yet
Chap 3
26 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Unit 2
No ratings yet
Unit 2
37 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Outliners
No ratings yet
Outliners
15 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
0.outline BIGDATA HoKhoi
No ratings yet
0.outline BIGDATA HoKhoi
17 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Mining
No ratings yet
Data Mining
7 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
04 Bootstrap Responsive Design 2023
No ratings yet
04 Bootstrap Responsive Design 2023
49 pages
Part 3 Toeic Writing
No ratings yet
Part 3 Toeic Writing
30 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Normalization
No ratings yet
Normalization
35 pages
Part 1 Speaking
100% (1)
Part 1 Speaking
7 pages
Part 3 Toeic Writing (2)
100% (1)
Part 3 Toeic Writing (2)
26 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
HƯỚNG DẪN VIẾT ESSAY CHI TIẾT TỪ MS.DUYEN
No ratings yet
HƯỚNG DẪN VIẾT ESSAY CHI TIẾT TỪ MS.DUYEN
6 pages
LAB 4 - Thuc Hanh Data Preparation & Pre-Processing - Phan 3
No ratings yet
LAB 4 - Thuc Hanh Data Preparation & Pre-Processing - Phan 3
3 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

L2 Data Preparation

Uploaded by

L2 Data Preparation

Uploaded by

6/28/2023

Data Preparation & Preprocessing

Phụ trách: TS. Võ Thị Hồng Thắm

Kiểm tra nhanh

The Knowledge Discovery Process

- The KDD Process

 Better chance to discover useful knowledge when data is clean

-2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48 Data Transformation

 Data cleaning attempts to:

Dealing with Missing Values

Smoothing Noisy Data

Partition into equidepth bins

Min and Max

Smoothing Noisy Data

Similar values are organized into

Fit data to a function. Linear

Smoothing Noisy Data - Example

ID Outlook Temperature Humidity Windy ID Temperature

Smoothing Noisy Data - Example

Smoothing Noisy Data - Example

ID Outlook Temperature Humidity Windy

 Meta-data is often necessary for successful data integration

Data Transformation: Normalization

 z-score normalization: normalization of v into v’ based on

 Normalization by decimal scaling

ID Gender Age Salary ID Gender Age Salary

Data Transformation: Discretization

 Discretization is used to reduce the number of values for a given

 Some data mining algorithms only accept categorical attributes

Data Discretization Methods

Simple Discretization: Binning

Discretization by Classification &

Converting Categorical Attributes to

Data Cube Aggregation

 Queries regarding aggregated information should be answered

Principal Component Analysis (PCA)

Principal Component Analysis (Steps)

Attribute Subset Selection

Heuristic Search in Attribute Selection

Decision Tree Induction

Attribute Creation (Feature Generation)

Data Reduction: Numerosity Reduction

 Reduction via clustering

 Reduction via sampling

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.