0% found this document useful (0 votes)

2 views

Slide 05 Chapter3 Data Preprocessing

Chapter 3 discusses data preprocessing, emphasizing the importance of data quality and outlining major tasks such as data cleaning, integration, reduction, and transformation. It highlights issues like missing and noisy data, methods for handling these problems, and the significance of dimensionality reduction techniques like PCA. The chapter concludes with strategies for effective data reduction to enhance analytical efficiency.

Uploaded by

a19910207

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Slide 05 Chapter3 Data Preprocessing

Uploaded by

a19910207

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Chapter 3.

Data
Preprocessing
HUI-YIN CHANG (張彙音)

1
Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

◦ Data Quality

◦ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary

2
Data Quality: Why Preprocess the Data?

Measures for data quality: A multidimensional view

◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not, dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are correct?
◦ Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing
Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies

Data integration
◦ Integration of multiple databases, data cubes, or files

Data reduction
◦ Dimensionality reduction
◦ Numerosity reduction
◦ Data compression

Data transformation and data discretization

◦ Normalization
◦ Concept hierarchy generation

4
Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

◦ Data Quality

◦ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
5
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human
or computer error, transmission error
◦ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
◦ e.g., Occupation=“ ” (missing data)
◦ noisy: containing noise, errors, or outliers
◦ e.g., Salary=“−10” (an error)
◦ inconsistent: containing discrepancies in codes or names, e.g.,
◦ Age=“42”, Birthday=“03/07/2010”
◦ Was rating “1, 2, 3”, now rating “A, B, C”
◦ discrepancy between duplicate records
◦ Intentional (e.g., disguised missing data)
◦ Jan. 1 as everyone’s birthday?

6
Incomplete (Missing) Data
Data is not always available
◦ E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to
◦ equipment malfunction (故障)
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the time of
entry
◦ not register history or changes of the data
Missing data may need to be inferred

7
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
◦ a global constant : e.g., “unknown”, a new class?!
◦ the attribute mean
◦ the attribute mean for all samples belonging to the same class:
smarter
◦ the most probable value: inference-based such as Bayesian
formula or decision tree
8
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
◦ faulty data collection instruments
◦ data entry problems
◦ data transmission problems
◦ technology limitation
◦ inconsistency in naming convention
Other data problems which require data cleaning
◦ duplicate records
◦ incomplete data
◦ inconsistent data

9
How to Handle Noisy Data?
Binning
◦ first sort data and partition into (equal-frequency) bins
◦ then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
◦ smooth by fitting the data into regression functions
Clustering
◦ detect and remove outliers
Combined computer and human inspection
◦ detect suspicious values and check by human (e.g., deal with
possible outliers)

10
Data Cleaning as a Process
Data discrepancy (差異) detection
◦ Use metadata (e.g., domain, range, dependency, distribution)
◦ Check field overloading
◦ Check uniqueness rule, consecutive rule and null rule
◦ Use commercial tools
◦ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors
and make corrections
◦ Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
Data migration and integration
◦ Data migration tools: allow transformations to be specified
◦ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a
graphical user interface
Integration of the two processes
◦ Iterative and interactive (e.g., Potter’s Wheels 陶輪)
11
Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

◦ Data Quality

◦ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary

12
Data Integration
Data integration:
◦ Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id  B.cust-#

◦ Integrate metadata from different sources

Entity identification problem:

◦ Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton

Detecting and resolving data value conflicts

◦ For the same real world entity, attribute values from different sources are different
◦ Possible reasons: different representations, different scales, e.g., metric vs. British
units

13
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple
databases
◦ Object identification: The same attribute or object may have
different names in different databases (中/英文名字)
◦ Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue (月薪vs年薪)
Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality

14
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
For nominal data, a correlation relationship between two
attributes, A and B, can be discovered by a Χ2 (chi-square) test .

ID Gender Book Type

1 male fiction
2 female non_fiction
3 male non_fiction
4 female fiction
5 female fiction
… … …
1500 male non_fiction

We can use X2 to calculate the correlation between the two attributes.

15
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test (hypothesis that A and B are independent -> no correlation
between them)

2
(𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑) 𝑐𝑜𝑢𝑛𝑡 𝐴 = 𝑎𝑖 × 𝑐𝑜𝑢𝑛𝑡 𝐵 = 𝑏𝑖
𝜒2 = ෍ 𝑒𝑖𝑗 =
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛

The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count
Correlation does not imply causality
◦ # of hospitals and # of car-theft in a city are correlated
◦ Both are causally linked to the third variable: population

16
Chi-Square Calculation: An Example
male female Sum (row) 300 × 450
𝑒11 = = 90
fiction 250(90) 200(360) 450 1500
non_fiction 50(210) 1000(840) 1050 𝑒21 =
300 × 1050
= 210 𝑒12 =
1200 × 450
= 360
1500 1500
Sum(col.) 300 1200 1500
1200 × 1050
𝑒22 = = 840
1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected

counts calculated based on the data distribution in the two
categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
2
+ + + = 507.93
90 210 360 840

17
Chi-Square Calculation: An Example

18
Correlation Analysis (Numeric Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)
i=1 (ai − A)(bi − B) 
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B

where n is the number of tuples, A and B are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the
sum of the AB cross-product.

If rA,B > 0, A and B are positively correlated (A’s values increase as

B’s). The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated

19
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

20
Correlation (viewed as linear relationship)
Correlation measures the linear relationship between objects
To compute correlation, we standardize data objects, A and B, and then take their dot product

a'k = (ak − mean( A)) / std ( A)

b'k = (bk − mean( B)) / std ( B)

correlation( A, B) = A'• B'

21
Covariance (Numeric Data)
Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or expected values
of A and B, σA and σB are the respective standard deviation of A and B.
Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to
be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
◦ Some pairs of random variables may have a covariance of 0 but are not independent. Only under
some additional assumptions (e.g., the data follow multivariate normal distributions) does a
covariance of 0 imply independence

22
Co-Variance: An Example

It can be simplified in computation as

Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).

Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
◦ E(A) =𝐴ҧ = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
◦ E(B) = 𝐵ത = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
◦ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

Thus, A and B rise together since Cov(A, B) > 0.

23
Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

◦ Data Quality

◦ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary

24
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that is much smaller in volume
but yet produces the same (or almost the same) analytical results
Why data reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set.
Data reduction strategies
◦ Dimensionality reduction, e.g., remove unimportant attributes
◦ Wavelet transforms
◦ Principal Components Analysis (PCA)
◦ Feature subset selection, feature creation
◦ Numerosity reduction (some simply call it: Data Reduction)
◦ Regression and Log-Linear Models
◦ Histograms, clustering, sampling
◦ Data cube aggregation
◦ Data compression

25
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
◦ When dimensionality increases, data becomes increasingly sparse
◦ Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful
◦ The possible combinations of subspaces will grow exponentially

Dimensionality reduction
◦ Avoid the curse of dimensionality
◦ Help eliminate irrelevant features and reduce noise
◦ Reduce time and space required in data mining
◦ Allow easier visualization

Dimensionality reduction techniques

◦ Wavelet transforms
◦ Principal Component Analysis
◦ Supervised and nonlinear techniques (e.g., feature selection)

26
Mapping Data to a New Space
◼ Fourier transform
◼ Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

27
What Is Wavelet Transform?
Decomposes a signal into different
frequency subbands
◦ Applicable to n-dimensional
signals
Data are transformed to preserve
relative distance between objects at
different levels of resolution
Allow natural clusters to become
more distinguishable
Used for image compression

28
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space

x1
29
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
◦ Normalize input data: Each attribute falls within the same range
◦ Compute k orthonormal (unit) vectors, i.e., principal components
◦ Each input data (vector) is a linear combination of the k principal component
vectors
◦ The principal components are sorted in order of decreasing “significance” or
strength
◦ Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data)

Works for numeric data only

30
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
◦ Duplicate much or all of the information contained in one or more other
attributes
◦ E.g., purchase price of a product and the amount of sales tax paid
Irrelevant attributes
◦ Contain no information that is useful for the data mining task at hand
◦ E.g., students' ID is often irrelevant to the task of predicting students' GPA

31
Heuristic (啟發式) Search in Attribute Selection
There are 2d possible attribute combinations of d attributes
Typical heuristic attribute selection methods:
◦ Best single attribute under the attribute independence
assumption: choose by significance tests
◦ Best step-wise feature selection:
◦ The best single-attribute is picked first
◦ Then next best attribute condition to the first, ...

◦ Step-wise attribute elimination:

◦ Repeatedly eliminate the worst attribute

◦ Best combined attribute selection and elimination

◦ Optimal branch and bound:
◦ Use attribute elimination and backtracking

32
Attribute Creation (Feature Generation)
Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones
Three general methodologies
◦ Attribute extraction
◦ Domain-specific

◦ Mapping data to new space (see: data reduction)

◦ E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)

◦ Attribute construction
◦ Combining features (see: discriminative frequent patterns in Chapter 7)
◦ Data discretization

33
Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of
data representation
Parametric methods (e.g., regression)
◦ Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except
possible outliers)
◦ Ex.: Log-linear models—obtain value at a point in m-D space as
the product on appropriate marginal subspaces
Non-parametric methods
◦ Do not assume models
◦ Major families: histograms, clustering, sampling, …

34
Parametric Data Reduction: Regression and Log-Linear Models
Linear regression
◦ Data modeled to fit a straight line
◦ Often uses the least-square method to fit the line
Multiple regression
◦ Allows a response variable Y to be modeled as a linear function
of multidimensional feature vector
Log-linear model
◦ Approximates discrete multidimensional probability
distributions

35
Regression Analysis
Regression analysis: A collective name for y
techniques for the modeling and analysis of
numerical data consisting of values of a dependent Y1
variable (also called response variable or
measurement) and of one or more independent Y1’
variables (aka. explanatory variables or predictors)
y=x+1

The parameters are estimated so as to give a "best

fit" of the data X1 x

Most commonly the best fit is evaluated by using Used for prediction (including
the least squares method, but other criteria have forecasting of time-series data),
inference, hypothesis testing, and
also been used
modeling of causal relationships

36
Histogram Analysis
40
Divide data into buckets and store
average (sum) for each bucket 35
30
Partitioning rules:
◦ Equal-width: equal bucket range 25

◦ Equal-frequency (or equal-depth) 20

15
10
5
0
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

37
Clustering
Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only

Can be very effective if data is clustered but not if data is

“smeared (塗抹)”

Can have hierarchical clustering and be stored in multi-

dimensional index tree structures

There are many choices of clustering definitions and clustering

algorithms

Cluster analysis will be studied in depth in Chapter 10

38
Sampling
Sampling: obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear

to the size of the data

Key principle: Choose a representative subset of the data

◦ Simple random sampling may have very poor performance in the
presence of skew
◦ Develop adaptive sampling methods, e.g., stratified sampling:

Note: Sampling may not reduce database I/Os (page at a time)

39
Types of Sampling
Simple random sampling
◦ There is an equal probability of selecting any particular
item
Sampling without replacement
◦ Once an object is selected, it is removed from the
population
Sampling with replacement
◦ A selected object is not removed from the population
Stratified sampling:
◦ Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
◦ Used in conjunction with skewed data

40
Sampling: With or without Replacement

Raw Data

41
Sampling: Cluster or Stratified (分層) Sampling

Raw Data Cluster/Stratified Sample

42
Data Cube Aggregation
The lowest level of a data cube (base cuboid)
◦ The aggregated data for an individual entity of interest
◦ E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
◦ Further reduce the size of data to deal with
Reference appropriate levels
◦ Use the smallest representation which is enough to solve the
task
Queries regarding aggregated information should be answered
using data cube, when possible

43
Data Reduction 3: Data Compression
String compression
◦ There are extensive theories and well-tuned algorithms
◦ Typically lossless, but only limited manipulation is possible
without expansion
Audio/video compression (e.g., JPEG/.jpg)
◦ Typically lossy compression, with progressive refinement
◦ Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
Time sequence is not audio
◦ Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be considered as
forms of data compression

44
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated

45
Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

◦ Data Quality

◦ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary

46
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
Methods
◦ Smoothing: Remove noise from data
◦ Attribute/feature construction
◦ New attributes constructed from the given ones
◦ Aggregation: Summarization, data cube construction
◦ Normalization: Scaled to fall within a smaller, specified range
◦ min-max normalization
◦ z-score normalization
◦ normalization by decimal scaling
◦ Discretization: Concept hierarchy climbing

47
Normalization
Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◦ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600 − 12,000
Then $73,000 is mapped to (1.0 − 0) + 0 = 0.716
98,000 − 12,000

Z-score normalization (μ: mean, σ: standard deviation):

v − A
v' =
 A

73,600 − 54,000
◦ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000

Normalization by decimal scaling (e.g., 紙鈔幣值有如玩具鈔票！委內瑞拉貨幣砍6個零百萬紙鈔變成1塊錢)

v
v'= j Where j is the smallest integer such that Max(|ν’|) < 1
10 48
Discretization
Three types of attributes
◦ Nominal—values from an unordered set, e.g., color, profession
◦ Ordinal—values from an ordered set, e.g., military or academic rank
◦ Numeric—real numbers, e.g., integer or real numbers

Discretization: Divide the range of a continuous attribute into intervals

◦ Interval labels can then be used to replace actual data values
◦ Reduce data size by discretization
◦ Supervised vs. unsupervised
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an attribute
◦ Prepare for further analysis, e.g., classification

49
Data Discretization Methods
Typical methods: All the methods can be applied recursively
◦ Binning
◦ Top-down split, unsupervised

◦ Histogram analysis
◦ Top-down split, unsupervised

◦ Clustering analysis (unsupervised, top-down split or bottom-up

merge)
◦ Decision-tree analysis (supervised, top-down split)
◦ Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)

50
Simple Discretization: Binning
Equal-width (distance) partitioning
◦ Divides the range into N intervals of equal size: uniform grid

◦ if A and B are the lowest and highest values of the attribute, the width of intervals will be: W =
(B –A)/N.

◦ The most straightforward, but outliers may dominate presentation

◦ Skewed data is not handled well

Equal-depth (frequency) partitioning

◦ Divides the range into N intervals, each containing approximately same number of samples

◦ Good data scaling

◦ Managing categorical attributes can be tricky

51
Discretization by Classification & Correlation Analysis
Classification (e.g., decision tree analysis)
◦ Supervised: Given class labels, e.g., cancerous vs. benign

◦ Using entropy to determine split point (discretization point)

◦ Top-down, recursive split

◦ Details to be covered in Chapter 7

Correlation analysis (e.g., Chi-merge: χ2-based discretization)

◦ Supervised: use class information

◦ Bottom-up merge: find the best neighboring intervals (those having similar distributions
of classes, i.e., low χ2 values) to merge

◦ Merge performed recursively, until a predefined stopping condition

52
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is
usually associated with each dimension in a data warehouse

Concept hierarchies facilitate drilling and rolling in data warehouses to view data
in multiple granularity

Concept hierarchy formation: Recursively reduce the data by collecting and

replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)

Concept hierarchies can be explicitly specified by domain experts and/or data

warehouse designers

Concept hierarchy can be automatically formed for both numeric and nominal
data. For numeric data, use discretization methods shown.

53
Concept Hierarchy Generation for Nominal Data
Specification of a partial/total ordering of attributes explicitly at the
schema level by users or experts
◦ street < city < state < country
Specification of a hierarchy for a set of values by explicit data
grouping
◦ {Urbana (厄巴納), Champaign (香檳), Chicago} < Illinois
Specification of only a partial set of attributes
◦ E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
◦ E.g., for a set of attributes: {street, city, state, country}

54
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the data set
◦ The attribute with the most distinct values is placed at the lowest level
of the hierarchy
◦ Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

55
Summary
Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
◦ Entity identification problem
◦ Remove redundancies
◦ Detect inconsistencies
Data reduction
◦ Dimensionality reduction
◦ Numerosity reduction
◦ Data compression
Data transformation and data discretization
◦ Normalization
◦ Concept hierarchy generation

56
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-
78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model,
and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), Dec. 1997
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective.
Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation,
VLDB’2001
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge
and Data Engineering, 7:623-640, 1995
57
Thanks for Your Attention
Q&A

Hadoop
77% (13)
Hadoop
65 pages
Xarios 350 Описание
No ratings yet
Xarios 350 Описание
2 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Module 2
No ratings yet
Module 2
62 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
data mining 3
No ratings yet
data mining 3
57 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
03Preprocessing_20160222
No ratings yet
03Preprocessing_20160222
65 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Lec7
No ratings yet
Lec7
45 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
PPT1
No ratings yet
PPT1
93 pages
DM_merged
No ratings yet
DM_merged
169 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
Mining
No ratings yet
Mining
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
03Preprocessing
No ratings yet
03Preprocessing
38 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
P4-5 Preprocessing
No ratings yet
P4-5 Preprocessing
60 pages
Unit 3
No ratings yet
Unit 3
164 pages
03Preprocessing (2)
No ratings yet
03Preprocessing (2)
80 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Coa Presentation
No ratings yet
Coa Presentation
8 pages
Math A (I) - June2017
No ratings yet
Math A (I) - June2017
20 pages
Comp Rog
No ratings yet
Comp Rog
2 pages
How To Replace Timing Belt On Peugeot 307 2.0 HDi 2005-2007
No ratings yet
How To Replace Timing Belt On Peugeot 307 2.0 HDi 2005-2007
8 pages
Cambridge Primary Checkpoint English As A Second Language Specimen Paper 3 2017
100% (1)
Cambridge Primary Checkpoint English As A Second Language Specimen Paper 3 2017
8 pages
Interference Analysis: Modelling Radio Systems for Spectrum Management 1st Edition John Pahl pdf download
100% (4)
Interference Analysis: Modelling Radio Systems for Spectrum Management 1st Edition John Pahl pdf download
66 pages
LT Panel
No ratings yet
LT Panel
1 page
Be Winter 2022
No ratings yet
Be Winter 2022
1 page
Immediate download INFORMATION RETRIEVAL a biomedical and health perspective 4th Edition William Hersh ebooks 2024
100% (3)
Immediate download INFORMATION RETRIEVAL a biomedical and health perspective 4th Edition William Hersh ebooks 2024
55 pages
2X2.5 Cable PDF
No ratings yet
2X2.5 Cable PDF
1 page
Manual Use and Maintenance Gearbox Montanari ENG
No ratings yet
Manual Use and Maintenance Gearbox Montanari ENG
28 pages
Physical Database Design
No ratings yet
Physical Database Design
13 pages
BIT Paper Question
No ratings yet
BIT Paper Question
4 pages
Tarea Monitoring - Matias Figueredo
0% (1)
Tarea Monitoring - Matias Figueredo
3 pages
Project Name: Structural Design of Subways: Effective Flange Width, Beff
No ratings yet
Project Name: Structural Design of Subways: Effective Flange Width, Beff
1 page
X20 POWERLINK Kabel-ENG V2.20
No ratings yet
X20 POWERLINK Kabel-ENG V2.20
3 pages
ZOO212 - Session 2 - Graphs Powerpoint
No ratings yet
ZOO212 - Session 2 - Graphs Powerpoint
21 pages
("D C Motor") : Maharashtra State Board of Technical Education (Mumbai)
No ratings yet
("D C Motor") : Maharashtra State Board of Technical Education (Mumbai)
20 pages
Service Manual q5 v2 0
0% (1)
Service Manual q5 v2 0
52 pages
Bronkhorst Manual EL FLOW Select
No ratings yet
Bronkhorst Manual EL FLOW Select
51 pages
Quantum Differential Cryptanalysis To The Block Ciphers
No ratings yet
Quantum Differential Cryptanalysis To The Block Ciphers
11 pages
Dairy Technology: Vol.01: Milk and Milk Processing PDF: (Pub.20zwt) Get Download
100% (1)
Dairy Technology: Vol.01: Milk and Milk Processing PDF: (Pub.20zwt) Get Download
2 pages
Anna University, Chennai Non-Autonomous Affiliated Colleges Regulations 2021 Choice Based Credit System B.E. Computer Science and Engineering
No ratings yet
Anna University, Chennai Non-Autonomous Affiliated Colleges Regulations 2021 Choice Based Credit System B.E. Computer Science and Engineering
86 pages
JAF structure and JD (SW)
No ratings yet
JAF structure and JD (SW)
3 pages
Green House Monitoring and Control System Using Iot
No ratings yet
Green House Monitoring and Control System Using Iot
4 pages
Z-135-70 Operators Manual
No ratings yet
Z-135-70 Operators Manual
68 pages
Model Paper 1 MCQ
No ratings yet
Model Paper 1 MCQ
8 pages
Module 9. Digital Self
No ratings yet
Module 9. Digital Self
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Slide 05 Chapter3 Data Preprocessing

Uploaded by

Slide 05 Chapter3 Data Preprocessing

Uploaded by

Chapter 3.

Data Preprocessing: An Overview

◦ Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Measures for data quality: A multidimensional view

Data transformation and data discretization

Data Preprocessing: An Overview

◦ Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Data Preprocessing: An Overview

◦ Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Schema integration: e.g., A.cust-id  B.cust-#

Entity identification problem:

Detecting and resolving data value conflicts

ID Gender Book Type

We can use X2 to calculate the correlation between the two attributes.

Χ2 (chi-square) calculation (numbers in parenthesis are expected

If rA,B > 0, A and B are positively correlated (A’s values increase as

a'k = (ak − mean( A)) / std ( A)

correlation( A, B) = A'• B'

It can be simplified in computation as

Thus, A and B rise together since Cov(A, B) > 0.

Data Preprocessing: An Overview

◦ Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Dimensionality reduction techniques

Two Sine Waves Two Sine Waves + Noise Frequency

Works for numeric data only

◦ Step-wise attribute elimination:

◦ Best combined attribute selection and elimination

◦ Mapping data to new space (see: data reduction)

The parameters are estimated so as to give a "best

◦ Equal-frequency (or equal-depth) 20

Can be very effective if data is clustered but not if data is

Can have hierarchical clustering and be stored in multi-

There are many choices of clustering definitions and clustering

Cluster analysis will be studied in depth in Chapter 10

Allow a mining algorithm to run in complexity that is potentially sub-linear

Key principle: Choose a representative subset of the data

Note: Sampling may not reduce database I/Os (page at a time)

Raw Data Cluster/Stratified Sample

Original Data Compressed

Data Preprocessing: An Overview

◦ Major Tasks in Data Preprocessing

Data Transformation and Data Discretization

Z-score normalization (μ: mean, σ: standard deviation):

Normalization by decimal scaling (e.g., 紙鈔幣值有如玩具鈔票！委內瑞拉貨幣砍6個零百萬紙鈔變成1塊錢)

Discretization: Divide the range of a continuous attribute into intervals

◦ Clustering analysis (unsupervised, top-down split or bottom-up

◦ The most straightforward, but outliers may dominate presentation

◦ Skewed data is not handled well

Equal-depth (frequency) partitioning

◦ Good data scaling

◦ Managing categorical attributes can be tricky

◦ Using entropy to determine split point (discretization point)

◦ Top-down, recursive split

◦ Details to be covered in Chapter 7

Correlation analysis (e.g., Chi-merge: χ2-based discretization)

◦ Merge performed recursively, until a predefined stopping condition

Concept hierarchy formation: Recursively reduce the data by collecting and

Concept hierarchies can be explicitly specified by domain experts and/or data

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.