DWDM LS3 Fall 24 25
DWDM LS3 Fall 24 25
— Chapter 3 —
© Jiawei Han, Micheline Kamber, and Jian Pei
1
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Integration
◼ Data Reduction
◼ Summary
2
Data Quality: Why Preprocess the Data?
3
Major Tasks in Data Preprocessing
4
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy
cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
5
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
6
Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional
◼ Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
9
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ faulty data collection instruments
◼ technology limitation
◼ incomplete data
◼ inconsistent data
10
How to Handle Noisy Data?
◼ Binning
◼ first sort data and partition into (equal-frequency) bins
◼ Clustering
◼ detect and remove outliers
possible outliers)
11
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency,
distribution)
◼ Check uniqueness rule, consecutive rule and null rule
12
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
13
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
◼ Detecting and resolving data value conflicts
◼ For the same real-world entity, attribute values from different sources are
different
◼ Possible reasons: different representations, different scales, e.g., metric vs.
British units
14
Data Integration Challenges
Challenge Description Solution Relevant Tools
Implement data quality checks like
Incomplete, incorrect, or redundant Informatica Data Quality, Talend
Data Quality Issues cleansing, deduplication, and
data from different sources. Data Quality
normalization.
Different data formats and schemas Use ETL tools to map fields,
Data Format and Schema
across systems (e.g., JSON vs. CSV, standardize formats, and reconcile Informatica, Talend, MuleSoft
Incompatibility
different field names). schema differences.
Delays in data availability affect
Use data streaming solutions or Apache Kafka, AWS Kinesis,
Data Latency real-time analytics and decision-
micro-batching to reduce latency. Google Dataflow
making.
Apply data encryption, access
Risk of exposure when sharing Informatica, IBM InfoSphere,
Data Security and Privacy control, data masking, and comply
sensitive data across systems. Talend
with data protection regulations.
Use MDM to enforce data
Data Consistency and Inconsistent data across sources, SAP Master Data Governance,
consistency and CDC to synchronize
Synchronization leading to conflicting information. Informatica, IBM InfoSphere
changes in real-time.
Large datasets slow down integration Use scalable cloud platforms and Google BigQuery, Amazon Redshift,
Scalability Issues
processes, impacting performance. distributed processing frameworks. Apache Spark
Combining structured, semi- Use integration tools that support
Heterogeneous Data Sources structured, and unstructured data is multiple data types and establish Talend, Informatica, MuleSoft
complex. ETL workflows to transform data.
Handling Redundancy in Data Integration
where n is the number of tuples, and are the respective means of A and
B, σA and σB are the respective standard B of A and B, and Σ(aibi)
A deviation
is the sum of the AB cross-product.
◼ If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
◼ rA,B = 0: independent (or no correlation);
◼ rAB < 0: negatively correlated
17
Visually Evaluating Correlation
18
Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.
19
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
20
20
Data Reduction Strategies
◼ Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but produces the same (or almost the same) analytical results
◼ Why data reduction?
A database/data warehouse may store terabytes of data. Complex data analysis
may take a very long time to run on the complete data set.
◼ Data reduction strategies
◼ Dimensionality reduction, e.g., remove unimportant attributes
◼ Regression
◼ Data compression
21
Data Reduction 1: Dimensionality Reduction
◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ Dimensionality reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization
◼ Dimensionality reduction techniques
◼ Principal Component Analysis (PCA)
◼ Supervised and nonlinear techniques (e.g., feature selection)
22
Principal Component Analysis (PCA)
◼ Find a projection that captures the largest amount of variation in data
◼ The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space
x2
x1
24
Attribute Subset Selection
◼ Another way to reduce dimensionality of data
◼ Redundant attributes
◼ Duplicate much or all the information contained in one or
more other attributes
◼ E.g., purchase price of a product and the amount of sales tax
paid
◼ Irrelevant attributes
◼ Contain no information that is useful for the data mining
task at hand
◼ E.g., students' ID is often irrelevant to the task of predicting
students' GPA
25
Data Reduction 2: Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller forms of
data representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
◼ Non-parametric methods
◼ Do not assume models
26
Parametric Data Reduction: Regression
and Log-Linear Models
◼ Linear regression y= mx + c
◼ Data modeled to fit a
Dependent variable
straight line
◼ Often uses the least-
square method to fit the
line Independent variable
◼ Multiple regression
Dependent variable
◼ Allows a response
variable Y to be modeled
as a linear function of
multidimensional feature Independent variable
vector
y= b0 + b1 X1 + b2 X2 + - - - - bn Xn
27
y
Regression Analysis
Y1
28
Regression Analysis Models
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above
Follow Slide 27
29
40
35
30
Histogram Analysis
25
20
15
10
5
0
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
30
Clustering
◼ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
◼ Can be very effective if data is clustered
◼ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
◼ There are many choices of clustering definitions and clustering
algorithms
31
Sampling
32
Types of Sampling
Population
◼ Once an object is selected, it is removed
from the population
◼ Sampling with replacement
◼ A selected object is not removed from the
Population
population
◼ Stratified sampling:
◼ Partition the data set, and draw samples
from each partition (proportionally, i.e.,
approximately the same percentage of the
data)
◼ Used in conjunction with skewed data
Population
33
Sampling: With or without Replacement
Raw Data
34
Sampling: Cluster or Stratified Sampling
35
Data Cube Aggregation
◼ Data Cube
◼ A data cube is a multidimensional array or structure used to represent data in a way that
supports fast querying and analysis, typically in the context of Online Analytical
Processing (OLAP). It is designed to help with the aggregation, summarization, and
analysis of large datasets by organizing them across multiple dimensions. These
dimensions could represent different attributes or categories of the data, like time,
geography, products, etc. For example, Customer Location dimension, Product
dimension, and Time dimension.
36
3
7
without expansion
◼ Audio/video compression
◼ Typically, lossy compression, with progressive refinement
39
Data Compression
Original Data
Approximated
40
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
41
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing
42
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
A
73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v
v'= j Where j is the smallest integer such that Max(|ν’|) < 1
10
43
Discretization
◼ Three types of attributes
◼ Nominal—values from an unordered set, e.g., color, profession
◼ Ordinal—values from an ordered set, e.g., military or academic rank
◼ Numeric—real numbers, e.g., integer or real numbers
◼ Discretization: Divide the range of a continuous attribute into intervals
◼ Interval labels can then be used to replace actual data values
◼ Reduce data size by discretization
44
Data Discretization Methods
◼ Typical methods: All the methods can be applied recursively
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis (unsupervised, top-down split or bottom-
up merge)
◼ Decision-tree analysis (supervised, top-down split)
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
45
Simple Discretization: Binning
47
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
48
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
49
Summary
◼ Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
◼ Data cleaning: e.g., missing/noisy values, outliers
◼ Data integration from multiple sources:
◼ Entity identification problem
◼ Remove redundancies
◼ Detect inconsistencies
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
50
References
◼ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
◼ A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
◼ H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
◼ M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
◼ H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
◼ J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
◼ T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
◼ R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
51