0% found this document useful (0 votes)
14 views50 pages

DWDM LS3 Fall 24 25

Uploaded by

veilverse.afrin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views50 pages

DWDM LS3 Fall 24 25

Uploaded by

veilverse.afrin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 3 —
© Jiawei Han, Micheline Kamber, and Jian Pei

1
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing


◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary

2
Data Quality: Why Preprocess the Data?

◼ Measures for data quality: A multidimensional view


◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, …
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing

4
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy

data, identify or remove outliers, and


resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data

cubes, or files
◼ Data reduction
◼ Dimensionality reduction

◼ Numerosity reduction

◼ Data compression

◼ Data transformation and data


discretization
◼ Normalization

◼ Concept hierarchy generation

5
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
6
Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional
◼ Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data

◼ Data is not always available


◼ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
◼ Missing data may be due to
◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the time of
entry
◼ not register history or changes of the data
◼ Missing data may need to be inferred
8
How to Handle Missing Data?
◼ Ignore the tuple
◼ usually done when class label is missing (when doing classification)
◼ Fill in the missing value manually: tedious + infeasible?
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean
◼ the attribute mean for all samples belonging to the same class:
smarter

9
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ faulty data collection instruments

◼ data entry problems

◼ data transmission problems

◼ technology limitation

◼ inconsistency in naming convention

◼ Other data problems which require data cleaning


◼ duplicate records

◼ incomplete data

◼ inconsistent data

10
How to Handle Noisy Data?

◼ Binning
◼ first sort data and partition into (equal-frequency) bins

◼ then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.


◼ Regression
◼ smooth by fitting the data into regression functions

◼ Clustering
◼ detect and remove outliers

◼ Combined computer and human inspection


◼ detect suspicious values and check by human (e.g., deal with

possible outliers)

11
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency,
distribution)
◼ Check uniqueness rule, consecutive rule and null rule

◼ Use commercial tools

◼ Data scrubbing: use simple domain knowledge (e.g.,

postal code, spell-check) to detect errors and make


corrections
◼ Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and


clustering to find outliers)

12
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
13
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id  B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
◼ Detecting and resolving data value conflicts
◼ For the same real-world entity, attribute values from different sources are
different
◼ Possible reasons: different representations, different scales, e.g., metric vs.
British units
14
Data Integration Challenges
Challenge Description Solution Relevant Tools
Implement data quality checks like
Incomplete, incorrect, or redundant Informatica Data Quality, Talend
Data Quality Issues cleansing, deduplication, and
data from different sources. Data Quality
normalization.
Different data formats and schemas Use ETL tools to map fields,
Data Format and Schema
across systems (e.g., JSON vs. CSV, standardize formats, and reconcile Informatica, Talend, MuleSoft
Incompatibility
different field names). schema differences.
Delays in data availability affect
Use data streaming solutions or Apache Kafka, AWS Kinesis,
Data Latency real-time analytics and decision-
micro-batching to reduce latency. Google Dataflow
making.
Apply data encryption, access
Risk of exposure when sharing Informatica, IBM InfoSphere,
Data Security and Privacy control, data masking, and comply
sensitive data across systems. Talend
with data protection regulations.
Use MDM to enforce data
Data Consistency and Inconsistent data across sources, SAP Master Data Governance,
consistency and CDC to synchronize
Synchronization leading to conflicting information. Informatica, IBM InfoSphere
changes in real-time.
Large datasets slow down integration Use scalable cloud platforms and Google BigQuery, Amazon Redshift,
Scalability Issues
processes, impacting performance. distributed processing frameworks. Apache Spark
Combining structured, semi- Use integration tools that support
Heterogeneous Data Sources structured, and unstructured data is multiple data types and establish Talend, Informatica, MuleSoft
complex. ETL workflows to transform data.
Handling Redundancy in Data Integration

◼ Redundant data occur often when integration of multiple


databases
◼ Object identification: The same attribute or object may
have different names in different databases
◼ Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
◼ Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
◼ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
16
Correlation Analysis (Numeric Data)
◼ Correlation coefficient (also called Pearson’s product moment
coefficient)

i=1 (ai − A)(bi − B) 


n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B

where n is the number of tuples, and are the respective means of A and
B, σA and σB are the respective standard B of A and B, and Σ(aibi)
A deviation
is the sum of the AB cross-product.
◼ If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
◼ rA,B = 0: independent (or no correlation);
◼ rAB < 0: negatively correlated
17
Visually Evaluating Correlation

18
Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.

19
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
20
20
Data Reduction Strategies
◼ Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but produces the same (or almost the same) analytical results
◼ Why data reduction?
A database/data warehouse may store terabytes of data. Complex data analysis
may take a very long time to run on the complete data set.
◼ Data reduction strategies
◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

21
Data Reduction 1: Dimensionality Reduction
◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ Dimensionality reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization
◼ Dimensionality reduction techniques
◼ Principal Component Analysis (PCA)
◼ Supervised and nonlinear techniques (e.g., feature selection)

22
Principal Component Analysis (PCA)
◼ Find a projection that captures the largest amount of variation in data
◼ The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space

x2

x1
24
Attribute Subset Selection
◼ Another way to reduce dimensionality of data
◼ Redundant attributes
◼ Duplicate much or all the information contained in one or
more other attributes
◼ E.g., purchase price of a product and the amount of sales tax
paid
◼ Irrelevant attributes
◼ Contain no information that is useful for the data mining
task at hand
◼ E.g., students' ID is often irrelevant to the task of predicting
students' GPA

25
Data Reduction 2: Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller forms of
data representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
◼ Non-parametric methods
◼ Do not assume models

◼ Major families: histograms, clustering, sampling, …

26
Parametric Data Reduction: Regression
and Log-Linear Models
◼ Linear regression y= mx + c
◼ Data modeled to fit a

Dependent variable
straight line
◼ Often uses the least-
square method to fit the
line Independent variable

◼ Multiple regression
Dependent variable
◼ Allows a response
variable Y to be modeled
as a linear function of
multidimensional feature Independent variable

vector
y= b0 + b1 X1 + b2 X2 + - - - - bn Xn

27
y
Regression Analysis
Y1

◼ Regression analysis: A collective name for


Y1’
techniques for the modeling and analysis of y=x+1
numerical data consisting of values of a
dependent variable (Y) (also called response
X1 x
variable or measurement) and of one or more
independent variables (X) (also called
explanatory variables or predictors)
◼ Used for prediction
◼ The parameters are estimated to give a "best (including forecasting of
fit" of the data time-series data),
inference, hypothesis
◼ Most commonly the best fit is evaluated by
testing, and modeling of
using the least squares method, but other
causal relationships
criteria have also been used

28
Regression Analysis Models
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above

Follow Slide 27

29
40
35
30

Histogram Analysis
25
20
15
10
5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000

◼ Divide data into buckets and


store average (sum) for each
bucket
◼ Partitioning rules:
◼ Equal-width: equal bucket
range
◼ Equal-frequency (or
equal-depth)

30
Clustering
◼ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
◼ Can be very effective if data is clustered
◼ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
◼ There are many choices of clustering definitions and clustering
algorithms

31
Sampling

◼ Sampling: obtaining a small sample s to represent the whole data set N


◼ Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
◼ Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor performance in the
presence of skew
◼ Develop adaptive sampling methods, e.g., stratified sampling

32
Types of Sampling

◼ Simple random sampling


◼ There is an equal probability of selecting
any item
◼ Sampling without replacement

Population
◼ Once an object is selected, it is removed
from the population
◼ Sampling with replacement
◼ A selected object is not removed from the

Population
population
◼ Stratified sampling:
◼ Partition the data set, and draw samples
from each partition (proportionally, i.e.,
approximately the same percentage of the
data)
◼ Used in conjunction with skewed data
Population

33
Sampling: With or without Replacement

Raw Data
34
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

35
Data Cube Aggregation

◼ Data Cube
◼ A data cube is a multidimensional array or structure used to represent data in a way that
supports fast querying and analysis, typically in the context of Online Analytical
Processing (OLAP). It is designed to help with the aggregation, summarization, and
analysis of large datasets by organizing them across multiple dimensions. These
dimensions could represent different attributes or categories of the data, like time,
geography, products, etc. For example, Customer Location dimension, Product
dimension, and Time dimension.

36
3
7

Data Cube Aggregation

◼ The lowest level of a data cube (base cuboid)


◼ The aggregated data for an individual entity of interest
◼ E.g., a customer in the context of a phone calling data
warehouse
◼ Multiple levels of aggregation in data cubes
◼ Further reduce the size of data to deal with
◼ Reference appropriate levels
◼ Use the smallest representation which is enough to solve the
task
◼ Queries regarding aggregated information should be answered
using data cube, when possible
3
8

Data Cube Aggregation


Data Reduction 3: Data Compression
◼ String compression
◼ There are extensive theories and well-tuned algorithms

◼ Typically lossless, but only limited manipulation is possible

without expansion
◼ Audio/video compression
◼ Typically, lossy compression, with progressive refinement

◼ Sometimes small fragments of signal can be reconstructed

without reconstructing the whole


◼ Dimensionality and numerosity reduction may also be considered
as forms of data compression

39
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

40
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
41
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing

42
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
 A

73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v
v'= j Where j is the smallest integer such that Max(|ν’|) < 1
10
43
Discretization
◼ Three types of attributes
◼ Nominal—values from an unordered set, e.g., color, profession
◼ Ordinal—values from an ordered set, e.g., military or academic rank
◼ Numeric—real numbers, e.g., integer or real numbers
◼ Discretization: Divide the range of a continuous attribute into intervals
◼ Interval labels can then be used to replace actual data values
◼ Reduce data size by discretization

44
Data Discretization Methods
◼ Typical methods: All the methods can be applied recursively
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis (unsupervised, top-down split or bottom-
up merge)
◼ Decision-tree analysis (supervised, top-down split)
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

45
Simple Discretization: Binning

◼ Equal-width (distance) partitioning


◼ Divides the range into N intervals of equal size: uniform grid
◼ if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
◼ The most straightforward, but outliers may dominate presentation
◼ Skewed data is not handled well

◼ Equal-depth (frequency) partitioning


◼ Divides the range into N intervals, each containing approximately same
number of samples
◼ Good data scaling
◼ Managing categorical attributes can be tricky
46
Simple Discretization: Binning

47
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

48
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
49
Summary
◼ Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
◼ Data cleaning: e.g., missing/noisy values, outliers
◼ Data integration from multiple sources:
◼ Entity identification problem

◼ Remove redundancies

◼ Detect inconsistencies

◼ Data reduction
◼ Dimensionality reduction

◼ Numerosity reduction

◼ Data compression

◼ Data transformation and data discretization


◼ Normalization

50
References
◼ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
◼ A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
◼ H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
◼ M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
◼ H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
◼ J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
◼ T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
◼ R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
51

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy