CL 2
CL 2
Data Pre-Processing
Jnaneshwar Bohara
Objects
characteristic, dimension, or feature
4 Yes Married 120K No
• A collection of attributes describe an object 5 No Divorced 95K Yes
• Object is also known as record, point, case,
6 No Married 60K No
sample, entity, or instance
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for a particular
object
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Size
• Type of analysis may depend on size of data
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Chapter 2 Data Pre-Processing | BoharaG 13
Transaction Data
• A special type of data, where
• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products purchased by a customer
during one shopping trip constitute a transaction, while the individual products that
were purchased are the items.
• Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper,
Chapter 2 Data Milk
Pre-Processing | BoharaG 14
Graph Data
• Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
An element of
the sequence
Chapter 2 Data Pre-Processing | BoharaG 17
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Chapter 2 Data Pre-Processing | BoharaG 18
Ordered Data
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
• r = 2. Euclidean distance
• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
• Statisticians often sample because obtaining the entire set of data of interest
is too expensive or time consuming.
• Sampling is typically used in data mining because processing the entire set of
data of interest is too expensive or time consuming.
• Using a sample will work almost as well as using the entire data set, if the sample is
representative
• A sample is representative if it has approximately the same properties (of interest) as the
original set of data
• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques
x1
Chapter 2 Data Pre-Processing | BoharaG 45
Principal Component Analysis (PCA)
x2
x1
46 Chapter 2 Data Pre-Processing | BoharaG
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only
47 Chapter 2 Data Pre-Processing | BoharaG
Dimensionality Reduction: PCA
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
51 Chapter 2 Data Pre-Processing | BoharaG
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
59
• Data compression Chapter 2 Data Pre-Processing | BoharaG
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
• OLTP system solved a critical business problem of automating daily business functions and running real time
report and analysis.
OLAP data typically: historical, consolidated, and multi-dimensional (eg: product, time,
location).
Involves lots of full database scans, across terabytes or more of data.
• A data warehouse is based on a multidimensional data model which views data in the form
of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.
Chapter 2 Data Pre-Processing | BoharaG 76
From Tables and Spreadsheets to
Data Cubes
• In data warehousing literature, an n-D base cube is called a base cuboid. The
top most 0-D cuboid, which holds the highest-level of summarization, is
called the apex cuboid. The lattice of cuboids forms a data cube.
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Chapter 2 Data Pre-Processing | BoharaG 81
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
Office Day
Month
Chapter 2 Data Pre-Processing | BoharaG 83
A Sample Data Cube
Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum