0% found this document useful (0 votes)

33 views85 pages

CL 2

This document discusses data pre-processing and different types of data. It defines data as a collection of objects and their attributes. There are four main types of attributes: nominal, ordinal, interval, and ratio. Attributes can be discrete or continuous. Different types of data sets are also discussed including record data, data matrices, document data, transaction data, graph data, and ordered data. The key characteristics of data like dimensionality, sparsity, resolution, and size are also covered.

Uploaded by

Rajiv Ranjan Sah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views85 pages

CL 2

Uploaded by

Rajiv Ranjan Sah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Chapter 2

Data Pre-Processing
Jnaneshwar Bohara

Chapter 2 Data Pre-Processing | BoharaG 1

What is Data?
• Collection of data objects and their Attributes
attributes
• An attribute is a property or characteristic Tid Refund Marital Taxable
Status Income Cheat
of an object
• Examples: eye color of a person, temperature, 1 Yes Single 125K No
etc. 2 No Married 100K No
• Attribute is also known as variable, field, 3 No Single 70K No

Objects
characteristic, dimension, or feature
4 Yes Married 120K No
• A collection of attributes describe an object 5 No Divorced 95K Yes
• Object is also known as record, point, case,
6 No Married 60K No
sample, entity, or instance
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for a particular
object

• Distinction between attributes and attribute values

• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

• Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers
• But properties of attribute can be different than the properties of the values used
to represent the attribute

Chapter 2 Data Pre-Processing | BoharaG 3

Types of Attributes
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: temperature in Kelvin, length, counts, elapsed time (e.g., time
to run a race)

Chapter 2 Data Pre-Processing | BoharaG 4

Properties of Attribute Values
• The type of an attribute depends on which of the following
properties/operations it possesses:
• Distinctness: = 
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful
• Nominal attribute: distinctness
• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & meaningful differences
• Ratio attribute: all 4 properties/operations
Chapter 2 Data Pre-Processing | BoharaG 5
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative
female} test

Ordinal Ordinal attribute hardness of minerals, median,

values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and

meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

Chapter 2 Data Pre-Processing | BoharaG 6

Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and

where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their

zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

Chapter 2 Data Pre-Processing | BoharaG 7

Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite number
of digits.
• Continuous attributes are typically represented as floating-point variables.

Chapter 2 Data Pre-Processing | BoharaG 8

Important Characteristics of Data
• Dimensionality (number of attributes)
• High dimensional data brings a number of challenges

• Sparsity
• Only presence counts

• Resolution
• Patterns depend on the scale

• Size
• Type of analysis may depend on size of data

Chapter 2 Data Pre-Processing | BoharaG 9

Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data

Chapter 2 Data Pre-Processing | BoharaG 10

Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Chapter 2 Data Pre-Processing | BoharaG 11

Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute

• Such a data set can be represented by an m by n matrix, where

there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness

of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

Chapter 2 Data Pre-Processing | BoharaG 12

Document Data
• Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the corresponding term
occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Chapter 2 Data Pre-Processing | BoharaG 13
Transaction Data
• A special type of data, where
• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products purchased by a customer
during one shopping trip constitute a transaction, while the individual products that
were purchased are the items.
• Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper,
Chapter 2 Data Milk
Pre-Processing | BoharaG 14
Graph Data
• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6

Chapter 2 Data Pre-Processing | BoharaG 15
Ordered Data
Ordered
• Has Sequences of transactions
 Spatial Data
• Spatial data, also known as geospatial data, is information about a physical
object that can be represented by numerical values in a geographic coordinate
system.
• Temporal Data
• A temporal data denotes the evolution of an object characteristic over a period
of time. Eg d=f(t).
• Sequential Data
• Data arranged in sequence.

Chapter 2 Data Pre-Processing | BoharaG 16

Ordered Data
• Sequences of transactions
Items/Events

An element of
the sequence
Chapter 2 Data Pre-Processing | BoharaG 17
Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Chapter 2 Data Pre-Processing | BoharaG 18
Ordered Data

• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

Chapter 2 Data Pre-Processing | BoharaG 19

Data Quality
• Poor data quality negatively affects many data processing
efforts

• Data mining example: a classification model for detecting

people who are loan risks is built using poor data
• Some credit-worthy candidates are denied loans
• More loans are given to individuals that default

Chapter 2 Data Pre-Processing | BoharaG 20

Data Quality …
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:

• Noise and outliers
• Wrong data
• Fake data
• Missing values
• Duplicate data

Chapter 2 Data Pre-Processing | BoharaG 21

Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television
screen
• The figures below show two sine waves of the same magnitude and different frequencies, the
waves combined, and the two sine waves with random noise
• The magnitude and shape of the original signal is distorted

Chapter 2 Data Pre-Processing | BoharaG 22

Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
• Case 1: Outliers are
noise that interferes
with data analysis
• Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection

Chapter 2 Data Pre-Processing | BoharaG 23

Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values

• Eliminate data objects or variables
• Estimate missing values
• Example: time series of temperature
• Example: census results
• Ignore the missing value during analysis

Chapter 2 Data Pre-Processing | BoharaG 24

Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues

• When should duplicate data not be removed?

Chapter 2 Data Pre-Processing | BoharaG 25
Similarity and Dissimilarity Measures
• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Chapter 2 Data Pre-Processing | BoharaG 26
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.

Chapter 2 Data Pre-Processing | BoharaG 27

Euclidean Distance
• Euclidean Distance

where n is the number of dimensions (attributes) and xk

and yk are, respectively, the kth attributes (components) or
data objects x and y.

 Standardization is necessary, if scales differ.

Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions (attributes) and xk and yk are,
respectively, the kth attributes (components) or data objects x and y.

Chapter 2 Data Pre-Processing | BoharaG 29

Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different
between two binary vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component of the
vectors

• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.

Chapter 2 Data Pre-Processing | BoharaG 30

Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.

• A distance that satisfies these properties is a

metric

Chapter 2 Data Pre-Processing | BoharaG 31

Common Properties of a Similarity
• Similarities, also have some well known properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.

Chapter 2 Data Pre-Processing | BoharaG 32

Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

33 Chapter 2 Data Pre-Processing | BoharaG

Data Preprocessing
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation

Chapter 2 Data Pre-Processing | BoharaG 34

Aggregation
• Combining two or more attributes (or objects) into a single attribute (or object)
• Purpose
• Data reduction - reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
• More “stable” data - aggregated data tends to have less variability

Chapter 2 Data Pre-Processing | BoharaG 35

Sampling
• Sampling is the main technique employed for data reduction.
• It is often used for both the preliminary investigation of the data and the final data
analysis.

• Statisticians often sample because obtaining the entire set of data of interest
is too expensive or time consuming.

• Sampling is typically used in data mining because processing the entire set of
data of interest is too expensive or time consuming.

Chapter 2 Data Pre-Processing | BoharaG 36

Sampling …
• The key principle for effective sampling is the following:

• Using a sample will work almost as well as using the entire data set, if the sample is
representative

• A sample is representative if it has approximately the same properties (of interest) as the
original set of data

Chapter 2 Data Pre-Processing | BoharaG 37

Sample Size

8000 points 2000 Points 500 Points

Chapter 2 Data Pre-Processing | BoharaG 38

Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are selected for the sample.
• In sampling with replacement, the same object can be picked up more than once
• Stratified sampling
• Split the data into several partitions; then draw random samples from each partition

Chapter 2 Data Pre-Processing | BoharaG 39

Discretization
• Discretization is the process of converting a continuous attribute into
an ordinal attribute
• A potentially infinite number of values are mapped into a small number of
categories
• Discretization is used in both unsupervised and supervised settings

Chapter 2 Data Pre-Processing | BoharaG 40

Binarization
• Binarization maps a continuous or categorical attribute into one or
more binary variables

Chapter 2 Data Pre-Processing | BoharaG 41

Attribute Transformation
• An attribute transform is a function that maps the entire set of values of a
given attribute to a new set of replacement values such that each old value
can be identified with one of the new values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences among attributes in terms of
frequency of occurrence, mean, variance, range
• Take out unwanted, common signal, e.g., seasonality
• In statistics, standardization refers to subtracting off the means and dividing by the
standard deviation

Chapter 2 Data Pre-Processing | BoharaG 42

Curse of Dimensionality
• When dimensionality increases, data
becomes increasingly sparse in the space
that it occupies

• Definitions of density and distance

between points, which are critical for
clustering and outlier detection, become
less meaningful

•Randomly generate 500 points

•Compute difference between max and
min distance between any pair of points
Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise

• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques

Chapter 2 Data Pre-Processing | BoharaG 44

Dimensionality Reduction: PCA
• Goal is to find a projection that captures the largest amount of
variation in data x2

x1
Chapter 2 Data Pre-Processing | BoharaG 45
Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data

• The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the
eigenvectors of the covariance matrix, and these eigenvectors define the new space

x1
46 Chapter 2 Data Pre-Processing | BoharaG
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only
47 Chapter 2 Data Pre-Processing | BoharaG
Dimensionality Reduction: PCA

Chapter 2 Data Pre-Processing | BoharaG 48

Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more other
attributes
• Example: purchase price of a product and the amount of sales tax paid
• Irrelevant features
• Contain no information that is useful for the data mining task at hand
• Example: students' ID is often irrelevant to the task of predicting students'
GPA
• Many techniques developed, especially for classification
Chapter 2 Data Pre-Processing | BoharaG 49
Feature Creation
• Create new attributes that can capture the important information in a
data set much more efficiently than the original attributes

• Three general methodologies:

• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
• Example: Fourier and wavelet analysis

Chapter 2 Data Pre-Processing | BoharaG 50

Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
51 Chapter 2 Data Pre-Processing | BoharaG
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

52 Chapter 2 Data Pre-Processing | BoharaG

Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred
Chapter 2 Data Pre-Processing | BoharaG 53
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same
class: smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree
Chapter 2 Data Pre-Processing | BoharaG 54
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data

Chapter 2 Data Pre-Processing | BoharaG 55

How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)

56 Chapter 2 Data Pre-Processing | BoharaG

Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different
• Possible reasons: different representations, different scales, e.g., metric
vs. British units
57 Chapter 2 Data Pre-Processing | BoharaG
57
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple

databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
58 Chapter 2 Data Pre-Processing | BoharaG
58
Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
59
• Data compression Chapter 2 Data Pre-Processing | BoharaG
Data Reduction 1: Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)

Chapter 2 Data Pre-Processing | BoharaG 60

Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
• Parametric methods (e.g., regression)
• Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

61 Chapter 2 Data Pre-Processing | BoharaG

Data Reduction 3: Data Compression
• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless, but only limited manipulation is possible
without expansion
• Audio/video compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
• Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered
as forms of data compression
Chapter 2 Data Pre-Processing | BoharaG 62
Data Reduction 3: Data Compression
• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless, but only limited manipulation is possible
without expansion
• Audio/video compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
• Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered
as forms of data compression
Chapter 2 Data Pre-Processing | BoharaG 63
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of replacement values s.t.
each old value can be identified with one of the new values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing

Chapter 2 Data Pre-Processing | BoharaG 64

OLAP
• OLAP stands for On-Line Analytical Processing.
• An OLAP cube is a data structure that allows fast analysis of data.
• OLAP tools were developed to solve multi-dimensional data analysis which stores their
data in a special multi-dimensional format (data cube) with no updating facility.
• An OLAP toll doesn’t learn, it creates no new knowledge and they can’t reach new
solutions.
• Information of multi-dimension nature can’t be easily analyzed when the table has the
standard 2-D representation.
• A table with n- independent attributes can be seen as an n-dimensional space.
• It is required to explore the relationships between several dimensions and standard
relational databases are not very good for this.

Chapter 2 Data Pre-Processing | BoharaG 65

OLAP Tool

Chapter 2 Data Pre-Processing | BoharaG 66

OLAP Operations
• Slicing: A slice is a subset of multi-dimensional array corresponding to a single value for one or more
members of the dimensions. Eg: Product A sales.
• Dicing: Dicing operation is the slice on more than two dimensions of data cube. (More than two
consecutive slice). Eg: Product A sales in 2004.
• Drill-Down: Drill-down is specific analytical technique where the user navigates among levels of data
ranging from the most summarized to the most detailed i.e. it navigates from less detailed data to more
detailed data. Eg: Product A sales in Chicago in 2004.
• Roll-Up: Computing of all the data relationship for more than one or more dimensions i.e.
summarization of data to one o more dimensions. Eg: Total Product.
• Pivoting: Pivoting is also called rotate operation. It rotates the data in order to provide an alternative
presentation of data.

Chapter 2 Data Pre-Processing | BoharaG 67

OLTP (Online Transaction Processing)
• Used to carry out day to day business functions such as ERP (Enterprise Resource Planning), CRM ( Customer
Relationship Planning)

• OLTP system solved a critical business problem of automating daily business functions and running real time
report and analysis.

Chapter 2 Data Pre-Processing | BoharaG 68

OLAP vs OLTP
OLAP: Online Analytical Processing (Data Warehouse)
OLTP: Online Transaction Processing (Traditional DBMS)

OLAP data typically: historical, consolidated, and multi-dimensional (eg: product, time,
location).
Involves lots of full database scans, across terabytes or more of data.

Typically aggregation and summarisation functions.

Distinctly different uses to OLTP on the operational database.

Chapter 2 Data Pre-Processing | BoharaG 69

Comparison of OLTP and Data Warehousing

OLTP systems Data warehousing systems

Holds current data Holds historic data
Stores detailed data Stores detailed, lightly, and
summarized data
Data is dynamic Data is largely static
Repetitive processing Ad hoc, unstructured, and heuristic
processing
High level of transaction throughput Medium to low transaction throughput
Predictable pattern of usage Unpredictable pattern of usage
Transaction driven Analysis driven
Application oriented Subject oriented
Supports day-to-day decisions Supports strategic decisions
Serves large number of Serves relatively lower number
clerical / operational users of managerial users

Chapter 2 Data Pre-Processing | BoharaG 70

OLTP vs. Data Warehouse
• OLTP systems are tuned for known transactions and workloads while
workload is not known a priori in a data warehouse
• Special data organization, access methods and implementation
methods are needed to support data warehouse queries (typically
multidimensional queries)
• e.g., average amount spent on phone calls between 9AM-5PM in Pune during
the month of December

71 Chapter 2 Data Pre-Processing | BoharaG

OLTP vs Data Warehouse
• OLTP • Warehouse (DSS)
• Application Oriented • Subject Oriented
• Used to run business • Used to analyze business
• Detailed data • Summarized and refined
• Current up to date • Snapshot data
• Isolated Data • Integrated Data
• Repetitive access • Ad-hoc access
• Clerical User • Knowledge User (Manager)

Chapter 2 Data Pre-Processing | BoharaG 72

OLTP vs Data Warehouse
• OLTP • Data Warehouse
• Performance Sensitive • Performance relaxed
• Few Records accessed at a • Large volumes accessed at a
time (tens) time(millions)
• Mostly Read (Batch Update)
• Read/Update Access • Redundancy present
• Database Size 100 GB -
• No data redundancy few terabytes
• Database Size 100MB -100
GB

73 Chapter 2 Data Pre-Processing | BoharaG

OLTP vs Data Warehouse
• OLTP • Data Warehouse
• Transaction throughput is the • Query throughput is the
performance metric performance metric
• Thousands of users • Hundreds of users
• Managed in entirety • Managed by subsets

74 Chapter 2 Data Pre-Processing | BoharaG

To summarize ...
• OLTP Systems are
used to “run” a business

• The Data Warehouse

helps to “optimize” the
business

75 Chapter 2 Data Pre-Processing | BoharaG

From Tables and Spreadsheets to Data Cubes

• A data warehouse is based on a multidimensional data model which views data in the form
of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.
Chapter 2 Data Pre-Processing | BoharaG 76
From Tables and Spreadsheets to
Data Cubes

• A data warehouse is based on a multidimensional data model which views

data in the form of a data cube

• A data cube, such as sales, allows data to be modeled and viewed in

multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day,
week, month, quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables

• In data warehousing literature, an n-D base cube is called a base cuboid. The
top most 0-D cuboid, which holds the highest-level of summarization, is
called the apex cuboid. The lattice of cuboids forms a data cube.

Chapter 2 Data Pre-Processing | BoharaG 77

Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier

1-D cuboids

time,location item,location location,supplier

time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid

time, item, location, supplier
Chapter 2 Data Pre-Processing | BoharaG 78
Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures

• Star schema: A fact table in the middle connected to a set of
dimension tables
• Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
• Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called galaxy
schema or fact constellation
Chapter 2 Data Pre-Processing | BoharaG 79
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Chapter 2 Data Pre-Processing | BoharaG 80
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Chapter 2 Data Pre-Processing | BoharaG 81
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
Chapter 2 Data Pre-Processing | BoharaG location_key 82
shipper_type
Multidimensional Data
• Sales volume as a function of product, month, and
region
Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product

Product City Month Week

Office Day

Month
Chapter 2 Data Pre-Processing | BoharaG 83
A Sample Data Cube
Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

Chapter 2 Data Pre-Processing | BoharaG 84

Thank You !
Chapter 2 Data Pre-Processing | BoharaG 85

CIS62283 02 PreProcessing
100% (1)
CIS62283 02 PreProcessing
51 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Attitude Measurement & Scaling Technique
100% (1)
Attitude Measurement & Scaling Technique
16 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Full
No ratings yet
Full
367 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Data Mining Techniques & Applications
No ratings yet
Data Mining Techniques & Applications
48 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Chapter 02 Data and Data Preprocessing
No ratings yet
Chapter 02 Data and Data Preprocessing
74 pages
Data
No ratings yet
Data
84 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Ch.3 Data Preprocessing
No ratings yet
Ch.3 Data Preprocessing
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
No ratings yet
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
94 pages
Machine Learning Lecture 4 Data Types
No ratings yet
Machine Learning Lecture 4 Data Types
21 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Attributes
No ratings yet
Attributes
66 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Lecture2 IntroData
No ratings yet
Lecture2 IntroData
16 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Developing Data Collection Instruments
No ratings yet
Developing Data Collection Instruments
11 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Lect 2
No ratings yet
Lect 2
77 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data Mining
No ratings yet
Data Mining
40 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Tugas Data Mining Dan Data
No ratings yet
Tugas Data Mining Dan Data
20 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
GD 2 c302 Market Research in Marketing I 2018 19
No ratings yet
GD 2 c302 Market Research in Marketing I 2018 19
9 pages
Lec 5
No ratings yet
Lec 5
24 pages
Lecture Sheet For SPSS
100% (1)
Lecture Sheet For SPSS
29 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
Target: Before Proceeding Further, Check How Much You Know About Business
No ratings yet
Target: Before Proceeding Further, Check How Much You Know About Business
16 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Introduction To Business Statistics
No ratings yet
Introduction To Business Statistics
455 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Basic Statistics For Counselling - EDG 1503V2
No ratings yet
Basic Statistics For Counselling - EDG 1503V2
23 pages
(Ebook PDF) Making Sense of The Social World: Methods of Investigation 6th Edition - The Newest Ebook Version Is Ready, Download Now To Explore
100% (2)
(Ebook PDF) Making Sense of The Social World: Methods of Investigation 6th Edition - The Newest Ebook Version Is Ready, Download Now To Explore
56 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Lec 2 Intro
No ratings yet
Lec 2 Intro
35 pages
Research Aptitude
No ratings yet
Research Aptitude
100 pages
Accounting Theory Godfrey - 5
No ratings yet
Accounting Theory Godfrey - 5
27 pages
STA630 Midterm 02
No ratings yet
STA630 Midterm 02
19 pages
Chapter 1: Introduction To Statistics
No ratings yet
Chapter 1: Introduction To Statistics
28 pages
Data Science With Python
No ratings yet
Data Science With Python
12 pages
PLG - PMC 500 KULIAH 1 - 10sept2019 - V2
No ratings yet
PLG - PMC 500 KULIAH 1 - 10sept2019 - V2
24 pages
Biostatistics: Eskindir Loha, PHD School of Public and Environmental Health Hawassa University
No ratings yet
Biostatistics: Eskindir Loha, PHD School of Public and Environmental Health Hawassa University
18 pages
Psych 205 Book Notes
No ratings yet
Psych 205 Book Notes
19 pages
Sma 160 Probability and Statistics 1
No ratings yet
Sma 160 Probability and Statistics 1
165 pages
Anin
No ratings yet
Anin
14 pages
8614 Assignment 01
No ratings yet
8614 Assignment 01
24 pages
Scale Measurement Concepts Guide Questions Try To Answer These Questions To Test Your Understanding Scaling Concepts
No ratings yet
Scale Measurement Concepts Guide Questions Try To Answer These Questions To Test Your Understanding Scaling Concepts
8 pages
ML 3170724 Unit-2
No ratings yet
ML 3170724 Unit-2
40 pages
Questionners and Data Analysis
No ratings yet
Questionners and Data Analysis
98 pages
Anna Mae Santos (Eapp) Remedial Exam
No ratings yet
Anna Mae Santos (Eapp) Remedial Exam
2 pages
Semester Test 1 Medical Solutions Memo
No ratings yet
Semester Test 1 Medical Solutions Memo
11 pages
Unit - 3 Non Parametric Test Part 1
No ratings yet
Unit - 3 Non Parametric Test Part 1
17 pages
Saa Lab Unit1 Qa
No ratings yet
Saa Lab Unit1 Qa
4 pages
Business Statistics Notes Coec 1210
No ratings yet
Business Statistics Notes Coec 1210
44 pages
Pr2 Module 3
No ratings yet
Pr2 Module 3
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.