Introduction To Data Cleaning
Introduction To Data Cleaning
References
No single reference!
Data Quality: Concepts, Methodologies and Techniques , C. Batini
and M. Scannapieco, Springer-Verlag, 2006 (Chapts. 1, 2, and 4)
Slides “Data Quality and Data Cleansing” course, Felix Naumann,
Winter 2014/15
Foundations of Data Quality Management , W. Fan and F. Geerts,
2012
Oliveira, P. (2009). Detecção e correcção de problemas de
qualidade de dados: Modelo, Sintaxe e Semântica . PhD thesis, U.
do Minho.
1
So far…
Example (1)
Table R Table S
2
Example (2)
<country>
<name> United States of America </name>
<cities> New York, Los Angeles, Chicago </
cities>
<lakes>
<name> Lake Michigan </name>
</lakes>
</country>
<country>
and
United States
<city> New York </city>
<city> Los Angeles </city> are the same
<lakes>
<lake> Lake Michigan </lake>
object?
</lakes>
</country>
Example (3)
P. Bernstein, D. Chiu: Using Semi-Joins to Solve
Relational Queries. JACM 28(1): 25-40(1981)
3
The three examples refer to the same problem that
is known under different names:
approximate duplicate detection
record linkage
entity resolution
merge-purge
data matching …
Outline
4
Why Data Cleaning?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
e.g., occupation=“”
noisy: containing errors (spelling, phonetic and typing errors,
word transpositions, multiple values in a single free-form
field) or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
(synonyms and nicknames, prefix and suffix variations,
abbreviations, truncation and initials)
e.g., Age=“42” Birthday=“03/07/1997”
e.g., was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between approximate duplicate records
10
5
Impact of Data Quality Problems
Incorrect prices in inventory retail
databases [English 1999]
Costs for consumers 2.5 billion $
80% of barcode-scan-errors to the
disadvantage of consumer
IRS 1992: almost 100,000 tax refunds not
deliverable [English 1999]
50% to 80% of computerized criminal
records in the U.S. were found to be
inaccurate, incomplete, or ambiguous.
[Strong et al. 1997a]
US-Postal Service: of 100,000 mass-
mailings up to 7,000 undeliverable due to
incorrect addresses [Pierce 112004]
12
6
Application contexts
Integrate data from different sources
E.g.,populating a DW from different operational data stores or a
mediator-based architecture
13
14
7
Why is Data Cleaning Important?
Activity of converting source data into target data without
errors, duplicates, and inconsistencies, i.e.,
Cleaning and Transforming to get…
High-quality data!
15
Quality
16
8
Outline
17
179 Dimensions"
9
Felix Naumann | Data Profiling
and Data Cleansing | Winter
2014/15 19
10
Accuracy
Closeness between a value v and a value v , considered as the
correct representation of the real-world phenomenon that v
aims to represent.
Ex: for a person name John , v =John is correct, v=Jhn is incorrect
21
22
11
Metrics for quantifying accuracy
23
Completeness
The extent to which data are of sufficient
breadth, depth, and scope for the task in
hand.
Three types:
Schema completeness: degree to which concepts
and their properties are not missing from the
schema
Column completeness: evaluates the missing
values for a specific property or column in a table.
Population completeness: evaluates missing
values with respect to a reference population
24
12
Completeness of relational data
The completeness of a table characterizes the extent to
which the table represents the real world.
Can be characterized with respect to:
The presence/absence and meaning of null values
Example: In Person(name, surname, birthdate, email), if
email is null may indicate the person has no mail (no
incompleteness), email exists but is not known (incompleteness), it is
not known whether Person has an email (incompleteness may not be
the case)
Validity of open world assumption (OWA) or closed world
assumption (CWA)
OWA: assumes that in addition to missing values, some tuples
representing real-world entities may also be missing
CWA: assumes the database has collected all the tuples
representing real-world entities, but the values of some attributes
in those tuples are possible missing
25
26
13
Metrics for quantifying completeness (2)
Model with null values with CWA: specific
definitions for different granularities:
Values: to capture the presence of null values for
some fields of a tuple
Tuple: to characterize the completeness of a tuple
wrt the values of all its fields:
Evaluates the % of specified values in the tuple wrt the
total number of attributes of the tuple itself
Example: Student(stID, name, surname, vote,
examdate)!
Equal to 1 for (6754, Mike, Collins, 29, 7/17/2004)
Equal to 0.8 for (6578, Julliane, Merrals, NULL, 7/17/2004)
27
28
14
Time-related dimensions
Currency: concerns how promptly data are updated
Example: if the residential address of a person is updated (it
corresponds to the address where the person lives) then the
currency is high
29
30
15
Consistency
31
Other dimensions
Interpretability: concerns the documentation and
metadata that are available to correctly interpret
the meaning and properties of data sources
Synchronization between different time series:
concerns proper integration of data having
different time stamps.
Accessibility: measures the ability of the user to
access the data from his/her own culture,
physical status/functions, and technologies
availavle.
32
16
Outline
33
34
17
Value level
Missing value: value not filled in a not null attribute
Ex: birth date = ‘’
Syntax violation: value does not satisfy the syntax
rule defined for the attribute
Ex: zip code = 27655-175; syntactical rule: xxxx-xxx
Spelling error
Ex: city = ‘Lsboa’, instead of ‘Lisbon’
Domain violation: value does not belong to the
valid domain set
Ex: age = 240; age: {0, 120}
35
18
Relation level
Heterogeneous data representations: different ways of
representing the same real world entity
Ex: name = ‘John Smith’; name = ‘Smith, John’
Functional dependency violation
Ex: (2765-175, ‘Estoril’) and (2765-175,
‘Oeiras’)
Existence of approximate duplicates
Ex: (1, André Fialho, 12634268) and (2,
André Pereira Fialho, 12634268)!
Integrity constraint violation
Ex: sum of salaries is superior to the max established
37
19
Outline
39
40
20
Data quality auditing
Constituted by:
Data profiling – analysing data sources to identify data
quality problems
Data analysis – statistical evaluation, logical study and
application of data mining algorithms to define data
patterns and rules
Main goals:
To obtain a definition of the data: metadata collection
To check violations to metadata definition
To detect other data quality problems that belong to a given
taxonomy
To supply recommendations in what concerns the data cleaning
task 41
Data Profiling
Data source discovery
Metadata
Schema discovery
Schema matching and mapping
Profiling for metadata (keys, foreign keys, data types, …)
Data discovery
Column-level: Null-values, domains, patterns, value
distributions / histograms
Table-level: Data mining, rules
21
Typical techniques used in data
quality auditing
Dictionaries of words: so that attribute values are compared
with one or more dictionaries of the domain
Ex: wordnet
Algorithms to detect functional dependencies and their
violations
Algorithms to detect duplicates
String matching for string fields
Character-based
Toke-based
Phonetic algorithms
Record matching
Rule-based
Probabilistic Localidade=>Cod.Postal
...
43
22
Typical techniques used in data
cleaning and transformation
Dictionaries of words
45
46
23
Data CleaningTasks
1. Extraction from sources
Technical and syntactic obstacles
2. Transformation
Schematic obstacles
3. Standardization
Syntactic and semantic obstacles
4. Duplicate detection
Similarity functions
Algorithms
5. Data fusion / consolidation
Semantic obstacles
6. Loading into warehouse / presenting to user
47
48
24
Outline
49
50
25
Existing technology for ensuring
data quality
Ad-hoc programs written in a programming language like
C or Java or using an RDBMS proprietary language
Programs difficult to optimize and maintain
51
52
26
Criteria for comparing commercial
data quality tools (2)
Profiling:
Rules: A rule is a business logic that defines conditions
applied to data. They are used to validate the data and to
measure data quality
Filters: A filter is used to split the data tuples in different
groups. Each group should be validated by a different set of
rules.
53
54
27
Commercial Data Cleaning Tools(2014)
(1/3)
Debugger Profiling Execution
Data User Incremental
Tools lineage Breakpoints Edit values Rules Filters involvement updates
Informatica
PowerCenter Y Y Y Y Y N Y
IBM Information
Server Y Y N Y Y N Y
Oracle Data
Integrator Y Y N Y Y N Y
SQL Server
Integration Services Y Y N Y N N Y
SAS Data
Integration Studio Y N N Y Y N Y
Pentaho Data
Integration N N N Y N N Y
Clover ETL N N Y Y Y N Y
55
28
Commercial Data cleaning tools (2014)
(2/3)
Extensibility User Interface
Tools Create Operators Modify Operators Drag and Drop Grahical Editor
57
29
Commercial Data Cleaning Tools (2014)
(3/3)
Scalability Others
Tools Grid Partitioning Pushdown Optimization Free version
Informatica PowerCenter Y Y Y Y
59
Cleenex QCs N N N Y
Llunatic Egds N Y N N
Scare N Y N Y N
Eracer N Y N Y N
Continuous data
cleaning FDs N Y Y N
60
30
Criteria for comparing research
data cleaning tools (1)
Detection:
Constraints – use of rules or/and conditions
EGDs - equality generating dependencies
QCs - quality constraints
CFDs - Conditional functional dependencies
MDs - Matching dependencies
Statistical – dirty tuples are detected based on simple
statistics or in complex data analysis
61
62
31
Criteria for comparing research
data cleaning tools (3)
User Interface:
Graphical interface: the system provides a visualizing tool and
menus to interact
User edition: the system allows the user to edit data values
Others:
Scalability: the system execution time grows linearly with the
number of input tuples
Streaming: the system receives tuples and processes each of
them treat them indivually (opposed to batch processing)
Extensible: the system allows the user to modify and/or insert
new algorithms
63
Scare N N N N Y
Eracer N N N N N
64
32
Outline
65
Death by Typo
66
33
Google searches for Britney Spears
488941 britney spears 29 britent spears 9 brinttany spears 5 brney spears 3 britiy spears 2 brirreny spears
40134 brittany spears 29 brittnany spears 9 britanay spears 5 broitney spears 3 britmeny spears 2 brirtany spears
36315 brittney spears 29 britttany spears 9 britinany spears 5 brotny spears 3 britneeey spears 2 brirttany spears
24342 britany spears 29 btiney spears 9 britn spears 5 bruteny spears 3 britnehy spears 2 brirttney spears
7331 britny spears 26 birttney spears 9 britnew spears 5 btiyney spears 3 britnely spears 2 britain spears
6633 briteny spears 26 breitney spears 9 britneyn spears 5 btrittney spears 3 britnesy spears 2 britane spears
2696 britteny spears 26 brinity spears 9 britrney spears 5 gritney spears 3 britnetty spears 2 britaneny spears
1807 briney spears 26 britenay spears 9 brtiny spears 5 spritney spears 3 britnex spears 2 britania spears
1635 brittny spears 26 britneyt spears 9 brtittney spears 4 bittny spears 3 britneyxxx spears 2 britann spears
1479 brintey spears 26 brittan spears 9 brtny spears 4 bnritney spears 3 britnity spears 2 britanna spears
1479 britanny spears 26 brittne spears 9 brytny spears 4 brandy spears 3 britntey spears 2 britannie spears
1338 britiny spears 26 btittany spears 9 rbitney spears 4 brbritney spears 3 britnyey spears 2 britannt spears
1211 britnet spears 24 beitney spears 8 birtiny spears 4 breatiny spears 3 britterny spears 2 britannu spears
1096 britiney spears 24 birteny spears 8 bithney spears 4 breetney spears 3 brittneey spears 2 britanyl spears
991 britaney spears 24 brightney spears 8 brattany spears 4 bretiney spears 3 brittnney spears 2 britanyt spears
991 britnay spears 24 brintiny spears 8 breitny spears 4 brfitney spears 3 brittnyey spears 2 briteeny spears
811 brithney spears 24 britanty spears 8 breteny spears 4 briattany spears 3 brityen spears 2 britenany spears
811 brtiney spears 24 britenny spears 8 brightny spears 4 brieteny spears 3 briytney spears 2 britenet spears
664 birtney spears 24 britini spears 8 brintay spears 4 briety spears 3 brltney spears 2 briteniy spears
664 brintney spears 24 britnwy spears 8 brinttey spears 4 briitny spears 3 broteny spears 2 britenys spears
664 briteney spears 24 brittni spears 8 briotney spears 4 briittany spears 3 brtaney spears 2 britianey spears
601 bitney spears 24 brittnie spears 8 britanys spears 4 brinie spears 3 brtiiany spears 2 britin spears
601 brinty spears 21 biritney spears 8 britley spears 4 brinteney spears 3 brtinay spears 2 britinary spears
544 brittaney spears 21 birtany spears 8 britneyb spears 4 brintne spears 3 brtinney spears 2 britmy spears
544 brittnay spears 21 biteny spears 8 britnrey spears 4 britaby spears 3 brtitany spears 2 britnaney spears
364 britey spears 21 bratney spears 8 britnty spears 4 britaey spears 3 brtiteny spears 2 britnat spears
364 brittiny spears 21 britani spears 8 brittner spears 4 britainey spears 3 brtnet spears 2 britnbey spears
329 brtney spears 21 britanie spears 8 brottany spears 4 britinie spears 3 brytiny spears 2 britndy spears
269 bretney spears 21 briteany spears 7 baritney spears 4 britinney spears 3 btney spears 2 britneh spears
269 britneys spears 21 brittay spears 7 birntey spears 4 britmney spears 3 drittney spears 2 britneney spears
244 britne spears 21 brittinay spears 7 biteney spears 4 britnear spears 3 pretney spears 2 britney6 spears
244 brytney spears 21 brtany spears 7 bitiny spears 4 britnel spears 3 rbritney spears 2 britneye spears
220 breatney spears 21 brtiany spears 7 breateny spears 4 britneuy spears 2 barittany spears 2 britneyh spears
220 britiany spears 19 birney spears 7 brianty spears 4 britnewy spears 2 bbbritney spears 2 britneym spears
Felix199 britnney spears
Naumann 19 brirtney spears
| Data Profiling 7 brintye spears 4 britnmey spears 2 bbitney spears 2 britneyyy spears
163 britnry spears 19 britnaey spears 7 britianny spears 4 brittaby spears 2 bbritny spears 2 britnhey spears
and Data Cleansing | Winter19 britnee spears
147 breatny spears 7 britly spears 4 brittery spears 2 bbrittany spears 2 britnjey spears
2014/15
147 brittiney spears 19 britony spears 7 britnej spears 67 4 britthey spearsSource:2 http://www.google.com/jobs/britney.html
beitany spears 2 britnne spears
147 britty spears 19 brittanty spears 7 britneyu spears 4 brittnaey spears 2 beitny spears 2 britnu spears
147 brotney spears 19 britttney spears 7 britniey spears 4 brittnat spears 2 bertney spears 2 britoney spears
Directmarketing by The
Economist
68
34
FIFA registration form (2010)
69
German Umlaute
70
35
Next lecture
Data Matching
36