0% found this document useful (0 votes)

46 views36 pages

Introduction To Data Cleaning

The document discusses data cleaning and introduces some key concepts. It covers why data cleaning is important, common data quality problems like duplicates and inconsistencies, and dimensions of data quality like accuracy and completeness. String matching is discussed as an important technique for solving problems like duplicate detection.

Uploaded by

Aleksandar Stankovic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views36 pages

Introduction To Data Cleaning

Uploaded by

Aleksandar Stankovic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Introduction to Data Cleaning

Helena Galhardas Follow me on LinkedIn for more :

DEI/IST Steve Nouri
https://www.linkedin.com/in/stevenouri/

References
 No single reference!
 Data Quality: Concepts, Methodologies and Techniques , C. Batini
and M. Scannapieco, Springer-Verlag, 2006 (Chapts. 1, 2, and 4)
 Slides “Data Quality and Data Cleansing” course, Felix Naumann,
Winter 2014/15
 Foundations of Data Quality Management , W. Fan and F. Geerts,
2012
 Oliveira, P. (2009). Detecção e correcção de problemas de
qualidade de dados: Modelo, Sintaxe e Semântica . PhD thesis, U.
do Minho.

1
So far…

 We’ve studied how to perform:

 String matching
efficiently and effectively.
 We’ve seen how string matching is important
in data integration
 Now, we’ll see how string matching is
important in data cleaning

Example (1)

Table R Table S

Name SSN Addr Name SSN Addr

Jack Lemmon 430-871-8294 Maple St Ton Hanks 234-162-1234 Main Street

Harrison Ford 292-918-2913 Culver Blvd Kevin Spacey - Frost Blvd

Tom Hanks 234-762-1234 Main St Jack Lemon 430-817-8294 Maple

Street
… … …
… … …

 Find records from different datasets that could be

the same entity

2
Example (2)
<country>
<name> United States of America </name>
<cities> New York, Los Angeles, Chicago </
cities>
<lakes>
<name> Lake Michigan </name>
</lakes>
</country>

<country>
and
United States
<city> New York </city>
<city> Los Angeles </city> are the same
<lakes>
<lake> Lake Michigan </lake>
object?
</lakes>
</country>

Example (3)
P. Bernstein, D. Chiu: Using Semi-Joins to Solve
Relational Queries. JACM 28(1): 25-40(1981)

Philip A. Bernstein, Dah-Ming W. Chiu, Using

Semi-Joins to Solve Relational Queries, Journal
of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

 These two bibliographic references concern the

same publication!

3
The three examples refer to the same problem that
is known under different names:
 approximate duplicate detection
 record linkage
 entity resolution
 merge-purge
 data matching …

It is one of the data quality problems addressed

by data cleaning

Outline

 Introduction to data cleaning

 Application contexts of data cleaning
 Data quality dimensions
 Taxonomy of data quality problems
 Data quality process
 Main data quality tools
 Real-world examples

4
Why Data Cleaning?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
 e.g., occupation=“”
noisy: containing errors (spelling, phonetic and typing errors,
word transpositions, multiple values in a single free-form
field) or outliers
 e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
(synonyms and nicknames, prefix and suffix variations,
abbreviations, truncation and initials)
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between approximate duplicate records

Data Quality Problems (Dirty Data)

Representation Contradictions Ref. integrity

CUST CNr Name Birthday Age Sex Phone ZIP

1234 Costa, Rui 18.2.80 37 m 999999999 1000

1234 Ana Costa 32.2.70 37 m 965432123 55555

1235 Rui Costa 18.2.80 27 m 963124568 1000

Uniqueness

ADDRESS ZIP Place Missing values Duplicates

1000 Lisboa

Typos 1000 Lsiboa

Incorrect values
1024 Portugal

5
Impact of Data Quality Problems
 Incorrect prices in inventory retail
databases [English 1999]
 Costs for consumers 2.5 billion $
 80% of barcode-scan-errors to the
disadvantage of consumer
 IRS 1992: almost 100,000 tax refunds not
deliverable [English 1999]
 50% to 80% of computerized criminal
records in the U.S. were found to be
inaccurate, incomplete, or ambiguous.
[Strong et al. 1997a]
 US-Postal Service: of 100,000 mass-
mailings up to 7,000 undeliverable due to
incorrect addresses [Pierce 112004]

Why Is Data Dirty?

 Incomplete data comes from:
 non available data value when collected
 different criteria between the time when the data was collected and
when it is analyzed
 human/hardware/software problems
 Noisy data comes from:
 data collection: faulty instruments
 data entry: human or computer errors
 data transmission
 Inconsistent (and duplicate) data comes from:
 Different data sources, so non-uniform naming conventions/data codes
 Functional dependency and/or referential integrity violation

6
Application contexts
 Integrate data from different sources
 E.g.,populating a DW from different operational data stores or a
mediator-based architecture

 Eliminate errors and duplicates within a single source

 E.g., duplicates in a file of customers

 Migrate data from a source schema into a different fixed

target schema
 E.g., discontinued application packages

 Convert poorly structured data into structured data

 E.g., processing data collected from the Web

When materializing the integrated data

(data warehousing)…
SOURCE DATA
TARGET DATA

Data Data Data

... Extraction Transformation Loading ...

ETL: Extraction, Transformation and Loading

70% of the time in a data warehousing project is spent with

the ETL process

7
Why is Data Cleaning Important?
Activity of converting source data into target data without
errors, duplicates, and inconsistencies, i.e.,
Cleaning and Transforming to get…
High-quality data!

 No quality data, no quality decisions!

 Quality decisions must be based on good quality data (e.g.,
duplicate or missing data may cause incorrect or even
misleading statistics)

Quality

“Even though quality

cannot be defined, you
know what it is.”
Robert Pirsig

8
Outline

 Introduction to data cleaning

 Application contexts of data cleaning
 Data quality dimensions
 Taxonomy of data quality problems
 Data quality process
 Main data quality tools
 Real-world examples

What is Data of Good Quality?

Accuracy, Objectivity, Believability,

Reputation, Accessibility, Security,
Relevance, Value-Added, Timeliness,
Completeness, Amount of Data,!
Interpretability, Understandability,
Consistency, Concise Representation!

179 Dimensions"

9
Felix Naumann | Data Profiling
and Data Cleansing | Winter
2014/15 19

Data Quality Dimensions (classical)

Accuracy
 Refers to the closeness of values in a database to the true values of the
entities that the data in the database represent; if it is not 100% that
means that there are errors in data
Example:”Jhn” vs. “John”
Completeness
 Concerns whether the database has complete information to answer
queries
 Partial knowledge of the records in a table or of the attributes in a record
Currency
 Aims at identifying the current values of entities represented by tuples in
a database and to answer queries using those values
Example: Residence (Permanent) Address: out-dated vs. up-to-dated
Consistency
 Refers to the validity and integrity of data representing real-world entities;
if it is violated, leads to discrepancies and conflicts in the data
Example: ZIP Code and City inconsistent
20

10
Accuracy
 Closeness between a value v and a value v , considered as the
correct representation of the real-world phenomenon that v
aims to represent.
 Ex: for a person name John , v =John is correct, v=Jhn is incorrect

Syntatic accuracy: closeness of a value v to the elements of the

corresponding definition domain D
 Ex: if v=Jack, even if v =John , v is considered syntactically correct,
because it is an admissible value in the domain of people names.
 Measured by means of comparison functions (e.g., edit distance) that
evaluate the distance between v and the values of the domain

Semantic accuracy: closeness of the value v to the true value v

 Measured with a <yes, no> or <correct, not correct> domain
 Coincides with correctness
 The corresponding true value has to be known

Ganularity of accuracy definition

 Accuracy may refer to:

 a single value of a relation attribute
 an attribute or column
 a relation
 the whole database

11
Metrics for quantifying accuracy

 Weak accuracy error

 Characterizes accuracy errors that do not affect
identification of tuples
 Strong accuracy error
 Characterizes accuracy errors that affect
identification of tuples
 Percentage of accurate tuples
 Characterizes the fraction of accurate tuples
matched with a reference table

Completeness
 The extent to which data are of sufficient
breadth, depth, and scope for the task in
hand.
 Three types:
 Schema completeness: degree to which concepts
and their properties are not missing from the
schema
 Column completeness: evaluates the missing
values for a specific property or column in a table.
 Population completeness: evaluates missing
values with respect to a reference population
24

12
Completeness of relational data
 The completeness of a table characterizes the extent to
which the table represents the real world.
 Can be characterized with respect to:
 The presence/absence and meaning of null values
Example: In Person(name, surname, birthdate, email), if
email is null may indicate the person has no mail (no
incompleteness), email exists but is not known (incompleteness), it is
not known whether Person has an email (incompleteness may not be
the case)
 Validity of open world assumption (OWA) or closed world
assumption (CWA)
 OWA: assumes that in addition to missing values, some tuples
representing real-world entities may also be missing
 CWA: assumes the database has collected all the tuples
representing real-world entities, but the values of some attributes
in those tuples are possible missing
25

Metrics for quantifying completeness (1)

 Model without null values with OWA

 Needs a reference relation ref(r) for a relation
r, that contains all the tuples that satisfy the
schema of r
C(r) = |r|/|ref(r)|

Example: according to a registry of Lisbon municipality,

the number of citizens is 2 million. If a company stores
data about Lisbon citizens for the purpose of its
business and that number is 1,400,000 then C(r) = 0,7

13
Metrics for quantifying completeness (2)
 Model with null values with CWA: specific
definitions for different granularities:
 Values: to capture the presence of null values for
some fields of a tuple
 Tuple: to characterize the completeness of a tuple
wrt the values of all its fields:
 Evaluates the % of specified values in the tuple wrt the
total number of attributes of the tuple itself
Example: Student(stID, name, surname, vote,
examdate)!
Equal to 1 for (6754, Mike, Collins, 29, 7/17/2004)
Equal to 0.8 for (6578, Julliane, Merrals, NULL, 7/17/2004)

Metrics for quantifying completeness (3)

 Attribute: to measure the number of null values of
a specific attribute in a relation
 Evaluates % of specified values in the column
corresponding to the attribute wrt the total number of
values that should have been specified.
Example: For calculating the average of votes in Student,
a notion of the completeness of Vote should be useful
 Relations: to capture the presence of null values
in the whole relation
 Measures how much info is represented in the relation
by evaluating the content of the info actually available
wrt the maximum possible content, i.e., without null
values.

14
Time-related dimensions
Currency: concerns how promptly data are updated
 Example: if the residential address of a person is updated (it
corresponds to the address where the person lives) then the
currency is high

Volatility: characterizes the frequency with which data vary

in time
 Example: Birth dates (volatility zero) vs stock quotes (high
degree of volatility)

Timeliness: expresses how current data are for the task in

hand
 Example: The timetable for university courses can be current by
containing the most recent data, but it cannot be timely if it is
available only after the start of the classes.

Metrics of time-related dimensions

 Last update metadata for currency
 Straightforward for data types that change with a
fixed frequency

 Length of time that data remain valid for

volatility

 Currency + check that data are available

before the planned usage time for timeliness

15
Consistency

 Captures the violation of semantic rules

defined over a set of data items, where data
items can be tuples of relational tables or
records in a file
 Integrity constraints in relational data
 Domain constraints, key definitions, inclusion and
functional dependencies

Other dimensions
 Interpretability: concerns the documentation and
metadata that are available to correctly interpret
the meaning and properties of data sources
 Synchronization between different time series:
concerns proper integration of data having
different time stamps.
 Accessibility: measures the ability of the user to
access the data from his/her own culture,
physical status/functions, and technologies
availavle.

16
Outline

 Introduction to data cleaning

 Application contexts of data cleaning
 Data quality dimensions
 Taxonomy of data quality problems
 Data quality process
 Main data quality tools
 Real-world examples

Taxonomy of data quality problems

[Oliveira 2009]
 Value-level
 Value-set (attribute/column) level
 Record level
 Relation level
 Multiple relations level

17
Value level
Missing value: value not filled in a not null attribute
 Ex: birth date = ‘’
Syntax violation: value does not satisfy the syntax
rule defined for the attribute
 Ex: zip code = 27655-175; syntactical rule: xxxx-xxx
Spelling error
 Ex: city = ‘Lsboa’, instead of ‘Lisbon’
Domain violation: value does not belong to the
valid domain set
 Ex: age = 240; age: {0, 120}
35

Value-set and Record levels

Value-set level
 Existence of synonyms: attribute takes different values, but
with the same meaning
 Ex: emprego = ‘futebolista’; emprego = ‘jogador futebol’
 Existence of homonyms: same word used with diff meanings
 Ex: same name refers to different authors of a publication
 Uniqueness violation: unique attribute takes the same value
more than once
 Ex: two clients have the same ID number
 Integrity contraint violation
 Ex: sum of the values of percent attribute is more than 100
Record level
 Integrity constraint violation
36
 Ex: total price of a product is different from price plus taxes

18
Relation level
Heterogeneous data representations: different ways of
representing the same real world entity
 Ex: name = ‘John Smith’; name = ‘Smith, John’
Functional dependency violation
 Ex: (2765-175, ‘Estoril’) and (2765-175,
‘Oeiras’)
Existence of approximate duplicates
 Ex: (1, André Fialho, 12634268) and (2,
André Pereira Fialho, 12634268)!
Integrity constraint violation
 Ex: sum of salaries is superior to the max established

Multiple tables level

Heterogeneous data representations
 Ex: one table stores meters, another stores inches
Existence of synonyms
Existence of homonyms
Different granularities: same real world entity
represented with diff. granularity levels
 Ex: age: {0-30, 31-60, > 60}; age: {0-25, 26-40,
40-65, >65}
Referential integrity violation
Existence of approximate duplicates
Integrity constraint violation 38

19
Outline

 Introduction to data cleaning

 Application contexts of data cleaning
 Data quality dimensions
 Taxonomy of data quality problems
 Data quality process
 Main data quality tools
 Real-world examples

Data Quality Process

1. Data Quality Auditing (Assessment)

 Data Profiling
 Data Analysis

2. Data Quality Improvement

 Data Cleaning
 Data Enrichment

20
Data quality auditing
 Constituted by:
 Data profiling – analysing data sources to identify data
quality problems
 Data analysis – statistical evaluation, logical study and
application of data mining algorithms to define data
patterns and rules

 Main goals:
 To obtain a definition of the data: metadata collection
 To check violations to metadata definition
 To detect other data quality problems that belong to a given
taxonomy
 To supply recommendations in what concerns the data cleaning
task 41

Data Profiling
 Data source discovery
 Metadata
 Schema discovery
 Schema matching and mapping
 Profiling for metadata (keys, foreign keys, data types, …)
 Data discovery
 Column-level: Null-values, domains, patterns, value
distributions / histograms
 Table-level: Data mining, rules

21
Typical techniques used in data
quality auditing
 Dictionaries of words: so that attribute values are compared
with one or more dictionaries of the domain
 Ex: wordnet
 Algorithms to detect functional dependencies and their
violations
 Algorithms to detect duplicates
 String matching for string fields
 Character-based
 Toke-based
 Phonetic algorithms
 Record matching
 Rule-based
 Probabilistic Localidade=>Cod.Postal

 ...

Data quality improvement

 Includes often:
 Data transformation – set of operations that
source data must undergo to fit target schema
 Data cleaning– detecting, removing and
correcting dirty data (including approximate
duplicate elimination)
 Data enrichement– use of additional
information to improve data quality
 Main goal:
 To correct the data quality problems detected
during the data quality auditing process
44

22
Typical techniques used in data
cleaning and transformation
 Dictionaries of words

 Libraries of pre-defined cleaning functions

 Machine learning techniques

 Techniques for consolidating approximate duplicates

Methodology for data cleaning

1. Extraction of the individual fields that are relevant
2. Standardization of record fields
3. Correction of data quality problems at value level
 Missing values, syntax violation, etc
4. Correction of data quality problems at value-set level and record level
 Synonyms, homonyms, uniqueness violation, integrity constraint violation, etc
5. Correction of data quality problems at relation level
 Violation of functional dependencies, duplicate elimination, etc
6. Correction of data quality problems problems at multiple relations level
 Referential integrity violation, duplicate elimination, etc
 User feedback
 To solve instances of data quality problems not addressed by automatic methods
 Effectiveness of the data cleaning and transformation process must be
always measured for a sample of the data set

23
Data CleaningTasks
1. Extraction from sources
 Technical and syntactic obstacles
2. Transformation
 Schematic obstacles
3. Standardization
 Syntactic and semantic obstacles
4. Duplicate detection
 Similarity functions
 Algorithms
5. Data fusion / consolidation
 Semantic obstacles
6. Loading into warehouse / presenting to user
47

Human Interaction is Needed

 Components to implement
 Wrappers for technical heterogeneity
 Schema integration based on correspondences
 Similarity measure for schema elements
 Similarity measure for records
 Knobs to turn
 Thresholds for similarity measures
 Partition size / window size
 Expert guidance
 Rule selection / rule specification
 Schema matching
 Duplicate detection
 Data fusion

24
Outline

 Introduction to data cleaning

 Application contexts of data cleaning
 Data quality dimensions
 Taxonomy of data quality problems
 Data quality process
 Main data quality tools
 Real-world examples

Existing technology for ensuring

data quality
Ad-hoc programs written in a programming language like
C or Java or using an RDBMS proprietary language
 Programs difficult to optimize and maintain

RDBMS mechanisms for guaranteeing integrity

constraints
 Do not address important data instance problems

Data transformation workflow scripts using a data

cleaning/profiling tool

25
Existing technology for ensuring
data quality
Ad-hoc programs written in a programming language like
C or Java or using an RDBMS proprietary language
 Programs difficult to optimize and maintain

RDBMS mechanisms for guaranteeing integrity

constraints
 Do not address important data instance problems

 Data transformation workflow scripts using an data

cleaning/profiling tool

Criteria for comparing commercial

data quality tools (1)
Debugger:
Data lineage: data lineage or provenance identifies the
set of source data items that produced a given data item
Breakpoints: breakpoints is an intentional stopping or
pausing place in a cleaning program put in place for
debugging purposes
Edit values: the user can edit values during debugging

26
Criteria for comparing commercial
data quality tools (2)
Profiling:
Rules: A rule is a business logic that defines conditions
applied to data. They are used to validate the data and to
measure data quality
Filters: A filter is used to split the data tuples in different
groups. Each group should be validated by a different set of
rules.

Criteria for comparing commercial

data quality tools (3)
Execution:
User involvement: Support for user interaction in a data
cleaning process
Incremental updates: The ability to incrementally update
data targets, instead of rebuilding them from scratch every
time

27
Commercial Data Cleaning Tools(2014)
(1/3)
Debugger Profiling Execution
Data User Incremental
Tools lineage Breakpoints Edit values Rules Filters involvement updates
Informatica
PowerCenter Y Y Y Y Y N Y

IBM Information
Server Y Y N Y Y N Y

Talend Open Studio N Y N Y Y N Y

Oracle Data
Integrator Y Y N Y Y N Y

SQL Server
Integration Services Y Y N Y N N Y

SAS Data
Integration Studio Y N N Y Y N Y

Pentaho Data
Integration N N N Y N N Y

Clover ETL N N Y Y Y N Y

Criteria for comparing commercial

data quality tools (4)
Extensibility:
Create operators: the user can define new operators
Modify operators: the user can modify standard
operators
User Interface:
Drag and drop: the user can define data quality
processes using a drag and drop interface
Editor: the user can define and edit data quality processes
modeled as workflows using a graphical interface
56

28
Commercial Data cleaning tools (2014)
(2/3)
Extensibility User Interface
Tools Create Operators Modify Operators Drag and Drop Grahical Editor

Informatica PowerCenter Y (Java) N Y Y

IBM Information Server Y (Java) N Y Y

Talend Open Studio Y (Java, Groovy) Y (Java) Y Y

Oracle Data Integrator Y Y Y Y

SQL Server Integration Services Y (C#, VB) N Y Y

SAS Data Integration Studio Y (SAS) Y (SAS) Y Y

Pentaho Data Integration Y (Javascript) N Y Y

Clover ETL Y (CTL) N Y Y

Criteria for comparing commercial

data quality tools (5)
Scalability:
Grid: the tool can run a cleaning process on a collection of
computer resources from multiple locations
Partitioning: the user can partition the data and run each
partition independently (on different CPUs or cores)
Pushdown optimization: the tool translates the
transformation logic into SQL queries and sends the SQL
queries to the database. The database engine executes the
SQL queries to process the transformations
Others:
Free version: the tool has a free version
58

29
Commercial Data Cleaning Tools (2014)
(3/3)
Scalability Others
Tools Grid Partitioning Pushdown Optimization Free version

Informatica PowerCenter Y Y Y Y

IBM Information Server Y Y Y N

Talend Open Studio Y N Optional ELT Y

Oracle Data Integrator Y N ELT Y

SQL Server Integration Services N Y - Y (IST)

SAS Data Integration Studio Y Y Y Y (IST)

Pentaho Data Integration Y Y N Y

Clover ETL Y Y N Y

Research Data cleaning tools (2014)

(1/2)
Detection DQ problems Repair DQ problems
Data
Tools Constraints Satistical Search ML/St Transformations

Cleenex QCs N N N Y

Llunatic Egds N Y N N

Nadeef CFDs, MDs N Y N N

Guided data repair CFDs N Y Y N

Scare N Y N Y N

Eracer N Y N Y N

Continuous data
cleaning FDs N Y Y N

30
Criteria for comparing research
data cleaning tools (1)
Detection:
Constraints – use of rules or/and conditions
 EGDs - equality generating dependencies
 QCs - quality constraints
 CFDs - Conditional functional dependencies
 MDs - Matching dependencies
Statistical – dirty tuples are detected based on simple
statistics or in complex data analysis

Criteria for comparing research

data cleaning tools (1)
Repair:
Search: The system explores the space of possible clean
tables and heuristically selects the best table
ML/St: The system uses machine learning and/or
statistical models to infer data values or to prune the search
Data transformations: The system models the data
cleaning process as a data transformation graph

31
Criteria for comparing research
data cleaning tools (3)
User Interface:
Graphical interface: the system provides a visualizing tool and
menus to interact
User edition: the system allows the user to edit data values
Others:
Scalability: the system execution time grows linearly with the
number of input tuples
Streaming: the system receives tuples and processes each of
them treat them indivually (opposed to batch processing)
Extensible: the system allows the user to modify and/or insert
new algorithms
63

Research Data cleaning tools (2014)

(1/2)
User Interface Others
Tools Graphical Interface User edition Extensible Streaming Scalability

Cleenex Y Y Matching algorithms N N

Llunatic Y Y Cost Managers N Y

Nadeef Y N Repair algorithms N N

Guided data repair N Y N N N

Scare N N N N Y

Eracer N N N N N

Continuous data cleaning N N N Y Y

32
Outline

 Introduction to data cleaning

 Application contexts of data cleaning
 Data quality dimensions
 Taxonomy of data quality problems
 Data quality process
 Main data quality tools
 Real-world examples

Death by Typo

33
Google searches for Britney Spears
488941 britney spears 29 britent spears 9 brinttany spears 5 brney spears 3 britiy spears 2 brirreny spears
40134 brittany spears 29 brittnany spears 9 britanay spears 5 broitney spears 3 britmeny spears 2 brirtany spears
36315 brittney spears 29 britttany spears 9 britinany spears 5 brotny spears 3 britneeey spears 2 brirttany spears
24342 britany spears 29 btiney spears 9 britn spears 5 bruteny spears 3 britnehy spears 2 brirttney spears
7331 britny spears 26 birttney spears 9 britnew spears 5 btiyney spears 3 britnely spears 2 britain spears
6633 briteny spears 26 breitney spears 9 britneyn spears 5 btrittney spears 3 britnesy spears 2 britane spears
2696 britteny spears 26 brinity spears 9 britrney spears 5 gritney spears 3 britnetty spears 2 britaneny spears
1807 briney spears 26 britenay spears 9 brtiny spears 5 spritney spears 3 britnex spears 2 britania spears
1635 brittny spears 26 britneyt spears 9 brtittney spears 4 bittny spears 3 britneyxxx spears 2 britann spears
1479 brintey spears 26 brittan spears 9 brtny spears 4 bnritney spears 3 britnity spears 2 britanna spears
1479 britanny spears 26 brittne spears 9 brytny spears 4 brandy spears 3 britntey spears 2 britannie spears
1338 britiny spears 26 btittany spears 9 rbitney spears 4 brbritney spears 3 britnyey spears 2 britannt spears
1211 britnet spears 24 beitney spears 8 birtiny spears 4 breatiny spears 3 britterny spears 2 britannu spears
1096 britiney spears 24 birteny spears 8 bithney spears 4 breetney spears 3 brittneey spears 2 britanyl spears
991 britaney spears 24 brightney spears 8 brattany spears 4 bretiney spears 3 brittnney spears 2 britanyt spears
991 britnay spears 24 brintiny spears 8 breitny spears 4 brfitney spears 3 brittnyey spears 2 briteeny spears
811 brithney spears 24 britanty spears 8 breteny spears 4 briattany spears 3 brityen spears 2 britenany spears
811 brtiney spears 24 britenny spears 8 brightny spears 4 brieteny spears 3 briytney spears 2 britenet spears
664 birtney spears 24 britini spears 8 brintay spears 4 briety spears 3 brltney spears 2 briteniy spears
664 brintney spears 24 britnwy spears 8 brinttey spears 4 briitny spears 3 broteny spears 2 britenys spears
664 briteney spears 24 brittni spears 8 briotney spears 4 briittany spears 3 brtaney spears 2 britianey spears
601 bitney spears 24 brittnie spears 8 britanys spears 4 brinie spears 3 brtiiany spears 2 britin spears
601 brinty spears 21 biritney spears 8 britley spears 4 brinteney spears 3 brtinay spears 2 britinary spears
544 brittaney spears 21 birtany spears 8 britneyb spears 4 brintne spears 3 brtinney spears 2 britmy spears
544 brittnay spears 21 biteny spears 8 britnrey spears 4 britaby spears 3 brtitany spears 2 britnaney spears
364 britey spears 21 bratney spears 8 britnty spears 4 britaey spears 3 brtiteny spears 2 britnat spears
364 brittiny spears 21 britani spears 8 brittner spears 4 britainey spears 3 brtnet spears 2 britnbey spears
329 brtney spears 21 britanie spears 8 brottany spears 4 britinie spears 3 brytiny spears 2 britndy spears
269 bretney spears 21 briteany spears 7 baritney spears 4 britinney spears 3 btney spears 2 britneh spears
269 britneys spears 21 brittay spears 7 birntey spears 4 britmney spears 3 drittney spears 2 britneney spears
244 britne spears 21 brittinay spears 7 biteney spears 4 britnear spears 3 pretney spears 2 britney6 spears
244 brytney spears 21 brtany spears 7 bitiny spears 4 britnel spears 3 rbritney spears 2 britneye spears
220 breatney spears 21 brtiany spears 7 breateny spears 4 britneuy spears 2 barittany spears 2 britneyh spears
220 britiany spears 19 birney spears 7 brianty spears 4 britnewy spears 2 bbbritney spears 2 britneym spears
Felix199 britnney spears
Naumann 19 brirtney spears
| Data Profiling 7 brintye spears 4 britnmey spears 2 bbitney spears 2 britneyyy spears
163 britnry spears 19 britnaey spears 7 britianny spears 4 brittaby spears 2 bbritny spears 2 britnhey spears
and Data Cleansing | Winter19 britnee spears
147 breatny spears 7 britly spears 4 brittery spears 2 bbrittany spears 2 britnjey spears
2014/15
147 brittiney spears 19 britony spears 7 britnej spears 67 4 britthey spearsSource:2 http://www.google.com/jobs/britney.html
beitany spears 2 britnne spears
147 britty spears 19 brittanty spears 7 britneyu spears 4 brittnaey spears 2 beitny spears 2 britnu spears
147 brotney spears 19 britttney spears 7 britniey spears 4 brittnat spears 2 bertney spears 2 britoney spears

Directmarketing by The
Economist

34
FIFA registration form (2010)

German Umlaute

35
Next lecture

 Data Matching

Follow me on LinkedIn for more :

Steve Nouri
https://www.linkedin.com/in/stevenouri/

Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Data Warehouse: FPT University Hanoi 2010
No ratings yet
Data Warehouse: FPT University Hanoi 2010
32 pages
Enumerator Advert_2025 1
No ratings yet
Enumerator Advert_2025 1
3 pages
DataQuality Session2
No ratings yet
DataQuality Session2
39 pages
Domains of Digital Transformation
No ratings yet
Domains of Digital Transformation
34 pages
5 Data Cleaning
No ratings yet
5 Data Cleaning
36 pages
02-DataQuality_compressed
No ratings yet
02-DataQuality_compressed
71 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
Data Quality and Data Cleaning: An Overview
No ratings yet
Data Quality and Data Cleaning: An Overview
27 pages
Importance of Data Cleaning 1
No ratings yet
Importance of Data Cleaning 1
47 pages
DHV MODEL 1.2 Data Cleaning
No ratings yet
DHV MODEL 1.2 Data Cleaning
49 pages
CH 01 Introduction
No ratings yet
CH 01 Introduction
21 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
CS194 Lec 04 Data Cleaning
No ratings yet
CS194 Lec 04 Data Cleaning
50 pages
Data Quality and Data Cleaning: An Overview
0% (1)
Data Quality and Data Cleaning: An Overview
132 pages
UjwalBhattarai InternalAssignment
No ratings yet
UjwalBhattarai InternalAssignment
9 pages
Data Analytics_Module-1.2
No ratings yet
Data Analytics_Module-1.2
55 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
6a_Data Quality and Data Cleaning
No ratings yet
6a_Data Quality and Data Cleaning
5 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Ilyas FN TDB2015
No ratings yet
Ilyas FN TDB2015
115 pages
Data Quality Concepts PDF
100% (3)
Data Quality Concepts PDF
83 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
data-cleaning-using-pandas
No ratings yet
data-cleaning-using-pandas
9 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
cs614 notes
No ratings yet
cs614 notes
2 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
ML-Lecture-5-data-quality
No ratings yet
ML-Lecture-5-data-quality
19 pages
DSF 3-4
No ratings yet
DSF 3-4
18 pages
L3
No ratings yet
L3
34 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Data Cleansing
No ratings yet
Data Cleansing
6 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Cleaning: Information Integration
No ratings yet
Data Cleaning: Information Integration
42 pages
What is Data Cleaning
No ratings yet
What is Data Cleaning
8 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
Lect 6
No ratings yet
Lect 6
36 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Correlation
No ratings yet
Correlation
14 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Preview - AIAG+MSA 4 2010
No ratings yet
Preview - AIAG+MSA 4 2010
52 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Group 2 Web Based Cemetery Mapping and Information System Final 1
No ratings yet
Group 2 Web Based Cemetery Mapping and Information System Final 1
50 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Master in Big Data and Business Intelligence
No ratings yet
Master in Big Data and Business Intelligence
28 pages
Aspects of Data Quality (Excellent!)
No ratings yet
Aspects of Data Quality (Excellent!)
2 pages
Google File System Paper - Summary
50% (2)
Google File System Paper - Summary
4 pages
M. Information Systems Paige Baltzan instant download
100% (1)
M. Information Systems Paige Baltzan instant download
54 pages
Oracle RDBMS & SQL Tutorial (Very Good)
100% (8)
Oracle RDBMS & SQL Tutorial (Very Good)
66 pages
Mini Project
No ratings yet
Mini Project
12 pages
Data Cleansing
No ratings yet
Data Cleansing
5 pages
Advanced SQL and PL/SQL: Guide To Oracle 10g
No ratings yet
Advanced SQL and PL/SQL: Guide To Oracle 10g
22 pages
Window Functions and Syntax (Slides)
No ratings yet
Window Functions and Syntax (Slides)
14 pages
Data Security in CC
No ratings yet
Data Security in CC
16 pages
Data Warehousing Data Mining Lecture Notes On UNIT 1
No ratings yet
Data Warehousing Data Mining Lecture Notes On UNIT 1
22 pages
Part 1 - Game Data Analyst PDF
No ratings yet
Part 1 - Game Data Analyst PDF
3 pages
Fanatic Magazine 2
No ratings yet
Fanatic Magazine 2
12 pages
Evebox
No ratings yet
Evebox
23 pages
Hadoop Installation For Windows
No ratings yet
Hadoop Installation For Windows
10 pages
Ap CSP study guide - Google Docs
No ratings yet
Ap CSP study guide - Google Docs
5 pages
Statistics and Computer Science: CATALOG 2019/2020
No ratings yet
Statistics and Computer Science: CATALOG 2019/2020
5 pages
Measures of Variation
No ratings yet
Measures of Variation
8 pages
UNIT 3 CHAPTER 9
No ratings yet
UNIT 3 CHAPTER 9
3 pages
Cloud Storage Market Size, Share, Trends & Growth Forecast Report - Segmentation By Type (Solutions and Services), Deployment Model (Public Cloud, Private Cloud and Hybrid Cloud), Organization Size (Large Enterpris
No ratings yet
Cloud Storage Market Size, Share, Trends & Growth Forecast Report - Segmentation By Type (Solutions and Services), Deployment Model (Public Cloud, Private Cloud and Hybrid Cloud), Organization Size (Large Enterpris
5 pages
HFM Hyperion Financial Management Syllabus
No ratings yet
HFM Hyperion Financial Management Syllabus
3 pages
Wireshark Lab 4: TCP: Joshua Larkin CSC 251 Net-Centric Spring 2012
No ratings yet
Wireshark Lab 4: TCP: Joshua Larkin CSC 251 Net-Centric Spring 2012
4 pages
ACIF File Generation: Formdefs, Pagedefs and Fonts) ACIF Indexing and AFP Output File
No ratings yet
ACIF File Generation: Formdefs, Pagedefs and Fonts) ACIF Indexing and AFP Output File
3 pages
Resume of Ref No: Cja646938 - Informatica Consultant With 3-4 Years Exp
No ratings yet
Resume of Ref No: Cja646938 - Informatica Consultant With 3-4 Years Exp
2 pages
Big Data Engineer Resume Example
No ratings yet
Big Data Engineer Resume Example
1 page
Database Warehouse and Data Lake.
No ratings yet
Database Warehouse and Data Lake.
3 pages
Bol Model
No ratings yet
Bol Model
15 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.