0% found this document useful (0 votes)
5 views5 pages

6a - Data Quality and Data Cleaning

The document provides an overview of data quality and data cleaning, emphasizing the importance of understanding data quality issues and their sources. It discusses the data quality continuum, various disciplines involved, and potential solutions for improving data quality throughout the data lifecycle. Key concepts include accuracy, completeness, uniqueness, and the challenges faced in data gathering, delivery, storage, integration, retrieval, and analysis.

Uploaded by

plutoagcorp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

6a - Data Quality and Data Cleaning

The document provides an overview of data quality and data cleaning, emphasizing the importance of understanding data quality issues and their sources. It discusses the data quality continuum, various disciplines involved, and potential solutions for improving data quality throughout the data lifecycle. Key concepts include accuracy, completeness, uniqueness, and the challenges faced in data gathering, delivery, storage, integration, retrieval, and analysis.

Uploaded by

plutoagcorp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2021/10/18

Based on:
• Recent book
Exploratory Data Mining and Data Quality
Dasu and Johnson
(Wiley, 2004)

• SIGMOD 2003 tutorial.


Data Quality and Data Cleaning:

1 2

1 2

Tutorial Focus
• Overview
– Data quality process
• Where do problems come from
• How can they be resolved
– Disciplines
• Management
• Statistics
• Database
• Metadata
3 4

3 4

Overview
• The meaning of data quality (1)
• The data quality continuum
• The meaning of data quality (2)
• Data quality metrics
• Technical tools
The Meaning of Data Quality (1)
– Management
– Statistical
– Database
– Metadata

5 6

5 6

1
2021/10/18

Meaning of Data Quality (1) Example


• Generally, you have a problem if the data T.Das|97336o8327|24.95|Y|-|0.0|1000
doesn’t mean what you think it does, or should Ted J.|973-360-8779|2000|N|M|NY|1000
– Data not up to spec : garbage in, glitches, etc.
– You don’t understand the spec : complexity, lack of • Can we interpret the data?
metadata. – What do the fields mean?
• Many sources and manifestations – What is the key? The measures?
– As we will see. • Data glitches
• Data quality problems are expensive and – Typos, multiple formats, missing / default values
pervasive
• Metadata and domain expertise
– DQ problems cost hundreds of billion $$$ each year.
– Field three is Revenue. In dollars or cents?
– Resolving data quality problems is often the biggest
effort in a data mining study. – Field seven is Usage. Is it censored?
• Field 4 is a censored flag. How to handle censored data?
7 8

7 8

Data Glitches Conventional Definition of Data Quality


• Accuracy
• Systemic changes to data which are external to – The data was recorded correctly.
the recorded process. • Completeness
– Changes in data layout / data types – All relevant data was recorded.
• Integer becomes string, fields swap positions, etc.
• Uniqueness
– Changes in scale / format
– Entities are recorded once.
• Dollars vs. euros
– Temporary reversion to defaults • Timeliness
• Failure of a processing step – The data is kept up to date.
• Special problems in federated data: time consistency.
– Missing and default values
• Application programs do not handle NULL values well … • Consistency
– Gaps in time series – The data agrees with itself.
• Especially when records represent incremental changes.
9 10

9 10

Problems … Finding a modern definition


• Unmeasurable
• We need a definition of data quality which
– Accuracy and completeness are extremely difficult,
perhaps impossible to measure. – Reflects the use of the data
• Context independent – Leads to improvements in processes
– No accounting for what is important. E.g., if you are – Is measurable (we can define metrics)
computing aggregates, you can tolerate a lot of
inaccuracy.
• Incomplete • First, we need a better understanding of how
– What about interpretability, accessibility, metadata, and where data quality problems occur
analysis, etc.
– The data quality continuum
• Vague
– The conventional definitions provide no guidance
towards practical improvements of the data.
11 12

11 12

2
2021/10/18

The Data Quality Continuum


• Data and information is not static, it flows in a
data collection and usage process
– Data gathering
– Data delivery
The Data Quality Continuum – Data storage
– Data integration
– Data retrieval
– Data mining/analysis

13 14

13 14

Data Gathering Solutions


• How does the data enter the system? • Potential Solutions:
• Sources of problems: – Preemptive:
– Manual entry • Process architecture (build in integrity checks)
• Process management (reward accurate data entry,
– No uniform standards for content and formats data sharing, data stewards)
– Parallel data entry (duplicates) – Retrospective:
– Approximations, surrogates – SW/HW • Cleaning focus (duplicate removal, merge/purge,
constraints name & address matching, field value
– Measurement errors. standardization)
• Diagnostic focus (automated detection of
glitches).

15 16

15 16

Data Delivery Solutions


• Destroying or mutilating information by • Build reliable transmission protocols
inappropriate pre-processing – Use a relay server
– Inappropriate aggregation • Verification
– Nulls converted to default values – Checksums, verification parser
– Do the uploaded files fit an expected pattern?
• Loss of data:
• Relationships
– Buffer overflows
– Are there dependencies between data streams and
– Transmission problems processing steps
– No checks • Interface agreements
– Data quality commitment from the data stream
supplier.

17 18

17 18

3
2021/10/18

Data Storage Solutions


• You get a data set. What do you do with it? • Metadata
• Problems in physical storage – Document and publish data specifications.
– Can be an issue, but terabytes are cheap. • Planning
• Problems in logical storage (ER → relations) – Assume that everything bad will happen.
– Poor metadata. – Can be very difficult.
• Data feeds are often derived from application programs or
legacy data sources. What does it mean? • Data exploration
– Inappropriate data models. – Use data browsing and data mining tools to examine
• Missing timestamps, incorrect normalization, etc. the data.
– Ad-hoc modifications. • Does it meet the specifications you assumed?
• Structure the data to fit the GUI. • Has something changed?
– Hardware / software constraints.
• Data transmission via Excel spreadsheets, Y2K

19 20

19 20

Data Integration Solutions


• Combine data sets (acquisitions, across departments).
• Commercial Tools
• Common source of problems
– Significant body of research in data integration
– Heterogenous data : no common key, different field formats
• Approximate matching – Many tools for address matching, schema mapping
– Different definitions are available.
• What is a customer: an account, an individual, a family, … • Data browsing and exploration
– Time synchronization
• Does the data relate to the same time periods? Are the time
– Many hidden problems and meanings : must extract
windows compatible? metadata.
– Legacy data – View before and after results : did the integration go
• IMS, spreadsheets, ad-hoc structures the way you thought?
– Sociological factors
• Reluctance to share – loss of power.

21 22

21 22

Data Retrieval Data Mining and Analysis


• Exported data sets are often a view of the actual • What are you doing with all this data anyway?
data. Problems occur because: • Problems in the analysis.
– Source data not properly understood. – Scale and performance
– Need for derived data not understood.
– Confidence bounds?
– Just plain mistakes.
– Black boxes and dart boards
• Inner join vs. outer join
• “fire your Statisticians”
• Understanding NULL values
– Attachment to models
• Computational constraints
– Insufficient domain expertise
– E.g., too expensive to give a full history, we’ll supply a
snapshot. – Casual empiricism
• Incompatibility
– Ebcdic?

23 24

23 24

4
2021/10/18

Solutions
• Data exploration
– Determine which models and techniques are
appropriate, find data bugs, develop domain expertise.
• Continuous analysis
– Are the results stable? How do they change?
• Accountability
– Make the analysis part of the feedback loop.

25

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy