02-DataQuality Compressed
02-DataQuality Compressed
Data Quality
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn
SELECT Market_Cap
From Companies
Where Company_Name = “Apple”
Number of Rows: 0
Problem:
Missing Data
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn
SELECT Market_Cap
From Companies
Where Company_Name = “IBM”
Number of Rows: 0
Problem:
Entity Resolution
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View $210
Intl. Business Machines Armonk, NY $200
Microsoft Redmond, WA €250
Sally’s Lemonade Stand Alameda,CA $260
SELECT MAX(Market_Cap)
From Companies
Number of Rows: 1
Problem:
Unit Mismatch
WHO’S CALLING WHO’S DATA
DIRTY?
Dirty Data
• The Statistics View:
• There is a process that produces data
• Any dataset is a sample of the output of that
process
• Results are probabilistic
• You can correct bias in your sample
Dirty Data
• The Database View:
• I got my hands on this data set
• Some of the values are missing, corrupted, wrong,
duplicated
• Results are absolute (relational model)
• You get a better answer by improving the quality
of the values in your dataset
Dirty Data
• The Domain Expert’s View:
• This Data Doesn’t look right
• This Answer Doesn’t look right
• What happened?
Dirty Data
• The Data Scientist’s View:
• Some Combination of all of the above
Data Quality Problems
• Data is dirty on its own
Integrate
Clean
Extract
Transform
Load
Example Data Quality Problems
T.Das|97336o8327|24.95|Y|-|0.0|1000
Ted J.|973-360-8779|2000|N|M|NY|1000
Semantic mappings
Which problems does
schema on write help?
DATA QUALITY
Meaning of Data Quality (1)
• Generally, you have a problem if the data
doesn t mean what you think it does, or should
– Data not up to spec : garbage in, glitches, etc.
– You don t understand the spec : complexity, lack of
metadata.
• Many sources and manifestations
– As we have discussed
• Data quality problems are expensive and
pervasive
– DQ problems cost hundreds of billion $$$ each year.
– Resolving data quality problems is often the biggest
effort in a data mining study.
Smoothed
output
Smoothing Filter
Raw
readings
Time
Physical Data Cleaning
The Eyes Have It:A Task by Data Type Taxonomy for Information
Visualization [Shneiderman, 96]
Tamara Munzner, 2013
Semantics vs.Types
• Data Semantics: The real-world meaning
e.g.,company name,day of the month,person height,
etc.
• DataType: Interpretation in terms of scales of
measurements
e.g., quantity or category, sensiblemathematical
operations, data structure,etc.
Nominal
Categorical
Qualitative
Ordinal
Interval
Ratio
Data Conceptual
1D floats temperature
3D vector space
of
Data vs. Conceptual Model
• From data model...
32.5,54.0,-17.3,… (floats)
• using conceptual model...
Temperature
• to data type
Continuous to 4 significant figures (Q)
Hot, warm,cold (O)
Burned vs. Not burned (N)
http://www.smartmoney.com/marketmap
/