Improving Data Quality: Why Is It So Difficult?: Larissa T. Moss
Improving Data Quality: Why Is It So Difficult?: Larissa T. Moss
presented by
Larissa T. Moss
President, Method Focus, Inc.
DAMA Oakland, CA
May 7, 2003
Larissa T. Moss
Method Focus Inc. www.methodfocus.com methodfocus@earthlink.net (626) 355-8167 Ms. Moss is founder and president of Method Focus Inc., a company specializing in improving the quality of business information systems. She frequently speaks at Data Warehouse, Business Intelligence, CRM, and Information Quality conferences around the world on the topics of information asset management, data quality, data modeling, project management, and organizational realignment. She lectures worldwide on the BI topics of spiral development methodology, data modeling, data audit and control, project management, as well as organizational issues. Her articles are frequently published in DM Review, TDWI Journal of Data Warehousing, Cutter IT Journal, Analytic Edge, and The Navigator. She coauthored the books: Data Warehouse Project Management, Addison Wesley 2000, Impossible Data Warehouse Situations, Addison Wesley 2002, and Business Intelligence Roadmap: The Complete Project Lifecycle for Decision Support Applications, Addison Wesley 2003. Ms. Moss is a member of the IBM Gold Group, a Friend of Teradata, a senior consultant at the Cutter Consortium, and a contributing member of Ask The Experts on www.dmreview.com. She has been a lecturer at DCI, TDWI, MISTI, and at the Extension of the California Polytechnic University, Pomona . She can be reached at lmoss@ methodfocus.com.
Presentation Outline
How do we change?
12 steps to [DQ] recovery
Data is integrated Data values follow the business rules Data corresponds to established domains Data is well defined and understood
Inability to write straight forward queries without knowing how to filter data
Missing Values
Operational systems do not always require
informational or demographic data
Gender Ethnicity Age Income Referring Source
Multi-purpose fields
ONE field explicitly has MANY meanings Which business unit enters the data At what time in history it was entered A value in one or more other fields
Appraisal Amount
redefined as
Advertised Amount
redefined as
25 redefines = 25 attributes ! Not mutually exclusive ! Only the value of one is known for each record !
10
11
Master_Cd
{A, B, C, D, E, F, G, H, I}
Type of customer
Type of supplier Regional constraints
12
13
Contradicting values
Values in one field are inconsistent with
values in another related field 1488 Flatbush Avenue New York, NY 75261
Texas Zip
Type of real property: Single Family Residence Number of rental units:four Income property
14
ceiling-interest-rate: floor-interest-rate:
8.25 14.75
switched ?
15
August 97:
16
Multiple employee numbers Employee Name July 1995: Bob Smith January 1996: Bob Smith August 1999: Bob Smith Department 213 (HR) 432 (SRV) 206 (MKT) Empl. Number 21304762 43218221 20684762
17
18
19
20
1 12 12 1 4
$ $ $ $ $
4 4 4 1
4 12 12
$ $ $
$5,163,980
21
22
Category Infrastructure Basis: Enterprise architected DBs Enterprise reusable create/update programs + Total Infrastructure expenses Value Basis: Total retrieve equivalent pgms + Total value-adding expenses Cost-adding Basis: Redundant create/update pgms Interface/extract programs Redundant database files Total cost-adding expenses
% of Budget Expenses
200 300
0.75 1.50
$ 15,000 $ 30,000
300
1.00
$ 20,000
$ 6,000,000 $ 6,000,000
12%
Lifetime Total **
3,800
$50,000,000
100%
* Determine relative effort to develop average unit of each category using effort to develop a retrieve program as 1.00 + For programs that retrieve some data and create/update other data, determine the percent of retrieve only attributes and percent of create/update attributes (e.g., to retrieve customer data to create an order) **Based on 3.800 application programs and database files in portfolio and $50 Million in development
23
Business Units
Client Client Client Product Pricing Client Customer Support Client Distribution Client Inventory Sales Client
Business
Chief Operating Officer
Officer
Technology
Chief Information Officer
Marketing
...
Business Manager Business Manager Technology Manager
...
Technology Manager
...
paired with
...
IT IT IT IT IT IT IT
24
TIME
SCOPE
BUDGET PEOPLE QUALITY
SCOPE TIME
YAH DDD Copyright 2003, Larissa T. Moss, Method Focus, Inc.
26
Customer Relationship Management Enterprise Resource Planning Enterprise Application Integration Knowledge Management
27
Data Warehousing
DW delivers... a collection of integrated data used to support the strategic decision making process for the enterprise.
The Promise: t data integration t no redundancy t consistency t historical data t ad-hoc reporting t trend analysis reporting t faster data delivery t faster data access
The Reality: t stove pipe marts t departmental views t swim lane development approach t too time consuming to integrate t too costly to cleanse data t increased data redundancy
28
the organizational lifeline, creating competitive advantage through customer service excellence. The Promise: t data integration t data quality t customer intimacy t customer wallet share t product pricing customization t knowing your competition t geographic market potential The Reality: t more stovepipe systems t departmental views t dirty customer data t purchased packages not integrated t focus is too narrow t privacy issues
29
30
31
Knowledge Management
KM delivers ... a process for capturing, editing, verifying (for accuracy), disseminating, and utilizing tacit and explicit information about the organization. The Promise: t utilize organizational info t data integration t historical data t faster data delivery t faster data access t first & only customer contact t reduction of customer calls t less re-solving same problems Reality of KM: t too difficult to build t too time consuming t too costly t technology challenges t non-sharing culture t isolated applications t difficult to disseminate information
32
33
34
Business intelligence
is a cross-organizational discipline and an enterprise architecture for an integrated collection of operational as well as decision support applications and databases, which provide the business community easy access to their business data, and allows them to make accurate business decisions.
35
20%
Data Delivery
Provide intuitive access to business information
36
BI ?
DW DM
L
DM
37
Business Units
Client Client Financial (AP & AR) Client Product Pricing Client Customer Support Client Distribution Client Inventory Client
Marketing
Sales
QUALITY
IT
IT
IT
IT
IT
IT
IT
38
(John Zachman)
Cheaper, faster, better !!! But how? Dont scrap and rework. Reuse what you already have.
Copyright 2003, Larissa T. Moss, Method Focus, Inc.
39
QUALITY
BUDGET PEOPLE TIME SCOPE
Information Age: Reassemble the entire enterprise Reuse assets from inventory
40
Projects
Application
Fifth Release Fourth Release Third Release
Refactoring
- Kent Beck
/ Project = Application
Copyright 2003, Larissa T. Moss, Method Focus, Inc.
41
The quality of the release deliverables (and ultimately the quality of the applications) will be higher!
Copyright 2003, Larissa T. Moss, Method Focus, Inc.
42
BI/DW Development Steps 1. Business Case Assessment ........................... Cross-organizational 2.A Enterprise Technical Infrastructure ........... Cross-organizational 2.B Enterprise Non-Technical Infrastructure ... Cross-organizational 3. Project Planning ........................................... Project-specific 4. Project Requirements Definition .................. Project-specific 5. Data Analysis ............................................... Cross-organizational 6. Application Prototyping ............................... Project-specific 7. Meta Data Repository Analysis ................... Cross-organizational 8. Database Design .......................................... Cross-organizational 9. ETL Design .............................................. Cross-organizational 10. Meta Data Repository Design .................... Cross-organizational 11. ETL Development ..................................... Cross-organizational 12. Application Development ......................... Project-specific 13. Data Mining .............................................. Cross-organizational 14. Meta Data Repository Development ........ Cross-organizational 15. Implementation ......................................... Project-specific 16. Release Evaluation ................................... Cross-organizational
Copyright 2003, Larissa T. Moss, Method Focus, Inc.
43
Involving business representatives from all business units to ratify the data models and meta data
44
Client Inventory IT
EDW DM
Client Sales IT
Operational Environment
Operational Systems
BI/DW Databases
45
EIM responsibilities
Business architecture inventory
Process models Data models
Architects
Application inventory
Programs Databases
Policy inventory
Standards IT asset inventory Procedures management Guidelines Copyright 2003, Larissa T. Moss, Method Focus, Inc.
46
Data stewardship
Guardians of the data while it is being created or maintained by them Create standards and procedures to ensure that policies and business rules are known and followed Enforce adherence to policies and business rules that govern the data while the data is in their custody Periodically monitor (audit) the quality of the data in their custody
47
Data ownership
Authority to establish policies and set business rules for the data under their control Decide what the official enterprise definition and domain is for the data under their control Monitor and advise other end users on proper usage of their data Frequently, but not always, the data originator Can be a person or a committee
48
Enterprise architecture
Mission and Objective Business Principles Business Functions Program Management Enterprise Data Model - Data Standardization - Data Integration - Data Reconciliation - Data Quality
Operational Applications Data Access Applications Data Analysis Applications Application Databases Technology Platform Network Middleware DBMS, Tools
Content
1. Data Management data integration data cleansing 2. Data Delivery data access data manipulation
49
TopDown
Account Payment Payment Method
Customer
Existing Customer
Potential Customer
Salesperson
Salaried Salesperson Org Unit Supplier Commissioned Salesperson Org Structure Shipment
Warehouse
50
BottomUp
Integrity Violations: Contradicting values Violation of business rules Reused primary keys Non-unique primary keys Missing data relationships Inappropriate data relationships
51
It may not be worth the time and money to cleanse every data element Not all data is equally significant Not all data can be cleansed How do you know what to cleanse?
53
54
Business decision!
Not all data is equally critical to all end users All critical data must be cleansed Usually includes amount fields
Important data
Important to the organization, but not absolutely critical Further prioritize important data elements Cleanse as many as time allows Those that cannot be cleansed should be bumped to critical for the next release
Insignificant data
Informational data, which is nice to have Cleansing is optional if time allows
Copyright 2003, Larissa T. Moss, Method Focus, Inc.
55
When should it be cleansed? Retroactively? At data entry time? How should it be cleansed? Use data cleansing or ETL tools? Write procedural (COBOL/C++) code? What will we do to prevent dirty data in the future?
Source Data Reengineering Total [Data] Quality Management (TQM)
Copyright 2003, Larissa T. Moss, Method Focus, Inc.
56
Staging Area
Operational Staging Enterprise Data Area Data Store/ Warehouse Cleansing Oper Marts Cleansing Transforms Tactical rpts Transforms Strategic rpts
Strategic rpts
L L L Daily StA
OM
Customer Support
DM
ODS Mo StA
EDW
CRM DM
Analytical
Marketing
CRM
Operational
DM
Transformation Cleansing
Engineering
Legal
EXW
57
Customer Master Merge Customers Customer Info File Merge Prospects All Customers Customers
Match Accounts
Sort Customers
Sorted Customers
Prospects
Extract Prospects
coordinated
Extract
Cleanse
Transform
Prepare
Load
58
ETL Reconciliation
L L L
DM
(monthly)
59
PROCESS MODULE
OUTPUT RECORDS
# Output Records +
# Rejected Records
= REJECTED RECORDS
60
OUTPUT CODES
INPUT CODES
PROCESS MODULE
OUTPUT CODES
OUTPUT CODES
REJECTED CODES
61
OUTPUT AMOUNTS +
INPUT AMOUNTS
PROCESS MODULE
OUTPUT AMOUNTS
REJECTED AMOUNTS
Total $ Per First Input Amount + Total $ Per Second Input Amount + Total $ Per Rejected Amounts
62
63
Discovery by accident
2 Limited data analysis
3
short term
Scale of 1 .. 5
5
long term
Optimization
64
65
66
67
Organizational impact
Cross-organizational tasks and responsibilities are not well defined Data quality responsibility is not clear or ignored
Value of data is not understood or appreciated Projects are often cost justified using the industrial-age mental model Resource requirements are not well defined Impact on application development empire No reward for data sharing Resistance to change
68
Organizational changes
Business and IT collaboration (partnership) Business and business collaboration (partnership) IT and IT collaboration (partnership) Increased end user involvement Cross-organizational activities Architecture and standardization
New leadership
69
New leadership
CEO
collaboration collaboration
CFO
COO
CKO
CTO
LOB Execs
EIM
...EA IT Execs
Chief Knowledge Officer
DA DQA MDA
70
71
72
73
74
75
76
Bibliography
Adelman, Sid, and Larissa Terpeluk Moss. Data Warehouse Project Management. Boston, MA: AddisonWesley, 2000. Aiken, Peter H. Data Reverse Engineering: Slaying the Legacy Dragon. New York: McGraw-Hill, 1995. Brackett, Michael H. Data Resource Quality: Turning Bad Habits into Good Practices. Boston, MA: Addison-Wesley, 2000. Brackett, Michael H. The Data Warehouse Challenge: Taming Data Chaos. New York: John Wiley & Sons, 1996. English, Larry P. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. New York: John Wiley & Sons, 1999. Hoberman, Steve. Data Modelers Workbench: Tools and Techniques for Analysis and Design. New York: John Wiley & Sons, 2001. Kuan-Tsae, Huang, Yang W. Lee, and Richard Y. Wang. Quality Information and Knowledge Management. Upper Saddle River, NJ: Prentice Hall, 1998. Marco, David. Building and Managing the Meta Data Repository: A Full Lifecycle Guide. New York: John Wiley & Sons, 2000. Moss, Larissa T., and Shaku Atre. Business Intelligence Roadmap: The Complete Lifecycle for DecisionSupport Applications. Boston, MA: Addison-Wesley, 2003. Reingruber, Michael C., and William W. Gregory. The Data Modeling Handbook: A Best-Practice Approach to Building Quality Data Models. New York: John Wiley & Sons, 1994. Ross, Ronald G. The Business Rule Concepts. Houston, TX: Business Rule Solutions, Inc., 1998. Simsion, Graeme. Data Modeling Essentials: Analysis, Design, and Innovation. Boston, MA: International Thomson Computer Press, 1994. Von Halle, Barbara. Business Rules Applied: Building Better Systems Using the Business Rules Approach. New York: John Wiley & Sons, 2001.
Copyright 2003, Larissa T. Moss, Method Focus, Inc.
77