0% found this document useful (0 votes)
60 views77 pages

Improving Data Quality: Why Is It So Difficult?: Larissa T. Moss

Uploaded by

justopenminded
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views77 pages

Improving Data Quality: Why Is It So Difficult?: Larissa T. Moss

Uploaded by

justopenminded
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 77

Improving Data Quality: Why is it so difficult?

presented by

Larissa T. Moss
President, Method Focus, Inc.

DAMA Oakland, CA
May 7, 2003

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Larissa T. Moss
Method Focus Inc. www.methodfocus.com methodfocus@earthlink.net (626) 355-8167 Ms. Moss is founder and president of Method Focus Inc., a company specializing in improving the quality of business information systems. She frequently speaks at Data Warehouse, Business Intelligence, CRM, and Information Quality conferences around the world on the topics of information asset management, data quality, data modeling, project management, and organizational realignment. She lectures worldwide on the BI topics of spiral development methodology, data modeling, data audit and control, project management, as well as organizational issues. Her articles are frequently published in DM Review, TDWI Journal of Data Warehousing, Cutter IT Journal, Analytic Edge, and The Navigator. She coauthored the books: Data Warehouse Project Management, Addison Wesley 2000, Impossible Data Warehouse Situations, Addison Wesley 2002, and Business Intelligence Roadmap: The Complete Project Lifecycle for Decision Support Applications, Addison Wesley 2003. Ms. Moss is a member of the IBM Gold Group, a Friend of Teradata, a senior consultant at the Cutter Consortium, and a contributing member of Ask The Experts on www.dmreview.com. She has been a lecturer at DCI, TDWI, MISTI, and at the Extension of the California Polytechnic University, Pomona . She can be reached at lmoss@ methodfocus.com.

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Presentation Outline

What do we mean by data quality?


Dirty data categories

How are we addressing it today?


Ineffective technology solutions

What do we have to change?


Approaches and techniques

How do we change?
12 steps to [DQ] recovery

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

What do we mean by data quality?


Data is correct Data is accurate Data is consistent Data is complete
#1

Data is integrated Data values follow the business rules Data corresponds to established domains Data is well defined and understood

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Symptoms of poor-quality data


Do your programs abend with data exceptions? Are your users confused about meaning of data? Is some of your data is too stale for reporting? Is your data being shared? Is it sharable? Are reports inconsistent? Does it take your IT staff or the end users hours to reconcile inconsistent reports? Does merging data often cause the system to fail? Do beepers go off at night?

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Dirty data categories


Dummy (default) values Intelligent dummy values Missing values Multi-purpose fields Cryptic values Free-form address lines Contradicting values Violation of business rules Reused primary key Non-unique primary key Missing data relationships Inappropriate data relationships

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Dummy (default) values


Defaults for mandatory fields
SSN 999-99-9999 Age 999 Zip 99999 Income 9,999,999.99

Inability to determine customer profiles Inability to determine customer demographics

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Intelligent dummy values


Defaults with meaning
SSN 888-88-8888 Income 999,999.99 Age 000 Source Code FF
Non-resident alien Employee Corporate customer Account closed prior to 1991

Inability to write straight forward queries without knowing how to filter data

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Missing Values
Operational systems do not always require
informational or demographic data
Gender Ethnicity Age Income Referring Source

Inability to analyze marketing channels

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Multi-purpose fields

ONE field explicitly has MANY meanings Which business unit enters the data At what time in history it was entered A value in one or more other fields

Appraisal Amount
redefined as

Advertised Amount
redefined as

25 redefines = 25 attributes ! Not mutually exclusive ! Only the value of one is known for each record !

Sold Date Loan Type Code


redefined as ...

Inability to judge product profitability


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

10

Cryptic values (1)


Often found in Kitchen Sink fields
Usually one byte (if not one bit) Highly cryptic (A, B, C, 1, 2, 3, ...) Non-intelligent, non-intuitive codes Often not mutually exclusive

Inability to empower end users to write their own queries

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

11

Cryptic values (2)

ONE field implicitly has MANY meanings

Master_Cd

{A, B, C, D, E, F, G, H, I}

{A, B, C} {D, E, F} {G, H, I}

Type of customer
Type of supplier Regional constraints

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

12

Free-form address lines


Unstructured text
no discernable pattern cannot be parsed address-line-1: address-line-2: address-line-3: address-line-4: ROSENTHAL, LEVITZ, A TTORNEYS 10 MARKET, SAN FRANC ISCO, CA 95111

Inability to perform market analysis

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

13

Contradicting values
Values in one field are inconsistent with
values in another related field 1488 Flatbush Avenue New York, NY 75261

Texas Zip

Type of real property: Single Family Residence Number of rental units:four Income property

Inability to make reliable business decisions

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

14

Violation of business rules


Business Rule: Adjustable Rate Mortgages must have Maximum Interest Rate ( Ceiling) Minimum Interest Rate ( Floor) Business Rule: A Ceiling is always higher than a Floor

ceiling-interest-rate: floor-interest-rate:

8.25 14.75

switched ?

Inability to calculate product profitability

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

15

Reused primary keys


Little history, if any, stored in operational files
primary keys are customarily re-used may have a different rollup structure
January 94: branch 501 = San Francisco Main region 1 area SW branch 501 = San Luis Obispo region 2 area SW

August 97:

Inability to evaluate organizational performance


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

16

Non-unique primary keys


Duplicate identification numbers
Multiple customer numbers Customer Name Philip K. Sherman Philip K. Sherman Philip K. Sherman Phone Number 818.357.5166 818.357.7711 818.357.8911 Cust. Number 960601 960105 960003

Multiple employee numbers Employee Name July 1995: Bob Smith January 1996: Bob Smith August 1999: Bob Smith Department 213 (HR) 432 (SRV) 206 (MKT) Empl. Number 21304762 43218221 20684762

Inability to determine customer relationships Inability to analyze employee benefits trends


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

17

Missing data relationships


Data that should be related to other data in a
dependent (parent-child) relationship
Branch Employee Benefit

Branch number 0765 does not exist in the BRANCH table

Inability to produce accurate rollups

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

18

Inappropriate data relationships


Data that is inadvertently related, but should not be
two entity types with the same key values
Purchaser: Seller: Jackie Schmidt Robert Black 837221 837221

Inability to determine customer or vendor relationships


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

19

Impact of erroneous data


Extra time it takes to correct data problems Extra resources needed to correct data problems Time and effort required to re-run jobs that abend Time wasted arguing over inconsistent reports Lost business opportunities due to unavailable data Unable to demonstrate business potential in a buyout Fines may be paid for noncompliance with government regulations Shipping products to the wrong customers Bad public relations with customers leads to alienated and lost customer
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

20

Cost of erroneous data


Direct Costs of Non-Quality Information
Marketing Campaign
Time: ($60/hour loaded rate) Creating redundant occurrence Researching correct address Correcting address errors Handling complaints from customers Mail preparation Materials, Facilities, Equipment: Marketing brochure Postage Warehouse storage Shipping equipment and maintenance Computing resources: CPU transactions Data storage Data backup Per Instance Number of Instances Total Number Per Year Larry English, Improving DW and BI Quality Total Cost Per Year

2.4 min 10 min 0.3 min 5.5 min 0.1 min

167,141 5,000/mo 6,000/mo 974/yr 393,273

1 12 12 1 4

$ $ $ $ $

401,138 600,000 21,600 5,357 157,309

$1.96 $0.52 $0.01 $5,000/yr

393,273 393,273 393,273 36%

4 4 4 1

$3,083,260 $ 818,008 $ 15,731 $ 1,800

$0.02/trans $0.001/mo $0.005/mo

393,273 393,273 393,273

4 12 12

$ $ $

31,462 4,719 23,596

Total Annual Costs

$5,163,980

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

21

Impact of redundant data


Hardware (CPU, disks) and software (program maintenance) costs incurred as a result of uncontrolled redundant data Extra time it takes to reconcile inconsistencies Extra resources needed to reconcile inconsistencies Unwise business decisions made due to redundant and inconsistent data Lost opportunities due to unreliable data Overcharging or overpayment for products Duplicate shipping of products Money wasted on sending redundant marketing material
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

22

Cost of redundant data


Information Development Cost Analysis
Portfolio Total Number Relative Weight Factor* Average Unit Dev/Maint Costs Larry English, Improving DW and BI Quality

Category Infrastructure Basis: Enterprise architected DBs Enterprise reusable create/update programs + Total Infrastructure expenses Value Basis: Total retrieve equivalent pgms + Total value-adding expenses Cost-adding Basis: Redundant create/update pgms Interface/extract programs Redundant database files Total cost-adding expenses

Total Infrastructure Total Value-adding Dev/Maint Cost-adding Expenses** Expenses

% of Budget Expenses

200 300

0.75 1.50

$ 15,000 $ 30,000

$ 3,000,000 $ 9,000,000 $12,000,000 24%

300

1.00

$ 20,000

$ 6,000,000 $ 6,000,000

12%

500 400 600 1,500

1.50 1.00 0.75

$ 30,000 $ 20,000 $ 15,000

$15,000,000 $ 8,000,000 $ 9,000,000 $32,000,000 64%

Lifetime Total **

3,800

$50,000,000

100%

* Determine relative effort to develop average unit of each category using effort to develop a retrieve program as 1.00 + For programs that retrieve some data and create/update other data, determine the percent of retrieve only attributes and percent of create/update attributes (e.g., to retrieve customer data to create an order) **Based on 3.800 application programs and database files in portfolio and $50 Million in development

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

23

Dirty data How did it happen?


Chief
Executive

Business Units
Client Client Client Product Pricing Client Customer Support Client Distribution Client Inventory Sales Client

Business
Chief Operating Officer

Officer

Technology
Chief Information Officer

Marketing

Financial (AP & AR)

...
Business Manager Business Manager Technology Manager

...
Technology Manager

...
paired with

...
IT IT IT IT IT IT IT

Information Technology Units

data redundancy process redundancy dirty data

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

24

Major cause for data deficiencies


highest to lowest priority
Priority Project Constraints

TIME
SCOPE
BUDGET PEOPLE QUALITY

Wrong priority on project constraints!


Industrial Age: Cheaper, faster, better Automate as quickly as possible
25

Cost-based value proposition

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Time is getting shorter scope is getting bigger


Everyone on the business side and in IT wants quality, but rarely is the extra time given or taken to achieve it. Quality and time are polarized constraints. The higher the quality the more effort (time) it takes to deliver. Companies are driven by shorter and shorter schedules.

SCOPE TIME
YAH DDD Copyright 2003, Larissa T. Moss, Method Focus, Inc.

26

How are we addressing it today?


Data Warehousing
Why cant technology fix this?

Customer Relationship Management Enterprise Resource Planning Enterprise Application Integration Knowledge Management

Ineffective Technology Solutions


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

27

Data Warehousing
DW delivers... a collection of integrated data used to support the strategic decision making process for the enterprise.

The Promise: t data integration t no redundancy t consistency t historical data t ad-hoc reporting t trend analysis reporting t faster data delivery t faster data access

The Reality: t stove pipe marts t departmental views t swim lane development approach t too time consuming to integrate t too costly to cleanse data t increased data redundancy

If it sounds too good to be true, it is to good to be true.


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

28

Customer Relationship Management


CRM delivers
seamless coordination between back-office systems, front-office systems and the Web.

the organizational lifeline, creating competitive advantage through customer service excellence. The Promise: t data integration t data quality t customer intimacy t customer wallet share t product pricing customization t knowing your competition t geographic market potential The Reality: t more stovepipe systems t departmental views t dirty customer data t purchased packages not integrated t focus is too narrow t privacy issues

If it sounds too good to be true, it is to good to be true.


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

29

Enterprise Resource Planning


ERP delivers... a collection of functional modules used to integrate operational data to support seamless operational business processes for the enterprise. The Promise: t data integration t no redundancy t consistency t data quality t easy reporting t easy maintenance t Y2K compliance The Reality: t system conversion not crossorganizational analysis t same dirty data t operational focus t poor quality (unusable) reports t one-size-fits-all data warehouse t too costly

If it sounds too good to be true, it is to good to be true.


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

30

Enterprise Application Integration


EAI delivers ... integration of disparate applications into a unified set of business processes through centrally managed rules and middleware technologies. The Promise: t fast & automated integration t leverage existing data t bridge islands of automation t easy cross-system reporting t faster data delivery t faster data access The Reality: t dirty data t no true integration t still data redundancy t still islands of automation t easier access to the current data mess

If it sounds too good to be true, it is to good to be true.


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

31

Knowledge Management
KM delivers ... a process for capturing, editing, verifying (for accuracy), disseminating, and utilizing tacit and explicit information about the organization. The Promise: t utilize organizational info t data integration t historical data t faster data delivery t faster data access t first & only customer contact t reduction of customer calls t less re-solving same problems Reality of KM: t too difficult to build t too time consuming t too costly t technology challenges t non-sharing culture t isolated applications t difficult to disseminate information

If it sounds too good to be true, it is to good to be true.


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

32

Whats the lesson?


You cannot keep doing what you have always done and expect the results to be different. Not even with new technology.
That wouldnt be logical Spock, Star Trek

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

33

What do we have to change?


1. Assess the current state of data quality at your company

2. Understand and fix the root causes for data contamination


3. Perform data audits regularly (monthly, quarterly) 4. Stop working in isolated swim lanes

> Stop recreating data


5. Centrally manage your data like a business asset (Enterprise Information Management [EIM]) > Assemble data as needed from the data inventory (enterprise data model and meta data) > Standardize and reconcile data transformations for BI/DW applications (coordinated ETL staging area) 6. Scale down project scopes to incorporate data quality and EIM activities 7. Embed data quality and EIM activities in all projects
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

34

Business intelligence
is a cross-organizational discipline and an enterprise architecture for an integrated collection of operational as well as decision support applications and databases, which provide the business community easy access to their business data, and allows them to make accurate business decisions.

is not business as usual


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

35

BI goals and objectives


80%
Data Management

20%
Data Delivery
Provide intuitive access to business information

Get control over the existing data chaos

Data Reengineering (Enterprise Information Management)


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

36

Proliferation of data quality problems


LegaMarts
(Doug Hackney)

Legacy L L L transformation ? cleansing?

BI ?

Data Warehouses Data Marts


DM DM

Users Marketing Finance Customer Support Product Sales Engineering

DW DM

L
DM

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

37

Industrial-age mental model


highest to lowest priority
Priority Project Constraints

Business Units
Client Client Financial (AP & AR) Client Product Pricing Client Customer Support Client Distribution Client Inventory Client

Marketing

Sales

TIME SCOPE BUDGET PEOPLE

QUALITY

IT

IT

IT

IT

IT

IT

IT

Information Technology Units

Scrap and rework

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

38

The game has changed


but our mental model has not
1. Enormous degree of complexity 2. Extremely high rate of change

(John Zachman)

Cheaper, faster, better !!! But how? Dont scrap and rework. Reuse what you already have.
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

39

Information-age mental model


highest to lowest priority
Priority Project Constraints

QUALITY
BUDGET PEOPLE TIME SCOPE

Reassemble reusable components

Investment-based value proposition

Information Age: Reassemble the entire enterprise Reuse assets from inventory
40

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Software release concept (1)


Extreme scoping
- Larissa Moss

First Release Final Release Second Release

Projects

Application
Fifth Release Fourth Release Third Release

Reusable & Expanding

Refactoring
- Kent Beck

/ Project = Application
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

41

Software release concept (2)


Requirements can be tested, and implemented in small increments Scope is very small and manageable Technology infrastructure can be tested and proven Data volumes (per release) are relatively small Project schedules are easier to estimate because the scope is very small Development activities can be iteratively refined, honed, and adapted
AND:

The quality of the release deliverables (and ultimately the quality of the applications) will be higher!
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

42

Cross-organizational development approach (1)

( Larissa Moss and Shaku Atre, Business Intelligence Roadmap)

Data Quality Touch Points

BI/DW Development Steps 1. Business Case Assessment ........................... Cross-organizational 2.A Enterprise Technical Infrastructure ........... Cross-organizational 2.B Enterprise Non-Technical Infrastructure ... Cross-organizational 3. Project Planning ........................................... Project-specific 4. Project Requirements Definition .................. Project-specific 5. Data Analysis ............................................... Cross-organizational 6. Application Prototyping ............................... Project-specific 7. Meta Data Repository Analysis ................... Cross-organizational 8. Database Design .......................................... Cross-organizational 9. ETL Design .............................................. Cross-organizational 10. Meta Data Repository Design .................... Cross-organizational 11. ETL Development ..................................... Cross-organizational 12. Application Development ......................... Project-specific 13. Data Mining .............................................. Cross-organizational 14. Meta Data Repository Development ........ Cross-organizational 15. Implementation ......................................... Project-specific 16. Release Evaluation ................................... Cross-organizational
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

43

Cross-organizational development approach (2)


Commitment to data quality embedded in the methodology Cross-organizational program management Enterprise information management group Standards that include a common information architecture (enterprise data model)
Involving down-stream information consumers in the requirements definition step Involving data owners in the data analysis step

Involving business representatives from all business units to ratify the data models and meta data

Coordinating the development/ETL processes


Disallowing stovepipe development Extracting and cleansing source data only once Reconciling data transformations and storing the reconciliation totals as meta data
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

44

Enterprise information management


Business Units
Client Marketing IT Client Financial (AP & AR) IT Client Product Pricing IT Client Customer Support IT Client Distribution IT
ODS

Client Inventory IT
EDW DM

Client Sales IT

Information Technology Units

Discover, Coordinate, Integrate, Document, Control


OM

Enterprise Information Managemen t Decision Support Environment

Operational Environment

Operational Systems

BI/DW Databases

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

45

EIM responsibilities
Business architecture inventory
Process models Data models
Architects

Application inventory
Programs Databases

Meta data inventory


Business meta data Technical meta data

Discover, Coordinate, Integrate, Document, Control


Managers

Policy inventory
Standards IT asset inventory Procedures management Guidelines Copyright 2003, Larissa T. Moss, Method Focus, Inc.

46

Data stewardship
Guardians of the data while it is being created or maintained by them Create standards and procedures to ensure that policies and business rules are known and followed Enforce adherence to policies and business rules that govern the data while the data is in their custody Periodically monitor (audit) the quality of the data in their custody

Also known as custodians


Can be a business person or an IT person
One who manages anothers property.
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

47

Data ownership
Authority to establish policies and set business rules for the data under their control Decide what the official enterprise definition and domain is for the data under their control Monitor and advise other end users on proper usage of their data Frequently, but not always, the data originator Can be a person or a committee

One who has the legal right to the possession of a property.


Copyright 2003, Larissa T. Moss, Method Focus, Inc.

48

Enterprise architecture
Mission and Objective Business Principles Business Functions Program Management Enterprise Data Model - Data Standardization - Data Integration - Data Reconciliation - Data Quality

Business Architecture Information Architecture Application Architecture Technology Architecture

Storage & Presentation

Operational Applications Data Access Applications Data Analysis Applications Application Databases Technology Platform Network Middleware DBMS, Tools

Content
1. Data Management data integration data cleansing 2. Data Delivery data access data manipulation

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

49

Enterprise data model (data inventory)


Customer Account

TopDown
Account Payment Payment Method

Customer

Product Order Product Product Part

Existing Customer

Supported by common data definitions, domains, and business rules.

Potential Customer

Salesperson

Product Category Part

Salaried Salesperson Org Unit Supplier Commissioned Salesperson Org Structure Shipment

Warehouse

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

50

Source data analysis


Domain Violations: Dummy values Intelligent dummy values Missing values Multi-purpose fields Cryptic values Free-form address lines

BottomUp

Integrity Violations: Contradicting values Violation of business rules Reused primary keys Non-unique primary keys Missing data relationships Inappropriate data relationships

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

51

To cleanse or not to cleanse


You probably cannot cleanse it all (takes too long)

It may not be worth the time and money to cleanse every data element Not all data is equally significant Not all data can be cleansed How do you know what to cleanse?

that is the question


52

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Triaging questions (1)


Can the data be cleansed?
Does the correct data exist anywhere? Is it easily accessible?

Should the data be cleansed?


How extensive is the problem? How elaborate will the cleansing process be? Is it cost-effective?

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

53

Triaging questions (2)


Why are we building the application?
What business questions cannot be answered today?

Why are we not able to answer the business


questions? Is it because of this dirty data? Is it because of these missing relationships?

Will the benefits of cleansing outweigh the cost


of the effort?

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

54

Categories of data significance


Critical data

Business decision!

Not all data is equally critical to all end users All critical data must be cleansed Usually includes amount fields

Important data
Important to the organization, but not absolutely critical Further prioritize important data elements Cleanse as many as time allows Those that cannot be cleansed should be bumped to critical for the next release

Insignificant data
Informational data, which is nice to have Cleansing is optional if time allows
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

55

Cleansing repairing prevention


Where should the dirty data be cleansed? In the staging area of the BI application? In the source (legacy) files?

When should it be cleansed? Retroactively? At data entry time? How should it be cleansed? Use data cleansing or ETL tools? Write procedural (COBOL/C++) code? What will we do to prevent dirty data in the future?
Source Data Reengineering Total [Data] Quality Management (TQM)
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

56

Coordinated ETL staging


Legacy Operatl reports

Staging Area

Operational Staging Enterprise Data Area Data Store/ Warehouse Cleansing Oper Marts Cleansing Transforms Tactical rpts Transforms Strategic rpts

Data Marts Clients

Strategic rpts

L L L Daily StA

OM

Customer Support

DM
ODS Mo StA

Product Pricing Finance

EDW

CRM DM
Analytical

Marketing

CRM
Operational

DM
Transformation Cleansing

Engineering
Legal

EXW

Enterprise Architecture & Meta Data Repository

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

57

ETL process flow


Associate Accounts 1 Sales File Extract New Sales New Sales Filter Accounts Account Tran File New Accounts Sort Accts Sorted Accounts

Extract Accounts Accounts Account Errors

Customer Master Merge Customers Customer Info File Merge Prospects All Customers Customers

Match Accounts

Sort Customers

Sorted Customers

Prospects

Extract Prospects

Prospects Profile Customers 3

coordinated

Extract

Cleanse

Transform

Prepare

Load

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

58

ETL Reconciliation
L L L

Monthly Staging Area


DM
Load Files

L DM ODS (daily) EDW (monthly) DM

DM
(monthly)

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

59

ETL tie-outs: record counts

INPUT RECORDS # Input Records

PROCESS MODULE

OUTPUT RECORDS

# Output Records +

# Rejected Records
= REJECTED RECORDS

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

60

ETL tie-outs: domain counts


# Records Per First Output Domain +

OUTPUT CODES

INPUT CODES

PROCESS MODULE

OUTPUT CODES

# Records Per Second Output Domain +

# Records Per Input Domain

OUTPUT CODES

# Records Per Third Output Domain +

REJECTED CODES

# Rejected Data Values

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

61

ETL tie-outs: amount counts


Total $ Per First Output Amount

OUTPUT AMOUNTS +

INPUT AMOUNTS

PROCESS MODULE

OUTPUT AMOUNTS

Total $ Per Second Output Amount

Total $ Input Amounts

REJECTED AMOUNTS

Total $ Rejected Amounts

Total $ Per First Input Amount + Total $ Per Second Input Amount + Total $ Per Rejected Amounts

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

62

Data quality improvements


Source data repairs Increased program edits Enhanced data entry procedures Improved data quality training Regular data audits Data usage monitoring Enterprise-wide end user surveys Continuous validation of enterprise data model Continuous validation of meta data, especially definitions and domains Involvement of data owners, information consumers, and business sponsors
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

63

Data quality maturity


Program abends

Discovery by accident
2 Limited data analysis

Data profiling Data cleansing during ETL

At what level of DQ maturity is your organization?


Repairing source data and programs

3
short term

Addressing root causes 4 Proactive prevention

Enterprise-wide DQ methods & techniques Continuous DQ process improvements

Scale of 1 .. 5

5
long term

Optimization
64

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

DQ capability maturity model (1)


(Source: Larry English)

CMM Level 1. Uncertainty - Unconscious and unaware


Data quality problems are denied. No formal data quality processes defined. Data quality initiatives are ad hoc and chaotic. Any success is dependent on individual efforts.

CMM Level 2. Awakening - The big Aha! and lip service


Data quality problems are acknowledged. Major problems are attacked as they come up. Minimum funding for a formal data quality initiative. Capability is a characteristic of the individual rather than the organization.

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

65

DQ capability maturity model (2)


(Source: Larry English)

CMM Level 3. Enlightenment - Lets do something


Data quality initiative takes off. Enterprise-wide data quality assessment is performed. Data quality problems are corrected at the source (where possible). Data quality improvement process is institutionalized.

CMM Level 4. Wisdom - Making a difference


Management accepts personal responsibility for data quality. Data quality group reports to a chief officer (CIO, CKO, COO). Data quality correction changes to data defect prevention. All business areas are involved.

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

66

DQ capability maturity model (3)


(Source: Larry English)

CMM Level 5. Certainty - Nirvana


Data defect prevention is the main focus. Data quality is an integral part of the business processes. All business areas are continuously improving the processes. The culture of the organization has changed.

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

67

Organizational impact
Cross-organizational tasks and responsibilities are not well defined Data quality responsibility is not clear or ignored

Value of data is not understood or appreciated Projects are often cost justified using the industrial-age mental model Resource requirements are not well defined Impact on application development empire No reward for data sharing Resistance to change

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

68

Organizational changes
Business and IT collaboration (partnership) Business and business collaboration (partnership) IT and IT collaboration (partnership) Increased end user involvement Cross-organizational activities Architecture and standardization

Software release concept


New charge-back system New incentives

New leadership

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

69

New leadership
CEO
collaboration collaboration

CFO

COO

CKO

CTO

LOB Execs

EIM

...EA IT Execs
Chief Knowledge Officer

Enterprise Information Management

DA DQA MDA

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

70

How do we change? 12 steps to [DQ] recovery (1)


1. Become aware Every cultural transformation process begins with an Aha. Understand the root causes for your current data chaos. 2. Accept responsibility Yes, it is our fault for being in this mess. Accepting responsibility is a prerequisite for change.

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

71

12 steps to [DQ] recovery (2)


3. Decide to change Now that you know better, the decision is yours: Stay stuck or change. There can be no more false hopes for any silver bullet technology solutions.
4. Identify root causes What are the specific root causes for nonquality data in your organization? Some root causes are common, some are not.

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

72

12 steps to [DQ] recovery (3)


5. Collaborate It doesnt matter whose fault it is that the root causes exist. IT must collaborate with the business community to affect changes. Business community must also collaborate with business community.
6. Identify change agents Who will be the couriers? Changes must be systemic and holistic, not isolated and sporadic.
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

73

12 steps to [DQ] recovery (4)


7. Spread the word To embrace changes, there must be something in it for everybody. Otherwise, changes trigger anxiety and anxiety results in resistance or rejection. 8. Plan changes Big changes do not get implemented in one Big Bang. Involve people in change planning. Cross-organizational changes are phased in.
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

74

12 steps to [DQ] recovery (5)


9. Prioritize changes Some changes are easier to implement than others. Some changes have a higher payback.
10. Implement changes Everyone affected by the changes must have an opportunity to review and approve the plan before implementation.

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

75

12 steps to [DQ] recovery (6)


11. Measure effectiveness Solicit feedback from the trenches. Are the changes affecting anyone adversely? 12. Refine changes Nothing is perfect the first time around. What might work in one organization may not work in another.

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

76

Bibliography
Adelman, Sid, and Larissa Terpeluk Moss. Data Warehouse Project Management. Boston, MA: AddisonWesley, 2000. Aiken, Peter H. Data Reverse Engineering: Slaying the Legacy Dragon. New York: McGraw-Hill, 1995. Brackett, Michael H. Data Resource Quality: Turning Bad Habits into Good Practices. Boston, MA: Addison-Wesley, 2000. Brackett, Michael H. The Data Warehouse Challenge: Taming Data Chaos. New York: John Wiley & Sons, 1996. English, Larry P. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. New York: John Wiley & Sons, 1999. Hoberman, Steve. Data Modelers Workbench: Tools and Techniques for Analysis and Design. New York: John Wiley & Sons, 2001. Kuan-Tsae, Huang, Yang W. Lee, and Richard Y. Wang. Quality Information and Knowledge Management. Upper Saddle River, NJ: Prentice Hall, 1998. Marco, David. Building and Managing the Meta Data Repository: A Full Lifecycle Guide. New York: John Wiley & Sons, 2000. Moss, Larissa T., and Shaku Atre. Business Intelligence Roadmap: The Complete Lifecycle for DecisionSupport Applications. Boston, MA: Addison-Wesley, 2003. Reingruber, Michael C., and William W. Gregory. The Data Modeling Handbook: A Best-Practice Approach to Building Quality Data Models. New York: John Wiley & Sons, 1994. Ross, Ronald G. The Business Rule Concepts. Houston, TX: Business Rule Solutions, Inc., 1998. Simsion, Graeme. Data Modeling Essentials: Analysis, Design, and Innovation. Boston, MA: International Thomson Computer Press, 1994. Von Halle, Barbara. Business Rules Applied: Building Better Systems Using the Business Rules Approach. New York: John Wiley & Sons, 2001.
Copyright 2003, Larissa T. Moss, Method Focus, Inc.

77

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy