0% found this document useful (0 votes)
63 views43 pages

10 Data Quality and Integration

The document discusses data quality and integration. It defines key terms like data governance and data stewardship. It describes the importance of data quality for business success and outlines characteristics of quality data. The document also examines causes of poor data quality and provides steps for improving data quality, such as establishing a data stewardship program and improving data capture processes.

Uploaded by

hiro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views43 pages

10 Data Quality and Integration

The document discusses data quality and integration. It defines key terms like data governance and data stewardship. It describes the importance of data quality for business success and outlines characteristics of quality data. The document also examines causes of poor data quality and provides steps for improving data quality, such as establishing a data stewardship program and improving data capture processes.

Uploaded by

hiro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Data Quality and

Integration
IT 221
Information
SLIDESMANIA.COM

Management
● Define terms
● Describe importance and goals
of data governance
● Describe importance and
measures of data quality
Objectives ● Define characteristics of
quality data
● Describe reasons for poor data
quality in organizations
● Describe a program for
improving data quality
SLIDESMANIA.COM
● Describe three types of data
integration approaches
● Describe the purpose and role
of master data management
● Describe four steps and
Objectives activities of ETL for data
integration for a data
warehouse
● Explain various forms of data
transformation for data
warehouses
SLIDESMANIA.COM
● Data governance
○ High-level organizational

Data groups and processes


overseeing data
Governance stewardship across the
organization
SLIDESMANIA.COM
● Data steward
○ A person responsible for

Data ensuring that


organizational
Governance applications properly
support the
organization’s data
quality goals
SLIDESMANIA.COM
For the Data Steward:
● Resolving Data Integration Issues
● Determining Data Security
Roles ● Documenting Data Definitions,
and Calculations, Summarizations,
etc.
Responsibilities
● Maintaining/Updating Business
Rules
● Analyzing and Improving Data
Quality
● Implementing Data Security
SLIDESMANIA.COM

Requirements
For the Data Administrators:
● Translating the Business Rules
into Data Models
Roles ● Maintaining Conceptual, Logical
and and Physical Data Models
● Assisting in Data Integration
Responsibilities
Resolution
● Maintaining Meta data Repository
SLIDESMANIA.COM
For the Database Administrators:
● Generating Physical DB Schema
● Performing Database Tuning
Roles ● Creating Database Backups
and ● Planning for Database Capacity

Responsibilities
SLIDESMANIA.COM
● Sponsorship from both senior
management and business units
● A data steward manager to
Requirements support, train, and coordinate
for Data data stewards
● Data stewards for different
Governance business units, subjects, and/or
source systems
● A governance committee to
provide data management
guidelines and standards
SLIDESMANIA.COM
● If the data are bad,
the business fails.
Importance Period.
of Data ○ GIGO – garbage in,
garbage out
Quality ○ Sarbanes-Oxley (SOX)
compliance by law
sets data and
metadata quality
standards
SLIDESMANIA.COM
● Purposes of data
quality
Importance ○ Minimize IT project
risk
of Data
○ Make timely business
Quality decisions
○ Ensure regulatory
compliance
SLIDESMANIA.COM

○ Expand customer
base
● Uniqueness
● Accuracy
● Consistency
● Completeness
Characteristics ● Timeliness
of Data Quality ● Currency
● Conformance
● Referential integrity
SLIDESMANIA.COM
● External data sources
Causes ○ Lack of control over
data quality
of Poor ● Redundant data
storage and
Data inconsistent metadata

Quality ○ Proliferation of
databases with
uncontrolled
redundancy and
SLIDESMANIA.COM

metadata
● Data entry
Causes ○ Poor data capture
controls
of Poor ● Lack of organizational
commitment
Data ○ Not recognizing poor
data quality as an
Quality organizational issue
SLIDESMANIA.COM
● Get business buy-in
● Perform data quality audit
● Establish data stewardship
program
Steps in Data ● Improve data capture processes
● Apply modern data management
principles and technology
Quality ● Apply total quality management
(TQM) practices
Improvement
SLIDESMANIA.COM
● Executive sponsorship
● Building a business case
● Prove a return on investment

Business (ROI)
● Avoidance of cost

Buy-in ● Avoidance of opportunity loss


SLIDESMANIA.COM
● Statistically profile all data files
● Document the set of values for all
Data fields
● Analyze data patterns

Quality (distribution, outliers,


frequencies)

Audit ● Verify whether controls and


business rules are enforced
● Use specialized data profiling
tools
SLIDESMANIA.COM
● Roles:
○ Oversight of data
Data stewardship program
Stewardship ○ Manage data subject
Program area
○ Oversee data definitions

○ Oversee production of
data
SLIDESMANIA.COM

○ Oversee use of data


● Report to: business unit
vs. IT organization?
Data
Stewardship
Program
SLIDESMANIA.COM
● Automate data entry as much as
possible
Improving ● Manual data entry should be

Data selected from preset options


● Use trained operators when
Capture possible
● Follow good user interface
Process design principles
● Immediate data validation for
entered data
SLIDESMANIA.COM
● Software tools for
Apply
analyzing and correcting
Modern Data
data quality problems:
Management
○ Pattern matching
Principles
and ○ Fuzzy logic

Technology ○ Expert systems


● Sound data modeling
SLIDESMANIA.COM

and database design


● TQM – Total Quality
Management
TQM ● TQM Principles:
Principles ○ Defect prevention
and
○ Continuous improvement
Management
○ Use of enterprise data
standards
○ Strong foundation of
SLIDESMANIA.COM

measurement
● Balanced focus
TQM ○ Customer
Principles ○ Product/Service
and
Management
SLIDESMANIA.COM
● Disciplines,
technologies, and
methods to ensure the
Master Data
currency, meaning, and
Management
quality of reference
data within and across
various subject areas
SLIDESMANIA.COM
● Three main architectures
○ Identity registry – master data remains
in source systems; registry provides

Master Data applications with location

Management ○ Integration hub – data changes


broadcast through central service to
subscribing databases

○ Persistent – central “golden record”


maintained; all applications have
access. Requires applications to push
data. Prone to data duplication.
SLIDESMANIA.COM
● Data integration creates a unified
view of business data
● Other possibilities:

Data ○ Application integration

Integration ○ Business process integration

○ User interaction integration


SLIDESMANIA.COM
● Any approach requires changed
data capture (CDC)

○ Indicates which data have


Data changed since previous data
integration activity
Integration
SLIDESMANIA.COM
● Consolidation (ETL)

○ Consolidating all data into a


centralized database (like a
data warehouse)
Techniques ● Data federation (EII)

for Data ○ Provides a virtual view of


data without actually creating
Integration one centralized database
● Data propagation (EAI and EDR)

○ Duplicate data across


databases, with near real-
SLIDESMANIA.COM

time delay
We will talk
about this first.
SLIDESMANIA.COM
● Typical operational data is:

○ Transient–not historical

Reconciled ○ Not normalized (perhaps due


to denormalization for
Data Layer performance)

○ Restricted in scope–not
comprehensive

○ Sometimes poor quality–


SLIDESMANIA.COM

inconsistencies and errors


● After ETL, data should be:

○ Detailed–not summarized yet

○ Historical–periodic

Reconciled ○ Normalized–3rd normal form or


higher
Data Layer ○ Comprehensive–enterprise-wide
perspective

○ Timely–data should be current


enough to assist decision-making
SLIDESMANIA.COM

○ Quality controlled–accurate with


full integrity
● Capture/Extract
● Scrub or data cleansing
● Transform

The ETL ● Load and Index

Process ETL = Extract, transform, and load

 During initial load of Enterprise Data Warehouse


(EDW)
SLIDESMANIA.COM

 During subsequent periodic updates to EDW


Capture/Extract…obtaining a snapshot of a chosen subset of the source data for
loading into the data warehouse

Steps in data
reconciliation

Great Idea!
Static extract = capturing a snapshot Incremental extract = capturing
of the source data at a point in time changes that have occurred since the
last static extract
SLIDESMANIA.COM

3333
Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade
data quality

Steps in data
reconciliation
(cont.)

Great Idea!
Fixing errors: misspellings, erroneous Also: decoding, reformatting, time
dates, incorrect field usage, stamping, conversion, key generation,
SLIDESMANIA.COM

mismatched addresses, missing data, merging, error detection/logging,


duplicate data, inconsistencies locating missing data

3434
Transform … convert data from format of operational system to
format of data warehouse

Steps in data
reconciliation
(cont.)

Record-level: Field-level:
SLIDESMANIA.COM

Selection–data partitioning single-field–from one field to one field


Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many
Load/Index…place transformed data into the
warehouse and create indexes

Steps in data
reconciliation
(cont.)
SLIDESMANIA.COM

Refresh mode: bulk rewriting Update mode: only changes in


of target data at periodic intervals source data are written to data
warehouse
● Selection – the process of
partitioning data according to
predefined criteria
Record Level ● Joining – the process of
Transformation combining data from various
sources into a single table or
Functions
view
SLIDESMANIA.COM
● Normalization – the process of
decomposing relations with
anomalies to produce smaller,
Record Level well-structured relations
Transformation ● Aggregation – the process of
transforming data from detailed
Functions
to summary level
SLIDESMANIA.COM
Single-field transformation
a) Basic Representation

In general, some transformation function translates data


SLIDESMANIA.COM

from old form to new form


Single-field transformation (cont.)

b) Algorithmic

Algorithmic transformation uses a formula or logical


SLIDESMANIA.COM

expression

40
Single-field transformation (cont.)
c) Table lookup
SLIDESMANIA.COM

Table lookup uses a separate table keyed by source record


code
Multi-field transformation
a) Many sources to one target
SLIDESMANIA.COM
Multi-field transformation (cont.)
b) One source to many targets
SLIDESMANIA.COM

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy