0% found this document useful (0 votes)
35 views46 pages

Lecture 3

The document discusses the challenges and processes involved in Extract, Transform, Load (ETL) for data warehousing. Some key points: 1. ETL functions are challenging due to the diverse and disparate nature of source systems and their changing structures over time. Data extraction, transformation, integration and loading take significant effort. 2. ETL is a multi-step process involving data extraction from sources, transformations to fit the data warehouse schema, and loading the data. It aims to move and consolidate data for analysis while resolving inconsistencies. 3. ETL is time-consuming and arduous, often taking 50-70% of a project's effort. Designing, testing and deploying E

Uploaded by

f200190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views46 pages

Lecture 3

The document discusses the challenges and processes involved in Extract, Transform, Load (ETL) for data warehousing. Some key points: 1. ETL functions are challenging due to the diverse and disparate nature of source systems and their changing structures over time. Data extraction, transformation, integration and loading take significant effort. 2. ETL is a multi-step process involving data extraction from sources, transformations to fit the data warehouse schema, and loading the data. It aims to move and consolidate data for analysis while resolving inconsistencies. 3. ETL is time-consuming and arduous, often taking 50-70% of a project's effort. Designing, testing and deploying E

Uploaded by

f200190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

Data WareHouse

Objective
 Survey broadly all the various aspects of the data
extraction, transformation, and loading (ETL) functions
 Examine the data extraction function, its challenges, its
techniques
 Tasks and types of the data transformation function
 Understand the meaning of data integration and
consolidation
 Perceive the importance of the data load function and
probe the major methods for applying data to the
warehouse
 Gain a true insight into why ETL is crucial, time
consuming, and arduous
Challenges

1. For changing data into information,


capture the data
2. Perform transformations so that the data
will be fit to be converted into strategic
information
3. It is still not useful to the end-users until
it is moved to the data warehouse
repository
3
Challenges

 DWH environment has three functional


areas
Data acquisition
Data storage, and
Information delivery
 ETL functions are challenging primarily
because of
1. The nature of the source systems
2. The disparities among the source operational systems
4
Challenges
 Source systems are very diverse and disparate.
 Deal with source systems on multiple platforms and
different operating systems
 Older legacy applications
 Historical information is critical in a data warehouse
 Quality of data is dubious in many old source systems
that have evolved over time.
 Source system structures keep changing over time
because of new business conditions

5
Challenges
 Lack of inconsistency
For example, data on salary may be represented as
monthly salary, weekly salary, and bimonthly salary in
different source payroll systems
 For inconsistent data, lack of a means for resolving
mismatches
 Most source systems do not represent data in types or
formats that are meaningful to the users. Many
representations are cryptic and ambiguous.

6
ETL
 The process of culling out data that is
required for the Data Warehouse from the
source system
• To get data out of the source and load it into
the data warehouse simply a process of copying
data from one database to other
• Data is extracted from an OLTP database,
transformed to match the data warehouse
schema and loaded into the data warehouse
database
ETL
• Many data warehouses also incorporate data from
non-OLTP systems such as text files, legacy
systems, and spreadsheets such data also
requires extraction, transformation, and loading
• When defining ETL for a data warehouse, it is
important to think of ETL as a process, not a
physical implementation
 Could involve some degree of cleansing and transformation
 requires active inputs from various stakeholders including
developers, analysts, testers, top executives and is technically
challenging
Time Consuming and
Arduous

 Designed ETL functions, tests of the various


processes, and deployement, will take 50% to
70% of the project effort
 Data extraction depends on the nature and complexity of the
source systems.
 The metadata must contain information on every database
and data structure
 Need detailed information, including database size
 Time window to extract data
 Determine the mechanism for capturing the changes

9
Time Consuming and
Arduous

 Transformation methods within


transformation activity
 Reformat internal data structures, re-
sequence data, apply various forms of
conversion techniques, supply default values
wherever values are missing, and must
design the whole set of aggregates that are
needed for performance improvement

10
Time Consuming and
Arduous

 Initial loading can populate millions of


rows in the data warehouse database.
 Even more difficult is the task of testing and
applying the load images to actually populate
the physical files in the data warehouse.
 It may take two or more weeks to complete
the initial physical loading.

11
Why ETL
 There are many reasons for adopting ETL in the
organization:
1. It helps companies to analyze their business data for
taking critical business decisions.
2. Transactional databases cannot answer complex business
questions that can be answered by ETL example.
1. A Data Warehouse provides a common data repository
3. ETL provides a method of moving the data from various
sources into a data warehouse.
Why ETL
• Well-designed and documented ETL system is almost essential
to the success of a Data Warehouse project.
• Allow verification of data transformation, aggregation and
calculations rules.
• Allow sample data comparison between the source and the
target system.
• Can perform complex transformations and requires the extra
area to store the data.
• Helps to Migrate data into a Data Warehouse. Convert to the
various formats and types to adhere to one consistent system.
Why ETL
• ETL is a predefined process for accessing and
manipulating source data into the target database.
• ETL in data warehouse offers deep historical context
for the business.
• It helps to improve productivity because it codifies
and reuses without a need for technical skills.
Refreshment Workflow
ETL Requirements and
Steps
 The most underestimated process in DW development
 The most time-consuming process in DW development
• 50-70% of development time is spent on ETL!
• Extract: Extract relevant data
• Transform: Transform data to DW format
• Build keys, etc.
• Cleansing of data
• Load: Load data into DW
• Build aggregates, etc.
ETL Requirements and Steps
ETL Requirements and
Steps

 For initial bulk refresh as well as for the


incremental data loads, the sequence is
simply as noted:
 triggering for incremental changes, filtering
for refreshes and incremental loads, data
extraction, transformation, integration,
cleansing, and applying to the data
warehouse database. .
ETL Architecture
Data Staging Area (DSA)
 Transit storage for data in the ETL process
• Transformations/cleansing done here
• No user queries
• Sequential operations on large data volumes
• Performed by central ETL logic
• No need for locking, logging, etc.
• RDBMS or flat files? (DBMS have become better at this)
• Finished dimensions copied from DSA to relevant marts
Data Staging Area (DSA)
• Allows centralized backup/recovery
• Often too time consuming to initial load all
data marts by failure
• Backup/recovery facilities needed
• Better to do this centrally in DSA than in all
data marts
Key factor

 The complexity of the data extraction and


transformation functions
 Diversity of source
 Data loading function
 Data refresh-Load job run too long
 For full refresh require proper time
 For incremental refresh need special technique to
capture changes at proper time
Data extraction

 Two major factor for a data warehouse


1. Extract data from many disparate sources
2. Extract data on the changes for ongoing incremental
loads as well as for a one-time initial full load.
Data extraction

 Previous factors, therefore, warrant the Use third-


party data extraction tools in addition to in-house
programs or scripts
 more expensive
 record their own metadata

 In-house programs increase the cost of maintenance


and are hard to maintain as source systems change
 Effective data extraction is a key to the success of
your data warehouse
Third Party tools

1. Teradata Warehouse Builder from Teradata


2. DataStage from Ascential Software
3. SAS System from SAS Institute
4. Power Mart/Power Center from Informatica
5. Sagent Solution from Sagent Software
List of issue in Data
extraction

1. Source identification
2. Method of extraction
3. Extraction frequency
4. Time window
5. Job sequencing
6. Exception handling
Source Identification
Data in operational system

 Current value
 Most attributes
 Most value at current time
 May change as transaction occur

 Periodic status
 Value of attributes and history preserve
 Value stored with reference of time
 An insurance policy such as insurance claim, each event, such as claim
initiation, verification, assessment, and settlement is usually recorded at each
point of time when something in the policy changes.
Data in operational system
Data extraction

 Data warehouse must be kept updated


so the history of the changes and
statuses are reflected in the DWH
 Two major type of data extraction
techniques
1. Static data (As Is)
2. Data of Revision
Data extraction

 Static Data(As Is): Capture data at the given point of time


 In current and periodic data include all transient data identified for
extraction
 Use for initial load
 Data of revision(incremental data capture)
 Revision since the last time data was capture
 If source data is transient/current, then capture of revisions is not easy
 If periodic, extracts the statuses and events that have been recorded
since the last data extraction
Incremental data Capture

 Immediate Incremental data capture


 Deferred data capture
Immediate data capture

 Real time
 Use three strategies
1. Capture through transaction log
2. Capture through database trigger
3. Capture in source application
Capture through trnsaction
log file

 Log file to recover from failure


 Read log file and select all committed transaction
 Contain all the changes to the various source database
tables
 Replication to capture changes to source data
1. Identify the source system database table
2. Identify and define target files in the staging area
3. Create mapping between the source table and target files
4. Define the replication mode
Capture through trnsaction
log file

5. Schedule the replication process


6. Capture the changes from the transaction logs
7. Transfer captured data from logs to target files
8. Verify transfer of data changes
9. Confirm success or failure of replication
10. In metadata, document the outcome of replication
11. Maintain definitions of sources, targets, and mappings
Immediate data capture
Capture through Trigger

 Triggers are special stored procedures that are stored on


the database and fired when certain predefined events
occur
 Execute actions on INSERT/UPDATE/DELETE
 Operational applications need not be changed
 Enables real-time update of DW

 For example, if you need to capture all changes to the


records in the customer table, write a trigger program to
capture all updates and deletes in that table
Captured In Source
application

 The source application is made to assist in the


data capture for the data warehouse.
 You have to modify the relevant application programs
that write to the source files and databases.

 Works for all types of updates and systems


 Operational applications must be changed +
operational overhead
Deferred Data Extraction

 The techniques under deferred data extraction do


not capture the changes in real time.
 The capture happens later.
 Two types
1. Capture Based on Date and Time Stamp
1. Use timestamp for each updates
2. captures the latest state of the source data
2. Capture by Comparing File (Snapshot differential)
1. Compare copies of data, simple and straightforward
2. May be the only feasible option for some legacy data sources that do not have
transaction logs or time stamps on source records
Snap shot Differential

 F1-> existing set of data in the data warehouse


 Each entry of the snapshot F1 is denoted by tuple (ki,bi)
where ki is the key and bi the set of fields

 F2-> data source as snapshot


 Each entry of the snapshot F2 is denoted by the tuple (ki,bj)
 where ki is the key and bj the corresponding set of fields
that have the present value of the tuple
Snap shot Differential

For each key in F2 find a tuple with the matching key in F1.
1.If we find a tuple <ki,bi> in F1:
the tuple in F1 needs to be updated
UPDATE <ki,bj>
2. If there is no tuple in F1 corresponding to tuple in F2
insert a tuple in F1.
INSERT <ki,bj> in F1
3.If there is no tuple in F2 that has the key <ki,bj>
corresponding to a tuple in F1
the tuple in F1 will have to be deleted
DELETE in F1
FAST Data LOAD
 Extract/ETL only changes since last load (delta)
 Extracts only that data which has changed since the last
time a build was run
FAST Data LOAD
 Delta = changes since last load
 Store sorted total extracts in DSA( Data Staging area)
 Delta = current+last extract
 Always possible
 Handles deletions
 High extraction time
 Put update timestamp on all rows (in sources)
 Updated by DB trigger
 Extract only where “timestamp > time for last extract”
 Reduces extract time
 Cannot (alone) handle deletions
Deferred CDC
Deferred CDC
Deferred CDC

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy