PSK DWH Material
PSK DWH Material
DWH-Informatica Material
Version 1.0
REVISION HISTORY
Table of Contents
1 Introduction 3
1.1 Purpose 3
2 ORACLE 3
2.1 DEFINATIONS 3
NORMALIZATION: 3
First Normal Form: 4
Second Normal Form: 4
Third Normal Form: 4
Boyce-Codd Normal Form: 5
Fourth Normal Form: 5
ORACLE SET OF STATEMENTS: 5
Data Definition Language :(DDL) 5
Data Manipulation Language (DML) 5
Data Querying Language (DQL) 5
Page 1 of 134
www.pskinfo.com
Data Control Language (DCL) 5
Transactional Control Language (TCL) 6
Syntaxes: 6
ORACLE JOINS: 9
Equi Join/Inner Join: 9
Non-Equi Join 9
Self Join 10
Natural Join 10
Cross Join 10
Outer Join 10
Left Outer Join 11
Right Outer Join 11
Full Outer Join 11
What’s the difference between View and Materialized View? 12
View: 12
Materialized View: 13
Inline view: 13
Indexes: 19
Why hints Require? 19
Explain Plan: 22
Store Procedure: 23
Packages: 24
Triggers: 25
Data files Overview: 27
2.2 IMPORTANT QUERIES 27
3 DWH CONCEPTS 30
What is BI? 30
4 ETL-INFORMATICA 55
4.1 Informatica Overview 55
4.2 Informatica Scenarios: 98
4.3 Development Guidelines 118
4.4 Performance Tips 121
4.5 Unit Test Cases (UTP): 124
5 UNIX 127
Page 2 of 134
www.pskinfo.com
Detailed Design DocumentAutomation of Candidate
Extract and Load Process
1 Introduction
1.1 Purpose
The purpose of this document is to provide the detailed information
about Oracle, DWH Concepts, Informatica and UNIX based on Real-Time.
2 ORACLE
2.1 DEFINATIONS
Organizations can store data on various media and in different formats, such as
a hard-copy document
stores, retrieves, and modifies data in the database on request. There are four
main types of databases:
NORMALIZATION:
Some Oracle databases were modeled according to the rules of normalization
that were intended to eliminate redundancy.
Page 3 of 134
www.pskinfo.com
Obviously, the rules of normalization are required to understand your
relationships and functional dependencies
A row is in first normal form (1NF) if all underlying domains contain atomic
values only.
Does not have a composite primary key. Meaning that the primary key
can not be subdivided into separate logical entities.
All the non-key columns are functionally dependent on the entire primary
key.
A row is in second normal form if, and only if, it is in first normal form and
every non-key attribute is fully dependent on the key.
Page 4 of 134
www.pskinfo.com
A row is in third normal form if and only if it is in second normal form and
if attributes that do not contribute to a description of the primary key are
move into a separate table. An example is creating look-up tables.
Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. In his later
writings Codd refers to BCNF as 3NF. A row is in Boyce Codd normal form if,
and only if, every determinant is a candidate key. Most entities in 3NF are
already in BCNF.
Create
Alter
Drop
Truncate
Insert
Update
Delete
Select
Page 5 of 134
www.pskinfo.com
Data Control Language (DCL)
Grant
Revoke
Commit
Rollback
Save point
Syntaxes:
EXECUTE DBMS_SNAPSHOT.REFRESH(‘MV_EMP_PK’,’F’);
OR
Page 6 of 134
www.pskinfo.com
EBIBDRO.HWMD_MTH_ALL_METRICS_CURR_VIEW
REFRESH COMPLETE
AS
DBMS_MVIEW.REFRESH('MV_COMPLEX', 'C');
Or
Case Statement:
Select NAME,
(CASE
WHEN (CLASS_CODE = 'Subscription')
THEN ATTRIBUTE_CATEGORY
ELSE TASK_TYPE
END) TASK_TYPE,
CURRENCY_CODE
From EMP
Decode()
Select empname,Decode(address,’HYD’,’Hyderabad’,
‘Bang’, Bangalore’, address) as address from emp;
Procedure:
www.pskinfo.com
cust_id_IN In NUMBER,
BEGIN
End
Trigger:
REFERENCING
NEW AS NEW
OLD AS OLD
DECLARE
BEGIN
ELSE
-- Exec procedure
Exec update_sysdate()
END;
Page 8 of 134
www.pskinfo.com
ORACLE JOINS:
Equi join
Non-equi join
Self join
Natural join
Cross join
Outer join
Left outer
Right outer
Full outer
USING CLAUSE
ON CLAUSE
Non-Equi Join
A join which contains an operator other than ‘=’ in the joins condition.
Page 9 of 134
www.pskinfo.com
Ex: SQL> select empno,ename,job,dname,loc from emp e,dept d where
e.deptno > d.deptno;
Self Join
Ex2:
Natural Join
Cross Join
Outer Join
Outer join gives the non-matching records along with matching records.
Page 10 of 134
www.pskinfo.com
Left Outer Join
This will display the all matching records and the records which are in left hand
side table those that are not in right hand side table.
Ex: SQL> select empno,ename,job,dname,loc from emp e left outer join dept
d on(e.deptno=d.deptno);
Or
e.deptno=d.deptno(+);
This will display the all matching records and the records which are in right
hand side table those that are not in left hand side table.
Or
This will display the all matching records and the non-matching records from
both tables.
Ex: SQL> select empno,ename,job,dname,loc from emp e full outer join dept
Page 11 of 134
www.pskinfo.com
d on(e.deptno=d.deptno);
OR
View:
Page 12 of 134
www.pskinfo.com
– Contains functions or groups of data
We can keep aggregated data into materialized view. we can schedule the MV
to refresh but table can’t.MV can be created based on multiple tables.
Materialized View:
Inline view:
If we write a select statement in from clause that is nothing but inline view.
Ex:
Get dept wise max sal along with empname and emp no.
Page 13 of 134
www.pskinfo.com
What is the difference between view and materialized view?
DELETE
The DELETE command is used to remove rows from a table. A WHERE clause
can be used to only remove some rows. If no WHERE condition is specified, all
rows will be removed. After performing a DELETE operation you need to
COMMIT or ROLLBACK the transaction to make the change permanent or to
undo it.
TRUNCATE
TRUNCATE removes all rows from a table. The operation cannot be rolled back.
As such, TRUCATE is faster and doesn't use as much undo space as a DELETE.
Page 14 of 134
www.pskinfo.com
DROP
The DROP command removes a table from the database. All the tables' rows,
indexes and privileges will also be removed. The operation cannot be rolled
back.
ROWID
A globally unique identifier for a row in a database. It is created at the time the
row is inserted into a table, and destroyed when it is removed from a
table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block number, RRRR is the
slot(row) number, and FFFF is a file number.
ROWNUM
For each row returned by a query, the ROWNUM pseudo column returns a
number indicating the order in which Oracle selects the row from a table or set
of joined rows. The first row selected has a ROWNUM of 1, the second has 2,
and so on.
You can use ROWNUM to limit the number of rows returned by a query, as in
this example:
Rowid Row-num
www.pskinfo.com
for a row in a database. It is created returns a number indicating the
at the time the row is inserted into order in which oracle selects the
the table, and destroyed when it is row from a table or set of joined
removed from a table. rows.
FROM table
[WHERE condition]
[GROUP BY group_by_expression]
[HAVING group_condition]
[ORDER BY column];
The WHERE clause cannot be used to restrict groups. you use the
Both where and having clause can be used to filter the data.
Where clause is used to restrict rows. But having clause is used to restrict
Page 16 of 134
www.pskinfo.com
groups.
MERGE Statement
You can use merge command to perform insert and update in a single
command.
On (s1.no=s2.no)
Sub Query:
Page 17 of 134
www.pskinfo.com
Example:
Select deptno, ename, sal from emp a where sal in (select sal from Grade
where sal_grade=’A’ or sal_grade=’B’)
Example:
Find all employees who earn more than the average salary in their department.
Group by B.department_id)
EXISTS:
Example: Example:
Select * from emp where deptno Select a.* from emp e where sal >=
in (select deptno from dept); (select avg(sal) from emp a where
Page 18 of 134
www.pskinfo.com
a.deptno=e.deptno group by
a.deptno);
Indexes:
1. Bitmap indexes are most appropriate for columns having low distinct
values—such as GENDER, MARITAL_STATUS, and RELATION. This
assumption is not completely accurate, however. In reality, a bitmap
index is always advisable for systems in which data is not frequently
updated by many concurrent systems. In fact, as I'll demonstrate here,
a bitmap index on a column with 100-percent unique values (a column
candidate for primary key) is as efficient as a B-tree index.
7. The table is large and most queries are expected to retrieve less than 2
to 4 percent of the rows
It is a perfect valid question to ask why hints should be used. Oracle comes
with an optimizer that promises to optimize a query's execution plan. When
this optimizer is really doing a good job, no hints should be required at all.
www.pskinfo.com
of date. In this case, a hint could help.
You should first get the explain plan of your SQL and determine what changes
can be done to make the code operate without using hints if possible.
However, hints such as ORDERED, LEADING, INDEX, FULL, and the various AJ
and SJ hints can take a wild optimizer and give you optimal performance
The ANALYZE statement can be used to gather statistics for a specific table,
index or cluster. The statistics can be computed exactly, or estimated based on
a specific number of rows, or a percentage of rows:
Hint categories:
ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing
systems.
Page 20 of 134
www.pskinfo.com
(/*+ ALL_ROWS */)
FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.
CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS,
based on statistics gathered.
Hints for Parallel Execution, (/*+ parallel(a,4) */) specify degree either 2
or 4 or 16
Additional Hints
HASH
Hashes one table (full scan) and creates a hash index for that table. Then
hashes other table and uses hash index to find corresponding records.
Therefore not suitable for < or > join conditions.
/*+ use_hash */
ORDERED- This hint forces tables to be joined in the order specified. If you
know table X has fewer rows, then ordering it first may speed execution in a
join.
Page 21 of 134
www.pskinfo.com
PARALLEL (table, instances)This specifies the operation is to be done in
parallel.
If index is not able to create then will go for /*+ parallel(table, 8)*/-----For
select and update example---in where clase like st,not in ,>,< ,<> then we will
use.
Explain Plan:
Explain plan will tell us whether the query properly using indexes or not.whatis
the cost of the table whether it is doing full table scan or not, based on these
statistics we can tune the query.
The explain plan process stores data in the PLAN_TABLE. This table can be
located in the current schema or a shared schema and is created using in
SQL*Plus as follows:
What is your tuning approach if SQL query taking long time? Or how do u
tune SQL query?
If query taking long time then First will run the query in Explain Plan, The
explain plan process stores data in the PLAN_TABLE.
it will give us execution plan of the query like whether the query is using the
relevant indexes on the joining columns or indexes to support the query are
missing.
If joining columns doesn’t have index then it will do the full table scan if it is full
table scan the cost will be more then will create the indexes on the joining
columns and will run the query it should give better performance and also
needs to analyze the tables if analyzation happened long back. The ANALYZE
statement can be used to gather statistics for a specific table, index or cluster
Page 22 of 134
www.pskinfo.com
using
If still have performance issue then will use HINTS, hint is nothing but a clue.
We can use hints like
ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing
systems.
FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.
CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS,
based on statistics gathered.
HASH
Hashes one table (full scan) and creates a hash index for that table. Then
hashes other table and uses hash index to find corresponding records.
Therefore not suitable for < or > join conditions.
/*+ use_hash */
Store Procedure:
What are the differences between stored procedures and triggers?
www.pskinfo.com
Stored procedures should be called explicitly by the user in order to execute
But the Trigger should be called implicitly based on the events defined in the
table.
Using stored procedure we can access and modify data present in many
tables.
Stored procedures are not automatically run and they have to be called
explicitly by the user. But triggers get executed when the particular event
associated with the event gets fired.
Packages:
package that contains several procedures and functions that process related to
same transactions.
Page 24 of 134
www.pskinfo.com
A package is a group of related procedures and functions, together with the
cursors and variables they use,
Triggers:
Oracle lets you define procedures called triggers that run implicitly when an
INSERT, UPDATE, or DELETE statement is issued against the associated table
Triggers are similar to stored procedures. A trigger stored in the database can
include SQL and PL/SQL
Types of Triggers
INSTEAD OF Triggers
Row Triggers
A row trigger is fired each time the table is affected by the triggering
statement. For example, if an UPDATE statement updates multiple rows of a
table, a row trigger is fired once for each row affected by the UPDATE
statement. If a triggering statement affects no rows, a row trigger is not run.
When defining a trigger, you can specify the trigger timing--whether the trigger
action is to be run before or after the triggering statement. BEFORE and AFTER
apply to both statement and row triggers.
Page 25 of 134
www.pskinfo.com
BEFORE and AFTER triggers fired by DML statements can be defined only on
tables, not on views.
Stored procedure may or may not Function should return at least one
return values. output parameter. Can return more
than one parameter using OUT
argument.
Stored procedure accepts more than Whereas function does not accept
one argument. arguments.
Stored procedures are mainly used to Functions are mainly used to compute
process the tasks. values
Page 26 of 134
www.pskinfo.com
Cannot be invoked from SQL Can be invoked form SQL statements
statements. E.g. SELECT e.g. SELECT
Can affect the state of database using Cannot affect the state of database.
commit.
Table Space:
A database is divided into one or more logical storage units called tablespaces.
Tablespaces are divided into logical units of storage called segments.
Control File:
Page 27 of 134
www.pskinfo.com
Select empno, count (*) from EMP group by empno having count (*)>1;
Delete from EMP where rowid not in (select max (rowid) from EMP group by
empno);
UNION
select
emp_id,
max(decode(row_id,0,address))as address1,
max(decode(row_id,1,address)) as address2,
max(decode(row_id,2,address)) as address3
group by emp_id
Other query:
Page 28 of 134
www.pskinfo.com
select
emp_id,
max(decode(rank_id,1,address)) as add1,
max(decode(rank_id,2,address)) as add2,
max(decode(rank_id,3,address))as add3
from
group by
emp_id
5. Rank query:
Select empno, ename, sal, r from (select empno, ename, sal, rank () over (order
by sal desc) r from EMP);
The DENSE_RANK function works acts like the RANK function except that it
assigns consecutive ranks:
Select empno, ename, Sal, from (select empno, ename, sal, dense_rank () over
(order by sal desc) r from emp);
Or
Select * from (select * from EMP order by sal desc) where rownum<=5;
8. 2 nd highest Sal:
Page 29 of 134
www.pskinfo.com
Select empno, ename, sal, r from (select empno, ename, sal, dense_rank ()
over (order by sal desc) r from EMP) where r=2;
9. Top sal:
Select * from EMP where sal= (select max (sal) from EMP);
11.Hierarchical queries
Starting at the root, walk from the top down, and eliminate employee Higgins
in the result, but
FROM employees
3 DWH CONCEPTS
What is BI?
Business Intelligence refers to a set of methods and techniques that are used by
organizations for tactical and strategic decision making. It leverages methods
and technologies that focus on counts, statistics and business objectives to
improve business performance.
Page 30 of 134
www.pskinfo.com
The objective of Business Intelligence is to better understand customers and
improve customer service, make the supply and distribution chain more
efficient, and to identify and address business problems and opportunities
quickly.
In terms of design data warehouse and data mart are almost the same.
Subject Oriented:
Integrated:
Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
Time-variant:
time period.
Non-volatile:
Data is stable in a data warehouse. More data is added but data is never
removed.
Page 31 of 134
www.pskinfo.com
What is a DataMart?
In terms of design data warehouse and data mart are almost the same.
www.pskinfo.com
time.
A fact table that contains only primary keys from the dimension tables,
and that do not contain any measures that type of fact table is called fact less
fact table .
What is a Schema?
In Data warehousing grain refers to the level of detail available in a given fact
table as well as to the level of detail provided by a star schema.
It is usually given as the number of records per key within the table. In general,
the grain of the fact table is the grain of the star schema.
Star schema is a data warehouse schema where there is only one "fact table"
and many denormalized dimension tables.
Fact table contains primary keys from all the dimension tables and other
Page 33 of 134
www.pskinfo.com
numeric columns columns of additive, numeric facts.
Page 34 of 134
www.pskinfo.com
What is the difference between snow flake and star schema
The star schema is the simplest data Snowflake schema is a more complex
warehouse scheme. data warehouse model than a star
schema.
In star schema only one join In snow flake schema since there is
establishes the relationship between relationship between the dimensions
the fact table and any one of the tables it has to do many joins to fetch
dimension tables. the data.
Page 35 of 134
www.pskinfo.com
A "fact" is a numeric value that a business wishes to count or sum. A
"dimension" is essentially an entry point for getting at the facts. Dimensions are
things of interest to the business.
A set of level properties that describe a specific aspect of a business, used for
analyzing the factual measures.
Types of facts?
Additive: Additive facts are facts that can be summed up through all of
the dimensions in the fact table.
Page 36 of 134
www.pskinfo.com
Semi-Additive: Semi-additive facts are facts that can be summed up for
some of the dimensions in the fact table, but not the others.
What is Granularity?
Principle: create fact tables with the most granular data possible to support
analysis of the business process.
In Data warehousing grain refers to the level of detail available in a given fact
table as well as to the level of detail provided by a star schema.
It is usually given as the number of records per key within the table. In general,
the grain of the fact table is the grain of the star schema.
Facts: Facts must be consistent with the grain.all facts are at a uniform grain.
Dimensions: each dimension associated with fact table must take on a single
value for each fact row.
Page 37 of 134
www.pskinfo.com
Dimensional Model
www.pskinfo.com
attributes of a particular employee change over time like their designation
changes or dept changes etc.
Conformed Dimensions (CD): these dimensions are something that is built once
in your model and can be reused multiple times with different fact tables. For
example, consider a model containing multiple fact tables, representing
different data marts. Now look for a dimension that is common to these facts
tables. In this example let’s consider that the product dimension is common
and hence can be reused by creating short cuts and joining the different fact
tables.Some of the examples are time dimension, customer dimensions,
product dimension.
When you consolidate lots of small dimensions and instead of having 100s of
small dimensions, that will have few records in them, cluttering your database
with these mini ‘identifier’ tables, all records from all these small dimension
tables are loaded into ONE dimension table and we call this dimension table
Junk dimension table. (Since we are storing all the junk in this one table) For
example: a company might have handful of manufacture plants, handful of
order types, and so on, so forth, and we can consolidate them in one
dimension table called junked dimension table
An item that is in the fact table but is stripped off of its description, because
the description belongs in dimension table, is referred to as Degenerated
Dimension. Since it looks like dimension, but is really in fact table and has been
degenerated of its description, hence is called degenerated dimension..
Page 39 of 134
www.pskinfo.com
Degenerated Dimension: a dimension which is located in fact table known as
Degenerated dimension
Dimensional Model:
Data modeling
There are three levels of data modeling. They are conceptual, logical, and
physical. This section will explain the difference among the three, the order
with which each one is created, and how to go from one level to the other.
No attribute is specified.
www.pskinfo.com
are specified.
At this level, the data modeler attempts to describe the data in as much detail
as possible, without regard to how they will be physically implemented in the
database.
In data warehousing, it is common for the conceptual data model and the
logical data model to be combined into a single step (deliverable).
The steps for designing the logical data model are as follows:
6. Normalization.
At this level, the data modeler will specify how the logical data model will be
realized in the database schema.
www.pskinfo.com
1. Convert entities into tables.
9. http://www.learndatamodeling.com/dm_standard.htm
The differences between a logical data model and physical data model is
shown below.
Entity Table
Attribute Column
Page 42 of 134
www.pskinfo.com
Definition Comment
Page 43 of 134
www.pskinfo.com
Below is the sq for one of sq for the dimension table load
Page 44 of 134
www.pskinfo.com
Page 45 of 134
www.pskinfo.com
EDIII – Logical Design
Page 46 of 134
www.pskinfo.com
ACW_ORGANIZATION_D
ACW_DF_FEES_STG ACW_DF_FEES_F Primary Key
Non-Key Attributes Primary Key ORG_KEY [PK1]
SEGMENT1 ACW_DF_FEES_KEY Non-Key Attributes
ORGANIZATION_ID [PK1] ORGANIZATION_CODE
ITEM_TYPE
Non-Key Attributes CREA TED_BY
BUYER_ID CREA TION_DATE
PRODUCT_KEY
COST_REQUIRED
ORG_KEY LAST_UPDATE_DATE
QUARTER_1_COST
DF_MGR_KEY LAST_UPDATED_BY
QUARTER_2_COST D_CREATED_BY
COST_REQUIRED
QUARTER_3_COST
DF_FEES D_CREATION_DATE PID for DF Fees
QUARTER_4_COST
COSTED_BY D_LAST_UPDATE_DATE
COSTED_BY D_LAST_UPDATED_BY
COSTED_DATE
COSTED_DATE
APPROV ING_MGR
APPROV ED_BY
APPROV ED_DATE
APPROV ED_DATE
D_CREATED_BY
D_CREATION_DATE ACW_USERS_D
D_LAST_UPDATE_BY Primary Key
D_LAST_UPDATED_DATE USER_KEY [PK1]
EDW_TIME_HIERARCHY Non-Key Attributes
PERSON_ID
EMAIL_ADDRESS
ACW_PCBA_A PPROVAL_F LAST_NAME
Primary Key FIRST_NAME
ACW_PCBA_A PPROVAL_STG PCBA _APPROVAL_KEY FULL_NAME
Non-Key Attributes [PK1] EFFECTIV E_STA RT_DATE
INV ENTORY_ITEM_ID Non-Key Attributes EFFECTIV E_END_DATE
LATEST_REV EMPLOYEE_NUMBER
PART_KEY
LOCATION_ID CISCO_PART_NUMBER LAST_UPDATED_BY
LOCATION_CODE SUPPLY_CHANNEL_KEY LAST_UPDATE_DATE
APPROV AL_FLAG CREA TION_DATE
NPI
ADJUSTMENT CREA TED_BY
APPROV AL_FLAG
APPROV AL_DATE D_LAST_UPDATED_BY
ADJUSTMENT
TOTA L_ADJUSTMENT D_LAST_UPDATE_DATE
APPROV AL_DATE
TOTA L_ITEM_COST D_CREATION_DATE
ADJUSTMENT_AMT
DEMAND SPEND_BY _ASSEMBLY D_CREATED_BY
COMM_MGR COMM_MGR_KEY ACW_PRODUCTS_D
BUYER_ID Primary Key
BUYER_ID
BUYER ACW_PART_TO_PID_D PRODUCT_KEY [PK1]
RFQ_CREATED
RFQ_CREATED Users
Primary Key Non-Key Attributes
RFQ_RESPONSE
RFQ_RESPONSE
CSS PART_TO_PID_KEY [PK1] PRODUCT_NA ME
CSS
D_CREATED_BY Non-Key Attributes BUSINESS_UNIT_ID
D_CREATED_DATE PART_KEY BUSINESS_UNIT
D_LAST_UPDATED_BY CISCO_PART_NUMBER PRODUCT_FAMILY_ID
ACW_DF_A PPROVAL_STG D_LAST_UPDATE_DATE PRODUCT_KEY PRODUCT_FAMILY
Non-Key Attributes PRODUCT_NA ME ITEM_TYPE
LATEST_REVISION D_CREATED_BY
INV ENTORY_ITEM_ID ACW_DF_A PPROVAL_F
D_CREATED_BY D_CREATION_DATE
CISCO_PART_NUMBER Primary Key
D_CREATION_DATE D_LAST_UPDATE_BY
LATEST_REV
DF_APPROVAL_KEY D_LAST_UPDATED_BY D_LAST_UPDATED_DATE
PCBA _ITEM_FLAG [PK1]
APPROV AL_FLAG D_LAST_UPDATE_DATE
Non-Key Attributes
APPROV AL_DATE
LOCATION_ID PART_KEY
LOCATION_CODE CISCO_PART_NUMBER
BUYER SUPPLY_CHANNEL_KEY
BUYER_ID PCBA _ITEM_FLAG
RFQ_CREATED APPROV ED ACW_SUPPLY_CHA NNEL_D
RFQ_RESPONSE APPROV AL_DATE
Primary Key
CSS BUYER_ID
SUPPLY_CHANNEL_KEY
RFQ_CREATED
[PK1]
RFQ_RESPONSE
CSS Non-Key Attributes
D_CREATED_BY SUPPLY_CHANNEL
D_CREATION_DATE DESCRIPTION
D_LAST_UPDATED_BY LAST_UPDATED_BY
D_LAST_UPDATE_DATE LAST_UPDATE_DATE
CREA TED_BY
CREA TION_DATE
D_LAST_UPDATED_BY
D_LAST_UPDATE_DATE
D_CREATED_BY
D_CREATION_DATE
Page 47 of 134
www.pskinfo.com
EDII– Physical Design
ACW_PRODUCT S_D
Colum ns
ACW_DF_APPROVA L_ST G
PRODUCT _KEY NUM B ER(10) [P K1]
Colum ns
PRODUCT _NAM E CHA R(30)
INVENT ORY_IT EM _ID NUM B ER(10) BUS INESS _UNIT _ID NUM B ER(10)
CISCO_PA RT _NUM BE RCHA R(30) ACW_DF_APPROVA L_F ACW_PA RT _T O_PID_D
BUS INESS _UNIT VARCHAR2(60)
LAT EST _REV CHA R(10) Colum ns Colum ns
PRODUCT _FAM ILY_ID NUM B ER(10)
PCB A_IT EM _FLAG CHA R(1) DF_APPROVAL_KEY NUM B ER(10) [P K1] PART _T O_PID_KEY NUM B ER(10) [P K1]
PRODUCT _FAM ILY VARCHAR2(180)
APP ROV AL_FLAG CHA R(1) PART _K EY NUM B ER(10) PART _K EY NUM B ER(10)
IT EM _T YPE CHA R(30)
APP ROV AL_DA T E DAT E CISCO_PA RT _NUM BE R CHA R(30) CISCO_PA RT _NUM BE RCHA R(30)
D_CREA T ED_BY CHA R(10)
LOCAT ION_ID NUM B ER(10) SUP PLY _CHANNE L_KEYNUM B ER(10) PRODUCT _KEY NUM B ER(10)
D_CREA T ION_DAT E DAT E
SUP PLY _CHANNE L CHA R(10) PCB A_IT EM _FLAG CHA R(1) PRODUCT _NAM E CHA R(30)
D_LAST _UPDAT E_BY CHA R(10)
BUY ER VARCHAR2(240) APP ROV ED CHA R(1) LAT EST _REVIS ION CHA R(10)
D_LAST _UPDAT ED_DAT CHA
E R(10)
BUY ER_ID NUM B ER(10) APP ROV AL_DA T E DAT E D_CREA T ED_BY CHA R(10)
RFQ_CREAT ED CHA R(1) BUY ER_ID NUM B ER(10) D_CREA T ION_DAT E DAT E
RFQ_RE SPONSE CHA R(1) RFQ_CREAT ED CHA R(1) D_LAST _UPDAT ED_BYCHA R(10)
CSS CHA R(10) RFQ_RE SPONSE CHA R(1) D_LAST _UPDAT E_DAT D EAT E
CSS CHA R(10)
D_CREA T ED_BY CHA R(10)
D_CREA T ION_DAT E DAT E
D_LAST _UPDAT ED_BY CHA R(10)
D_LAST _UPDAT E_DAT EDAT E
ACW_SUPPLY_CHANNEL_D
Colum ns
SUP PLY _CHANNE L_KEYNUM B ER(10) [P K1]
SUP PLY _CHANNE L CHA R(60)
DES CRIPT ION VARCHAR2(240)
LAST _UPDAT ED_BY NUM B ER
LAST _UPDAT E_DAT E DAT E
CRE AT ED_BY NUM B ER(10)
CRE AT ION_DAT E DAT E
D_LAST _UPDAT ED_BY CHA R(10)
D_LAST _UPDAT E_DAT EDAT E
D_CREA T ED_BY CHA R(10)
D_CREA T ION_DAT E DAT E
Users
Page 48 of 134
www.pskinfo.com
Types of SCD Implementation:
After Christina moved from Illinois to California, the new information replaces
the new record, and we have the following table:
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem,
since there is no need to keep track of the old information.
Disadvantages:
- Usage:
Type 1 slowly changing dimension should be used when it is not necessary for
the data warehouse to keep track of historical changes.
Page 49 of 134
www.pskinfo.com
Type 2 Slowly Changing Dimension
After Christina moved from Illinois to California, we add the new information as
a new row into the table:
Advantages:
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of
rows for the table is very high to start with, storage and performance can
become a concern.
Usage:
Type 2 slowly changing dimension should be used when it is necessary for the
Page 50 of 134
www.pskinfo.com
data warehouse to track historical changes.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the
particular attribute of interest, one indicating the original value, and one
indicating the current value. There will also be a column that indicates when
the current value becomes active.
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets
updated, and we have the following table (assuming the effective date of
change is January 15, 2003):
Advantages:
- This does not increase the size of the table, since new information is updated.
Page 51 of 134
www.pskinfo.com
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more
than once. For example, if Christina later moves to Texas on December 15,
2003, the California information will be lost.
Usage:
Type III slowly changing dimension should only be used when it is necessary for
the data warehouse to track historical changes, and when such changes will
only occur for a finite number of time.
If target and source databases are different and target table volume is high it
contains some millions of records in this scenario without staging table we need
to design your informatica using look up to find out whether the record exists or
not in the target table since target has huge volumes so its costly to create
cache it will hit the performance.
If we create staging tables in the target database we can simply do outer join in
the source qualifier to determine insert/update this approach will give you good
performance.
Page 52 of 134
www.pskinfo.com
Data cleansing, also known as data scrubbing, is the process of ensuring that a
set of data is correct and accurate. During data cleansing, records are checked
for accuracy and consistency.
Data cleansing
Data merging
Data scrubbing
My understanding of ODS is, its a replica of OLTP system and so the need of
this, is to reduce the burden on production system (OLTP) while fetching data
for loading targets. Hence its a mandate Requirement for every Warehouse.
Page 53 of 134
www.pskinfo.com
So every day do we transfer data to ODS from OLTP to keep it up to date?
OLTP is a sensitive database they should not allow multiple select statements it
may impact the performance as well as if something goes wrong while fetching
data from OLTP to data warehouse it will directly impact the business.
A surrogate key is any column or set of columns that can be declared as the
primary key instead of a "real" or natural key. Sometimes there can be several
natural keys that could be declared as the primary key, and these are all called
candidate keys. So a surrogate is a candidate key. A table could actually have
more than one surrogate key, although this would be unusual. The most
common type of surrogate key is an incrementing integer, such as an auto
increment column in MySQL, or a sequence in Oracle, or an identity column in
SQL Server.
Page 54 of 134
www.pskinfo.com
4 ETL-INFORMATICA
Informatica is a powerful ETL( Extraction, Transformation, and Loading) tool and DEVELOPED
BY Informatica corporation.. Informatica comes with the following clients to perform various
tasks.
Informatica Transformations:
Mapplet:
Page 55 of 134
www.pskinfo.com
mind:
o Normalizer transformations
o COBOL sources
o XML sources
o Target definitions
o Other mapplets
System Variables
$$$SessStartTime returns the initial system date value on the machine hosting
the Integration Service when the server initializes a session. $$$SessStartTime
returns the session start time as a string value. The format of the string
depends on the database you are using.
Page 56 of 134
www.pskinfo.com
Session: A session is a set of instructions that tells informatica Server how to
move data from sources to targets.
Filter: The Filter transformation is used to filter the data based on single
condition and pass through next transformation.
Router: The router transformation is used to route the data based on multiple
conditions and pass through next transformations.
1) Input group
3) Default group
www.pskinfo.com
and also
1) Connected
2) Unconnected
Page 58 of 134
www.pskinfo.com
Pass multiple output values to Pass one output value to another
another transformation. Link transformation. The
lookup/output ports to another lookup/output/return port passes
transformation. the value to the transformation
calling: LKP expression.
Lookup Caches:
When configuring a lookup cache, you can specify any of the following options:
Persistent cache
Static cache
Dynamic cache
Shared cache
Dynamic cache: When you use a dynamic cache, the PowerCenter Server
updates the lookup cache as it passes rows to the target.
If you configure a Lookup transformation to use a dynamic cache, you can only
use the equality operator (=) in the lookup condition.
Page 59 of 134
www.pskinfo.com
NewLookupRow
Description
Value
Static cache: It is a default cache; the PowerCenter Server doesn’t update the
lookup cache as it passes rows to the target.
Persistent cache: If the lookup table does not change between sessions,
configure the Lookup transformation to use a persistent lookup cache. The
PowerCenter Server then saves and reuses cache files from session to session,
eliminating the time required to read the lookup table.
Page 60 of 134
www.pskinfo.com
NewLookupRow port will enable
automatically.
Best example where we need to use If we use static lookup first record
dynamic cache is if suppose first it will go to lookup and check in the
record and last record both are lookup cache based on the
same but there is a change in the condition it will not find the match
address. What informatica mapping so it will return null value then in
has to do here is first record needs the router it will send that record
to get insert and last record should to insert flow.
get update in the target table.
But still this record dose not
available in the cache memory so
when the last record comes to
lookup it will check in the cache it
will not find the match so it returns
null value again it will go to insert
flow through router but it is
suppose to go to update flow
because cache didn’t get refreshed
when the first record get inserted
into target table.
Rank: The Rank transformation allows you to select only the top or bottom
rank of data. You can use a Rank transformation to return the largest or
smallest numeric value in a port or group.
Page 61 of 134
www.pskinfo.com
Sequence Generator: The Sequence Generator transformation is used to
generate numeric key values in sequential order.
Union Transformation:
The Union transformation is a multiple input group transformation that you can
use to merge data from multiple pipelines or pipeline branches into one
pipeline branch. It merges data from multiple sources similar to the UNION ALL
SQL statement to combine the results from two or more SQL statements.
Similar to the UNION ALL statement, the Union transformation does not
remove duplicate rows.Input groups should have similar structure.
1) Mapping level
2) Session level.
Aggregator Transformation:
Transformation type:
Page 62 of 134
www.pskinfo.com
Active
Connected
Aggregate cache: The Integration Service stores data in the aggregate cache
until it completes aggregate calculations. It stores group values in an index
cache and row data in the data cache.
Group by port: Indicate how to create groups. The port can be any input,
input/output, output, or variable port. When grouping data, the Aggregator
transformation outputs the last row of each group unless otherwise specified.
Sorted input: Select this option to improve session performance. To use sorted
input, you must pass data to the Aggregator transformation sorted by group by
port, in ascending or descending order.
Aggregate Expressions:
Page 63 of 134
www.pskinfo.com
Aggregate Functions
(AVG,COUNT,FIRST,LAST,MAX,MEDIAN,MIN,PERCENTAGE,SUM,VARIANCE and
STDDEV)
When you use any of these functions, you must use them in an expression
within an Aggregator transformation.
Use sorted input to increase the mapping performance but we need to sort the
data before sending to aggregator transformation.
SQL Transformation
Transformation type:
Active/Passive
Connected
For example, you might need to create database tables before adding new
transactions. You can create an SQL transformation to create the tables in a
workflow. The SQL transformation returns database errors in an output port.
Page 64 of 134
www.pskinfo.com
You can configure another workflow to run if the SQL transformation returns
no errors.
When you create an SQL transformation, you configure the following options:
Script mode. The SQL transformation runs ANSI SQL scripts that are externally
located. You pass a script name to the transformation with each input row. The
SQL transformation outputs one row for each input row.
Query mode. The SQL transformation executes a query that you define in a
query editor. You can pass strings or parameters to the query to define
dynamic queries or change the selection parameters. You can output multiple
rows when the query has a SELECT statement.
Database type. The type of database the SQL transformation connects to.
Script Mode
An SQL transformation configured for script mode has the following default
ports:
ScriptName Input Receives the name of the script to execute for the current
Page 65 of 134
www.pskinfo.com
row.
ScriptResult Output Returns PASSED if the script execution succeeds for the
row. Otherwise contains FAILED.
ScriptError Output Returns errors that occur when a script fails for a row.
Transformation type:
Active/Passive
Connected
For example, you can define transformation logic to loop through input rows
and generate multiple output rows based on a specific condition. You can also
use expressions, user-defined functions, unconnected transformations, and
mapping variables in the Java code.
Transformation type:
Active
Connected
PowerCenter lets you control commit and roll back transactions based on a set
of rows that pass through a Transaction Control transformation. A transaction
is the set of rows bound by commit or roll back rows. You can define a
transaction based on a varying number of input rows. You might want to define
Page 66 of 134
www.pskinfo.com
transactions based on a group of rows ordered on a common key, such as
employee ID or order entry date.
Within a session. When you configure a session, you configure it for user-
defined commit. You can choose to commit or roll back a transaction if the
Integration Service fails to transform or write any row to the target.
When you run the session, the Integration Service evaluates the expression for
each row that enters the transformation. When it evaluates a commit row, it
commits all rows in the transaction to the target or targets. When the
Integration Service evaluates a roll back row, it rolls back all rows in the
transaction from the target or targets.
If the mapping has a flat file target you can generate an output file each time
the Integration Service starts a new transaction. You can dynamically name
each target flat file.
Transaction control
expression
The expression contains values that represent actions the Integration Service
Page 67 of 134
www.pskinfo.com
performs based on the return value of the condition. The Integration Service
evaluates the condition on a row-by-row basis. The return value determines
whether the Integration Service commits, rolls back, or makes no transaction
changes to the row. When the Integration Service issues a commit or roll back
based on the return value of the expression, it begins a new transaction. Use
the following built-in variables in the Expression Editor when you create a
transaction control expression:
Joiner Lookup
Page 68 of 134
www.pskinfo.com
In joiner on multiple matches it will In lookup it will return either first
return all matching records. record or last record or any value or
error value.
We can’t perform any filters along We can apply filters along with lkp
with join condition in joiner conditions using lkp query override
transformation. lookup transformation.
In source qualifier it will push all the Where as in lookup we can restrict
matching records. whether to display first value, last
value or any value
When both source and lookup are in When the source and lookup table
same database we can use source exists in different database then we
qualifier. need to use lookup.
www.pskinfo.com
Source Qualifier Joiner
We use source qualifier to join the We use joiner to join the tables if
tables if tables are in the same tables are in the different database
database
In source qualifier we can use any Where as in joiner we can’t use other
type of join between two tables. than 4 types of joins.
Stoped:
You choose to stop the workflow or task in the Workflow Monitor or through
pmcmd. The Integration Service stops processing the task and all other tasks in
its path. The Integration Service continues running concurrent tasks like
backend store procedures.s
Abort:
You choose to abort the workflow or task in the Workflow Monitor or through
pmcmd. The Integration Service kills the DTM process and aborts the task.
2nd Approach
Use Mod() function in routers based on Seq.Next values we can route the data
into multiple targets.
1) Yes, One of my mapping was taking 3-4 hours to process 40 millions rows
into staging table we don’t have any transformation inside the mapping
Page 70 of 134
www.pskinfo.com
its 1 to 1 mapping .Here nothing is there to optimize the mapping so I
created session partitions using key range on effective date column. It
improved performance lot, rather than 4 hours it was running in 30
minutes for entire 40millions.Using partitions DTM will creates multiple
reader and writer threads.
2) There was one more scenario where I got very good performance in the
mapping level .Rather than using lookup transformation if we can able to
do outer join in the source qualifier query override this will give you good
performance if both lookup table and source were in the same database.
If lookup tables is huge volumes then creating cache is costly.
4) If any mapping taking long time to execute then first we need to look in to
source and target statistics in the monitor for the throughput and also
find out where exactly the bottle neck by looking busy percentage in the
session log will come to know which transformation taking more time ,if
your source query is the bottle neck then it will show in the end of the
session log as “query issued to database “that means there is a
performance issue in the source query.we need to tune the query using .
If we look into session logs it shows busy percentage based on that we need to
find out where is bottle neck.
***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ****
Page 71 of 134
www.pskinfo.com
[ACW_PCBA_APPROVAL_F1, ACW_PCBA_APPROVAL_F] has completed: Total
Run Time = [0.806521] secs, Total Idle Time = [0.000000] secs, Busy Percentage
= [100.000000]
If suppose I've to load 40 lacs records in the target table and the workflow
is taking about 10 - 11 hours to finish. I've already increased
the cache size to 128MB.
There are no joiner, just lookups
and expression transformations
Ans:
this case drop constraints and indexes before you run the
Page 72 of 134
www.pskinfo.com
What is Constraint based loading in informatica?
Genarally What it do is it will load the data first in parent table then it will load
it in to child table.
Let’s assume we have imported some source and target definitions in a shared
folder after that we are using those sources and target definitions in another
folders as a shortcut in some mappings.
If any modifications occur in the backend (Database) structure like adding new
columns or drop existing columns either in source or target I f we reimport into
shared folder those new changes automatically it would reflect in all
folder/mappings wherever we used those sources or target definitions.
If we don’t have primary key on target table using Target Update Override
option we can perform updates.By default, the Integration Service updates
target tables based on key values. However, you can override the default
UPDATE statement for each target in a mapping. You might want to update the
target based on non-key columns.
You can override the WHERE clause to include non-key columns. For example,
you might want to update records for employees named Mike Smith only. To
do this, you edit the WHERE clause as follows:
www.pskinfo.com
If you modify the UPDATE portion of the statement, be sure to use :TU to
specify ports.
4) Because its var so it stores the max last upd_date value in the
repository, in the next run our source qualifier query will fetch only the
records updated or inseted after previous run.
Page 74 of 134
www.pskinfo.com
Logic in the mapping variable is
Page 75 of 134
www.pskinfo.com
Logic in the SQ is
Page 76 of 134
www.pskinfo.com
In expression assign max last update date value to the variable using function
set max variable.
Page 77 of 134
www.pskinfo.com
Page 78 of 134
www.pskinfo.com
Logic in the update strategy is below
Page 79 of 134
www.pskinfo.com
Approach_2: Using parameter file
Page 80 of 134
www.pskinfo.com
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_GEHC_APO_
WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALES_HIST_AUSTRI]
$DBConnection_Source=DMD2_GEMS_ETL
$DBConnection_Target=DMD2_GEMS_ETL
Page 81 of 134
www.pskinfo.com
Main mapping
Page 82 of 134
www.pskinfo.com
Sql override in SQ Transformation
Page 83 of 134
www.pskinfo.com
Workflod Design
3 create two store procedures one for update cont_tbl_1 with session
st_time, set property of store procedure type as Source_pre_load .
www.pskinfo.com
SCD Type-II Effective-Date Approach
Update the previous record eff-end-date with sysdate and insert as a new
record with source data.
Once you fetch the record from source qualifier. We will send it to lookup
to find out whether the record is present in the target or not based on
source primary key column.
Once we find the match in the lookup we are taking SCD column from
lookup and source columns from SQ to expression transformation.
If the source and target data is same then I can make a flag as ‘S’.
If the source and target data is different then I can make a flag as ‘U’.
Page 85 of 134
www.pskinfo.com
If source data does not exists in the target that means lookup returns null
value. I can flag it as ‘I’.
Based on the flag values in router I can route the data into insert and
update flow.
Complex Mapping
Source file directory contain older than 30 days files with timestamps.
For this requirement if I hardcode the timestamp for source file name it
will process the same file every day.
Then I am going to use the parameter file to supply the values to session
variables ($InputFilename).
This mapping will update the parameter file with appended timestamp to
file name.
I make sure to run this parameter file update mapping before my actual
mapping.
Page 86 of 134
www.pskinfo.com
How to handle errors in informatica?
We need to send those records to flat file after completion of 1st session
run. Shell script will check the file size.
If the file size is greater than zero then it will send email notification to
source system POC (point of contact) along with deno zero record file and
appropriate email subject and body.
If file size<=0 that means there is no records in flat file. In this case shell
script will not send any email notification.
Or
We are expecting a not null value for one of the source column.
Source qualifier will select the data from the source table.
Parameter file it will supply the values to session level variables and mapping
level variables.
Page 87 of 134
www.pskinfo.com
Session level variables
$DBConnection_Source
$DBConnection_Target
$InputFile
$OutputFile
Variable
Parameter
What is the difference between mapping level and session level variables?
Flat File
Delimiter
Fixed Width
In fixed width we need to known about the format first. Means how many
Page 88 of 134
www.pskinfo.com
character to read for particular column.
If the file contains the header then in definition we need to skip the first row.
List file:
If you want to process multiple files with same structure. We don’t need
multiple mapping and multiple sessions.
We can use one mapping one session using list file option.
First we need to create the list file for all the files. Then we can use this file in
the main mapping.
It is a text file below is the format for parameter file. We use to place this file in
the unix box where we have installed our informatic server.
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_GEHC_APO_
WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALES_HIST_AUSTRI]
$InputFileName_BAAN_SALE_HIST=/interface/dev/etl/apo/srcfiles/
HS_025_20070921
$DBConnection_Target=DMD2_GEMS_ETL
$$CountryCode=AT
$$CustomerNumber=120165
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_GEHC_APO_
WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALES_HIST_BELUM]
$DBConnection_Sourcet=DEVL1C1_GEMS_ETL
$OutputFileName_BAAN_SALES=/interface/dev/etl/apo/trgfiles/
Page 89 of 134
www.pskinfo.com
HS_002_20070921
$$CountryCode=BE
$$CustomerNumber=101495
Page 90 of 134
www.pskinfo.com
Power Center 8.X Architecture.
Page 91 of 134
www.pskinfo.com
Page 92 of 134
www.pskinfo.com
Developer Changes:
• Client applications are the same, but work on top of the new services
framework
Page 93 of 134
www.pskinfo.com
4) grid concept is additional feature
8) concurrent cache creation and faster index building are additional feature
in lookup transformation
13)flat file names we can populate to target while processing through list file .
14)For Falt files header and footer we can populate using advanced options in
8 at session level.
Effective in version 8.0, you create and configure a grid in the Administration
Console. You configure a grid to run on multiple nodes, and you configure one
Integration Service to run on the grid. The Integration Service runs processes
on the nodes in the grid to distribute workflows and sessions. In addition to
running a workflow on a grid, you can now run a session on a grid. When you
run a session or workflow on a grid, one service process runs on each available
node in the grid.
Page 94 of 134
www.pskinfo.com
Pictorial Representation of Workflow execution:
2. IS starts ISP
Manages the data from source system to target system within the
memory and disk
Load Balancer
Page 95 of 134
www.pskinfo.com
Data Transformation Manager
The Integration Service starts one or more Integration Service processes to run
and monitor workflows. When we run a workflow, the ISP starts and locks the
workflow, runs the workflow tasks, and starts the process to run sessions. The
functions of the Integration Service Process are,
Load Balancer
www.pskinfo.com
the node relative to the total physical memory size called Maximum
Memory %. The maximum number of running Session and Command
tasks allowed for each Integration Service process running on the node
called Maximum Processes
2. The Load Balancer dispatches all tasks to the node that runs the master
Integration Service process
1. The Load Balancer verifies which nodes are currently running and enabled
2. The Load Balancer identifies nodes that have the PowerCenter resources
required by the tasks in the workflow
3. The Load Balancer verifies that the resource provision thresholds on each
candidate node are not exceeded. If dispatching the task causes a
threshold to be exceeded, the Load Balancer places the task in the
dispatch queue, and it dispatches the task later
Page 97 of 134
www.pskinfo.com
4. The Load Balancer selects a node based on the dispatch mode
When the workflow reaches a session, the Integration Service Process starts
the DTM process. The DTM is the process associated with the session task. The
DTM process performs the following tasks:
Adds partitions to the session when the session is configured for dynamic
partitioning.
Sends a request to start worker DTM processes on other nodes when the
session is configured to run on a grid.
Runs post-session stored procedures, SQL, and shell commands and sends
post-session email
Page 98 of 134
www.pskinfo.com
4.2 Informatica Scenarios:
We can do using sequence generator by setting end value=3 and enable cycle
option.then in the router take 3 goups
In 1st group specify condition as seq next value=1 pass those records to 1st
target simillarly
In 2nd group specify condition as seq next value=2 pass those records to 2nd
target
In 3rd group specify condition as seq next value=3 pass those records to 3rd
target.
Since we have enabled cycle option after reaching end value sequence
generator will start from 1,for the 4th record seq.next value is 1 so it will go to
1st target.
I want to generate the separate file for every State (as per state, it should
generate file).It has to generate 2 flat files and name of the flat file is
corresponding state name that is the requirement.
Below is my mapping.
Source:
AP 2 HYD
Page 99 of 134
www.pskinfo.com
AP 1 TPT
KA 5 BANG
KA 7 MYSORE
KA 3 HUBLI
This functionality was added in informatica 8.5 onwards earlier versions it was
not there.
We can achieve it with use of transaction control and special "FileName" port
in the target file .
In order to generate the target file names from the mapping, we should make
use of the special "FileName" port in the target file. You can't create this
special port from the usual New port button. There is a special button with
label "F" on it to the right most corner of the target flat file when viewed in
"Target Designer".
When you have different sets of input data with different target files created,
use the same instance, but with a Transaction Control transformation which
defines the boundary for the source sets.
in target flat file there is option in column tab i.e filename as column.
when you click that one non editable column gets created in metadata of
target.
www.pskinfo.com
Implementation Procedure:
Double click on target definition and click on ports tab there we can find out on
right side one label ‘F’ property click on that label,automatically new
port(filename) shall be created
b)Mapping overview:
www.pskinfo.com
2.sorter transformation
3.Expression transformation
www.pskinfo.com
c) ports&expression in Expression transformation
www.pskinfo.com
d)condition in transaction control Transformation:
www.pskinfo.com
e) linking between Transaction control and target ports:
At session level specify any name as a output file name with valid output
directory or path.
www.pskinfo.com
3) How to concatenate row data through informatica?
Source:
Ename EmpNo
Stev 100
methew 100
John 101
Tom 101
Target:
Ename EmpNo
Stev 100
methew
If record doen’t exit do insert in target .If it is already exist then get
corresponding Ename vale from lookup and concat in expression with current
Ename value then update the target Ename column using update strategy.
Sort the data in sq based on EmpNo column then Use expression to store
previous record information using Var port after that use router to insert a
record if it is first time if it is already inserted then update Ename with concat
value of prev name and current name value then update in target.
Page 106 of 134
www.pskinfo.com
Implementation Procedure:
a)Mapping overview:
www.pskinfo.com
SELECT EMP.NO, EMP.NAME FROM EMP order by EMP.NO
Conditions:
V1=iif(NO != no_v,'i','u' )
Prename= iif(NO=no_v,concat(concat1,concat(',',NAME)),NAME)
www.pskinfo.com
c)Router transformations conditions:
www.pskinfo.com
4) How to generate Sequence numbers without Seq generator
transformation .
Solution:
We can use mapping variable and one variable port for increment
purpose in expression and assign increment value to Mapping
variable by using Setmaxvariable() function.
a)Mapping overview:
$$SEQ_NO
www.pskinfo.com
In 1st Variable port ( SEQ_NO_v ) expression:
SETMAXVARIABLE ($$SEQ_NO,INC_v)
$$SEQ_NO + 1
Create one output port and assign first variable port(SEQ_NO_v) and link to
Target Surrogate key column.
1) How to send Unique (Distinct) records into One target and duplicates
into another tatget?
Source:
Ename EmpNo
stev 100
Stev 100
Page 111 of 134
www.pskinfo.com
john 101
Mathew 102
Output:
Target_1:
Ename EmpNo
Stev 100
John 101
Mathew 102
Target_2:
Ename EmpNo
Stev 100
If record doen’t exit do insert in target_1 .If it is already exist then send it to
Target_2 using Router.
Sort the data in sq based on EmpNo column then Use expression to store
previous record information using Var ports after that use router to route the
data into targets if it is first time then sent it to first target if it is already
inserted then send it to Tartget_2.
Page 112 of 134
www.pskinfo.com
a. How to Process multiple flat files to single target table through
informatica if all files are same structure?
We can process all flat files through one mapping and one session using list file.
First we need to create list file using unix script for all flat file the extension of
the list file is .LST.
If both workflow exists in same folder we can create 2 worklet rather than
creating 2 workfolws.
www.pskinfo.com
create worklet.
We can set the dependency between these two workflow using shell
script is one approach.
If both workflow exists in different folrder or different rep then we can use
below approaches.
As soon as first workflow get completes we are creating zero byte file
(indicator file).
If indicator file is not available we will wait for 5 minutes and again we will
check for the indicator. Like this we will continue the loop for 5 times i.e
30 minutes.
After 30 minutes if the file does not exists we will send out email
notification.
We can put event wait before actual session run in the workflow to wait a
indicator file if file available then it will run the session other event wait it will
wait for infinite time till the indicator file is available.
Solution:
Using var ports in expression we can load cumulative salary into target.
www.pskinfo.com
Page 115 of 134
www.pskinfo.com
SQL Transformation:
E. How to generate multiple records in target based on source column value.
Solution:
We can use SQL Transformation with Query mode by passing value to query
dynamically.
Source table
Name Adderss No
Kiran Tpt 2
Somu Kkd 3
Name Adderss No
Kiran tpt 2
Kiran Tpt 2
Somu Kkd 3
Somu Kkd 3
Somu kkd 3
www.pskinfo.com
Page 117 of 134
www.pskinfo.com
Below is the query
SELECT NAME, ADDR,NUM FROM (SELECT NAME,ADDR,NUM FROM EMP A,EMP B WHERE NUM=?
NUM1?) WHERE ROWNUM <=?NUM1?
www.pskinfo.com
4.3 Development Guidelines
General Development Guidelines
The starting point of the development is the logical model created by the Data
Architect. This logical model forms the foundation for metadata, which will be
continuously be maintained throughout the Data Warehouse Development Life
Cycle (DWDLC). The logical model is formed from the requirements of the
project. At the completion of the logical model technical documentation
defining the sources, targets, requisite business rule transformations, mappings
and filters. This documentation serves as the basis for the creation of the
Extraction, Transformation and Loading tools to actually manipulate the data
from the applications sources into the Data Warehouse/Data Mart.
To start development on any data mart you should have the following things
set up by the Informatica Load Administrator
Transformation Specifications
www.pskinfo.com
While estimating the time required to develop mappings the thumb rule is as
follows.
It’s an accepted best practice to always load a flat file into a staging table
before any transformations are done on the data in the flat file.
Always use LTRIM, RTRIM functions on string columns before loading data into
a stage table.
You can also use UPPER function on string columns but before using it you
need to ensure that the data is not case sensitive (e.g. ABC is different from
Abc)
If you are loading data from a delimited file then make sure the delimiter is not
a character which could appear in the data itself. Avoid using comma-
separated files. Tilde (~) is a good delimiter to use.
Failure Notification
Once in production your sessions and batches need to send out notification
when then fail to the Support team. You can do this by configuring email task
in the session level.
Port Standards:
Input Ports – It will be necessary to change the name of input ports for lookups,
expression and filters where ports might have the same name. If ports do have
the same name then will be defaulted to having a number after the name.
Change this default to a prefix of “in_”. This will allow you to keep track of
input ports through out your mappings.
Page 120 of 134
www.pskinfo.com
Prefixed with: IN_
Transformation should be prefixed with a “v_”. This will allow the developer to
distinguish between input/output and variable ports. For more explanation of
Variable Ports see the section “VARIABLES”.
Prefixed with: V_
Prefixed with: O_
Quick Reference
www.pskinfo.com
Aggregator AGG_<Purpose>
Expression EXP_<Purpose>
Filter FLT_<Purpose>
Rank RNK_<Purpose>
Router RTR_<Purpose>
Mapplet MPP_<Purpose>
www.pskinfo.com
performance so sessions run during the available load window
connections.
1. Cache lookups if source table is under 500,000 rows and DON’T cache for
tables over 500,000 rows.
3. If a value is used in multiple ports, calculate the value once (in a variable)
and reuse the result instead of recalculating it for multiple ports.
7. Avoid using Stored Procedures, and call them only once during the
mapping if possible.
8. Remember to turn off Verbose logging after you have finished debugging.
Page 123 of 134
www.pskinfo.com
9. Use default values where possible instead of using IIF (ISNULL(X),,) in
Expression port.
10.When overriding the Lookup SQL, always ensure to put a valid Order By
statement in the SQL. This will cause the database to perform the order
rather than Informatica Server while building the Cache.
16.Define the source with less number of rows and master source in Joiner
Transformations, since this reduces the search time and also the cache.
19.If the lookup table is on the same database as the source table, instead of
using a Lookup transformation, join the tables in the Source Qualifier
Transformation itself if possible.
20.If the lookup table does not change between sessions, configure the
Lookup transformation to use a persistent lookup cache. The Informatica
Page 124 of 134
www.pskinfo.com
Server saves and reuses cache files from session to session, eliminating
the time required to read the lookup table.
24.Reduce the number of rows being cached by using the Lookup SQL
Override option to add a WHERE clause to the default SQL statement.
Testing regimens:
1. Unit Testing
2. Functional Testing
www.pskinfo.com
as it would function in production. This includes security, volume and stress
testing.
UTP Template:
Actual Pass Tested
Results, or Fail By
(P or
Step Description Test Conditions Expected Results F)
SAP-
CMS
Interf
aces
1 Check for the SOURCE: Both the source and target Should be Pass Stev
total count table load record count same as the
of records in SELECT count(*) FROM should match. expected
source tables XST_PRCHG_STG
that is
fetched and
the total TARGET:
records in
the PRCHG Select count(*) from
table for a _PRCHG
perticular
session
timestamp
www.pskinfo.com
Actual Pass Tested
Results, or Fail By
(P or
Step Description Test Conditions Expected Results F)
2 Check for all select PRCHG_ID, Both the source and target Should be Pass Stev
the target table record values should same as the
columns PRCHG_DESC, return zero records expected
whether
they are DEPT_NBR,
getting
populated EVNT_CTG_CDE,
correctly
with source PRCHG_TYP_CDE,
data.
PRCHG_ST_CDE,
from T_PRCHG
MINUS
select PRCHG_ID,
PRCHG_DESC,
DEPT_NBR,
EVNT_CTG_CDE,
PRCHG_TYP_CDE,
PRCHG_ST_CDE,
from PRCHG
3 Check for Identify a one record from It should insert a record into Should be Pass Stev
Insert the source which is not in target table with source data same as the
strategy to target table. Then run the expected
load records session
into target
table.
www.pskinfo.com
Actual Pass Tested
Results, or Fail By
(P or
Step Description Test Conditions Expected Results F)
4 Check for Identify a one Record It should update record into Should be Pass Stev
Update from the source which is target table with source data same as the
strategy to already present in the for that existing record expected
load records target table with different
into target PRCHG_ST_CDE or
table. PRCHG_TYP_CDE values
Then run the session
5 UNIX
cd /pmar/informatica/pc/pmserver/
2) And if we suppose to process flat files using informatica but those files were
exists in remote server then we have to write script to get ftp into informatica
server before start process those files.
3) And also file watch mean that if indicator file available in the specified
location then we need to start our informatica jobs otherwise will send email
notification using
Page 128 of 134
www.pskinfo.com
Mail X command saying that previous jobs didn’t completed successfully
something like that.
4) Using shell script update parameter file with session start time and end time.
This kind of scripting knowledge I do have. If any new UNIX requirement comes
then I can Google and get the solution implement the same.
Basic Commands:
Cat file1 (cat is the command to create none zero byte file)
cat file1 file2 > all -----it will combined (it will create file if it doesn’t exit)
cat file1 >> file2---it will append to file 2
o > will redirect output from standard out (screen) to file or printer or
whatever you like.
ps -A
Crontab command.
Crontab command is used to schedule jobs. You must have permission to run
this command by Unix Administrator. Jobs are scheduled in five numbers, as
follows.
Minutes (0-59) Hour (0-23) Day of month (1-31) month (1-12) Day of week (0-
6) (0 is Sunday)
so for example you want to schedule a job which runs from script named
backup jobs in /usr/local/bin directory on sunday (day 0) at 11.25 (22:25) on
Page 129 of 134
www.pskinfo.com
15th of month. The entry in crontab file will be. * represents all values.
25 22 15 * 0 /usr/local/bin/backup_jobs
who | wc -l
$ ls -l | grep '^d'
Pipes:
The pipe symbol "|" is used to direct the output of one command to the input
of another.
www.pskinfo.com
To display hidden files
ls –a
find command
find -name aaa.txt Finds all the files named aaa.txt in the current directory
or
Sed (The usual sed command for global string search and replace is this)
If you want to replace 'foo' with the string 'bar' globally in a file.
find / -name vimrc Find all the files named 'vimrc' anywhere on the system.
Find all files whose names contain the string 'xpilot' which
www.pskinfo.com
exist within the '/usr/local/games' directory tree.
You can find out what shell you are using by the command:
echo $SHELL
#!/usr/bin/sh
Or
#!/bin/ksh
It actually tells the script to which interpreter to refer. As you know, bash shell
has some specific functions that other shell does not have and vice-versa. Same
way is for perl, python and other languages.
It's to tell your shell what shell to you in executing the following statements in
your shell script.
Interactive History
A feature of bash and tcsh (and sometimes others) you can use
www.pskinfo.com
Basics of the vi editor
Opening a file
Vi filename
Creating text
Edit modes: These keys enter editing modes and type in the text
of your document.
r Replace 1 character
R Replace mode
Deletion of text
:w! existing.file Overwrite an existing file with the file currently being edited.
:q Quit.
www.pskinfo.com
You Have Successfully
Completed Data
Warehousing Training.
Best Of Luck.
www.pskinfo.com