Major Components of Teradata Architecture
Major Components of Teradata Architecture
The Teradata RDBMS has a "shared-nothing" architecture, which means that the vprocs
(which are the PEs and AMPs) do not share common components. For example, each
AMP manages its own dedicated memory space (taken from the memory pool) and the
data on its own vdisk -- these are not shared with other AMPs. Each AMP uses system
resources independently of the other AMPs so they can all work in parallel for high system
performance overall.
Massively Parallel Processing (MPP):When multiple SMP nodes are connected to form
a larger configuration,we refer to this as a Massively Parallel Processing (MPP) system.
Parsing Engine:
A Parsing Engine (PE) is a virtual processor(vproc). It is made up of the following software components:
1. Session Control,
2. Parser,
3.Optimizer,
4. Dispatcher.
Session Control
The major functions performed by Session Control are logon and logoff. Logon takes a textual request for
session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and
deletes the session’s context.
Parser
The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically.
The PE also consults the Data Dictionary to ensure that all objects and columns exist and that the user has
authority to access these objects.
Optimizer
The Optimizer is responsible for developing the least expensive plan to return the requested response set.
Processing alternatives are evaluated and the fastest alternative is chosen. This alternative is converted to
executable steps, to be performed by the AMPs, which are then passed to the dispatcher.
Dispatcher
The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the
BYNET. It is composed of execution control and response-control tasks. Execution control receives the step
definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status
reports from the AMPs as they process the steps, and passes the results on to response control once the
AMPs have
completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs
have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step
will be sent to one AMP, or broadcast to all AMPs.
The BYNET handles the internal communication of the Teradata RDBMS. All communication between PEs
and AMPs is done via the BYNET.
When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the BYNET. The
messages are routed to the appropriate AMP(s) where results sets and status information are generated.
This response information is also routed back to the requesting PE via the BYNET.
Once the message is on a participating node, PDE handles the multicast(carries the message to just the
AMPs that should get it). So, while a teradata system does do multicast messaging, the BYNET hardware
alone cannot do it - the BYNET can only do point-to-point and broadcast between nodes.
FEATURES OF BYNET:
The BYNET has several unique features:
Fault tolerant: each network has multiple connection paths. If the BYNET detects an unusable path in either
network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally,
in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are
re-routed to BYNET 1 (or equally distributed if there are more than two BYNETs present), and vice versa.
Load balanced: traffic is automatically and dynamically distributed between both BYNETs.
Scalable: as you add nodes to the system, overall network bandwidth scales linearly - meaning an increase
in system size without loss of performance.
High Performance: an MPP system typically has two or more BYNET networks. Because all networks are
active, the system benefits from the full aggregate bandwidth of all networks. Since the number of networks
can be scaled, performance can also be scaled to meet the needs of demanding applications. The
technology of the BYNET is what makes the Teradata parallelism possible.
The Database Manager subsystem resides on each AMP. The Database Manager:
• Receives the steps from the Dispatcher and processes the steps. It has the ability to:
− Lock databases and tables.
− Create, modify, or delete definitions of tables.
− Insert, delete, or modify rows within the tables.
− Retrieve information from definitions and tables.
• Collects accounting statistics, recording accesses by session so
users can be billed appropriately.
• Returns responses to the Dispatcher.
The Database Manager provides a bridge between that logical organization and the physical organization of
the data on disks. The Database Manager performs a space-management function that controls the use and
allocation of space.
A disk array is a configuration of disk drives that utilizes specialized controllers to manage and distribute
data and parity across the disks while providing fast access and data integrity.
Each AMP vproc must have access to an array controller that in turn accesses the physical disks. AMP
vprocs are associated with one or more ranks of data. The total disk space associated with an AMP vproc is
called a vdisk. A vdisk may have up to three ranks.
Teradata supports several protection schemes:
• RAID Level 5—Data and parity protection striped across multiple disks.
• RAID Level 1—Each disk has a physical mirror replicating the data.
• RAID Level S—Data and parity protection similar to RAID 5 but used for EMC di5sk arrays.
The disk array controllers are referred to as dual active array controllers, which means that both
controllers are actively used in addition to serving as backup for each other.
Query Parallelism:
Breaking the request into smaller components, all components being worked on at the same time,
with one single answer delivered. Parallel execution can incorporate all or part of the operations
within a query, and can significantly reduce the response time of an SQL statement, particularly if the
query reads and analyzes a large amount of data.
Query parallelism is enabled in Teradata by hash-partitioning the data across all the VPROCs
defined in the system. A VPROC provides all the database services on its allocation of datablocks.All
relational operations such as table scans, index scans, projections, selections, joins, aggregations,
and sorts execute in parallel across all the VPROCs simultaneously and unconditionally. Each
operation is performed on a VPROC’s data independently of the data associated with the other
VPROCs.
.
To locate a row, the AMP file system searches through a memory-resident structure called the Master Index.
An entry in the Master Index will indicate that if a row with this Table ID and row hash exists, then it must be
on a specific disk cylinder.
The file system will then search through the designated Cylinder Index. There it will find an
entry that indicates that if a row with this Table ID and row hash exists, it must be in one specific data block
on that cylinder.
The file system then searches the data block until it locates the row(s) or returns a No Rows Found condition
code.
Data retrival:
Retrieving data from the Teradata RDBMS simply reverses the storage model process. A request
made for data is passed on to a Parsing Engine(PE). The PE optimizes the request for efficient processing
and creates tasks for the AMPs to perform, which results in the request being satisfied. Tasks are then
dispatched to the AMPs via the BYNET. Often, all AMPs must participate in creating the answer set, such as
returning all rows of a table to a client application. Other times, only one or a few AMPs need participate. The
PE will ensure that only the AMPs that need to will be assigned tasks. Once the AMPs have been given their
assignments, they retrieve the desired rows from their respective disks. The AMPs will sort, aggregate,or
format if needed. The rows are then returned to the requesting PE viathe BYNET. The PE takes the returned
answer set and returns it to the requesting client application.
When a user writes an SQL query that has a SI in the WHERE clause, the Parsing Engine will hash the
Secondary Index Value. The output is the Row Hash of the SI. The PE creates a request containing the Row
Hash and gives the request to the Message Passing Layer (which includes the BYNET software and
network). The Message Passing Layer uses a portion of the Row Hash to point to a bucket in the Hash Map.
That bucket contains an AMP number to which the PE's request will be sent. The AMP gets the request and
accesses the Secondary Index Subtable pertaining to the requested SI information. The AMP will check to
see if the Row Hash exists in the subtable and double check the subtable row with the actual secondary
index value. Then, the AMP will create a request containing the Primary Index Row ID and send it back to
the Message Passing Layer. This request is directed to the AMP with the base table row, and the AMP easily
retrieves the data row.
Queries you make that do not specify database name will be made against your default database.
For example:
DATABASE birla;
set your default database to birla and the subsequent queries are made against birla database.
12.What is a cluster?
A cluster is a group of AMPs that act as a single fallback unit. Clustering has no effect on primary row
distribution of the table, but the fallback row copy will always go to another AMP in the same cluster.
Should an AMP fail, the primary and fallback row copies stored on that AMP cannot be accessed.
However, their alternate copies are available through the other AMPs in the same cluster.
The loss of an AMP in one cluster has no effect upon other clusters. It is possible to lose one AMP in each
cluster and still have full access to all fallback-protected table data. If there are two AMP failures in the same
cluster, the entire Teradata system halts.While an AMP is down, the remaining AMPs in the cluster must do
their own work plus the work of the down AMP.
The example shows an 8-AMP system set up in two clusters of 4-AMPs each.
The Call Level Interface (CLI) is the lowest level interface to the Teradata RDBMS. It consists of system
calls which create sessions, allocate request and response buffers, create and de-block “parcels” of
information, and fetch response information to the requesting client.
The Teradata Director Program (TDP) is a Teradata-supplied program that must run on any client system
that will be channel-attached to the Teradata RDBMS. The TDP manages the session traffic between the
Call-Level Interface and the RDBMS. Its functions include session initiation and termination, logging,
verification, recovery, and restart, as well as physical input to and output from the PEs, (including session
balancing) and the maintenance of queues. The TDP may also handle system security.
The Host Channel Adapter is a mainframe hardware component that allows the mainframe to connect to an
ESCON or Bus/Tag channel.
The PBSA (PCI Bus ESCON Adapter) is a PCI adapter card that allows a WorldMark server to connect to an
ESCON channel.
The PBCA (PCI Bus Channel Adapter) is a PCI adapter card that allows a WorldMark server to connect to a
Bus/Tag channel.
The Call Level Interface (CLI) is a library of routines that resides on the client side. Client
application programs use these routines to perform operations such as logging on and off, submitting SQL
queries and receiving responses which contain the answer set. These routines are 98% the same in a
network-attached environment as they are in a channel attached.
The Teradata ODBC™ (Open Database Connectivity) driver uses an open standardsbased
ODBC interface to provide client applications access to Teradata across LAN-based
environments. NCR has ODBC drivers for both UNIX and Windows-based applications.
The Micro Teradata Director Program (MTDP) is a Teradata-supplied program that must be linked to any
application that will be network-attached to the Teradata RDBMS. The MTDP performs many of the functions
of the channel based TDP including session management. The MTDP does not control session balancing
across PEs. Connect and Assign Servers that run on the Teradata system handle this activity.
The Micro Operating System Interface (MOSI) is a library of routines providing operating system
independence for clients accessing the RDBMS. By using MOSI, we only need one version of the MTDP to
run on all network-attached platforms.
15.How do you replace a null value with a default value while loading?
16.What is COMPRESS?
Compress: By default compresses the null values. In order to compress any values explicitly we need to
give the characters or values in order to compress those values.
17.How many values can we compress in Teradata?
Any column can be compressed except the indexed column and non volatile.
18.Difference between volatile and global volatile table?
Primary Key:
A relational concept used to determine relationships among entities and to define referential
constraints.
Not required, unless referential integrity checks are to be performed.
Define by CREATE TABLE statement.
Unique.
Identifies a row uniquely.
Value can not be changed.
Can not be null.
Not related to access path.
Primary Index:
Used to store rows on disk.
Defined by CREATE TABLE STATEMENT .
Unique or Non unique.
It is used to distribute rows.
Values can be changed.
Can be null.
Related to access path.
21.What is TDPID?
TDPID is the IP address of the teradata server machine.
22.What is tenacity?
Specifies the no. of hours that teradata FLOAD continuous trying to logon when the maximum no of load
jobs is already running on teradata database.
23.What is Sleep?
Specifies the no. of minutes that teradata FLOAD pauses before retrying on logon operation.
MLOAD:
Multiload allows nonunique secondary indexes - automatically rebuilds them after loading.
Multiload can load at max 5 tbls at a time and can also update and delete the data
FastLoad:
Fastload performs the loading of the data in 2phase and it no need a work table for loading the data so it is
faster as well as it follows the below steps to load the data in the table
Phase1-It moves all the records to all the AMP first without any hashing
Phase2-After giving end loading command,Amp will hashes the record and send it to the appropriate AMPS .
Fastload is used to load empty tables and is very fast, can load one table at a time.
When a Index is given on a partitioned table on the partitioned column that is the column on
which the partitioned has done the same column has been given as a primary index then,
If there are more partitions, then it will be faster to scan the table, that too with the PI
value itself.
29.Teradata joins?
Join Processing
A join is the combination of two or more tables in the same FROM of a single SELECT statement. When
writing a join, the key is to locate a column in both tables that is from a common domain. Like the
correlated subquery, joins are normally based on an equal comparison between the join columns.
JoIN keyword is used in an SQL statement to query data from two or more tables, based on a
relationship between certain columns in these tables
Self Join
A Self Join is simply a join that uses the same table more than once in a single join operation. The
first requirement for this type of join is that the table must contain two different columns of the same
domain. This may involve de-normalized tables.
For instance, if the Employee table contained a column for the manager's employee number and
the manager is an employee, these two columns have the same domain. By joining on these two
columns in the Employee table, the managers can be joined to the employees.
Example:
SELECT Mgr.Last_name (Title 'Manager Name', format 'X(10) )
,Department_name (Title 'For Department ')
FROM Employee_table AS Emp
INNER JOIN Employee_table AS Mgr
ON Emp.Manager_Emp_ID = Mgr.Employee_Number
INNER JOIN Department_table AS Dept
ON Emp.Department_number = Dept.Department_number
ORDER BY 2 ;
INNER JOIN:
INNER JOIN keyword return rows when there is at least one match in both tables
SELECT column_name(s)
FROM table_name1
INNER JOIN table_name2
ON table_name1.column_name=table_name2.column_name
The LEFT OUTER JOIN keyword returns all rows from the left table (table_name1), even if there are
no matches in the right table(table_name2).
SELECT column_name(s)
FROM table_name1
LEFT OUTER JOIN table_name2
ON table_name1.column_name=table_name2.column_name
The RIGHT OUTER JOIN keyword Return all rows from the right table (table_name2), even if there
are no matches in the left table (table_name1).
SELECT column_name(s)
FROM table_name1
RIGHT OUTER JOIN table_name2
ON table_name1.column_name=table_name2.column_name
The FULL OUTER JOIN keyword return rows when there is a match in one of the tables.
FULL OUTER JOIN Syntax:
SELECT column_name(s)
FROM table_name1
FULL OUTER JOIN table_name2
ON table_name1.column_name=table_name2.column_name
A FULL OUTER JOIN uses both of the tables as outer tables. The exceptions are returned from both
tables and the missing column values from either table are extended with NULL.
Product Join
It is very important to use an equal condition in the WHERE clause. Otherwise you get a product join.
This means that one row of a table is joined to multiple rows of another table. A mathematic product
means that multiplication is used.
30. Difference between Primary index and secondary index?
1. primary index cannot create after table creation, whereas secondary index can be created dynamically.
2. primary index is 1 AMP operation, secondary index is 2 AMP operation and non unique secondary index
is ALL AMP operation.
There are two types of Journals (1) permanent (2) Transient journal.
The purpose of the permanent journal is to provide selective or full database recovery to a
specified point in time. It permits recovery from unexpected hardware or software disasters. The
permanent journal also reduces the need for full table backups that can be costly in both time and
resources.
1. Permanent journals are explicitly created during database and/or table creation time. This
journaling can be implemented depending upon the need and available disk space.
PJ processing is a user selectable option on a database which allows the user to select extra
journaling for changes made to a table. There are more options and the data can be rolled
forward or backward (depending if you selected the correct options) at points of the customers
choosing. They are permanent because the changes are kept until the customer deletes them or
unloads them to a backup tape. They are usually kept in conjunction with backups of the database
and allow partial rollback or roll forward for some corrupted data or operational error like someone
deleted a months worth of data because they messed up the where clause
2.Transient Journal
The transient journal permits the successful rollback of a failed transaction (TXN). Transactions
are not committed to the database until the AMPs have received an End Transaction request,
either implicitly or explicitly. There is always the possibility that the transaction may fail. If
so, the participating table(s) must be restored to their pre-transaction state.
The transient journal maintains a copy of before images of all rows affected by the transaction. In
the event of transaction failure, the before images are reapplied to the affected tables, then are
deleted from the journal, and a rollback operation is completed. In the event of transaction
success, the before images for the transaction are discarded from the journal at the point of
transaction commit.
.LOGTABLE RestartLog1_fxp;
.LAYOUT Record_Layout ;
.FIELD in_City 1 CHAR(20) ;
.FIELD in_Zip * CHAR(5);
.END EXPORT ;
.LOGOFF ;
33.Teradata statistics.
Statistics collection is essential for the optimal performance of the Teradata query optimizer. The query
optimizer relies on statistics to help it determine the best way to access data. Statistics also help the
optimizer ascertain how many rows exist in tables being queried and predict how many rows will qualify for
given conditions. Lack of statistics, or out-dated statistics, might result in the optimizer choosing a less-than-
optimal method for accessing data tables.
Points:
1: Once a collect stats is done on the table(on index or column) where is this information stored so
that the optimizer can refer this?
Ans: Collected statistics are stored in DBC.TVFields or DBC.Indexes. However, you cannot query these two
tables.
2: How often collect stats has to be made for a table that is frequently updated?
Answer: You need to refresh stats when 5 to 10% of table's rows have changed. Collect stats could be pretty
resource consuming for large tables. So it is always advisable to schedule the job at off peak period and
normally after approximately 10% of data changes.
3: Once a collect stats has been done on the table how can i be sure that the optimizer is considering
this before execution ? i.e; until the next collect stats has been done will the optimizer refer this?
Ans: Yes, optimizer will use stats data for query execution plan if available. That's why stale stats is
dangerous as that may mislead the optimizer.
When the workload is not distributed across all the AMPs, only a few AMPs end up overburdened with the
work. This is a hot AMP condition.
This typically occurs when the volume of data you are dealing with is high and
(a). You are trying to retrieve the data in a TERADATA table which is not well distributed across the AMPs on
the system (bad Primary Index)
OR
(b). When you are trying to join on column with highly non unique values
OR
(c). When you apply the DISTINCT operator on a column with highly non unique values
4: How can i know the tables for which the collect stats has been done?
Ans: You run Help Stats command on that table. e.g HELP STATIISTICS TABLE_NAME ; this will give you
Date and time when stats were last collected. You will also see stats for the columns ( for which stats were
defined) for the table. You can use Teradata Manager too.
5: To what extent will there be performance issues when a collect stats is not done?Can a
performance issue be related only due to collect stats? Probably a HOT AMP could be the reason for
lack of spool space which is leading to performance degradation !!!
As: 1stpart: Teradata uses a cost based optimizer and cost estimates are done based on statistics. So if you
dont have statistics collected then optimizer will use a Dynamic AMP Sampling method to get the stats. If
your table is big and data was unevenly distributed then dynamic sampling may not get right information and
your performance will suffer.
2nd Part: No, performance could be related to bad selection of indexes ( most importantly PI) and the access
path of a particular query.
6: Also let me know what can lead to lack of spool space apart from HOT AMP !!!
Ans: One reason comes to my mind, a product join on two big data sets may lead to the lack of spool space.
Example: You have a database with total disk space of 100GB. You have
10GB of user data and an additional 10GB of overhead. What is the
maximum amount of spool space available for queries?
Answer: 80GB. All of the remaining space in the system is available for spool
Temp Space :
The third type of space is temporary space. Temp space is used for Global and Volatile temporary
tables, and these results remain available to the user until the session is terminated. Tables created
in temp space will survive a restart.
Therefore, to eliminate confusion it is important to explicitly define which one is desired. Otherwise, you
must know in which mode the CREATE TABLE will execute in so that the correct type is used for each
table. The implication of using a SET or MULTISET table is discussed further.
A SET table does not allow duplicate rows so Teradata checks to ensure that no two rows in a table are
exactly the same. This can be a burden. One way around the duplicate row check is to have a column in
the table defined as UNIQUE. This could be a Unique Primary Index (UPI), Unique Secondary Index
(USI) or even a column with a UNIQUE or PRIMARY KEY constraint. Since all must be unique, a
duplicate row may never exist. Therefore, the check on either the index or constraint eliminates the
need for the row to be examined for uniqueness. As a result, inserting new rows can be much faster by
eliminating the duplicate row check.
However, if the table is defined with a NUPI and the table uses SET as the table type, now a duplicate
row check must be performed. Since SET tables do not allow duplicate rows a check must be
performed every time a NUPI DUP (duplicate of an existing row NUPI value) value is inserted or
updated in the table. Do not be fooled! A duplicate row check can be a very expensive operation in
terms of processing time. This is because every new row inserted must be checked to see if it is a
duplicate of any existing row with the same NUPI Row Hash value. The number of checks increases
exponentially as each new row is added to the table.
What is the solution? There are two: either make the table a MULTISET table (only if you want duplicate
rows to be possible) or define at least one column or composite columns as UNIQUE. If neither is an
option then the SET table with no unique columns will work, but inserts and updates will take more time
because of the mandatory duplicate row check.
The following is an example of creating the same table as before, but this time as a MULTISET table:
CREATE MULTISET TABLE employee
( emp INTEGER
,dept INTEGER
,lname CHAR(20)
,fname VARCHAR(20)
,salary DECIMAL(10,2)
,hire_date DATE )
PRIMARY INDEX(emp);
Notice also that the PI is now a NUPI because it does not use the word UNIQUE. This is important! As
mentioned previously, if the UPI is requested, no duplicate rows can be inserted. Therefore, it acts more
like a SET table. This MULTISET example allows duplicate rows. Inserts will take longer because of the
mandatory duplicate row check.
points:
When a Fallback protected AMP goes down during a write operation, the update takes place
in the Fallback AMP in the same cluster to later update in the original AMP when it recovers.
When an AMP goes down the updates are also recorded in the Down AMP Recovery journal to later
update when AMP recovers.
My doubt is when an AMP goes down are the updates made in both Fallback AMP & Down AMP
recovery journal?
Regards,
Annal T
Hi Annal,
According to my knowledge
1.Down amp recovery journal will start when AMP goes down to restore the data for the down amp
2.fall back is like it has redundant data,if one amp goes down in the cluster also it wont affect your
queries.the query will use data from fall back rows.the down amp wont be updated use the data
from fall back.
For your doubt,When amp is down you ran the update,so fall back rows will be updated.Still amp is
in down condition and if you run the query,the query will use the updated ones and run.whenever
down amp active it will use downamp recovery journal and data will be updated.
Regards,
Syam Prasad K
41. Which one will take care when an AMP goes down?
Down amp recovery journal will start when AMP goes down to restore the data for the down amp
2.fall back is like it has redundant data,if one amp goes down in the cluster also it wont affect
your queries.the query will use data from fall back rows.the down amp wont be updated use the
data from fall back.
For your doubt,When amp is down you ran the update,so fall back rows will be updated.Still amp
is in down condition and if you run the query,the query will use the updated ones and
run.whenever down amp active it will use downamp recovery journal and data will be updated.
The EXPLAIN facility allows you to preview how Teradata will execute a requested query. It
returns a summary of the steps the Teradata RDBMS would perform to execute the request.
EXPLAIN also discloses the strategy and access method to be used, how many rows will be
involved, and its cost in minutes and seconds. Use EXPLAIN to evaluate a query performance
and to develop an alternative processing strategy that may be more efficient. EXPLAIN works on
any SQL request. The request is fully parsed and optimized, but not run. The complete plan is
returned to the user in readable English statements.
EXPLAIN provides information about locking, sorting, row selection criteria, join strategy and
conditions, access method, and parallel step processing. EXPLAIN is useful for performance
tuning, debugging, pre-validation of requests, and for technical training.
The newer ANSI standard COALESCE can also convert a NULL to a zero. However, it can convert a NULL
value to any data value as well. The COALESCE searches a value list, ranging from one to many values,
and
returns the first Non-NULL value it finds. At the same time, it returns a NULL if all values in the list are
NULL.
To use the COALESCE, the SQL must pass the name of a column to the function. The data in the
column is then compared for a NULL. Although one column name is all that is required, normally more than
one column is normally passed to it. Additionally, a literal value, which is never NULL, can be returned to
provide a default value if all of the previous column values are NULL.
In the above syntax the <column-list> is a list of columns. It is written as a series of column names
separated by commas.
SELECT COALESCE(NULL,0) AS Col1
,COALESCE(NULL,NULL,NULL) AS Col2
,COALESCE(3) AS Col3
,COALESCE('A',3) AS Col4 ;
A role can be assisgned a collection of access rights in the same way a user can.
You then grant the role to a set of users, rather than grant each user the same rights.
This cuts down on maintenance, adds standardisation (hence reducing erroneous access to sensitive data)
and reduces the size of the dbc.allrights table, which is very important in reducing DBC blocking in a large
environment.
Profiles assign different characteristics on a User, such as spool space, permspace and account strings.
Again this helps with standardisation. Note that spool assigned to a profile will overrule spool assigned on a
create user statement. Check the on line manuals for the full lists of properties
Data Control Language is used to restrict or permit a user's access. It can selectively limit a user's ability to
retrieve, add, or modify data. It is used to grant and revoke access privileges on tables and views.
Both may own objects such as tables, views, macros, procedures, and functions. Both users and databases
may hold privileges. However, only users may log on, establish a session with the Teradata Database, and
submit requests.
A user performs actions where as a database is passive. Users have passwords and startup strings;
databases do not. Users can log on to the Teradata Database, establish sessions, and submit SQL
statements; databases cannot.
Creator privileges are associated only with a user because only a user can log on and submit a CREATE
statement. Implicit privileges are associated with either a database or a user because each can hold an
object and an object is owned by the named space in which it resides
47.How many mload scripts are required for the below scenario
First I want to load data from source to volatile table.
After that I want to load data from volatile table to Permanent table.
48.What are the types of CASE statements available in Teradata?
The CASE function provides an additional level of data testing after a row is accepted by the WHERE
clause. The additional test allows for multiple comparisons on multiple columns with multiple outcomes.
It also incorporates logic to handle a situation in which none of the values compares equal.
When using CASE, each row retrieved is evaluated once by every CASE function. Therefore, if two
CASE operations are in the same SQL statement, each row has a column checked twice, or two
different values each checked one time.
Types:
1.Flexible Comparisons within CASE
When it is necessary to compare more than just equal conditions within the CASE, the format is
modified slightly to handle the comparison. Many people prefer to use the following format because it is
more flexible and can compare inequalities as well as equalities.
This is a more flexible form of the CASE syntax and allows for inequality tests:
CASE
WHEN <condition-test1> THEN <true-result1>
WHEN <condition-test2> THEN <true-result2>
WHEN <condition-testN> THEN <true-resultN>
[ ELSE <false-result> ]
END
The above syntax shows that multiple tests can be made within each CASE. The value stored in the
column continues to be tested until it finds a true condition. At that point, it does the THEN portion and
exits the CASE logic by going directly to the END.
In this section, we will investigate adding more power to the CASE statement. In the above examples, a
literal value was returned. In most cases, it is necessary to return data. The returned value can come
from a column name just like any selected column or a mathematical operation.
Additionally, the above examples used a literal ‘=’ as the comparison operator. The CASE comparisons
also allow the use of IN, BETWEEN, NULLIF and COALESCE. In reality, the BETWEEN is a compound
comparison. It checks for values that are greater than or equal to the first number and less than or equal
to the second number.
The next example uses both formats of the CASE in a single SELECT with each one producing a
column display. It also uses AS to establish an alias after the END:
SELECT CASE WHEN Grade_pt IS NULL THEN 'Grade Point Unknown'
WHEN Grade_pt IN (1,2,3) THEN 'Integer GPA'
WHEN Grade_pt BETWEEN 1 AND 2 THEN 'Low Decimal value'
WHEN Grade_pt < 3.99 THEN 'High Decimal value'
ELSE '4.0 GPA'
END AS Grade_Point_Average
,CASE Class_code
WHEN 'FR' THEN 'Freshman'
WHEN 'SO' THEN 'Sophomore'
WHEN 'JR' THEN 'Junior'
WHEN 'SR' THEN 'Senior'
ELSE 'Unknown Class'
END AS Class_Description
FROM Student_table
ORDER BY Class_code ;
Another interesting usage for the CASE is to perform horizontal reporting. Normally, SQL does vertical
reporting. This means that every row returned is shown on the next output line of the report as a
separate line. Horizontal reporting shows the output of all information requested on one line as columns
instead of vertically as rows.
Previously, we discussed aggregation. It eliminates detail data and outputs only one line or one line per
unique value in the non-aggregate column(s) when utilizing the GROUP BY. That is how vertical
reporting works, one output line below the previous. Horizontal reporting shows the next value on the
same line as the next column, instead of the next line.
Using the next SELECT statement, we achieve the same information in a horizontal reporting format by
making each value a column:
SELECT AVG(CASE Class_code
WHEN 'FR' THEN Grade_pt
ELSE NULL END) (format 'Z.ZZ') AS Freshman_GPA
,AVG(CASE Class_code
WHEN 'SO' THEN Grade_pt
ELSE NULL END) (format 'Z.ZZ') AS Sophomore_GPA
,AVG(CASE Class_code
WHEN 'JR' THEN Grade_pt
ELSE NULL END) (format 'Z.ZZ') AS Junior_GPA
,AVG(CASE Class_code
WHEN 'SR' THEN Grade_pt
ELSE NULL END) (format 'Z.ZZ') AS Senior_GPA
FROM Student_Table
WHERE Class_code IS NOT NULL ;
After becoming comfortable with the previous examples of the CASE, it may become apparent that a
single check on a column is not sufficient for more complicated requests. When that is the situation, one
CASE can be imbedded within another. This is called nested CASE statements.
The CASE may be nested to check data in a second column in a second CASE before determining
what value to return. It is common to have more than one CASE in a single SQL statement. However, it
is powerful enough to have a CASE statement within a CASE statement.
Example:
SELECT Last_name
,CASE Class_code WHEN 'JR'
THEN 'Junior ' ||(CASE WHEN Grade_pt < 2 THEN 'Failing'
WHEN Grade_pt < 3.5 THEN 'Passing'
ELSE 'Exceeding' END)
ELSE 'Senior ' ||(CASE WHEN Grade_pt < 2 THEN 'Failing'
WHEN Grade_pt < 3.5 THEN 'Passing'
ELSE 'Exceeding' END)
END AS Current_Status
FROM Student_Table
WHERE Class_code IN ('JR','SR')
ORDER BY class_code, last_name;
51 What is the SQL to find the base AMP, no. of records stored for a particular table?
52 When a PI is not mentioned on a table, how will Teradata consider the PI for that table?
If you don't specify a PI at table create time then Teradata must chose one. For instance, if the DDL is
ported from another database that uses a Primary Key instead of a Primary Index, the CREATE TABLE
contains a PRIMARY KEY (PK) constraint. Teradata is smart enough to know that Primary Keys must
be unique and cannot be null. So, the first level of default is to use the PRIMARY KEY column(s) as a
UPI.
If the DDL defines no PRIMARY KEY, Teradata looks for a column defined as UNIQUE. As a second
level default, Teradata uses the first column defined with a UNIQUE constraint as a UPI.
If none of the above attributes are found, Teradata uses the first column defined in the table as a
NON-UNIQUE PRIMARY INDEX (NUPI).
If a SELECT query covers all the columns that are defined in the JOIN INDEX as join columns, such type of
queries are called as COVERED query.
Multi-Column NUSI Columns used as a Covered Query
After all business requirements have been gathered for a proposed database, they must be modeled. Models
are created to visually represent the proposed database so that business requirements can easily be
associated with database objects to ensure that all requirements have been completely and accurately
gathered. Different types of diagrams are typically produced to illustrate the business processes, rules,
entities, and organizational units that have been identified. These diagrams often include entity relationship
diagrams, process flow diagrams, and server model diagrams. An entity relationship diagram (ERD)
represents the entities, or groups of information, and their relationships maintained for a business. Process
flow diagrams represent business processes and the flow of data between different processes and entities
that have been defined. Server model diagrams represent a detailed picture of the database as being
transformed from the business model into a relational database with tables, columns, and constraints.
Basically, data modeling serves as a link between business needs and system requirements.
Logical modeling
Physical modeling
If you are going to be working with databases, then it is important to understand the difference between
logical and physical modeling, and how they relate to one another. Logical and physical modeling are
described in more detail in the following subsections.
Logical Modeling
Logical modeling deals with gathering business requirements and converting those requirements into a
model. The logical model revolves around the needs of the business, not the database, although the needs
of the business are used to establish the needs of the database. Logical modeling involves gathering
information about business processes, business entities (categories of data), and organizational units. After
this information is gathered, diagrams and reports are produced including entity relationship diagrams,
business process diagrams, and eventually process flow diagrams. The diagrams produced should show the
processes and data that exists, as well as the relationships between business processes and data. Logical
modeling should accurately render a visual representation of the activities and data relevant to a particular
business.
The diagrams and documentation generated during logical modeling is used to determine whether the
requirements of the business have been completely gathered. Management, developers, and end users alike
review these diagrams and documentation to determine if more work is required before physical modeling
commences.
Physical Modeling
Physical modeling involves the actual design of a database according to the requirements that were
established during logical modeling. Logical modeling mainly involves gathering the requirements of the
business, with the latter part of logical modeling directed toward the goals and requirements of the database.
Physical modeling deals with the conversion of the logical, or business model, into a relational database
model. When physical modeling occurs, objects are being defined at the schema level. A schema is a group
of related objects in a database. A database design effort is normally associated with one schema.
During physical modeling, objects such as tables and columns are created based on entities and attributes
that were defined during logical modeling. Constraints are also defined, including primary keys, foreign keys,
other unique keys, and check constraints. Views can be created from database tables to summarize data or
to simply provide the user with another perspective of certain data. Other objects such as indexes and
snapshots can also be defined during physical modeling. Physical modeling is when all the pieces come
together to complete the process of defining a database for a business.
Physical modeling is database software specific, meaning that the objects defined during physical modeling
can vary depending on the relational database software being used. For example, most relational database
systems have variations with the way data types are represented and the way data is stored, although basic
data types are conceptually the same among different implementations. Additionally, some database
systems have objects that are not available in other database systems.
Derived tables are always local to a single SQL request. They are built dynamically using an additional
SELECT within the query. The rows of the derived table are stored in spool and discarded as soon as
the query finishes. The DD has no knowledge of derived tables. Therefore, no extra privileges are
necessary. Its space comes from the users spool space.
Following is a simple example using a derived table named DT with a column alias called avgsal and its
data value is obtained using the AVG aggregation:
SELECT *
FROM (SELECT AVG(salary) FROM Employee_table) DT(avgsal) ;
In Teradata, the additional key phase: WITH CHECK OPTION, indicates that the WHERE clause
conditions should be applied during the execution of an UPDATE or DELETE against the view.
This is not a concern if views are not used for maintenance activity due to restricted privileges.
Soft RI is just an indication that there is a PK-FK relation between the columns and is not implemented at
TD side.
But having it would help in cases like Join processing etc.
Batch:
- Tests an entire insert, delete, or update batch operation for referential integrity.
- If insertion, deletion, or update of any row in the batch violates referential integrity, then parsing engine
software rolls back the entire batch and returns an abort message.
Lets say that I had a table called X with some number of rows and I wanted to insert these rows into table Y
(insert into X select * from y). However, some of the rows violated an RI constraint that table Y had. From
reading the manuals, it seemed to me that if using standard RI, all of the valid rows would be inserted but the
invalid ones would not. But with batch RI (which is "all or nothing") I would expect nothing to get inserted
since it would check for problem rows up front and return an error right away.
If in fact there is no difference except in how Teradata processes things internally (i.e. where it checks for
invalid rows) then why would you want to use one over the other? Wouldn't you always want to use batch
since it does the checking up front and saves processing time?
Points:
lets suppose that we have 3 dimensions and 1 facts table (like in the example above).
lets suppose that join index (or aji) is based on 3 dims and facts (all tables inner joined).
"Soft" referential integrity is a feature that is more about accessing the data than about loading it.
Soft referential integrity does not enforce any RI constraints. However, when you
specify soft RI, you are telling the optimizer that the foreign key references do exist. Therefore, it
is your job to make sure that is true.
Soft Referential Integrity (Soft RI) is a mechanism by which you can tell the optimizer that
even though no formal RI constraints have been placed on the table(s), the data in the tables
conform to the requirements of RI enforced tables.
This means that the user has insured the following:
The PK of the parent table has unique, not null values.
The FK of the child table contains only values which are contained in the PK column of
the parent table.
Soft RI
Does not create or maintain reference indexes
Does not validate referencing constraints
By allowing the optimizer to assume that RI constraints are implicitly in force, (even though no
formal RI is assigned to the table), you enable the optimizer to eliminate join steps in queries
such as the one seen previously.
Implementing Soft RI
Soft RI is implemented using slightly different syntax than standard RI. The
REFERENCES clause for the column definition will add the key words 'WITH NO CHECK
OPTION'.
Examples
Create the employee table with a soft RI reference to the department table.
CREATE TABLE employee
( employee_number INTEGER NOT NULL,
manager_employee_number INTEGER,
department_number INTEGER ,
job_code INTEGER,
last_name CHAR(20) NOT NULL,
first_name VARCHAR(30) NOT NULL,
hire_date DATE NOT NULL,
birthdate DATE NOT NULL,
salary_amount DECIMAL(10,2) NOT NULL
, FOREIGN KEY ( department_number ) REFERENCES WITH NO CHECK OPTION
department( department_number))
UNIQUE PRIMARY INDEX (employee_number);
The parent table must be created with a unique, not null referenced column. Either of the
examples below may be used.
CREATE TABLE department
( department_number INTEGER NOT NULL CONSTRAINT primary_1 PRIMARY KEY
,department_name CHAR(30) UPPERCASE NOT NULL UNIQUE
,budget_amount DECIMAL(10,2)
,manager_employee_number INTEGER);
CREATE TABLE department
( department_number INTEGER NOT NULL
,department_name CHAR(30) UPPERCASE NOT NULL UNIQUE
,budget_amount DECIMAL(10,2)
,manager_employee_number INTEGER)
UNIQUE PRIMARY INDEX (department_number);
Executing the same query as before, notice the join elimination step takes place just as it did
when standard RI was enforced.
Find all employees in valid departments.
EXPLAIN SELECT employee_number
, department_number
FROM employee e, department d
WHERE e.department_number = d.department_number
ORDER BY 2,1;
An EXPLAIN of this query produces the following partial result:
CHECKSUM:
The problem in the diskdrive and disk array...can corrupt the data....
these type of corrupted data cant be found easily..but queries against these
corrupted data will get u wrong answers..we can find the corruption by means of scandisk and
checktable.....These errors will reduce the availability
of the DWH.......This Kinda Errors is called DIsk I/o Errors
Inorder to avoid this in TD we have the DIsk I/o Integrity Check.... CheckSum is used to check the
Disk I/O Integrity Check
by means of checksum for table level......this is a kinda protection technique by which we can select
the
various levels of corruption checking ..........
These checks are done by some integrity methods.....
This feature detects and logs the disk i/o errors
this checksum can be enabled.....using create table for table level.. DDL.
for system level use DBScontrol utilty to set the parameter
If u wanna more hands on then u ve to use the scandisk and checktbl utility....
u ve to run the checktbl utility in level 3 so that it will diagnos the entire rows,byte by byte...
60.what is identity column?
IN Teradata V2R5.1 with one, column (INTEGER data type) that is defined as an Identity column. Here's the
DDL:
Teradata has a concept of identity columns on their tables beginning around V2R6.x. These
columns differ from Oracle's sequence concept in that the number assigned is not guaranteed to be
sequential. The identity column in Teradata is simply used to guaranteed row-uniqueness.
Example:
Granted, ColA may not be the best primary index for data access or joins with other tables in the
data model. It just shows that you could use it as the PI on the table.
We have MERGE-INTO option available in Teradata data which works as an UPSERT logic in teradata.
ON (Target.dept_no = 20)
63.What are diff other options available with SAMPLE function in Teradata?
SAMPLE function is used to retrive the random of data from table
example 1
Select * from emp sample 10
example 2
select * from tab sample
when prod_code = 'AS' then 10
when prod_code = 'CM' then 10
when prod_code = 'DQ' then 10
end
Sample Function
Hi,
I have an order table which has order details alongwith Product Code as "AS" ,
"BU" ,"CM","DQ","ER","FN"
I was to select a random of 10 records for each of the product codes "AS" , "CM" and "DQ"
Can i use a "sample" teradata feature to acheive the above results . If yes how can that be done in a
single query, such that i get 30 records 10 each for the above 3 product codes.
Is there a better way to get the above results
Thanks,
Sam
Dieter
RANDOM Function
The RANDOM function may be used to generate a random number between a specified
range.
RANDOM (Lower limit, Upper limit) returns a random number between the lower and upper
limits inclusive. Both limits must be specified, otherwise a random number between 0 and
approximately 4 billion is generated.
department_number
-----------------
501
301
201
600
100
402
403
302
401
Example
department_number Random(1,9)
----------------- -----------
501 2
301 6
201 3
600 7
100 3
402 2
403 1
302 5
401 1
Note: it is possible for random numbers to repeat. The RANDOM function is activated for
each row processed, thus duplicate random values are possible.
66.consider Mload or Tpump according to volume of the data.,diffrent situations where Tpump and
Mload should be used ?
In general, the more you tend to accumulate your updates into large batches before applying them to your
tables, the more likely it is that you'll want to use Mload. Mload is more efficient at applying a large number of
updates. However, Mload has certain limitations like it can't update unique secondary indexes or join
indexes, it can't fire triggers, and you can't use it on a table with referential integrity defined. Also, Mload will
lock the entire table with a write lock when it's in the APPLY phase (when it's applying the updates).
Tpump, on the other hand, is best used if you are applying updates throughout the day in small batches (or
using a queue). Tpump is not as fast, especially as the update volumes grow. It's advantages are that it
doesn't lock the entire table for write, but only locks the specific row-hash values that are being updated, and
it only locks them for the duration of the update. Also, since there is no special code inside the DBMS for
Tpump, it supports all DBMS features (updates unique secondary indexes, join indexes, fires triggers, etc.).
If you are applying updates on a weekly or daily basis, I would tend to use Multiload. As you start to apply
updates more frequently throughout the day, you may start to find that Tpump is the better option.
Teradata is a MPP System which really can process the complex queries very fastly..
Another advantage is the uniform distribution of data through the Unique primary indexes with out any
overhead.
Recently we had an evaluation with experts from both Oracle and Teradata for OLAP system,
and they were really impressed with the performance of Teradata over Oracle.
Oracle support MPP in form of grid computing. uniform distribution of data based on primary key will not be
much useful when accessing huge amount of data a full scan is required. so far we found teradata almost
equal in performance with oracle 10g. Based on bench mark and after consulting from different people we
find following problems in Teradata.
its too expensive. you need long pockets to work with teradata.
it has only one type of index while oracle has many types of indexes especially there bitmap index.
teradata does not have materialize view.
oracle has materialize view which decrease the IO band width and makes system more scalable.
Oracle has very wide variety of analytic functions for Sql.
3 types of partitioning and in oracle 11g there are some new addition in partitioning
the ability to use clusters without having to statically partition data
Further..... these are the remarks i found on some of oracle discussion forms
the largest databases in the world run on Oracle
http://biz.yahoo.com/prnews/031114/sff029_1.html
they count
But still we saw that best database is the one which you have technical resource to work and especially tune.
Database Query Log tables are the tables present in DBC database which store the history of all the
operations performed on the tables present in the databases.
The history could get very large so these tables should be purged when the data is no longer needed.
There are many forms of disk array protection in Teradata. RAID 1 and RAID 5 are commonly used and will
be discussed here. The disk array controllers manage both.
RAID 1 is a disk-mirroring technique. Each physical disk is mirrored elsewhere in the array. This requires the
array controllers to write all data to two separate locations, which means data can be read from two locations
as well. In the event of a disk failure, the mirror disk becomes the primary disk to the array controller and
performance is unchanged. RAID 1 may be configured as RAID 1 + 0 that uses mirrored striping.
RAID 5 is a parity-checking technique. For every three blocks of data (spread over three disks), there is a
fourth block on a fourth disk that contains parity information. This allows any one of the four blocks to be
reconstructed by using the information on the other three. If two of the disks fail, the rank becomes
unavailable. The array controller does the recalculation of the information for the missing block.
Recalculation will have some impact on performance, but at a much lower cost in terms of disk space.
TOP Clause
The TOP clause is used to specify the number of records to return.
The TOP clause can be very useful on large tables with thousands of records. Returning a large number of
records can impact on perfor mance.
Example:
There is an Top function in V2R6, but if you want to try out in V2R5 you need to go by analytical
function.
Select *
From vinod_1
Qualify Row_number() OVER(Order by empno) <= 5
You then grant the role to a set of users, rather than grant each user the same rights.
This cuts down on maintenance, adds standardization (hence reducing erroneous access to sensitive data)
and reduces the size of the dbc.allrights table, which is very important in reducing DBC blocking in a large
environment.
Profiles assign different characteristics on a User, such as spool space, perm space and account strings.
Again this helps with standardization. Note that spool assigned to a profile will overrule spool assigned on a
create user statement. Check the on line manuals for the full lists of properties
Data Control Language is used to restrict or permit a user's access. It can selectively limit a user's ability to
retrieve, add, or modify data. It is used to grant and revoke access privileges on tables and views.
Example: You have a database with total disk space of 100GB. You have
10GB of user data and an additional 10GB of overhead. What is the
maximum amount of spool space available for queries?
Answer: 80GB. All of the remaining space in the system is available for spool
Temp Space :
The third type of space is temporary space. Temp space is used for Global and Volatile temporary
tables, and these results remain available to the user until the session is terminated. Tables created
in temp space will survive a restart.
A user performs actions where as a database is passive. Users have passwords and startup strings;
databases do not. Users can log on to the Teradata Database, establish sessions, and submit SQL
statements; databases cannot.
Creator privileges are associated only with a user because only a user can log on and submit a CREATE
statement. Implicit privileges are associated with either a database or a user because each can hold an
object and an object is owned by the named space in which it resides
85.What is Checkpoint ?
86.When do you use BTEQ. What other softwares have you used or can we use rather than BTEQ.
When the query is performing operations on lesser amount of data in a table then we go for BTEQ.
Any kind of SQL operations like SELECT, UPDATE, INSERT and delete.
Can be used for import, export and reporting purposes.
Macros and Stored procs can also be run using BTEQ.
The other utilities which we can use instead of BTEQ for loading purposes are FASTLOAD and MLOAD.
And exporting is FASTEXPORT. But these are used while accessing large amount of data.
87.How many type of files have you loaded and their differences. (Fixed and Variable) ?
In a channel environment I.e mainframes, the load utilities can be execute through a JCL.
In a network I.e from a command prompt the load scripts can be run through the following command.
<utility name> <scriptname>
89.What was the environment of your latest project (Number of Amps, Nodes, Teradata Server
Number etc)
Indexing is a way to physically reorganize the records to enable some frequently used queries to run faster.
The index can be used as a pointer to the large table. It helps to locate the required row quickly and then
return to back to the user.
or
The frequently used queries need not hit a large table for data. they can get what they want from the index
itself. - cover queries.
Index comes with the overhead of maintenance. Teradata maintains its index by itself. Each time an
insert/update/delete is done on the table the indexes will also need to be updated and maintained.
Indexes cannot be accessed directly by users. Only the optimizer has access to the index.
92.What is difference between Multiload, FastLoad and TPUMP
93.what are the different functions you do in BTEQ (Errorcode, ErrorLevel, etc) ?
Error Level :Assigns severity to errors
you can assign an error level (severity) for each error code returned.
you can make decisions can be based on error level.
96.Explain PPI?
PPI :-
Partitioned Primary Indexes are Created so as to divide the table onto partitions based on Range or Values
as Required .the data is first Hashed into Amps , then Stored in amps based on the Partitions !!! which
when Retrived for a single partition / multiple Partitions , will be a all amps
Scan, but not a Full Table Scan !!!! . this is effective for Larger Tables partitioned on the Date Specially !!!
there is no extra Overhead on the System (no Spl Tables Created ect )
Both are set operators on two tables generally. UNION gives all rows from both tables eliminating duplicate
rows.
MINUS gives records from first table excluding common records from both tables. Its just like EXCEPT in
Teradata
Operator Returns
UNION All rows selected by either query.
UNION ALL All rows selected by either query, including all duplicates.
MINUS All distinct rows selected by the first query but not the second.
UNION Example
The following statement combines the results with the UNION operator, which eliminates duplicate selected
rows. This statement shows that you must match datatype (using the TO_DATE and TO_NUMBER
functions) when columns do not exist in one or the other table:
SELECT part, partnum, to_date(null) date_in FROM orders_list1
UNION
SELECT part, to_number(null), date_in FROM orders_list2;
SELECT part
FROM orders_list1
UNION
SELECT part
FROM orders_list2;
PART
----------
SPARKPLUG
FUEL PUMP
TAILPIPE
CRANKSHAFT
MINUS Example
The following statement combines results with the MINUS operator, which returns only rows returned by the
first query but not by the second:
SELECT part
FROM orders_list1
MINUS
SELECT part
FROM orders_list2;
PART
----------
SPARKPLUG
FUEL PUMP
Indexing is a way to physically reorganize the records to enable some frequently used queries to run faster.
The index can be used as a pointer to the large table. It helps to locate the required row quickly and then
return to back to the user.
or
The frequently used queries need not hit a large table for data. they can get what they want from the index
itself. - cover queries.
Index comes with the overhead of maintenance. Teradata maintains its index by itself. Each time an
insert/update/delete is done on the table the indexes will also need to be updated and maintained.
Indexes cannot be accessed directly by users. Only the optimizer has access to the index.
no
112.What is Join Index in TD and How it works?
ANS : JOIN INDEX:
-----------
Join Index is nothing but pre-joining 2 or more tables or
views which are commonly joined in order to reduce the
joining overhead.
So teradata uses the join index instead of resolving the
joins in the participating base tables.
They increase the efficiency and performance of join queries.
They can have different primary indexes than the base tables
and also are automatically updated as and when the base rows
are updated. they can have repeating values.
113.I have two tables and one of the table index is defined as UPI or USI. The second table is having
any of the indexes like UPI,NUPI,USI OR NUSI. In this scenario what type of join strategy optimizer
will use ?
Merge Join Strategy
114.I have two tables. Most of the time I am joining on the same columns. Which type of join index
will improve the performance in this scenario ?
Multi table join index
115.When will you create PPI and when will you create secondary indexes?
Partitioned Primary Indexes are Created so as to divide the table onto partitions based on Range or Values
as Required. This is effective for Larger Tables partitioned on the Date and integer columns. There is no
extra Overhead on the System (no Spl Tables Created ect )
Secondary Indexes are created on the table for an alternate way to access data. This is the second fastest
method to retrieve data from a table next to the primary index. Sub tables are created.
PPI and secondary indexes do not perform full table scans but they access only a defined st of data in the
AMP's.
116.what is an optimization and performance tuning and how does it really work in practical
projects. can i get any example to better understand.
118.When you chose primary index and when will you choose secondary index?
Primary index will be chosen at the time of table creation. This will help us in data distribution, data retrieval
and join operations.
Secondary indexes can be created and dropped at any time. They are used as an alternate path to access
data other than the primary index.
When we have two tables which are joined based on the same join condition very frequently then we go for
Join Index
120.When will you go for hash index?
a.A hash index organizes the search keys with their associated pointers into a hash file structure.
b.We apply a hash function on a search key to identify a bucket,
and store the key and its associated pointers in the bucket (or in overflow buckets).
c.Strictly speaking, hash indices are only secondary index structures,
since if a file itself is organized using hashing, there is no need for a separate hash index structure on it.
121. In case of replacement loading which utility you prefer? Mload or Fload?
Fload.
122.I have a scenario where I update one column in a table using flat file as source. At the same time,
the same column is getting updated because of another flat file. Which utility will be more applicable
in this case?
Tpump is better as it locks at row level
The table got loaded with wrong data using Fastload and it failed. The error message shown was:
“RDBMS error 2652: Operation not allowed: _db_._table_ is being Loaded.” How to realese lock on
this table?
When the data got loaded completely and still its locked, submit another fastload script with
BEGIN LOADING AND END LOADING atetments alone.
I need to create a delimited file using fastexport. As fast export do not support delimited format, so I
have written the following select to get the delimited output:
select
trim(col1) || '|' ||
trim(col2) || '|' ||
trim(col3) || '|' || ...........
...............................
trim(col50)
from table
but the above script prefix each line with 2 junk characters.
How to get the data without the junk characters.
when the fastload check point value is <= 60 and > 60, how is that going to matter?
When the checkpoint interval is <= 60, that indicates the minutes (time) interval. If the value
is more than 60, it will be considered as the no. of records but not the time.
123. I am loading a delimited flat file with a time format as the following:
HH:MM PM/AM
Examples would be :
9:45 AM
10:25 PM
And there is no zero if the hours is a single integer value.
Is there any way that I would get the mload acquisition phase count in the mload script? MLOAD
support environment provides different variables (total ins, upd, del etc.) at the application phase,
but not at the acquisition phase.
Is there any way other than scan the log file?
There are various commands available for the same.
SYSAPLYCNT
SYSNOAPLYCNT
SYSRCDCNT
SYSRJCTCNT
124. I have this requirement when error table gets generated during the MLOAD, I want to send an
email. How can I achieve this?\
After Mload use a BTEQ to query for the error table if present quit on some value say '99'
and use your OS to mail when the return code is 99.
126.Can we make a MLOAD script fail when the error tables are created ?
Currently the mload scripts exits with a return code = 0 which means loading is successful even
though it is not.It has created some error tables which indicate some data has been rejected....
There are various commands to do this operation.
.logoff &SYSUVCNT + &SYSRJCTCNT + &SYSETCNT + &SYSRC;
TROUBLE SHOOTING
Solution : When ever you want to open a fresh batch id, first of all you should
close the existing batch id and open a fresh batch id.
2) source is Flat file and I am staging the this flat file in teradata.
I found that the initial zero’s are truncating in teradata. What could be the
reason.
Solution : The reason is that in teradata you are defined the column datatype
as Integer. That’s why initial values are truncating. So, change the target table
data type to VARCHAR. VARCHAR datatype it won’t trucate the initial zero’s.
Solution : For any fresh stage load you should open a batch id for the current
data source id.
Solution : First find all the NOT NULL columns in a target table and cross verify
with the corresponding source columns and identify for which source column
you are getting NULL value and take necessary action.
6) source is Flat file and I am staging the this flat file in teradata.
I found that the initial zero’s are truncating in teradata. What could be the
reason.
Solution : The reason is that in teradata you are defined the column datatype
as Integer. That’s why initial values are truncating. So, change the target table
data type to VARCHAR. VARCHAR datatype it won’t trucate the initial zero’s.
7) I am passing one record to target look up but the look up is not returning
the matching record.I know that the record is present in loo up. What action
you will take ?
9) Accti_id is a Not null column in AGREEMENT table. You are getting a NULL
value from CFDW_AGREEMENT_XREF look up ? what will you do to eliminate
NULL records.
11) when will you use ECTL_PGM_ID column in target look up sql overirde ?
Solution : when you are populating a single target table (AGREEMENT table)
from multiple mappings in the same informatica folder then we will use
ECTL_PGM_ID in taget look up sql override. This will eliminate unnecessary
updating records.
12) you are defined the primary keys as per the ETL spec but you are getting
the duplicate records. How will you handle.
Solution : Apart from the primary key columns in the spec,First I will add any
other column (other primary key columns in spec) as the primary key and I
will check for the duplicate records. If I didn’t get any duplicates, I will ask
modeller to add this column as the primary key.
13) In teradata the error is mentioned as: “no more room in database”
Solution: I spoke with DBA to add the space for that database.
14) Though the column is available in target table, when I am trying to load
using Mload, it shows that tahe column is not available in the table. Why?
Solution: As the loading process was happening through a view and the view
was not refreshed to add the new column, it was the error message. So,
refresh the view definition to add the new column.
15) when deleting the target table, though I wante to delete some data from
the target table, by mistake all the data got deleted from Development table.
16) While updatating the target table, it shows an error message saying
multiple rows are trying to update a single row.
Solution: There are duplicates available in the table matching the Where
condition of the update qurey. These duplicate records need to be eliminated.
17) I have a file with header, data records and trailer. Data record is delimited
with comma and header and trailer are fixed width. The header and trailer
starts with (HDR,TRA).
I need to avoid the header and trailer while loading the file with Multiload.
Please help me in this case.
Solution: Code Mload utility to consider only the data records excluding the
header and trailer records.
APPLY label WHERE REC_TD_IN NOT IN('HDR','TRA')
****MORE ON JOINS & INDEXES****
*Teradata makes itself the decision to use the index or not - if you are not careful you spend time in
table updates to keep up an index which is no used at all (one cannot give the query optimizer hints
to use some index - though collecting of statistics may affect the optimizer strategy
*In the MP-RAS environment, look at the script "/etc/gsc/bin/perflook.sh". This will provide a
system-wide snapshot in a series of files. The GSC uses this data for incident analysis.
* When using an index one must keep sure that the index condition is met in the sub queries "using
IN, nested queries, or derived tables"
* Indication of the proper index use is found by explain log entry "a ROW HASH MATCH SCAN
across ALL-AMPS"
* If the index is not used the result of the analysis is the 'FULL TABLE SCAN' where the
performance time grows when the size of the history table grows
* Keeping up an index information is a time/space consuming issue. Sometimes Teradata is much
better when you "manually" imitatate the index just building it from scratch.
* keeping up join index might help, but you cannot multiload to a table which is a part of the join
index - loading with 'tpump' or pure 'SQL' is OK but does not perform as well. Dropping and re-
creating a join index with a big table takes time and space.
* when your Teradata "explain" gives '25' steps from your query (even without the update of the
results) and the actual query is a join of six or more tables
Case e.g.
We had already given up updating the secondary indexes - because we have not had much use for
them.
After some trials and errors we ended up to the strategy, where the actual "purchase frequency
analysis" is never made "directly" against the history table.
Instead:
1) There is a "one-shot" run to build the initial "customer's previous purchase" from the "purchase
history" - it takes time, but that time is saved later
2) The purchase frequency is calculated by joining the "latest purchase" with the "customer's
previous purchase".
3) When the "latest purchase" rows are inserted to the "purchase history" the "customer's previous
purchase" table is dropped and recreated by merging the "customer's previous purchase" with the
"latest purchase"
4) By following these steps the performance is not too fast yet (about 25 minutes in our two node
system) for a bunch of almost 1.000.000 latest receipts - but it is tolerable now.
(We also tested by adding both the previous and latest purchase to the same table, but because its
size was in average case much bigger than the pure "latest purchase", the self-join was slower in
that case)
*********
How do you avoid bottlenecks when the query coordinator must retrieve
information from the data dictionary?
In Teradata, the DBMS itself manages the data dictionary. Each dictionary table is simply a
relational table, parallelized across all nodes. The same query engine that manages user workloads
also manages the dictionary access, using all nodes for processing dictionary information to spread
the load and avoid bottlenecks. The PE even caches recently used dictionary information in
memory. Because each PE has its own cache, there is no coordination overhead. The cache for each
PE learns the dictionary information most likely to be needed by the sessions assigned to it.
With a large volume of work, how can all requests execute at once?
As in any computer system, the total number of items that can execute at the same time is always
limited to the number of CPUs available. Teradata uses the scheduling services Unix and NT
provide to handle all the threads of execution running concurrently. Some requests might also exist
on other queues inside the system, waiting for I/O from the disk or a message from the BYNET, for
example. Each work item runs in a thread; each thread gets a turn at the CPU until it needs to wait
for some external event or until it completes the current work. Teradata configures several units of
parallelism in each SMP node. Each unit of parallelism contains many threads of execution that
aren't restricted to a particular CPU; therefore, every thread gets to compete equally for the CPUs in
the SMP node.
There is a limit, of course, to the number of pieces of work that can actually have a thread allocated
in a unit of parallelism. Once that limit is reached, Teradata queues work for the threads. Each
thread is context free, which means that it is not assigned to any session, transaction, or request.
Therefore, each thread is free to work on whatever is next on the queue. The unit of work on the
queue is a processing step for a request. Combining the queuing of steps with context-free threads
allows Teradata to share the processing service equally across all the concurrent requests in the
system. From the users' point of view, all the requests in the system are running, receiving service,
and sharing system resources.
How does Teradata avoid resource contention and the resulting performance and
management problems?
Teradata algorithms are very resource efficient. Other DBMSs optimize for single-query
performance by giving all resources to the single query. But Teradata optimizes for throughput of
many concurrent queries by allocating resources sparingly and using them efficiently. This kind of
optimization helps avoid wide performance variations that can occur depending on the number of
concurrent queries.
When faced with a workload that requires more system resources than are available, Teradata tunes
itself to that workload. Thrashing, a common performance failure mode in computer systems,
occurs when the system has fewer resources than the current workload requires and begins using
more processing time to manage resources than to do the work. With most databases, a DBA would
tune the system to avoid thrashing. However, Teradata adjusts automatically to workload changes
by adjusting the amount of running work and internally pushing back incoming work. Each unit of
parallelism manages this flow control mechanism independently.
If all concurrent work shares resources evenly, how are different service levels
provided to different users?
The Priority Scheduler Facility (PSF) in Teradata manages service levels among different parts of
the workload. PSF allows granular control of system resources. The system administrator can define
up to five resource partitions; each partition contains four available priorities. Together, they
provide 20 allocation groups (AGs) to which portions of the workload are assigned by an attribute
of the logon ID for the user or application. The administrator assigns each AG a portion of the total
system resources and a scheduling policy.
For example, the administrator can assign short queries from the Web site a guaranteed 20 percent
of system resources and a high priority. In contrast, the administrator might assign medium priority
and 10 percent of system resources to more complex queries with lower response-time
requirements. Similarly, the administrator might assign data mining queries a low priority and five
percent of the total resources, effectively running them in the background. You can define policies
so that the resources adjust to the work in the system. For example, you could allow data mining
queries to take up all the resources in the system if nothing else is running.
Unlike other scheduling utilities, PSF is fully integrated into the DBMS, not managed at the task or
thread level, which makes it easier to use for parallel database workloads. Because PSF is an
attribute of the session, it follows the work wherever it goes in the system. Whether that piece of
work is executed by a single thread in a single unit of parallelism or in 2,000 threads in 500 units of
parallelism, PSF manages it without system administrator involvement.
CPU scheduling is a primary component of PSF, using all the normal techniques (such as quantum
size, CPU queues by priority, and so on). However, PSF is endemic throughout the Teradata DBMS.
There are many queues inside a DBMS handling a large volume mixed workload. All of those
queues are prioritized based on the priority of the work. Thus, a high priority query entered after
several lower priority requests that are awaiting their turn to run will go to the head of the queue
and will be executed first. I/O is managed by priority. Data warehouse workloads are heavy I/O
users, so a large query performing a lot of I/O could hold up a short, high-priority request. PSF puts
the high-priority request I/Os to the head of the queue, helping to deliver response time goals.
Data warehouse databases often set the system environment to allow for fast scans.
Does Teradata performance suffer when the short work is mixed in?
Because Teradata was designed to handle a high volume of concurrent queries, it doesn't count on
sequential scans to produce high performance for queries. Although other DBMS products see a
large fall in request performance when they go from a single large query to multiple queries or
when a mixed workload is applied, Teradata sees no such performance change. Teradata never plans
on sequential access in the first place. In fact, Teradata doesn't even store the data for sequential
accesses. Therefore, random accesses from many concurrent requests are just business as usual.
Sync scan algorithms provide additional optimization. When multiple concurrent requests are
scanning or joining the same table, their I/O is piggybacked so that only a single I/O is performed to
the disk. Multiple concurrent queries can run without increasing the physical I/O load, leaving the
I/O bandwidth available for other parts of the workload.
STAYING ACTIVE
The active warehouse is a busy place. It must handle all decision making for the organization,
including strategic, long-range data mining queries, tactical decisions for daily operations, and
event-based decisions necessary for effective Web sites. Nevertheless, managing this diversity of
work does not require a staff of hundreds running a complex architecture with multiple data marts,
operational data stores, and a multitude of feeds. It simply requires a database management system
that can manage multiple workloads at varying service levels, scale with the business, and provide
2437 availability year round with a minimum of operational staff.
2. Use COMPRESS in whichever attribute possible. This helps in reducing IO and hence
Improves performance. Especially for attribute having lots of NULL values/Unique known
values.
3. COLLECT STATISTICS on daily basis (after every load) in order to improve performance.
4. Drop and recreate secondary indices before and after every load. This helps in improving load
performance (if critical)
5. Regularly Check for EVEN data distribution across all AMPs using Teradata Manager or thru
queryman
6. Check for the combination on CPU, AMP’s, PE, nodes for performance optimization.
Each AMP can handle 80 tasks and each PE can handle 120 sessions.
MLOAD – Customize the number sessions for each MLOAD jobs depending on the
Number of concurrent MLOAD jobs &
Number of PE’s in the system
e.g
SCENARIO 1
# of AMPS = 10
# of MAx load Jobs handled by Teradata=5 (Parameter which can be
set values-5 to 15)
# of Sessions per load job= 1 (parameter that can be set at Global
or at each MLOAD script level)
# of PE's=1
SCENARIO 2
#AMPS = 16
#Max load Jobs handles by Teradata=15
#Sessions per load job= 1
#of PE's=1
Use the SLEEP and TENACITY features of MLOAD for scheduling MLOAD jobs.
Check the TABLEWAIT parameter. If omitted can cause immediate load job failure if you
submit two MLOADS loads that are trying to update the same table.
JOIN INDEX - Check the limit on number of fields for a join Index (max 16 fields). It may
vary by version
Join Index is like building the table physically. Hence it has the advantage like BETTER
Performance since data is physically stored and not calculated ON THE FLY etc. Cons are
of LOADING time(MLOAD needs Join Indices to be dropped before loading) and additional
space since it is a physical table.