5 - DB2BP - Troubleshooting - DB2 - Servers - 1013
5 - DB2BP - Troubleshooting - DB2 - Servers - 1013
Best practices
Troubleshooting DB2 servers
Nikolaj Richers
Information Architect
IBM
Amit Rai
Advisory Software Engineer
IBM
Serge Boivin
Senior Writer
IBM
The result? When things do break, you are well prepared to make troubleshooting as
quick and painless as possible.
For each scenario, this paper shows you how to identify the problem symptoms, how
to collect the diagnostic data with minimal impact to your database environment, and
how to diagnose the cause of the problem.
The target audience for this paper is database and system administrators who have
some familiarity with operating system and DB2 commands.
This paper applies to DB2 V10.1 FP2 and later, but many of the features that are
described here are available in earlier DB2 versions as well. For example, some of the
serviceability functionality for large database environments was introduced in DB2
V9.7 FP4, and user-defined threshold detection for problem scenarios was introduced
in DB2 V9.7 FP5. If you are not sure whether specific functionality is supported for
your DB2 version, check the information center for that version.
For information about how to contact IBM and the available support options, see
“Contacting IBM Software Support” in the DB2 information center
(http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.admi
n.trb.doc/doc/t0053716.html ).
For information about how to use EcuRep, see “Enhanced Customer Data Repository
(EcuRep)” (http://www-05.ibm.com/de/support/ecurep/index.htm).
To see how your server is configured to behave during diagnostic data collection
when a critical error occurs, issue the db2pdcfg command. The output shows how
your data server responds to critical events such as trap conditions and what the
current state is. Significant events, such as critical errors, trigger automatic data
capture through first occurrence data capture (FODC, sometimes also referred to as
db2cos), which is described elsewhere in this paper.
Database Member 0
For more information about the db2pdcfg command, see “db2pdcfg - Configure DB2
database for problem determination behavior command”
(http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.admin.cmd.
doc/doc/r0023252.html).
You typically make configuration changes for troubleshooting in one of two places:
For DB2 profile registry variables, there is an important difference between the
methods that you can use to make configuration changes:
• You can make changes permanently by using the db2set command, which
requires an instance restart for changes to become effective.
The command output includes all database manager configuration values. The
following sample output has been abridged to show only values that are related to
problem determination:
The diagnostic error capture error level (indicated by DIAGLEVEL) determines the
level of detail that is recorded in the db2diag log file, and the notify level (indicated by
NOTIFYLEVL) determines the level of detail that is recorded in the notification log file.
The diagnostic data directory path (indicated by DIAGPATH) and the alternate
diagnostic data directory path (indicated by ALT_DIAGPATH) determine where
diagnostic data is stored.
Unless you are guided by IBM Support, do not change the default settings of
parameters or registry variables that are specific to problem determination but are not
described in this paper. For example, do not change the settings of the diaglevel
configuration parameter or other DB2FODC registry variable parameters. If you set the
level of detail for these parameters or registry variables too high, very large amounts
of diagnostic data can be generated in a very short time, which in turn can negatively
affect the performance of your data server. If you set the level of detail too low,
insufficient data to troubleshoot a problem might be available, requiring further
diagnostic data collection before you can diagnose and resolve a problem.
If you notice that the values for configuration parameters such as diaglevel and
notifylevel are not set to the defaults and you are not troubleshooting a problem,
you can use the UPDATE DBM CFG command to reset them to their defaults. In the
following example, with abridged output, the GET DBM CFG command shows that
the diaglevel parameter is set to the highest value possible:
The default value is 3, which captures all errors, warnings, event messages, and
administration notification messages. To reset the value for the diaglevel parameter
to the default, issue the following command:
The reason for redirecting diagnostic data away from the installation path is that the
various types of diagnostic data can use significant amounts of space in the file
system. By default, the db2diag and administration notification logs, core dump files,
trap files, an error log, a notification file, an alert log file, and FODC packages are all
written to the installation path. These files can negatively affect data server
availability if the data fills up all the space in the file system.
Redirect diagnostic data away from the DB2 installation path by using the following
command, replacing /var/log/db2diag with a location on your system:
The different types of diagnostic files are described in more detail in the section
“db2diag and administration notification logs“.
To see the value for the alt_diagpath configuration parameter, issue the following
command:
To reduce the amount of diagnostic data that is sent to the directory path that you
specify for the diagpath parameter, redirect FODC packages and the core file to a
different directory path. You use the following DB2FODC registry variable settings to
change the setting for the FODC packages and core files:
• FODCPATH: Specifies the absolute path name for the FODC package. The size
of a FODC package depends on the type of collection, the operating system,
and the sizes of the files that are collected. The size can reach several
gigabytes.
• DUMPDIR: Specifies the absolute path name of the directory for core file
creation. A core file can become as large as the amount of physical memory of
the machine where the core file is generated. For example, a machine with 64
GB of physical memory requires at least 64 GB of space in the directory path
where the core file will be stored. You can limit the size of the core file, but
you should instead configure core file behavior to point to a file system with
enough space to avoid lost or truncated diagnostic data.
You use the db2set command to make changes to these registry variable settings. For
example, to redirect both FODC packages and core files to the /tmp path
permanently, issue the following db2set command, which takes effect after you
restart the instance:
db2set DB2FODC="DUMPDIR=/tmp"
You can also specify multiple registry variable settings for the db2set command,
separating the settings with a space. For example, to set both the FODCPATH and the
DUMPDIR registry variables at the same time, you can issue the following command,
replacing the variable values with values that apply to your own system:
For more information about the registry variables that are supported by the db2set
command, see “General registry variables”
(http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/index.jsp?
topic=/com.ibm.db2.luw.admin.regvars.doc/doc/r0005657.html).
When you run the db2support command to collect environment data, it searches a
number of paths for FODC packages, including the path that is indicated by the
FODCPATH registry variable. You can specify an additional existing directory for the
db2support command to search for FODC packages by using the -fodcpath
command parameter. For more information about the parameters for the
db2support command, see “db2support - Problem analysis and environment
collection tool command”
(http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.admi
n.cmd.doc/doc/r0004503.html).
When you specify that you want to use rotating diagnostic and notification logs, a
series of rotating diagnostic log files and a series of rotating administration
notification log files are used that fit into the size that you defined for the diagsize
parameter. As log files fill up, the oldest files are deleted, and new files are created.
To see the current diagnostic logging setting, use the GET DBM CFG command:
When the value of the diagsize parameter is the default of 0, as shown in the
output, there is only one diagnostic log file, called the db2diag.log file. There is also
only one notification log file, which is named after the instance and has a .nfy file
extension. Ifconfigured as in the above example, these files grow in size indefinitely.
To configure for rotating diagnostic and notification logs, set the diagsize
configuration parameter to a nonzero value. The value that you specify depends on
your system. Most importantly, you want to avoid losing information too quickly
because of rapid file rotation (the deletion of the oldest log file) before you can archive
the old files. Generally, set the diagsize parameter to at least 50 MB, and make sure
that there is enough free space in the directory path that you specify for the
For example:
After you configure for rotating diagnostic logs, spend some time observing the
rotation of these files. The DB2 diagnostic and notification log files should be rotated
by the system every seven to 14 days. If they are rotated out too often, increase the
value of the diagsize parameter. If they are rotated too infrequently, decrease the
value of the parameter.
For more information about rotating diagnostic logs, see “DB2 diagnostic (db2diag)
log files”
(http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.admin.trb.d
oc/doc/c0054462.html).
To archive the log files, use the db2diag -A command.1 To avoid filling up the
diagnostic directory path with the archived diagnostic data, archive the diagnostic log
files to a different file system or to backup storage. After archiving a file, retain the
diagnostic data for two to four weeks, for example, by backing it up to a storage
solution. After this retention period has passed, you can automatically delete the
diagnostic data. If you do not archive the log files to the intended location by
specifying a directory path, make sure that you move the archived diagnostic data to a
different location to free up the space in the diagnostic path.
The following example demonstrates how to archive. The directory listing shows
which db2diag log file is in use. You can tell that this is a rotating diagnostic log file
because a numerical identifier (0) is part of the file name.
Now issue the db2diag -A command and include a destination path for the
archived logs:
db2diag -A /home/testuser/archive/
Having a good policy for regularly archiving and deleting the diagnostic and
notification logs takes care of diagnostic data that is regularly generated, but it does
not take care of all types of diagnostic data. You might need to remove other types of
data after you no longer need it. For example, if you run the db2support command
to prepare for uploading diagnostic data to the IBM Support site, you end up with a
compressed archive that takes up space. Remember to remove this archive after your
problem report is resolved. You must also manually remover any additional data
dump or FODC packages that are generated.
Diagnostic and notification logs: For both the primary diagnostic path that you
specify for the diagpath parameter and the alternate diagnostic path that you
specify for the alt_diagpath parameter, provide at least 20% more free space than
the value of the diagsize parameter.
Minimum space for diagnostic and notification logs = value of the diagsize parameter x 1.2
Core file dumps and FODC data: For free space, provide at least twice the amount of
physical memory of the machine, plus 20%. Providing this much space ensures that
you can store at least two full core file dumps or several FODC packages without
running the risk of truncated diagnostic data.
Minimum space for core files and FODC packages = 2 x physical memory x 1.2
Diagnostic data that you are uploading to the IBM Support site: If you run the
db2support command to prepare to upload diagnostic data to the IBM Support site,
make sure that enough space is available. The size of the db2support.zip file depends
on what parameters you specify for the db2support command, but the size of the
db2support.zip file can range from several megabytes to more than tens of gigabytes.
If you do not specify an output path, the resulting compressed archive is stored in the
directory path that you specified for the diagpath parameter.
• What information to collect from DB2 tools and logs and from operating
system tools and logs and what environmental information to collect
The first step is to characterize the issue by asking the following questions:
You might also ask include whether there were any recent changes that might be
implicated in the problem. Some problems, such as performance problems or
problems that occur only intermittently or only after some time has elapsed, are much
more open ended and require an iterative approach to troubleshooting.
After you have characterized the issue, you can use a number of tools and logs. In the
sections that follow, the main tools and diagnostic logs are described. These tools
include FODC, DB2 diagnostic and administration notification logs, DB2 tools, and
operating system tools and logs.
The DB2 monitoring infrastructure can also provide a wealth of information about the
health and performance of DB2 servers. Using table functions, you can access a broad
range of real-time operational data (in-memory metrics) about the current workload
and activities, along with average response times. Using event monitors, you can
capture detailed activity information and aggregate activity statistics for historical
analysis.
You can also perform some of the troubleshooting and monitoring tasks that are
covered in this paper by using the IBM InfoSphere® Optim™ and IBM Data Studio
tools. Information about how to use all of these tools is outside the scope of this paper,
but you might consider the following Optim tools:
By default, FODC invokes a db2cos callout script to collect diagnostic data. The
db2cos callout script is located in the bin directory in the DB2 installation path (in the
sqllib/bin directory, for example). You can modify the db2cos callout script to
customize diagnostic data collection.
In DB2 V9.7 FP5 and later (excluding V9.8), FODC supports defining your own
threshold rules for detecting a specific problem scenario and collecting diagnostic data
in response. You define threshold rules by specifying the -detect parameter for the
db2fodc command. To detect a threshold condition and to trigger automatic
diagnostic data collection when the threshold condition is exceeded multiple times,
create an FODC threshold rule such as the following one:
When the threshold conditions that are defined by the -detect parameter are met
and FODC memory collection is triggered, a message similar to the following one is
displayed in the command window:
Messages are also written in the db2diag.log file, as shown in the following example.
To get details about a triggered threshold, you can use tools and scripts to scan the
db2diag.log file and look for the string pdFodcDetectAndRunCollection,
probe:100.
You can also gather diagnostic performance data selectively without defining
threshold rules by using the -cpu, -connections, or -memory parameter. These
parameters are alternatives to collecting diagnostic data more extensively and
expensively with the –perf and –hang parameters when you already have a
preliminary indication of where a problem might be occurring.
As of DB2 V9.7 FP4, FODC collects diagnostic data at the member level to provide
more granular access to diagnostic data. Member-level FODC settings provide greater
control than the instance-level or host-level settings that were supported in previous
releases and fix packs.
You can also use the db2diag command to merge multiple log files.
You should monitor the administration notification log to determine whether any
administrative or maintenance activities require manual intervention. For example, if
the directory where transaction logs are kept is full, this blocks new transactions from
being processed, resulting in an apparent application hang. In that situation, the DB2
process writes the error ADM1826E to the administration notification log, as shown in
this example:
The error condition is also written to the db2diag log file, as shown here:
For more information, see “db2pd - Monitor and troubleshoot DB2 database
command”
(http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.l
uw.admin.cmd.doc/doc/r0011729.html).
You can archive rotating diagnostic log files to retain diagnostic data that
would otherwise be eventually overwritten and move the files to a different
location for storage.
For more information, see “db2diag - db2diag logs analysis tool command”
(http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.l
uw.admin.cmd.doc/doc/r0011728.html).
• db2top command (UNIX and Linux operating systems): This command uses
the snapshot monitor to provide a single-system view for partitioned database
environments. The db2top command can help you identify performance
problems across the whole database system or in individual partitions. You
can also use the db2top command on single-partition environments.
• db2support command: This command archives all the diagnostic data from
the directory that you specify for the diagpath configuration parameter into
a compressed file archive. You typically use the db2support command to
prepare to upload the data to the IBM Support site or to analyze the diagnostic
data locally. You can limit the amount of data that is collected to a specific time
interval by using the -history or -time parameter, and you can decompress
the compressed file archive.
For more information, see “db2caem - Capture activity event monitor data
tool command”
(http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.l
uw.admin.cmd.doc/doc/r0057282.html).
You typically run the following commands under the guidance of IBM Support
personnel. The commands are useful to help you gather the required information to
help support personnel assist you in diagnosing and correcting problems.
• db2trc command: This command collects traces through the DB2 trace
facility. The process requires setting up the trace facility, reproducing the
error, and collecting the data. In V9.7 FP4 and later, the db2trcon and
db2trcoff scripts simplify using the db2trc command. The db2trc
command can have a significant performance impact unless you limit what
you trace to specific application IDs or top EDUs.
▪ To repair a database
For more information, see “db2dart - Database analysis and reporting tool
command”
(http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.l
uw.admin.cmd.doc/doc/r0003477.html).
On UNIX and Linux operating systems, the following system tools are available:
• vmstat command: This is a good overall tool for showing whether processor
or memory bottlenecks exist. You can run it continuously.
• iostat command: You can use this command to find out whether any disk
I/O bottleneck exists and what the I/O throughput is.
On Windows operating systems, you can use the following system tools and
commands:
• db2pd -vmstat command: Use the db2pd command with the -vmstat
parameter to show whether processor or memory bottlenecks exist.
• db2pd -iostat command: Use the db2pd command with the -iostat
parameter to show whether any disk I/O bottleneck exists.
• Task Manager: You can use this tool to show memory consumption and
processor usage. Alternatively, you can use the db2pd –edus command to
return similar information.
• Process Explorer: This tool is similar to the Task Manager but provides
additional information and functionality for processes.
• Windows Performance Monitor: You can use this tool to monitor system and
application performance in real time and historically. The tool supports data
collection that you can customize, and you can define automatic thresholds for
alerts and actions to take in response. The tool can also generate performance
reports.
The following operating system error logs, which differ by operating system, can
contain important diagnostic information:
• Windows operating systems: Event logs and the Dr. Watson log
Monitoring infrastructure
The lightweight, metrics-based monitoring infrastructure that was introduced in DB2
Version 9.7 and is available in DB2 Version 10.5 provides pervasive and continuous
monitoring of both system and query performance. Compared to the older snapshot
and system monitor, the monitoring infrastructure provides real-time in-memory
aggregation and accumulation of metrics within the DB2 system at different levels,
with a relatively low impact on the system.
You can use the DB2 monitoring infrastructure to gain an understanding of the typical
workloads that your data server processes. Understanding your typical workloads is
• Time-spent metrics that identify how the time that is spent breaks down into
time spent waiting (lock wait time, buffer pool I/O time, and direct I/O time)
and time spent doing processing.
• In-memory metrics. SQL table functions provide highly granular access to
these metrics.
• Section explains that show the access plan that was executed for a statement
without the need for recompiling the statement.
• Section actuals, which can shorten the time to discover problem areas in an
access plan when you compare them to the estimated access plan values. You
use the db2caem command to get section actuals.
For more information about the available configuration parameters and the metrics
that are returned for each parameter setting, see “Configuration parameters” (
http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/index.jsp?
topic=/com.ibm.db2.luw.admin.config.doc/doc/c0004555.html).
To further reduce the system impact of high-volume event monitors and to reduce
storage requirements, DB2 Version 9.7 introduced a new event monitor target type,
the unformatted event table. You can format the unformatted event table data as
follows:
Event monitors that use unformatted event tables to store their data include the
following ones:
● Unit of work event monitor: This replaces the transaction event monitor. You
can control the granularity of the information that is returned at the workload
or at the database level. Data that is captured includes in-memory metrics. The
following statement creates a unit of work event monitor:
● Package cache event monitor: This captures both dynamic and static SQL
entries when they are removed from the package cache. Entries begin to be
captured as soon as the event monitor is activated. You can control the
granularity of the information that is returned by using the WHERE clause
when you define the event monitor. The WHERE clause for the event monitor
can include one or more of the following predicates (ANDed): the number of
executions, overall aggregate execution time, and evicted entries whose
metrics were updated since last boundary time set using the
MON_GET_PKG_CACHE_STMT function. You can also use the event monitor
definition to control the level of information that is captured; options include
BASE and DETAILED. The following statement shows an example of a package
cache event monitor:
Consider the following example. The users of a new Java application are complaining
about slow performance. Expected results are not being returned within the expected
time frame.
As a first step, you can run the MONREPORT.DBSUMMARY summary report for six
minutes (as an example) to get a picture of the system and application performance
metrics for the database while the application is running. To run the summary report
for six minutes, issue the following command:
-----------------------------------------------------------------------------
-- Detailed breakdown of TOTAL_WAIT_TIME --
% Total
---------------------------------------------
TOTAL_WAIT_TIME 100 711546
The output shows that most of the wait is happening while data is being sent back to
the client (the value in the TCPIP_SEND_WAIT_TIME row is high). Further
investigation with JDBC tracing reveals an improper override setting by the
Statement.setFetchSize method. After you set the fetch size (FetchSize
parameter) correctly, application performance not only improves but exceeds
expectations.
Other monitoring reports that you might find useful in troubleshooting include the
MONREPORT.CONNECTION, the MONREPORT.CURRENTAPPS, the
MONREPORT.CURRENTSQL, the MONREPORT.PKGCACHE, and the
MONREPORT.LOCKWAIT reports.
Full coverage of the monitoring infrastructure is beyond the scope of this paper,
although the monitoring infrastructure is used in some of the scenarios. For more
information about DB2 monitoring, see “Database monitoring”
(http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.admi
n.mon.doc/doc/c0001138.html) in the DB2 Information Center.
For example, to collect FODC data during a performance issue on members 10, 11, 12,
13, and 15, issue the following command:
Scenarios
The following scenarios show how you can apply some of the troubleshooting best
practices. The list of scenarios represents only a small sample of possible scenarios
that you might encounter while troubleshooting a DB2 server. For links to additional
scenarios and recommendations, see the “Additional information” section of this
paper's web page in the DB2 best practices developerWorks community .
You log on to the database server to do a preliminary investigation. You run some
operating system tools, such as the top command or the Windows Task Manager, to
see what the current processor usage is. As these tools are running, you observe that
the processor usage occasionally spikes to above 90% and remains high for several
minutes before dropping down to what seem to be more typical usage levels:
To capture diagnostic data for the intermittent processor usage spikes that you
observed, you define an FODC threshold rule. An FODC threshold rule is a tool that
waits for the resource conditions that you define to occur. In this case, you have some
preliminary information that points to high processor usage. If you don't know what
system resources are constrained, you can adapt this scenario to collect data about
additional system resources, such as connections and memory. After you set up the
FODC threshold rule, it triggers an FODC collection whenever the threshold
conditions that you specified for processor usage are exceeded, for as many
occurrences of the problem as you specified.
You define a FODC threshold rule for processor usage by using the db2fodc
-detect command. The db2fodc –detect command performs detection at regular
intervals for as long as you tell it to, if you specify a duration. If you do not specify a
duration, detection runs until the threshold conditions are triggered. The term
Threshold conditions in this context refers to both a specific frequency of the problem
and a duration that must be met before a collection is triggered.
The following threshold rule is a good start for detecting processor usage spikes:
In this case, the threshold rule is used to detect a combined user and system processor
usage rate that is higher than 90%. For FODC collection to be triggered, the threshold
conditions must exist for 40 seconds for each iteration (triggercount value of 4 x
interval value of 10 seconds = 40 seconds). The detection process sleeps for 30
seconds between each iteration. The total time that detection is enabled is 5 hours or
10 iterations of successful detection in total, whichever comes first. If FODC collection
is triggered, a new directory with a name that is prefixed with FODC_CPU_ is created
in the current diagnostic path. Only the lighter-weight, basic collection of diagnostic
data is performed.
This output specifies where to look for the diagnostic data for the particular FODC
package (/var/log/db2diag/db2dump/FODC_Cpu_2013-07-14-11.09.51.739430_0000).
The output also provides a bit of information about what types of data are collected.
After the collection is finished, the db2fodc –detect command either stops
running or executes the next iteration after sleeping for some time. The amount of
time to sleep is determined by the value of the sleeptime option if you specify it or 1
second if you do not specify a value. Whether detection continues depends on often
the threshold trigger conditions have been met at this point and how much time has
passed (that is, the values of the iteration and duration options that you used.
As defined in the previous example, detection and FODC collection continue until
either all 10 iterations of detection are complete or the end of the threshold duration is
reached.
Data collected
The diagnostic data that is collected is stored in an FODC package (a directory path).
This path is created inside the path you that specified for the FODCPATH parameter
when you configured your data server. If you did not configure the paths where
diagnostic data is stored ahead of time, see the section “Be prepared: configure your
data server ahead of time“ to learn about how to configure your system.
The contents of the FODC package directory path might look like the following
example:
Data analysis
An analysis of the output of the vmstat command (FODC_Cpu_2013-07-14-
15.15.40.604026_0000/vmstat_2013-07-14.15.15.40.000000/db2v10.vmstat.out) shows a
combined user and system processor usage rate of 100% , which in turn triggered the
FODC collection:
Now, investigate the cause of these high processor usage rates. To narrow down the
cause, you must use both the stack trace log and the output of the db2pd command.
Two stack trace logs are created during the FODC collection. These indicate the top
DB2 consumers of processor resources, in descending order over an interval of 30
seconds. The information that is given is for the top coordinator agents (db2agents),
which perform all database requests on behalf of the application.
Here are the top coordinator agents from the stack trace log (FODC_Cpu_2013-07-14-
15.15.40.604026_0000/FODC_Perf_2013-07-14-
15.15.44.474725_0000/StackTrace.0075/StackTrace.log.0):
Look for one or several coordinator agents that use significantly more processor
resources than other agents use, which gives you a clue for the next step. In this
output, db2agent EDU 54 looks promising, based on the amount of processor
resources that it used:
...
54 2863655792 9135 db2agent (SAMPLE) 0 130.200000 1.170000
57 2643454832 9193 db2agent (idle) 0 2.040000 1.070000
53 2864704368 9109 db2agent (idle) 0 1.880000 0.550000
...
You can see that db2agent EDU 54 uses far more resources than the next two
coordinator agents use. The other stack trace log (which is not shown but looks very
similar to the previous sample output) also shows db2agent EDU 54 at the top of the
list.
The db2agent number by itself is only an intermediate bit of information and is not
useful by itself. You can use this information to gain additional insight into the
application that the db2agent is working for, though. Look at the output folder of the
db2pd command and see whether you can correlate the db2agent number with a
specific application ID (alternatively, use the snapshot output for the same purpose).
Several db2pd command output files are created during FODC, each showing similar
output; you need the information from only one of them. Searching for the
coordinator agent 54 in one of the db2pd command output files (FODC_Cpu_2013-07-
14-15.15.40.604026_0000/DB2PD_2013-07-14.15.15.40.000000) yields the following
result:
Address AppHandl [nod-index] AgentEDUID Priority Type State ClientPid Userid ClientNm ...
0x13A37F80 195 [000-00195] 54 0 Coord Inst-Active 8724 db2inst1 db2bp ...
Note the process ID, 8724. This process ID is another important clue and gets you
closer to determining the query statement that is the likely culprit behind the spikes in
processor usage. All you have to do now is to search for additional occurrences of the
Application :
Address : 0x13BB0060
AppHandl [nod-index] : 195 [000-00195]
TranHdl : 3
Application PID : 8724
Application Node Name : db2v10
IP Address: n/a
Connection Start Time : (1373840099)Sun Jul 14 15:14:59 2013
Client User ID : db2inst1
System Auth ID : DB2INST1
Coordinator EDU ID : 54
Coordinator Member : 0
Number of Agents : 1
Locks timeout value : NotSet
Locks Escalation : No
Workload ID : 1
Workload Occurrence ID : 1
Trusted Context : n/a
Connection Trust Type : non trusted
Role Inherited : n/a
Application Status : UOW-Executing
Application Name : db2bp
Application ID : *LOCAL.db2inst1.130714221459
ClientUserID : n/a
ClientWrkstnName : n/a
ClientApplName : CLP longquery.db2
ClientAccntng : n/a
CollectActData: N
CollectActPartition: C
SectionActuals: N
How can you address the impact of this other query? You might be able to rewrite the
query so that it becomes less expensive to run. Alternatively, you can use some
standard DB2 workload management practices to run the query in a more controlled
fashion, without using excessive system resources.
There might be cases where it is not easy to determine the cause of a problem. Even if
you cannot resolve an issue yourself, you can set up an FODC threshold rule to collect
the required diagnostic data for different system resources, which you can then
provide to IBM Support for further analysis. IBM Support needs the diagnostic data to
be able to help, especially with intermittent problems. If you have the diagnostic data
ready, you can reduce the amount of time that it takes to diagnose the underlying
issue.
A general performance slowdown that is perceived by users can have many different
causes. In this case, the focus is on the performance impact that a large number of sort
overflows, also known as sort spills, can cause. If you don't know whether sort
overflows are a problem on your system, use this scenario to find out.
Queries often require a sort operation. A sort is performed when no index exists that
would satisfy the sort order or when an index exists, but sorting is determined to be
more efficient. Sort overflows occur when an index is so large that it cannot be sorted
in the memory that is allocated for the sort heap. During the sort overflow, the data to
be sorted is divided into several smaller sort runs and stored in a temporary table
space. When sort overflows that are stored in the temporary table space also require
writing to disk, they can negatively impact the performance of your data server.
26 record(s) selected.
Look at the SORT_OVERFLOWS column in the output; any nonzero value indicates
that a query performed a sort operation that spilled over to disk. In this example, there
is one SELECT statement on table T1 that resulted in a sort overflow. Also, the
TOTAL_SECTION_SORT_TIME column for the same statement indicates that the
section sort operation took a very long time. A section in this context is the compiled
query plan that was generated by the SQL statement that was issued. The unit of
measurement is milliseconds; when converted, the total sort time is around 32
minutes, which makes the query long running.
The MON_GET_PKG_CACHE_STMT table function can return other useful columns that
you can include in your query. For example, the NUM_EXECUTIONS column can
show how often a statement was executed. It might also be useful to return more of
the statement text by modifying the sample query. For more information about
metrics, see “MON_GET_PKG_CACHE_STMT table function - Get SQL statement
activity metrics in the package cache”
The db2pd command is also useful in this context, because you can investigate sort
performance while a perceived query performance problem is happening. You do not
have to wait for the monitoring information to be updated before you can begin
troubleshooting the problem. This feature is helpful if your queries are very long
running, as in the example that is used here.
To monitor sort performance with the db2pd command, use the parameters in the
following example:
The following sample output is abridged to highlight how you can determine whether
sort overflows are happening and what applications and SQL statement are involved:
Database Member 0 -- Database SAMPLE -- Active -- Up 2 days 01:36:58 -- Date 04/22/2013 20:34:29
AppHandl [nod-index]
950 [000-00950]
SortCB MaxRowSize EstNumRows EstAvgRowSize NumSMPSorts NumSpills
0xE7750430 133 109155368 136 1 187711
KeySpec
CHAR:128
Applications:
Address AppHandl [nod-index] ... Status C-AnchID ... Appid
0xF15F0060 950 [000-00950] ... UOW-Executing 542 ... *LOCAL.DB2.130420233443
In this case, the NumSpills column indicates that there are sort overflows. The
NumSpilledRows column shows that these sort overflows resulted in writing a large
number of rows to disk.
To determine the application and SQL statement, first use the ApplHandl value, 950
in this example, to locate the application information, and note the value in the C-
AnchID column, here 542. Next, locate the same C-AnchID value in the output for
SQL statements to find the statement text.
There might be no way to avoid sort overflows by increasing the value of the
sortheap parameter, because of the amount of memory that is required. However,
you can still take some actions to minimize the impact of sort overflows. Ensure that
the buffer pool for temporary table spaces is large enough to minimize the amount of
disk I/O that sort overflows cause. Furthermore, to achieve I/O parallelism during the
merging of sort runs, you can define temporary table spaces in multiple containers,
each on a different disk. To assess how well temporary data is used in the buffer pool,
use the db2pd command with the -bufferpool parameter. A section of the output
shows the cache hit ratios of temporary table space data and indexes.
If more than one index is defined on a table, memory usage increases proportionally,
because the sort operation keeps all index keys in memory. To keep memory usage to
a minimum, create only the indexes that you need.
The Transaction Throughput and Statement Throughput graphs show that the system
was originally working fine with good throughput. At a specific time, the system
experienced a very significant drop in throughput, almost to zero. However, there
was still some activity on the system during that period, as shown by the Row
Throughput and Rows Read per Fetched Row graphs. They show that a high number
of rows were read even though almost no transactions were completed. These
symptoms suggest that one transaction might have held locks and blocked most other
transactions from executing. These symptoms might also indicate that a very large
query was reading a very large number of rows. To diagnose the cause, you must
investigate further.
You can configure OPM to monitor locking events and notify you when particular
events occur or exceed a threshold. You can use the Locking configuration dialog,
Figure 2. You can use the Locking configuration dialog to specify locking alerts and
the amount of detail to collect for locking events
The Overview dashboard displays average values across the selected Time Slider
interval, for a wide range of metrics. Two are of particular interest in this scenario.
Both DB2 Lock Wait Time and Average Lock Wait Time metrics show significant
increases from the baseline. These increases provide further evidence of a locking
problem.
You can investigate the problem further with the Locking dashboard. You can access
the Locking dashboard from the Overview dashboard. The Locking dashboard
provides detailed information for all locking events on a separate tab. Figure 4
highlights a section of the Locking dashboard showing locking events.
Figure 4. Locking dashboard highlighting large values of Lock Wait Alerts and Block
Time
The current Maximum Block Time and Lock Wait Alerts metrics are of particular
interest when you compare them with the baseline, which is shown in the dashed
border box in the figure. OPM has recorded a much higher number of lock wait alerts
than is typical. The baseline shows zero lock wait alerts and a very short maximum
block time.
You can investigate the problem further by selecting an individual lock timeout event
from the dashboard and displaying detailed information for it. Figure 5 highlights
some of the key information that is displayed about the lock event after you double-
click to select it.
The lock timeout event details show information for both participants in the lock
event: the owner of the lock and its requestor. To see the lock owner's SQL statement,
select the Statements details. The information includes the complete text of the SQL
statement, the details of the lock that is being held, and the isolation level of the
transaction. In this example, the isolation level is repeatable read (RR). This is the
likely cause of the multiple lock timeout events and slowdown in transaction
throughput. A transaction using the RR isolation level can hold a large number of
locks during a unit of work (UOW) and cause many other transactions to be blocked,
waiting for locks to be released.
When you click the red icon, it opens the locking alert list for the database. If you
select a specific alert, you can see the details of the alert, as shown in the following
example.
To drill down into the full details for this event, click Analyze .
The event details window provides more information about each participant in the
event. This information helps you pinpoint the cause of the problem and determine a
course of action to correct the issue.
• Use the scenarios in this paper as examples for how you can use
the various troubleshooting and monitoring tools.
• Know when you are faced with a problem that you cannot resolve
on your own and therefore must engage with IBM for technical
support.
The trend is toward more granular diagnostic data collection, especially on large
database systems. On these systems, end-to-end diagnostic data collection is often too
expensive and carries the risk of affecting database availability. To lessen the impact
of diagnostic data collection, DB2 tools such as FODC can collect data about ongoing
problems locally and selectively.
To prepare for a possible problem, it is important that you configure your data server
before problems might occur. Troubleshooting is much easier if the data is readily
available and the impact to the performance of the system is well controlled.
Part of being prepared also means knowing the typical workloads that your data
server processes. If you understand your typical workloads, you are much more likely
to know quickly when an atypical event might be happening that requires further
investigation.
Contributors
Dmitri Abrashkevich, DB2 Development, IBM
Contacting IBM
To provide feedback about this paper, write to db2docs@ca.ibm.com.
To contact IBM in your country or region, see the IBM Directory of Worldwide
Contacts at http://www.ibm.com/planetwide.
IBM may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and
services currently available in your area. Any reference to an IBM product, program, or
service is not intended to state or imply that only that IBM product, program, or service
may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you any
license to these patents. You can send license inquiries, in writing, to:
The following paragraph does not apply to the United Kingdom or any other country
where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS
MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF
ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain
transactions, therefore, this statement may not apply to you.
This document and the information contained herein may be used solely in connection
with the IBM products discussed in this document.
Any references in this information to non-IBM websites are provided for convenience only
and do not in any manner serve as an endorsement of those websites. The materials at
those websites are not part of the materials for this IBM product and use of those websites
is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM has not
tested those products and cannot confirm the accuracy of performance, compatibility
or any other claims related to non-IBM products. Questions on the capabilities of non-IBM
products should be addressed to the suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business operations.
To illustrate them as completely as possible, the examples include the names of
individuals, companies, brands, and products. All of these names are fictitious and any
similarity to the names and addresses used by an actual business enterprise is entirely
coincidental.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. If these and
other IBM trademarked terms are marked on their first occurrence in this information with
a trademark symbol (® or ™), these symbols indicate U.S. registered or common law
trademarks owned by IBM at the time this information was published. Such trademarks
may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.