CICS Recovery and Restart Guide
CICS Recovery and Restart Guide
SC33-1698-02
CICS® Transaction Server for OS/390® IBM
SC33-1698-02
Note!
Before using this information and the product it supports, be sure to read the general information under “Notices” on page xi.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
What this book is about . . . . . . . . . . . . . . . . . . . . . xiii
Who should read this book . . . . . . . . . . . . . . . . . . . . xiii
What you need to know to understand this book . . . . . . . . . . . . xiii
How to use this book . . . . . . . . . . . . . . . . . . . . . . xiii
Determining if a publication is current . . . . . . . . . . . . . . . . xiii
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . xv
CICS Transaction Server for OS/390 . . . . . . . . . . . . . . . . xv
CICS books for CICS Transaction Server for OS/390 . . . . . . . . . xv
CICSPlex SM books for CICS Transaction Server for OS/390 . . . . . . xvi
Other CICS books . . . . . . . . . . . . . . . . . . . . . . xvi
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . 3
Faults and their effects . . . . . . . . . . . . . . . . . . . . . 3
Comparison of batch and online systems . . . . . . . . . . . . . . 3
Recovery requirements in a transaction processing system . . . . . . . . 4
Maintaining the integrity of data . . . . . . . . . . . . . . . . . 4
Minimizing the effect of failures . . . . . . . . . . . . . . . . . 4
The role of CICS . . . . . . . . . . . . . . . . . . . . . . . . 5
Recoverable resources . . . . . . . . . . . . . . . . . . . . 5
CICS backward recovery (backout) . . . . . . . . . . . . . . . . . 6
Dynamic transaction backout . . . . . . . . . . . . . . . . . . 7
Emergency restart backout . . . . . . . . . . . . . . . . . . . 7
CICS forward recovery. . . . . . . . . . . . . . . . . . . . . . 8
Forward recovery of CICS data sets. . . . . . . . . . . . . . . . 8
Forward recovery of other (non-VSAM) resources. . . . . . . . . . . 9
Failures that require CICS recovery processing . . . . . . . . . . . . 9
CICS recovery processing following a communication failure. . . . . . . 9
CICS recovery processing following a transaction failure . . . . . . . . 11
CICS recovery processing following a system failure. . . . . . . . . . 11
Contents v
Communications between application and user. . . . . . . . . . . . 99
Security . . . . . . . . . . . . . . . . . . . . . . . . . . 100
System definitions for recovery-related functions . . . . . . . . . . . . 100
System recovery table (SRT) . . . . . . . . . . . . . . . . . . 100
CICS-required resource definitions . . . . . . . . . . . . . . . . 100
Definition of system log streams and general log streams . . . . . . . . 101
System initialization parameters . . . . . . . . . . . . . . . . . 101
Files . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Transient data queues . . . . . . . . . . . . . . . . . . . . . 102
Temporary storage table (TST). . . . . . . . . . . . . . . . . . 102
Program list table (PLT) . . . . . . . . . . . . . . . . . . . . 102
Transaction list table (XLT) . . . . . . . . . . . . . . . . . . . 102
Documentation and test plans . . . . . . . . . . . . . . . . . . . 103
Chapter 16. Moving recoverable data sets that have retained locks . . . 175
Procedure for moving a data set with retained locks . . . . . . . . . . . 175
Using the REPRO method . . . . . . . . . . . . . . . . . . . 175
Using the EXPORT and IMPORT functions . . . . . . . . . . . . . 178
Rebuilding alternate indexes . . . . . . . . . . . . . . . . . . 178
Contents vii
Recovery of data set with loss of volume . . . . . . . . . . . . . . 182
Forward recovery of data sets accessed in non-RLS mode . . . . . . . . 190
Procedure for failed RLS mode forward recovery operation . . . . . . . . 191
Procedure for failed non-RLS mode forward recovery operation . . . . . . 193
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Contents ix
x CICS Recovery and Restart Guide
Notices
This information was developed for products and services offered in the U.S.A. IBM
may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and
services currently available in your area. Any reference to an IBM product, program,
or service is not intended to state or imply that only that IBM product, program, or
service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However,
it is the user’s responsibility to evaluate and verify the operation of any non-IBM
product, program, or service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not give you any
license to these patents. You can send license inquiries, in writing, to:
For license inquiries regarding double-byte (DBCS) information, contact the IBM
Intellectual Property Department in your country or send inquiries, in writing, to:
The following paragraph does not apply in the United Kingdom or any other
country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A
PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore this statement may not apply to
you.
Licensees of this program who wish to have information about it for the purpose of
enabling: (i) the exchange of information between independently created programs
and other programs (including this one) and (ii) the mutual use of the information
which has been exchanged, should contact IBM United Kingdom Laboratories,
MP151, Hursley Park, Winchester, Hampshire, England, SO21 2JN. Such
information may be available, subject to appropriate terms and conditions, including
in some cases, payment of a fee.
Trademarks
The following terms are trademarks of International Business Machines Corporation
in the United States, or other countries, or both:
Other company, product, and service names may be trademarks or service marks
of others.
The information in this book is generally restricted to a single CICS region. For
information about ISC and MRO, see the CICS Intercommunication Guide. For
information about XRF systems, see the CICS/ESA 3.3 XRF Guide.
FEPI: This book does not describe recovery and restart for the CICS front end
programming interface. For information on this topic, see the CICS Front End
Programming Interface User’s Guide.
For CICS Transaction Server books, these softcopy updates appear regularly on the
Transaction Processing and Data Collection Kit CD-ROM, SK2T-0730-xx. Each
reissue of the collection kit is indicated by an updated order number suffix (the -xx
part). For example, collection kit SK2T-0730-06 is more up-to-date than
SK2T-0730-05. The collection kit is also clearly dated on the cover.
If you have any questions about the CICS Transaction Server for OS/390 library,
see CICS Transaction Server for OS/390: Planning for Installation which discusses
both hardcopy and softcopy books and the ways that the books can be ordered.
The new topics in this edition are covered in the following chapters:
v “Chapter 8. Unit of work recovery and abend processing” on page 69
v “Chapter 11. Defining system and general log streams” on page 105, which
replaces the obsolete “Logging and journaling” chapter
v “Chapter 15. Resolving retained locks on recoverable resources” on page 157
v “Chapter 16. Moving recoverable data sets that have retained locks” on page 175
v “Chapter 17. Forward recovery procedures” on page 179
v “Chapter 20. Disaster recovery” on page 237
Because of the extensive changes and additions to this edition, new material is not
marked, and you should regard this as a new book.
In batch systems, input data is usually prepared before processing begins, and jobs
can be rerun, either from the start of the job or from some intermediate checkpoint.
In mixed systems, where both batch and online processing can occur against data
at the same time, the recovery requirements for batch processing and online
systems are similar.
Logging changes
One way of maintaining the integrity of a resource is to keep a record, or log, of all
the changes made to a resource while the system is executing normally. If a failure
occurs, the logged information can help recover the data.
If processing for the entire system stops, there may be many users whose updating
work is interrupted. On a subsequent startup of the system, only those data set
updates in process (in-flight) at the time of failure should be backed out. Backing
out only the in-flight updates makes restart quicker, and reduces the amount of data
to reenter.
Automatic backout is provided for most CICS resources (such as databases, files,
and auxiliary temporary storage queues), either following a transaction failure or
during an emergency restart of CICS. The logging functions necessary to support
backout are performed for you by the CICS recovery manager and the log manager.
If the backout of a VSAM file fails, CICS backout failure processing ensures that all
locks on the backout-failed records are retained, and the backout-failed parts of the
unit of work (UOW) are shunted to await retry. The VSAM file remains open for
use. (For an explanation of shunted units of work and retained locks, see “The
shunted state” on page 13.)
If the cause of the backout failure is a physically damaged data set, and provided
the damage affects only a localized section of the data set, you can choose a time
when it is convenient to take the data set offline for recovery. You can then use the
forward recovery log with a forward recovery utility, such as CICSVR, to restore the
data set and re-enable it for CICS use.
Note: In many cases, a data set failure also causes a processing failure. In this
event, forward recovery must be followed by backward recovery.
You don’t need to shut CICS down to perform these recovery operations. For data
sets accessed by CICS in VSAM record-level sharing (RLS) mode, you can
quiesce the data set to allow you to perform the forward recovery offline. On
completion of forward recovery, setting the data set to unquiesced causes CICS to
perform the backward recovery automatically.
For files accessed in non-RLS mode, you can issue an EXEC CICS or CEMT SET
DSNAME RETRY after the forward recovery, which causes CICS to perform the
backward recovery online.
Another way is to shut down CICS with an immediate shutdown and perform the
forward recovery, after which a CICS emergency restart performs the backward
recovery.
Recoverable resources
In CICS, a recoverable resource is any resource with recorded recovery
information that can be recovered by backout.
Chapter 1. Introduction 5
| v Coupling facility data tables
v The CICS system definition (CSD) file
v Intrapartition transient data destinations
v Auxiliary temporary storage queues
v Resource definitions dynamically installed using resource definition online (RDO)
If a data set failure occurs, you can use a backup of the data set and a forward
recovery utility, such as CICS VSAM Recovery MVS/ESA™ (CICSVR), to recover
the VSAM file.
Before a change is made to a resource, the recovery information for backout, in the
form of a before-image, is recorded on the CICS system log. A before-image is a
record of what the resource was like before the change. These before-images are
used by CICS to perform backout in two situations:
v In the event of failure of an individual in-flight transaction, which CICS backs out
dynamically at the time of failure (dynamic transaction backout)
v In the event of an emergency restart, when CICS backs out all those transactions
that were in-flight at the time of the CICS failure (emergency restart backout).
Note: Although these occur in different situations, CICS uses the same backout
process in each case—CICS does not distinguish between dynamic backout
and emergency restart backout. See “Chapter 6. CICS emergency restart” on
page 57 for an explanation of how CICS re-attaches failed in-flight units of
work in order to perform transaction backout following an emergency restart.
Each CICS region has only one system log, which cannot be shared with any other
CICS region. The system log is written to a unique MVS system logger log stream.
The CICS system log is intended for use only for recovery purposes—for example,
during dynamic transaction backout, or during emergency restart. It is not meant to
be used for any other purpose.
For example, when any updates made to a recoverable data set are to be backed
out, file control uses the system log records to reverse the updates. When all the
updates made in the unit of work have been backed out, the unit of work is
completed. The locks held on the updated records are freed if the backout is
successful.
For data sets open in RLS mode, CICS requests VSAM RLS to release the locks;
for data sets open in non-RLS mode, the CICS enqueue domain releases the locks
automatically.
During emergency restart, the recovery manager uses the system log data to drive
backout processing for any UOWs that were in-flight at the time of the failure. The
backout of UOWs during emergency restart is the same as a dynamic backout;
there is no distinction between the backout that takes place at emergency restart
and that which takes place at any other time. At this point, while recovery
processing continues, CICS is ready to accept new work for normal processing.
The recovery manager drives these backout and commit processes because the
condition that caused them to fail may be resolved by the time CICS restarts. If the
Chapter 1. Introduction 7
condition that caused a failure has not been resolved, the UOW remains in backout-
or commit-failed state. See “Backout-failed recovery” on page 75 and “Commit-failed
recovery” on page 79 for more information.
CICS writes the after-images of changes made to a data set to a forward recovery
log, which is a general log stream managed by the MVS system logger.
CICS obtains the log stream name of a VSAM forward recovery log in one of two
ways:
1. For files opened in VSAM record level sharing (RLS) mode, the explicit log
stream name is obtained directly from the VSAM ICF catalog entry for the data
set.
2. For files in non-RLS mode, the log stream name is derived from:
v The VSAM ICF catalog entry for the data set if it is defined there (and if
RLS=YES is specified as a system initialization parameter). In this case,
CICS file control manages writes to the log stream directly.
v A journal model definition referenced by a forward recovery journal name
specified in the file resource definition.
Forward recovery journal names are of the form DFHJnn where nn is a
number in the range 1–99 and is obtained from the forward recovery log id
(FWDRECOVLOG) in the FILE resource definition.
In this case, CICS creates a journal entry for the forward recovery log, which
can be mapped by a JOURNALMODEL resource definition. Although this
method enables user application programs to reference the log, and write
user journal records to it, you are recommended not to do so—you should
ensure that forward recovery log streams are reserved for forward recovery
data only.
Note: You cannot use a CICS system log stream as a forward recovery log.
For details of procedures for performing forward recovery, see “Chapter 17. Forward
recovery procedures” on page 179.
Chapter 1. Introduction 9
If the link fails and is later reestablished, CICS and its partners use the SNA
set-and-test-sequence-numbers (STSN) command to find out what they were doing
(backout or commit) at the time of link failure. For more information on link failure,
see the CICS Intercommunication Guide.
When communication fails, the communication system access method either retries
the transmission or notifies CICS. If a retry is successful, CICS is not informed.
Information about the error can be recorded by the operating system. If the retries
are not successful, CICS is notified.
Both dummy and sample versions of these programs are provided by CICS. The
dummy versions do nothing; they simply allow the default actions selected by CICS
to proceed. The sample versions show how to write your own NEP or TEP to
change the default actions.
The loss of an MVS image in a sysplex is detected by XCF in another MVS, and
XCF issues message IXC402D. If the failed MVS is running CICS regions
connected through XCF/MRO to CICS regions in another MVS, tasks running in the
active regions are initially suspended in an IRLINK WAIT state.
XCF/MRO-connected regions do not detect the loss of an MVS image and its
resident CICS regions until an operator replies to the XCF IXC402D message.
When the operator replies to IXC402D, the CICS interregion communication
program, DFHIRP, is notified and the suspended tasks are abended, and MRO
connections closed. Until the reply is issued to IXC402D, an INQUIRE
CONNECTION command continues to show connections to regions in the failed
MVS as in service and normal.
After dynamic transaction backout has completed, the transaction can restart
automatically without the operator being aware of it happening. This function is
especially useful in those cases where the cause of transaction failure is temporary
and an attempt to rerun the transaction is likely to succeed (for example, DL/I
program isolation deadlock). The conditions when a transaction can be
automatically restarted are described under “Abnormal termination of a task” on
page 90 .
“Chapter 8. Unit of work recovery and abend processing” on page 69 gives more
details about CICS processing of a transaction failure.
During normal execution, CICS stores recovery information on its system log
stream, which is managed by the MVS system logger. If you specify
START=AUTO, CICS automatically performs an emergency restart when it restarts
after a system failure.
Chapter 1. Introduction 11
During an emergency restart, the CICS log manager reads the system log
backward and passes information to the CICS recovery manager.
The CICS recovery manager then uses the information retrieved from the system
log to:
v Back out recoverable resources.
v Recover changes to terminal resource definitions. (All resource definitions
installed at the time of the CICS failure are initially restored from the CICS global
catalog.)
Unit of work
The period between the start of a particular set of changes and the point at which
they are complete is called a unit of work (UOW). The unit of work is a
fundamental concept of all CICS backout mechanisms.
From the application designer’s point of view, a UOW is a sequence of actions that
needs to be complete before any of the individual actions can be regarded as
complete. To ensure data integrity, a unit of work must be atomic, consistent,
isolated, and durable (see ACID properties in the CICS Glossary).
The CICS recovery manager operates with units of work. If a transaction that
consists of multiple UOWs fails, or the CICS region fails, committed UOWs are not
backed out.
The CICS recovery manager (see page 18) shunts a UOW from the primary system
log stream to the secondary log stream if it cannot be successfully completed or
backed out.
These situations can persist for some time, depending on how long it takes to
resolve the cause of the failure. Because it is undesirable for transaction resources
to be held up for too long, CICS attempts to release as many resources as possible
while a UOW is shunted. This is generally achieved by abending the user task to
which the UOW belongs, resulting in the release of the following:
v Terminals
v User programs
v Working storage
v Any LU6.2 sessions
v Any LU6.1 links
v Any MRO links
Locks
For files opened in RLS mode, VSAM maintains a single central lock structure using
the lock-assist mechanism of the MVS coupling facility. This central lock structure
provides sysplex-wide locking at a record level—control interval (CI) locking is not
used.
The locks for files accessed in non-RLS mode, the scope of which is limited to a
single CICS region, are CI locks. However, the CICS enqueue domain also holds a
record lock for the record accessed within the CI.
| For both RLS and non-RLS recoverable files, CICS releases all locks on completion
| of a unit of work. For recoverable coupling facility data tables, the locks are
| released on completion of a unit of work by the CFDT server.
Active and retained states for locks: CICS supports active and retained states
for locks.
When a lock is first acquired, it is an active lock. It remains an active lock until
successful completion of the unit of work, when it is released, or is converted into a
retained lock if the unit of work fails, or for a CICS or SMSVSAM failure:
v If a unit of work fails, RLS VSAM or the CICS enqueue domain continues to hold
the record locks that were owned by the failed unit of work for recoverable data
sets, but converted into retained locks. Retaining locks ensures that data integrity
for those records is maintained until the unit of work is completed.
v If a CICS region fails, locks are converted into retained locks to ensure that data
integrity is maintained while CICS is being restarted.
v If an SMSVSAM server fails, locks are converted into retained locks (with the
conversion being carried out by the other servers in the sysplex, or by the first
server to restart if all servers have failed). This means that a UOW that held
active RLS locks will hold retained RLS locks following the failure of an
SMSVSAM server.
Converting active locks into retained locks not only protects data integrity. It also
ensures that new requests for locks owned by the failed unit of work do not wait,
but instead are rejected with the LOCKED response.
Synchronization points
The end of a UOW is indicated to CICS by a synchronization point (usually
abbreviated to syncpoint).
A UOW that does not change a recoverable resource has no meaningful effect for
the CICS recovery mechanisms. Nonrecoverable resources are never backed out.
A unit of work can also be ended by backout, which causes a syncpoint in one of
the following ways:
v Implicitly when a transaction terminates abnormally, and CICS performs dynamic
transaction backout
v Explicitly by EXEC CICS SYNCPOINT ROLLBACK commands issued by an
application program to backout changes made by the UOW
Examples
In Figure 1, task A is a nonconversational (or pseudoconversational) task with one
UOW, and task B is a multiple UOW task (typically a conversational task in which
each UOW accepts new data from the user). The figure shows how UOWs end at
syncpoints. During the task, the application program can issue syncpoints explicitly,
and, at the end, CICS issues a syncpoint.
Unit of work
Task A
SOT EOT
(SP)
Task B
SOT SP SP SP EOT
(SP)
Abbreviations:
EOT: End of task SOT: Start of task
UOW: Unit of work SP: Syncpoint
l R esource Manag
ca e rs
Lo
TD
TS
R
FC
D
O
Recovery
LS
FC/R
LOG
2 DB2
LU6.
Com
rce
Manager
sou
D
mu
B
1
6.
Re
TL
nic
LU
MQM
MRO
a
on
e
ti
ot
sM e m
an R
a ge
rs I for
RM
Figure 3. CICS recovery manager and resources it works with
The identity of a UOW and its state are owned by the CICS recovery manager, and
are recorded in storage and on the system log. The system log records are used by
the CICS recovery manager during emergency restart to reconstruct the state of the
UOWs in progress at the time of the earlier system failure.
The execution of a UOW can be distributed over more than one CICS system in a
network of communicating systems.
The CICS recovery manager supports SPI commands that provide information
about UOWs.
Each local resource manager can write UOW-related log records to the local
system log, which the CICS recovery manager may subsequently be required to
re-present to the resource manager during recovery from failure.
To enable the CICS recovery manager to deliver log records to each resource
manager as required, the CICS recovery manager adds additional information when
the log records are created. Therefore, all logging by resource managers to the
system log is performed through the CICS recovery manager.
During syncpoint processing, the CICS recovery manager invokes each local
resource manager that has updated recoverable resources within the UOW. The
local resource managers then perform the required action. This provides the means
of coordinating the actions performed by individual resource managers.
If the commit or backout of a file resource fails (for example, because of an I/O
error or the inability of a resource manager to free a lock), the CICS recovery
manager takes appropriate action with regard to the failed resource:
v If the failure occurs during commit processing, the UOW is marked as
commit-failed and is shunted awaiting resolution of whatever caused the commit
failure.
v If the failure occurs during backout processing, the UOW is marked as
backout-failed, and is shunted awaiting resolution of whatever caused the
backout to fail.
Note that a commit failure can occur during the commit phase of a completed UOW,
or the commit phase that takes place after successfully completing backout.1Note
also that a UOW can be backout-failed with respect to some resources and
1. Commit phase. The two types (or ‘directions’) of commit processing—commit after normal completion and commit after
backout—are sometimes referred to as ‘forward commit’ and ‘backward commit’ respectively.
These events leave one data set commit-failed, and the other backout-failed. In this
situation, the overall status of the UOW is logged as backout-failed.
During emergency restart following a CICS failure, each UOW and its state is
reconstructed from the system log. If any UOW is in the backout-failed or
commit-failed state, CICS automatically retries the UOW to complete the backout or
commit.
Note: In this context, the non-CICS equivalent of a CICS recovery manager could
be the recovery component of a database manager, such as DBCTL or ®, or
any equivalent function where one of a pair of connected systems is not
CICS.
As remote resources are accessed during UOW execution, the CICS recovery
manager keeps track of data describing the status of its end of the conversation
with that RMC. The CICS recovery manager also assumes responsibility for the
coordination of two-phase syncpoint processing for the RMC.
If a session fails at any time during the running of a UOW, it is the RMC’s
responsibility to notify the CICS recovery manager, which takes appropriate action
with regard to the unit of work as a whole. If the failure occurs during syncpoint
processing, the CICS recovery manager may be in doubt and unable to determine
immediately how to complete the UOW. In this case, the CICS recovery manager
causes the UOW to be shunted awaiting UOW resolution, which follows notification
from its RMC of successful resynchronization on the failed session.
2. Atomic. A unit of work is said to be atomic when the changes it makes to resources within the UOW are either all committed or all
backed out. See also ACID properties in the Glossary.
The system log is the only place where CICS records information for use when
backing out transactions, either dynamically or during emergency restart processing.
CICS automatically connects to its system log stream during initialization, unless
you have specified a journal model definition that defines the system log as
DUMMY (in which case CICS can perform only an initial start).
The CICS System Definition Guide tells you how to specify CICS system log
streams, and how you can use journal model definitions to map the CICS journal
names for the primary system log stream (DFHLOG) and the secondary system log
stream (DFHSHUNT) to specific log stream names. If you don’t specify journal
model definitions, CICS uses default log stream names.
Your application programs can write user-defined recovery records to the system
log using EXEC CICS WRITE JOURNALNAME commands. Any user-written log
records to support your own recovery processes are made available to global user
exit programs enabled at the XRCINPT exit point.
In the event of an uncontrolled termination of CICS, records on the system log are
used as input to the emergency restart process.
During an initial phase of CICS restart, recovery manager uses this information,
together with UOW-related log records, to restore the CICS system to its state at
the time of the previous shutdown. This is done on a single backward scan of the
system log.
The following additional actions are taken for files accessed in non-RLS mode that
use backup while open (BWO):
v Tie-up records are recorded on the forward recovery log stream. A tie-up record
associates a CICS file name with a VSAM data set name.
v Recovery points are recorded in the integrated catalog facility (ICF) catalog.
These define the starting time for forward recovery. Data recorded on the forward
recovery log before that time does not need to be used.
See “Defining forward recovery log streams” on page 114 for information about the
use of forward recovery log streams.
Automatic journaling is used for user-defined purposes; for example, for an audit
trail. Automatic journaling is not used for CICS recovery purposes.
Shutdown
CICS can stop executing as a result of:
v A normal (warm) shutdown initiated by a CEMT, or EXEC CICS, PERFORM
SHUT command
v An immediate shutdown initiated by a CEMT, or EXEC CICS, PERFORM SHUT
IMMEDIATE command
v An abnormal shutdown caused by a CICS system module encountering an
irrecoverable error
v An abnormal shutdown initiated by a request from the operating system (arising,
for example, from a program check or system abend)
v A machine check or power failure
The SDTRAN option specified on the PERFORM SHUT command overrides any
SDTRAN option specified as a system initialization parameter.
v The DFHCESD program started by the CICS-supplied transaction, CESD,
attempts to purge and back out long-running tasks using increasingly stronger
methods (see “The shutdown assist transaction” on page 30).
v Tasks that are automatically initiated are run—if they start before the second
quiesce stage.
v Any programs listed in the first part of the shutdown program list table (PLT) are
run sequentially. (The shutdown PLT suffix is specified in the PLTSD system
initialization parameter, which can be overridden by the PLT option of the CEMT
or EXEC CICS PERFORM SHUTDOWN command.)
v A new task started as a result of terminal input is allowed to start only if its
transaction code is listed in the current transaction list table (XLT) or has been
defined as SHUTDOWN(ENABLED) in the transaction resource definition. The
XLT list of transactions restricts the tasks that can be started by terminals and
allows the system to shut down in a controlled manner. The current XLT is the
one specified by the XLT=xx system initialization parameter, which can be
overridden by the XLT option of the CEMT or EXEC CICS PERFORM
SHUTDOWN command.
Certain CICS-supplied transactions are, however, allowed to start whether their
code is listed in the XLT or not. These transactions are CEMT, CESF, CLR1,
CLR2, CLQ2, CLS1, CLS2, CSAC, CSTE, and CSNE.
v Finally, at the end of this stage and before the second stage of shutdown, CICS
unbinds all the VTAM terminals and devices.
The first quiesce stage is complete when the last of the programs listed in the first
part of the shutdown PLT has executed and all user tasks are complete. If the
CICS-supplied shutdown transaction CESD is used, this stage does not wait
indefinitely for all user tasks to complete.
The second quiesce stage ends when the last of the programs listed in the PLT has
completed executing.
Warm keypoints
The CICS-provided warm keypoint program (DFHWKP) writes a warm keypoint to
the global catalog, for terminal control and profile resources only, during the third
quiesce stage of shutdown processing when all system activity is quiesced. The
remainder of the warm keypoint information, for all other resources, is written to the
CICS system log stream, under the control of the CICS recovery manager. This
system log warm keypoint is written by the activity keypoint program as a special
form of activity keypoint that contains information relating to shutdown.
The warm keypoints contain information needed to restore the CICS environment
during a subsequent warm or emergency restart. Thus CICS needs both the global
catalog and the system log to perform a restart. If you run CICS with a system log
that is defined by a journal model specifying TYPE(DUMMY), you cannot restart
| CICS with START=AUTO following a normal shutdown, or with START=COLD.
| See “Chapter 4. CICS cold start” on page 43 for information about a cold start if
| CICS has issued message DFHRM0203 at the previous shutdown.
| During an immediate shutdown, the call to the log manager domain is bypassed
| and journal records are not flushed. This also applies to an immediate shutdown
| that is initiated by the shutdown-assist transaction because a normal shutdown has
| stalled. Therefore, any user journal records in a log manager buffer at the time of
| an immediate shutdown are lost. This does not affect CICS system data integrity.
| The system log and forward recovery logs are always synchronised with regard to
| I/O and unit of work activity. If user journal data is important, you should take
| appropriate steps to ensure that journal buffers are flushed at shutdown.
You should resort to using an immediate shutdown only if you have a special
reason for doing so. For instance, you might need to stop and restart CICS
during a particularly busy period, when the slightly faster immediate shutdown
may be of benefit. Also, you can use VTAM persistent sessions support with
an immediate shutdown.
Uncontrolled termination
An uncontrolled shutdown of CICS can be caused by:
v Power failure
v Machine check
v Operating system failure
In each case, CICS cannot perform any shutdown processing. In particular, CICS
does not write a warm keypoint or a warm-start-possible indicator to the global
catalog.
The operation of CESD, for both normal and immediate shutdowns, takes place
over a number of stages. CESD controls these stages by sampling the number of
tasks present in the system, and proceeds to the next stage if the number of
in-flight tasks is not reducing quickly enough.
The operation of CESD is quicker for an immediate shutdown, with the number of
tasks in the system being sampled only four times instead of eight.
You are recommended always to use the CESD shutdown-assist transaction when
shutting down your CICS regions. You can use the DFHCESD program “as is”, or
use the supplied source code as the basis for your own customized version (CICS
supplies versions in assembler, VS COBOL II, and PL/I). For more information
about the operation of the CICS-supplied shutdown assist program, see the CICS
Operations and Utilities Guide.
The CICS System Definition Guide tells you how to create and initialize these CICS
catalog data sets.
While CICS is running, the catalogs (and the system log) receive information
passed from one execution of CICS, through a shutdown, to the next execution of
CICS. This information is used for warm and emergency restarts, and to a lesser
extent for cold starts also. If the global catalog fails (for reasons other than filling
the available space), the recovery manager control record is lost. Without this, it is
impossible to perform a warm, emergency, or cold start, and the only possibility is
then an initial start. For example, if the failure is due to an I/O error, you can’t
restart.
Usually, if the global catalog fills, CICS abnormally terminates, in which case you
could define more space and attempt an emergency restart.
Consider putting the catalog data sets on the most reliable DASD available—RAID
or dual-copy devices—to ensure maximum protection of the data. Taking ordinary
copies is not recommended because of the risk of getting out of step with the
system log.
From a restart point of view, the system log and the CICS catalog (both data
sets) form one logical set of data, and all of them are required for a restart.
Global catalog
The global catalog contains information needed at restart, and CICS uses it to store
the following information:
v The names of the system log streams.
v Copies of tables of installed resource definitions, and related information, for the
following:
Most resource managers update the catalog whenever they make a change to
their table entries. Terminal and profile resource definitions are exceptions (see
the next list item about the catalog warm keypoint). Because of the typical
volume of changes, terminal control does not update the catalog, except when:
– Running a VTAM query against a terminal
– A generic connection has bound to a remote system
– Installing a terminal
– Deleting a terminal.
v A partial warm keypoint at normal shutdown. This keypoint contains an image
copy of the TCT and profile resource definitions at shutdown for use during a
warm restart.
Note: The image copy of the TCT includes all the permanent devices installed
by explicit resource definitions. Except for some autoinstalled APPC
connections, it does not include autoinstalled devices. Autoinstalled
terminal resources are cataloged initially, in case they need to be
recovered during an emergency restart, but only if the AIRDELAY system
initialization parameter specifies a nonzero value. Therefore, apart from
the APPC exceptions mentioned above, autoinstalled devices are
excluded from the warm keypoint, and are thus not recovered on a warm
start.
v Statistics options.
v Monitoring options.
All this information is essential for a successful restart following any kind of
shutdown.
Local catalog
The CICS local catalog data set represents just one part of the CICS catalog, which
is implemented as two physical data sets. The two data sets are logically one set of
cataloged data managed by the CICS catalog domain. Although minor in terms of
the volume of information recorded on it, the local catalog is of equal importance
with the global catalog, and the data should be equally protected when restarts are
performed.
If you ever need to redefine and reinitialize the CICS local catalog, you should also
reinitialize the global catalog. After reinitializing both catalog data sets, you must
perform an initial start.
Warm restart
If you shut down a CICS region normally, CICS restarts with a warm restart if you
specify START=AUTO. For a warm start to succeed, CICS needs the information
stored in the CICS catalogs at the previous shutdown, and the information stored in
the system log.
For more information about the warm restart process, see “Chapter 5. CICS warm
restart” on page 51.
Emergency restart
If a CICS region fails, CICS restarts with an emergency restart if you specify
START=AUTO. An emergency restart is similar to a warm start but with additional
recovery processing—for example, to back out any transactions that were in-flight at
the time of failure, and thus free any locks protecting resources.
If the failed CICS region was running with VSAM record-level sharing, SMSVSAM
converts into retained locks any active exclusive locks held by the failed system,
pending the CICS restart. This means that the records are protected from being
updated by any other CICS region in the sysplex. Retained locks also ensure that
other regions trying to access the protected records do not wait on the locks until
the failed region restarts. See the CICS Application Programming Guide for
information about active and retained locks.
For non-RLS data sets (including BDAM data sets), any locks (ENQUEUES) that
were held before the CICS failure are re-acquired.
For more information about the emergency restart process, see “Chapter 6. CICS
emergency restart” on page 57.
Cold start
On a cold start, CICS reconstructs the state of the region from the previous run for
remote resources only. For all resources, the region is built from resource
definitions specified on the GRPLIST system initialization parameter and those
resources defined in control tables.
The following is a summary of how CICS uses information stored in the global
catalog and the system log on a cold start:
v CICS preserves, in both the global catalog and the system log, all the information
relating to distributed units of work for partners linked by:
– APPC
– MRO connections to regions running under CICS Transaction Server for
OS/390 Release 1
– The resource manager interface (RMI) (for example, to DB2 and DBCTL).
v CICS does not preserve any information in the global catalog or the system log
that relates to local units of work.
Generally, to perform a cold start you specify START=COLD, but CICS can also
force a cold start in some circumstances when START=AUTO is specified. See the
table in the CICS System Definition Guide for details of the effect of the START
parameter in conjunction with various states of the global catalog and the system
log.
See the CICS Operations and Utilities Guide for information about the DFHRMUTL
utility program.
When an SMSVSAM server fails, any locks for which it was responsible are
converted to retained locks by another SMSVSAM server within the sysplex, thus
preventing access to the records until the situation has been recovered. CICS
detects that the SMSVSAM server has failed the next time it tries to perform an
RLS mode open requests and RLS mode record access requests issued by new
units of work receive error responses from VSAM when the server has failed. The
SMSVSAM server normally restarts itself without any manual intervention. After the
SMSVSAM server has restarted, it uses the MVS event notification facility (ENF) to
notify all the CICS regions within its MVS image that the SMSVSAM server is
available again.
CICS performs a dynamic equivalent of emergency restart for the RLS component,
and drives backout of the deferred work.
CICS support of persistent sessions includes the support of all LU-LU sessions,
except LU0 pipeline and LU6.1 sessions. CICS determines for how long the
sessions should be retained from the time on the PSDINT system initialization
parameter. This is a user-defined time interval. If a failed CICS is restarted within
this time, it can use the retained sessions immediately—there is no need for
network flows to re-bind them.
You can change the interval using the CEMT SET VTAM command, or the EXEC
CICS SET VTAM command, but the changed interval is not stored in the CICS
global catalog, and therefore is not restored in an emergency restart.
During emergency restart, CICS restores those sessions pending recovery from the
CICS global catalog and the CICS system log to an “in session” state. This
happens when CICS opens its VTAM ACB.
Before specific terminal types and levels of service are discussed, note that many
factors can affect the performance of a terminal at takeover, including:
v The type of terminal
v The total number of terminals connected
v What the end user is doing at the time of takeover
v The type of failure of the CICS system
v How the terminal is defined by the system programmer
The end user of a terminal sees different symptoms of a CICS failure following a
restart, depending on whether CICS is initialized with VTAM persistent sessions
support or XRF support:
v If CICS is running without VTAM persistent sessions or XRF, and fails, the
terminal user sees the VTAM logon panel followed by the “good morning”
message (if AUTOCONNECT(YES) is specified for the TYPETERM resource
definition).
v If CICS does have persistent session support and fails, and the terminal user
enters data while CICS is recovering, it appears as if CICS is “hanging”; the
screen on display at the time of the failure remains until persistent session
recovery is complete. You can use the RECOVOPTION and RECOVNOTIFY
keywords to customize CICS so that either a successful emergency restart can
be transparent to terminal users, or terminal users can be notified of the CICS
failure, allowing them to take the appropriate recovery actions.
If APPC sessions are active at the time CICS fails, persistent sessions recovery
appears to APPC partners as CICS “hanging”. VTAM saves requests issued by
the APPC partner, and passes them to CICS when the persistent recovery is
complete. After a successful emergency restart, recovery of terminal sessions is
determined by the options defined in PSRECOVERY of the CONNECTION
definition and RECOVOPTION of the session definition. If the appropriate
recovery options have been selected (see the CICS Resource Definition Guide),
and the APPC sessions are in the correct state, CICS performs an ISSUE
ABEND (see the CICS Distributed Transaction Programming Guide) to inform the
partner that the current conversation has been abnormally terminated.
APPC synclevel 2 sessions are not restored. They are unbound when the ACB is
opened.
When the VTAM failure occurs and the TPEND failure exit is driven, the
autoinstalled terminals that are normally deleted at this point are retained by CICS.
If the session is not restored and the terminal is not reused within the AIRDELAY
interval, CICS deletes the TCTTE when the AIRDELAY interval expires after the
ACB is re-opened successfully.
Unbinding sessions
CICS does not always reestablish sessions held by VTAM in a recovery pending
state. CICS (or VTAM) unbinds recovery pending sessions in the following
situations:
v If CICS does not restart within the specified persistent session delay interval
v If you perform a COLD start after a CICS failure
v If CICS restarts with XRF=YES (when the failed CICS was running with
XRF=NO)
v If CICS cannot find a terminal control table terminal entry (TCTTE) for a session
(for example, because the terminal was autoinstalled with AIRDELAY=0
specified)
v If a terminal or session is defined with the recovery option (RECOVOPT) set to
UNCONDREL or NONE
v If CICS determines that it cannot recover the session without unbinding and
re-binding it
In all these situations, the sessions are unbound, and the result is as if CICS has
restarted following a failure without VTAM persistent session support.
There are some other situations where APPC sessions are unbound. For example,
if a bind was in progress at the time of the failure, sessions are unbound.
For more information about persistent session support, see the CICS System
Definition Guide.
| Note: If CICS detects that there were shunted units of work at the previous
| shutdown (that is, it had issued message DFHRM0203) CICS issues a
| warning message, DFHRM0154, to let you know that local recovery data
| has being lost, and initialization continues. The only way to avoid this loss
| of data from the system log is not to perform a cold start after CICS has
| issued DFHRM0203.
Note: The system log information preserved does not include before-images of
any file control data updated by a distributed unit of work. Any changes
made to local file resources are not backed out, and by freeing all locks
they are effectively committed. To preserve data integrity, perform a warm
or emergency restart (using START=AUTO).
v CICS retrieves its logname token from the recovery manager control record for
use in the “exchange lognames” process during reconnection to partner systems.
Thus, by using the logname token from the previous execution, CICS ensures a
warm start of those connections for which there is outstanding resynchronization
work.
To perform these actions on a cold start, CICS needs the contents of the catalog
data sets and the system log from a previous run.
See the CICS System Definition Guide for details of the actions that CICS takes for
START=COLD in conjunction with various states of the global catalog and the
system log.
Files
All previous file control state data, including file resource definitions, is lost.
If RLS support is specified, CICS connects to the SMSVSAM, and when connected
requests the server to:
v Release all RLS retained locks
v Clear any “lost locks” status
v Clear any data sets in “non-RLS update permitted” status
Attention: If you use the SHCDS REMOVESUBSYS command for a CICS region
that uses RLS access mode, ensure that you perform a cold start the
next time you start the CICS region. The SHCDS REMOVESUBSYS
command causes SMSVSAM to release all locks held for the region that
is the subject of the command, allowing other CICS regions and batch
jobs to update records released in this way. If you restart a CICS region
with either a warm or emergency restart, after specifying it on a
REMOVESUBSYS command, you risk losing data integrity. See the
DFSMS/MVS Access Method Services for ICF, SC26-4906 for more
information about the REMOVESUBSYS parameter.
Temporary storage
All temporary storage queues from a previous run are lost, including
CICS-generated queues (for example, for data passed on START requests).
If the auxiliary temporary storage data set was used on a previous run, CICS opens
the data set for update. If CICS finds that the data set is newly initialized, CICS
closes it, reopens it in output mode, and formats all the control intervals (CIs) in the
primary extent. When formatting is complete, CICS closes the data set and reopens
it in update mode. The time taken for this formatting operation depends on the size
of the primary extent, but it can add significantly to the time taken to perform a cold
start.
Shared TS pools are managed by a temporary storage server, and stored in the
coupling facility. Stopping and restarting a TS data sharing server does not affect
the contents of the TS pool, unless you clear the coupling facility structure in which
the pool resides.
If you want to cause a server to reinitialize its pool, use the MVS SETXCF FORCE
command to clean up the structure:
SETXCF FORCE,STRUCTURE,STRNAME(DFHXQLS_poolname)
Transient data
All transient data queues from a previous run are lost.
Finally, all the newly-installed TD queue definitions are written to the global catalog.
All TD queues are installed as enabled.
Note: If, during the period when CICS is installing the TD queues, an attempt is
made to write a record to a CICS-defined queue that has not yet been
installed (for example, CSSL), CICS writes the record to the CICS-defined
queue CXRF.
Transactions
All transaction and transaction class resource definitions are installed from the CSD,
and are cataloged in the global catalog.
Note: The CICS log manager retrieves the system log stream name from the
global catalog, ensuring that, even on a cold start, CICS uses the same log
stream as on a previous run.
Programs
All programs, mapsets, and partitionsets are installed from the CSD, and are
cataloged in the global catalog.
Any data associated with START requests is also lost, even if it was stored in a
recoverable TS queue.
The initial recording status for CICS statistics is determined by the statistics system
initialization parameter (STATRCD). If STATRCD=YES is specified, interval statistics
are recorded at the default interval of every three hours.
Committing and cataloging resources installed from the CSD: CICS has two
ways of installing and committing terminal resource definitions:
v Some VTAM terminal control resource definitions must be installed in groups and
are committed in “installable sets” (for example, connections and sessions).
v Other resource definitions can be installed in groups or individually, and are
committed at the individual resource level.
If the install of a resource (or of an installable set, such as a CONNECTION and its
associated SESSIONS definitions) is successful, CICS writes the resource
definitions to the global catalog during commit processing.
Single resource install: All except the resources that are installed in installable sets
are committed individually. CICS writes each single resource definition to the global
catalog as the resource is installed. If a definition fails, it is not written to the catalog
(and therefore is not recovered at a restart).
Installable set install: The following VTAM terminal control resources are
committed in installable sets:
v Connections and their associated sessions
v Pipeline terminals—all the terminal definitions sharing the same POOL name
If one definition in an installable set fails, the set fails. However, each installable set
is treated independently within its CSD group. If an installable set fails as CICS
installs the CSD group, it is removed from the set of successful installs. Logical sets
that are not successfully installed do not have catalog records written and are not
recovered.
This is effective only if both the system log stream and the global catalog from the
previous run of CICS are available at restart.
Dump table
The dump table that you use for controlling system and transaction dumps is not
preserved in a cold start. If you have built up over a period of time a number of
entries in a dump table, which is recorded in the CICS catalog, you have to
recreate these entries following a cold start.
Note: An initial start can also result from a START=COLD parameter if the global
catalog is newly initialized and does not contain a recovery manager control
record. If the recovery manager finds that there is no control record on the
catalog, it issues a message to the console prompting the operator to reply
with a GO or CANCEL response. If the response is GO, CICS performs an
initial start as if START=INITIAL was specified.
For more information about the effect of the state of the global catalog and the
system log on the type of start CICS performs, see the CICS System Definition
Guide.
See “Chapter 6. CICS emergency restart” on page 57 for the restart processing
performed if the type-of-restart indicates “emergency restart needed”.
Note: CICS needs both the catalogs and the system log from the previous run of
CICS to perform a warm restart—the catalogs alone are not sufficient. If you
run CICS with the system log defined as TYPE(DUMMY), CICS appears to
shut down normally, but only the global catalog portion of the warm keypoint
is actually written. Therefore, without the warm keypoint information from the
system log, CICS cannot perform a warm restart. CICS startup fails unless
you specify an initial start with START=INITIAL.
Recovering their own state is the responsibility of the individual resource managers
(such as file control) and the CICS domains. This section discusses the process of
rebuilding their state from the catalogs and system log, in terms of the following
resources:
v Files
v Temporary storage queues
v Transient data queues
v Transactions
v Programs, including mapsets and partitionsets
v Start requests
v Monitoring and statistics
v Journals and journal models
v Terminal control resources
v Distributed transaction resources
Files
File control information from the previous run is recovered from information
recorded in the CICS catalog only.
File resource definitions for VSAM and BDAM files, data tables, and LSR pools are
installed from the global catalog, including any definitions that were added
dynamically during the previous run. The information recovered and reinstalled
Note: An exception to the above rule occurs when there are updates to a file to be
backed out during restarts, in which case the file is opened regardless of the
OPENTIME option. At a warm start, there cannot be any in-flight units of
work to back out, so this backout can only occur when retrying backout-failed
units of work against the file.
CICS closes all files at shutdown, and, as a general rule, you should expect your
files to be re-installed on restart as either:
v OPEN and ENABLED if the OPENTIME option is STARTUP
v CLOSED and ENABLED if the OPENTIME option is FIRSTREF.
The FCT and the CSDxxxx system initialization parameters are ignored.
File control uses the system log to reconstruct the internal structures, which it uses
for recovery.
Temporary storage
Auxiliary temporary storage queue information (for both recoverable and
non-recoverable queues) is retrieved from the warm keypoint. Note that TS READ
pointers are recovered on a warm restart (which is not the case on an emergency
restart).
Transient data
Transient data initialization on a warm restart depends on the TDINTRA system
initialization parameter, which specifies whether or not TD is to initialize with empty
intrapartition queues. The different options are discussed as follows:
CICS opens any extrapartition TD queues that need to be opened—that is, any that
specify OPEN=INITIAL.
Note: If, during the period when CICS is installing the TD queues, an attempt is
made to write a record to a CICS-defined queue that has not yet been
installed (for example, CSSL), CICS writes the record to the CICS-defined
queue CXRF.
The recovery manager returns log records and keypoint data associated with TD
queues. CICS applies this data to the installed queue definitions to return the TD
queues to the state they were in at normal shutdown. Logically recoverable,
physically recoverable, and non-recoverable intrapartition TD queues are recovered
from the warm keypoint data.
Trigger levels (for TERMINAL and SYSTEM only): After the queues have been
recovered, CICS checks the trigger level status of each intrapartition TD queue that
is defined with FACILITY(TERMINAL|SYSTEM) to determine whether a start
request needs to be rescheduled for the trigger transaction. If a trigger transaction
failed to complete during the previous run (that is, did not reach the empty queue
(QZERO) condition) or the number of items on the queue is greater than the trigger
level, CICS schedules a start request for the trigger transaction.
This does not apply to trigger transactions defined for queues that are associated
with files (FACILITY(FILE)).
TDINTRA=EMPTY
The transient data queues are 'cold started', but the resource definitions are 'warm
started':
v All intrapartition TD queues are initialized empty.
v The queue resource definitions are installed from the global catalog, but they are
not updated by any log records or keypoint data. They are always installed
enabled.
You cannot specify a general cold start of transient data while the rest of CICS
performs a warm restart, as you can for temporary storage.
Transactions
All transaction and transaction class resource definitions are installed from the CSD,
and updated with information from the warm keypoint in the system log. The
resource definitions installed from the catalog include any that were added
dynamically during the previous run.
Programs
The recovery of program, mapset, and partitionset resource definitions depends on
whether you are using program autoinstall and, if you are, whether you have
requested autoinstall cataloging (specified by the system initialization parameter
PGAICTLG=ALL|MODIFY).
The resource definitions installed from the catalog include any that were added
dynamically during the previous run.
Start requests
In general, start requests are recovered together with any associated start data.
Recovery can, however, be suppressed by specifying explicit cold start system
The rules governing the operation of the explicit cold requests on system
initialization parameters are:
v ICP=COLD suppresses all starts that do not have both data and a terminal
associated with them. It also suppresses any starts that had not expired at
shutdown. This includes BMS starts.
v TS=COLD (or TS main only) suppresses all starts that had data associated with
them.
v BMS=COLD suppresses all starts relating to BMS paging.
Start requests that have not been suppressed for any of the above reasons either
continue to wait if their start time or interval has not yet expired, or they are
processed immediately. For start requests with terminals, consider the effects of the
CICS restart on the set of installed terminal definitions. For example, if the terminal
specified on a start request is no longer installed after the CICS restart, CICS
invokes an XALTENF global user exit program (if enabled), but not the XICTENF
exit.
Only the global catalog is referenced for terminals defined in the CSD.
To add a terminal after initialization, use the CEDA INSTALL or EXEC CICS
CREATE command, or the autoinstall facility. To delete a terminal definition, use the
DISCARD command or, if autoinstalled, allow it to be deleted by the autoinstall
facility after the interval specified by the AILDELAY system initialization parameter.
If you specify START=AUTO, CICS determines what type of start to perform using
information retrieved from the recovery manager’s control record in the global
catalog. If the type-of-restart indicator in the control record indicates “emergency
restart needed”, CICS performs an emergency restart.
See “Chapter 5. CICS warm restart” on page 51 for the restart processing
performed if the type-of-restart indicates “warm start possible”.
Overview
The additional processing performed for an emergency restart is mainly related to
the recovery of in-flight transactions. There are two aspects to the recovery
operation:
1. Recovering information from the system log
2. Driving backout processing for in-flight units of work
For non-RLS data sets and other recoverable resources, any locks (ENQUEUES)
that were held before the CICS failure are re-acquired during this initial phase.
For data sets accessed in RLS mode, the locks that were held by SMSVSAM for
in-flight tasks are converted into retained locks at the point of abnormal termination.
Any non-RLS locks associated with in-flight (and other failed) transactions are
acquired as active locks for the tasks attached to perform the backouts. This means
that, if any new transaction attempts to access non-RLS data that is locked by a
backout task, it waits normally rather than receiving the LOCKED condition.
Retained RLS locks are held by SMSVSAM, and these do not change while
backout is being performed. Any new transactions that attempt to access RLS
resources locked by a backout task receive a LOCKED condition.
For both RLS and non-RLS resources, the backout of in-flight transactions after an
| emergency restart is indistinguishable from dynamic transaction backout.
The recovery manager drives these backout and commit processes because the
condition that caused them to fail may be resolved by the time CICS restarts. If the
condition that caused a failure has not been resolved, the UOW remains in backout-
or commit-failed state. See “Backout-failed recovery” on page 75 and “Commit-failed
recovery” on page 79 for more information.
Files
All file control state data and resource definitions are recovered in the same way as
on a warm start.
CICS uses the information it receives from SMSVSAM to eliminate orphan locks.
Orphan locks can occur if a CICS region acquires an RLS lock, but then fails before
logging it. Records associated with orphan locks that have not been logged cannot
have been updated, and CICS can safely release them.
Note: Locks that fail to be released during UOW commit processing cause the
UOW to become a commit-failed UOW. CICS automatically retries commit
processing for these UOWs, but if the locks are still not released before the
CICS region terminates, these also are treated as orphan locks during the
next restart.
Temporary storage
Auxiliary temporary storage queue information for recoverable queues only is
retrieved from the warm keypoint. The TS READ pointers are not recovered and are
set to zero.
If a nonzero TSAGE parameter is specified in the temporary storage table (TST), all
queues that have not been referenced for this interval are deleted.
Transactions
As for warm restart.
Programs
As for warm restart.
Start requests
In general, start requests are recovered if, and only if:
v They are associated with recoverable data
v They are protected (by the PROTECT option on the START command) and the
issuing UOW is in-doubt
Recovery can, however, be further limited by the use of the specific COLD option
on the system initialization parameter for TS, ICP, or BMS. If you suppress start
requests by means of the COLD option on the appropriate system initialization
parameter, any data associated with the suppressed starts is discarded. The rules
are:
v ICP=COLD suppresses all starts including BMS starts.
v TS=COLD (or TS main only) suppresses all starts that had data associated with
them.
v BMS=COLD suppresses all starts relating to BMS paging.
Start requests that have not been suppressed for any of the above reasons either
continue to wait if their start time or interval has not yet expired, or are processed
immediately.
For start requests with terminals, consider the effects of the CICS restart on the set
of installed terminal definitions. For example, if the terminal specified on a start
request is no longer installed after the CICS restart, CICS invokes an XALTENF
global user exit program (if enabled), but not the XICTENF exit.
The state of the catalog may have been modified for some of the above resources
by their removal with a CEMT, or an EXEC CICS DISCARD, command.
CICS uses records from the system log, written when any terminal resources were
being updated, to perform any necessary recovery on the cataloged data. This may
be needed if terminal resources are installed or deleted while CICS is running, and
CICS fails before the operation is completed.
In this way, CICS ensures that the terminal entries recovered at emergency restart
consist of complete logical sets of resources (for connections, sessions, and
pipelines), and complete terminal resources and autoinstall models, and that the
catalog reflects the real state of the system accurately.
The main benefits of the MVS automatic restart manager are that it:
v Enables CICS to preserve data integrity automatically in the event of any system
failure.
v Eliminates the need for operator-initiated restarts, or restarts by other automatic
packages, thereby:
– Improving emergency restart times
– Reducing errors
– Reducing complexity.
v Provides cross-system restart capability. It ensures that the workload is restarted
on MVS images with spare capacity, by working with the MVS workload
manager.
v Allows all elements within a restart group to be restarted in parallel. Restart
levels (using the ARM WAITPRED protocol) ensure the correct starting sequence
of dependent or related subsystems.
Overview
MVS automatic restart management is a sysplex-wide integrated automatic restart
mechanism that:
v Restarts an MVS subsystem in place if it abends (or if a monitor program notifies
ARM of a stall condition)
v Restarts all the elements of a workload (for example, CICS TORs, AORs, FORs,
DB2, and so on) on another MVS image after an MVS failure
v Restarts a failed MVS image
You cannot use MVS automatic restart for CICS regions running with XRF—it is
available only to non-XRF CICS regions. If you specify XRF=YES, CICS
de-registers from ARM and continues initialization with XRF support.
MVS automatic restart management is available only to those MVS subsystems that
register with ARM. CICS regions register with ARM automatically as part of CICS
system initialization. If a CICS region fails before it has registered for the first time
with ARM, it will not be restarted. After a CICS region has registered, it is restarted
by ARM according to a predefined policy for the workload.
CICS always registers with ARM because CICS needs to know whether it is being
restarted by ARM and, if it is, whether or not the restart is with persistent JCL. (The
ARM registration response to CICS indicates whether or not the same JCL that
started the failed region is being used for the ARM restart.) You indicate whether
MVS is to use the same JCL or command text that previously started CICS by
specifying PERSIST as the restart_type operand on the RESTART_METHOD
parameter in your automatic restart management policy.
When it registers with ARM, CICS passes the value ‘SYSCICS’ as the element
type, and the string ‘SYSCICS_aaaaaaaa’ as the element name, where aaaaaaaa is
the CICS applid. Using the applid in the element name means that only one CICS
region can successfully register with ARM for a given applid. If two CICS regions try
to register with the same applid, the second register is rejected by ARM.
Some error situations that occur during CICS initialization cause CICS to issue a
message, with an operator prompt to reply GO or CANCEL. If you reply CANCEL,
CICS de-registers from ARM before terminating, because if CICS remained
registered, an automatic restart would probably encounter the same error condition.
For other error situations, CICS does not de-register, and automatic restarts follow.
To control the number of restarts, specify in your ARM policy the number of times
ARM is to restart a failed CICS region.
Note: A CICS restart can have been initiated by ARM, even though CICS
registration with ARM has failed in the restarted CICS.
See OS/390 MVS Setting Up a Sysplex for information about ARM couple data sets
and ARM policies.
You cannot specify XRF=YES if you want to use ARM support. If the XRF system
initialization parameter is changed to XRF=YES for a CICS region being restarted
by ARM, CICS issues message DFHKE0407 to the console, then terminates.
If the ARM policy specifies different JCL for an automatic restart, and that JCL
specifies START=COLD, CICS obeys this parameter but risks losing data integrity.
Therefore, if you need to specify different JCL to ARM, specify START=AUTO to
ensure data integrity.
Workload policies
Workloads are started initially by scheduling or automation products.
The components of the workload, and the MVS images capable of running them,
are specified as part of the policies for MVS workload manager and ARM. The MVS
images must have access to the databases, logs, and program libraries required for
the workload.
Implementing ARM support for CICS generally involves the following steps:
v Ensure that the MVS images available for automatic restarts have access to the
databases, logs, and program libraries required for the workload.
v Define ARM policies for your CICS regions:
– Allow the RESTART_ORDER LEVEL to default to 2 (CICS regions should
generally be level 2 in the ARM restart order, after DB2 and DBCTL).
Connecting to VTAM
VTAM is at restart level 1, the same as DB2 and DBCTL. However, VTAM is not
restarted when failed subsystems are being restarted on another MVS, because
ARM expects VTAM to be running on all MVS images in the sysplex. For this
reason, CICS and VTAM are not generally part of the same restart group.
In a VTAM network, the session between CICS and VTAM is started automatically if
VTAM is started before CICS. If VTAM is not active when you start (or restart)
CICS, you receive the following messages:
+DFHSI1589D 'applid' VTAM is not currently active.
+DFHSI1572 'applid' Unable to OPEN VTAM ACB - RC=xxxxxxxx, ACB CODE=yy.
CICS provides a new transaction, COVR, to open the VTAM ACB automatically
when VTAM becomes available. See “The COVR transaction” for more information
about this.
You cannot run the COVR transaction from a terminal. If you invoke COVR from a
terminal, it abends with an AZCU transaction abend.
DFHKE0401 DFHKE0407
DFHKE0402 DFHKE0408
DFHKE0403 DFHKE0410
DFHKE0404 DFHKE0411
DFHKE0405 DFHZC0200
DFHKE0406 DFHZC0201
For the text of these messages, see the CICS Transaction Server for OS/390:
Migration Guide.
The following events can cause the abnormal termination of transactions, all of
which cause CICS to initiate:
v A transaction ABEND request issued by a CICS management module.
v A program check or operating system abend (this is trapped by CICS and
converted into an ASRA or ASRB transaction abend).
v An ABEND request issued by a user application program.
v A CEMT, or EXEC CICS, command such as SET TASK PURGE or
FORCEPURGE.
Note: Unlike the EXEC CICS ABEND command above, these EXEC CICS
commands cause other tasks to abend, not the one issuing the command,
v A transaction abend request issued by DFHZNEP or DFHTEP following a
communication error. This includes the abnormal termination of a remote CICS
during processing of in-flight distributed UOWs on the local CICS.
v An abnormal termination of CICS, in which all in-flight transactions are effectively
abended as a result of the CICS region failing.
In-flight transactions are recovered during a subsequent emergency restart to
enable CICS to complete the necessary backout of recoverable resources, which
is performed in the same way as if the task abended while CICS was running.
For recovery purposes, CICS recovery manager is concerned only with the units of
work that have not yet completed a syncpoint because of some failure. This section
discusses how CICS handles these failed units of work.
Note: Although the failed backout may have been attempted as a result of
the abnormal termination of a transaction, the backout failure itself
does not cause the transaction to terminate abnormally.
3. Two-phase commit. The protocol used by CICS when taking a syncpoint in a distributed unit of work, where the first prepare
phase is followed by the actual commit phase. See two-phase commit in the Glossary.
In general, the same process of transaction backout is used for individual units of
work that abend while CICS is running and for in-flight tasks recovered during
emergency restart. One difference is that dynamic backout of a single abnormally
terminating transaction takes place immediately. Therefore, it does not cause any
active locks to be converted into retained locks. In the case of a CICS region
abend, in-flight tasks have to wait to be backed out when CICS is restarted, during
which time the locks are retained to protect uncommitted resources.
To restore the resources to the state they were in at the beginning of the unit of
work, CICS preserves a description of their state at that time:
v For tables maintained by CICS, information is held in the tables themselves.
v For recoverable auxiliary temporary storage, CICS maintains information on the
system log about all new items written to TS queues. CICS maintains information
about TS queues for backout purposes in main storage.
v For transient data, CICS maintains cursors that indicate how much has been
read and written to the queue, and these cursors are logged. CICS does not log
before- or after-images for transient data.
v For CICS files, the before-images of deleted or changed records are recorded in
the system log. Although they are not strictly “before-images”, CICS also logs
newly added records, because CICS needs information about them if they have
to be removed during backout.
This section discusses the way the individual resource managers handle their part
of the backout process in terms of the following resources:
v Files
v Intrapartition transient data
v Auxiliary temporary storage
v Start requests
v Cancel START requests
v Basic mapping support (BMS) messages
Files
CICS file control is presented with the log records of all the recoverable files that
have to be backed out. File control:
v Restores the before-images of updated records
v Restores deleted records
v Removes new records added by the unit of work
BDAM files and VSAM ESDS files: In the special case of the file access
methods that do not support delete requests (VSAM ESDS and BDAM) CICS
cannot remove new records added by the unit of work. In this case, CICS invokes
the global user exit program enabled at the XFCLDEL exit point whenever a WRITE
to a VSAM ESDS, or to a BDAM data set, is being backed out. This enables your
exit program to perform a logical delete by amending the record in some way that
flags it as deleted.
If you do not have an XFCLDEL exit program, CICS handles the unit of work as
backout-failed, and shunts the unit of work to be retried later (see “Backout-failed
recovery” on page 75). For information about resolving backout failures, see
“Logical delete not performed” on page 77.
Such flagged records can be physically deleted when you subsequently reorganize
the data set offline with a utility program.
CICS data tables: For CICS-maintained data tables, the updates made to the
source VSAM data set are backed out. For user-maintained data tables, the
in-storage data is backed out.
Transient data does not provide any support for the concept of transaction backout,
which means that:
v Any records retrieved by the abending unit of work are not available to be read
by another task, and are therefore lost.
v Any records written by the abending unit of work are not backed out. This means
that these records are available to be read by other tasks, although they might
be invalid.
CICS does not back out changes to temporary storage queues held in main storage
or in a TS server temporary storage pool.
START requests
Recovery of EXEC CICS START requests during transaction backout depends on
some of the options specified on the request. The options that affect recoverability
are:
When designing your applications, consider the recoverability of data that is being
passed to a started transaction.
During transaction backout of a failed task that has canceled a START request that
has recoverable data associated with it, CICS recovers both the temporary storage
queue and the start request. Thus the effect of the recovery is as if the CANCEL
command had never been issued.
If there is no data associated with the START command, or if the temporary storage
queue is not recoverable, neither the canceled started task nor its data is
recovered, and it stays canceled.
Note: If backout fails, CICS does not try to restart regardless of the setting of the
restart program.
Backout-failed recovery
In principle, backout failure support can apply to any resource that performs
backout, but the support is currently provided only by CICS file control.
Files
If backout to a VSAM data set fails for any reason, CICS:
v Invokes the backout failure global user exit program at XFCBFAIL, if this exit
is enabled. If the user exit program chooses to bypass backout failure
processing, the remaining actions below are not taken.
v Issues message DFHFC4701, giving details of the update that has failed
backout, and the type of backout failure that has occurred.
v Converts the active exclusive locks into retained locks. This ensures that no
other task in any CICS region (including the region that owns the locks) waits for
a lock that cannot be granted until the failure is resolved. (In this situation, CICS
returns the LOCKED condition to other tasks that request a lock.) Preserving
locks in this way also prevents other tasks from updating the records until the
failure is resolved.
– For data sets open in RLS mode, CICS requests SMSVSAM to retain the
locks.
– For VSAM data sets open in non-RLS mode, the CICS enqueue domain
provides an equivalent function.
Creating retained locks also ensures that other requests do not have to wait on
the locks until the backout completes successfully.
v Keeps the log records that failed to be backed out (by shunting4 the unit of
work) so that the failed records can be presented to file control again when
backout is retried. (See “The shunted state” on page 13 for more information
about shunted units of work.
If a unit of work updates more than one data set, the backout may fail for only one,
or some, of the data sets. When this occurs, CICS converts to retained locks only
those locks held by the unit of work for the data sets for which backout has failed.
When the unit of work is shunted, CICS releases the locks for records in data sets
that are backed out successfully. The log records for the updates made to the data
sets that fail backout are kept for the subsequent backout retry. CICS does not
keep the log records that are successfully backed out.
4. Shunting. The process of suspending a unit of work in order to allow time to resolve the problem that has caused the suspension.
Shunting releases the user’s terminal, virtual storage, and CP resources, and allows completion of the unit of work to be deferred
for as long as necessary.
For BDAM data sets, there is only limited backout failure support: the backout
failure exit, XFCBFAIL, is invoked (if enabled) to take installation-defined action, and
message DFHFC4702 is issued.
Disposition of data sets after backout failures: Because individual records are
locked when a backout failure occurs, CICS need not set the entire data set into a
backout-failed condition. CICS may be able to continue using the data set, with
only the locked records being unavailable. Some kinds of backout failure can be
corrected without any need to take the data set offline (that is, without needing to
stop all current use of the data set and prevent further access). Even for those
failures that cannot be corrected with the data set online, it may still be preferable
to schedule the repair at some future time and to continue to use the data set in the
meantime, if this is possible.
Message DFHFC4701 with a failure code of X'24' indicates that an I/O error (a
physical media error) has occurred while backing out a VSAM data set. This
indicates that there is some problem with the data set, but it may be that the
problem is localized. A better indication of the state of a data set is given by
message DFHFC0157 (followed by DFHFC0158), which CICS issues whenever
an I/O error occurs (not just during backout). Depending on the data set
concerned, and other factors, your policy may be to repair the data set:
v After a few I/O errors
v After the first backout failure
5. Unshunting. The process of attaching a transaction to provide an environment under which to resume the processing of a shunted
unit of work.
It might be worth initially deciding to leave a data set online for some time after
a backout failure, to evaluate the level of impact the failures have on users.
To recover from a media failure, recreate the data set by applying forward
recovery logs to the latest backup. The steps you take depend on whether the
data set is opened in RLS or non-RLS mode:
v For data sets opened in non-RLS mode, set the data set offline to all CICS
applications by closing all open files against the data set.
Perform forward recovery using a forward recovery utility.
When the new data set is ready, use the CEMT (or EXEC CICS) SET
DSNAME RETRY command to drive backout retry against the data set for all
the units of work in backout-failed state.
v For data sets opened in RLS mode, use the CEMT (or EXEC CICS) SET
DSNAME QUIESCED command to quiesce the data set.
Perform forward recovery using CICSVR as your forward recovery utility.
CICS regions are notified through the quiesce protocols when CICSVR has
completed the forward recovery. This causes backout to be automatically
retried. The backout retry fails at this attempt because the data set is still
quiesced, and the UOWs are again shunted as backout-failed.
Unquiesce the data set as soon as you know that forward recovery is
complete. Completion of the unquiesce is notified to the CICS regions, which
causes backout to be automatically retried again, and this time it should
succeed.
This mechanism, in which the backout retry is performed within CICS,
supersedes the batch backout facility supported by releases of CICSVR
earlier than CICSVR 2.3. You do not need a batch backout utility.
Logical delete not performed
This error occurs if, during backout of a write to an ESDS, the XFCLDEL logical
delete exit was either not enabled, or requested that the backout be handled as
a backout failure.
You can correct this by enabling a suitable exit program and manually retrying
the backout. There is no need to take the data set offline.
Open error
Investigate the cause of any error that occurs in a file open operation. A data
set is normally already open during dynamic backout, so an open error should
occur only during backout processing if the backout is being retried, or is being
carried out following an emergency restart. Some possible causes are:
v The data set has been quiesced, in which case the backout is automatically
retried when the data set is unquiesced.
v It is not possible to open the data set in RLS mode because the SMSVSAM
server is not available, in which case the backout is automatically retried
when the SMSVSAM server becomes available.
For other cases, manually retry the backout after the cause of the problem has
been resolved. There is no need to take the data set offline.
DFSMSdss makes use of the VSAM quiesce protocols when taking non-BWO
backups of data sets that are open in RLS mode. While a non-BWO backup is
in progress, the data set does not need to be closed, but updates to the data
set are not allowed. This error means that the backout request was rejected
because it was issued while a non-BWO backup was in progress.
Take the data set offline to reallocate it with more space. (See “Chapter 16.
Moving recoverable data sets that have retained locks” on page 175 for
information about preserving retained locks in this situation.) You can then retry
the backout manually, using the CEMT, or EXEC CICS, SET DSNAME(...)
RETRY command.
Non-unique alternate index full
Take the data set offline to rebuild the data set with a larger record size for the
alternate index. (See “Chapter 16. Moving recoverable data sets that have
retained locks” on page 175 for information about preserving retained locks in
this situation.) You can then retry the backout manually, using the CEMT, or
EXEC CICS, SET DSNAME(...) RETRY command.
Deadlock detected
This error can occur only for VSAM data sets opened in non-RLS access mode.
This situation can be resolved only by deleting the rival record with the
duplicate key value.
Lock structure full error
The backout required VSAM to acquire a lock for internal processing, but it was
unable to do so because the RLS lock structure was full. This error can occur
only for VSAM data sets opened in RLS access mode.
To resolve the situation, you must allocate a larger lock structure in an available
coupling facility, and rebuild the existing lock structure into the new one. The
failed backout can then be retried using SET DSNAME RETRY.
None of the above
If any other error occurs, it indicates a possible error in CICS or VSAM code, or
a storage overwrite in the CICS region. Diagnostic information is given in
message DFHFC4700, and a system dump is provided.
If the problem is only transient, a manual retry of the backout should succeed.
Transient data
All updates to logically recoverable intrapartition queues are managed in main
storage until syncpoint, or until a buffer must be flushed because all buffers are in
use. TD always commits forwards; therefore, TD can never suffer a backout failure
on DFHINTRA.
Commit-failed recovery
Commit failure support is provided only by CICS file control, because it is the only
CICS component that needs this support.
Files
A commit failure is one that occurs during the commit stage of a unit of work (either
following the prepare phase of two-phase commit, or following backout of the unit of
work). It means that the unit of work has not yet completed, and the commit must
be retried successfully before the recovery manager can forget about the unit of
work.
When a failure occurs during file control’s commit processing, CICS ensures that all
the unit of work log records for updates made to data sets that have suffered the
commit failure are kept by the recovery manager. Preserving the log records
ensures that the commit processing for the unit of work can be retried later when
conditions are favorable.
However, it is also possible for a file control commit failure to occur as a result of
some other error when CICS is attempting to release RLS locks during commit
processing, or is attempting to convert some of the locks into retained locks during
the commit processing that follows a backout failure. In this case it may be
necessary to retry the commit explicitly using the SET DSNAME RETRY command.
Such failures should be rare, and may be indicative of a more serious problem.
It is possible for a unit of work that has not performed any recoverable work, but
which has performed repeatable reads, to suffer a commit failure. If the SMSVSAM
server fails while holding locks for repeatable read requests, it is possible to access
the records when the server recovers, because all repeatable read locks are
released at the point of failure. If the commit failure is not due to a server failure,
the locks are held as active shared locks. The INQUIRE UOWDSNFAIL command
distinguishes between a commit failure where recoverable work was performed, and
one for which only repeatable read locks were held.
The CICS recovery manager shunts a unit of work if all the following conditions
apply:
v The unit of work has entered the in-doubt period.
v The recovery manager detects loss of connectivity to its coordinator for the unit
of work.
v The in-doubt attribute on the transaction resource definition under which the unit
of work is running specifies WAIT(YES).
v The conditions exist that allow shunting. For example, recovery manager does
not shunt a UOW that has an MRO link to a CICS region running under CICS for
MVS/ESA 4.1 or earlier. See the CICS Intercommunication Guide for a complete
list of conditions.
Files
When file control shunts its resources for the unit of work, it detects that the shunt
is being issued during the first phase of two-phase commit, indicating an in-doubt
failure. Any active exclusive lock held against a data set updated by the unit of
work is converted into a retained lock. The result of this action is as follows:
v No CICS region, including the CICS region that obtained the locks, can update
the records that are awaiting in-doubt resolution because the locks have not
been freed.
For data sets opened in RLS mode, interfaces to VSAM RLS are used to retain the
locks. For VSAM data sets opened in non-RLS mode, and for BDAM data sets, the
CICS enqueue domain provides an equivalent function. It is not possible for some
of the data sets updated in a particular unit of work to be failed in-doubt and for the
others not to be.
It is possible for a unit of work that has not performed any recoverable work, but
which has performed repeatable reads, to be shunted when an in-doubt failure
occurs. In this event, repeatable read locks are released. Therefore, for any data
set against which only repeatable reads were issued, it is possible to access the
records, and to open the data set in non-RLS mode for batch processing, despite
the existence of the in-doubt failure. The INQUIRE UOWDSNFAIL command
distinguishes between an in-doubt failure where recoverable work has been
performed, and one for which only repeatable read locks were held. If you want to
open the data set in non-RLS mode in CICS, you need to resolve the in-doubt
failure before you can define the file as having RLSACCESS(NO). If the unit of
work has updated any other data sets, or any other resources, you should try to
resolve the in-doubt correctly, but if the unit of work has only performed repeatable
reads against VSAM data sets and has made no updates to other resources, it is
safe to force the unit of work using the SET DSNAME or SET UOW commands.
CICS saves enough information about the unit of work to allow it to be either
committed or backed out when the in-doubt unit of work is unshunted when the
coordinator provides the resolution (or when the transaction wait time expires). This
information includes the log records written by the unit of work.
When CICS has re-established communication with the coordinator for the unit of
work, it can resynchronize all in-doubt units of work. This involves CICS first
unshunting the units of work, and then proceeding with the commit or backout. All
CICS enqueues and VSAM RLS record locks are released, unless a commit failure
or backout failure occurs.
For information about the resynchronization process for units of work that fail
in-doubt, see the CICS Intercommunication Guide.
You can use the WAITACTION option on the TD queue resource definition to
control the action that CICS takes when an update request is made against a
shunted in-doubt UOW that has updated the queue. In addition to the default
After resynchronization, the shunted updates to the TD queue are either committed
or backed out, and the retained locks are released.
After resynchronization, the shunted updates to the TS queue are either committed
or backed out, and the retained locks are released.
To retrieve information about a unit of work (UOW), you can use either the CEMT,
or EXEC CICS, INQUIRE UOW command. For the purposes of this illustration, the
CEMT method is used. You can filter the command to show only UOWs that are
associated with a particular transaction. For example, Figure 4 shows one UOW
(AC0CD65E5D990800) associated with transaction UPDT.
Figure 4. The CEMT INQUIRE UOW command showing UOWs associated with a transaction
Each UOW identifier is unique within the local CICS system. To see more
information about the UOW, move the cursor to the UOW row and press ENTER.
This display the following screen:
The UOWSTATE for this UOW is Indoubt. The TRANSACTION definition attribute
WAIT(YES|NO) controls the action that CICS takes when a UOW fails in-doubt.
CICS does one of two things:
v Makes the UOW wait, pending recovery from the failure. (In other words, the
UOW is shunted.) Updates to recoverable resources are suspended,
v Takes an immediate decision to commit or backout the recoverable resource
updates.
The WAITSTATE of Shunted shows that this UOW has been suspended.
When a UOW has been shunted in-doubt, CICS retains locks on the recoverable
resources that the UOW has updated. This prevents further tasks from changing the
resource updates while they are in-doubt. To display CICS locks held by a UOW
that has been shunted in-doubt, use the CEMT INQUIRE UOWENQ command. You
can filter the command to show only locks that are associated with a particular
UOW. (Note that the INQUIRE UOWENQ command operates only on non-RLS
resources on which CICS has enqueued, and for RLS-accessed resources you
should use the INQUIRE UOWDSNFAIL command.) For example:
To see more information about this UOWENQ, put the cursor alongside it and press
ENTER:
Because CIND was used to create this in-doubt failure, it can also be used to
resolve the in-doubt UOW. For an example of how to resolve a real in-doubt failure,
see the CICS Intercommunication Guide.
SMSVSAM supports cache set definitions that allow you to define multiple cache
structures6 within a cache set across one or more coupling facilities. To insure
against a cache structure failure, use at least two coupling facilities and define each
cache structure, within the cache set, on a different coupling facility.
The support for rebuilding cache structures enables coupling facility storage to be
used effectively. It is not necessary to reserve space for a rebuild to recover from a
cache structure failure—SMSVSAM uses any available space.
If RLS is unable to recover from the cache failure for any reason, the error is
reported to CICS when it tries to access a data set that is bound to the failed
cache, and CICS issues message DFHFC0162 followed by DFHFC0158. CICS
defers any activity on data sets bound to the failed cache by abending UOWs that
attempt to access the data sets. When “cache failed” responses are encountered
during dynamic backout of the abended UOWs, CICS invokes backout failure
support (see “Backout-failed recovery” on page 75). RLS open requests for data
sets that must bind to the failed cache, and RLS record access requests for open
data sets that are already bound to the failed cache, receive error responses from
SMSVSAM.
When either the failed cache becomes available again, or SMSVSAM is able to
connect to another cache in a data set’s cache set, CICS is notified by the
SMSVSAM quiesce protocols. CICS then retries all backouts that were deferred
because of cache failures.
Whenever CICS is notified that a cache is available, it also drives backout retries
for other types of backout failure, because this notification provides an opportunity
to complete backouts that may have failed for some transient condition.
CICS recovers after a cache failure automatically. There is no need for manual
intervention (other than the prerequisite action of resolving the underlying cause of
the cache failure).
6. Cache structure. One of three types of coupling facility data structure supported by MVS. SMSVSAM uses its cache structure to
perform buffer pool management across the sysplex. This enables SMSVSAM to ensure that the data in the VSAM buffer pools in
each MVS image remains valid.
If the rebuild fails, all SMSVSAM servers abend and restart, but they are not
available for service until they can successfully connect to a new coupling facility
lock structure. Thus a lock structure failure is initially detected by CICS as an
SMSVSAM server failure, and CICS issues message DFHFC0153.
When the SMSVSAM servers abend because of this failure, the sharing control
data set is updated to reflect the lost locks condition. The sharing control data set
records:
v The data sets whose locks have been lost.
v The CICS regions that must complete lost locks recovery for each data set.
These are the CICS regions that had a data set open for update (in RLS-mode)
for which the locks have been lost.
If a lost-locks condition occurs and is not resolved when a CICS restart (warm or
emergency) occurs, CICS is notified during file control restart about any data sets
for which it must perform lost locks recovery. On a cold start, CICS does not
perform any lost locks recovery, and the information in the sharing control data set,
which records that CICS must complete lost locks recovery, is cleared for each data
set. This does not affect the information recorded for other CICS regions.
Only UOWs performing lost locks recovery can use data sets affected by lost locks.
Error responses are returned on open requests issued by any CICS region that was
not sharing the data set at the time the lost locks condition occurred, and on RLS
access requests issued by any new UOWs in CICS regions that were sharing the
data set.
CICS takes the following actions during dynamic RLS restart to expedite lost locks
recovery:
v It drives backout-failed UOWs for backout retry. If backout retry for a data set in
the lost locks condition fails, lost locks recovery cannot complete until either:
When a CICS region has completed lost locks recovery for a data set, it informs
SMSVSAM. This is done once for each data set. When all CICS regions have
informed SMSVSAM that they have completed their lost locks recovery for a
particular data set, that data set is no longer in a lost locks condition, and is made
available for general access. Although the lost locks condition affects
simultaneously all data sets in use when the lock structure fails, each data set can
be restored to service individually as soon as all its sharing CICS regions have
completed lost locks recovery.
When the MVS image restarts, recovery for all resources is through CICS
emergency restart. If any CICS region completes emergency restart before the
SMSVSAM server becomes available, it performs dynamic RLS restart as soon as
the server is available.
The surviving MVS images should be affected by the failure only to the extent that
more work is routed to them. Also, tasks that attempt to access records that are
locked by CICS regions in the failed MVS image receive the LOCKED response.
If all the MVS images in a sysplex fail, the first SMSVSAM server to restart
reconnects to the lock structure in the coupling facility and converts all the locks
into retained locks for the whole sysplex.
Recovery from the failure of a sysplex is just the equivalent of multiple MVS failure
recoveries.
Only one abend exit can be active at any given logical level 7 within a task. This
means that:
1. When one program LINKs to another program, the LINKed-from program and
the LINKed-to program can each have one active exit.
2. When an exit is activated (at a particular program level), any other exit that may
already be active at the same level automatically becomes deactivated.
Reasons that an application programmer might have for coding a program level
abend exit, and functions that might be incorporated, are discussed in “Handling
abends and program level abend exits” on page 143.
When an abend request is issued for a task, CICS immediately passes control to
the exit that is active at the current logical level:
v If no exit is active at the current logical level, CICS checks progressively up
through higher logical levels and passes control to the first exit code found to be
active.
v If CICS finds no active exit at, or higher than, the current logical level, the task
terminates abnormally (see “Abnormal termination of a task” on page 90).
When control is transferred to any exit code, CICS deactivates the exit before any
of its code is executed. (This means that, in the event of another abend request,
the exit will not be reentered, and control is passed to activated exit code (if any) at
the next higher level.)
The exit code then executes as an extension of the abending task, and runs at the
same level as the program that issued the HANDLE ABEND command that
activated the exit.
After any program-level abend exit code has been executed, the next action
depends on how the exit code ends:
v If the exit code ends with an ABEND command, CICS gives control to the next
higher level exit code that is active. If no exit code is active at higher logical
levels, CICS terminates the task abnormally. The next section describes what
may happen after abnormal termination of a task.
v If the exit code ends with a RETURN command, CICS returns control to the next
higher logical level at the point following the LINK command (not to any exit code
that may be active) just as if the RETURN had been issued by the lower level
application program. This leaves the task in a normal processing state, and it
does not terminate at this point.
7. Logical level. A LINKed-to program is said to be at a lower logical level than the program that issues the LINK command. The
concept of logical levels is explained in the CICS Application Programming Guide.
Transaction restart
The transaction restart user-replaceable module (DFHREST) enables you to
participate in the decision as to whether a transaction should be restarted or not.
The default program requests restart under certain conditions; for example, in the
event of a program isolation deadlock (for instance, when two tasks each wait for
the other to release a particular DL/I database segment), one of the tasks is backed
out and automatically restarted, and the other is allowed to complete its update.
For programming information about how to write your own code for DFHREST, see
the CICS Customization Guide.
Notes:
1. CICS invokes DFHREST only when RESTART(YES) is specified in a
transaction’s resource definition.
2. Ensure that resources used by restartable transactions, such as files, temporary
storage, and intrapartition transient data queues, are defined as recoverable.
3. When transaction restart occurs, a new task is attached that invokes the initial
program of the transaction. This is true even if the task abended in the second
or subsequent unit of work, and DFHREST requested a restart.
4. CICS keeps statistics on the total number of restarts against each transaction.
5. Emergency restart does not restart any user tasks that were in-flight when CICS
abnormally terminated. Instead, recovery manager attaches a special task for
each in-flight task that had recovery records in the system log at abnormal
termination. This task invokes each resource manager to backout recoverable
resources.
6. Making a transaction restartable can involve slightly more overhead, because
copies of the TCTUA, COMMAREA, and terminal input/output area (TIOA) have
to be kept in case the transaction needs to be restarted. For more information
about making transactions restartable, see the CICS Customization Guide.
Except for transaction failures that occur during syncpoint processing, and before
sending the message to the CSMT queue, the transaction failure program links to
the user-replaceable program error program (DFHPEP). DFHPEP is given control
through a LINK from the CICS transaction failure program. This occurs after all
program-level abend exit code has been executed by the task that abnormally
terminates, and after dynamic transaction backout (if any) has been performed.
Notes:
1. DFHPEP is not given control when the task abend is part of the processing
done by CICS to avoid a system stall.
2. DFHPEP is not given control if transaction manager detects that the abended
transaction is to be restarted by DFHREST.
3. DFHPEP processing takes place after a transaction dump has been taken.
DFHPEP cannot prevent a dump being taken.
4. DFHPEP is not given control if the transaction failure occurs during syncpoint
processing.
5. DFHPEP is not given control when the conditions causing the task to be
terminated are handled by the CICS abnormal condition program (ACP). The
conditions handled by ACP are some kind of attach failure; for instance, when
the transaction does not exist, or when a security violation is detected.
6. DFHPEP is not given control when a task has abended and CICS is short on
storage.
The CICS-provided DFHPEP program executes no functions, but you can include in
it your own code to carry out installation-level action following a transaction abend
(see “The CICS-supplied PEP” on page 155). There is only one program error
program for the whole system.
All CICS facilities are available to the DFHPEP program. You can, for example:
v Send messages to the terminal
v Send messages to the master terminal
v Record information or statistics about the abend
v Request the disabling of the transaction entry associated with this task
If the program check or abend is associated with any domain other than the
application domain, you have no further part in processing the error.
If the program check or abend is in the application domain, one of the following can
occur:
v CICS remains operational, but the task currently in control terminates.
v CICS terminates (see “Shutdown requested by the operating system” on
page 29).
If a program check occurs when a user task is processing, the task abends with an
abend code of ASRA. If a program check occurs when a CICS system task is
processing, CICS terminates.
If an operating system abend has occurred, CICS searches the system recovery
table, DFHSRT. The SRT contains a set of operating system abend codes that you
want CICS to recover from. CICS searches the SRT looking for the system abend
code issued by the system:
v If a match is not found, CICS is terminated.
v If a match is found, and a CICS system task is processing, CICS is terminated.
v If a match is found, and a user task is processing, the default action is to abend
the task with an abend code of ASRB. However, you can change this action by
coding a global user exit program at exit point XSRAB. The value of the return
code from XSRAB determines which of the following happens next:
– The task terminates with the ASRB abend code.
– The task terminates with the ASRB abend code and CICS cancels any
program-level abend exits that are active for the task.
– CICS terminates.
For programming information about the XSRAB exit point, see the CICS
Customization Guide.
CICS supplies a sample SRT, DFHSRT1$, that has a default set of abend codes.
You can modify the sample table to define abend codes that suit your own
requirements. The source of DFHSRT1$ is supplied in the
CICSTS13.CICS.SDFHSAMP library. For more information about the SRT, see the
CICS Resource Definition Guide.
CICS controls terminals by using VTAM (in conjunction with NCP for remote
terminals) and TCAM. These communication access methods detect transmission
errors between the central processing complex (CPC) and a remote terminal, and
automatically invoke error recovery procedures, if specified. These error recovery
procedures generally involve:
v Retransmission of data a defined number of times or until data is transmitted
error-free.
v Recording of information about the error on a data set, or internally in control
blocks. You can, at times, access data recorded in control blocks using
communication system commands.
If the data is not transmitted successfully after the specified number of retries:
v CICS terminal management is notified.
v One of the following CICS terminal error transactions is initiated:
– Control can pass to a user-written node error program (DFHZNEP).
– Control can pass to a user-written terminal error program (DFHTEP).
For programming information about coding your own NEPs and TEPs, see the
CICS Customization Guide.
The NEP is entered once for each terminal error; therefore it should be designed to
process only one error for each invocation.
The TEP is entered once for each terminal error, and therefore should be designed
to process only one error for each invocation.
If the failure occurs between syncpoints, CICS and the remote system can back out
any updates of recoverable resources either dynamically or following an emergency
restart.
If a failure occurs during the syncpointing process while a distributed unit of work is
in-doubt, CICS handles the unit of work according to the in-doubt attributes defined
on the transaction resource definition. One possible course of action is to shunt the
UOW as failed-indoubt to await resynchronization when communications are
restored. When designing applications, remember that if a communications failure
occurs and units of work are shunted as failed-indoubt, resources remain locked
until after resynchronization.
For information about the resolution of in-doubt units of work, see the CICS
Intercommunication Guide.
Question 1: Does the application update data in the system? If the application is to
perform no updating (that is, it is an inquiry-only application), recovery and restart
functions are not needed within CICS. (But you should take backup copies of
non-updated data sets in case they become unreadable.) The remaining questions
assume that the application does perform updates.
Question 2: Does this application update data sets that other online applications
access? If yes, does the business require updates to be made online, and then to
be immediately available to other applications—that is, as soon as the application
has made them? This could be a requirement in an online order entry system
where it is vital for inventory data sets8 to be as up-to-date as possible for use by
other applications at all times.
Alternatively, can updates be stored temporarily and used to modify the data set(s)
later—perhaps using offline batch programs? This might be acceptable for an
application that records only data not needed immediately by other applications.
Question 3: Does this application update data sets that batch applications access?
If yes, establish whether the batch applications are to access the data sets
concurrently with the online applications. If accesses made by the batch
applications are limited to read-only, the data sets can be shared between online
and batch applications, although read integrity may not be guaranteed. If you intend
to update data sets concurrently from both online and batch applications, consider
using DL/I or DB2, which ensure both read and write integrity.
Question 4: Does the application access any confidential data? Files that contain
confidential data, and the applications having access to those files, must be clearly
identified at this stage. You may need to ensure that only authorized users may
access confidential data when service is resumed after a failure, by asking for
re-identification in a sign-on message.
8. In the context of these questions, the term “data sets” includes databases.
The acceptable waiting time will vary depending on the value of the data, and the
number of users whom you expect to be affected. If the data is very valuable or
infrequently accessed, the acceptable waiting time will be longer than if the data is
of low value or accessed by many business-critical processes.
Question 8: How long can the business tolerate being unable to use the application
in the event of a failure? Indicate (approximately) the maximum time that the
business can allow the system to be out of service after a failure. Is it minutes or
hours? The time allowed may have to be negotiated according to the types of
failure and the ways in which the business can continue without the online
application.
Question 9: How is the user to continue or restart entering data after a failure? This
is an important part of a recovery requirements statement because it can affect the
amount of programming required. The terminal user’s restart procedure will depend
largely on what is feasible—for example:
v Must the user be able to continue business by other means—for example,
manually?
v Does the user still have source material (papers, documents) that allow the
continued entry (or reentry) of data? If the source material is transitory (received
over the telephone, for example), more complex procedures may be required.
v Even if the user still has the source material, does the quantity of data preclude
its reentry?
Such factors define the point where the user restarts work. This could be at a point
that is as close as possible to the point reached before the system failure. The best
point could be determined with the aid of a progress transaction9). Or it could be at
some point earlier in the application—even at the start of the transaction.
Question 10: During what periods of the day do users expect online applications to
be available? This is an important consideration when applications (online and
batch) require so much of the available computer time that difficulties can arise in
scheduling precautionary work for recovery (taking backup copies, for example).
See “The RLS quiesce and unquiesce functions” on page 157.
9. A progress transaction here means one that enables users to determine the last actions performed by the application on their
behalf.
When designing the user’s restart procedure (including the progress transaction, if
used) include precautions to ensure that each input data item is processed once
only.
Security
Decide the security procedures for an emergency restart or a break in
communications. For example, when confidential data is at risk, specify that the
users should sign on again and have their passwords rechecked.
Bear in mind the security requirements when a user needs to use an alternative
terminal if a failure is confined to one terminal (or to a few terminals).
Note: The sign-on state of a user is not retained after a persistent sessions restart.
For information about individual resource recoverability, see “Chapter 12. Defining
recoverability for CICS-managed resources” on page 119.
You are recommended to specify a system recovery table on the SRT system
initialization parameter. If you do not specify a table suffix on this system
initialization parameter, it defaults to YES, which means that CICS tries to load an
unsuffixed table (which probably won’t exist in your load libraries). There is a
pregenerated sample table, DFHSRT1$, supplied in CICSTS13.CICS.SDFHLOAD,
and if this is adequate for your needs, specify SRT=1$. This table adds some extra
system abend codes to the built-in list that CICS handles automatically, even if you
define only the basic DFHSRT TYPE=INITIAL and TYPE=FINAL macros.
If you want to add additional system or user entries of your own, modify the sample
table. For information about modifying an SRT, see the CICS Resource Definition
Guide.
Included in DFHLIST are the following groups, which provide basic recovery
functions:
DFHRSEND
DFHSTAND
Ensure that your CICS region includes all these CICS-required groups by including
DFHLIST as one of the lists specified on the GRPLIST system initialization
parameter. For information about the contents of these groups, see the CICS
Resource Definition Guide.
CICS uses the services of the MVS system logger for all its system and general
logging requirements. The CICS log manager writes system log and general log
data to log streams defined to the MVS system logger. For more information, see
“Chapter 11. Defining system and general log streams” on page 105.
Activity keypoints are strongly recommended to reduce the time take to perform an
emergency restart. Therefore, you should specify a nonzero value for AKPFREQ
(the default is 4000.) For a discussion about how to set the activity keypoint
frequency, see “Activity keypointing” on page 110.
If you code NEWSIT=YES at a warm start, the values in the SIT that would
normally be ignored on a warm restart take effect, overriding any system
initialization parameters stored in the warm keypoint at normal shutdown. The SIT is
still modified, even with NEWSIT, by run-time system initialization parameters
specified in the PARM string, from SYSIN, or at the console.
For VSAM files defined to be accessed in non-RLS mode, you can define the
recovery attributes in the CSD file resource definition, or, optionally, in the ICF
catalog, providing your level of DFSMS supports this.
For BDAM files, you define the recovery attributes in the FCT.
Use the TST system initialization parameter to specify the suffix of the temporary
storage table that you want CICS to load at initialization.
If you have global user exit programs that are invoked for recovery purposes, they
must be enabled during the second stage of CICS initialization. You can enable
these global user exit recovery programs in application programs specified in the
first part of the PLTPI.
See the CICS Resource Definition Guide for information about defining program list
tables.
CICS facilities, such as the execution diagnostic facility (CEDF) and command
interpreter (CECI), can help to create exception conditions and to interpret program
and system reactions to those conditions.
The ability of the installed CICS system, application programs, operators, and
terminal users to cope with exception conditions depends on the designer and the
implementer being able to:
v Forecast the exceptional conditions that can be expected
v Document what operators and users should do in the process of recovery, and
include escape procedures for problems or errors that persist
It is essential that recovery and restart procedures are tested and rehearsed
in a controlled environment by everyone who might have to cope with a
failure. This is especially important in installations that have temporary operators.
All CICS system logging and journaling is controlled by the CICS log manager,
which uses MVS system logger log streams to store its output.
CICS logging and journaling can be divided into four broad types of activity:
System logging
CICS maintains a system log to support transaction backout for recoverable
resources. CICS implements system logging automatically, but you can
define the log stream as DUMMY to inhibit this function. However, if you
specify TYPE(DUMMY) for the system log, you are running without any
transaction backout facilities and without any restart capability, and you can
start CICS with START=INITIAL only.
CICS also supports programming interfaces that enable you to write your
own data to the system log, but user-written records should be for
recovery purposes only.
See “Defining system log streams” on page 106.
Forward recovery logging
CICS supports forward recovery logs for VSAM data sets.
Forward recovery logging is not automatic—you must specify that you want
this facility for your files, and also define the forward recovery log streams.
See “Defining forward recovery log streams” on page 114.
Autojournaling
CICS supports autojournaling of file control data and terminal control
messages. Autojournaling is generally used for audit trails.
Autojournaling is not automatic—you must specify that you want this facility
for your files and terminal messages, and also define the general log
streams to be used.
See the CICS Resource Definition Guide for information about specifying
autojournaling on file and terminal resource definitions.
User journaling
CICS supports programming interfaces to enable CICS applications to write
user-defined records to user journals, which are held on general log
streams.
See the CICS System Definition Guide for information about defining
general log streams for user journals.
Autojournaling and user journals play no part in CICS recovery and therefore are
not discussed here.
For information on how CICS handles the different error conditions detected by the
CICS log manager, see the CICS Problem Determination Guide.
All log streams needed by CICS must be defined to the MVS system logger before
CICS can use them. You can either define log streams explicitly, or you can let
CICS create them dynamically when they are first used. To enable CICS to create
log streams dynamically, you first define model log streams to the MVS system
logger. To define explicit log streams and model log streams, use the MVS
IXCMIAPU utility.
For information about defining coupling facility log streams and DASD-only log
streams, see the CICS Installation Guide. For more information about the coupling
facility and defining log structures generally, see OS/390 MVS Setting Up a
Sysplex.
CICS log manager connects to its log stream automatically during system
initialization, unless it is defined as TYPE(DUMMY) in a CICS JOURNALMODEL
resource definition.
During an initial start, CICS uses default log stream names, unless you specify
otherwise on a JOURNALMODEL resource definition. The process of selecting and
connecting to a system log stream is as follows:
Without a JOURNALMODEL definition
If CICS can’t find a JOURNALMODEL resource definition for DFHLOG and
DFHSHUNT, it issues calls to the MVS system logger to connect to its system
log streams using default names. These are:
region_userid.applid.DFHLOG
region_userid.applid.DFHSHUNT
where region_userid is the RACF® userid under which the CICS address space
is running, and applid is the region’s VTAM APPL name (taken from the APPLID
system initialization parameter). The CICS-supplied JOURNALMODEL
definitions for default DFHLOG and DFHSHUNT log streams are in the CSD
group DFHLGMOD, which is included in DFHLIST.
If you don’t want to use a system log (for example, in a CICS test region), specify
JOURNALMODEL resource definitions for DFHLOG and DFHSHUNT with
TYPE(DUMMY). Note that running CICS with the system log defined as
TYPE(DUMMY) forces you to perform an initial start, and CICS does not support
dynamic transaction backout.
Otherwise, it is probably less work to define the log streams explicitly using
IXCMIAPU. Generally, dynamic creation using model log streams is best suited for
test and development purposes, and explicit definition for production regions.
The default model log stream names that CICS uses for dynamic creation of its
system log are always of the form &SYSNAME.DFHLOG.MODEL and
&SYSNAME.DFHSHUNT.MODEL, where &SYSNAME is the MVS symbol that
resolves to the system name of the MVS image.
Example: If a CICS region, running in an MVS image with a sysid of MV10, issues
a request to create a log stream for its primary system log, the system logger
expects to find a model log stream named MV10.DFHLOG.MODEL.
CICS invokes the XLGSTRM global user exit immediately before calling the MVS
system logger to create a log stream. If you don’t want to use CICS default values
for the creation of a log stream, you can write an XLGSTRM global user exit
program to modify the request details, including the model log stream name (in
parameter UEPMLSN).
Recovery considerations
If you are using coupling facility log streams, sharing structures between MVS
images provides some recovery advantages. If an MVS image or logger address
space fails, another surviving MVS image using the same log stream structures (not
necessarily the same log streams) is notified of the failure and can start immediate
log stream recovery for the log streams used by the failing MVS. Otherwise,
recovery is delayed until the next time a system connects to a log stream in the
affected structures, or until the failed system logger address space restarts.
However, using model log streams defined with the CICS default name are always
assigned to the same structure within an MVS image. This may not give you the
best allocation in terms of recovery considerations if you are using structures
defined across two or more coupling facilities.
For example, consider a two-way sysplex that uses two coupling facilities, each with
one log structure defined for use by CICS system logs, structures
LOG_DFHLOG_001 and LOG_DFHLOG_002. In this situation, it is better from a
2-Way Sysplex
CICSA1 CICSB1
LOG_DFHLOG_001
CICSA2 CICSB2
CF2
CICSA3 CICSB3
LOG_DFHLOG_002
CICSA4 CICSB4
LOG_DFHLOG_001
MVSA MVSB
(on CF1)
LOG_DFHLOG_002
MVSC MVSD
(on CF2)
Varying the model log stream name: To balance log streams across log
structures as shown in Figure 8 on page 109 using model log streams means
customizing the model log stream names. You cannot achieve the distribution of log
streams shown in this scenario using the CICS default model name.
You can use an XLGSTRM global user exit program to vary, in a number of ways,
the model log stream name to be passed to the system logger. One such way is to
store appropriate values in the exit’s global work area. For example, you can use
the INITPARM system initialization parameter to specify a parameter string for use
by the exit. This can be retrieved, using the EXEC CICS ASSIGN INITPARM
command, in the first-phase PLT program that you use to enable the XLGSTRM
global user exit program. Having obtained the relevant model log stream information
from the INITPARM command, you can store this in the global work area for later
use by your XLGSTRM global exit program. Varying the model log stream details in
this way enables you to balance log streams across different log structures in a
coupling facility.
See the CICS Customization Guide for information about writing an XLGSTRM
global user exit program, and for information about PLT initialization programs.
Activity keypointing
During a restart, the CICS log manager scans the log backwards to recover unit of
work information. The log is scanned backwards as follows:
1. To begin with, CICS reads all records sequentially as far back as the last
complete activity keypoint written before termination. To minimize the time taken
for this scan, it is important that you do not specify an activity keypoint
frequency zero. For information about setting a suitable activity keypoint
frequency for your CICS region, see the CICS System Definition Guide.
2. When CICS reaches the last-written complete activity keypoint, it extracts all the
information relating to in-flight units of work, including in-doubt units of work.
With this information, CICS continues reading backwards, but this time reading
CICS S Y S T E M L O G S T R E A M
n-1 n
Abnormal
termination
of CICS
Here are some steps you can take to ensure that system log stream sizes, and thus
restart times, are kept to a minimum:
v Keep to a minimum the amount of data that has to be read. This means
specifying an activity keypoint frequency that is non-zero, and which is:
– Long enough to avoid excessive keypointing
– Short enough to avoid large volumes of data between keypoints.
For information about calculating system log stream structure sizes, see the
CICS Installation Guide.
v Except for your own recovery records that you need during emergency restart, do
not write your own data to the system log (for example, audit trail data).
v In particular, do not write information to the system log that needs to be kept.
(See “Avoiding retention periods on the system log” on page 113).
v If you write a long-running transaction that updates recoverable resources,
ensure that it takes a syncpoint at regular intervals.
Note: Do not specify AKPFREQ=0, because without activity keypoints CICS cannot
perform log tail deletion until shutdown, by which time the system log will
have spilled onto DASD.
Log-tail deletion
The log tail is the oldest end of the log. At each activity keypoint, the CICS recovery
manager requests the log manager to delete the tail of the system log by
establishing a point on the system log before which all older data blocks can be
deleted. Thus, if the oldest “live” unit of work is in data block x, the CICS log
manager requests the system logger to delete all data blocks older than x (x−1 and
older).
Note: Long-running units of work that regularly initiate writes to the system log can
prevent CICS from deleting completed units of work stretching back over
many activity keypoints. See “Long-running transactions” on page 113.
Moving units of work to the secondary log: All units of work start on the
primary log stream. From this, they are either deleted by the log-tail deletion
mechanism, or shunted to the secondary log stream. Eligible long-running units of
work, including failed units of work, are shunted to the secondary log during activity
keypointing, leaving a pointer in front of the latest activity keypoint in the primary log
stream. For the purpose of log management, a long-running unit of work is eligible
for shunting if it exists for 2 complete activity keypoints without initiating any writes
to the system log.
After it has been shunted, a unit of work remains on the secondary log until it is
completed. Note that, after a unit of work is shunted to the secondary log,
subsequent writes to the system log by the unit of work are directed to the primary
log.
User-written recovery records are managed in the same way as recovery manager
records, and do not prevent CICS from performing log-tail deletion:
v User-written recovery records written by user transactions are treated as part of
the unit of work in which they are written—that is, they form part of the UOW
Retrieving user records from the system log: During warm and emergency
restarts, CICS scans the system log backwards. If, during this backwards scan,
CICS finds active user records, they are presented to a global user exit program
enabled at the XRCINPT exit point.
If CICS completes its scan of the system log and finds there are no active user
records, it issues message DFHER5731.
You are strongly recommended not to use the system log for records that need to
be kept. Any log and journal data that needs to be preserved should be written to a
general log stream. See the CICS System Definition Guide for advice on how to
manage log stream data sets.
Long-running transactions
Do not design long-running transactions in such a way that they write frequently to
the system log stream, in a single unit of work, across multiple activity keypoints.
System-log-stream
AKP AKP AKP AKP AKP
x x x x x x x x x x
5 6 7 8 9
x x x x x x x x x x
To recover from a DASD failure, first restore the most recent backup to a new data
set. Then use a forward recovery utility, such as CICS VSAM Recovery (CICSVR),
to apply all the updates that were written to a forward recovery log stream after the
backup date.
Note: Define a data set as recoverable if you want forward recovery logging.
Neither CICS nor VSAM provides any support for forward recovery logging
for a nonrecoverable data set.
Otherwise, it is probably less work to define the log streams explicitly using
IXCMIAPU.
The default model log stream names that CICS uses for dynamic creation of a
general log stream are of the form LSN_qualifier1.LSN_qualifier2.MODEL where the
first and second qualifiers are the CICS region userid and the CICS APPLID,
respectively.
See “Model log streams for CICS system logs” on page 108 for information about
using an XLGSTRM global user exit program to modify requests to create a log
stream.
Your decision on how to define and allocate forward recovery log streams is a
trade-off between transaction performance, fast recovery, and having a large
number of log streams to manage.
The log of logs contains records that are written each time a file is opened or
closed. At file-open time, a tie-up record is written that identifies:
v The name of the file
v The name of the underlying VSAM data set
v The name of the forward recovery log stream
v The name of the CICS region that performed the file open.
The log of logs assists forward recovery utilities to maintain an index of log data
sets.
Depending on the options specified on the SUBSYS parameter, general log stream
records are presented either:
v In a record format compatible with utility programs written for releases of CICS
that use the journal control program, DFHJUP, for logging and journaling
v In a format compatible with utility programs written for versions of CICS that use
the log manager for logging and journaling.
See the CICS Operations and Utilities Guide for more information about using the
LOGR SSI to access log stream data, and for sample JCL.
If you plan to write your own utility program to read log stream data, see the CICS
Customization Guide for information about log stream record formats.
CICS generally obtains and stores the local time offset at specific times of day only
(for example, at startup, and midnight). Thus you should execute this command
whenever you change the system date or time-of-day while CICS is running, to
ensure that the correct local time is used by all CICS functions, including the API.
For example, whenever an application program issues an EXEC CICS ASKTIME
command, CICS obtains the current time from the MVS TOD clock, and modifies
this by the stored local time difference. CICS then updates the EIBTIME field in the
exec interface block with the local time.
For general logs, in addition to time-stamping as in system logs, CICS also includes
local time in the journal records.
During a restart, for system recovery purposes, CICS reads the youngest—most
recently written—record from the primary log stream. Thereafter, CICS uses only
direct reads using block ids and does not rely upon time stamps. CICS also uses
direct read with block ids to retrieve the logged data for transaction backout
purposes, again without any dependence on time stamps.
Changing local time backwards will not affect the operation of DFHJUP provided
you specify the GMT option on the SUBSYS parameter of the log stream DD
statement in the DFHJUP JCL.
However, if you use local time on the SUBSYS parameter to specify the partitioning
of a log stream for processing by DFHJUP, you must take steps to ensure the
chronological sequence of time-stamps when adjusting clocks backwards. You can
do this by stopping CICS regions until the new local time passes the old time at
which the change was made.
User- or vendor-written journal utilities and DFHJUP exit programs may also be
sensitive to local time changes. These should be checked to ensure that there are
no problems posed by backwards time changes.
Forward recovery utilities (but not CICS VSAM Recovery 2.3) may also be sensitive
to the time sequence of forward recovery log data. If you are not using CICSVR
2.3, check that your forward recovery utility can handle discontinuities in logged
records.
If you use a backup taken earlier than the new local time, or if you specify GMT,
CICSVR handles forward recovery successfully.
For remote transactions, in addition to the above, CICS provides in-doubt options to
specify the recovery action in the event of in-doubt failures.
If you’ve specified WAIT(YES), the action is not taken unless the interval
specified on WAITTIME expires before recovery occurs.
You can use WAIT and WAITTIME to allow an opportunity for normal recovery
and resynchronization to take place, while ensuring that a transaction commits
or backs out within a reasonable time. See CICS Intercommunication Guide for
information about defining wait times for distributed transactions.
For more information about recovery of distributed units of work, see the CICS
Intercommunication Guide.
For more information about options on the TRANSACTION definition, see the CICS
Resource Definition Guide.
More than one file can refer to the same data set.
When designing your applications, ensure that application data is protected from
transaction failures that could lead to data corruption, and that you can recover from
accidental damage to storage devices.
When deciding on the access method, consider the recovery and restart facilities
provided by each. These considerations are discussed in the following sections.
VSAM files
CICS file control supports three VSAM access modes—local shared resources
(LSR), non-shared resources (NSR), and record-level sharing (RLS). This section
discusses recovery considerations for VSAM files accessed in these modes.
Forward recovery
For VSAM files, you can use a forward recovery utility, such as CICSVR, when
online backout processing has failed as a result of some physical damage to the
data set. For forward recovery:
v Create backup copies of data sets.
v Record after-images of file changes in a forward recovery stream. CICS does this
for you automatically if you specify that you want forward recovery support for the
file.
v Write a job to run a forward recovery utility, and keep control of backup data sets
and log streams that might be needed as input. CICSVR automatically constructs
Backward recovery
To ensure that VSAM files can be backward recoverable, certain points should be
considered:
v Key-sequenced data sets (KSDS) and both fixed- and variable-length relative
record data sets (RRDS):
– If the files referring to KSDS or RRDS data sets are designated as
recoverable with LOG(ALL) specified, CICS can back out any updates,
additions, and deletions made by an interrupted unit of work.
– For information about backout failures, see “Backout-failed recovery” on
page 75.
v Entry-sequenced data sets (VSAM-ESDS):
– New records are added to the end of a VSAM-ESDS. After they have been
added, records cannot be physically deleted. A logical deletion can be made
only by modifying data in the record; for example, by flagging the record with
a “logically deleted” flag, using an XFCLDEL global user exit program.
See “Transaction backout” on page 71 for more information.
Backout for BDAM data sets is the same as for ESDS data sets in that you cannot
delete records from the data set.
You specify recovery options, including forward recovery, either in the integrated
catalog facility (ICF) catalog (if you are using DFSMS 1.3 or later), or in the CICS
file resource definition, as follows:
v If your VSAM data sets are accessed by CICS in RLS mode, the recovery
attributes must be defined in the ICF catalog.
v If your VSAM data sets are accessed by CICS in non-RLS mode, recovery
attributes can be defined in either the file resource definition or the ICF catalog.
Note that CICS uses the ICF catalog definitions only when you specify RLS=YES
as a system initialization parameter—if you specify RLS=NO, recovery attributes
are always taken from the CICS file definition.
If you use the ICF catalog to define attributes for data sets accessed in non-RLS
mode, the ICF catalog entry recovery attributes override the CICS file resource
definition.
v You define the recovery attributes for BDAM files in file entries in the file control
table (FCT).
File control also writes the following to the forward recovery log:
FILE_OPEN tie-up records
FILE_CLOSE tie-up records
TAKE_KEYPOINT tie-up records
Data set BACKUP tie-up records
See the CICS Customization Guide for details of all these forward recovery
records.
Note: The use of autojournal records for forward recovery purposes is not
recommended for VSAM files.
RECOVERY(BACKOUTONLY)
If you specify RECOVERY(BACKOUTONLY), CICS provides backout recovery
only, writing only before-images to the system log.
BACKUPTYPE(DYNAMIC)
If you specify BACKUPTYPE(DYNAMIC), CICS supports the DFSMS
backup-while-open (BWO) facility, enabling the data set to be backed up while
open. If you specify STATIC, all CICS files open against a data set must be
closed before it can be backed up. For information about the backup-while-open
(BWO) facility, see “Chapter 19. Backup-while-open (BWO)” on page 217.
The VSAM parameters LOG and LOGSTREAMID, on the access methods services
DEFINE CLUSTER and ALTER commands, determine recoverability for the entire
sphere. Locating these recovery parameters in the ICF catalog enforces the same
options, for all CICS regions in the sysplex, for all the files opened against a given
sphere.
LOG({NONE|UNDO|ALL})
Specifies the type of recovery required for the VSAM sphere. Specify the LOG
parameter for data sets that are to be used by CICS in RLS mode.
NONE
The sphere is not recoverable.
UNDO
The sphere is recoverable. CICS must maintain system log records for
backout purposes.
ALL
The sphere is recoverable for both backout and forward recovery. CICS
must maintain system log records (as for UNDO) and forward recovery log
records. If you specify LOG(ALL), also specify LOGSTREAMID to indicate
the name of the forward recovery log.
If you omit the LOG parameter when defining your VSAM data sets, recovery is
assumed to be UNDEFINED, and the data set cannot be opened in RLS mode. You
can also set the UNDEFINED status explicitly by specifying NULLIFY(LOG).
For information about the access methods services DEFINE and ALTER
commands, see DFSMS/MVS Access Method Services for ICF, SC26-4906, and
DFSMS/MVS Using Data Sets, SC26-4922.
Inquiring on recovery attributes: You can use CEMT, or EXEC CICS, INQUIRE
FILE and INQUIRE DSNAME commands to determine the recovery options that are
specified for files and data sets. The INQUIRE FILE command shows the options
from the CICS file definition until the first file for the data set is opened. If the
options are obtained from the ICF catalog when the first file is opened, the ICF
catalog values are returned. The INQUIRE DSNAME command returns values from
the VSAM base cluster block (BCB). However, because base cluster block (BCB)
recovery values are not set until the first open, if you issue an INQUIRE DSNAME
command before the first file is opened, CICS returns NOTAPPLIC for
RECOVSTATUS.
Although CICS does not provide forward recovery support for BDAM files, you
can use the autojournal records to provide your own facility. JREQ=(WU,WN) is
the equivalent of the CSD file definition parameters JNLUPDATE(YES)
combined with JNLADD(BEFORE), providing the necessary images for forward
recovery to a journal specified by JID=nn.
For information about defining file resource definitions, see the CICS Resource
Definition Guide.
The first file open for the base data set determines the base data set recovery
attributes, and these are stored in the base cluster block (BCB). If, on a subsequent
file open request, CICS detects an inconsistency between the file definition recovery
attributes and those stored in the BCB, the open request fails.
See page “Inquiring on recovery attributes” on page 124 for information about
finding the recovery attributes for files and data sets.
The order in which files are opened for the same base data set determines the
content of the message received on suppression of an open failure using
XFCNREC. If the base cluster block is set as unrecoverable and a mismatch has
been allowed, access to the data set could be allowed through an unrecoverable
file before the data set is fully recovered.
See the CICS Customization Guide for programming information about XFCNREC.
CICS takes the actions shown in the following list when opening a file for update
processing in non-RLS mode—that is, the file definition specifies RLSACCESS(NO),
and the operations parameters specify ADD(YES), DELETE(YES) or UPDATE(YES)
(or the equivalent SERVREQ parameters in the FCT entry.) If you set only
READ(YES) and/or BROWSE(YES) CICS does not make these consistency
checks. These checks are not made at resource definition or install time.
v If a file definition refers to an alternate index (AIX) path and RECOVERY is ALL
or BACKOUTONLY, the AIX must be in the upgrade set for the base. This means
that any changes made to the base data set are also reflected in the AIX. If the
AIX is not in the upgrade set, the attempt to open the ACB for this AIX path fails.
v If a file is the first to be opened against a base cluster after the last cold start,
the recovery options of the file definition are copied into the base cluster block.
v If a file is not the first to be opened for update against a base cluster after the
last cold start, the recovery options in the file definition are checked against
those copied into the base cluster block by the first open. There are the following
possibilities:
– Base cluster has RECOVERY(NONE):
- File is defined with RECOVERY(NONE): the open proceeds.
- File is defined with RECOVERY(BACKOUTONLY): the attempt to open the
file fails, unless overridden by an XFCNREC global user exit program,
which can allow inconsistencies in backout settings for files that are
associated with the same base data set.
- File is defined with RECOVERY(ALL): the open fails.
– Base cluster has RECOVERY(BACKOUTONLY):
- File is defined with RECOVERY(NONE): the attempt to open the file fails
unless overridden by an XFCNREC global user exit program to allow
inconsistencies in backout settings for files associated with the same base
data set.
- File is defined with RECOVERY(BACKOUTONLY): the open proceeds.
- File is defined with RECOVERY(ALL): the open fails.
– Base cluster has RECOVERY(ALL):
- File is defined with RECOVERY(NONE): the open fails.
- File is defined with RECOVERY(BACKOUTONLY): the open fails.
Any failure to open a file against a data set results in a message to the console. If
necessary, the recovery options must be changed. To change the recovery
attributes (held in the base cluster block) of a VSAM data set, you can use the
CEMT, or EXEC CICS, SET DSNAME REMOVE command. This deletes the base
cluster block, so CICS has no record of prior recovery settings for the VSAM data
set. The next file to open against this data set causes a new base cluster block to
be built and, if the file is opened for update, the data set takes on the recovery
attributes of this file.
The base cluster block, together with its recovery attributes, and the inconsistency
condition that may be set if you are using XFCNREC, are preserved even when all
the files relating to the block are closed, and across warm and emergency restarts.
For details of procedures for performing forward recovery, see “Chapter 17. Forward
recovery procedures” on page 179.
For programming information about the format of log and journal records, see the
CICS Customization Guide.
Backward recovery
CICS can recover only intrapartition transient data. The intrapartition data set is a
VSAM-ESDS data set, with a file name of DFHINTRA. (For more information about
allocation and space requirements, see the CICS System Definition Guide.) For
extrapartition transient data considerations, see “Recovery for extrapartition
transient data” on page 129.
You must specify the name of every intrapartition transient data queue that you
want to be recoverable in the queue definition. The recovery attributes you can
specify for an intrapartition transient data queue are:
v Logical
Logical recovery
If you request logical recovery on an intrapartition queue definition, changes to a
transient data queue by an interrupted UOW are backed out. Backout occurs
dynamically in the case of a task abend, or at a CICS emergency restart in the
case of a CICS failure.
As a general rule, you should request logical recoverability. For example, if you
make related changes to a set of resources that includes intrapartition transient
data, and you want to commit (or back out) all the changes, you require logical
recovery.
Physical recovery
Physical recoverability is unique to transient data and is effective on both warm and
emergency restarts. By requesting physical recovery on an intrapartition queue
definition, you ensure that changes to the queue are committed immediately and,
with one exception, are not backed out.
The exception is in the case of the last read from a physically recoverable queue
before a unit of work fails. CICS always backs out the last read from a physically
recoverable transient data queue. In terms of the read and write pointers that CICS
maintains for TD queues, this means that the read pointer is reset, but the write
pointer never changes. This is illustrated by the diagram in Figure 12. The
sequence of TD actions in this example, and the subsequent recovery, is as follows:
v The unit of work has read items 1 and 2, leaving the read pointer at item 3.
v The unit of work has written item 4, leaving the write pointer ready for the next
item to be written.
v CICS abends and is restarted with an emergency restart.
v As a result of the transient data recovery, the read pointer is reset to item 2,
queue items 2, 3, and 4 are still available, and the write pointer is restored.
No recovery
Recovery is not performed if you specify NO on the recovery attribute of an
intrapartition transient data definition.
Forward recovery
CICS does not provide forward recovery support for transient data. If you want
forward recovery of intrapartition transient data, provide application programs to
record the changes made to the contents of your intrapartition transient data
queues while CICS is running. Changes are recorded in a user journal. The
information journaled must include:
v Each WRITEQ, including the data that is written
v Each READQ
v Each DELETEQ of a queue
v For logically recoverable queues, each backout, syncpoint, or syncpoint rollback
You must provide the application program to rebuild the data by reading the
journaled information and applying that information to the transient data queue. Your
application program could run in the program list table (PLT) phase or after
emergency restart. Until the data set is fully recovered, do not write to the queue,
because that would probably result in wrongly-ordered data, and a read might not
provide valid data (or any data at all). For these reasons, running the recovery
program in the PLT phase is probably preferable to running it after the restart.
If you do not have such a recovery strategy and you perform a cold start with a
corrupted intrapartition data set, you will lose the contents of the intrapartition data
set. You also lose the contents of the intrapartition data set if you specify
TDINTRA=EMPTY as a system initialization parameter.
The application programs then issue READQ TD commands to read and process
extrapartition input records. In this way, they accumulate the total of input records
read and processed during execution for each queue. The total number of READQ
operations is written to a journal data set, together with the relevant destination
identifications. This journaling should be done immediately before RETURN or
SYNCPOINT commands.
Following output of the journal record, each application program dequeues itself
from the extrapartition input queues to permit other application programs to access
those queues.
If uncontrolled shutdown occurs before this journaling, no records will appear on the
journal data set for that unit of work. The effect of that in-flight task is, therefore,
automatically backed out on emergency restart. However, if the journal record is
written before uncontrolled shutdown, this completed input data set processing will
be recognized on emergency restart.
You can identify an extrapartition input recovery program in the PLT for execution
during the initialization phase. This program reads the journal data set forward.
Each journaled record indicates the number of READQ operations performed on the
relevant extrapartition input data set during previous execution of application
programs. The same number of READQ TD commands is issued again by the
recovery program, to the same input queue that was referenced previously.
On reaching the end of the journal data set, the extrapartition input data sets are
positioned at the same point they had reached before the initiation of tasks that
were in-flight at uncontrolled shutdown. The result is the logical recovery of these
input data sets with in-flight task activity backed out.
For a tape output data set, use a new output tape on restart. You can then use the
previous output tape if you need to recover information recorded before termination.
To avoid losing data in tape output buffers on termination, you can write unblocked
records. Alternatively, write the data to an intrapartition disk destination (recovered
by CICS on a warm start or emergency restart) and periodically copy it to the
extrapartition tape destination through an automatically initiated task. On
termination, the data is still available to be recopied on restart.
If a controlled shutdown of CICS occurs, the previous output tape closes correctly
and writes a tape mark. However, on an uncontrolled shutdown such as a power
failure or machine check, a tape mark is not written to indicate the end of the tape.
On restart, the page that was being processed at the time of failure can be
identified from the journal data set, and that page can be reprocessed to reproduce
the same output. Alternatively, use an intermediate intrapartition destination (as
previously described) for tape output buffers.
There are two PLT phases. The first phase occurs before the system initialization
task is attached, and should not use CICS resources, because initialization is
incomplete. The first phase is intended solely to enable exits that are needed during
recovery processing. The second phase occurs after CICS initialization is complete
and, at this point, you may use PLT programs to customize the environment.
For information on how to code the PLT, see the CICS Resource Definition Guide.
For programming information about the special conditions that apply to PLT
programs, see the CICS Customization Guide.
| Backward recovery
| Temporary storage queues that are to be recoverable by CICS must be on auxiliary
| temporary storage.
| CICS continues to support temporary storage tables (TST) and you could define
| recoverable TS queues in a TST, as shown in the following example:
| DFHTST TYPE=RECOVERY,
| DATAID=(DF,**,
| $$(,character-string)...)
| The DATAID DF makes the temporary storage queues used on CICS start requests
| recoverable.
| The DATAID character string represents the leading characters of each temporary
| storage queue identifier that you want to be recoverable. For example,
| DATAID=(R,ZIP) makes recoverable all temporary storage queues that have
| identifiers starting with the character “R” or the characters “ZIP”.
| For more information on allocation and space requirements, see the CICS
| Operations and Utilities Guide.
Forward recovery
If you want forward recovery of temporary storage:
1. Record the changes made to temporary storage during the current CICS run;
you must provide application programs to do this. At emergency restart time,
you can then delay the emergency restart (by using PLTPI, for example) and,
again using application programs, rebuild as much as possible of the temporary
storage data using the records previously read.
2. Repeat the emergency restart but with the system initialization parameters
amended to cold-start temporary storage (TS=(COLD)). Note, however, that this
loses the contents of the entire temporary storage data set.
Application design
This section tells you how to design your applications so that they take advantage
of the CICS recovery facilities.
Of the files and databases that can be accessed, specify those that are to be
updated (as distinct from those that are only to be read).
Consistency of updates: Specify which (if any) updates must happen in step with
each other to ensure integrity of data. For example, in an order-entry application, it
may be necessary to ensure that a quantity subtracted from the inventory file is, at
the same time, added to the to-be-shipped file.
Immediacy of updates: Specify when newly entered data must or can be applied
to the files or databases. Possibilities include:
v The application processing unit updates the files and databases as soon as the
data is accepted from the user.
v The application processing unit accumulates updates for later action, for
example:
– By a later processing unit within the same application.
– By a batch application that runs overnight. (If you choose this option, make
sure that there is enough time for the batch work to complete the number of
updates.)
You will need the above information when deciding on the internal design of
application processing units.
This information is needed when deciding what resources are required by each
processing unit (see “Mechanisms for passing data between transactions” on
page 137).
To use the SAA resource recovery interface, you need to include SAA resource
recovery commands in your applications in place of EXEC CICS SYNCPOINT
commands. This book refers only to CICS API resource recovery commands; for
information about the SAA resource recovery interface, see the CPI Resource
Recovery Reference manual.
Program design
This section tells you how to design your programs to use the CICS recovery
facilities effectively.
Conversational processing
With conversational processing, the transaction continues to run as a task across all
terminal interactions—including the time it takes for the user to read output and
enter input. While it runs, the task retains resources that may be needed by other
tasks. For example:
v The task occupies storage, and locks database records, for a considerable period
of time. Also, in the event of a failure and subsequent backout, all the updates to
files and databases made up to the moment of failure have to be backed out
(unless the transaction has been subdivided into UOWs).
v If the transaction uses DL/I, and the number of scheduled PSBs reaches the
maximum allowed, tasks needing to schedule further PSBs have to wait.
Pseudoconversational processing
With pseudoconversational processing, successive terminal interactions with the
user are processed as separate tasks—usually consisting of one UOW each. (This
approach can result in a need to communicate between tasks or transactions (see
“Mechanisms for passing data between transactions” on page 137) and the
application programming can be a little more complex than for conversational
processing.)
However, at the end of each task, the updates are committed, and the resources
associated with the task are released for use by other tasks. For this reason,
pseudoconversational transactions are generally preferred to the conversational
type.
When several terminal interactions with the user are related to each other, data for
updates should accumulate on a recoverable resource (see “CICS recoverable
resources for communication between transactions” on page 137), and then be
applied to the database in a single task (for example, in the last interaction of a
conversation). In the event of a failure, emergency restart or dynamic transaction
backout would need to back out only the updates made during that individual step;
the application would be responsible for restarting at the appropriate point in the
conversation. This may involve re-creating a screen format.
Bear in mind, however, that other tasks may try to update the database between
the time when update information is accepted, and the time when it is applied to the
database. Design your application to ensure that no other application can update
the database at a time when it would corrupt your updating.
CICS does not log changes to these areas (except as noted later in this section).
Therefore, in the event of an uncontrolled shutdown, data stored in any of these
areas is lost, which makes them unsuitable for applications needing to retain data
between transactions across an emergency restart. Also, some of these storage
areas can cause inter-transaction affinities, which are a hindrance to dynamic
transaction routing. To avoid inter-transaction affinities, use either a COMMAREA or
the TCTUA. For information about intertransaction affinities, see the CICS
Application Programming Guide.
The advantages of main storage areas are realized only where recovery is not
important, or when passing data between programs servicing the same task.
Note: Design programs so that they do not rely on the presence or absence of
data in the COMMAREA to indicate whether or not control has been passed
to the program for the first time (for example, by testing for a data length of
zero). Consider the abend of a transaction where dynamic transaction
backout and automatic restart are specified. After the abend, a COMMAREA
could be passed to the next transaction from the terminal, even though the
new transaction is unrelated. Similar considerations apply to the terminal
control table user area (TCTUA).
CICS can return all these to their status at the beginning of an in-flight UOW if an
abnormal task termination occurs.
User files and DL/I and DB2 databases: You can dedicate files or database
segments to communicating data between transactions.
Transactions can record the completion of certain functions on the dedicated file or
database segment. A progress transaction (whose purpose is to tell the user what
updates have and have not been performed) can examine the dedicated file or
segment.
In the event of physical damage, user VSAM files, DL/I, and DB2 databases can be
forward recovered.
| Coupling facility data tables: Coupling facility data tables updated using the
| locking model, and which are recoverable after a unit of work failure, can be a
| useful means of passing data between transactions. Unlike UMTs, coupling facility
| data tables are recoverable in the event of a CICS failure, CFDT server failure , or
| an MVS failure. However, they are not forward recoverable.
Of the three methods, the second (sorting data items into an ascending
sequence by programming) is most widely accepted.
Note that, if you allow updates on a data set through the base and one or more
alternate index (AIX) paths, or through multiple AIX paths, sequencing record
updates may not provide protection against transaction deadlock. You are not
protected because the different base key sequences will probably not all be in
ascending (or descending) order. If you do allow updates through multiple paths,
and if you need to perform several record updates, always use a single path or
the base. Define such a procedure in your installation standards.
The PROTECT option on a START request ensures that, if the task issuing the
START fails during the UOW, the new task will not be initiated, even though its start
time may have passed. (See “START requests” on page 72 for more information
about the PROTECT option.)
Consider also the possibility of a started task that fails. Unless you include abend
processing in the program, only the master terminal will know about the failure. The
abend processing should analyze the cause of failure as far as possible, and restart
the task if appropriate. Ensure that either the user or the master terminal operator
can take appropriate action to repeat the updates. You could, for example, allow the
user to reinitiate the task.
In cases where the application requires the reply to consist of a large amount of
data that cannot all be viewed at one time (such as data required for browsing),
several techniques are available, including:
v Terminal paging through BMS
v Using transient data queues
The application program should then send a committed output message to the user
to say that the task is complete, and that the output data is available in the form of
terminal pages.
If an uncontrolled termination occurs while the user is viewing the pages of data,
those pages are not lost (assuming that temporary storage for BMS is designated
as recoverable). After emergency restart, the user can resume terminal paging by
using the CSPG CICS-supplied transaction and terminal paging commands. (For
more information about CSPG, see the CICS Supplied Transactions manual.)
Such queuing can be done on a transient data queue associated with a terminal. A
special transaction, triggered when the terminal is available, can then format and
present the data.
The actions taken by CICS are described under “Chapter 8. Unit of work recovery
and abend processing” on page 69 and “Processing operating system abends and
program checks” on page 92.
Transaction failures
When a transaction fails, the following CICS facilities can be invoked during and
after the abend process:
v CICS condition handling
v HANDLE ABEND commands, and user exit code
v The SYNCPOINT ROLLBACK command
v Dynamic transaction backout (DTB)
v Transaction restart after DTB
v The program error program (DFHPEP)
These facilities can be used individually or together. During the internal design
phase, specify which facilities to use and determine what additional (application or
systems) programming may be involved.
For example, if file input and output errors occur (where the default action is merely
to abend the task), you may wish to inform the master terminal operator, who may
decide to terminate CICS, especially if one of the files is critical to the application.
Your installation may have standards relating to the use of RESP options or
HANDLE CONDITION commands. Review these for each new application.
Your installation may have standards relating to the use of HANDLE ABEND
commands; review these for each new application.
Remember that:
v For transactions that access a recoverable resource, DTB helps to preserve
logical data integrity.
v Resources that are to be updated should be made recoverable.
v DTB takes place only after program level abend exits (if any) have attempted
cleanup or logical recovery.
Even if transaction restart is specified, a task will restart automatically only under
certain default conditions (listed under “Abnormal termination of a task” on page 90).
These conditions can be changed, if absolutely necessary, by modifying the restart
program DFHREST.
System failures
Specify how an application is to be restarted after an emergency restart.
Depending on how far you want to automate the restart process, application and
system programming could provide the following functions:
v User exits for transaction backout processing to handle:
– Logically deleting records added to BDAM or VSAM-ESDS files (see the CICS
Customization Guide for details of the XFCLDEL global user exit point)
– Backing out file control log records (see the CICS Customization Guide for
details of the XFCBOUT global user exit point)
– File errors during transaction backout (see the CICS Customization Guide for
details of the XFCBFAIL global user exit point)
– User recovery records read from the system log during emergency restart
(see the CICS Customization Guide for details of the XRCINPT global user
exit point).
v A progress transaction to help the user discover what updates have and have not
been performed. For this purpose, application code can be written to search
existing files or databases for the latest record or segment of a particular type.
Notes:
1. If an abend occurs during the invocation of a CICS service, issuing a further
request for the same service may cause unpredictable results because the
reinitialization of pointers and work areas and the freeing of storage areas in the
exit routine may not have been completed.
2. ASPx abends, which are task abends while in syncpoint processing, cannot be
handled by an application program.
In program-level abend exit code, you might want to perform actions such as the
following (although it is best to keep abend exit code to a minimum):
v Record application-dependent information relating to that task in case it
terminates abnormally.
If you want to initiate a dump, do so in the exit code at the same program level
as the abend. If you initiate the dump at a program level higher than where the
abend occurred, you may lose valuable diagnostic information.
v Attempt local recovery, and then continue running the program.
v Send a message to the terminal operator if, for example, you believe that the
abend is due to bad input data.
For transactions that are to be dynamically backed out if an abend occurs, beware
of writing exit code that ends with a RETURN command. This would indicate to
CICS that the transaction had ended normally and would therefore prevent dynamic
transaction backout (and automatic transaction restart where applicable). (See
description of program level abend processing at “How CICS handles transaction
abends” on page 88.)
Exit programs can be coded in any supported language, but exit routines must be in
the same language as the program of which they are a part.
See the CICS Messages and Codes manual for the transaction abend codes for
abnormal terminations that CICS initiates, their meanings, and the recommended
actions.
Use of a recoverable DATAID also ensures that, if a system failure occurs after the
START-issuing task has completed its syncpoint, the START command is
preserved. CICS starts the transaction specified on a recoverable START command
after an emergency restart, when the expiry time is reached and provided the
terminal specified the TERMID option is available. Note that a DATAID is relevant
only if data is being passed to the started transaction.
Note: Consider using EXEC CICS RETURN TRANSID(...) with the IMMEDIATE
option if the purpose is to start the next transaction in a sequence on the
same terminal. This does not unlock the terminal, incurs less overhead, and,
in a dynamic transaction routing (DTR) environment, the transaction is
eligible for DTR.
Note that, under CICS, PL/I execution-time options can be specified only by the
PLIXOPT character string.
For details of PL/I coding restrictions in a CICS environment, see the appropriate
PL/I programmer’s guide for your compiler.
Note: Locking (implicit or explicit) data resources protects data integrity in the
event of a failure, but can affect performance if several tasks attempt to
operate on the same data resource at the same time. The effect of locking
on performance, however, is minimized by implementing applications with
short UOWs, as discussed under “Dividing transactions into units of work” on
page 135.
Nonrecoverable files
For BDAM files that are nonrecoverable (that is, LOG=NO is specified in the FCT
entry), CICS does not lock records that are being updated. By default, you get
BDAM exclusive control, which operates on a physical block, is system wide, but
lasts only until the update is complete. If a transaction reads a record for update
under BDAM exclusive control, and the transaction subsequently decides not to
change the data, it must release the BDAM exclusive control. To do this, issue an
EXEC CICS UNLOCK command, which causes CICS to issue a RELEX macro.
If you don’t want BDAM exclusive control, specify SERVREQ=NOEXCTL on the file
entry in the FCT.
Figure 13 on page 147 illustrates the extent of locking for nonrecoverable files.
SOT SP
Abbreviations: Note: For BDAM and VSAM
SOT: Start of task non-RLS, this locking
SP: Syncpoint grants exclusive control.
Figure 13. Locking during updates to nonrecoverable files. This figure illustrates two tasks
updating the same record or control interval. Task A is given a lock on the record or control
interval between the READ UPDATE and WRITE commands. During this period, task B
waits.
READ WRITE
UPDATE
Task A
SOT
WRITE
READ
UPDATE
Locked
===============Wait============== ====== until =====
end of UOW
Task B
SOT SP SP
Abbreviations:
SOT: Start of task
SP: Syncpoint
Figure 14. Locking (enqueuing on a resource) during updates to recoverable files. This figure
illustrates two tasks updating the same record or control interval. Task A is given an exclusive
lock on the record until the update is committed (at the end of the UOW). During this period,
task B waits.
Recoverable files
For VSAM or BDAM files designated as recoverable, the duration of the locking
action is extended as shown in Figure 14. For VSAM files, the extended locking is
on the updated record only, not the whole control interval.
The file control commands that invoke automatic locking in this way are:
v READ (for UPDATE)
v WRITE
v DELETE
Notes:
1. Enqueuing as described above can lead to transaction deadlock (see
“Possibility of transaction deadlock” on page 151).
2. The scope of locks varies according to the access method, the type of access,
and who obtains the lock:
v BDAM exclusive control applies to the physical block
v Non-RLS VSAM exclusive control applies to the control interval
v CICS locks for BDAM (with NOEXCTL specified) apply to the record only
v CICS locks for non-RLS VSAM apply to the record only
v SMSVSAM locks for RLS apply to the record only
3. VSAM exclusive control. The CICS enqueuing action on recoverable files
opened in non-RLS mode, which always lasts until the end of the UOW, does
not affect VSAM’s exclusive control actions. When a transaction issues a READ
UPDATE command (for any file, recoverable or not), VSAM maintains its
exclusive control of the control interval containing the record until a REWRITE
(or UNLOCK, DELETE, or SYNCPOINT) command is issued. Unless it specifies
the TOKEN keyword, a second READ UPDATE command for records in the
same control interval without an intervening REWRITE, DELETE, or UNLOCK
command raises the INVREQ condition.
4. For recoverable files, do not use unique key alternate indexes (AIXs) to allocate
unique resources (represented by the alternate key). If you do, backout may fail
in the following set of circumstances:
a. A task deletes or updates a record (through the base or another AIX) and
the AIX key is changed.
b. Before the end of the first task’s UOW, a second task inserts a new record
with the original AIX key, or changes an existing AIX key to that of the
original one.
c. The first task fails and backout is attempted.
The backout fails because a duplicate key is detected in the AIX indicated by
message DFHFC4701, with a failure code of X'F0'. There is no locking on the
The foregoing rules for program isolation scheduling can be overridden using the ‘Q’
command code in a segment search argument (this command extends enqueuing
to the issue of a DL/I TERM call), or by using PROCOPT=EXCLUSIVE in the PCB
(this macro gives exclusive control of specified segment types throughout the period
that the task has scheduled the PSB).
These commands can be useful in certain applications when, for example, you want
to:
v Protect data written into the common work area (CWA), which is not
automatically protected by CICS
v Prevent transaction deadlock by enqueuing on records that might be updated by
more than one task concurrently
v Protect a temporary storage queue from being read and updated concurrently.
After a task has issued an ENQ RESOURCE(data-area) command, any other task
that issues an ENQ RESOURCE command with the same data-area parameter is
suspended until the task issues a matching DEQ RESOURCE(data-area)
command, or until the UOW ends.
As shown in Figure 15, transaction deadlock means that two (or more) tasks cannot
proceed because each task is waiting for the release of a resource that is
enqueued upon by the other. (The enqueuing, DL/I program isolation scheduling
action, or VSAM RLS locking action protects resources until the next
synchronization point is reached.)
TASK A TASK B
. .
. .
Update resource 1 .
. Update resource 2
. .
. .
Update resource 2 .
. (Wait) Update resource 1
. (Wait) .
. .
. .
Syncpoint Syncpoint
If transaction deadlock occurs, one task abends and the other proceeds.
v If both deadlocked resources are non-RLS resources, CICS file control detects
the deadlock and abends one of the transactions with an AFCF abend code.
v If both deadlocked resources are VSAM RLS resources, deadlock detection is
performed by VSAM. If VSAM detects an RLS deadlock condition, it returns a
deadlock exception condition to CICS, causing CICS file control to abend the
transaction with an AFCW abend code. CICS also writes messages and trace
entries that identify the members of the deadlock chain.
10. Transaction deadlock is sometimes known as enqueue deadlock, enqueue interlock, or deadly embrace.
The abended task may then be backed out by dynamic transaction backout, as
described in “Transaction backout” on page 71. (Under certain conditions, the
transaction can be restarted automatically, as described under “Abnormal
termination of a task” on page 90. Alternatively, the terminal operator may restart
the abended transaction.)
For more information, see “Designing to avoid transaction deadlock” on page 138.
For detailed programming information about the exits described in this chapter, see
the CICS Customization Guide.
You may want to include functions in global user exit programs that run during
emergency restart to:
v Process file control log records containing the details of file updates that are
backed out
v Deal with the backing-out of additions to data set types that do not support
deletion (VSAM ESDS and BDAM), by flagging the records as logically deleted
v Handle file error conditions that arise during transaction backout
v Process user recovery records (in the XRCINPT exit of the user log record
recovery program during emergency restart)
v Deal with the case of a non-RLS batch program having overridden RLS retained
locks (this should not occur very often, if at all)
Exit details
For programming information on the generalized interface for exits, how to write exit
programs, and the input parameters and return codes for each exit, see the CICS
Customization Guide.
Do not set the UERCPURG return code for these exits, because the exit tasks
cannot be purged.
XRCINIT exit
XRCINIT is invoked at warm and emergency restart:
1. When the first user log record is read from the system log, and before it is
passed to the XRCINPT exit
2. After the last user log record has been read from the system log and passed to
XRCINPT
The XRCINIT exit code must always end with a return code of UERCNORM. No
choice of processing options is available to this exit.
CICS passes information in the global user exit parameter list about user-written log
records presented at the XRCINPT exit. The parameters includes a flag byte that
indicates the disposition of the user log records. This can indicate the state of the
unit of work in which the user log records were found, or that the record is an
activity keypoint record. The possible values indicate:
v The record is an activity keypoint record
v The UOW was committed
v The UOW was backed out
v The UOW was in-flight
v The UOW was in-doubt
XRCINPT exit
XRCINPT is invoked at warm and emergency restart, once for each user log record
read from the system log.
If you want to ignore the log record, return with return code UERCBYP. This frees
the record area immediately and reads a new record from the system log. Take
care that this action does not put data integrity at risk.
As described under “Actions taken at transaction failure” on page 91, the program
error program (PEP) gains control after all program-level ABEND exit code has
executed and after dynamic transaction backout has been performed. The PEP can
be:
v The CICS-supplied PEP
v Your own PEP created by modifying the CICS-supplied version
The CICS-supplied PEP performs no processing, and is effectively a no-op. Its only
effect, when CICS links to it, is to avoid the DFHAC2259 message that would
otherwise be issued.
The program error program is a command-level program that can be written in any
of the languages that CICS supports. The CICS abnormal condition program
passes to the PEP a communications area (COMMAREA) containing information
about the abend. You can add code to take appropriate actions to suit your
requirements.
When you have corrected the error, you can re-enable the relevant installed
transaction definition to allow terminals to use it. You can also disable transaction
identifiers when transactions are not to be accepted for application-dependent
reasons, and can enable them again later. The CICS Supplied Transactions manual
tells you more about the master terminal operator functions.
If logic within DFHPEP determines that it is unsafe to continue CICS execution, you
can force a CICS abend by issuing an operating system ABEND macro. If DFHPEP
abends (transaction abend), CICS produces message DFHAC2263.
The quiesce function enables you, with a single command, to close in an orderly
manner throughout the sysplex any data sets that are open in RLS mode, and
prevent the data sets being opened in RLS mode while they are in the quiesced
state. This function is required in the data sharing environment because many CICS
regions can have the same data set open for update at the same time. You can use
the quiesce function to take a data set offline throughout the sysplex when:
v You want to switch between RLS and non-RLS VSAM access modes.
v You want to prevent data set access during forward recovery.
CICS supports the VSAM RLS quiesce interface by providing an RLS quiesce exit
program that is driven for the quiesce and unquiesce functions. CICS regions
initiating quiesce and unquiesce requests propagate these to other CICS regions,
through the SMSVSAM control ACB and the CICS RLS quiesce exit program.
Other products such as DFSMSdss and CICSVR also communicate with CICS
through their SMSVSAM control ACBs and the RLS quiesce exit.
Normally, a data set cannot be opened for update in non-RLS access mode if
SMSVSAM is holding retained locks against the data set. Thus, to enable a
non-RLS batch program to update a recoverable data set that CICS regions have
open in RLS mode, take the following steps:
1. Resolve shunted UOWs that are holding retained locks
2. Quiesce the data set
3. Run the batch update job
4. Unquiesce the data set
CICS transactions that try to access a quiesced data in RLS-mode fail with a
NOTOPEN condition. Non-RLS accesses are permitted subject to the data set’s
VSAM SHAREOPTIONS.
For more information, see “Switching from RLS to non-RLS access mode” on
page 164.
Note: If enabled, the CICS XFCVSDS exit is invoked for this function, with
the UEPVSACT parameter set to a value of UEQUIES.
If enabled, the CICS XFCQUIS exit is also invoked for this function,
with the UEPQSTAT parameter set to a value of UEQSD or
UEIMQSD, for a normal or an immediate quiesce respectively.
Issuing an unquiesce request
This is initiated in a CICS region (by CFQS) in response to a CEMT, or
EXEC CICS, SET DSNAME(...) command to unquiesce the specified data
set. SMSVSAM updates the ICF catalog to indicate that the data set is
unquiesced, and then propagates the request, through the CICS RLS
quiesce exit program, to all CICS regions that are registered with an
SMSVSAM control ACB. As soon as they receive the notification, CICS
regions re-enable the data set, making it eligible to be reopened in RLS
mode, and re-try any backout-failed shunted UOWs for that data set. (See
also “Completion notification”.)
If enabled, the CICS XFCQUIS exit is also invoked for this function,
with the UEPQSTAT parameter set to a value of UEUNQSD.
Notifying completion
The required completion notification for the quiesce function and the
unquiesce function is as follows:
Quiesce: SMSVSAM does not change the ICF catalog state until it
receives notification from all CICS regions with open RLS ACBs that
they have completed their quiesce processing. For this purpose, CICS
issues a quiesce completed code (using an IDAQUIES macro QUICMP
function) when it has finished its required processing for a data set
quiesce operation.
SMSVSAM then issues a normal response to the initiating CICS region
to signify that the operation is complete.
Unquiesce: SMSVSAM changes the ICF catalog state immediately on
receipt of an unquiesce request from a CICS region. SMSVSAM then
notifies all CICS regions that are registered with an SMSVSAM control
ACB, through the CICS RLS quiesce exit program, that the data set is
unquiesced. In this case, the CICS RLS quiesce exit issues a quiesce
completed code to SMSVSAM immediately, with any associated
unquiesce processing taking place asynchronously.
SMSVSAM then issues a normal response to the initiating CICS region
to signify that the operation is complete.
Canceling a quiesce request
If SMSVSAM receives an unquiesce request before an in-progress quiesce
request completes, it has the effect of canceling the quiesce request.
Timing out a quiesce
If the CICS region initiating a quiesce request does not receive a normal
response from SMSVSAM within the specified quiesce time-out limit, CICS
issues an unquiesce request to cancel the quiesce operation. You specify
the quiesce time-out limit on the QUIESTIM system initialization parameter,
for which the default value is 240 seconds.
5 6 5 6
SMSVSAM SMSVSAM
MVS
Coupling
7 4a facility 4a
ICF catalog
Figure 16. The CICS RLS quiesce operation with the CICS quiesce exit program
Notes:
1. A suitably-authorized user application program issues an EXEC CICS SET
DSNAME(...) QUIESCED command (or a terminal operator issues the
equivalent CEMT command).
If the command specifies the BUSY(NOWAIT) option, all phases of the quiesce
operation are asynchronous, and the user application program continues
processing.
If the command specifies BUSY(WAIT), control is not returned to the user
application program until SMSVSAM replies to CFQS that the quiesce function
has completed (or failed).
2. CICS file control invokes CFQS to send the quiesce request to SMSVSAM.
3. The long-running CICS CFQS task passes the quiesce request to SMSVSAM
across the control ACB interface using the IDAQUIES macro QUICLOSE
function.
4. SMSVSAM drives the CICS RLS quiesce exit program of each CICS region that
has an open RLS ACB for the specified data set.
Note: This also applies to the CICS region in which a quiesce request is
initiated if it has open RLS ACBs for the data set. Thus an initiator can
also be a recipient of a quiesce request.
(4a) SMSVSAM uses the coupling facility to propagate the request to the other
SMSVSAM servers in the sysplex.
The other functions provided on the RLS quiesce interface for data-set-related
activities are as follows:
Non-BWO data set backup start
A quiesce interface function initiated by DFSMSdss in readiness for
non-BWO backup processing for a data set that is open in RLS mode. This
function prevents CICS file control issuing RLS update requests against a
sphere so that the VSAM sphere can be backed up.
SMSVSAM invokes the CICS RLS quiesce exit program in each region that
has an open RLS ACB for the data set.
If any in-flight UOWs are using the data set when a QUICOPY notification is
received, CICS allows them to complete (or be shunted). CICS then flags
its data set name block (DSNB) for the data set to disable further updates.
Any UOW that attempts to update the data set thereafter is abended with
an AFCK abend, and SMSVSAM prevents any new file opens for update.
Note: If enabled, the CICS XFCVSDS exit is invoked for this function, with
the UEPVSACT parameter set to a value of UENBWST.
Note: If enabled, the CICS XFCVSDS exit is invoked for this function, with
the UEPVSACT parameter set to a value of UENBWCMP.
BWO backup start
A quiesce interface function initiated by DFSMSdss in readiness for BWO
backup processing for a data set that is open in RLS mode. This function
enables CICS to ensure the data set is in a suitable state for a BWO
backup to be taken.
SMSVSAM invokes the CICS RLS quiesce exit program in each region that
has an open RLS ACB for the data set.
In response to this form of request, CICS writes tie-up records to the
forward recovery log and log of logs, and waits for any in-flight UOWs to
complete (or be shunted). New units of work can then update the data set.
Note: If enabled, the CICS XFCVSDS exit is invoked for this function, with
the UEPVSACT parameter set to a value of UEBWOST.
BWO backup end
A quiesce interface function initiated by DFSMSdss at the end of BWO
backup processing (or to cancel a BWO backup request). It notifies CICS
that a BWO backup of a data set is complete.
SMSVSAM invokes the CICS RLS quiesce exit program in each region that
is registered with an SMSVSAM control ACB.
CICS does not perform any processing for this form of request.
Note: If enabled, the CICS XFCVSDS exit is invoked for this function, with
the UEPVSACT parameter set to a value of UEBWOCMP.
Forward recovery complete
A quiesce interface function initiated by VSAM in response to a request
from CICSVR. VSAM takes action associated with a sphere having
completed forward recovery, which includes notifying CICS.
SMSVSAM invokes the CICS RLS quiesce exit program in each region that
is registered with an SMSVSAM control ACB.
CICS retries any backout-failed shunted UOWs for the data set.
Lost locks recovery complete
A quiesce interface function initiated by VSAM. VSAM takes action
Non-RLS access mode is a generic term embracing NSR, LSR, and GSR access
modes. It is likely that your existing batch programs open VSAM data sets in NSR
access mode, although it is also possible for batch programs to use LSR or GSR.
As a general rule, you are recommended not to switch between RLS and non-RLS
within CICS. After a data set is accessed in RLS mode by CICS, it should always
be accessed in RLS mode by CICS.
Note: If your file definitions specify an LSR pool id that is built dynamically
by CICS, consider using the RLSTOLSR system initialization
parameter.
v Open the files non-RLS read-only mode in CICS.
v Concurrently, run batch non-RLS.
v When batch work is finished:
– Close the read-only non-RLS mode files in CICS.
– Re-define the files as RLS mode and with update operations. You can do
this using the CEMT, or EXEC CICS, SET FILE command.
– Unquiesce the data sets.
– Open the files in CICS, if not using open on first reference.
– Resume normal running.
You should also take data set copies for recovery purposes before and after a
batch run as you would normally, regardless of whether you are switching from
RLS to non-RLS access mode.
The quiesce mechanism cannot inform batch programs that have the data set open
in RLS access mode about the quiesce request. If you have such programs, use
the access method services SHCDS LIST subcommand to check whether any
Quiescing a data set sets the quiesce flag in the ICF catalog so that the data set
can be opened in non-RLS mode only. This way of making data sets available for
batch programs without having to shut down all the CICS regions is particularly
suited to the Parallel Sysplex environment. Note that if you do shut CICS regions
down, it is best to avoid immediate shutdowns when you have not specified a
shutdown transaction, because this could cause large numbers of locks to be
retained. A shutdown transaction can enable you to perform a quick but controlled
shutdown.
Even if a data set has been quiesced, you still cannot open it for update in non-RLS
access mode if SMSVSAM is holding retained locks against the data set. This is
because the locks are needed to preserve data integrity: they protect changes that
are waiting to be either committed or backed out. It is possible to open a data set
for input (read-only) in non-RLS mode even when there are retained locks, because
a reader cannot corrupt the data integrity that such locks are preserving. Note that,
when you are opening a data set in a batch job, there cannot be any active locks
because, when an ACB is closed, any active locks are converted automatically into
retained locks.
The remainder of this topic on switching to non-RLS access mode describes the
options that are available if you need to switch to non-RLS mode and are prevented
from doing so by retained locks.
The procedure described here lists the steps you can take to switch from RLS to
non-RLS mode in readiness for batch processing. The procedure is for a single
data set, and should be repeated for each data set that is to be accessed from
batch in non-RLS mode. CICS provides a suite of sample programs designed to
help you to automate most of this procedure, which you can tailor to suit your own
requirements (see “The batch-enabling sample programs” on page 170):
1. Prevent further use of the applications that access the data set, so that no more
locks can be created. One way to do this is to disable transactions that use the
data set; but you may have your own technique for quiescing applications.
2. Check for any retained locks by using the INQUIRE DSNAME command on
each CICS region that has been accessing the data set (see “INQUIRE
DSNAME” on page 167). Generally there should not be any, and you can
continue by quiescing the data set and running your batch program(s).
3. If there are retained locks, the RLS-in-use indicator prevents the data set being
opened in non-RLS mode. In this case, use the CAUSE and REASON options
on the INQUIRE UOWDSNFAIL command to find out what has caused them.
INQUIRE DSNAME
You can use the INQUIRE DSNAME command to find out whether the data set has
any retained locks or if it is waiting for lost locks recovery by any CICS region (see
“A special case: lost locks” on page 172).
INQUIRE UOWDSNFAIL
CEMT INQUIRE UOWDSNFAIL DSN(datasetname) can return information about
shunted UOWs that have failed to commit changes to the named data set. The
information includes the resource that caused the failure and, where there can be
more than one reason, the reason for the failure. The EXEC CICS INQUIRE
Backout failures can be retried after the cause of the failure has been corrected.
In some cases, the retry occurs automatically.
v Commit failure, where a unit of work has failed during the commit action. The
commit action may be either to commit the changes made by a completed unit of
work, or to commit the successful backout of a unit of work. This failure is
caused by a failure of the SMSVSAM server, which is returned as RLSSERVER
on the CAUSE option and COMMITFAIL or RRCOMMITFAIL on the REASON
option of the INQUIRE UOWDSNFAIL command. (RLSSERVER is also returned
as the WAITCAUSE on an INQUIRE UOW command.)
Commit failures can be retried after the cause of the failure has been corrected.
The retry usually occurs automatically.
Note that INQUIRE UOWDSNFAIL displays information about UOWs that are
currently failed with respect to one or more data sets. The command does not
display information for a unit of work that was in backout-failed or commit-failed
state and is being retried, or for a unit of work that was in-doubt and is being
completed. The retry, or the completion of the in-doubt unit of work, could fail, in
which case a subsequent INQUIRE UOWDSNFAIL displays the relevant
information.
The SHCDS LIST commands could also be useful in the unlikely event of a
mismatch between the information that CICS holds about uncommitted changes,
and the information that the VSAM RLS share control data set holds about retained
locks.
When the unit of work completes, the locks are released. If the action chosen is
backout (either by explicitly specifying BACKOUT or as a result of the FORCE
option), diagnostic information is written to the CSFL transient data queue.
Therefore BACKOUT is normally the best option, because you can use the
diagnostic information to correct the data if backout was the wrong decision.
Diagnostic messages DFHFC3004 and DFHFC3010 are issued for each backed-out
update. Choose COMMIT for the UOWs only if you know this is the decision that
would be communicated by the coordinator when resynchronization takes place
following reconnection.
Note: The CEMT or EXEC CICS SET UOW command operates on a single unit of
work and therefore gives better control of each unit of work. The SET
DSN(...) UOWACTION operates on all in-doubt UOWs that have updated the
specified data set. Because these may have also updated resources other
than those in the specified data set, you may prefer to use the SET UOW
command and consider the effect of your actions separately for each
individual unit of work. However, if you are confident that the ACTION option
specified on your transaction definitions indicates the best direction for each
individual transaction, you can use SET DSNAME FORCE.
If a data set has both indoubt-failed and other (backout- or commit-) failed UOWs,
deal with the indoubt UOWs first, using SET DSNAME UOWACTION, because this
might result in other failures which can then be cleared by the SET DSNAME
RESETLOCKS command.
The DFH0BAT3 sample program is also useful for resolving pending backouts
after a failure to forward recover a data set. See “Procedure for failed RLS
mode forward recovery operation” on page 191.
For more information about these batch-enabling sample programs see the CICS
Operations and Utilities Guide, which describes how to install the programs and
The display shows a REASON code of DELEXITERROR (Del) for one unit of
work, and INDEXRECFULL (Ind) for the other.
2. A CEMT SET DSNAME(...) RETRY command, after fixing the delete exit error
(caused because an XFCLDEL exit program was not enabled) invokes a retry of
both UOWs with the following result:
SET DSNAME('RLS.ACCOUNTS.ESDS.DBASE1') RETRY
STATUS: RESULTS - OVERTYPE TO MODIFY
Dsn(RLS.ACCOUNTS.ESDS.DBASE1 ) Vsa NORMAL
Fil(0001) Val Bas Rec Sta Ava Ret
3. Another CEMT INQ UOWDSNFAIL command shows that one of the two UOWs
failed again. It is the unit of work with the INDEXRECFULL error, as follows:
INQUIRE UOWDSNFAIL
STATUS: RESULTS
Dsn(RLS.ACCOUNTS.ESDS.DBASE1 ) Dat Ind
Uow(AA6DB08AC66B4000) Dat Ind Rls
This type of error can occur when adding a value to a non-unique alternate
index while trying to back out a change to a record; after the original update
was made, other UOWs have added values to the alternate index, filling all the
available space. Such an error should be extremely rare. The solution is to
define a larger alternate index record size for the data set.
4. However, the data set with the remaining failed unit of work is needed urgently
for a batch run, so you may decide not to complete the unit of work now, and to
resolve the problem later. In the meantime, to allow the batch job to start, you
issue SET DSN('RLS.ACCOUNTS.ESDS.DBASE1') RESETLOCKS to release the retained
locks associated with the unit of work.
5. To check that there are no retained locks left, you can issue another INQUIRE
DSNAME command. The expanded display shows:
Dsname(RLS.ACCOUNTS.ESDS.DBASE1)
Accessmethod(Vsam)
Action( )
Filecount(0001)
Validity(Valid)
Object(Base)
Recovstatus(Recoverable)
Backuptype(Static)
Frlog()
Availability( Available )
Lostlocks()
Retlocks(Noretained)
Lost locks recovery is complete when all uncommitted changes, which were
protected by the locks that were lost, have been committed. Therefore, after a lost
locks condition occurs, you may need to follow the same procedures as those for
preparing for batch jobs (described in “The batch-enabling sample programs” on
page 170), to ensure that the data set is made available again as soon as possible.
Preserving data integrity should be the priority, but you may decide to force in-doubt
UOWs and to reset locks in order to make the data set available sooner.
Post-batch processing
After a non-RLS program has been permitted to override retained locks, the
uncommitted changes that were protected by those locks must not normally be
allowed to back out. This is because the non-RLS program may have changed the
protected records. To ensure this, all CICS regions that held retained locks are
notified by VSAM that a non-RLS update may have occurred because of the
PERMITNONRLSUPDATE override. The CICS regions receive this notification when
they open the affected data set in RLS mode.
CICS stores information about any uncommitted changes that are currently
outstanding, and then informs VSAM RLS that the PERMITNONRLSUPDATE state
can be cleared with respect to this CICS region. CICS uses the stored information
to prevent automatic backouts during retries, because this could cause
unpredictable results if backouts were applied to data modified by a batch program.
Whenever a change made by one of these UOWs is about to be backed out, CICS
detects its special status and invokes the XFCBOVER global user exit program.
This allows you to determine what action to take over the uncommitted changes.
The default action for these UOWs is not to retry the backouts.
Note that neither CICS nor VSAM knows whether the non-RLS program did change
any of the locked records, only that it had the potential to do so. With your
knowledge of the batch application, you may know that the records, or certain of the
records, could not have been changed by batch and could therefore be safely
backed out. For example, the batch application might only ever add records to the
data set. CICS provides a global user exit point, XFCBOVER, which allows you to
request that records are backed out, and which you can also use to perform other
actions. If you choose to back out an update, CICS issues diagnostic message
DFHFC3002 instead of DFHFC3001.
The SHCDS LISTDS subcommand shows whether the non-RLS batch update
PERMITNONRLSUPDATE state is currently set. It also shows the status of a
PERMIT first-time switch, which is a switch that is cleared as soon as a data set for
which non-RLS update has been allowed is next opened for RLS access. The
non-RLS permitted state is used to inform each contributing CICS region that a
non-RLS program has potentially overridden its retained locks. The first-time switch
| A recoverable CFDT supports in-doubt and backout failures. If a unit of work fails
| when backing out an update to a CFDT, or if it fails in-doubt during syncpoint
| processing, the locks are converted to retained locks and the unit of work is
| shunted. If a CFDT record is locked by a retained lock, an attempt to modify the
| record fails with the LOCKED condition. This can be returned on the following API
| file control requests issued against CFDT records:
| v DELETE with RIDFLD
| v READ with UPDATE
| v READNEXT with UPDATE
| v READPREV with UPDATE
| v WRITE
| To resolve retained locks against CFDT records, use the CEMT INQUIRE
| UOWLINK command to find the unresolved units of work associated with links to
| the coupling facility data table pool containing the table. When you have found the
| UOW links, take the appropriate action to resolve the link failure or force completion
| of the UOWs.
| The owner of the failed UOW could be another CICS region in the sysplex, and the
| cause of the retained lock could be one of the following:
| v The CICS region cannot resolve the UOW because its CFDT server has failed.
| v The CICS region that owns the UOW has failed to connect to its CFDT server.
| v The CICS region that owns the UOW cannot resynchronize with its CFDT
| server.
| v An in-doubt failure caused by the loss of the CICS region that is coordinating a
| distributed UOW.
| v An in-doubt failure caused by the loss of connectivity to the CICS region that is
| coordinating a distributed UOW.
Recoverable data sets may have retained locks associated with them. For data sets
accessed in RLS mode, SMSVSAM associates locks with the physical location of
the data set on disk. If you move the data set from one disk location to another,
ensure that any retained locks are associated with the new data set location after it
is moved. DFSMS access method services provide some SHCDS subcommands
that enable you to preserve these locks when moving a data set.
There are two basic methods that you can use to move a VSAM data set:
1. Use the REPRO function of the access method services utility, then delete the
original data set.
2. Use the EXPORT and IMPORT functions of access method services.
With either of these methods, you need to take extra steps to preserve any retained
locks for data sets accessed in RLS mode.
In the case of a non-RLS mode data set, retained locks are not a problem and no
other special action is needed.
To enable you to move data and keep the locks associated with the data, access
method services provide the following SHCDS subcommands. These are
subcommands of the VSAM sharing control SHCDS command, and must always be
preceded by the SHCDS keyword:
SHCDS FRSETRR
This command marks the data set as being under maintenance, by setting a
flag in the data set’s ICF catalog entry. (This flag is shown as ‘Recovery
Required’ in a catalog listing (LISTCAT).)
SHCDS FRUNBIND
This command unbinds any retained locks held against the data set. The locks
are no longer associated with any specific disk location or data set.
SHCDS FRBIND
This command re-binds to the new data set all retained locks that were
unbound from the old data set. The locks are now associated with the named
data set. The data set name used in FRBIND must match the name used in the
earlier FRUNBIND.
SHCDS FRRESETRR
This frees the data set from being under maintenance, resetting the flag in the
ICF catalog.
Based on the above example of using REPRO to move a VSAM data set, the
complete solution for a data set accessed in RLS mode, and which preserves the
RLS locks, is as follows.
1. Quiesce the data set
Quiesce the data set that is being moved, to prevent access by CICS regions
while maintenance is in progress.
2. Create a new data set
Create a new data set into which the data is to be copied. At this stage, it
cannot have the same name as the old data set. For example:
DEFINE CLUSTER (NAME(CICS.DATASET.B) ...
3. Issue SHCDS FRSETRR
Use this access method services (AMS) SHCDS subcommand to mark the old
data set as being under maintenance. This makes the data set unavailable
while the move from old to new is in progress, and also allows the following
unbind operation to succeed. For example:
SHCDS FRSETRR(CICS.DATASET.A)
4. Issue SHCDS FRUNBIND
Use this SHCDS subcommand to unbind any retained locks against the old
data set. This enables SMSVSAM to preserve the locks ready for re-binding
later to the new data set. For example:
Note: You can include the SHCDS FRSETRR and FRUNBIND subcommands
of steps 3 and 4 in the same IDCAMS execution, but they must be in the
correct sequence. For example, the SYSIN input to IDCAMS would look
like this:
//SYSIN DD *
SHCDS FRSETRR(old_dsname)
SHCDS FRUNBIND(old_dsname)
/*
5. Copy (REPRO) the data
After the unbind, use REPRO to copy the data from the old data set to the new
data set created in step 1. For example:
REPRO INDATASET(CICS.DATASET.A) OUTDATASET(CICS.DATASET.B)
6. SHCDS FRSETRR
Use this SHCDS subcommand to mark the new data set as being under
maintenance. This is necessary to allow the later (step 8) re-bind operation to
succeed. For example:
SHCDS FRSETRR(CICS.DATASET.B)
7. Delete the old data set
Delete the old data set to enable you to rename the new data set to the name
of the old data set. For example:
DELETE CICS.DATASET.A
8. Alter the new data set name
Use access method services to rename the new data set to the name of the old
data set. For example:
ALTER CICS.DATASET.B NEWNAME(CICS.DATASET.A)
You must give the new data set the name of the old data set to enable the
following bind operation to succeed.
9. FRBIND
Use this SHCDS subcommand to re-bind to the recovered data set all the
retained locks that were unbound from the old data set. For example:
SHCDS FRBIND(CICS.DATASET.A)
10. FRRESETRR
Use this SHCDS subcommand after the re-bind to reset the maintenance flag
and enable the data set for use. For example:
SHCDS FRRESETRR(CICS.DATASET.A)
Note: You can include the SHCDS FRBIND and FRRESETRR subcommands
of steps 8 and 9 in one IDCAMS execution, but they must be in the
correct sequence. For example, the SYSIN input to IDCAMS would look
like this:
//SYSIN DD *
SHCDS FRBIND(dataset_name)
SHCDS FRRESETRR(dataset_name)
/*
Chapter 16. Moving recoverable data sets that have retained locks 177
Using the EXPORT and IMPORT functions
Similar considerations to those for the REPRO method also apply to the use of
IMPORT when recovering data from a copy created by EXPORT. The following
steps are required to restore a data set copy that has been created by an access
method services EXPORT command:
v Create a new empty data set into which the copy is to be restored, and use
IMPORT to copy the data from the exported version of the data set to the new
empty data set.
v Use SHCDS FRSETRR to mark the original data set as being under
maintenance.
v Use SHCDS FRUNBIND to unbind the locks from the original data set.
v Use SHCDS FRSETRR to mark the new data set as being under maintenance.
v Delete the original data set.
v Rename the new data set back to the old name.
v Use SHCDS FRBIND to associate the locks with the new data set.
v Use SHCDS FRRESETRR to take the data set out of maintenance.
This chapter deals with general forward recovery procedures. See “Chapter 18.
Forward recovery with CICSVR” on page 195 for details of forward recovery using
CICSVR.
In the event of a data set failure, it is important to ensure that you preserve any
retained locks as part of the data set recovery process. This is to enable the locks
associated with the original data set to be attached to the new data set. If the data
set failure is caused by anything other than a volume failure, retained locks can be
“unbound” using the SHCDS FRUNBIND subcommand. The data set can then be
recovered and the locks rebound to the recovered data using the SHCDS FRBIND
subcommand.
Note: If you use CICSVR to recover a data set, the unbinding and subsequent
binding of locks is handled by CICSVR. CICSVR also uses the SHCDS
FRSETRR and FRRESETRR commands to prevent general access to the
data set during the recovery process.
If a data set failure is caused by the loss of a volume, it is not possible to preserve
retained locks using FRUNBIND and FRBIND because SMSVSAM no longer has
access to the failed volume. When recovering from the loss of a volume, you can
ensure data integrity only by deleting the entire IGWLOCK00 lock structure, which
forces CICS to perform lost locks recovery. CICS uses information from its system
log to perform lost locks recovery. (For more information about lost locks
processing, see “Lost locks recovery” on page 85.) Recovering data from the loss of
a volume requires a different procedure from the simple loss of a data set.
The procedures to recover data sets that could have retained locks are described in
the following topics:
v For recovery of failed data sets where the volume remains operational, see
“Recovery of data set with volume still available” on page 180
The following steps outline the procedure to forward recover a data set accessed in
RLS mode. Note that the procedure described here refers to two data sets—the
failed data set, and the new one into which the backup is restored. When building
your JCL to implement this process, be sure you reference the correct data set at
each step.
1. Quiesce the data set
To prevent further accesses to the failed data set, quiesce it using the CEMT, or
EXEC CICS, SET DSNAME QUIESCED command.
2. Create a new data set
Create a new data set into which the backup is to be restored. At this stage, it
cannot have the same name as the failed production data set.
3. Issue FRSETRR
Use this access method services SHCDS subcommand to mark the failed data
set as being subject to a forward recovery operation. This makes the data set
unavailable to tasks other than those performing recovery functions, and also
allows the following unbind operation to succeed.
4. Issue FRUNBIND
Use this access method services SHCDS subcommand to unbind any retained
locks against the failed data set. This enables SMSVSAM to preserve the locks
ready for re-binding later to the new data set used for the restore. This is
necessary because there is information in the locks that relates to the old data
set, and it must be updated to refer to the new data set. Unbinding and
re-binding the locks takes care of this.
Note: You can include the access method services SHCDS FRSETRR and
FRUNBIND subcommands of steps 3 and 4 in the same IDCAMS
execution, but they must be in the correct sequence. For example, the
SYSIN input to IDCAMS would look like this:
//SYSIN DD *
SHCDS FRSETRR(old_dsname)
SHCDS FRUNBIND(old_dsname)
/*
5. Restore the backup
After the unbind, restore a full backup of the data set to the new data set
created in step 2. You can use the recovery function (HRECOVER) of
DFSMShsm™ to do this.
6. Issue the FRSETRR subcommand
Use this access method services SHCDS subcommand to mark the new data
set as being subject to a forward recovery operation. This is necessary to allow
the later bind operation to succeed.
You must give the restored data set the name of the old data set to enable the
following bind operation to succeed.
10. Issue the FRBIND subcommand
Use this access method services SHCDS subcommand to re-bind to the
recovered data set all the retained locks that were unbound from the old data
set.
11. Issue the FRRESETRR subcommand
Use this access method services SHCDS subcommand after the re-bind to
re-enable access to the data set by applications other than the forward recovery
utility.
Note: You can include the SHCDS FRBIND and FRRESETRR subcommands
of steps 10 and 11 in one IDCAMS execution, but they must be in the
correct sequence. For example, the SYSIN input to IDCAMS would look
like this:
//SYSIN DD *
SHCDS FRBIND(dataset_name)
SHCDS FRRESETRR(dataset_name)
/*
These steps are summarized in the following commands, where the data set names
are labeled with A and B suffixes:
CEMT SET DSNAME(CICS.DATASETA) QUIESCED
DEFINE CLUSTER(NAME(CICS.DATASETB) ...
SHCDS FRSETRR(CICS.DATASETA)
SHCDS FRUNBIND(CICS.DATASETA)
HRECOVER (CICS.DATASETA.BACKUP) ... NEWNAME(CICS.DATASETB)
SHCDS FRSETRR(CICS.DATASETB)
EXEC PGM=fwdrecov_utility
DELETE CICS.DATASETA
ALTER CICS.DATASETB NEWNAME(CICS.DATASETA)
SHCDS FRBIND(CICS.DATASETA)
SHCDS FRRESETRR(CICS.DATASETA)
If you use CICSVR, the SHCDS functions are performed for you (see “Chapter 18.
Forward recovery with CICSVR” on page 195).
After successful forward recovery, CICS can carry out any pending backout
processing against the restored data set. Backout processing is necessary because
the forward recovery log contains after images of all changes to the data set,
including those that are uncommitted, and were in the process of being backed out
when the data set failed.
There are several methods you can use to recover data sets after the loss of a
volume. Whichever method you use (whether a volume restore, a logical data set
recovery, or a combination of both), you need to ensure SMSVSAM puts data sets
into a lost locks state to protect data integrity. This means that, after you have
carried out the initial step of recovering the volume, your data recovery process
must include the following command sequence:
1. ROUTE *ALL,VARY SMS,SMSVSAM,TERMINATESERVER
2. VARY SMS,SMSVSAM,FORCEDELETELOCKSTRUCTURE
3. ROUTE *ALL,VARY SMS,SMSVSAM,ACTIVE
The first command terminates all SMSVSAM servers in the sysplex and temporarily
disables the SMSVSAM automatic restart facility. The second command (issued
from any MVS) deletes the lock structure. The third command restarts all
SMSVSAM servers, as a result of which SMSVSAM records, in the sharing control
data set, that data sets are in lost locks state. The automatic restart facility is also
reenabled.
Each CICS region detects that its SMSVSAM server is down as a result of the
TERMINATESERVER command, and waits for the server event indicating the
server has restarted before it can resume RLS-mode processing. This occurs at
step 3 in the above procedure.
It is important to realize the potential impact of these commands. Deleting the lock
structure puts all RLS-mode data sets that have retained locks, or are open at the
time the servers are terminated, into the lost locks condition. A data set which is in
lost locks condition is not available for general access until all outstanding recovery
on the data set is complete. This is because records are no longer protected by the
lost locks, and new updates can only be permitted when all shunted UOWs with
outstanding recovery work for the data set have completed.
When CICS detects that its server has restarted, it performs dynamic RLS restart,
during which it is notified that it must perform lost locks recovery. During this
recovery process, CICS does not allow new RLS-mode work to start for a given
data set until all backouts for that data set are complete. Error responses are
returned on open requests issued by any CICS region that was not sharing the data
set at the time SMSVSAM servers were terminated, and on RLS access requests
issued by any new UOWs in CICS regions that were sharing the data set. Also,
in-doubt UOWs must be resolved before the data set can be taken out of lost locks
state.
For RLS-mode data sets that are not on the lost volume, the CICS regions can
begin lost locks recovery processing as soon as they receive notification from their
SMSVSAM servers. For the data sets on these other volumes, recovery processing
completes quickly and the data sets are removed from lost locks state.
If you physically restore the volume, however, the data sets that need to be forward
recovered are immediately available for backout. In this case you need to use
CFVOL QUIESCE before the volume restore to prevent access to the restored
volume until that protection can be transferred to CICS (by using the CICS SET
DSNAME(...) QUIESCED command). When all the data sets that need to be
forward recovered have been successfully quiesced, you can enable the volume
again (CFVOL ENABLE). The volume is then useable for other SMSVSAM data
sets.
The command D SMS,CFVOL(volser) can be used to display the CFVOL state of the
indicated volume.
CICS must not perform backouts until forward recovery is completed. The following
outline procedure, which includes the three VARY SMS commands described
above, prevents CICS opening for backout a data set on a restored volume until it
is safe to do so. In this procedure volser is the volume serial of the lost volume:
1. VARY SMS,CFVOL(volser),QUIESCE
Perform this step before volume restore. Quiescing the volume ensures that the
volume remains unavailable, even after the restore, so that attempts to open
data sets on the volume in RLS mode will fail with RC=8,
ACBERFLG=198(X'C6'). Quiescing the volume also ensures CICS can’t perform
backouts for data sets after the volume is restored until it is re-enabled.
2. ROUTE *ALL,VARY SMS,SMSVSAM,TERMINATESERVER
3. VARY SMS,SMSVSAM,FORCEDELETELOCKSTRUCTURE
4. ROUTE *ALL,VARY SMS,SMSVSAM,ACTIVE
Note at this point, as soon as they receive the “SMSVSAM available” event
notification (ENF), CICS regions are able to run backouts for the data sets that
are available. RLS-mode data sets on the lost volume, however, remain
unavailable until a later ENABLE command.
5. At this point the procedure assumes the volume has been restored. This step
transfers the responsibility of inhibiting backouts for those data sets to be
forward recovered from SMSVSAM to CICS. Quiescing the data sets that need
to be forward recovered is a first step to allowing the restored volume to be
used for recovery work for other data sets:
a. SET DSNAME(...) QUIESCED
Use this command for all of the data sets on the lost volume that are to be
eventually forward recovered. Issue the command before performing any of
the forward recoveries.
The following are two examples of forward recovery after the loss of a volume,
based on the procedure outline above:
Example of recovery using data set backup: For this illustration, involving two
data sets, we simulated the loss of a volume by varying the volume offline. The two
data sets (RLSADSW.VF04D.DATAENDB and RLSADSW.VF04D.TELLCTRL) were
being updated in RLS mode by many CICS AORs at the time the volume was taken
offline. The CICS file names used for these data sets were F04DENDB and
F04DCTRL.
The failed data sets were recovered onto another volume without first recovering
the failed volume. For this purpose, you have to know what data sets are on the
volume at the time of the failure. In “Example of recovery using volume backup” on
page 188, we describe the recovery process by performing a volume restore before
the forward recovery of data sets. Here are the steps followed in this example:
1. We simulated the volume failure using the MVS command:
ROUTE *ALL,VARY 4186,OFFLINE,FORCE
The loss of the volume caused I/O errors and transaction abends, producing
messages on the MVS system log such as these:
DFHFC0157 ADSWA04B 030
TT1P 3326 CICSUSER An I/O error has occurred on base data set
RLSADSW.VF04D.TELLCTRL accessed via file F04DCTRL component code
X'00'.
DFHFC0158 ADSWA04B 031
96329,13154096,0005EDC00000,D,9S4186,A04B ,CICS
,4186,DA,F04DCTRL,86- OP,UNKNOWN COND. ,000000A5000403,VSAM
The impact of the recovery process is greater if there are inflight tasks updating
RLS mode files. For this reason, it is recommended at this point that you quiesce
the data sets that are being accessed in RLS mode on other volumes before
terminating the SMSVSAM servers. To determine which data sets are being
accessed in RLS-mode by a CICS region, use the SHCDS LISTSUBSYSDS
subcommand. For example, the following command lists those data sets that are
being accessed in RLS-mode by CICS region ADSWA01D.
SHCDS LISTSUBSYSDS('ADSWA01D')
For the purpose of this example, we did not quiesce data sets; hence there is no
sample output to show.
Note: You can issue SHCDS subcommands as a TSO command or from a batch
job.
4. We terminated the SMSVSAM servers using the MVS command:
ROUTE *ALL,VARY SMS,SMSVSAM,TERMINATESERVER
In our example, terminating the servers caused abends of all in-flight tasks that
were updating RLS-mode data sets. This, in turn, caused backout failures and
shunted UOWs, which were reported by CICS messages. For example, the
effect in CICS region ADSWA03C was shown by the following response to an
INQUIRE UOWDSNFAIL command for data set
RLSADSW.VF01D.BANKACCT:
INQUIRE UOWDSNFAIL DSN(RLSADSW.VF01D.BANKACCT)
STATUS: RESULTS
Dsn(RLSADSW.VF01D.BANKACCT ) Dat Ope
Uow(ADD19B8166268E02) Rls
Dsn(RLSADSW.VF01D.BANKACCT ) Rls Com
Uow(ADD19B9D93DE1200) Rls
After the SMSVSAM servers terminated, all RLS-mode files were automatically
closed by CICS and further RLS access prevented.
5. When we were sure that all servers were down, we deleted the IGWLOCK00
lock structure with the MVS command:
VARY SMS,SMSVSAM,FORCEDELETELOCKSTRUCTURE
The SMSVSAM server reported that there were no longer any retained locks
but that instead there were data sets in the “lost locks” condition:
IGW414I SMSVSAM SERVER ADDRESS SPACE IS NOW ACTIVE.
IGW321I No retained locks
IGW321I 45 spheres in Lost Locks
CICS was informed during dynamic RLS restart about the data sets for which it
must perform lost locks recovery. In our example, CICS issued messages such
as the following to tell us that lost locks recovery was needed on one or more
data sets:
DFHFC0555 ADSWA04A One or more data sets are in lost locks status.
CICS will perform lost locks recovery.
If there were many data sets in lost locks it would take some time for lost locks
recovery to complete. Error responses are returned on open requests issued by any
CICS region that was not sharing the data set at the time SMSVSAM servers were
terminated, and on RLS access requests issued by any new UOWs in CICS regions
that were sharing the data set. Also, it may be necessary to open explicitly files that
suffer open failures during lost locks recovery.
Each data set in a lost locks state is protected from new updates until all CICS
regions have completed lost locks recovery for the data set. This means that all
shunted UOWs must be resolved before the data set is available for new work.
Assuming that all CICS regions are active, and there are no in-doubt UOWs, lost
locks processing, for all data sets except the ones on the failed volume, should
complete quickly.
7. In this example, CEMT INQUIRE UOWDSNFAIL on CICS region ADSWA01D
showed UOW failures only for the RLSADSW.VF04D.TELLCTRL and
RLSADSW.VF04D.DATAENDB data sets:
INQUIRE UOWDSNFAIL
STATUS: RESULTS
Dsn(RLSADSW.VF04D.TELLCTRL ) Dat Ope
Uow(ADD18C2DA4D5FC03) Rls
Dsn(RLSADSW.VF04D.DATAENDB ) Dat Ope
Uow(ADD18C2E693C7401) Rls
All CICS regions are automatically notified when CICSVR processing for a data set
is complete. CICSVR preserves the lost locks state for the recovered data set and
CICS disallows all new update requests until all CICS regions have completed lost
These commands are issued to each CICS AOR that requires access.
10. All data sets were now available for general access. We confirmed this using
the SHCDS subcommand LISTSUBSYS(ALL), which showed that no CICS
region had lost locks recovery outstanding.
If you follow the above example, but find that a CICS region still has a data set in
lost locks, you can investigate the UOW failures on that particular CICS region
using the CEMT commands INQUIRE UOWDSNFAIL and INQUIRE UOW. For
in-doubt UOWs that have updated a data set that is in a lost locks condition, CICS
waits for in-doubt resolution before allowing general access to the data set. In such
a situation you can still release the locks immediately, using the SET DSNAME
command, although in most cases you will lose data integrity. See “Lost locks
recovery” on page 85 for more information about resolving in-doubt UOWs following
lost locks processing.
Note: It is important to ensure that CICS cannot retry the shunted UOWs when the
volume is restored, until after the forward recovery work is complete. This is
done by quiescing the volume before it is restored, as described under
“Volume recovery procedure using CFVOL QUIESCE” on page 183 (step 1).
Many of the steps in this second example are the same as those described under
the “Example of recovery using data set backup” on page 184, and are listed here
in summary form only.
1. We simulated the volume failure using the MVS command.
ROUTE *ALL,VARY 4186,OFFLINE,FORCE
2. We stopped the I/O errors by closing the files that were open against failed
data sets. In our example, file F04DENDB was open against data set
RLSADSW.FV04D.DATAENDB and file F04DCTRL was open against data set
RLSADSW.FV04D.TELLCTRL.
3. Because the failed data sets were restored from the same volume, there was
no need to delete the catalog entries for these data sets.
4. Before restoring the failed volume, we quiesced the volume to ensure that
CICS could not access the restored data sets. by issuing the command:
VARY SMS,CFVOL(9S4186),QUIESCE
In this example, for volume serial 9S4186, the command produced the
message:
IGW462I DFSMS CF CACHE REQUEST TO QUIESCE VOLUME 9S4186 IS ACCEPTED
We confirmed that the volume was quiesced by issuing the MVS command:
which confirmed that the volume was quiesced with the message:
IGW531I DFSMS CF VOLUME STATUS
VOLUME = 9S4186
DFSMS VOLUME CF STATUS = CF_QUIESCED
VOLUME 9S4186 IS NOT BOUND TO ANY DFSMS CF CACHE STRUCTURE
5. We simulated the volume restore for this example by using the MVS VARY
command to bring the volume back online:
ROUTE *ALL,VARY 4186,ONLINE
Because the volume was quiesced, attempts to open files on this volume
failed, with messages such as the following:
DFHFC0500 ADSWA02A RLS OPEN of file F04DENDB failed. VSAM has
returned code X'0008' in R15 and reason X'00C6'.
The impact of the recovery process is greater if there are inflight tasks updating
RLS mode files. To minimize the impact, you are recommended at this point to
quiesce all data sets that are being accessed in RLS mode.
6. We terminated the SMSVSAM servers with the MVS command:
ROUTE *ALL,VARY SMS,SMSVSAM,TERMINATESERVER
7. When all SMSVSAM servers were down, we deleted the IGWLOCK00 lock
structure with the MVS command:
VARY SMS,SMSVSAM,FORCEDELETELOCKSTRUCTURE
8. We restarted the SMSVSAM servers with the MVS command:
ROUTE *ALL,VARY SMS,SMSVSAM,ACTIVE
CICS was informed during dynamic RLS restart about the data sets for which it
must perform lost locks recovery. CICS issued messages such as the following
to inform you that lost locks recovery was being performed on one or more
data sets:
+DFHFC0555 ADSWA04A One or more data sets are in lost locks status.
CICS will perform lost locks recovery.
If we had quiesced data sets prior to terminating the servers, this is the point at
which we would unquiesce those data sets before proceeding.
If there were many data sets in lost locks it would take some time for lost locks
recovery to complete. It may be necessary to explicitly open files which suffer open
failures during lost locks recovery.
9. At this point it was possible that there were data sets on the restored volume
which did not require forward recovery. In order to make these data sets
available, we needed to re-allow access to the volume. Before doing this,
however, we first had to quiesce the data sets that still required forward
recovery, thus transferring the responsibility of preventing backouts from
SMSVSAM to CICS. In our example, we quiesced our two data sets using the
CEMT commands:
SET DSN(RLSADSW.VF04D.DATAENDB) QUIESCED
SET DSN(RLSADSW.VF04D.TELLCTRL) QUIESCED
10. When we were sure that all data sets requiring forward recovery were
quiesced, we used the following MVS command to allow access to the
restored volume:
VARY SMS,CFVOL(9S4186),ENABLE
Catalog recovery
If a user catalog is lost, follow the procedures documented in DFSMS/MVS
Managing Catalogs. Before making the user catalog available, run the SHCDS
CFREPAIR command to reconstruct critical RLS information in the catalog. Note
that before running SHCDS CFREPAIR, the restored user catalog must be import
connected to the master catalog on all systems (see the “Recovering Shared
Catalogs” topic in DFSMS/MVS Managing Catalogs).
These steps are summarized in the following commands, where the data set names
are labeled with A and B suffixes:
CEMT SET FILE(...) CLOSED (for each file opened against
the failed data set)
DEFINE CLUSTER(NAME(CICS.DATASETB) ...
HRECOVER (CICS.DATASETA.BACKUP) ... NEWNAME(CICS.DATASETB)
EXEC PGM=fwdrecov_utility
DELETE CICS.DATASETA
ALTER CICS.DATASETB NEWNAME(CICS.DATASETA)
In these cases, you can resolve the cause of the failure and try the whole process
again.
This section describes what to do when the failure in forward recovery cannot be
resolved. In this case, where you are unsuccessful in applying all the forward
recovery log data to a restored backup, you are forced to abandon the forward
recovery, and revert to your most recent full backup. For this situation, the access
method services SHCDS command provides the FRDELETEUNBOUNDLOCKS
subcommand, which allows you to delete the retained locks that were associated
with the data set, instead of re-binding them to the recovered data set as in the
case of a successful forward recovery.
The most likely cause of a forward recovery failure is the loss or corruption of one
or more forward recovery logs. In this event, you probably have no alternative other
than to restore the most recent backup and reapply lost updates to the data set
manually. In this case, it is important that you force CICS to discard any pending
(shunted) units of work for the data set that has failed forward recovery before you
restore the most recent backup. This is because, during recovery processing, CICS
assumes that it is operating on a data set that has been correctly forward
recovered.
CICS performs most of its recovery processing automatically, either when the region
is restarted, or when files are opened, or when a data set is unquiesced. There isn’t
any way that you can be sure of preventing CICS from attempting this recovery
processing. How you force recovery processing before restoring the backup
depends on whether or not the affected CICS regions are still running:
v For a CICS region that is still running, issue the appropriate CICS commands to
initiate the retry of pending units of work.
v For a CICS region that is shut down, restart it to cause CICS to retry
automatically any pending units of work.
In the event of a failed forward recovery of a data set, use the following procedure:
1. Tidy up any outstanding CICS recovery work, as follows:
a. Make sure that any CICS regions that are not running, and which could
have updated the data set, are restarted to enable emergency restart
processing to drive outstanding backouts.
b. Using information returned from the INQUIRE UOWDSNFAIL command
issued in each CICS region that uses the data set, compile a list of all
shunted UOWs that hold locks on the data set.
c. If there are shunted in-doubt UOWs, try to resolve the in-doubts before
proceeding to the next step. This is because the in-doubt UOWs may have
updated resources other than the failed data set, and you don’t want to
corrupt these other resources.
If the resolution of an in-doubt unit of work results in backout, this will fail for
the data set that is being restored, because it is still in a recovery-required
state. The (later) step to reset locks for backout-failed UOWs allows you to
tidy up any such backout failures that are generated by the resolution of
in-doubts.
d. In all CICS regions that could have updated the failed data set:
1) Force shunted in-doubt UOWs using SET DSNAME(...)
UOWACTION(COMMIT | BACKOUT | FORCE).
Before issuing the next command, wait until the SET DSNAME(...)
UOWACTION has completed against all shunted in-doubt units of work.
If the UOWACTION command for an in-doubt unit of work results in
backout, this will fail for the data set that is being restored, because it is
still in a recovery-required state. The (next) step to reset locks for
backout-failed UOWs allows you to tidy up any such backout failures
that are generated by the resolution of in-doubts.
2) Reset locks for backout-failed UOWs using SET DSNAME(...)
RESETLOCKS.
Do not issue this command until the previous UOWACTION command
has completed. RESETLOCKS operates only on backout-failed units of
work, and does not affect units of work that are in the process of being
backed out. If you issue RESETLOCKS too soon, and shunted in-doubt
units of work fail during backout, the data set will be left with recovery
work pending.
There should not now be any shunted units of work on any CICS region with
locks on the data set.
2. When you are sure that all CICS regions have completed any recovery work for
the data set, delete the unbound locks using SHCDS
FRDELETEUNBOUNDLOCKS.
Note: It is very important to enter this command (and the following SHCDS
FRRESETRR) at this stage in the procedure. If you do not, and the failed
data set was in a lost locks condition, the data set will remain in a lost
locks condition unless you cold start all CICS regions which have
If the restored data set is eligible for backup-while-open (BWO) processing, you
may need to reset the BWO attributes of the data set in the ICF catalog. This is
because the failed forward recovery may have left the data set in a
‘recovery-in-progress’ state. You can do this using the CEMT, or EXEC CICS, SET
DSNAME RECOVERED command.
If you do not follow this sequence of operations, the restored backup could be
corrupted by CICS backout operations.
All the parts of step 1 of the above procedure may also be appropriate in similar
situations where you do not want CICS to perform pending backouts. An example of
this might be before you convert an RLS SMS-managed data set to non-SMS when
it has retained locks, because the locks will be lost.
If you are not able to complete forward recovery on a data set, ensure that all CICS
regions tidy up any pending recovery processing on the data set (as described
below) before you restore the backup copy from which you intend to work. You can
do this in the following way:
1. Make sure that any CICS regions that are not running, and which could have
updated the data set, are restarted to enable emergency restart processing to
drive outstanding backouts.
2. If there are shunted in-doubt UOWs, try to resolve the in-doubts before
proceeding to the next step. This is because the in-doubt UOWs may have
updated resources other than the failed data set, and you don’t want to corrupt
these other resources.
3. In all CICS regions that could have updated the failed data set:
v Force shunted in-doubt UOWs using set DSNAME(...)
UOWACTION(COMMIT | BACKOUT | FORCE).
Before issuing the next command, wait until the SET DSNAME(...)
UOWACTION has completed against all shunted in-doubt units of work.
v Reset locks for backout-failed UOWs using SET DSNAME(...)
RESETLOCKS.
Do not issue this command until the previous UOWACTION command has
completed. RESETLOCKS operates only on backout-failed units of work, and
does not affect units of work that are in the process of being backed out. If
If you do not follow this sequence of operations, the restored backup could be
corrupted by CICS backout operations.
If you have high transaction volumes between backups, you want to avoid long
work interruptions due to lost data. CICSVR minimizes your recovery labor and
eliminates the possibility of introducing data errors as you recover your data sets.
CICSVR has automated functions that help you recover your CICS VSAM data sets
if they are lost or damaged.
CICSVR features
CICSVR:
v Automatically determines the DFSMShsm backups that are available for forward
recovery. Once you have selected the backup, CICSVR selects the appropriate
CICS logs automatically.
v Uses the ISPF dialog interface, which complies with common user access
(CUA®) standards, to generate automatically the JCL to run the forward recovery
job.
v Interfaces to DFSMShsm to recover the data set and apply the CICS logs to
forward recover the data set. Data sets that have been lost or damaged generally
require forward recovery.
v Works with backups taken with or without the backup-while-open (BWO) facility,
and with or without concurrent copy.
v Provides the necessary RLS support to:
1. Set recovery required
2. Unbind the locks
3. Forward recover the data set
4. Bind the locks
5. Indicate recovery complete.
v Supports the following logs:
– CICS log of logs (used as an index to all forward recovery logs)
– CICS forward recovery log streams
CICSVR benefits
CICSVR:
v Helps you recover your data quickly and easily
v Supports 24-hour CICS availability
v Reduces the risk of human error by creating the job for you
v Saves time by reducing the VSAM data set outage time
v Stores all the information you need for recovery
CICSVR:
v Restores the VSAM sphere from the DFSMShsm backup. When CICSVR
recovers a CICS data set, it preserves all RLS retained locks.
v Applies the after-images from the logs to the restored VSAM sphere.
You can set up a periodic automatic invocation of CICSVR SCAN, which keeps
track of all the CICS forward recovery logs.
v Sets a state of ‘forward recovery in progress’ during the forward recovery process
until the forward recovery is complete.
When the forward recovery run is complete, CICSVR has re-created the updated
data.
For more information about the CICSVR functions, see the CICS VSAM Recovery
MVS/ESA User’s Guide and Reference.
CICSVR lets you manage this RCDS information using the dialog interface.
You can use the dialog interface to remove information you no longer need about:
v Data sets
v Logs
v Copies of MVS log streams
v Log of logs
You can set an automatic function to delete and uncatalog logs after a specified
time, and you can use CICSVR to delete parts of an MVS log stream. For added
security, CICSVR maintains three copies of the RCDS.
Using CICSVR with DFSMShsm ensures that you have all information about
backups. You need not keep a separate record of backups or logs for each data
set.
For more information about the CICSVR archive utility and the MVS log stream
copy utility, see the IBM CICS VSAM Recovery MVS/ESA Implementation Guide.
Use default values stored by CICSVR on previous runs, or use your own values.
You decide which values to save for next time. You can browse, save, and, if
necessary, edit the JCL before you submit it.
With BWO and the concurrent copy function, you can back up data while the data
sets are open and being updated. CICSVR can recover using BWO copies,
ensuring that the data is protected from loss at all times. If the data should become
damaged or inaccessible, CICSVR can use BWO backups to retrieve the data.
Because you can make backups without interrupting your CICS regions, you can
make them more often.
If your CICS regions are operating only for a certain number of hours in each day,
make backups of your VSAM spheres outside these hours, when the spheres are
unavailable to users. Consider using BWO and the concurrent copy function to
make VSAM spheres continuously available. To take full advantage of CICSVR
recovery control, use DFSMShsm to make your backups. CICSVR automatically
restores the DFSMShsm backup before starting the recovery.
Journaling considerations
The CICS log manager uses the services of the MVS system logger to write CICS
log and journal data to MVS log streams. A log stream is a sequence of data
blocks, with each log stream identified by its own log stream identifier—the log
stream name (LSN). The CICS system log, forward recovery logs (including the log
of logs), autojournals, and user journals map onto specific MVS log streams. CICS
forward recovery logs, user journals, and autojournals are referred to as general
logs, to distinguish them from system logs.
Both the MVS system logger and CICS log manager have extensive recovery
procedures to prevent errors from causing the loss of your VSAM data. For CICS
Transaction Server, CICSVR provides support only for the log of logs and forward
recovery logs. You cannot use CICSVR to perform forward recovery from
autojournals written by CICS Transaction Server regions. However, for earlier
releases of CICS, CICSVR continues to support system logs and autojournals.
For more information about defining CICS logs and journals, see the CICS System
Definition Guide.
To use CICSVR to recover your data, specify the required recovery attributes when
you define your files to CICS. For details about the recovery attributes required by
CICSVR, see the IBM CICS VSAM Recovery MVS/ESA Implementation Guide. If
your files are accessed in RLS mode, you define the recovery attributes with the
data set definition in the ICF catalog. For more information about RLS recovery
definitions, see the DFSMS/MVS Access Method Services for ICF, SC26-4906.
Except for the ESDS delete exit, the use of CICSVR exits is optional.
For more information about the CICSVR exits, see the IBM CICS VSAM Recovery
MVS/ESA Implementation Guide.
Technical responsibilities
v Who is responsible for the different duties associated with CICSVR recovery?
These duties include:
– Monitoring CICS for messages indicating a VSAM problem
– Notifying users when a VSAM data set becomes unavailable and when it
becomes available again after successful recovery
– Authorizing a CICSVR recovery run
– Locating logs
– Performing the CICSVR run
– Examining the CICSVR reports
– Investigating the reasons for an unsuccessful CICSVR run, if necessary
The panels and secondary windows shown here are in the sequence that they
appear for a CICSVR run if you accept the CICSVR default values:
1. Select option 1 from the main menu. A list of VSAM spheres appears.
2. Select the VSAM sphere(s) you want to recover.
3. Press the CompRec key (F4).
4. If you are using DFSMShsm, select the backup you need. Otherwise, enter the
start time for recovery.
5. Accept the CICSVR defaults in the secondary windows that appear.
6. Submit the job that CICSVR creates for you.
Note: References to CICSVR backout in the ISPF dialog are for compatibility with
CICS releases earlier than CICS Transaction Server for OS/390 Release 1.
For information about other CICSVR functions, see the IBM CICS VSAM Recovery
MVS/ESA User’s Guide and Reference, and the IBM CICS VSAM Recovery
MVS/ESA Implementation Guide.
Help
-----------------------------------------------------------------------------
CICSVR main menu
From the panel in Figure 17 you can choose from these objects:
v A list of VSAM spheres
v A list of archived logs
v A list of MVS log streams
When the action is completed, the letter S appears in the select column beside
each selected sphere on the CICSVR VSAM sphere list panel:
From the CICSVR VSAM sphere list panel, you can select an action by using one
of the shortcut function keys, or select one of these pull-downs from the menu bar:
v Administrate
v Utilities
v Tools
v List
v View
v Help
From this pull-down, you can select “Complete recovery” for the VSAM spheres that
you selected by using one of these methods:
v Select option 1
v Press the CompRec key (F4)
v Type comprec on the command line
v Move the cursor to the “Complete recovery” item in the pull-down, and press
Enter
To get information about an item in the pull-down menu, move the cursor to the
item and press the Help key (F1).
If you have the log of logs registered in the RCDS, CICSVR scans the logs and
presents the results as an ISPF browse of the DWWMSG and DWWPRINT data
sets. For an example of these reports, see “Report of log of logs scan” on
page 215.
Press F4 when the cursor is in the Backup time field to get a list of
backup times from DFSMShsm. Press Enter to continue.
Here you can specify VSAM sphere parameters for inclusion in the recovery run.
These parameters are:
v A new name for the recovered VSAM sphere
v The start time for forward recovery
v The stop time for forward recovery
v The backup time
v The backup type
v The time format used on the log (if you are using MVS log streams or QSAM
copies of MVS log streams)
v The volume for the restored copy of the data if you are using full volume dumps
v The unit for the restored copy of the data if you are using full volume dumps
When you first see this secondary window, the CICSVR default values are
displayed. Press:
v F4, to get a list of DFSMShsm backups (Figure 24 on page 205).
v F5, to get the default values from the recovery control data set (RCDS).
v F6, to save the currently displayed values. The secondary window for default
update verification (Figure 27 on page 207) appears.
v F7, to go back to the previous VSAM sphere.
While CICSVR is constructing the recovery job, the wait secondary window shown
in Figure 21 on page 204 appears.
While CICSVR is constructing the recovery job, the secondary window shown in
Figure 21 appears:
CICSVR wait
CICSVR is constructing your recovery job. This might take a few minutes.
If CICSVR detects errors while constructing the recovery job, the CICSVR recovery
job error list (Figure 22) appears.
Select one or more errors, then press Enter to get more information about
the error.
Specify log stream type. Press Enter to continue the job creation.
Use this secondary window to specify the type of MVS log stream for this recovery
job.
When you first see this secondary window, the CICSVR default values are
displayed. Press F5 to get the default values from the recovery control data set
(RCDS). Press F6 to save the currently displayed values, and display the default
update verification secondary window (Figure 27 on page 207).
For detailed help information about a particular field, move the cursor to the field
and press the Help key (F1).
This secondary window appears once for every VSAM sphere that you select, and
shows a list of DFSMShsm backups for that sphere. Select the backup that you
need from this list.
After you press Enter in the secondary window (Figure 20), you can enter the
CICSVR recovery parameters here:
Press Enter to create a job with default values. Or, select one or more
choices below, and press Enter to override current values.
From this secondary window (Figure 25), you can create a recovery job for the
VSAM sphere you selected. You can accept the CICSVR default values, or change
the forward recovery parameters before continuing the recovery.
When you select “Complete recovery” from the Utilities pull-down, you can define
these recovery parameters for your run:
v Sequence checking
v VSAM buffer pools
v CICSVR exits
For detailed help information about any of these choices, move the cursor to the
field and press the Help key (F1).
Sequence checking
Use this secondary window to set sequence checking parameters for your CICSVR
run:
When you first see this secondary window, the CICSVR default values are
displayed. Press F5 to get the default values from the recovery control data set
(RCDS). Press F6 to save the currently displayed values, and display the default
update verification secondary window (Figure 27).
Note: Use this secondary window only if you want to force CICSVR to use logs
that are not in time sequence order, or logs that contain old data.
For detailed help information about a particular field, move the cursor to the field
and press the Help key (F1).
Press Enter to update stored defaults, or press F12 to cancel the request.
Specify the number of buffers needed for each buffer pool. Press Enter to
use the displayed values in the recovery.
displayed. Press F5 to get the default values from the RCDS. Press F6 to save the
currently displayed values, and display the default update verification secondary
window (Figure 27 on page 207).
For detailed help information about a particular field, move the cursor to the field
and press the Help key (F1).
Defining exits
Use this secondary window to define which CICSVR exits to use in this CICSVR
run:
CICSVR exits
Specify member names for the CICSVR exits. Press Enter to use the
displayed member names in the recovery.
Preapply . . . ________
Error . . . . . ________
ESDS delete . . ________
Termination . . ________
When you first see this secondary window, the CICSVR default values are
displayed. Press F5 to get the default values from the RCDS. Press F6 to save the
currently displayed values, and display the default update verification secondary
window (Figure 27 on page 207).
For detailed help information about any of these options, move the cursor to the
option description and press the Help key (F1).
Type a member name for the CICSVR generated JCL. Press Enter to save the
generated JCL as this member name.
Here you can give a member name for the saved CICSVR JCL. The member will
be saved in the data set you allocated to the ddname ISPFILE.
NO OF RECORDS NO OF NO OF NO OF NO OF
NAME OF MVS LOG STREAM PROCESSED DSNAME UPD-AFTER ADD-AFTER DEL-AFTER
-------------------------------------------- ------------- ----------- ----------- ----------- -----------
RETAIL.ACCOUNTS.MVSLG1.CUST 11 3 2 5 1
RETAIL.ACCOUNTS.MVSLG2.CUST 44 12 5 21 6
-------------------------------------------- ------------- ----------- ----------- ----------- -----------
TOTAL 55 15 7 26 7
-------------------------------------------- ------------- ----------- ----------- ----------- -----------
:--------- RECORDS FOUND ON THE LOG(S) ----------: :----- CHANGE RECORDS APPLIED -----: :-- CHANGES
DATASET FCT ENTRY IGNORED
TYPE NAME DSNAME UPD-AFTER ADD-AFTER DEL-AFTER ADDS UPDATES DELETES BY EXIT
------- --------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
BASE MAIN 4 2 14 3 2 2 1 0
------- --------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
TOTAL 4 2 14 3 2 2 1 0
------- --------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
------- --------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
OVERALL TOTAL 4 2 14 3 2 2 1 0
------------------ ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
THE PREVIOUS FORWARD RECOVERY REGISTERED FOR THIS VSAM SPHERE WHICH WAS
VBFR.CICS390.TEST.AIX1
Archive reports
The CICSVR archive utility and log of logs scan utility can produce up to four
different reports, shown here:
v Report of automatic de-register statistics
v Report of information for a forward recovery
v Report of archive statistics
v Report of log of log scans
JOB STEP 1
START TIME STOP TIME
--------------- ---------------
96.159 10:15:30 96.159 15:01:12
LOGS NEEDED
-----------
CICSPROD.LOG03.D96158.T101530
CICSPROD.LOG03.D96158.T121530
CICSPROD.LOG03.D96158.T141530
RELATED PATHS
-------------
PAYROLL.PATH1
PAYROLL.PATH2
CICSID : CICSPROD
MVSID : MVS2
LOG ID : 02
CICS VERSION : 3
FIRST TIME : 96.159 14:15:30
LAST TIME : 96.159 17:23:28
FIRST SEQUENCE NUMBER : 737
LAST SEQUENCE NUMBER : 994
INPUT LOG NAME : CICSPROD.DFHJ02A
OUTPUT LOG NAME(S) : CICSPROD.LOG02.D96158.T141530
VSAM DATA SET NAME CICSID FCT NAME OPEN DATE/TIME CLOSE DATE/TIME MVS LOG STREAM NAME
-------------------------------------------- -------- -------- ---------------- --------------- -----------------------
CICSPROD.ACC.VSAMA CICSPROD BASEA 96.157 12:00:00 96.159 12:11:10 CICSVR1.MVSLOG
CICSPROD.ACC.VSAMB CICSPROD BASE2 96.157 12:00:00 CICSVR1.MVSLOG
CICSPROD.ACC.VSAMC CICSPROD BASE3 96.157 12:00:00 CICSVR1.MVSLOG
JOB STEP 1
START TIME GMT STOP TIME GMT
---------------- ---------------
96.157 12:00:00 96.158 12:11:10
MVS LOG STREAMS NEEDED
----------------------
CICSVR1.MVSLOG
CICSVR - LOG OF LOGS SCAN UTILITY DATE : 96/06/07 TIME : 11:01:09 PAGE : 3
INFORMATION FOR A FORWARD RECOVERY OF CICSPROD.ACC.VSAMB
========================================================
JOB STEP 1
JOB STEP 1
Benefits of BWO
Many CICS applications depend on their data sets being open for update over a
long period of time. Normally, you cannot take a backup of the data set while the
data set is open. Thus, if a failure occurs that requires forward recovery, all updates
that have been made to the data set since it was opened must be recovered. This
means that you must keep all forward recovery logs that have been produced since
the data set was opened. A heavily used data set that has been open for update for
several days or weeks may need much forward recovery.
The BWO facility, together with other system facilities and products, allows you to
take a backup copy of a VSAM data set while it remains open for update. Then,
only the updates that have been made since the last backup copy was taken need
to be recovered. This could considerably reduce the amount of forward recovery
that is needed.
To use concurrent copy, specify the CONCURRENT keyword when you use
DFSMShsm to dump BWO data sets.
The concurrent copy function used along with BWO by DFSMSdss allows backups
to be taken with integrity even when control-area and control-interval splits and data
set additions (new extents or add-to-end) are occurring for VSAM key sequenced
data sets.
The DFSMSdfp™ IGWAMCS2 callable services module must be installed in the link
pack area (LPA). The IGWABWO module, supplied in SYS1.CSSLIB, must be
installed in the LPA, or SYS1.CSSLIB must be included in the link list (do not
include the library in the STEPLIB or JOBLIB library concatenations).
Hardware requirements
The concurrent copy function is supported by the IBM 3990 Model 3 with the
extended platform and the IBM 3990 Model 6 control units.
BWO is supported at the VSAM sphere level; thus you cannot take BWO backup
copies of some sphere components and not others. The first data set opened for
update against a VSAM base cluster determines the BWO eligibility for the sphere.
This includes base clusters that are accessed through a VSAM path key. For
example, if the first data set is defined as eligible for BWO, CICS fails the file-open
operation for any subsequent data set that is opened for update against that cluster
and which is not defined as eligible for BWO.
You can take BWO volume backups if all data sets that are open for update on the
volume are eligible for BWO.
DFSMSdfp indicates in the ICF catalog that a split has occurred. DFSMShsm
and DFSMSdss check the ICF catalog at the start and end of a backup. If a
split is in progress at the start of a backup, the backup is not taken. If a split
has occurred during a backup, or a split is still in progress at the end of a
backup, the backup is discarded.
So, to take a BWO backup successfully, the normal time between splits must
be greater than the time taken for DFSMShsm and DFSMSdss to take a
backup of the data set.
Data tables: You can use BWO with CICS-maintained data table base clusters.
However, you cannot use BWO with user-maintained data tables, because no
forward recovery support is provided.
AIX: CICS normally uses a base key or a path key to access data in a VSAM base
cluster data set. It is also possible, but not normal, for CICS to access AIX records
by specifying the AIX name as the data set name. If an AIX data set is used in this
way, you cannot define the AIX as eligible for BWO. Instead, the AIX adopts the
BWO characteristics already defined for the VSAM sphere.
Defining BWO in the ICF catalog requires DFSMS 1.3. For datasets that are
accessed in RLS mode, the BWO option must be defined in the ICF catalog.
Datasets that are accessed only in non-RLS mode can be defined either in the ICF
catalog or in the CICS file definition. If BWO is defined in the ICF catalog definition,
it overrides any BWO option defined in a CICS file definition.
If you specify BWO(TYPECICS), you must also specify LOG(ALL) and a forward
recovery log stream name, LOGSTREAMID(logstream_name).
If you omit the BWO parameter from the DEFINE statement, by default it is
UNDEFINED in the ICF catalog, and the BWO attribute from the CICS file resource
definition is used.
BACKUPTYPE(STATIC), the default, defines a file as not eligible for BWO. In this
case, if DFSMShsm is to back up a data set, all CICS files currently open for
update against that data set must be closed before the backup can start.
All files opened against the same VSAM base cluster must have the same
BACKUPTYPE value. That value is established by the first file opened against the
cluster; it is stored in the CICS data set name block (DSNB) for the cluster. If the
value for a subsequent file does not match, the file-open operation fails.
The BACKUPTYPE value in the DSNB persists across warm and emergency
restarts. It is removed by a CICS cold start (unless a backout failure occurs) or by
issuing EXEC CICS SET DSNAME ACTION(REMOVE) (or the CEMT equivalent)
for the base cluster data set. To do this, all files that are open against the base
cluster and via path definition must be closed, and the DSNB must have
FILECOUNT of zero and NORMALBKOUT state.
The BACKUPTYPE attribute is ignored for user-maintained data table base clusters,
because no forward recovery support is provided.
The procedures are simpler when using BWO than when not, because:
v Backups can be taken more frequently, so there are fewer forward recovery logs
to manage. This also reduces the amount of processing that is required to
forward recover the data set.
v The point from which forward recovery should start is recorded in the ICF
catalog. The forward recovery utility uses this value to automate this part of the
forward recovery process. This recovery point is saved with the backup copy and
subsequently replaced in the ICF catalog when the backup copy is restored. For
more information, see “Recovery point (non-RLS mode)” on page 230.
v During data set restore and forward recovery processing, CICS does not allow
files to be opened for the same data set.
Batch jobs
During the batch window between CICS sessions, it is possible for batch jobs
to update a data set. Because batch jobs do not create forward recovery logs,
any update that is made while a BWO backup is in progress, or after it has
completed, would not be forward recoverable. Therefore, non-BWO backups
should be taken, at least:
v At the start of the batch window so that, if a batch job fails, it can be
restarted; and
v At the end of the batch window, for use with CICS forward recovery
processing.
All update activity against the data set must be quiesced while the backups
are taken, so that DFSMShsm can have exclusive control of the data set.
Data set security: CICS must have RACF ALTER authority for all data sets that
are defined as BACKUPTYPE(DYNAMIC), because CICS needs to update the
BWO attributes in the ICF catalog. The authority must apply either to the data set or
to the ICF catalog in which the data set is cataloged. For information on defining
RACF ALTER authority, see the CICS RACF Security Guide.
The attribute flags and recovery point are managed by VSAM in the primary data
VSAM volume record (VVR) of the base cluster, which is in the VSAM volume data
set (VVDS). There is only one primary base cluster VVR for each VSAM sphere,
which is why BWO eligibility is defined at the sphere level. For more information,
see DFSMS/MVS Managing Catalogs.
File opening
Different processing is done for each of the following three cases when a file is
opened for update:
v First file opened for update against a cluster
v Subsequent file opened for update against a cluster while the previous file is still
open (that is, the update use count in the DSNB is not zero)
Also, if the file-open operation fails during BWO processing, the ACB will be open.
So CICS closes the ACB before indicating the file-open operation has failed. This
affects CICS statistics.
If the file is opened for read-only, and the data set ICF catalog indicates that the
data set is back-level, the file-open operation fails.
CICS calls the DFSMSdfp IGWABWO callable service to inquire on the BWO
attributes in the ICF catalog.
v If the file is defined with BACKUPTYPE(DYNAMIC), CICS calls IGWABWO to
make the data set eligible for BWO and to set the recovery point to the current
time. CICS also sets the BACKUPTYPE attribute in the DSNB to indicate
eligibility for BWO.
However, if the ICF catalog indicates that the data set is already eligible for
BWO, IGWABWO just sets the recovery point to the current time. CICS issues a
message, and you can discard any BWO backups already taken in a previous
batch window.
v If the file was defined with BACKUPTYPE(STATIC) and the ICF catalog indicates
that the data set is already ineligible for BWO, CICS sets the BACKUPTYPE
attribute in the DSNB to indicate ineligibility for BWO.
11. On an initial access to the data set after a restart, this check also help CICS to detect a previous abend in an application or
system during control-area or control-interval splits, and helps prevent possible data-integrity problems—particularly when
alternate indices exist as part of the update set. This helps the administrator to decide whether to recover from forward-recovery
logs or perform transaction backout.
If BWO support is requested and the appropriate level of DFSMSdfp (as described
in “What is needed to use BWO” on page 218) is not correctly installed on the
processor where CICS is running, the first file-open operation fails with error
message DFHFC5811. Subsequent file-open operations are allowed, but CICS
issues aa attention message.
CICS also issues an attention message (DFHFC5813) for the first file-open
operation if the appropriate levels of DFSMShsm and DFSMSdss are not installed
on the processor where CICS is running. Ensure that they are installed on the
processor where the BWO backup is to be made.
The ICF catalog has already been validated and set by the first file-open operation,
so CICS just checks the BACKUPTYPE attributes in the FCT and the DSNB. If they
are not consistent, the file-open operation fails with error messages. You must then
either correct the CEDA definition, or REMOVE the DSNB after closing all files that
are open against the base cluster data set.
CICS checks the BACKUPTYPE attributes in the FCT and the DSNB. If they are
inconsistent, the file-open operation fails with error messages. Either correct the
CEDA definition, or REMOVE the DSNB after closing all files that are open against
the base cluster data set. If the BACKUPTYPE attributes are consistent, CICS uses
the DFSMSdfp IGWABWO callable service to inquire on the BWO attributes in the
ICF catalog.
v If the file was defined with BACKUPTYPE(DYNAMIC), IGWABWO makes the
data set eligible for BWO and sets the recovery point to the current time.
However, if the ICF catalog indicates that the data set is already eligible for
BWO, IGWABWO resets the recovery point to the current time. CICS issues an
attention message; you can discard any BWO backup copies already taken in a
previous batch window.
v If the file was defined with BACKUPTYPE(STATIC) and the ICF catalog indicates
that the data set is already ineligible for BWO, the ICF catalog is not updated.
However, if the ICF catalog indicates that the data set is currently eligible for
BWO, IGWABWO makes it ineligible for BWO and sets the recovery point to the
If a VSAM split has occurred while a file was open, CICS calls IGWABWO at
file-close time to update the ICF catalog to prevent further BWO backups. If
DFSMShsm is currently taking a BWO backup, it will discard the backup at the end
of the backup operation.
The BWO attributes indicating that a split has occurred and that the data set is
eligible for BWO are restored when the next file is opened for update against the
data set. This ensures that DFSMShsm takes the correct action if a split occurs
during backup processing, which spans CICS updating a file (causing a VSAM
split), the file being closed, and then the file being reopened.
When CICS is terminated by a normal shutdown, all CICS files are closed. The ICF
catalog is updated to suppress BWO activity during the batch window between
CICS sessions. After an uncontrolled or immediate shutdown, or if there is a failure
during file closing, the data set remains open and the BWO flags are not reset. See
“Shutdown and restart”.
The data set is now ineligible for BWO backups because CICS file control has
reset the BWO attributes in the ICF catalog. But, until all open ACBs in the
sphere are closed, VSAM will not close the internal ACBs that are open for
update, and thus it is not possible to take non-BWO backups either.
If a failure occurs during shutdown so that CICS is unable to close a file, CICS
issues warning message DFHFC5804. In this case, check the BWO attributes and,
if necessary, either use DFSMSdfp IGWABWO callable service to set the attributes,
or discard any BWO backups that are taken in the batch window that follows the
shutdown.
Use the DFSMSdfp IGWABWO callable service to set the attributes (see “An
assembler program that calls DFSMS callable services” on page 232 for an
example of how to do this). Do not run any batch jobs before the next CICS restart.
If you do, for releases prior to DFSMS 1.2, discard any BWO backups that are
taken in the batch window.
For DFSMS 1.2 onward, the controls in DFSMS allow DFSMSdss to detect a
backup that is invalidated if CICS applications are shut down (normally or
abnormally) and if batch programs are executed that update the data set while the
BWO backup is in progress. This allows DFSMSdss to discard the backup, which
prevents DFSMShsm from erroneously discarding the oldest valid backup from the
inventory maintained by DFSMShsm.
Restart
At the next CICS restart, the following BWO-dependent actions can occur when a
data set is opened for update:
v If the BWO attributes in the ICF catalog are set to the ‘BWO enabled’ state, CICS
issues warning message DFHFC5808.
v If the file has been redefined as BACKUPTYPE(STATIC), and:
– CICS has been cold started
– The original base cluster DSNB has been discarded
– The BWO attributes in the ICF catalog are set to the ‘BWO enabled’ state
CICS issues warning message DFHFC5809.
v If the file has been redefined as BACKUPTYPE(STATIC), and:
– CICS has been warm or emergency restarted
– The original base cluster DSNB has been kept
CICS fails the file-open operation with message DFHFC5807.
The DFSMS processing at the start of backup is dependent on the DFSMS release
level. For releases prior to DFSMS 1.2, DFSMSdss first checks the BWO attributes
in the ICF catalog to see if the data set is eligible for BWO. If it is, the backup is
made without attempting to obtain exclusive control and serialize updates to this
data set.
For DFSMS 1.2 onward, DFSMSdss first tries to obtain exclusive control of the data
set. If DFSMSdss succeeds, an enqueued form of backup takes place. If this
serialization fails, DFSMSdss checks the BWO attributes in the ICF catalog to see if
the data set is eligible for BWO. If it is, a BWO backup is attempted. If it is not
eligible, the backup attempt fails.
This change will prevent DFSMS starting a BWO backup after CICS has abnormally
terminated.
At the end of the BWO backup, DFDSS again checks the BWO attributes. If the
data set is no longer eligible for BWO, the backup is discarded. Events that cause
this situation are:
v File closing during BWO, which sets the ‘BWO disabled’ state
v Start of VSAM split, which sets the ‘BWO enabled and VSAM split in progress’
state
v End of VSAM split, which sets the ‘BWO enabled/disabled and VSAM split
occurred’ state.
At the start of a backup, if the state is ‘BWO enabled and VSAM split occurred’,
DFSMSdss resets the state to ‘BWO enabled’. Then, if another VSAM split occurs,
the backup will be discarded at the end of the backup operation.
For DFSMS 1.2 onward, access method services supports the import and export of
BWO attributes.
DFSMSdfp must now disallow the pending change to ‘BWO enabled’ (and
DFSMSdss must fail the backup) to prevent the possibility of a BWO backup
being taken during a subsequent batch window.
DFSMSdss also resets the recovery point in the ICF catalog to the value it
contained when the backup was made. This ensures that forward recovery starts at
the correct point. This value should not be used for forward recovery of a non-BWO
backup.
Data sets
Each data set after-image record on the log is associated with a file name.
However, there might be many files associated with the same data set; therefore,
when a file is opened, the association between the file and the data set is recorded
on the forward recovery log by a tie-up record. This information is also written to the
log of logs. For non-BWO backups, the forward recovery utility uses this tie-up
record to apply the log records to the correct data sets.
When a BWO is taken for a data set opened in RLS mode, DFSMSdss notifies
each CICS region having an open ACB for the dataset. On receipt of this
notification, each CICS allows all units of work with updates for the data set to
complete, and then they write the tie-up records to the forward recovery log and the
log of logs, and replies to DFSMSdss.
The recovery point is stored in the ICF catalog. It is initialized when the first file is
opened for update against the data set, and updated during activity-keypoint
processing and when the file is closed.
The recovery point is not the time of the current keypoint, as there might still be
some uncommitted log records that have not been forced. Instead, it is the time of
the start of the last keypoint that wrote a complete set of tie-up records and that
completed earlier than the oldest uncommitted write to a forward recovery log.
Notes:
1. Only one new recovery point is calculated during an activity keypoint. It is used
for all data sets that are open for update and eligible for BWO. Thus a
long-running task updating a data set that uses BWO will affect the amount of
forward recovery needed for all data sets.
2. If you disable activity keypointing in your system (by specifying the AKPFREQ
system initialization parameter as zero), BWO support is seriously affected
because, after the file-open operation, no more tie-up records are written and
the recovery point is not updated. So, forward recovery of a BWO data set must
take place from the time that the data set was first opened for update.
Forward recovery
CICSVR fully supports BWO and the log of logs. If you do not use CICSVR, ensure
that your forward recovery utility is able to:
v Recognize whether a backup was made with BWO or not. The DFSMShsm
ARCXTRCT macro can be used to determine this.
v Use the BWO attributes and recovery point in the ICF catalog. It should use the
DFSMSdfp IGWABWO callable service to do this. See “An assembler program
that calls DFSMS callable services” on page 232 for a sample program.
v Recognize the additional tie-up records on the forward recovery logs and,
optionally, recognize tie-up records on the log of logs. These are written so that
the forward recovery utility can quickly find the correct position without having to
scan the entire forward recovery log.
v Recognize after-images that have already been applied to the data set.
The forward recovery utility should ALLOCATE, with DISP=OLD, the data set that is
to be recovered. This prevents other jobs accessing a back level data set and
ensures that data managers such as CICS are not still using the data set.
Before the data set is opened, the forward recovery utility should set the BWO
attribute flags to the ‘Forward recovery started but not ended’ state. This prevents
The forward recovery utility should use the BWO time stamp for the data set in the
ICF catalog, set by DFSMSdss when the data set is restored, to determine the
starting point in the forward recovery log to start the forward recovery.
If forward recovery completes successfully, the utility should set the BWO attributes
to the ‘BWO disabled’ state before the data set is closed.
If forward recovery does not complete successfully, the utility should leave the BWO
attributes in the ‘Forward recovery started but not ended’ state to ensure that CICS
does not open a back-level data set.
Alternatively, if you use a VSAM forward recovery utility that does not update the
BWO attributes during forward recovery, you may use these commands to reset the
backup-restored-by-DFSMShsm state prior to subsequent CICS file control access.
To forward recover such a data set, after the restore, use the AMS ALTER or
DELETE command to remove or delete the AIXs from the upgrade set. After
forward recovery has completed successfully, you can re-create the upgrade set by
rebuilding the AIXs using the access method services BLDINDX command.
This chapter describes what you need to consider when planning for disaster
recovery in a CICS environment: It covers the following topics:
v “Why have a disaster recovery plan?”
v “Disaster recovery testing” on page 239
v “Six tiers of solutions for off-site recovery” on page 239
v “Disaster recovery and high availability” on page 250
v “Disaster recovery facilities” on page 254
v “CICS emergency restart considerations” on page 256
v “Final summary” on page 257
You may consider some, or all, of your CICS applications as vital to the operations
of your business. If all applications are vital, you need to recover all the data that
your CICS systems use. If only some of your applications are vital, you have to
determine what data is associated with those applications.
The length of time between the disaster and recovery of your vital applications is a
key factor. If your business cannot continue without access to your CICS data, your
disaster recovery plan must take this into account.
The time-sensitive nature of your recovered data can be an overriding factor. If your
vital application is a high volume, high change application, recovering week-old data
may not be acceptable—even hour-old data may be unacceptable. You may need
to recover right up to the point of the disaster.
When you are planning for disaster recovery, consider the cost of being unable to
operate your business for a period of time. You have to consider the number of lost
transactions, and the future loss of business as your customers go elsewhere. Your
disaster recovery solution should not be more expensive than the loss from the
disaster, unless your business would fail as a result of the outage caused by a
disaster.
What is the real cost of your disaster recovery plan? Keeping track of the total cost
of your disaster recovery procedures allows you to look at available options and
judge the benefits and cost of each.
Your disaster recovery plan should include some performance considerations once
you have recovered. Unless your disaster site mirrors your production site, you
should determine acceptable levels of throughput and transaction times while
operating from the disaster recovery site. The length of time it takes to recover your
primary site can also determine what your disaster recovery site has to support in
the interim.
Figure 40 shows that risk, speed of recovery, and completeness of recovery have to
be balanced against cost.
Disaster
Coverage
(Risk)
Cost
Recovery Completeness
Speed of Recovery
Whenever possible, you should choose a remote site recovery strategy that you can
test frequently. Testing your disaster recovery process has the following benefits:
v You know that your recovery plan works.
v You discover problems, mistakes, and errors, and can resolve them before you
have to use the procedures.
v Your staff are educated in executing tests and managing disaster recovery
situations.
v Your recovery plan becomes a living document.
v Members of your IT organization recognize the necessity of such a disaster
recovery concept, and plan accordingly.
v Awareness of your disaster recovery strategy is increased.
After each test, use the detailed logs and schedules to identify any errors in your
procedures, and eliminate them. Retest the changed procedures, and then
incorporate them into your recovery plan. After changing the recovery plan,
completely revise all existing disaster recovery documents.
Make frequent tests early in the implementation of your disaster recovery plan.
Once you have removed the major problems, you can test less frequently. The
frequency will depend on:
v The interval between major changes in your hardware and software
v How current you want to keep the recovery plan
v How critical and sensitive your business processes are: the more critical they
are, the more frequently testing may be required.
12. In a paper presented to IBM SHARE, prepared by the SHARE automated remote site recovery task force.
Approach
Data not sent offsite
Recovery done utilizing onsite
local records
Recovery
Least expensive cost
No disaster recovery capability
Any disaster recovery capability would depend on recovering on-site local records.
For most true disasters, such as fire or earthquake, you would not be able to
recover your data or systems if you implemented a tier 0 solution.
Approach Recovery
Your disaster recovery plan has to include information to guide the staff responsible
for recovering your system, from hardware requirements to day-to-day operations.
The backups required for off-site storage must be created periodically. After a
disaster, your data can only be as up-to-date as the last backup—daily, weekly,
monthly, or whatever period you chose—because your recovery action is to restore
the backups at the recovery site (when you have one).
This method may not meet your requirements if you need your online systems to be
continuously available.
v If you require data from two or more subsystems to be synchronized, for
example, from DB2 and VSAM, you would have to stop updates to both, then
copy both sets of data.
v Such subsystems would both be unavailable for update until the longest running
copy is finished.
v If you require a point-in-time copy for all your data, your application may be
unavailable for updates for a considerable time.
The major benefit of tier 1 is the low cost. The major costs are the storage site and
transportation.
Tier 1
Tier 1 provides a very basic level of disaster recovery. You will lose data in the
disaster, perhaps a considerable amount. However, tier 1 allows you to
recover and provide some form of service at low cost. You must assess
whether the loss of data and the time taken to restore a service will prevent
your company from continuing in business.
Approach Recovery
Backups kept off-site Standby site costs
Procedures and inventory Recovery time reduced
off-site
Recovery - restore system
and data, reconnect to
network
Figure 43. Disaster recovery tier 2: physical removal to a ‘hot’ standby site
Tier 2 is similar to tier 1. The difference in tier 2 is that a secondary site already has
the necessary hardware installed, which can be made available to support the vital
applications of the primary site. The same process is used to backup and store the
vital data; therefore the same availability issues exist at the primary site as for tier
1.
The benefits of tier 2 are the elimination of the time it takes to obtain and setup the
hardware at the secondary site, and the ability to test your disaster recovery plan.
The drawback is the expense of providing, or contracting for, a ‘hot’ standby site.
Tier 2
Tier 2, like tier 1, provides a very basic level of disaster recovery. You will lose
data in the disaster, perhaps a considerable amount. However, tier 2 allows
you to recover and provide some form of service at low cost and more rapidly
than tier 1. You must assess whether the loss of data and the time taken to
restore a service will prevent your company from continuing in business.
Standby Site
Approach Recovery
Backups kept off-site Standby site, plus bulk
Procedures and inventory data transfer costs
off-site Recovery in hours
Recovery - restore system
and data, reconnect to
network
The drawbacks are the cost of reserving the DASD at the hot standby site, and that
you must have a link to the hot site, and the required software, to transfer the data.
Procedures and documentation still have to be available at the hot site, but this can
be achieved electronically.
Tier 3
Tier 3, like tiers 1 and 2, provides a basic level of disaster recovery. You will
lose data in the disaster, perhaps a considerable amount of data. The
advantage of tier 3 is that you should be able to provide a service to your
users quite rapidly. You must assess whether the loss of data will prevent your
company from continuing in business.
Tier 0 - 3
1 2 3 4
No offsite Truck access Truck access Electronic
data method - method and vaulting
tapes to hot site
cold site
Tiers 0 to 3 cover the disaster recovery plans of many CICS users. With the
exception of tier 0, they employ the same basic design using a point-in-time copy of
the necessary data. That data is then moved off-site to be used when required after
a disaster.
VTAM
3745 3745
Channel extender
ESCON
3990 3990
Approach Recovery
Tier 4 closes the gap between the point-in-time backups and current online
processing recovery. Under a tier 4 recovery plan, site one acts as a backup to site
two, and site two acts as a backup to site one.
Tier 4 duplicates the vital data of each system at the other’s site. You must transmit
image copies of data to the alternate site on a regular basis. You must also transmit
CICS system logs and forward recovery logs, after they have been archived.
Similarly, you must transmit logs for IMS and DB2 subsystems. Your recovery action
is to perform a forward recovery of the data at the alternate site. This allows
recovery up to the point of the latest closed log for each subsystem.
You must also copy to the alternate site other vital data that is necessary to run
your system. For example, you must copy your load libraries and JCL. You can do
this on a regular basis, or when the libraries and JCL change.
Tier 4
Tier 4 provides a more advanced level of disaster recovery. You will lose data
in the disaster, but only a few minutes- or hours-worth. You must assess
whether the loss of data will prevent your company from continuing in
business, and what the cost of lost data will be.
VTAM
3745 3745
Channel extender
ESCON
3990 3990
Approach Recovery
Shadow data synchronized Recovery time in minutes
by remote two phase commit Only inflight data lost
Reconnect network
Other data required to run your vital application has to be sent to the secondary site
as well. For example, current load libraries and documentation has to be kept
up-to-date on the secondary site.
The benefits of tier 5 are fast recovery using vital data that is current. The
drawbacks are:
v The cost of maintaining and running two sites.
Tier 5
A Tier 5 solution is appropriate for a custom-designed recovery plan with
special applications. Because these applications must be designed to use this
solution, it cannot be implemented at most CICS sites.
VTAM
3745 3745
Channel extender
ESCON
3990 3990
Approach Recovery
Local and remote copies Most expensive
updated Instantaneous recovery
Dual online storage Non-disruptive terminal switch
Network switching capability
Tier 6, minimal to zero data loss, is the ultimate level of disaster recovery.
There are two tier 6 solutions, one hardware-based and the other software-based.
For details of the hardware and software available for these solutions, see
“Peer-to-peer remote copy (PPRC) and extended remote copy (XRC)” on page 250
(hardware) and “Remote Recovery Data Facility” on page 252 (software).
The hardware solution involves the use of IBM 3990-6 DASD controllers with
remote and local copies of vital data. There are two flavors of the hardware
solution: (1) peer-to-peer remote copy (PPRC), and (2) extended remote copy
(XRC).
The software solution involves the use of Remote Recovery Data Facility (RRDF).
RRDF applies to data sets managed by CICS file control and to the DB2, IMS,
IDMS, CPCS, ADABAS, and SuperMICR database management systems, collecting
real-time log and journal data from them. RRDF is supplied by E-Net Corporation
and is available from IBM as part of the IBM Cooperative Software Program.
The drawbacks are the cost of running two sites and the communication overhead.
If you are using the hardware solution based on 3990-6 controllers, you are limited
in how far away your recovery site can be. If you use PPRC, updates are sent from
the primary 3990-6 directly to the 3990-6 at your recovery site using enterprise
systems connection (ESCON®) links between the two 3990-6 devices. The 3990-6
devices can be up to 43 km (26.7 miles) apart subject to quotation.
If you use XRC, the 3990-6 devices at the primary and recovery sites can be
attached to the XRC DFSMS/MVS host at up to 43 km (26.7 miles) using ESCON
links (subject to quotation). If you use three sites, one for the primary 3990, one to
support the XRC DFSMS/MVS host, and one for the recovery 3990, this allows a
total of 86 km (53.4 miles) between the 3990s. If you use channel extenders with
XRC, there is no limit on the distance between your primary and remote site.
For RRDF there is no limit to the distance between the primary and secondary
sites.
Tier 6
Tier 6 provides a very complete level of disaster recovery. You must assess
whether the cost of achieving this level of disaster recovery is justified for your
company.
4 5 6
This summary shows the three tiers and the various tools for each that can help
you reach your required level of disaster recovery.
Tier 4 relies on automation to send backups to the remote site. NetView® provides
the ability to schedule work in order to maintain recoverability at the remote site.
Tier 6 is divided into two sections: software solutions for specific access methods
and database management systems, and hardware solutions for any data.
RRDF can provide very high currency and recoverability for a wide range of data.
However, it does not cover all the data in which you may be interested. For
example, RRDF does not support load module libraries.
The 3990-6 hardware solution is independent of the data being stored on the
DASD. PPRC and XRC can be used for databases, CICS files, logs, and any other
data sets that you need to ensure complete recovery on the remote system.
Because PPRC and XRC are hardware solutions, they are application, and data,
independent. The data can be DB2, VSAM, IMS, or any other type of data. All your
vital data on DASD can be duplicated off-site. This reduces the complexity of
recovery. These solutions can also make use of redundant array of independent
disks (RAID) DASD to deliver the highest levels of data integrity and availability.
PPRC synchronously shadows updates from the primary to the secondary site. This
ensures that no data is lost between the data committed on the primary and
secondary DASD. The time taken for the synchronous write to the secondary unit
has an impact on your application, increasing response time. This additional time
(required for each write operation) is approximately equivalent to a DASD fastwrite
operation. Because the implementation of PPRC is almost entirely in the 3990-6,
you must provide enough capacity for cache and non-volatile storage (NVS) to
ensure optimum performance.
In the event of a disaster, check the state of all secondary volumes to ensure data
consistency against the shadowed log data sets. This ensures that the same
sequence of updates is maintained on the secondary volumes as on the primary
volumes up to the point of the disaster. Because PPRC and XRC do not require
restores or forward recovery of data, your restart procedures on the secondary
system may be the same as for a short-term outage at the primary site, such as a
power outage.
When running with PPRC or XRC, the data you replicate along with the databases
includes:
v CICS logs and forward recovery logs
v CICS system definition (CSD) data sets, SYSIN data sets, and load libraries
v Recovery control (RECON) and restart data set (RDS) for IMS
v IMS write-ahead data set (WADS) and IMS online log data set (OLDS)
v ACBLIB for IMS
CICS applications can use non-DASD storage for processing data. If your
application depends on this type of data, be aware that PPRC and XRC do not
handle it.
For more information on PPRC and XRC, see Planning for IBM Remote Copy,
SG24-2595-00, and DFSMS/MVS Remote Copy Administrator’s Guide and
Reference.
PPRC or XRC?
You need to choose between PPRC and XRC for transmitting data to your backup
site. This section compares the two methods to help you make your choice.
The synchronous nature of PPRC ensures that, if you have a disaster at your main
site, you lose only inflight transactions. The committed data recorded at the remote
site is the same as that at the primary site.
The asynchronous nature of XRC means that the remote site may have no
knowledge of transactions that ran at the primary site, or does not know that they
completed successfully. XRC ensures that the data recorded at the remote site is
consistent (that is, it looks like a snapshot of the data at the primary site, but the
snapshot may be several seconds old).
XRC supports the running of concurrent copy sessions to its secondary volumes.
This enables you to create a point-in-time copy of the data.
PPRC and XRC also allow you to migrate data to another or larger DASD of similar
geometry, behind the same or different control units at the secondary site. This can
be done for workload management or DASD maintenance, for example.
Forward recovery
Whether you use PPRC or XRC, you have two fundamental choices:
1. You can pair volumes containing both the data and the log records
2. You can pair only the volumes containing the log records
In the first case you should be able to perform an emergency restart of your
systems, and restore service very rapidly. In the latter case you would need to use
the log records, along with an image copy transmitted separately, to perform a
forward recovery of your data, followed by an emergency restart.
Pairing the data volumes, as well as the log volumes, costs more because you
have more data flowing between the sites, and therefore you need a greater
bandwidth to support the flow. In theory you can restart much faster than if you
have to perform a forward recovery. When deciding which to use, you must
determine whether this method is significantly faster, and whether you think it is
worth the additional costs.
The programs that run in the CICS or database management system address
space use MVS cross-memory services to move log data to the RRDF address
space. The RRDF address space maintains a virtual storage queue at the primary
site for records awaiting transmission, with provision for spill files if communication
between the primary and secondary sites is interrupted. Remote logging is only as
effective as the currency of the data that is sent off-site. RRDF transports log
stream data to a remote location in real-time, within seconds of the log operation at
the primary site.
When the RRDF address space at the remote site receives the log data, it formats
it into archived log data sets. Once data has been stored at the remote site, you
can use it as needed to meet business requirements. The recovery process uses
standard recovery utilities. For most data formats, first use the log data transmitted
by RRDF in conjunction with a recent image copy of the data sets and databases
that you have to recover. Then perform a forward recovery. If you are using DB2,
you have the option of applying log records to the remote copies of your databases
as RRDF receives the log records.
If you use DB2, you can use the optional RRDF log apply feature. With this feature
you can maintain a logically consistent “shadow” copy of a DB2 database at the
remote site. The RRDF log apply feature updates the shadow database at selected
intervals, using log data transmitted from the primary site. Thus restart time is
shortened because the work needed after a disaster is minimal. The currency of the
data depends on the log data transmitted by RRDF and on how frequently you run
the RRDF log apply feature. The RRDF log apply feature also enhances data
availability, as you have read access to the shadow copy through a remote site DB2
subsystem. RRDF supports DB2 remote logging for all environments, including
TSO, IMS, CICS, batch, and call attach.
At least two RRDF licenses are required to support the remote site recovery
function, one for the primary site and one for the remote site. For details of RRDF
support needed for the CICS Transaction Server, see “Remote Recovery Data
Facility support” on page 255.
You should ensure that a senior manager is designated as the disaster recovery
manager. The recovery manager must make the final decision whether to switch to
a remote site, or to try to rebuild the local system (this is especially true if you have
adopted a solution that does not have a warm or hot standby site).
You must decide who will run the remote site, especially during the early hours of
the disaster. If your recovery site is a long way from the primary site, many of your
staff will be in the wrong place.
Finally, and to show the seriousness of disaster recovery, it is possible that some of
your key staff may be severely injured and unable to take part in the recovery
operation. Your plans need to identify backups for all your key staff.
See also the ITSC Disaster Recovery Library: Planning Guide for information that
should help you set up a detailed disaster recovery plan if you use a combination of
databases, such as DB2 and IMS.
The remote site needs to be able to import log streams transmitted by the recovery
resource manager; this too is provided by MVS system logger services. Importation
is the process whereby log blocks resident in one log stream (the source log
stream) are created in (or copied into) another log stream (the target log stream)
while maintaining (in the target log stream) the same MVS system logger-assigned
log block id and GMT time stamp that is assigned to the source log stream. The
result of a log stream import is a copy of the source log stream in the target log
stream.
It is possible for system log records to be transmitted to the remote site for
units-of-work that subsequently become indoubt-failed or backout-failed. The log
records for failed units of work could be moved to the secondary log stream at the
local site. However, resource managers such as RRDF are aware of such
movements and act accordingly.
If a disaster occurs at the primary site, your disaster recovery procedures should
include recovery of VSAM data sets at the designated remote recovery site. You
can then emergency restart the CICS regions at the remote site so that they can
backout any uncommitted data. Special support is needed for RLS because record
locks, which were protecting uncommitted data from being updated by other
transactions at the primary site, are not present at the remote site. You invoke CICS
RLS support for off-site recovery using the OFFSITE system initialization parameter.
The OFFSITE system initialization parameter protects all RLS data sets until all
emergency restart recovery is complete. You can specify this OFFSITE system
initialization parameter at run-time only—it cannot be specified and assembled in
the SIT—and it is valid only when START=AUTO is specified. You specify
OFFSITE=YES when restarting CICS regions at a remote site when recovery
involves VSAM data sets opened in RLS mode.
When you specify OFFSITE=YES, file control performs RLS off-site recovery
processing, under which file control does not allow any new RLS access during
startup. With RLS recovery in operation during an emergency restart, CICS does
not allow any data sets to be accessed in RLS mode until:
v CICS has completed all outstanding RLS recovery work.
v CICS file control has issued a message requesting a “GO” response from an
operator when all CICS regions have completed RLS recovery processing.
v An operator has replied to the message.
Operators should reply to the message only when all the CICS regions being
restarted with OFFSITE=YES have issued the message, indicating that they have
all completed their RLS recovery.
Final summary
Your disaster recovery plan will be truly tested only at a very difficult time for your
business—during a disaster. Careful planning and thorough testing may mean the
difference between a temporary inconvenience and going out of business.
The goal of your disaster recovery plan is to get your CICS systems back online.
The currency of the data and the time it takes to get back online is a function of
which disaster recovery tier you use. Unless legal requirements for disaster
recovery dictate the type of disaster recovery you must have, the choice of tiers is
usually a business decision.
BWO. See backup-while-open. general log . A general purpose log stream used by
CICS for any of the following:
backup-while-open (BWO) . A facility that allows a v Forward recovery logs
backup copy of a VSAM data set to be made while the v Autojournals
data set remains open for update.
v User journals
When you take a backup-while-open (BWO) copy of a
data set, only the updates that are made after the BWO Contrast with system log.
need to be recovered in the event of a disk failure. This
considerably reduces the amount of forward recovery heuristic decision . A decision that enables a
that is needed. transaction manager to complete a failed in-doubt unit
of work (UOW) that cannot wait for resynchronization
CICSplex . (1) A CICS complex—a set of after recovery from the failure.
interconnected CICS regions acting as resource
managers, and combining to provide a set of coherent Under the two-phase commit protocol, the loss of the
services for a customer’s business needs. In its coordinator (or loss of connectivity) that occurs while a
simplest form, a CICSplex operates within a single MVS UOW is in-doubt theoretically forces a participant in the
image. Within a Parallel Sysplex environment, a UOW to wait forever for resynchronization. While a
CICSplex can be configured across all the MVS images subordinate waits in doubt, resources remain locked
in the sysplex. and, in a CICS Transaction Server region, the failed
UOW is shunted pending resolution.
The CICS regions in the CICSplex are generally linked
through the CICS interregion communication (IRC) Applying a heuristic decision provides an arbitrary
facility, using either the XM or IRC access method solution for resolving a failed in-doubt UOW as an
alternative to waiting for the return of the coordinator. In
CICS, the heuristic decision can be made in advance by
13. Transaction Processing: Concepts and Techniques (1993)
User transactions are allowed to write their own Thus a UOW is completed when a transaction takes a
recovery records to the system log for use in an syncpoint, which occurs either when a transaction
emergency restart, but the system log cannot be used issues an explicit syncpoint request, or when CICS
for forward recovery log or autojournal records. takes an implicit syncpoint at the end of the transaction.
In the absence of user syncpoints explicitly taken within
Contrast with general log. the transaction, the entire transaction is one UOW.
system logger. A central logging facility provided by In earlier releases of CICS, this was referred to as a
OS/390 (and also MVS/ESA SP 5.2). The MVS system logical unit of work (LUW).
logger provides an integrated MVS logging facility that
can be used by system and subsystem components. UOW. See unit-of-work.
For example, it is used by the CICS log manager.
UOW id . Unit-of-work identifier.
two-phase commit . In CICS, the protocol observed CICS uses two unit of work identifiers for two purposes,
when taking a syncpoint in a distributed UOW. At one short and one long:
syncpoint, all updates to recoverable resources must be
either committed or backed out. At this point, the Short UOW id
coordinating recovery manager gives each subordinate An 8-byte value that CICS passes to resource
participating in the UOW an opportunity to vote on managers, such as DB2 and VSAM, for lock
whether its part of the UOW is in a consistent state and management purposes.
can be committed. If all participants vote “yes”, the Long UOW id
A 27-byte value that CICS uses to identify a
distributed UOW. This is built from a short
Glossary 263
UOW id prefixed by two 1-byte length fields
and by the fully-qualified NETNAME of the
CICS region.
Index 267
in-doubt journal data set statistics
definition of 262 report of 210
in-doubt failure support 80 journaling 198
file control 80 journals
repeatable read requests 81 for extrapartition transient data set recovery 129
in-doubt units of work 94 offline programs for reading 117
in-flight tasks
transaction backout 71
initialization K
cold start process 44 keypoints
file control 44 warm 27
journal model definitions 46
monitoring and statistics 47
programs 46 L
resource definitions 47 locking
start requests 47 implicit locking on nonrecoverable files 146
temporary storage 45, 48 implicit locking on recoverable files 147
terminal control resources 47 in application programs 146
transactions 46 locks 14
transient data 46 log manager
emergency restart process definition of 262
journal models 61 log of logs
journal names 61 failures 116
monitoring 60 log stream
statistics 60 specifying type 205
options 43 logging 4
requirements for restart 58 logical levels, application program 89
warm start process logical recovery 128
file control 51, 59 logical unit of work (LUW), definition of 262
monitoring 55 lost locks
programs, mapsets and partitionsets 54, 60 recovery from 85
statistics 55 LUWID, definition of 262
temporary storage 52, 59
transactions 54, 60
transient data 53, 60
initialization (PLT) programs
M
managing UOW state 19
defining 102
moving data sets
use of 131
using EXPORT and IMPORT commands 178
initialization and termination exit
using REPRO command 175
for transaction backout 153
moving data sets with locks
input data sets 129
FRBIND command 177
INQUIRE DSNAME 167
FRRESETRR command 177
INQUIRE UOWDSNFAIL 167
FRSETRR subcommand 176
integrity of data 4
FRUNBIND command 176
internal design phase 141
MVS automatic restart manager 63
intersystem communication failures
MVS system recovery and sysplex recovery 88
during syncpoint 94
intertransaction communication 137
use of COMMAREA 137
use of resources 137
N
NetView
interval control START requests 139
DFH$OFAR, 257
intrapartition transient data
NEWSIT system initialization parameter 101
backout 127
node error program (NEP) 93
forward recovery 129
normal shutdown 25
implicit enqueuing upon 149
notifying CICS of SMSVSAM restart 86
recoverability 127
IOERR condition processing 144
isolation, definition of 261
O
off-site recovery
J support for RLS-mode data sets 256
job creation, automated 197 OFFSITE, system initialization parameter 256
Index 269
security considerations system recovery table (SRT)
for BWO 222 definition of 100
for restart 100 user extensions to 92
sequence checking 206 system warm keypoints 27
SHCDS systems administration
FRBIND command 177, 181 for BWO 222
FRRESETRR command 177, 181
FRSETRR subcommand 176, 180
FRUNBIND command 176, 180 T
SHCDS LIST subcommands 168 tables
shunted for recovery 100
definition of 263
task termination, abnormal 91
shunted state, of unit of work 13
DFHPEP execution 91
shunted unit of work 19
DFHREST execution 90
shutdown
task termination, normal 90
BWO processing 226
TBEXITS system initialization parameter 101
immediate 28
normal 25 technical responsibilities 199
requested by operating system 29 temporary storage
uncontrolled 30 backout 131
SIT (system initialization table) forward recovery 132
options and overrides for recovery and restart 101 implicit enqueuing upon 149
SMSVSAM, definition of 263 recoverability 131
sphere, definition of 263 used for intertransaction communication 138
spheres, VSAM 198 temporary storage table (TST)
SPURGE option (DEFINE TRANSACTION) 119 definition of 102
SRT system initialization parameter 101 terminal error program (TEP)
STAE option 145 sample 94
standby procedures 99 terminal error recovery 93
START=COLD specification 43 terminal I/O errors, recovery
START option 101 terminal error program immediate shutdown 28
START requests 72 terminal paging through BMS 140
START TRANSID command 145 termination and initialization exit
statistics for transaction backout 153
report of archive 214 testing recovery and restart programs 103
report of automatic de-register 213 tie-up record 229
report of exit action 211 time changes 117
report of journal data set 210 TPURGE option (DEFINE TRANSACTION) 120
report of recovered data set 211 transaction 133
STGRCVY system initialization parameter 101
transaction abend processing 69
switching from RLS to non-RLS access mode 164
ASRA abend code 92
syncpoint
exit code at program levels 89
general description 15
program error program (PEP)
rollback 142
user coding 155
system abend extensions 92
program-level exit code 89
system activity keypoints
task termination, abnormal 90
description 23
DFHPEP execution 91
system failures DFHREST execution 90
designing for restart 143 task termination, normal 90
overview 11 transaction backout 71
system initialization parameters 101 transaction backout 71
OFFSITE 256 backout failure support 75
system log backout failures 70
definition of 263 BDAM files 72
for backout 22 data tables 72
information recorded on 22 entry-sequenced data set (ESDS) 72
system log stream files 71
basic definition 101 ROLLBACK 70
system logs START requests 72
log-tail deletion 112 temporary storage 72
system or abend exit creation 92 transient data 72
U
uncontrolled shutdown 30 X
unit of recovery (see unit of work) 13 XCF/MRO
unit of work MVS failure 10
atomic 21 XFCNREC
managing state of 19 overriding open failures 125
overview 13 XLT (transaction list table) 26
shunt 19 definition of 102
shunted state 13 XRC 250
unshunt 19 XRCINIT global user exit 153
unit of work (UOW)
definition of 263
unit of work recovery 69
UOW (unit of work)
short UOWs preferred 135
UOW id, definition of 263
UOWNETQL system initialization parameter 101
updates to local resources 20
user abend exit creation 155
user considerations 199
user exits
emergency restart 152
Index 271
272 CICS Recovery and Restart Guide
Sending your comments to IBM
If you especially like or dislike anything about this book, please use one of the
methods listed below to send your comments to IBM.
Feel free to comment on what you regard as specific errors or omissions, and on
the accuracy, organization, subject matter, or completeness of this book.
Please limit your comments to the information in this book and the way in which the
information is presented.
When you send comments to IBM, you grant IBM a nonexclusive right to use or
distribute your comments in any way it believes appropriate, without incurring any
obligation to you.
You can send your comments to IBM in any of the following ways:
v By mail, to this address:
Information Development Department (MP095)
IBM United Kingdom Laboratories
Hursley Park
WINCHESTER,
Hampshire
United Kingdom
v By fax:
– From outside the U.K., after your international access code use
44–1962–870229
– From within the U.K., use 01962–870229
v Electronically, use the appropriate network ID:
– IBM Mail Exchange: GBIBM2Q9 at IBMMAIL
– IBMLink: HURSLEY(IDRCF)
– Internet: idrcf@hursley.ibm.com
SC33-1698-02
Spine information:
IBM CICS TS for OS/390 CICS Recovery and Restart Guide Release 3