Infoscale Tshoot 71 Sol
Infoscale Tshoot 71 Sol
1
Troubleshooting Guide -
Solaris
April 2016
Veritas Infoscale™ Troubleshooting Guide
Last updated: 2016-04-28
Legal Notice
Copyright © 2016 Veritas Technologies LLC. All rights reserved.
Veritas, the Veritas Logo, and NetBackup are trademarks or registered trademarks of Veritas
Technologies LLC or its affiliates in the U.S. and other countries. Other names may be
trademarks of their respective owners.
This product may contain third party software for which Veritas is required to provide attribution
to the third party (“Third Party Programs”). Some of the Third Party Programs are available
under open source or free software licenses. The License Agreement accompanying the
Software does not alter any rights or obligations you may have under those open source or
free software licenses. Refer to the third party legal notices document accompanying this
Veritas product or available at:
https://www.veritas.com/about/legal/license-agreements
The product described in this document is distributed under licenses restricting its use, copying,
distribution, and decompilation/reverse engineering. No part of this document may be
reproduced in any form by any means without prior written authorization of Veritas Technologies
LLC and its licensors, if any.
THE DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED
CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED
WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR
NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH
DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. VERITAS TECHNOLOGIES LLC
SHALL NOT BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES IN
CONNECTION WITH THE FURNISHING, PERFORMANCE, OR USE OF THIS
DOCUMENTATION. THE INFORMATION CONTAINED IN THIS DOCUMENTATION IS
SUBJECT TO CHANGE WITHOUT NOTICE.
The Licensed Software and Documentation are deemed to be commercial computer software
as defined in FAR 12.212 and subject to restricted rights as defined in FAR Section 52.227-19
"Commercial Computer Software - Restricted Rights" and DFARS 227.7202, et seq.
"Commercial Computer Software and Commercial Computer Software Documentation," as
applicable, and any successor regulations, whether delivered by Veritas as on premises or
hosted services. Any use, modification, reproduction release, performance, display or disclosure
of the Licensed Software and Documentation by the U.S. Government shall be solely in
accordance with the terms of this Agreement.
Veritas Technologies LLC
500 E Middlefield Road
Mountain View, CA 94043
http://www.veritas.com
Technical Support
Technical Support maintains support centers globally. All support services will be delivered
in accordance with your support agreement and the then-current enterprise technical support
policies. For information about our support offerings and how to contact Technical Support,
visit our website:
https://www.veritas.com/support
You can manage your Veritas account information at the following URL:
https://my.veritas.com
If you have questions regarding an existing support agreement, please email the support
agreement administration team for your region as follows:
Japan CustomerCare_Japan@veritas.com
Documentation
Make sure that you have the current version of the documentation. Each document displays
the date of the last update on page 2. The document version appears on page 2 of each
guide. The latest documentation is available on the Veritas website:
https://sort.veritas.com/documents
Documentation feedback
Your feedback is important to us. Suggest improvements or report errors or omissions to the
documentation. Include the document title, document version, chapter title, and section title
of the text on which you are reporting. Send feedback to:
doc.feedback@veritas.com
You can also see documentation information or ask a question on the Veritas community site:
http://www.veritas.com/community/
Gathering LLT and GAB information for support analysis .............. 174
Gathering IMF information for support analysis ........................... 175
Message catalogs ................................................................. 175
Troubleshooting the VCS engine .................................................... 177
HAD diagnostics ................................................................... 177
HAD is not running ................................................................ 177
HAD restarts continuously ...................................................... 178
DNS configuration issues cause GAB to kill HAD ........................ 178
Seeding and I/O fencing ......................................................... 178
Preonline IP check ................................................................ 179
Troubleshooting Low Latency Transport (LLT) .................................. 179
LLT startup script displays errors .............................................. 179
LLT detects cross links usage .................................................. 180
LLT link status messages ....................................................... 180
Unexpected db_type warning while stopping LLT that is configured
over UDP ...................................................................... 183
Troubleshooting Group Membership Services/Atomic Broadcast
(GAB) ................................................................................. 184
Delay in port reopen .............................................................. 184
Node panics due to client process failure ................................... 184
Troubleshooting VCS startup ........................................................ 185
"VCS: 10622 local configuration missing" and "VCS: 10623 local
configuration invalid" ....................................................... 185
"VCS:11032 registration failed. Exiting" ..................................... 185
"Waiting for cluster membership." ............................................. 186
Troubleshooting Intelligent Monitoring Framework (IMF) ..................... 186
Troubleshooting service groups ..................................................... 188
VCS does not automatically start service group ........................... 188
System is not in RUNNING state .............................................. 189
Service group not configured to run on the system ....................... 189
Service group not configured to autostart ................................... 189
Service group is frozen .......................................................... 189
Failover service group is online on another system ...................... 189
A critical resource faulted ....................................................... 189
Service group autodisabled .................................................... 190
Service group is waiting for the resource to be brought online/taken
offline ........................................................................... 190
Service group is waiting for a dependency to be met. ................... 191
Service group not fully probed. ................................................ 191
Service group does not fail over to the forecasted system ............. 191
Service group does not fail over to the BiggestAvailable system
even if FailOverPolicy is set to BiggestAvailable .................... 192
Restoring metering database from backup taken by VCS .............. 193
Contents 11
Prepare for your next ■ List product installation and upgrade requirements, including
installation or upgrade operating system versions, memory, disk space, and
architecture.
■ Analyze systems to determine if they are ready to install or
upgrade Veritas products.
■ Download the latest patches, documentation, and high
availability agents from a central repository.
■ Access up-to-date compatibility lists for hardware, software,
databases, and operating systems.
Improve efficiency ■ Find and download patches based on product version and
platform.
■ List installed Veritas products and license keys.
■ Tune and optimize your environment.
Note: Certain features of SORT are not available for all products. Access to SORT
is available at no extra cost.
See “Letting vxgetcore find debugging data automatically (the easiest method)”
on page 16.
■ If you know the path to the core file (and optionally the binary file), you can
specify them on the command line.
See “Running vxgetcore when you know the location of the core file” on page 18.
■ You can run vxgetcore without any options, and the script prompts you for all
the information it needs.
See “Letting vxgetcore prompt you for information” on page 18.
When you work with vxgetcore, keep in mind the following:
■ vxgetcore is contained in the VRTSspt support package, which is a collection
of tools to analyze and troubleshoot systems. When you install VRTSspt, the
vxgetcore script is installed on the path /opt/VRTSspt/vxgetcore/.
■ Before you run vxgetcore, contact Veritas Technical Support and get a case
ID for your issue.
If you know the case number before you run the script, you can specify it on the
command line. Getting the case ID first saves you time later. When you send
the tar file to Veritas for analysis, the tar file name includes a case number. This
approach is faster than generating the file, getting the case ID, and renaming
the file later.
■ You do not have to analyze or work with the tar file vxgetcore generates.
Make sure that the file name includes the case ID, and FTP the file to your local
FTP site.
■ For the latest information on vxgetcore, see the README.vxgetcore file at
/opt/VRTSspt/vxgetcore/.
Note: Because you do not specify the core file name or binary file name with this
option, vxgetcore makes its best effort to find the correct files. If vxgetcore finds
more than one core file or binary file, it chooses the latest file in the first directory
it finds. If you do not think that these are the correct files, and you know the location
and names of the files, run vxgetcore specifying the core file and binary file names.
See “Running vxgetcore when you know the location of the core file” on page 18.
Before you run vxgetcore, contact Veritas Technical Support and get a case ID
for your issue. You'll need to include the case ID in the tar file name before you
send it to Veritas.
To let vxgetcore find data automatically
1 If you do not know the location of the core file, enter the following command.
vxgetcore looks for a core file in the current working directory, followed by
other core locations. If you use the -C option, substitute your information for
the given syntax:
2 vxgetcore finds the core file and searches a predetermined list of directories
for the associated binary file, library files, and other debugging data. It then
creates a tar file in this format:
/tmp/VRTSgetcore.xxxx/coreinfo.CASEID.hostname.date_time.tar.gz
3 Review the system output. vxgetcore lists the core file name, binary file name,
and the other files it gathers. If these are not the file you intended, rerun the
command and specify the file names.
4 In the tar file creation message, note the checksum of the new tar file.
5 (Optional) If you did not specify your case ID on the command in step 2, rename
the tar file name to include your case ID number.
6 FTP the file to your local FTP site.
7 Contact your Veritas Technical Support representative and tell them the
checksum and the FTP site to which you uploaded the file.
If you know the location of the core file, you can use the -c option along with the
-a option. In this case, vxgetcore uses the specified core file and automatically
finds the debugging information that is related to this core file. If you are running
vxgetcore as part of a script, the script does not pause for user input.
Introduction 18
About collecting application and daemon core data for debugging
Running vxgetcore when you know the location of the core file
If you know the location of the core file or binary file, specify them on the vxgetcore
command line. If you only specify the core file name, vxgetcore searches for the
corresponding binary file.
Before you run vxgetcore, contact Veritas Technical Support and get a case ID
for your issue. You'll need to include the case ID in the tar file name before you
send it to Veritas.
To gather debugging data when you know the location of the core file
1 Enter one of the following commands, substituting your information for the
given syntax:
# /opt/VRTSspt/vxgetcore/vxgetcore -c /path/core_file \
[-C Veritas_case_ID]
or
# /opt/VRTSspt/vxgetcore/vxgetcore -c /path/core_file \
-b /path/binary_file [-C Veritas_case_ID]
2 After vxgetcore displays a WARNING message about its usage, press Enter
to continue.
3 (Optional.) If you did not specify a binary file name in step 1, and vxgetcore
finds more than one binary that matches the core file, it prompts you to select
one binary file from the list and enter the full path.
vxgetcore gathers the core file, binary file, library file, and any other available
debugging information. It creates a tar file in this format:
/tmp/VRTSgetcore.xxxx/coreinfo.CASEID.hostname.date_time.tar.gz
4 In the tar file creation message, note the checksum of the new tar file.
5 (Optional.) If you did not specify your case ID on the command in step 1, rename
the tar file name to include your case ID number.
6 FTP the file to your local FTP site.
7 Contact your Veritas Technical Support representative and tell them the
checksum and the FTP site to which you uploaded the file.
– starting with the present working directory. If vxgetcore finds more than one
binary that matches the core file, it prompts you to select one binary file from the
list and enter the full path.
Before you run vxgetcore, contact Veritas Technical Support and get a case ID
for your issue. You'll need to include the case ID in the tar file name before you
send it to Veritas.
To let vxgetcore prompt you for binary file information
1 Enter the following command. If you use the -C option, substitute your
information for the given syntax:
2 After vxgetcore displays a WARNING message about its usage, press Enter
to continue.
3 (Optional.) If vxgetcore finds more than one binary file, it displays a list of
possible matches. Enter the file name you want included in the tar file, and
press Enter.
vxgetcore gathers the core file, binary file, library file, and any other available
debugging information. It creates a tar file in this format:
/tmp/VRTSgetcore.xxxx/coreinfo.CASEID.hostname.date_time_.gz
4 In the tar file creation message, note the checksum of the new tar file.
5 (Optional.) If you did not specify your case ID on the command in step 1, rename
the tar file name to include your case ID number.
6 FTP the file to your local FTP site.
7 Contact your Veritas Technical Support representative and tell them the
checksum and the FTP site to which you uploaded the file.
Section 1
Troubleshooting Veritas File
System
Marking an inode bad Inodes can be marked bad if an inode update or a directory-block
update fails. In these types of failures, the file system does not
know what information is on the disk, and considers all the
information that it finds to be invalid. After an inode is marked bad,
the kernel still permits access to the file name, but any attempt to
access the data in the file or change the inode fails.
Disabling transactions If the file system detects an error while writing the intent log, it
disables transactions. After transactions are disabled, the files in
the file system can still be read or written, but no block or inode
frees or allocations, structural changes, directory entry changes,
or other changes to metadata are allowed.
Disabling a file system If an error occurs that compromises the integrity of the file system,
VxFS disables itself. If the intent log fails or an inode-list error
occurs, the super-block is ordinarily updated (setting the
VX_FULLFSCK flag) so that the next fsck does a full structural
check. If this super-block update fails, any further changes to the
file system can cause inconsistencies that are undetectable by the
intent log replay. To avoid this situation, the file system disables
itself.
Diagnostic messages 22
About kernel messages
Warning: You can use fsck to check and repair a VxFS file system; however,
improper use of the command can result in data loss. Do not use this command
unless you thoroughly understand the implications of running it. If you have any
questions about this command, contact Veritas Technical Support.
■ Use the metasave script for your operating system platform to generate a
copy of the metadata of the file system, then replay the metasave and run
fsck with the -y option command to evaluate possible structural damage.
Warning: If you have any questions about these procedures or do not fully
understand the implications of the fsck command, contact Technical Support.
For more information on the fsck command, see the fsck_vxfs(1M) manual page.
■ Restarting volumes after recovery when some nodes in the cluster become
unavailable
Note: Apparent disk failure may not be due to a fault in the physical disk media or
the disk controller, but may instead be caused by a fault in an intermediate or
ancillary component such as a cable, host bus adapter, or power supply.
The hot-relocation feature in VxVM automatically detects disk failures, and notifies
the system administrator and other nominated users of the failures by electronic
mail. Hot-relocation also attempts to use spare disks and free disk space to restore
redundancy and to preserve access to mirrored and RAID-5 volumes.
For more information about administering hot-relocation, see the Storage Foundation
Administrator’s Guide.
Recovery from failures of the boot (root) disk and repairing the root (/) and usr
file systems require the use of the special procedures.
See “VxVM and boot disk failure” on page 62.
The following example output shows one volume, mkting, as being unstartable:
The following example shows a disabled volume, vol, which has two clean
plexes, vol-01 and vol-02, each with a single subdisk:
Shut down
(vxvol stop)
PS = plex state
PKS = plex kernel state
For more information about plex states, see the Storage Foundation Administrator’s
Guide.
At system startup, volumes are started automatically and the vxvol start task
makes all CLEAN plexes ACTIVE. At shutdown, the vxvol stop task marks all
ACTIVE plexes CLEAN. If all plexes are initially CLEAN at startup, this indicates
that a controlled shutdown occurred and optimizes the time taken to start up the
volumes.
Figure 3-2 shows additional transitions that are possible between plex states as a
result of hardware problems, abnormal system shutdown, and intervention by the
system administrator.
Recovering from hardware failure 30
The plex state cycle
Resync data
Shut down (vxvol stop) (vxplex att)
Put plex online
(vxmend on)
Uncorrectable
I/O failure
Resync
PS: IOFAIL fails PS: STALE
PKS: DETACHED PKS: DETACHED
PS = plex state
PKS = plex kernel state
When first created, a plex has state EMPTY until the volume to which it is attached
is initialized. Its state is then set to CLEAN. Its plex kernel state remains set to
DISABLED and is not set to ENABLED until the volume is started.
After a system crash and reboot, all plexes of a volume are ACTIVE but marked
with plex kernel state DISABLED until their data is recovered by the vxvol resync
task.
A plex may be taken offline with the vxmend off command, made available again
using vxmend on, and its data resynchronized with the other plexes when it is
reattached using vxplex att. A failed resynchronization or uncorrectable I/O failure
places the plex in the IOFAIL state.
There are various actions that you can take if a system crash or I/O error leaves
no plexes of a mirrored volume in a CLEAN or ACTIVE state.
See “Recovering an unstartable mirrored volume” on page 31.
See “Failures on RAID-5 volumes” on page 36.
Recovering from hardware failure 31
Recovering an unstartable mirrored volume
2 To recover the other plexes in a volume from the CLEAN plex, the volume must
be disabled, and the other plexes must be STALE. If necessary, make any other
CLEAN or ACTIVE plexes STALE by running the following command on each of
these plexes in turn:
2 Place the plex into the STALE state using this command:
3 If there are other ACTIVE or CLEAN plexes in the volume, use the following
command to reattach the plex to the volume:
4 If the volume is not already enabled, use the following command to start it, and
preform any resynchronization of the plexes in the background:
If the data in the plex was corrupted, and the volume has no ACTIVE or CLEAN
redundant plexes from which its contents can be resynchronized, it must be
restored from a backup or from a snapshot image.
replacing the failed disk. Any volumes that are listed as Unstartable must be
restarted using the vxvol command before restoring their contents from a backup.
To forcibly restart a disabled volume
◆ Type the following command:
The -f option forcibly restarts the volume, and the -o bg option resynchronizes
its plexes as a background task. For example, to restart the volume myvol so
that it can be restored from backup, use the following command:
Warning: Do not unset the failing flag if the reason for the I/O errors is unknown.
If the disk hardware truly is failing, and the flag is cleared, there is a risk of data
loss.
Recovering from hardware failure 34
Reattaching failed disks
# vxdisk list
2 Use the vxedit set command to clear the flag for each disk that is marked
as failing (in this example, mydg02):
3 Use the vxdisk list command to verify that the failing flag has been
cleared:
# vxdisk list
# vxdisk list
DEVICE TYPE DISK GROUP STATUS
c1t1d0s2 auto:sliced mydg01 mydg online
c1t2d0s2 auto:sliced mydg02 mydg online
- - mydg03 mydg failed was: c1t3d0s2
- - mydg04 mydg failed was: c1t4d0s2
2 Once the fault has been corrected, the disks can be discovered by using the
following command to rescan the device list:
# /usr/sbin/vxdctl enable
# /etc/vx/bin/vxreattach
Reattachment can fail if the original (or another) cause for the disk failure still
exists.
You can use the command vxreattach -c to check whether reattachment is
possible, without performing the operation. Instead, it displays the disk group
and disk media name where the disk can be reattached.
You can use the command vxreattach -br to recover STALE volumes.
See the vxreattach(1M) manual page.
System failures
RAID-5 volumes are designed to remain available with a minimum of disk space
overhead, if there are disk failures. However, many forms of RAID-5 can have data
loss after a system failure. Data loss occurs because a system failure causes the
data and parity in the RAID-5 volume to become unsynchronized. Loss of
synchronization occurs because the status of writes that were outstanding at the
time of the failure cannot be determined.
Recovering from hardware failure 37
Failures on RAID-5 volumes
If a loss of sync occurs while a RAID-5 volume is being accessed, the volume is
described as having stale parity. The parity must then be reconstructed by reading
all the non-parity columns within each stripe, recalculating the parity, and writing
out the parity stripe unit in the stripe. This must be done for every stripe in the
volume, so it can take a long time to complete.
Besides the vulnerability to failure, the resynchronization process can tax the system
resources and slow down system operation.
RAID-5 logs reduce the damage that can be caused by system failures, because
they maintain a copy of the data being written at the time of the failure. The process
of resynchronization consists of reading that data and parity from the logs and
writing it to the appropriate areas of the RAID-5 volume. This greatly reduces the
amount of time needed for a resynchronization of data and parity. It also means
that the volume never becomes truly stale. The data and parity for all stripes in the
volume are known at all times, so the failure of a single disk cannot result in the
loss of the data within the volume.
Disk failures
An uncorrectable I/O error occurs when disk failure, cabling or other problems cause
the data on a disk to become unavailable. For a RAID-5 volume, this means that a
subdisk becomes unavailable. The subdisk cannot be used to hold data and is
considered stale and detached. If the underlying disk becomes available or is
replaced, the subdisk is still considered stale and is not used.
If an attempt is made to read data contained on a stale subdisk, the data is
reconstructed from data on all other stripe units in the stripe. This operation is called
a reconstructing-read. This is a more expensive operation than simply reading the
data and can result in degraded read performance. When a RAID-5 volume has
stale subdisks, it is considered to be in degraded mode.
A RAID-5 volume in degraded mode can be recognized from the output of the
vxprint -ht command as shown in the following display:
The volume r5vol is in degraded mode, as shown by the volume state, which is
listed as DEGRADED. The failed subdisk is disk02-01, as shown by the MODE flags;
d indicates that the subdisk is detached, and S indicates that the subdisk’s contents
are stale.
A disk containing a RAID-5 log plex can also fail. The failure of a single RAID-5 log
plex has no direct effect on the operation of a volume provided that the RAID-5 log
is mirrored. However, loss of all RAID-5 log plexes in a volume makes it vulnerable
to a complete failure. In the output of the vxprint -ht command, failure within a
RAID-5 log plex is indicated by the plex state being shown as BADLOG rather than
LOG.
In the following example, the RAID-5 log plex r5vol-02 has failed:
■ Any existing log plexes are zeroed and enabled. If all logs fail during this process,
the start process is aborted.
■ If no stale subdisks exist or those that exist are recoverable, the volume is put
in the ENABLED volume kernel state and the volume state is set to ACTIVE. The
volume is now started.
Parity resynchronization and stale subdisk recovery are typically performed when
the RAID-5 volume is started, or shortly after the system boots. They can also be
performed by running the vxrecover command.
See “Unstartable RAID-5 volumes” on page 43.
If hot-relocation is enabled at the time of a disk failure, system administrator
intervention is not required unless no suitable disk space is available for relocation.
Hot-relocation is triggered by the failure and the system administrator is notified of
the failure by electronic mail.
Hot relocation automatically attempts to relocate the subdisks of a failing RAID-5
plex. After any relocation takes place, the hot-relocation daemon (vxrelocd) also
initiates a parity resynchronization.
In the case of a failing RAID-5 log plex, relocation occurs only if the log plex is
mirrored; the vxrelocd daemon then initiates a mirror resynchronization to recreate
the RAID-5 log plex. If hot-relocation is disabled at the time of a failure, the system
administrator may need to initiate a resynchronization or recovery.
Note: Following severe hardware failure of several disks or other related subsystems
underlying a RAID-5 plex, it may be only be possible to recover the volume by
removing the volume, recreating it on hardware that is functioning correctly, and
restoring the contents of the volume from a backup.
This output lists the volume state as NEEDSYNC, indicating that the parity needs to
be resynchronized. The state could also have been SYNC, indicating that a
synchronization was attempted at start time and that a synchronization process
should be doing the synchronization. If no such process exists or if the volume is
in the NEEDSYNC state, a synchronization can be manually started by using the
resync keyword for the vxvol command.
A RAID-5 volume that has multiple stale subdisks can be recovered in one
operation. To recover multiple stale subdisks, use the vxvol recover command
on the volume:
Only the third case can be overridden by using the -o force option.
Subdisks of RAID-5 volumes can also be split and joined by using the vxsd split
command and the vxsd join command. These operations work the same way as
those for mirrored volumes.
RAID-5 subdisk moves are performed in the same way as subdisk moves for other
volume types, but without the penalty of degraded redundancy.
VxVM vxvol ERROR V-5-1-1236 Volume r5vol is not startable; RAID-5 plex
does not map entire volume length.
RAID-5 plex
There are four stripes in the RAID-5 array. All parity is stale and subdisk disk05-00
has failed. This makes stripes X and Y unusable because two failures have occurred
within those stripes.
This qualifies as two failures within a stripe and prevents the use of the volume. In
this case, the output display from the vxvol start command is as follows:
This situation can be avoided by always using two or more RAID-5 log plexes in
RAID-5 volumes. RAID-5 log plexes prevent the parity within the volume from
becoming stale which prevents this situation.
See “System failures” on page 36.
This causes all stale subdisks to be marked as non-stale. Marking takes place
before the start operation evaluates the validity of the RAID-5 volume and
what is needed to start it. You can mark individual subdisks as non-stale by
using the following command:
If some subdisks are stale and need recovery, and if valid logs exist, the volume
is enabled by placing it in the ENABLED kernel state and the volume is available
for use during the subdisk recovery. Otherwise, the volume kernel state is set
to DETACHED and it is not available during subdisk recovery. This is done
because if the system were to crash or if the volume were ungracefully stopped
while it was active, the parity becomes stale, making the volume unusable. If
this is undesirable, the volume can be started with the -o unsafe start option.
The volume state is set to RECOVER, and stale subdisks are restored. As the
data on each subdisk becomes valid, the subdisk is marked as no longer stale.
If the recovery of any subdisk fails, and if there are no valid logs, the volume
start is aborted because the subdisk remains stale and a system crash makes
the RAID-5 volume unusable. This can also be overridden by using the -o
unsafe start option.
If the volume has valid logs, subdisk recovery failures are noted but they do
not stop the start procedure.
When all subdisks have been recovered, the volume is placed in the ENABLED
kernel state and marked as ACTIVE.
Automatic recovery depends on being able to import both the source and target
disk groups. However, automatic recovery may not be possible if, for example, one
of the disk groups has been imported on another host.
To recover from an incomplete disk group move
1 Use the vxprint command to examine the configuration of both disk groups.
Objects in disk groups whose move is incomplete have their TUTIL0 fields set
to MOVE.
2 Enter the following command to attempt completion of the move:
This operation fails if one of the disk groups cannot be imported because it
has been imported on another host or because it does not exist:
VxVM vxdg ERROR V-5-1-2907 diskgroup: Disk group does not exist
Use the following command on the other disk group to remove the objects that
have TUTIL0 fields marked as MOVE:
4 If only one disk group is available to be imported, use the following command
to reset the MOVE flags on this disk group:
to start. If node C has a volume that is unavailable, then vxrecover fails to start the
other volumes too.
To resolve the issue, manually start the volumes using the following command:
This output shows the mirrored volume, vol1, its snapshot volume, SNAP-vol1,
and their respective DCOs, vol1_dco and SNAP-vol1_dco. The two disks, mydg03
and mydg04, that hold the DCO plexes for the DCO volume, vol1_dcl, of vol1 have
failed. As a result, the DCO volume, vol1_dcl, of the volume, vol1, has been
detached and the state of vol1_dco has been set to BADLOG. For future reference,
note the entries for the snap objects, vol1_snp and SNAP-vol1_snp, that point to
vol1 and SNAP-vol1 respectively.
You can use such output to deduce the name of a volume’s DCO (in this example,
vol1_dco), or you can use the following vxprint command to display the name of
a volume’s DCO:
You can use the vxprint command to check if the badlog flag is set for the DCO
of a volume as shown here:
For example:
In this example, the command returns the value on, indicating that the badlog flag
is set.
Use the following command to verify the version number of the DCO:
For example:
The command returns a value of 0, 20, or 30. The DCO version number determines
the recovery procedure that you should use.
See “Recovering a version 0 DCO volume” on page 49.
Recovering from hardware failure 49
Recovery from failure of a DCO volume
See “Recovering an instant snap DCO volume (version 20 or later)” on page 51.
For the example output, the command would take this form:
The entry for vol1_dco in the output from vxprint now looks like this:
dc vol1_dco vol1 - - - -
For the example output, the command would take this form:
4 Use the vxassist snapclear command to clear the FastResync maps for the
original volume and for all its snapshots. This ensures that potentially stale
FastResync maps are not used when the snapshots are snapped back (a full
resynchronization is performed). FastResync tracking is re-enabled for any
subsequent snapshots of the volume.
Warning: You must use the vxassist snapclear command on all the
snapshots of the volume after removing the badlog flag from the DCO.
Otherwise, data may be lost or corrupted when the snapshots are snapped
back.
If a volume and its snapshot volume are in the same disk group, the following
command clears the FastResync maps for both volumes:
If a snapshot volume and the original volume are in different disk groups, you
must perform a separate snapclear operation on each volume:
5 To snap back the snapshot volume on which you performed a snapclear, use
the following command (after using the vxdg move command to move the
snapshot plex back to the original disk group, if necessary):
For the example output, the command would take this form:
You cannot use the vxassist snapback command because the snapclear
operation removes the snapshot association information.
For the example output, the command would take this form:
For the example output, the command would take this form:
For the example output, the command would take this form:
For the example output, the command might take this form:
The command adds a DCO volume with two plexes, and also enables DRL
and FastResync (if licensed).
For full details of how to use the vxsnap prepare command, see the Storage
Foundation Administrator’s Guide and the vxsnap(1M) manual page.
Chapter 4
Recovering from instant
snapshot failure
This chapter includes the following topics:
■ Recovering from the failure of vxsnap make for full-sized instant snapshots
■ Recovering from the failure of vxsnap make for break-off instant snapshots
■ Recovering from failure of vxsnap upgrade of instant snap data change objects
(DCOs)
3 If the snapshot volume is in DISABLED state, start the snapshot volume. Enter
the following command:
4 Prepare the snapshot volume again for snapshot operations. Enter the following
command:
5 Clear the volume’s tutil0 field (if it is set). Enter the following command.
Be sure to specify the -r (recursive) option and the -f (force) option because
the volume is enabled and has plexes and logs attached to it.
Alternatively, the snapshot volume is removed automatically when the system
is next restarted.
To recover from the failure of the vxsnap make command for space-optimized
instant snapshots
1 Type the following command:
Be sure to specify the -r (recursive) option and the -f (force) option because
the volume is enabled and has plexes and logs attached to it.
Alternatively, the snapshot volume is removed automatically when the system
is next restarted.
If the vxsnap make operation was being performed on a prepared cache object
by specifying the cache attribute, the cache object remains intact after you
delete the snapshot. If the cachesize attribute was used to specify a new
cache object, the cache object does not exist after you delete the snapshot.
2 Clear the volume’s tutil0 field (if it is set). Enter the following command.
3 Clear the source volume's tutil0 field (if it is set). Enter the following command:
4 Clear the source volume's tutil0 field (if it is set). Enter the following command:
After correcting the source of the error, restart the resynchronization operation.
To recover from I/O errors during resynchronization
◆ Type the following command:
If the I/O failure also affects the data volume, it must be recovered before its DCO
volume can be recovered.
See “Recovering an instant snap DCO volume (version 20 or later)” on page 51.
Use the alloc attribute to specify storage to be used for the new DCO. VxVM
creates the new DCO on the specified storage. See the vxsnap(m1) manual page
for information about storage attributes.
If both the alloc attribute and the -f option are specified, VxVM uses the storage
specified to the alloc attribute.
Chapter 5
Recovering from failed
vxresize operation
This chapter includes the following topics:
The operation results in a reduced file system size, but does not change the volume
size. This behavior is expected; however, you need to correct the volume size to
match the file system size.
Workaround:
Repeat the vxresize command with the required size but without any disk
parameters. The file system is not resized again when the current file system size
Recovering from failed vxresize operation 61
Recovering from a failed vxresize shrink operation
and new file system size are the same. The vxresize command then calls the
vxassist command, which reduces the volume size. The file system size and the
volume sizes are now the same.
Chapter 6
Recovering from boot disk
failure
This chapter includes the following topics:
■ Recovery by reinstallation
Note: The examples assume that the boot (root) disk is configured on the device
c0t0d0s2. Your system may be configured to use a different device.
■ usr is on a disk other than the root disk. In this case, a volume is created for
the usr partition only if you use VxVM to encapsulate the disk. Note that
encapsulating the root disk and having mirrors of the root volume is ineffective
in maintaining the availability of your system if the separate usr partition becomes
inaccessible for any reason. For maximum availability of the system, it is
recommended that you encapsulate both the root disk and the disk containing
the usr partition, and have mirrors for the usr, rootvol, and swapvol volumes.
The rootvol volume must exist in the boot disk group.
There are other restrictions on the configuration of rootvol and usr volumes.
See the Storage Foundation Administrator’s Guide.
VxVM allows you to put swap partitions on any disk; it does not need an initial swap
area during early phases of the boot process. By default, the Veritas Volume
Manager installation chooses partition 0 on the selected root disk as the root
partition, and partition 1 as the swap partition. However, it is possible to have the
swap partition not located on the root disk. In such cases, you are advised to
encapsulate that disk and create mirrors for the swap volume. If you do not do this,
damage to the swap partition eventually causes the system to crash. It may be
possible to boot the system, but having mirrors for the swapvol volume prevents
system failures.
Recovering from boot disk failure 64
Booting from an alternate boot disk on Solaris SPARC systems
The OBP names specify the OpenBoot PROM designations. For example, on
Desktop SPARC systems, the designation sbus/esp@0,800000/sd@3,0:a indicates
a SCSI disk (sd) at target 3, lun 0 on the SCSI bus, with the esp host bus adapter
plugged into slot 0.
You can use Veritas Volume Manager boot disk alias names instead of OBP names.
Example aliases are vx-rootdisk or vx-disk01. To list the available boot devices,
use the devalias command at the OpenBoot prompt.
The filename argument is the name of a file that contains the kernel. The default
is /kernel/unix in the root partition. If necessary, you can specify another program
(such as /stand/diag) by specifying the -a flag. (Some versions of the firmware
allow the default filename to be saved in the nonvolatile storage area of the system.)
Warning: Do not boot a system running VxVM with rootability enabled using all the
defaults presented by the -a flag.
The system dump device is usually configured to be the swap partition of the root
disk. Whenever a swap subdisk is moved (by hot-relocation, or using vxunreloc)
from one disk to another, the dump device must be re-configured on the new disk.
You can use the dumpadm command to view and set the dump device.
See the dumpadm(1M) manual page.
The following are common causes for the system PROM being unable to read the
boot program from the boot drive:
■ The boot disk is not powered on.
■ The SCSI bus is not terminated.
■ There is a controller failure of some sort.
Recovering from boot disk failure 67
Recovery from boot failure
■ A disk is failing and locking the bus, preventing any disks from identifying
themselves to the controller, and making the controller assume that there are
no disks attached.
The first step in diagnosing this problem is to check carefully that everything on the
SCSI bus is in order. If disks are powered off or the bus is unterminated, correct
the problem and reboot the system. If one of the disks has failed, remove the disk
from the bus and replace it.
If no hardware problems are found, the error is probably due to data errors on the
boot disk. In order to repair this problem, attempt to boot the system from an
alternate boot disk (containing a mirror of the root volume). If you are unable to
boot from an alternate boot disk, there is still some type of hardware problem.
Similarly, if switching the failed boot disk with an alternate boot disk fails to allow
the system to boot, this also indicates hardware problems.
it, and then halts the system. For example, if the plex rootvol-01 of the root volume
rootvol on disk rootdisk is stale, vxconfigd may display this message:
VxVM vxconfigd ERROR V-5-1-1049: System boot disk does not have
a valid root plex
Please boot from one of the following disks:
Disk: disk01 Device: c0t1d0s2
vxvm:vxconfigd: Error: System startup failed
The system is down.
This informs the administrator that the alternate boot disk named disk01 contains
a usable copy of the root plex and should be used for booting. When this message
is displayed, reboot the system from the alternate boot disk.
Once the system has booted, the exact problem needs to be determined. If the
plexes on the boot disk were simply stale, they are caught up automatically as the
system comes up. If, on the other hand, there was a problem with the private area
on the disk or the disk failed, you need to re-add or replace the disk.
If the plexes on the boot disk are unavailable, you should receive mail from VxVM
utilities describing the problem. Another way to determine the problem is by listing
the disks with the vxdisk utility. If the problem is a failure in the private area of root
disk (such as due to media failures or accidentally overwriting the VxVM private
region on the disk), vxdisk list shows a display such as this:
If this message appears during the boot attempt, the system should be booted from
an alternate boot disk. While booting, most disk drivers display errors on the console
about the invalid UNIX partition information on the failing disk. The messages are
similar to this:
This indicates that the failure was due to an invalid disk partition. You can attempt
to re-add the disk.
See “Re-adding a failed boot disk ” on page 77.
However, if the reattach fails, then the disk needs to be replaced.
See “Replacing a failed boot disk” on page 78.
It is recommended that you first run fsck on the root partition as shown in this
example:
At this point in the boot process, / is mounted read-only, not read/write. Since the
entry in /etc/vfstab was either incorrect or deleted, mount / as read/write manually,
using this command:
After mounting / as read/write, exit the shell. The system prompts for a new run
level. For multi-user mode, enter run level 3:
ok boot cdrom -s
# mount /dev/dsk/c0t0d0s0 /a
3 Edit /a/etc/vfstab, and ensure that there is an entry for the /usr file system,
such as the following:
4 Shut down and reboot the system from the same root partition on which the
vfstab file was restored.
ok boot -a
2 Press Return to accept the default for all prompts except the following:
■ The default pathname for the kernel program, /kernel/unix, may not be
appropriate for your system’s architecture. If this is so, enter the correct
pathname, such as /platform/sun4u/kernel/unix, at the following prompt:
■ Enter the name of the saved system file, such as /etc/system.save at the
following prompt:
ok boot cdrom -s
# mount /dev/dsk/c0t0d0s0 /a
Recovering from boot disk failure 72
Repair of root or /usr file systems on mirrored volumes
set vxio:vol_rootdev_is_volume=1
forceload: drv/driver
...
forceload: drv/vxio
forceload: drv/vxspec
forceload: drv/vxdmp
rootdev:/pseudo/vxio@0:0
Lines of the form forceload: drv/driver are used to forcibly load the drivers
that are required for the root mirror disks. Example driver names are pci, sd,
ssd, dad and ide. To find out the names of the drivers, use the ls command
to obtain a long listing of the special files that correspond to the devices used
for the root disk, for example:
# ls -al /dev/dsk/c0t0d0s2
This produces output similar to the following (with irrelevant detail removed):
This example would require lines to force load both the pci and the sd drivers:
forceload: drv/pci
forceload: drv/sd
4 Shut down and reboot the system from the same root partition on which the
configuration files were restored.
■ Mount one plex of the root or /usr file system, repair it, unmount it, and use
dd to copy the fixed plex to all other plexes. This procedure is not recommended
as it can be error prone.
■ Restore the system from a valid backup. This procedure does not require the
operating system to be re-installed from the base CD-ROM. It provides a simple,
efficient, and reliable means of recovery when both the root disk and its mirror
are damaged.
ok boot cdrom -s
2 Use the format command to create partitions on the new root disk (c0t0d0s2).
These should be identical in size to those on the original root disk before
encapsulation unless you are using this procedure to change their sizes. If you
change the size of the partitions, ensure that they are large enough to store
the data that is restored to them.
See the format(1M) manual page.
A maximum of five partitions may be created for file systems or swap areas
as encapsulation reserves two partitions for Veritas Volume Manager private
and public regions.
3 Use the mkfs command to make new file systems on the root and usr
partitions. For example, to make a ufs file system on the root partition, enter:
# mount /dev/dsk/c0t0d0s0 /a
5 Restore the root file system from tape into the /a directory hierarchy. For
example, if you used ufsdump to back up the file system, use the ufsrestore
command to restore it.
See the ufsrestore(1M) manual page.
6 Use the installboot command to install a bootblock device on /a.
7 If the /usr file system is separate from the root file system, use the mkdir
command to create a suitable mount point, such as /a/usr/, and mount
/dev/dsk/c0t0d0s6 on it:
# mkdir -p /a/usr
# mount /dev/dsk/c0t0d0s6 /a/usr
8 If the /usr file system is separate from the root file system, restore the /usr
file system from tape into the /a/usr directory hierarchy.
Recovering from boot disk failure 75
Repair of root or /usr file systems on mirrored volumes
9 Disable startup of VxVM by modifying files in the restored root file system.
10 Create the file /a/etc/vx/reconfig.d/state.d/install-db to prevent the
configuration daemon, vxconfigd, from starting:
# touch /a/etc/vx/reconfig.d/state.d/install-db
set vxio:vol_rootdev_is_volume=1
rootdev:/pseudo/vxio@0:0
* set vxio:vol_rootdev_is_volume=1
* rootdev:/pseudo/vxio@0:0
16 Shut down the system cleanly using the init 0 command, and reboot from
the new root disk. The system comes up thinking that VxVM is not installed.
17 If there are only root disk mirrors in the old boot disk group, remove any volumes
that were associated with the encapsulated root disk (for example, rootvol,
swapvol and usrvol) from the /dev/vx/dsk/bootdg and /dev/vx/rdsk/bootdg
directories.
Recovering from boot disk failure 76
Replacement of boot disks
18 If there are other disks in the old boot disk group that are not used as root disk
mirrors, remove files involved with the installation that are no longer needed:
# rm -r /etc/vx/reconfig.d/state.d/install-db
# vxiod set 10
# vxconfigd -m disable
# vxdctl init
Enable the old boot disk group excluding the root disk that VxVM interprets as
failed::
# vxdctl enable
Use the vxedit command (or the Veritas Enterprise Administrator (VEA)) to
remove the old root disk volumes and the root disk itself from Veritas Volume
Manager control.
19 Use the vxdiskadm command to encapsulate the new root disk and initialize
any disks that are to serve as root disk mirrors. After the required reboot, mirror
the root disk onto the root disk mirrors.
# vxdisk list
Note that the disk disk01 has no device associated with it, and has a status of
failed with an indication of the device that it was detached from. It is also possible
for the device (such as c0t0d0s2 in the example) not to be listed at all should the
disk fail completely.
In some cases, the vxdisk list output can differ. For example, if the boot disk
has uncorrectable failures associated with the UNIX partition table, a missing root
partition cannot be corrected but there are no errors in the Veritas Volume Manager
private area. The vxdisk list command displays a listing such as this:
However, because the error was not correctable, the disk is viewed as failed. In
such a case, remove the association between the failing device and its disk name
using the vxdiskadm “Remove a disk for replacement” menu item.
See the vxdiskadm (1M) manual page.
You can then perform any special procedures to correct the problem, such as
reformatting the device.
Recovering from boot disk failure 78
Replacement of boot disks
where diskname is the name of the disk that failed or the name of one of its mirrors.
The following is sample output from running this command:
From the resulting output, add the DISKOFFS and LENGTH values for the last subdisk
listed for the disk. This size is in 512-byte sectors. Divide this number by 2 for the
size in kilobytes. In this example, the DISKOFFS and LENGTH values for the subdisk
rtdg01-02 are 1,045,296 and 16,751,952, so the disk size is (1,045,296 +
16,751,952)/2, which equals 8,898,624 kilobytes or approximately 8.5 gigabytes.
2 Remove the association between the failing device and its disk name using
the “Remove a disk for replacement” function of vxdiskadm.
See the vxdiskadm (1M) manual page.
3 Shut down the system and replace the failed hardware.
4 After rebooting from the alternate boot disk, use the vxdiskadm “Replace a
failed or removed disk” menu item to notify VxVM that you have replaced the
failed disk.
5 Use vxdiskadm to mirror the alternate boot disk to the replacement boot disk.
6 When the volumes on the boot disk have been restored, shut down the system,
and test that the system can be booted from the replacement boot disk.
Recovery by reinstallation
Reinstallation is necessary if all copies of your boot (root) disk are damaged, or if
certain critical files are lost due to file system damage.
If these types of failures occur, attempt to preserve as much of the original VxVM
configuration as possible. Any volumes that are not directly involved in the failure
do not need to be reconfigured. You do not have to reconfigure any volumes that
are preserved.
Warning: You should assume that reinstallation can potentially destroy the contents
of any disks that are touched by the reinstallation process
Recovering from boot disk failure 80
Recovery by reinstallation
Note: During reinstallation, you can change the system’s host name (or host ID).
It is recommended that you keep the existing host name, as this is assumed by the
procedures in the following sections.
2 If required, use the vxlicinst command to install the Veritas Volume Manager
license key.
See the vxlicinst(1) manual page.
Recovering from boot disk failure 82
Recovery by reinstallation
# touch /etc/vx/reconfig.d/state.d/install-db
# exec init S
# rm -rf /etc/vx/reconfig.d/state.d/install-db
8 Start some Veritas Volume Manager I/O daemons using the following command:
# vxiod set 10
# vxconfigd -m disable
The configuration preserved on the disks not involved with the reinstallation
has now been recovered. However, because the root disk has been reinstalled,
it does not appear to VxVM as a VM disk. The configuration of the preserved
disks does not include the root disk as part of the VxVM configuration.
If the root disk of your system and any other disks involved in the reinstallation
were not under VxVM control at the time of failure and reinstallation, then the
reconfiguration is complete at this point.
If the root disk (or another disk) was involved with the reinstallation, any volumes
or mirrors on that disk (or other disks no longer attached to the system) are
now inaccessible. If a volume had only one plex contained on a disk that was
reinstalled, removed, or replaced, then the data in that volume is lost and must
be restored from backup.
Repeat this command, using swapvol, standvol, and usr in place of rootvol,
to remove the swap, stand, and usr volumes.
Recovering from boot disk failure 85
Recovery by reinstallation
2 After completing the rootability cleanup, you must determine which volumes
need to be restored from backup. The volumes to be restored include those
with all mirrors (all copies of the volume) residing on disks that have been
reinstalled or removed. These volumes are invalid and must be removed,
recreated, and restored from backup. If only some mirrors of a volume exist
on reinstalled or removed disks, these mirrors must be removed. The mirrors
can be re-added later.
Establish which VM disks have been removed or reinstalled using the following
command:
# vxdisk list
This displays a list of system disk devices and the status of these devices. For
example, for a reinstalled system with three disks and a reinstalled root disk,
the output of the vxdisk list command is similar to this:
The display shows that the reinstalled root device, c0t0d0s2, is not associated
with a VM disk and is marked with a status of error. The disks disk02 and
disk03 were not involved in the reinstallation and are recognized by VxVM
and associated with their devices (c0t1d0s2 and c0t2d0s2). The former disk01,
which was the VM disk associated with the replaced disk device, is no longer
associated with the device (c0t0d0s2).
If other disks (with volumes or mirrors on them) had been removed or replaced
during reinstallation, those disks would also have a disk device listed in error
state and a VM disk listed as not associated with a device.
Recovering from boot disk failure 86
Recovery by reinstallation
3 After you know which disks have been removed or replaced, locate all the
mirrors on failed disks using the following command:
where disk is the access name of a disk with a failed status. Be sure to
enclose the disk name in quotes in the command. Otherwise, the command
returns an error message. The vxprint command returns a list of volumes
that have mirrors on the failed disk. Repeat this command for every disk with
a failed status.
The following is sample output from running this command:
4 Check the status of each volume and print volume information using the
following command:
The only plex of the volume is shown in the line beginning with pl. The STATE
field for the plex named v01-01 is NODEVICE. The plex has space on a disk
that has been replaced, removed, or reinstalled. The plex is no longer valid
and must be removed.
Recovering from boot disk failure 87
Recovery by reinstallation
5 Because v01-01 was the only plex of the volume, the volume contents are
irrecoverable except by restoring the volume from a backup. The volume must
also be removed. If a backup copy of the volume exists, you can restore the
volume later. Keep a record of the volume name and its length, as you will
need it for the backup procedure.
Remove irrecoverable volumes (such as v01) using the following command:
# vxedit -r rm v01
6 It is possible that only part of a plex is located on the failed disk. If the volume
has a striped plex associated with it, the volume is divided between several
disks. For example, the volume named v02 has one striped plex striped across
three disks, one of which is the reinstalled disk disk01. The vxprint -th v02
command produces the following output:
The display shows three disks, across which the plex v02-01 is striped (the
lines starting with sd represent the stripes). One of the stripe areas is located
on a failed disk. This disk is no longer valid, so the plex named v02-01 has a
state of NODEVICE. Since this is the only plex of the volume, the volume is
invalid and must be removed. If a copy of v02 exists on the backup media, it
can be restored later. Keep a record of the volume name and length of any
volume you intend to restore from backup.
Remove invalid volumes (such as v02) using the following command:
# vxedit -r rm v02
Recovering from boot disk failure 88
Recovery by reinstallation
7 A volume that has one mirror on a failed disk can also have other mirrors on
disks that are still valid. In this case, the volume does not need to be restored
from backup, since the data is still valid on the valid disks.
The output of the vxprint -th command for a volume with one plex on a failed
disk (disk01) and another plex on a valid disk (disk02) is similar to the
following:
This volume has two plexes, v03-01 and v03-02. The first plex (v03-01) does
not use any space on the invalid disk, so it can still be used. The second plex
(v03-02) uses space on invalid disk disk01 and has a state of NODEVICE.
Plex v03-02 must be removed. However, the volume still has one valid plex
containing valid data. If the volume needs to be mirrored, another plex can be
added later. Note the name of the volume to create another plex later.
To remove an invalid plex, use the vxplex command to dissociate and then
remove the plex from the volume. For example, to dissociate and remove the
plex v03-02, use the following command:
8 After you remove all invalid volumes and plexes, you can clean up the disk
configuration. Each disk that was removed, reinstalled, or replaced (as
determined from the output of the vxdisk list command) must be removed
from the configuration.
To remove a disk, use the vxdg command. For example, to remove the failed
disk disk01, use the following command:
9 After you remove all the invalid disks, you can add the replacement or reinstalled
disks to Veritas Volume Manager control. If the root disk was originally under
Veritas Volume Manager control or you now wish to put the root disk under
Veritas Volume Manager control, add this disk first.
To add the root disk to Veritas Volume Manager control, use the vxdiskadm
command:
# vxdiskadm
From the vxdiskadm main menu, select menu item 2 (Encapsulate a disk).
Follow the instructions and encapsulate the root disk for the system.
10 When the encapsulation is complete, reboot the system to multi-user mode.
11 After the root disk is encapsulated, any other disks that were replaced should
be added using the vxdiskadm command. If the disks were reinstalled during
the operating system reinstallation, they should be encapsulated; otherwise,
they can be added.
12 After all the disks have been added to the system, any volumes that were
completely removed as part of the configuration cleanup can be recreated and
their contents restored from backup. The volume recreation can be done by
using the vxassist command or the graphical user interface.
For example, to recreate the volumes v01 and v02, use the following command:
After the volumes are created, they can be restored from backup using normal
backup/restore procedures.
Recovering from boot disk failure 90
Recovery by reinstallation
13 Recreate any plexes for volumes that had plexes removed as part of the volume
cleanup. To replace the plex removed from volume v03, use the following
command:
After you have restored the volumes and plexes lost during reinstallation,
recovery is complete and your system is configured as it was prior to the failure.
14 Start up hot-relocation, if required, by either rebooting the system or manually
start the relocation watch daemon, vxrelocd (this also starts the vxnotify
process).
Warning: Hot-relocation should only be started when you are sure that it will
not interfere with other reconfiguration procedures.
■ Command logs
■ Task logs
■ Transaction logs
Command logs
The vxcmdlog command allows you to log the invocation of other Veritas Volume
Manager (VxVM) commands to a file.
The following examples demonstrate the usage of vxcmdlog:
vxcmdlog -s 512k Set the maximum command log file size to 512K.
By default command logging is turned on. Command lines are logged to the file
cmdlog, in the directory /etc/vx/log. This path name is a symbolic link to a directory
whose location depends on the operating system. If required, you can redefine the
directory which is linked.
If you want to preserve the settings of the vxcmdlog utility, you must also copy the
settings file, .cmdlog, to the new directory.
The size of the command log is checked after an entry has been written so the
actual size may be slightly larger than that specified. When the log reaches a
maximum size, the current command log file, cmdlog, is renamed as the next
available historic log file, cmdlog.number, where number is an integer from 1 up to
the maximum number of historic log files that is currently defined, and a new current
log file is created.
A limited number of historic log files is preserved to avoid filling up the file system.
If the maximum number of historic log files has been reached, the oldest historic
log file is removed, and the current log file is renamed as that file.
Each log file contains a header that records the host name, host ID, and the date
and time that the log was created.
The following are sample entries from a command log file:
Each entry usually contains a client ID that identifies the command connection to
the vxconfigd daemon, the process ID of the command that is running, a time
stamp, and the command line including any arguments.
If the client ID is 0, as in the third entry shown here, this means that the command
did not open a connection to vxconfigd.
The client ID is the same as that recorded for the corresponding transactions in the
transactions log.
Managing commands, tasks, and transactions 93
Task logs
Task logs
The tasks that are created on the system are logged for diagnostic purposes in the
tasklog file in the /etc/vx/log/ directory. This file logs an entry for all task-related
operations (creation, completion, pause, resume, and abort). The size of the task
log is checked after an entry has been written, so the actual size may be slightly
larger than the size specified. When the log reaches the maximum size, the current
task log file, tasklog, is renamed as the next available historic log file. Historic task
log files are named tasklog.1, tasklog.2, and so on up to tasklog.5. A maximum
of five historic log files are tracked.
Each log file contains a header that records the host name, host ID, and the date
and time that the log was created. The following are sample entries from a task log
file:
Each entry contains two lines. The first line contains the following:
■ The client ID that identifies the connection to the vxconfigd daemon.
■ The process ID of the command that causes the task state to change.
■ A timestamp.
The client ID is the same as the one recorded for the corresponding command and
transactions in command log and transaction log respectively.
The second line contains task-related information in the following format:
Managing commands, tasks, and transactions 94
Transaction logs
Transaction logs
The vxtranslog command allows you to log VxVM transactions to a file.
The following examples demonstrate the usage of vxtranslog:
vxtranslog -s 512k Set the maximum transaction log file size to 512K.
By default, transaction logging is turned on. Transactions are logged to the file
translog, in the directory /etc/vx/log. This path name is a symbolic link to a
directory whose location depends on the operating system. If required, you can
redefine the directory which is linked. If you want to preserve the settings of the
vxtranslog utility, you must also copy the settings file, .translog, to the new
directory.
Managing commands, tasks, and transactions 95
Transaction logs
The size of the transaction log is checked after an entry has been written so the
actual size may be slightly larger than that specified. When the log reaches a
maximum size, the current transaction log file, translog, is renamed as the next
available historic log file, translog.number, where number is an integer from 1 up
to the maximum number of historic log files that is currently defined, and a new
current log file is created.
A limited number of historic log files is preserved to avoid filling up the file system.
If the maximum number of historic log files has been reached, the oldest historic
log file is removed, and the current log file is renamed as that file.
Each log file contains a header that records the host name, host ID, and the date
and time that the log was created.
The following are sample entries from a transaction log file:
The first line of each log entry is the time stamp of the transaction. The Clid field
corresponds to the client ID for the connection that the command opened to
vxconfigd. The PID field shows the process ID of the utility that is requesting the
operation. The Status and Abort Reason fields contain error codes if the transaction
does not complete normally. The remainder of the record shows the data that was
used in processing the transaction.
The client ID is the same as that recorded for the corresponding command line in
the command log.
See “Command logs” on page 91.
See “Association of command, task, and transaction logs” on page 96.
If there is an error reading from the settings file, transaction logging switches to its
built-in default settings. This may mean, for example, that logging remains enabled
Managing commands, tasks, and transactions 96
Association of command, task, and transaction logs
after being disabled using vxtranslog -m off command. If this happens, use the
vxtranslog utility to recreate the settings file, or restore the file from a backup.
In this example, the following request was recorded in the transaction log:
To locate the utility that issued this request, the command would be:
The output from the example shows a match at line 7310 in the command log.
Examining lines 7310 and 7311 in the command log indicates that the vxdg import
command was run on the foodg disk group:
If there are multiple matches for the combination of the client and process ID, you
can determine the correct match by examining the time stamp.
If a utility opens a conditional connection to vxconfigd, its client ID is shown as
zero in the command log, and as a non-zero value in the transaction log. You can
use the process ID and time stamp to relate the log entries in such cases.
Using the same method, you can use the PID and CLID from the task log to correlate
the entries in the task log with the command log.
Managing commands, tasks, and transactions 97
Associating CVM commands issued from slave to master node
On the CVM slave node, enter the following command to identify the shipped
command from the transaction log (translog):
In this example, the following entry was recorded in the transaction log on slave
node:
To locate the utility that issued this request on the slave node, use this syntax:
The output from the example shows a match at line 7310 in the command log.
Examining lines 7310 and 7311 in the command log indicates that the vxassist
make command was run on the shareddg disk group:
If the command uses disk access (DA) names, the shipped command converts the
DA names to unique disk IDs (UDID) or Disk media (DM) names. On the CVM
master node, the vxconfigd log shows the entry for the received command. To
determine the commands received from slave nodes on the master, enter the
command:
Note: The file to which the vxconfigd messages are logged may differ, depending
on where the messages are redirected. These messages are logged by default.
There is no need to set the vxconfigd debug level.
In this example, the following received command is recorded in the vxconfigd log
on the master node:
From the above output on the master node, you can determine the slave from which
the command is triggered based on the SlaveID. The SlaveID is the cluster monitor
nodeid (CM nid) of the node in the cluster.
To determine the slave node from the command is triggered, enter the following
command and find the slave node with the matching SlaveID:
# /etc/vx/bin/vxclustadm nidmap
For example:
Managing commands, tasks, and transactions 99
Command completion is not enabled
The CVM master node executes the command and sends the response to the slave
node.
To find the response that the master node sent to the slave node, enter a command
such as the following on the master node:
In this example, enter the following command to find the response that the master
node sent:
# . /etc/bash_completion.d/vx_bash
Chapter 8
Backing up and restoring
disk group configurations
This chapter includes the following topics:
■ Backing up and restoring Flexible Storage Sharing disk group configuration data
Warning: The backup and restore utilities act only on VxVM configuration data.
They do not back up or restore any user or application data that is contained within
volumes or other VxVM objects. If you use vxdiskunsetup and vxdisksetup on a
disk, and specify attributes that differ from those in the configuration backup, this
may corrupt the public region and any data that it contains.
If VxVM cannot update a disk group’s configuration because of disk errors, it disables
the disk group and displays the following error:
If such errors occur, you can restore the disk group configuration from a backup
after you have corrected any underlying problem such as failed or disconnected
hardware.
Configuration data from a backup allows you to reinstall the private region headers
of VxVM disks in a disk group whose headers have become damaged, to recreate
a corrupted disk group configuration, or to recreate a disk group and the VxVM
objects within it. You can also use the configuration data to recreate a disk group
on another system if the original system is not available.
Note: To restore a disk group configuration, you must use the same physical disks
that were configured in the disk group when you took the backup.
Here diskgroup is the name of the disk group, and dgid is the disk group ID. If a
disk group is to be recreated on another system, copy these files to that system.
Warning: Take care that you do not overwrite any files on the target system that
are used by a disk group on that system.
Backing up and restoring disk group configurations 104
Restoring a disk group configuration
-l The -loption lets you specify a directory for the location of the
backup configuration files other than the default location,
/etc/vx/cbr/bk
Warning: None of the disks or VxVM objects in the disk groups should be open or
in use by any application while the restoration is being performed.
You can choose whether or not to reinstall any corrupted disk headers at the
precommit stage. If any of the disks’ private region headers are invalid, restoration
may not be possible without reinstalling the headers for the affected disks.
See the vxconfigrestore(1M) manual page.
Backing up and restoring disk group configurations 105
Restoring a disk group configuration
At the precommit stage, you can use the vxprint command to examine the
configuration that the restored disk group will have. You can choose to proceed
to commit the changes and restore the disk group configuration. Alternatively,
you can cancel the restoration before any permanent changes have been
made.
To abandon restoration at the precommit stage
◆ Type the following command:
Note: Between the precommit and commit state, any operation that results in
change in diskgroup configuration should not be attempted. This may lead to
unexpected behavior. User should either abandon the restoration or commit
the operation.
If no disk headers are reinstalled, the configuration copies in the disks’ private
regions are updated from the latest binary copy of the configuration that was
saved for the disk group.
If any of the disk headers are reinstalled, a saved copy of the disks’ attributes
is used to recreate their private and public regions. These disks are also
assigned new disk IDs. The VxVM objects within the disk group are then
recreated using the backup configuration records for the disk group. This
process also has the effect of creating new configuration copies in the disk
group.
Volumes are synchronized in the background. For large volume configurations,
it may take some time to perform the synchronization. You can use the vxtask
-l list command to monitor the progress of this operation.
Disks that are in use or whose layout has been changed are excluded from
the restoration process.
If the back-up is taken of a shared disk group, the vxconfigrestore command
restores it as a private disk group. After the disk group is restored, run the following
commands to make the disk group shared.
To make the disk group shared
1 Deport the disk group:
1049135264.31.xxx.veritas.com
The solution is to specify the disk group by its ID rather than by its name to perform
the restoration. The backup file, /etc/vx/cbr/bk/diskgroup.dgid/ dgid.dginfo,
contains a timestamp that records when the backup was taken.
The following is a sample extract from such a backup file that shows the timestamp
and disk group ID information:
TIMESTAMP
Tue Apr 15 23:27:01 PDT 2003
.
.
.
DISK_GROUP_CONFIGURATION
Group: mydg
dgid: 1047336696.19.xxx.veritas.com
.
.
.
Use the timestamp information to decide which backup contains the relevant
information, and use the vxconfigrestore command to restore the configuration
by specifying the disk group ID instead of the disk group name.
restoring the configuration data on the secondary nodes, you must restore the
configuration data on the primary (master) node that will import the disk group.
To back up FSS disk group configuration data
◆ To back up FSS disk group configuration data on all cluster nodes that have
connectivity to at least one disk in the disk group, type the following command:
# /etc/vx/bin/vxconfigbackup -T diskgroup
# vxclustadm nidmap
2 Check if the primary node has connectivity to at least one disk in the disk group.
The disk can be a direct attached storage (DAS) disk, partially shared disk, or
fully shared disks.
3 If the primary node does not have connectivity to any disk in the disk group,
switch the primary node to a node that has connectivity to at least one DAS or
partially shared disk, using the following command:
# vxconfigrestore diskgroup
Note: You must restore the configuration data on all secondary nodes that
have connectivity to at least one disk in the disk group.
# vxconfigrestore diskgroup
# vxprint -g diskgroup
# vxconfigrestore -c diskgroup
Backing up and restoring disk group configurations 109
Backing up and restoring Flexible Storage Sharing disk group configuration data
# vxclustadm nidmap
# vxconfigrestore -d diskgroup
# vxconfigrestore -d diskgroup
Note: You must abort or decommit the configuration data on all secondary
nodes that have connectivity to at least one disk in the disk group, and all
secondary nodes from which you triggered the precommit.
# vxdisk scandisks
2 If the disks are part of an imported disk group, deport the disk group.
3 Clear the udid_mismatch flag on all non-clone disks identified during step 1.
Use one of the following methods:
Method Step(s)
Clear the flags for 1 Clear the udid_mismatch and the clone disk flag on all
each disk and then non-clone disks you identified in step 1.
import the disk
group: # vxdisk -cf updateudid diskname
Clear the flags 1 Clear the udid_mismatch flag on all non-clone disks you
individually for each identified in step 1.
disk and then import
the disk group. # vxdisk -f updateudid diskname
329 Cannot join a non-CDS disk group Change the non-CDS disk group
and a CDS disk group into a CDS disk group (or vice
versa), then retry the join
operation.
330 Disk group is for a different Import the disk group on the
platform correct platform. It cannot be
imported on this platform.
331 Volume has a log which is not CDS To get a log which is CDS
compatible compatible, you need to stop the
volume, if currently active, then
start the volume. After the volume
has been successfully started, retry
setting the CDS attribute for the
disk group.
Recovering from CDS errors 113
CDS error codes and recovery actions
332 License has expired, or is not Obtain a license from Veritas that
available for CDS enables the usage of CDS disk
groups.
334 Disk group alignment not CDS Change the alignment of the disk
compatible group to 8K and then retry setting
the CDS attribute for the disk
group.
335 Sub-disk length violates disk group Ensure that sub-disk length value
alignment is a multiple of 8K.
336 Sub-disk offset violates disk group Ensure that sub-disk offset value
alignment is a multiple of 8K.
337 Sub-disk plex offset violates disk Ensure that sub-disk plex offset
group alignment value is a multiple of 8K.
338 Plex stripe width violates disk Ensure that plex stripe width value
group alignment is a multiple of 8K.
339 Volume or log length violates disk Ensure that the length of the
group alignment volume is a multiple of 8K.
340 Last disk media offset violates disk Reassociate the DM record prior
group alignment to upgrading.
Recovering from CDS errors 114
CDS error codes and recovery actions
341 Too many device nodes in disk Increase the number of device
group nodes allowed in the disk group, if
not already at the maximum.
Otherwise, you need to remove
volumes from the disk group,
possibly by splitting the disk group.
342 Map length too large for current log Use a smaller map length for the
length DRL/DCM log, or increase the log
length and retry.
343 Volume log map alignment violates Remove the DRL/DCM log, then
disk group alignment add it back after changing the
alignment of the disk group.
344 Disk device cannot be used as a Only a SCSI device can be used
CDS disk as a CDS disk.
345 Disk group contains an old-style Import the disk group on the
RVG which cannot be imported on platform that created the RVG. To
this platform import the disk group on this
platform, first remove the RVG on
the creating platform.
347 User transactions are disabled for Retry the command as it was
the disk group temporarily disallowed by the
vxcdsconvert command
executing at the same time.
■ Types of messages
Note: Some error messages described here may not apply to your system.
You may find it useful to consult the VxVM command and transaction logs to
understand the context in which an error occurred.
See “Command logs” on page 91.
Logging and error messages 116
How error messages are logged
If syslog output is enabled, messages with a priority higher than Debug are written
to /var/log/syslog.
See “Configuring logging in the startup script” on page 117.
Alternatively, you can use the following command to change the debug level:
There are 10 possible levels of debug logging with the values 0 through 9. Level 1
provides the least detail, and 9 the most. Level 0 turns off logging. If a path name
is specified, this file is used to record the debug output instead of the default debug
log file. If the vxdctl debug command is used, the new debug logging level and
debug log file remain in effect until the VxVM configuration daemon, vxconfigd, is
next restarted.
The vxdctl debug command logs the host name, the VxVM product name, and
the VxVM product version at the top of vxconfigd log file. The following example
shows a typical log entry:
opts="$opts -x syslog"
# use syslog for console messages
#opts="$opts -x log"
# messages to vxconfigd.log
#opts="$opts -x logfile=/foo/bar" # specify an alternate log file
#opts="$opts -x timestamp"
# timestamp console messages
#debug=1
# enable debugging console output
Types of messages
VxVM is fault-tolerant and resolves most problems without system administrator
intervention. If the configuration daemon, vxconfigd, recognizes the actions that
are necessary, it queues up the transactions that are required. VxVM provides
atomic changes of system configurations; either a transaction completes fully, or
the system is left in the same state as though the transaction was never attempted.
If vxconfigd is unable to recognize and fix system problems, the system
administrator needs to handle the task of problem solving using the diagnostic
messages that are returned from the software. The following sections describe error
message numbers and the types of error message that may be seen, and provide
a list of the more common errors, a detailed description of the likely cause of the
problem together with suggestions for any actions that can be taken.
Messages have the following generic format:
For Veritas Volume Manager, the product is set to VxVM. The component can be
the name of a kernel module or driver such as vxdmp, a configuration daemon such
as vxconfigd, or a command such as vxassist.
Note: For full information about saving system crash information, see the Solaris
System Administation Guide.
Messages are divided into the following types of severity in decreasing order of
impact on the system:
The unique message number consists of an alpha-numeric string that begins with
the letter “V”. For example, in the message number, V-5-1-3141, “V” indicates that
this is a Veritas InfoScale product error message, the first numeric field (5) encodes
the product (in this case, VxVM), the second field (1) represents information about
the product component, and the third field (3141) is the message index. The text
of the error message follows the message number.
Messages
This section contains a list of messages that you may encounter during the operation
of Veritas Volume Manager. However, the list is not exhaustive and the second
Logging and error messages 120
Using VxLogger for kernel-level logging
field may contain the name of different command, driver or module from that shown
here.
Descriptions are included to elaborate on the situation or problem that generated
a particular message. Wherever possible, a recovery procedure is provided to help
you to locate and correct the problem.
If you encounter a product error message, record the unique message number
preceding the text of the message. Search on the message number at the following
URL to find the information about that message:
http://sort.veritas.com/
When contacting Veritas Technical Support, either by telephone or by visiting the
Veritas Technical Support website, be sure to provide the relevant message number.
Veritas Technical Support will use this message number to quickly determine if
there are TechNotes or other information available for you.
Category Description
VxLogger logs messages in the kernel buffer based on the log level and category
settings. Messages outside these settings are discarded by VxLogger. These
settings can be changed cluster-wide using the tunable parameters provided by
VxLogger.
See “Configuring tunable settings for kernel-level logging” on page 121.
# vxtune vol_log_level 3
Chapter 12
Troubleshooting Veritas
Volume Replicator
This chapter includes the following topics:
If the RLINK does not change to the CONNECT/ACTIVE state within a short time,
there is a problem preventing the connection. This section describes a number of
possible causes. An error message indicating the problem may be displayed on
the console.
■ If the following error displays on the console:
Make sure that the vradmind daemon is running on the Primary and the
Secondary hosts; otherwise, start the vradmind daemon by issuing the following
command:
# /usr/sbin/vxstart_vvr
For an RLINK in a shared disk group, make sure that the virtual IP address of
the RLINK is enabled on the logowner.
■ If there is no self-explanatory error message, issue the following command on
both the Primary and Secondary hosts:
# ping -s remote_host
Note: This command is only valid when ICMP ping is allowed between the VVR
Primary and the VVR Secondary.
After 10 iterations, type Ctrl-C. There should be no packet loss or very little
packet loss. To ensure that the network can transmit large packets, issue the
following command on each host for an RLINK in a private disk group.
For an RLINK in a shared disk group, issue the following command on the
logowner on the Primary and Secondary:
The packet loss should be about the same as for the earlier ping command.
■ Issue the vxiod command on each host to ensure that there are active I/O
daemons. If the output is 0 volume I/O daemons running, activate I/O daemons
by issuing the following command:
# vxiod set 10
Issue the following command to ensure that the heartbeat port number in the
output matches the port displayed by vxprint command:
# vrport
Confirm that the state of the heartbeat port is Idle by issuing the following
command:
UDP: IPv4
Local Address Remote Address State
-------------------- -------------------- -------
*.port-number Idle
Check whether the required VVR ports are open. Check for UDP 4145, TCP
4145, TCP 8199, and the anonymous port. Enter the following commands:
Perform a telnet test to check for open ports. For example, to determine if port
4145 is open, enter the following:
■ Use the netstat command to check if vradmind daemons can connect between
the Primary site and the Secondary site.
::1 localhost
129.148.174.232 seattle
Troubleshooting Veritas Volume Replicator 126
Recovery from configuration errors
Note that on the Secondary, the size of volume hr_dv01 is small, hr_dv2 is
misnamed (must be hr_dv02), and hr_dv03 is missing. An attempt to attach the
Primary RLINK to this Secondary using the attach command fails.
3 Associate a new volume, hr_dv03, of the same size as the Primary data volume
hr_dv03.
Alternatively, the problem can be fixed by altering the Primary to match the
Secondary, or any combination of the two. When the Primary and the Secondary
match, retry the attach.
On the Primary:
Run the vxrlink verify rlink command at either node to check whether this
has occurred. When the configuration error has been corrected, the affected RLINK
can be resumed.
On the Secondary:
To correct the problem, either create and associate hr_dv04 on the Secondary or
alternatively, dissociate vol04 from the Primary, and then resume the Secondary
RLINK. To resume the Secondary RLINK, use the vradmin resumerep rvg_name
command.
If hr_dv04 on the Primary contains valid data, copy its contents to hr_dv04 on the
Secondary before associating the volume to the Secondary RVG.
On the Secondary:
To correct the problem, increase the size of the Secondary data volume, or shrink
the Primary data volume:
After resizing a data volume, resume the Secondary RLINK by issuing the following
command on any host in the RDS:
On the Secondary:
■ Rename either the Primary or Secondary data volume, and resume the RLINK
using the vradmin resumerep rvg_name command.
OR
■ Set the primary_datavol field on the Secondary data volume to refer to the
new name of the Primary data volume as follows, and resume the RLINK using
the vradmin resumerep rvg_name command.
On the Secondary:
VxVM VVR vxrlink ERROR V-5-1-0 VSet name vset_name of secondary datavol
vol_name does not match VSet name vset_name of primary datavol
vol_name
To correct the problem, rename the volume set on either the Primary or the
Secondary, using the following command:
Troubleshooting Veritas Volume Replicator 133
Recovery from configuration errors
VxVM VVR vxrlink ERROR V-5-1-0 VSet index (index_name) of secondary datavol
vol_name does not match VSet index (index_name) of primary datavol
vol_name
When you remove the last volume, the volume set is also removed.
2 Create the volume set using the following command:
3 Associate each of the remaining volumes to the volume set, specifying the
index of the corresponding volumes on the Primary using the following
command:
Similarly, if a data volume is removed from the volume set on the Secondary RVG
only or added to the volume set on the Primary RVG only, the following error
displays:
To correct the problem, add or remove data volumes from either the Secondary or
the Primary volume sets. The volume sets on the Primary and the Secondary should
have the same component volumes.
To add a data volume to a volume set, do one of the following:
■ To add a data volume to a volume set in an RVG:
If the RVG contains a database, recovery of the failed data volume must be
coordinated with the recovery requirements of the database. The details of the
database recovery sequence determine what must be done to synchronize
Secondary RLINKs.
Detailed examples of recovery procedures are given in the examples:
■ See “Example - Recovery with detached RLINKs” on page 135.
■ See “Example - Recovery with minimal repair” on page 136.
■ See “Example - Recovery by migrating the primary” on page 136.
If the data volume failed due to a temporary outage such as a disconnected cable,
and you are sure that there is no permanent hardware damage, you can start the
data volume without dissociating it from the RVG. The pending writes in the SRL
are replayed to the data volume.
See “Example - Recovery from temporary I/O error” on page 137.
5 Make sure the data volume is started before restarting the RVG. If the data
volume is not started, start the data volume:
Any outstanding writes in the SRL are written to the data volume.
4 Associate a new SRL with the RVG. After associating the new SRL, the RVG
PASSTHRU mode no longer displays in the output of the command vxprint
-lV.
After this error has occurred and you have successfully recovered the RVG, if you
dissociate a volume from the RVG, you may see the following message:
recommended that you detach the RLINKs and synchronize them after the restore
operation is complete.
3 Repair or restore the SRL. Even if the problem can be fixed by repairing the
underlying subdisks, the SRL must still be dissociated and reassociated to
initialize the SRL header.
4 Make sure the SRL is started, and then reassociate the SRL:
6 Restore the data volumes from backup if needed. Synchronize all the RLINKs.
3 Restore data from the Secondary Storage Checkpoint backup to all the volumes.
If all volumes are restored from backup, the Secondary will remain consistent
during the synchronization. Restore the RLINK by issuing the following
command:
3 Restore data from the Primary Storage Checkpoint backup to all data volumes.
Unlike restoration from a Secondary Storage Checkpoint, the Primary Storage
Checkpoint data must be loaded onto all Secondary data volumes, not just the
failed volume. If a usable Primary Storage Checkpoint does not already exist,
make a new Storage Checkpoint.
Fix or replace the SRL. Be sure the SRL is started before associating it:
2 Repair the SRL volume. Even if the problem can be fixed by repairing the
underlying subdisks, the SRL volume must still be dissociated and re-associated
to initialize the SRL header.
3 Start the SRL volume. Then, re-associate it.
5 If the integrity of the data volumes is not suspect, just resume the RLINK.
OR
If the integrity of the data volumes is suspect, and a Secondary Storage
Checkpoint backup is available, restore from the Secondary Storage
Checkpoint.
Restore the Secondary Storage Checkpoint backup data on to the data volumes.
OR
If the integrity of the data volumes is suspect and no Secondary Storage
Checkpoint is available, synchronize the Secondary using a block-level backup
and Primary Storage Checkpoint.
As an alternative, you can also use automatic synchronization.
On the Secondary, restore the Primary Storage Checkpoint backup data to the
data volumes.
VxVM vxvol WARNING V-5-1-5291 WARNING: Rvg rvgname has not been
recovered because the SRL is not available. The data volumes may
be out-of-date and inconsistent
VxVM vxvol WARNING V-5-1-5296 The data volumes in the rvg rvgname
cannot be recovered because the SRL is being dissociated.
Restore the data volumes from backup before starting the applications
If replication was frozen due to receipt of an IBC, the data in the SRL is lost
but there is no indication of this problem. To see whether this was the case,
examine the /var/adm/messages file for a message such as:
If this is the last message for the RLINK, that is, if there is no subsequent
message stating that replication was unfrozen, the Primary RLINK must be
completely resynchronized.
Section 3
Troubleshooting Dynamic
Multi-Pathing
You can migrate the LUNs from the control of the native multi-pathing driver to DMP
control.
■ To migrate to DMP with Veritas Volume Manager, refer to the section on disabling
MPxIO in the Storage Foundation Administrator's Guide.
■ To migrate to DMP with OS native volume support, refer to the section on
migrating to DMP from MPxIO in the Dynamic Multi-Pathing Adminstrator's
Guide.
DMP saves the corrupted file with the name vxvm.exclude.corrupt. DMP creates
a new vxvm.exclude file. You must manually recover from this situation.
Dynamic Multi-Pathing troubleshooting 149
Downgrading the array support
2 View the saved vxvm.exclude.corrupt file to find any entries for the excluded
paths that are relevant.
# cat /etc/vx/vxvm.exclude.corrupt
exclude_all 0
paths
controllers
c4 /pci@1f,4000/pci@4/scsi@4/fp@0,0
3 Reissue the vxdmpadm exclude command for the paths that you noted in step
2.
# cat /etc/vx/vxvm.exclude
exclude_all 0
paths
#
controllers
c0 /pci@1f,4000/scsi@3
c4 /pci@1f,4000/pci@4/scsi@4/fp@0,0
#
product
#
If you have issues with an updated VRTSaslapm package, Veritas may recommend
that you downgrade to a previous version of the ASL/APM package. You can only
revert to a package that is supported for the installed release of Veritas InfoScale
products. To perform the downgrade while the system is online, do not remove the
installed package. Instead, you can install the previous version of the package over
the new package. This method prevents multiple instances of the VRTSaslapm
package from being installed.
Use the following method to downgrade the VRTSaslapm package.
To downgrade the ASL/APM package for Solaris 10
1 Create a response file to the pkgadd command that specifies
instance=overwrite. The following example shows a response file:
#
# Copyright 2004 Sun Microsystems, Inc. All rights reserved.
# Use is subject to license terms.
#
#ident "@(#)default 1.7 04/12/21 SMI"
#
mail=
instance=overwrite
partial=ask
runlevel=ask
idepend=ask
rdepend=ask
space=ask
setuid=ask
conflict=ask
action=ask
networktimeout=60
networkretries=3
authentication=quit
keystore=/var/sadm/security
proxy=
basedir=default
3 Verify that the root pool is imported using the zpool status command.
# init 6
■ Troubleshooting CFS
Troubleshooting CFS
This section discusses troubleshooting CFS problems.
Troubleshooting Storage Foundation Cluster File System High Availability 154
Troubleshooting CFS
LD_LIBRARY_PATH=/opt/SUNWspro/lib:\
/app/oracle/orahome/lib:/usr/lib:/usr/ccs/lib
In the above example, /app/oracle is a CFS file system, and if the user tries to
change the primary node for this file system, the system will hang. The user is still
able to ping and telnet to the system, but simple commands such as ls will not
respond. One of the first steps required during the changing of the primary node is
freezing the file system cluster wide, followed by a quick issuing of the fsck
command to replay the intent log.
Since the initial entry in <library> path is pointing to the frozen file system itself, the
fsck command goes into a deadlock situation. In fact, all commands (including ls)
which rely on the <library> path will hang from now on.
The recommended procedure to correct for this problem is as follows: Move any
entries pointing to a CFS file system in any user's (especially root) <library> path
towards the end of the list after the entry for /usr/lib
Therefore, the above example of a <library path> would be changed to the following:
LD_LIBRARY_PATH=/opt/SUNWspro/lib:\
/usr/lib:/usr/ccs/lib:/app/oracle/orahome/lib
Because the fencing module operates identically on each system, both nodes
assume the other is failed, and carry out fencing operations to insure the other node
is ejected. The VCS GAB module on each node determines the peer has failed due
to loss of heartbeats and passes the membership change to the fencing module.
Each side “races” to gain control of the coordinator disks. Only a registered node
can eject the registration of another node, so only one side successfully completes
the command on each disk.
Troubleshooting Storage Foundation Cluster File System High Availability 156
Troubleshooting fenced configurations
The side that successfully ejects the peer from a majority of the coordinator disks
wins. The fencing module on the winning side then passes the membership change
up to VCS and other higher-level packages registered with the fencing module,
allowing VCS to invoke recovery actions. The losing side forces a kernel panic and
reboots.
Example Scenario I
Figure 14-2 scenario could cause similar symptoms on a two-node cluster with one
node shut down for maintenance. During the outage, the private interconnect cables
are disconnected.
Node 0 Node 1
First - Network Second – Node 1
interconnect severed. panics and reboots
Node 0 wins coordinator
race.
Example Scenario II
Similar to example scenario I, if private interconnect cables are disconnected in a
two-node cluster, Node 1 is fenced out of the cluster, panics, and reboots. If before
the private interconnect cables are fixed and Node 1 rejoins the cluster, Node 0
reboots and remote (or just reboots). No node can write to the data disks until the
private networks are fixed. This is because GAB membership cannot be formed,
therefore the cluster cannot be formed.
Similar to example scenario I, if private interconnect cables are disconnected in a
two-node cluster, Node 1 is fenced out of the cluster, panics, and reboots. No node
can write to the data disks until the private networks are fixed. This is because GAB
membership cannot be formed, therefore the cluster cannot be formed.
Suggested solution: Shut down both nodes, reconnect the cables, restart the nodes.
# /opt/VRTSvcs/vxfen/bin/vxfenclearpre
# gabconfig -cx
CVM group is not online after adding a node to the Veritas InfoScale
products cluster
The possible causes for the CVM group being offline after adding a node to the
cluster are as follows:
■ The cssd resource is configured as a critical resource in the cvm group.
■ Other resources configured in the cvm group as critical resources are not online.
To resolve the issue if cssd is configured as a critical resource
1 Log onto one of the nodes in the existing cluster as the root user.
2 Configure the cssd resource as a non-critical resource in the cvm group:
# haconf -makerw
# hares -modify cssd Critical 0
# haconf -dump -makero
To resolve the issue if other resources in the group are not online
1 Log onto one of the nodes in the existing cluster as the root user.
2 Bring the resource online:
# haconf -makerw
# hares -modify resource_name Critical 0
# haconf -dump -makero
vxvm:vxconfigd:ERROR:vold_vcs_getnodeid(/dev/vx/rdmp/disk_name):
local_node_id < 0
Troubleshooting Storage Foundation Cluster File System High Availability 159
Troubleshooting Cluster Volume Manager in Veritas InfoScale products clusters
First, make sure that CVM is running. You can see the CVM nodes in the cluster
by running the vxclustadm nidmap command.
# vxclustadm nidmap
Name CVM Nid CM Nid State
system01 1 0 Joined: Master
system02 0 1 Joined: Slave
This above ouput shows that CVM is healthy, with system system01 as the CVM
master. If CVM is functioning correctly, then the output above is displayed when
CVM cannot retrieve the node ID of the local system from the vxfen driver. This
usually happens when port b is not configured.
To verify vxfen driver is configured
◆ Check the GAB ports with the command:
# gabconfig -a
# hastop -all
2 Make sure that the port h is closed on all the nodes. Run the following command
on each node to verify that the port h is closed:
# gabconfig -a
4 If you have any applications that run outside of VCS control that have access
to the shared storage, then shut down all other nodes in the cluster that have
access to the shared storage. This prevents data corruption.
5 Start the vxfenclearpre script:
# /opt/VRTSvcs/vxfen/bin/vxfenclearpre
Troubleshooting Storage Foundation Cluster File System High Availability 161
Troubleshooting Cluster Volume Manager in Veritas InfoScale products clusters
6 Read the script’s introduction and warning. Then, you can choose to let the
script run.
The script cleans up the disks and displays the following status messages.
...................
[10.209.80.194]:50001: Cleared all registrations
[10.209.75.118]:443: Cleared all registrations
Cleaning up the data disks for all shared disk groups ...
You can retry starting fencing module. In order to restart the whole
product, you might want to reboot the system.
# hastart
For example:
# ls -ltr /dev/rdsk/disk_name
lrwxrwxrwx 1 root root 81 Aug 18 11:58
c2t5006016141E02D28d4s2
-> ../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0/fp@0,
0/ssd@w5006016141e02d28,4:c,raw
lrwxrwxrwx 1 root root 81 Aug 18 11:58
Troubleshooting Storage Foundation Cluster File System High Availability 163
Troubleshooting Cluster Volume Manager in Veritas InfoScale products clusters
c2t5006016141E02D28d3s2
-> ../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0/fp@0,
0/ssd@w5006016141e02d28,3:c,raw
lrwxrwxrwx 1 root root 81 Aug 18 11:58
c2t5006016141E02D28d2s2
-> ../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0/fp@0,
0/ssd@w5006016141e02d28,2:c,raw
lrwxrwxrwx 1 root root 81 Aug 18 11:58
c2t5006016141E02D28d1s2
-> ../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0/fp@0,
0/ssd@w5006016141e02d28,1:c,raw
If all LUNs are not discovered by SCSI, the problem might be corrected by specifying
dev_flags or default_dev_flags and max_luns parameters for the SCSI driver.
If the LUNs are not visible in /dev/rdsk/* files, it may indicate a problem with SAN
configuration or zoning.
Perform the following additional steps:
■ Check the file /kernel/drv/sd.conf to see if the new LUNs were added.
■ Check the format to see if the LUNs have been labeled in the server.
■ Check to see is the disk is seen, using the following command:
# prtvtoc /dev/rdsk/disk_name
Section 5
Troubleshooting Cluster
Server
■ Troubleshooting resources
■ Troubleshooting notification
■ Troubleshooting licensing
■ Verifying the metered or forecasted values for CPU, Mem, and Swap
Troubleshooting and recovery for VCS 166
VCS message logging
Note that the logs on all nodes may not be identical because
■ VCS logs local events on the local nodes.
■ All nodes may not be running when an event occurs.
Troubleshooting and recovery for VCS 167
VCS message logging
Note: Irrespective of the value of LogViaHalog, the script entry point’s logs that are
executed in the container will go into the engine log file.
From VCS 6.2 version, the capturing of the FFDC information on unexpected events
has been extended to resource level and to cover VCS events. This means, if a
resource faces an unexpected behavior then FFDC information will be generated.
The current version enables the agent to log detailed debug logging functions during
unexpected events with respect to resource, such as,
■ Monitor entry point of a resource reported OFFLINE/INTENTIONAL OFFLINE
when it was in ONLINE state.
■ Monitor entry point of a resource reported UNKNOWN.
■ If any entry point times-out.
■ If any entry point reports failure.
■ When a resource is detected as ONLINE or OFFLINE for the first time.
Now whenever an unexpected event occurs FFDC information will be automatically
generated. And this information will be logged in their respective agent log file.
/opt/VRTSgab/gabread_ffdc binary_logs_files_location
You can change the values of the following environment variables that control the
GAB binary log files:
■ GAB_FFDC_MAX_INDX: Defines the maximum number of GAB binary log files
The GAB logging daemon collects the defined number of log files each of eight
MB size. The default value is 20, and the files are named gablog.1 through
gablog.20. At any point in time, the most recent file is the gablog.1 file.
■ GAB_FFDC_LOGDIR: Defines the log directory location for GAB binary log files
The default location is:
/var/adm/gab_ffdc
Troubleshooting and recovery for VCS 169
VCS message logging
Note that the gablog daemon writes its log to the glgd_A.log and glgd_B.log
files in the same directory.
You can either define these variables in the following GAB startup file or use the
export command. You must restart GAB for the changes to take effect.
/etc/default/gab
# haconf -makerw
2 Enable logging and set the desired log levels. Use the following command
syntax:
The following example shows the command line for the IPMultiNIC resource
type.
The following example shows the command line for the Sybase resource type.
For more information on the LogDbg attribute, see the Cluster Server
Administration Guide.
3 For script-based agents, run the halog command to add the messages to the
engine log:
If DBG_AGDEBUG is set, the agent framework logs for an instance of the agent
appear in the agent log on the node on which the agent is running.
Troubleshooting and recovery for VCS 170
VCS message logging
4 For CVMvxconfigd agent, you do not have to enable any additional debug logs.
5 For AMF driver in-memory trace buffer:
If you had enabled AMF driver in-memory trace buffer, you can view the
additional logs using the amfconfig -p dbglog command.
# export VCS_DEBUG_LOG_TAGS="DBG_IPM"
# hagrp -list
Note: Debug log messages are verbose. If you enable debug logs, log files might
fill up quickly.
DBG_AGDEBUG
DBG_AGINFO
DBG_HBFW_DEBUG
DBG_HBFW_INFO
Troubleshooting and recovery for VCS 172
VCS message logging
# /opt/VRTSvcs/bin/hagetcf
The command prompts you to specify an output directory for the gzip file. You
may save the gzip file to either the default/tmp directory or a different directory.
Troubleshoot and fix the issue.
See “Troubleshooting the VCS engine” on page 177.
See “Troubleshooting VCS startup” on page 185.
See “Troubleshooting service groups” on page 188.
See “Troubleshooting resources” on page 195.
See “Troubleshooting notification” on page 213.
See “Troubleshooting and recovery for global clusters” on page 214.
See “Troubleshooting the steward process” on page 217.
If the issue cannot be fixed, then contact Veritas technical support with the file
that the hagetcf command generates.
# /opt/VRTSvcs/bin/vcsstatlog
--dump /var/VRTSvcs/stats/copied_vcs_host_stats
■ To get the forecasted available capacity for CPU, Mem, and Swap for a
system in cluster, run the following command on the system on which you
copied the statlog database:
# /opt/VRTSgab/getcomms
The script uses rsh by default. Make sure that you have configured
passwordless rsh. If you have passwordless ssh between the cluster nodes,
you can use the -ssh option. To gather information on the node that you run
the command, use the -local option.
Troubleshoot and fix the issue.
See “Troubleshooting Low Latency Transport (LLT)” on page 179.
See “Troubleshooting Group Membership Services/Atomic Broadcast (GAB)”
on page 184.
If the issue cannot be fixed, then contact Veritas technical support with the file
/tmp/commslog.<time_stamp>.tar that the getcomms script generates.
# /opt/VRTSamf/bin/getimf
Message catalogs
VCS includes multilingual support for message catalogs. These binary message
catalogs (BMCs), are stored in the following default locations. The variable language
represents a two-letter abbreviation.
/opt/VRTS/messages/language/module_name
Troubleshooting and recovery for VCS 176
VCS message logging
HAD diagnostics
When the VCS engine HAD dumps core, the core is written to the directory
$VCS_DIAG/diag/had. The default value for variable $VCS_DIAG is /var/VRTSvcs/.
When HAD core dumps, review the contents of the $VCS_DIAG/diag/had directory.
See the following logs for more information:
■ Operating system console log
■ Engine log
■ hashadow log
VCS runs the script /opt/VRTSvcs/bin/vcs_diag to collect diagnostic information
when HAD and GAB encounter heartbeat problems. The diagnostic information is
stored in the $VCS_DIAG/diag/had directory.
When HAD starts, it renames the directory to had.timestamp, where timestamp
represents the time at which the directory was renamed.
# svcs -l vcs
fmri svc:/system/vcs:default
name Veritas Cluster Server (VCS) Init service
enabled true
state online
next_state none
state_time Tue Feb 10 11:27:30 2009
restarter svc:/system/svc/restarter:default
Troubleshooting and recovery for VCS 178
Troubleshooting the VCS engine
If the output does not resemble the previous output, refer to Solaris documentation
for Service Management Facility (SMF).
It is recommended to let the cluster automatically seed when all members of the
cluster can exchange heartbeat signals to each other. In this case, all systems
perform the I/O fencing key placement after they are already in the GAB
membership.
Preonline IP check
You can enable a preonline check of a failover IP address to protect against network
partitioning. The check pings a service group's configured IP address to verify that
it is not already in use. If it is, the service group is not brought online.
A second check verifies that the system is connected to its public network and
private network. If the system receives no response from a broadcast ping to the
public network and a check of the private networks, it determines the system is
isolated and does not bring the service group online.
To enable the preonline IP check, do one of the following:
■ If preonline trigger script is not already present, copy the preonline trigger script
from the sample triggers directory into the triggers directory:
# cp /opt/VRTSvcs/bin/sample_triggers/VRTSvcs/preonline_ipc
/opt/VRTSvcs/bin/triggers/preonline
Check the log files that get generated in the /var/svc/log directory for any errors.
Recommended action: Ensure that all systems on the network have unique
clusterid-nodeid pair. You can use the lltdump -f device -D command to get
the list of unique clusterid-nodeid pairs connected to the network. This utility is
available only for LLT-over-ethernet.
LLT INFO V-14-1-10205 This message implies that LLT did not receive any heartbeats
link 1 (link_name) node 1 in trouble on the indicated link from the indicated peer node for LLT
peertrouble time. The default LLT peertrouble time is 2s for
hipri links and 4s for lo-pri links.
Recommended action: If these messages sporadically appear
in the syslog, you can ignore them. If these messages flood
the syslog, then perform one of the following:
lltconfig -T peertrouble:<value>
for hipri link
lltconfig -T peertroublelo:<value>
for lopri links.
LLT INFO V-14-1-10024 This message implies that LLT started seeing heartbeats on
link 0 (link_name) node 1 active this link from that node.
LLT INFO V-14-1-10032 This message implies that LLT did not receive any heartbeats
link 1 (link_name) node 1 inactive 5 on the indicated link from the indicated peer node for the
sec (510) indicated amount of time.
If the peer node has not actually gone down, check for the
following:
LLT INFO V-14-1-10510 sent hbreq This message implies that LLT did not receive any heartbeats
(NULL) on link 1 (link_name) node 1. on the indicated link from the indicated peer node for more
4 more to go. than LLT peerinact time. LLT attempts to request heartbeats
LLT INFO V-14-1-10510 sent hbreq (sends 5 hbreqs to the peer node) and if the peer node does
(NULL) on link 1 (link_name) node 1. not respond, LLT marks this link as “expired” for that peer
3 more to go. node.
LLT INFO V-14-1-10510 sent hbreq
Recommended action: If the peer node has not actually gone
(NULL) on link 1 (link_name) node 1.
down, check for the following:
2 more to go.
LLT INFO V-14-1-10032 link 1 ■ Check if the link has got physically disconnected from the
(link_name) node 1 inactive 6 sec system or switch.
(510) ■ Check for the link health and replace the link if necessary.
LLT INFO V-14-1-10510 sent hbreq
(NULL) on link 1 (link_name) node 1.
1 more to go.
LLT INFO V-14-1-10510 sent hbreq
(NULL) on link 1 (link_name) node 1.
0 more to go.
LLT INFO V-14-1-10032 link 1
(link_name) node 1 inactive 7 sec
(510)
LLT INFO V-14-1-10509 link 1
(link_name) node 1 expired
LLT INFO V-14-1-10499 recvarpreq This message is logged when LLT learns the peer node’s
link 0 for node 1 addr change from address.
00:00:00:00:00:00 to
Recommended action: No action is required. This message
00:18:8B:E4:DE:27
is informational.
Troubleshooting and recovery for VCS 183
Troubleshooting Low Latency Transport (LLT)
On peer nodes:
You can ignore this kind of warning if the packet received is ICMP. In this example,
the 10th byte 0x01 indicates that this is an ICMP packet. The 21st and 22nd bytes
indicate the ICMP type and code for the packet.
Recommended Action: If this issue occurs during a GAB reconfiguration, and does
not recur, the issue is benign. If the issue persists, collect commslog from each
node, and contact Veritas support.
GABs attempt (five retries) to kill the VCS daemon fails if VCS daemon is stuck in
the kernel in an uninterruptible state or the system is heavily loaded that the VCS
daemon cannot die with a SIGKILL.
Recommended Action:
■ In case of performance issues, increase the value of the VCS_GAB_TIMEOUT
environment variable to allow VCS more time to heartbeat.
■ In case of a kernel problem, configure GAB to not panic but continue to attempt
killing the VCS daemon.
Do the following:
■ Run the following command on each node:
Troubleshooting and recovery for VCS 185
Troubleshooting VCS startup
gabconfig -k
■ Add the “-k” option to the gabconfig command in the /etc/gabtab file:
gabconfig -c -k -n 6
■ In case the problem persists, collect sar or similar output, collect crash dumps,
run the Veritas Operations and Readiness Tools (SORT) data collector on all
nodes, and contact Veritas Technical Support.
# cd /etc/VRTSvcs/conf/config
# hacf -verify .
GAB can become unregistered if LLT is set up incorrectly. Verify that the
configuration is correct in /etc/llttab. If the LLT configuration is incorrect, make the
appropriate changes and reboot.
Troubleshooting and recovery for VCS 186
Troubleshooting Intelligent Monitoring Framework (IMF)
Intelligent resource If the system is busy even after intelligent resource monitoring is enabled, troubleshoot as
monitoring has not follows:
reduced system
■ Check the agent log file to see whether the imf_init agent function has failed.
utilization
If the imf_init agent function has failed, then do the following:
■ Make sure that the AMF_START environment variable value is set to 1.
■ Make sure that the AMF module is loaded.
■ Make sure that the IMF attribute values are set correctly for the following attribute keys:
■ The value of the Mode key of the IMF attribute must be set to 1, 2, or 3.
■ The value of the MonitorFreq key of the IMF attribute must be be set to either 0 or a
value greater than 0.
For example, the value of the MonitorFreq key can be set to 0 for the Process agent.
Refer to the appropriate agent documentation for configuration recommendations
corresponding to the IMF-aware agent.
Note that the IMF attribute can be overridden. So, if the attribute is set for individual
resource, then check the value for individual resource.
■ Verify that the resources are registered with the AMF driver. Check the amfstat
command output.
■ Check the LevelTwoMonitorFreq attribute settings for the agent. For example, Process
agent must have this attribute value set to 0.
Refer to the appropriate agent documentation for configuration recommendations
corresponding to the IMF-aware agent.
Enabling the agent's The actual intelligent monitoring for a resource starts only after a steady state is achieved.
intelligent monitoring So, it takes some time before you can see positive performance effect after you enable IMF.
does not provide This behavior is expected.
immediate performance
For more information on when a steady state is reached, see the following topic:
results
Agent does not perform For the agents that use AMF driver for IMF notification, if intelligent resource monitoring has
intelligent monitoring not taken effect, do the following:
despite setting the IMF
■ Make sure that IMF attribute's Mode key value is set to three (3).
mode to 3
■ Review the the agent log to confirm that imf_init() agent registration with AMF has
succeeded. AMF driver must be loaded before the agent starts because the agent
registers with AMF at the agent startup time. If this was not the case, start the AMF
module and restart the agent.
Troubleshooting and recovery for VCS 188
Troubleshooting service groups
AMF module fails to Even after you change the value of the Mode key to zero, the agent still continues to have
unload despite changing a hold on the AMF driver until you kill the agent. To unload the AMF module, all holds on it
the IMF mode to 0 must get released.
If the AMF module fails to unload after changing the IMF mode value to zero, do the following:
■ Run the amfconfig -Uof command. This command forcefully removes all holds on
the module and unconfigures it.
■ Then, unload AMF.
When you try to enable A few possible reasons for this behavior are as follows:
IMF for an agent, the
■ The agent might require some manual steps to make it IMF-aware. Refer to the agent
haimfconfig
documentation for these manual steps.
-enable -agent
■ The agent is a custom agent and is not IMF-aware. For information on how to make a
<agent_name>
custom agent IMF-aware, see the Cluster Server Agent Developer’s Guide.
command returns a
■ If the preceding steps do not resolve the issue, contact Veritas technical support.
message that IMF is
enabled for the agent.
However, when VCS
and the respective agent
is running, the
haimfconfig
-display command
shows the status for
agent_name as
DISABLED.
Warning: To bring a group online manually after VCS has autodisabled the group,
make sure that the group is not fully or partially active on any system that has the
AutoDisabled attribute set to 1 by VCS. Specifically, verify that all resources that
may be corrupted by being active on multiple systems are brought down on the
designated systems. Then, clear the AutoDisabled attribute for each system: #
hagrp -autoenable service_group -sys system
Warning: Exercise caution when you use the -force option. It can lead to situations
where a resource status is unintentionally returned as FAULTED. In the time interval
that a resource transitions from ‘waiting to go offline’ to ‘not waiting’, if the agent
has not completed the offline agent function, the agent may return the state of the
resource as OFFLINE. VCS considers such unexpected offline of the resource as
FAULT and starts recovery action that was not intended.
Service group does not fail over to the BiggestAvailable system even
if FailOverPolicy is set to BiggestAvailable
Sometimes, a service group might not fail over to the biggest available system even
when FailOverPolicy is set to BiggestAvailable.
To troubleshoot this issue, check the engine log located in
/var/VRTSvcs/log/engine_A.log to find out the reasons for not failing over to the
biggest available system. This may be due to the following reasons:
■ If , one or more of the systems in the service group’s SystemList did not have
forecasted available capacity, you see the following message in the engine log:
One of the systems in SystemList of group group_name, system
system_name does not have forecasted available capacity updated
■ If the hautil –sys command does not list forecasted available capacity for the
systems, you see the following message in the engine log:
Failed to forecast due to insufficient data
This message is displayed due to insufficient recent data to be used for
forecasting the available capacity.
The default value for the MeterInterval key of the cluster attribute MeterControl
is 120 seconds. There will be enough recent data available for forecasting after
3 metering intervals (6 minutes) from time the VCS engine was started on the
system. After this, the forecasted values are updated every ForecastCycle *
MeterInterval seconds. The ForecastCycle and MeterInterval values are specified
in the cluster attribute MeterControl.
■ If one or more of the systems in the service group’s SystemList have stale
forecasted available capacity, you can see the following message in the engine
log:
System system_name has not updated forecasted available capacity since
last 2 forecast cycles
This issue is caused when the HostMonitor agent stops functioning. Check if
HostMonitor agent process is running by issuing one of the following commands
on the system which has stale forecasted values:
■ # ps –aef|grep HostMonitor
Even if HostMonitor agent is running and you see the above message in the
engine log, it means that the HostMonitor agent is not able to forecast, and
Troubleshooting and recovery for VCS 193
Troubleshooting service groups
# rm /var/VRTSvcs/stats/.vcs_host_stats.data \
/var/VRTSvcs/stats/.vcs_host_stats.index
# cp /var/VRTSvcs/stats/.vcs_host_stats_bkup.data \
/var/VRTSvcs/stats/.vcs_host_stats.data
# cp /var/VRTSvcs/stats/.vcs_host_stats_bkup.index \
/var/VRTSvcs/stats/.vcs_host_stats.index
Troubleshooting and recovery for VCS 194
Troubleshooting service groups
# /opt/VRTSvcs/bin/vcsstatlog --setprop \
/var/VRTSvcs/stats/.vcs_host_stats rate 120
# /opt/VRTSvcs/bin/vcsstatlog --setprop \
/var/VRTSvcs/stats/.vcs_host_stats compressto \
/var/VRTSvcs/stats/.vcs_host_stats_daily
# /opt/VRTSvcs/bin/vcsstatlog --setprop \
/var/VRTSvcs/stats/.vcs_host_stats compressmode avg
# /opt/VRTSvcs/bin/vcsstatlog --setprop \
/var/VRTSvcs/stats/.vcs_host_stats compressfreq 24h
# rm /var/VRTSvcs/stats/.vcs_host_stats.data \
/var/VRTSvcs/stats/.vcs_host_stats.index
Troubleshooting resources
This topic cites the most common problems associated with bringing resources
online and taking them offline. Bold text provides a description of the problem.
Recommended action is also included, where applicable.
The Monitor entry point of the disk group agent returns ONLINE even
if the disk group is disabled
This is expected agent behavior. VCS assumes that data is being read from or
written to the volumes and does not declare the resource as offline. This prevents
potential data corruption that could be caused by the disk group being imported on
two hosts.
You can deport a disabled disk group when all I/O operations are completed or
when all volumes are closed. You can then reimport the disk group to the same
system. Reimporting a disabled disk group may require a system reboot.
Note: A disk group is disabled if data including the kernel log, configuration copies,
or headers in the private region of a significant number of disks is invalid or
inaccessible. Volumes can perform read-write operations if no changes are required
to the private regions of the disks.
If you see these messages when the new node is booting, the vxfen startup script
on the node makes up to five attempts to join the cluster.
To manually join the node to the cluster when I/O fencing attempts fail
◆ If the vxfen script fails in the attempts to allow the node to join the cluster,
restart vxfen driver with the command:
The vxfentsthdw utility fails when SCSI TEST UNIT READY command
fails
While running the vxfentsthdw utility, you may see a message that resembles as
follows:
The disk array does not support returning success for a SCSI TEST UNIT READY
command when another host has the disk reserved using SCSI-3 persistent
reservations. This happens with the Hitachi Data Systems 99XX arrays if bit 186
of the system mode option is not enabled.
Note: If you want to clear all the pre-existing keys, use the vxfenclearpre utility.
# vi /tmp/disklist
For example:
/dev/rdsk/c1t0d11s2
3 If you know on which node the key (say A1) was created, log in to that node
and enter the following command:
# vxfenadm -m -k A2 -f /tmp/disklist
6 Remove the first key from the disk by preempting it with the second key:
/dev/rdsk/c1t0d11s2
A node experiences the split-brain condition when it loses the heartbeat with its
peer nodes due to failure of all private interconnects or node hang. Review the
behavior of I/O fencing under different scenarios and the corrective measures to
be taken.
See “How I/O fencing works in different event scenarios” on page 199.
Both private networks Node A races for Node B races for When Node B is
fail. majority of majority of ejected from cluster,
coordination points. coordination points. repair the private
networks before
If Node A wins race If Node B loses the
attempting to bring
for coordination race for the
Node B back.
points, Node A ejects coordination points,
Node B from the Node B panics and
shared disks and removes itself from
continues. the cluster.
Both private networks Node A continues to Node B has crashed. Restart Node B after
function again after work. It cannot start the private networks are
event above. database since it is restored.
unable to write to the
data disks.
Nodes A and B and Node A restarts and Node B restarts and Resolve preexisting
private networks lose I/O fencing driver I/O fencing driver split-brain condition.
power. Coordination (vxfen) detects Node (vxfen) detects Node
See “Fencing startup
points and data disks B is registered with A is registered with
reports preexisting
retain power. coordination points. coordination points.
split-brain”
The driver does not The driver does not
Power returns to on page 204.
see Node B listed as see Node A listed as
nodes and they
member of cluster member of cluster
restart, but private
because private because private
networks still have no
networks are down. networks are down.
power.
This causes the I/O This causes the I/O
fencing device driver fencing device driver
to prevent Node A to prevent Node B
from joining the from joining the
cluster. Node A cluster. Node B
console displays: console displays:
Potentially a Potentially a
preexisting preexisting
split brain. split brain.
Dropping out Dropping out
of the cluster. of the cluster.
Refer to the Refer to the
user user
documentation documentation
for steps for steps
required required
to clear to clear
preexisting preexisting
split brain. split brain.
Troubleshooting and recovery for VCS 202
Troubleshooting I/O fencing
Node A crashes while Node A is crashed. Node B restarts and Resolve preexisting
Node B is down. detects Node A is split-brain condition.
Node B comes up registered with the
See “Fencing startup
and Node A is still coordination points.
reports preexisting
down. The driver does not
split-brain”
see Node A listed as
on page 204.
member of the
cluster. The I/O
fencing device driver
prints message on
console:
Potentially a
preexisting
split brain.
Dropping out
of the cluster.
Refer to the
user
documentation
for steps
required
to clear
preexisting
split brain.
The disk array Node A continues to Node B continues to Power on the failed
containing two of the operate as long as no operate as long as no disk array so that
three coordination nodes leave the nodes leave the subsequent network
points is powered off. cluster. cluster. partition does not
cause cluster
No node leaves the
shutdown, or replace
cluster membership
coordination points.
Troubleshooting and recovery for VCS 203
Troubleshooting I/O fencing
The disk array Node A continues to Node B has left the Power on the failed
containing two of the operate in the cluster. cluster. disk array so that
three coordination subsequent network
points is powered off. partition does not
cause cluster
Node B gracefully
shutdown, or replace
leaves the cluster and
coordination points.
the disk array is still
powered off. Leaving
gracefully implies a
clean shutdown so
that vxfen is properly
unconfigured.
The disk array Node A races for a Node B has left Power on the failed
containing two of the majority of cluster due to crash disk array and restart
three coordination coordination points. or network partition. I/O fencing driver to
points is powered off. Node A fails because enable Node A to
only one of the three register with all
Node B abruptly
coordination points is coordination points,
crashes or a network
available. Node A or replace
partition occurs
panics and removes coordination points.
between node A and
itself from the cluster.
node B, and the disk See “Replacing
array is still powered defective disks when
off. the cluster is offline”
on page 207.
Cluster ID on the I/O fencing key of coordinator disk does not match
the local cluster’s ID
If you accidentally assign coordinator disks of a cluster to another cluster, then the
fencing driver displays an error message similar to the following when you start I/O
fencing:
The warning implies that the local cluster with the cluster ID 57069 has keys.
However, the disk also has keys for cluster with ID 48813 which indicates that nodes
from the cluster with cluster id 48813 potentially use the same coordinator disk.
Troubleshooting and recovery for VCS 204
Troubleshooting I/O fencing
You can run the following commands to verify whether these disks are used by
another cluster. Run the following commands on one of the nodes in the local
cluster. For example, on system01:
system01> # lltstat -C
57069
Where disk_7, disk_8, and disk_9 represent the disk names in your setup.
Recommended action: You must use a unique set of coordinator disks for each
cluster. If the other cluster does not use these coordinator disks, then clear the keys
using the vxfenclearpre command before you use them as coordinator disks in the
local cluster.
However, the same error can occur when the private network links are working and
both systems go down, system 1 restarts, and system 2 fails to come back up. From
the view of the cluster from system 1, system 2 may still have the registrations on
the coordination points.
Assume the following situations to understand preexisting split-brain in server-based
fencing:
■ There are three CP servers acting as coordination points. One of the three CP
servers then becomes inaccessible. While in this state, one client node leaves
the cluster, whose registration cannot be removed from the inaccessible CP
server. When the inaccessible CP server restarts, it has a stale registration from
the node which left the VCS. In this case, no new nodes can join the cluster.
Each node that attempts to join the cluster gets a list of registrations from the
CP server. One CP server includes an extra registration (of the node which left
earlier). This makes the joiner node conclude that there exists a preexisting
split-brain between the joiner node and the node which is represented by the
stale registration.
■ All the client nodes have crashed simultaneously, due to which fencing keys
are not cleared from the CP servers. Consequently, when the nodes restart, the
vxfen configuration fails reporting preexisting split brain.
These situations are similar to that of preexisting split-brain with coordinator disks,
where you can solve the problem running the vxfenclearpre command. A similar
solution is required in server-based fencing using the cpsadm command.
See “Clearing preexisting split-brain condition” on page 206.
Troubleshooting and recovery for VCS 206
Troubleshooting I/O fencing
Scenario Solution
2 Clear the keys on the coordinator disks as well as the data disks in all shared disk
groups using the vxfenclearpre command. The command removes SCSI-3
registrations and reservations.
4 Restart system 2.
Troubleshooting and recovery for VCS 207
Troubleshooting I/O fencing
Scenario Solution
2 Clear the keys on the coordinator disks as well as the data disks in all shared disk
groups using the vxfenclearpre command. The command removes SCSI-3
registrations and reservations.
After removing all stale registrations, the joiner node will be able to join the cluster.
4 Restart system 2.
■ When you add a disk, add the disk to the disk group vxfencoorddg and retest
the group for support of SCSI-3 persistent reservations.
■ You can destroy the coordinator disk group such that no registration keys remain
on the disks. The disks can then be used elsewhere.
To replace a disk in the coordinator disk group when the cluster is offline
1 Log in as superuser on one of the cluster nodes.
2 If VCS is running, shut it down:
# hastop -all
Make sure that the port h is closed on all the nodes. Run the following command
to verify that the port h is closed:
# gabconfig -a
where:
-t specifies that the disk group is imported only until the node restarts.
-f specifies that the import is to be done forcibly, which is necessary if one or
more disks is not accessible.
-C specifies that any import locks are removed.
6 To remove disks from the disk group, use the VxVM disk administrator utility,
vxdiskadm.
You may also destroy the existing coordinator disk group. For example:
■ Verify whether the coordinator attribute is set to on.
7 Add the new disk to the node and initialize it as a VxVM disk.
Then, add the new disk to the vxfencoorddg disk group:
■ If you destroyed the disk group in step 6, then create the disk group again
and add the new disk to it.
■ If the diskgroup already exists, then add the new disk to it.
8 Test the recreated disk group for SCSI-3 persistent reservations compliance.
9 After replacing disks in a coordinator disk group, deport the disk group:
12 Verify that the I/O fencing module has started and is enabled.
# gabconfig -a
Make sure that port b membership exists in the output for all nodes in the
cluster.
Make sure that port b and port o memberships exist in the output for all nodes
in the cluster.
# vxfenadm -d
Make sure that I/O fencing mode is not disabled in the output.
13 If necessary, restart VCS on each node:
# hastart
Troubleshooting and recovery for VCS 210
Troubleshooting I/O fencing
The vxfenswap utility exits if rcp or scp commands are not functional
The vxfenswap utility displays an error message if rcp or scp commands are not
functional.
To recover the vxfenswap utility fault
◆ Verify whether the rcp or scp functions properly.
Make sure that you do not use echo or cat to print messages in the .bashrc
file for the nodes.
If the vxfenswap operation is unsuccessful, use the vxfenswap –a cancel
command if required to roll back any changes that the utility made.
Troubleshooting CP server
All CP server operations and messages are logged in the /var/VRTScps/log directory
in a detailed and easy to read format. The entries are sorted by date and time. The
logs can be used for troubleshooting purposes or to review for any possible security
issue on the system that hosts the CP server.
The following files contain logs and text files that may be useful in understanding
and troubleshooting a CP server:
■ /var/VRTScps/log/cpserver_[ABC].log
■ /var/VRTSvcs/log/vcsauthserver.log (Security related)
■ If the vxcpserv process fails on the CP server, then review the following
diagnostic files:
■ /var/VRTScps/diag/FFDC_CPS_pid_vxcpserv.log
■ /var/VRTScps/diag/stack_pid_vxcpserv.txt
Note: If the vxcpserv process fails on the CP server, these files are present in
addition to a core file. VCS restarts vxcpserv process automatically in such
situations.
See “Issues during fencing startup on VCS nodes set up for server-based fencing”
on page 212.
See “Issues during online migration of coordination points” on page 212.
cpsadm command on If you receive a connection error message after issuing the cpsadm command on the VCS,
the VCS gives perform the following actions:
connection error
■ Ensure that the CP server is reachable from all the VCS nodes.
■ Check the /etc/vxfenmode file and ensure that the VCS nodes use the correct CP server
virtual IP or virtual hostname and the correct port number.
■ For HTTPS communication, ensure that the virtual IP and ports listed for the server can
listen to HTTPS requests.
Authorization failure Authorization failure occurs when the nodes on the client clusters and or users are not added
in the CP server configuration. Therefore, fencing on the VCS (client cluster) node is not
allowed to access the CP server and register itself on the CP server. Fencing fails to come
up if it fails to register with a majority of the coordination points.
To resolve this issue, add the client cluster node and user in the CP server configuration
and restart fencing.
Authentication failure If you had configured secure communication between the CP server and the VCS (client
cluster) nodes, authentication failure can occur due to the following causes:
■ The client cluster requires its own private key, a signed certificate, and a Certification
Authority's (CA) certificate to establish secure communication with the CP server. If any
of the files are missing or corrupt, communication fails.
■ If the client cluster certificate does not correspond to the client's private key,
communication fails.
■ If the CP server and client cluster do not have a common CA in their certificate chain of
trust, then communication fails.
Thus, during vxfenswap, when the vxfenmode file is being changed by the user,
the Coordination Point agent does not move to FAULTED state but continues
monitoring the old set of coordination points.
As long as the changes to vxfenmode file are not committed or the new set of
coordination points are not reflected in vxfenconfig -l output, the Coordination
Point agent continues monitoring the old set of coordination points it read from
vxfenconfig -l output in every monitor cycle.
The status of the Coordination Point agent (either ONLINE or FAULTED) depends
upon the accessibility of the coordination points, the registrations on these
coordination points, and the fault tolerance value.
When the changes to vxfenmode file are committed and reflected in the vxfenconfig
-l output, then the Coordination Point agent reads the new set of coordination
points and proceeds to monitor them in its new monitor cycle.
Troubleshooting notification
Occasionally you may encounter problems when using VCS notification. This section
cites the most common problems and the recommended actions. Bold text provides
a description of the problem.
Troubleshooting and recovery for VCS 214
Troubleshooting and recovery for global clusters
Disaster declaration
When a cluster in a global cluster transitions to the FAULTED state because it can
no longer be contacted, failover executions depend on whether the cause was due
to a split-brain, temporary outage, or a permanent disaster at the remote cluster.
If you choose to take action on the failure of a cluster in a global cluster, VCS
prompts you to declare the type of failure.
■ Disaster, implying permanent loss of the primary data center
■ Outage, implying the primary may return to its current form in some time
■ Disconnect, implying a split-brain condition; both clusters are up, but the link
between them is broken
■ Replica, implying that data on the takeover target has been made consistent
from a backup source and that the RVGPrimary can initiate a takeover when
the service group is brought online. This option applies to VVR environments
only.
You can select the groups to be failed over to the local cluster, in which case VCS
brings the selected groups online on a node based on the group's FailOverPolicy
attribute. It also marks the groups as being offline in the other cluster. If you do not
select any service groups to fail over, VCS takes no action except implicitly marking
the service groups as offline on the downed cluster.
Troubleshooting and recovery for VCS 215
Troubleshooting and recovery for global clusters
VCS alerts
VCS alerts are identified by the alert ID, which is comprised of the following
elements:
■ alert_type—The type of the alert
■ object—The name of the VCS object for which this alert was generated. This
could be a cluster or a service group.
Alerts are generated in the following format:
alert_type-cluster-system-object
For example:
GNOFAILA-Cluster1-oracle_grp
This is an alert of type GNOFAILA generated on cluster Cluster1 for the service
group oracle_grp.
Types of alerts
VCS generates the following types of alerts.
■ CFAULT—Indicates that a cluster has faulted
■ GNOFAILA—Indicates that a global group is unable to fail over within the cluster
where it was online. This alert is displayed if the ClusterFailOverPolicy attribute
Troubleshooting and recovery for VCS 216
Troubleshooting and recovery for global clusters
is set to Manual and the wide-area connector (wac) is properly configured and
running at the time of the fault.
■ GNOFAIL—Indicates that a global group is unable to fail over to any system
within the cluster or in a remote cluster.
Some reasons why a global group may not be able to fail over to a remote
cluster:
■ The ClusterFailOverPolicy is set to either Auto or Connected and VCS is
unable to determine a valid remote cluster to which to automatically fail the
group over.
■ The ClusterFailOverPolicy attribute is set to Connected and the cluster in
which the group has faulted cannot communicate with one ore more remote
clusters in the group's ClusterList.
■ The wide-area connector (wac) is not online or is incorrectly configured in
the cluster in which the group has faulted
Managing alerts
Alerts require user intervention. You can respond to an alert in the following ways:
■ If the reason for the alert can be ignored, use the Alerts dialog box in the Java
console or the haalert command to delete the alert. You must provide a
comment as to why you are deleting the alert; VCS logs the comment to engine
log.
■ Take an action on administrative alerts that have actions associated with them.
■ VCS deletes or negates some alerts when a negating event for the alert occurs.
An administrative alert will continue to live if none of the above actions are performed
and the VCS engine (HAD) is running on at least one node in the cluster. If HAD
is not running on any node in the cluster, the administrative alert is lost.
Negating events
VCS deletes a CFAULT alert when the faulted cluster goes back to the running
state
VCS deletes the GNOFAILA and GNOFAIL alerts in response to the following
events:
■ The faulted group's state changes from FAULTED to ONLINE.
■ The group's fault is cleared.
■ The group is deleted from the cluster where alert was generated.
Troubleshooting licensing
This section cites problems you may encounter with VCS licensing. It provides
instructions on how to validate license keys and lists the error messages associated
with licensing.
Troubleshooting and recovery for VCS 218
Troubleshooting licensing
# /opt/VRTSvcs/bin/vcsstatlog --dump\
/var/VRTSvcs/stats/copied_vcs_host_stats
■ To get the forecasted available capacity for CPU, Mem, and Swap for a
system in cluster, run the following command on the system on which you
copied the statlog database:
A C
ACTIVE plex state 28 CD-ROM
ACTIVE volume state 39 booting 74
agent log CLEAN plex state 28
format 166 client ID
location 166 in command logging file 91
in task logging file 93
in transaction logging file 94
B cmdlog file 91
backup commands
primary Storage Checkpoint 141 associating with transactions 96
backups logging 91
of FSS disk group configuration 107 configuration
badlog flag backing up for disk groups 101, 103
clearing for DCO 49 backup files 103
BADLOG plex state 38 resolving conflicting backups 107
binary message catalogs restoring for disk groups 101, 104
about 175 configuration errors
location of 175 recovery from 126
Index 224
I O
I/O fencing OpenBoot PROMs (OPB) 64
testing and scenarios 199
INFO messages 119 P
inode list error 21
PANIC messages 118
install-db file 75, 82
parity
IOFAIL plex state 28
regeneration checkpointing 40
resynchronization for RAID-5 40
L stale 36
license keys partitions
troubleshooting 217 invalid 68
listing plex kernel states
unstartable volumes 26 DISABLED 28, 39
log failure 21 ENABLED 28
log file plex states
default 116 ACTIVE 28
syslog error messages 116 BADLOG 38
vxconfigd 116 CLEAN 28
log files 210 EMPTY 28
LOG plex state 38 IOFAIL 28
log plexes LOG 38
importance for RAID-5 36 STALE 31
recovering RAID-5 41 plexes
logging defined 28
agent log 166 displaying states of 27
associating commands and transactions 96 in RECOVER state 32
directory 91, 94 mapping problems 43
Index 226
vxplex command 41
vxprint
displaying volume and plex states 27
vxreattach
reattaching failed disks 34
vxsnap make
recovery from failure of 54
vxsnap prepare
recovery from failure of 53
vxsnap refresh
recovery from failure of 57
vxsnap restore
recovery from failure of 56
vxtranslog
controlling transaction logging 94
vxunreloc command 65
VxVM
disabling 75
RAID-5 recovery process 39
recovering configuration of 82
vxvol command
aslog keyword
associating the SRL 138
assoc keyword
associating a new volume 128
dis keyword
dissociating the SRL 137
start keyword
starting data volume 135
starting SRL 137
vxvol recover command 42
vxvol resync command 40
vxvol start command 31
W
WARNING messages 119
warning messages
corrupt label_sdo 68
Plex for root volume is stale or unusable 68
unable to read label 68