Book-Storage Color en
Book-Storage Color en
Storage Administration
Guide
Storage Administration Guide
SUSE Linux Enterprise Server 15 SP2
Provides information about how to manage storage devices on a SUSE Linux Enter-
prise Server.
https://documentation.suse.com
Copyright © 2006– 2022 SUSE LLC and contributors. All rights reserved.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Docu-
mentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright
notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation
License”.
For SUSE trademarks, see https://www.suse.com/company/legal/ . All other third-party trademarks are the prop-
erty of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates.
Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not
guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable
for possible errors or the consequences thereof.
Contents
4 Support xvii
Support Statement for SUSE Linux Enterprise Server xvii • Technology
Previews xviii
1.2 Btrfs 3
Key Features 4 • The Root File System Setup on SUSE Linux
Enterprise Server 4 • Migration from ReiserFS and Ext File Systems
to Btrfs 9 • Btrfs Administration 10 • Btrfs Quota Support
for Subvolumes 10 • Swapping on Btrfs 11 • Btrfs send/
receive 12 • Data Deduplication Support 15 • Deleting Subvolumes
from the Root File System 16
1.3 XFS 17
XFS formats 18
1.4 Ext2 19
1.5 Ext3 20
Easy and Highly Reliable Upgrades from Ext2 20 • Converting an Ext2 File
System into Ext3 20
1.7 ReiserFS 26
4.3 bcache 46
Main Features 46 • Setting Up a bcache Device 46 • bcache
Configuration Using sysfs 48
4.4 lvmcache 48
Configuring lvmcache 49 • Removing a Cache Pool 50
5 LVM Configuration 53
5.1 Understanding the Logical Volume Manager 53
6.6 Merging a Snapshot with the Source Logical Volume to Revert Changes
or Roll Back to a Previous State 87
8.2 Setting Up the System with a Software RAID Device for the Root (/)
Partition 99
16.5 Managing FCoE Instances with the FCoE Administration Tool 181
1 Available Documentation
Bug Reports
Report issues with the documentation at https://bugzilla.suse.com/ . To simplify this
process, you can use the Report Documentation Bug links next to headlines in the HTML ver-
sion of this document. These preselect the right product and category in Bugzilla and add
a link to the current section. You can start typing your bug report right away. A Bugzilla
account is required.
Contributions
To contribute to this documentation, use the Edit Source links next to headlines in the
HTML version of this document. They take you to the source code on GitHub, where you
can open a pull request. A GitHub account is required.
For more information about the documentation environment used for this documentation,
see the repository's README at https://github.com/SUSE/doc-sle/blob/main/README.adoc
Mail
You can also report errors and send feedback concerning the documentation to doc-
team@suse.com . Include the document title, the product version, and the publication date
of the document. Additionally, include the relevant section number and title (or provide
the URL) and provide a concise description of the problem.
Alt , Alt – F1 : a key to press or a key combination; keys are shown in uppercase as on
a keyboard
AMD/Intel This paragraph is only relevant for the AMD64/Intel 64 architecture. The ar-
rows mark the beginning and the end of the text block.
IBM Z, POWER This paragraph is only relevant for the architectures IBM Z and POWER .
The arrows mark the beginning and the end of the text block.
Commands that must be run with root privileges. Often you can also prefix these com-
mands with the sudo command to run them as non-privileged user.
# command
> sudo command
> command
Notices
4 Support
Find the support statement for SUSE Linux Enterprise Server and general information about
technology previews below. For details about the product lifecycle, see Book “Upgrade Guide”,
Chapter 2 “Life Cycle and Support”.
If you are entitled to support, nd details how to collect information for a support ticket in Book
“Administration Guide”, Chapter 42 “Gathering System Information for Support”.
L1
Problem determination, which means technical support designed to provide compatibility
information, usage support, ongoing maintenance, information gathering and basic trou-
bleshooting using available documentation.
L2
Problem isolation, which means technical support designed to analyze data, reproduce
customer problems, isolate problem area and provide a resolution for problems not re-
solved by Level 1 or prepare for Level 3.
For contracted customers and partners, SUSE Linux Enterprise Server is delivered with L3 sup-
port for all packages, except for the following:
technology previews.
some packages shipped as part of the module Workstation Extension are L2-supported only.
packages with names ending in -devel (containing header les and similar developer
resources) will only be supported together with their main packages.
SUSE will only support the usage of original packages. That is, packages that are unchanged
and not recompiled.
Technology previews are still in development. Therefore, they may be functionally incom-
plete, unstable, or in other ways not suitable for production use.
Technology previews can be dropped at any time. For example, if SUSE discovers that a
preview does not meet the customer or market needs, or does not prove to comply with
enterprise standards. SUSE does not commit to providing a supported version of such tech-
nologies in the future.
For an overview of technology previews shipped with your product, see the release notes at
https://www.suse.com/releasenotes/ .
SUSE Linux Enterprise Server ships with different le systems from which to
choose, including Btrfs, Ext4, Ext3, Ext2 and XFS. Each le system has its own
advantages and disadvantages. For a side-by-side feature comparison of the ma-
jor le systems in SUSE Linux Enterprise Server, see https://www.suse.com/re-
leasenotes/x86_64/SUSE-SLES/15-SP2/#allArch-filesystems (File System Support and
Sizes). This chapter contains an overview of how these le systems work and what
advantages they offer.
With SUSE Linux Enterprise 12, Btrfs is the default le system for the operating system and
XFS is the default for all other use cases. SUSE also continues to support the Ext family of le
systems, and OCFS2. By default, the Btrfs le system will be set up with subvolumes. Snapshots
will be automatically enabled for the root le system using the snapper infrastructure. For more
information about snapper, refer to Book “Administration Guide”, Chapter 7 “System Recovery and
Snapshot Management with Snapper”.
2 SLES 15 SP2
1.1 Terminology
metadata
A data structure that is internal to the le system. It ensures that all of the on-disk data
is properly organized and accessible. Almost every le system has its own structure of
metadata, which is one reason the le systems show different performance characteristics.
It is extremely important to maintain metadata intact, because otherwise all data on the
le system could become inaccessible.
inode
A data structure on a le system that contains a variety of information about a le, includ-
ing size, number of links, pointers to the disk blocks where the le contents are actually
stored, and date and time of creation, modification, and access.
journal
In the context of a le system, a journal is an on-disk structure containing a type of log
in which the le system stores what it is about to change in the le system’s metadata.
Journaling greatly reduces the recovery time of a le system because it has no need for
the lengthy search process that checks the entire le system at system start-up. Instead,
only the journal is replayed.
1.2 Btrfs
Btrfs is a copy-on-write (COW) le system developed by Chris Mason. It is based on COW-friendly
B-trees developed by Ohad Rodeh. Btrfs is a logging-style le system. Instead of journaling the
block changes, it writes them in a new location, then links the change in. Until the last write,
the new changes are not committed.
Writable snapshots that allow you to easily roll back your system if needed after applying
updates, or to back up les.
Subvolume support: Btrfs creates a default subvolume in its assigned pool of space. It allows
you to create additional subvolumes that act as individual le systems within the same
pool of space. The number of subvolumes is limited only by the space allocated to the pool.
The online check and repair functionality scrub is available as part of the Btrfs command
line tools. It verifies the integrity of data and metadata, assuming the tree structure is ne.
You can run scrub periodically on a mounted le system; it runs as a background process
during normal operation.
Different checksums for metadata and user data to improve error detection.
Integration with the YaST Partitioner and AutoYaST on SUSE Linux Enterprise Server. This
also includes creating a Btrfs le system on Multiple Devices (MD) and Device Mapper
(DM) storage configurations.
Offline migration from existing Ext2, Ext3, and Ext4 le systems.
Boot loader support for /boot , allowing to boot from a Btrfs partition.
Multivolume Btrfs is supported in RAID0, RAID1, and RAID10 profiles in SUSE Linux En-
terprise Server 15 SP2. Higher RAID levels are not supported yet, but might be enabled
with a future service pack.
1.2.2 The Root File System Setup on SUSE Linux Enterprise Server
By default, SUSE Linux Enterprise Server is set up using Btrfs and snapshots for the root parti-
tion. Snapshots allow you to easily roll back your system if needed after applying updates, or
to back up les. Snapshots can easily be managed with the SUSE Snapper infrastructure as ex-
/home
If /home does not reside on a separate partition, it is excluded to avoid data loss on roll-
backs.
/opt
Third-party products usually get installed to /opt . It is excluded to avoid uninstalling
these applications on rollbacks.
/srv
Contains data for Web and FTP servers. It is excluded to avoid data loss on rollbacks.
/tmp
All directories containing temporary les and caches are excluded from snapshots.
/usr/local
This directory is used when manually installing software. It is excluded to avoid unin-
stalling these installations on rollbacks.
/var
This directory contains many variable les, including logs, temporary caches, third party
products in /var/opt , and is the default location for virtual machine images and databas-
es. Therefore this subvolume is created to exclude all of this variable data from snapshots
and has Copy-On-Write disabled.
5 The Root File System Setup on SUSE Linux Enterprise Server SLES 15 SP2
Warning: Support for Rollbacks
Rollbacks are only supported by SUSE if you do not remove any of the preconfigured
subvolumes. You may, however, add subvolumes using the YaST Partitioner.
The Btrfs le system supports transparent compression. While enabled, Btrfs compresses le
data when written and uncompresses le data when read.
Use the compress or compress-force mount option and select the compression algorithm,
zstd , lzo or zlib (the default). zlib compression has a higher compression ratio while lzo
is faster and takes less CPU load. The zstd algorithm offers a modern compromise, with perfor-
mance close to lzo and compression ratios similar to zlib.
For example:
In case you create a le, write to it, and the compressed result is greater or equal to the uncom-
pressed size, Btrfs will skip compression for future write operations forever for this le. If you
do not like this behavior, use the compress-force option. This can be useful for les that have
some initial non-compressible data.
Note, compression takes effect for new les only. Files that were written without compression
are not compressed when the le system is mounted with the compress or compress-force
option. Furthermore, les with the nodatacow attribute never get their extents compressed:
# chattr +C FILE
# mount -o nodatacow /dev/sdx /mnt
In regard to encryption, this is independent from any compression. After you have written some
data to this partition, print the details:
6 The Root File System Setup on SUSE Linux Enterprise Server SLES 15 SP2
Label: 'Test-Btrfs' uuid: 62f0c378-e93e-4aa1-9532-93c6b780749d
Total devices 1 FS bytes used 3.22MiB
devid 1 size 2.00GiB used 240.62MiB path /dev/sdb1
If you want this to be permanent, add the compress or compress-force option into the /
etc/fstab configuration le. For example:
A system rollback from a snapshot on SUSE Linux Enterprise Server is performed by booting
from the snapshot rst. This allows you to check the snapshot while running before doing the
rollback. Being able to boot from snapshots is achieved by mounting the subvolumes (which
would normally not be necessary).
In addition to the subvolumes listed in Section 1.2.2, “The Root File System Setup on SUSE Linux En-
terprise Server” a volume named @ exists. This is the default subvolume that will be mounted as
the root partition ( / ). The other subvolumes will be mounted into this volume.
When booting from a snapshot, not the @ subvolume will be used, but rather the snapshot. The
parts of the le system included in the snapshot will be mounted read-only as / . The other
subvolumes will be mounted writable into the snapshot. This state is temporary by default: the
previous configuration will be restored with the next reboot. To make it permanent, execute the
snapper rollback command. This will make the snapshot that is currently booted the new
default subvolume, which will be used after a reboot.
File system usage is usually checked by running the df command. On a Btrfs le system, the
output of df can be misleading, because in addition to the space the raw data allocates, a Btrfs
le system also allocates and uses space for metadata.
Consequently a Btrfs le system may report being out of space even though it seems that plenty
of space is still available. In that case, all space allocated for the metadata is used up. Use the
following commands to check for used and available space on a Btrfs le system:
7 The Root File System Setup on SUSE Linux Enterprise Server SLES 15 SP2
Label: 'ROOT' uuid: 52011c5e-5711-42d8-8c50-718a005ec4b3
Total devices 1 FS bytes used 10.02GiB
devid 1 size 20.02GiB used 13.78GiB path /dev/sda3
Shows the total size of the le system and its usage. If these two values in the last line
match, all space on the le system has been allocated.
btrfs filesystem df
Shows values for allocated ( total ) and used space of the le system. If the values for
total and used for the metadata are almost equal, all space for metadata has been
allocated.
8 The Root File System Setup on SUSE Linux Enterprise Server SLES 15 SP2
1.2.3 Migration from ReiserFS and Ext File Systems to Btrfs
You can migrate data volumes from existing ReiserFS or Ext (Ext2, Ext3, or Ext4) to the Btrfs
le system using the btrfs-convert tool. This allows you to do an in-place conversion of
unmounted (offline) le systems, which may require a bootable install media with the btrfs-
convert tool. The tool constructs a Btrfs le system within the free space of the original le
system, directly linking to the data contained in it. There must be enough free space on the device
to create the metadata or the conversion will fail. The original le system will be intact and no
free space will be occupied by the Btrfs le system. The amount of space required is dependent
on the content of the le system and can vary based on the number of le system objects (such
as les, directories, extended attributes) contained in it. Since the data is directly referenced,
the amount of data on the le system does not impact the space required for conversion, except
for les that use tail packing and are larger than about 2 KiB in size.
To convert the original le system to the Btrfs le system, run:
# btrfs-convert /path/to/device
When converted, the contents of the Btrfs le system will reflect the contents of the source le
system. The source le system will be preserved until you remove the related read-only image
created at fs_root/reiserfs_saved/image . The image le is effectively a 'snapshot' of the
ReiserFS le system prior to conversion and will not be modified as the Btrfs le system is
modified. To remove the image le, remove the reiserfs_saved subvolume:
9 Migration from ReiserFS and Ext File Systems to Btrfs SLES 15 SP2
To revert the le system back to the original one, use the following command:
# btrfs-convert -r /path/to/device
The size can either be specified in bytes (5000000000), kilobytes (5000000K), megabytes
(5000M), or gigabytes (5G). The resulting values in bytes slightly differ, since 1024 Bytes
= 1 KiB, 1024 KiB = 1 MiB, etc.
4. To list the existing quotas, use the following command. The column max_rfer shows the
quota in bytes.
To disable quota support for a partition and all its subvolumes, use btrfs quota dis-
able :
See the man 8 btrfs-qgroup and man 8 btrfs-quota for more details. The UseCases page
on the Btrfs wiki (https://btrfs.wiki.kernel.org/index.php/UseCases ) also provides more infor-
mation.
the swap le must have the NODATACOW and NODATASUM mount options.
the swap le can not be compressed—that you can acomplish by setting the NODATACOW
and NODATASUM mount options. Both options disable the swap le compression.
the swap le can not be activated while exclusive operations are running—such as device
resizing, adding, removing or replacing or when a balancing operation is running.
1.2.7.1 Prerequisites
To use the send/receive feature, the following requirements need to be met:
A Btrfs le system is required on the source side ( send ) and on the target side ( receive ).
Btrfs send/receive operates on snapshots, therefore the respective data needs to reside in
a Btrfs subvolume.
SUSE Linux Enterprise 12 SP2 or better. Earlier versions of SUSE Linux Enterprise do not
support send/receive.
The following procedure shows the basic usage of Btrfs send/receive using the example of cre-
ating incremental backups of /data (source side) in /backup/data (target side). /data needs
to be a subvolume.
1. Create the initial snapshot (called snapshot_0 in this example) on the source side and
make sure it is written to the disk:
A new subvolume /data/bkp_data is created. It will be used as the basis for the next
incremental backup and should be kept as a reference.
2. Send the initial snapshot to the target side. Since this is the initial send/receive operation,
the complete snapshot needs to be sent:
When the initial setup has been finished, you can create incremental backups and send the
differences between the current and previous snapshots to the target side. The procedure is
always the same:
1. Create a new snapshot on the source side and make sure it is written to the disk. In the
following example the snapshot is named bkp_data_ CURRENT_DATE :
/data/bkp_data
/data/bkp_data_2016-07-07
/backup/bkp_data
/backup/bkp_data_2016-07-07
Now you have three options for how to proceed:
Keep all snapshots on both sides. With this option you can roll back to any snapshot
on both sides while having all data duplicated at the same time. No further action
is required. When doing the next incremental backup, keep in mind to use the next-
to-last snapshot as parent for the send operation.
Only keep the last snapshot on the source side and all snapshots on the target side.
Also allows to roll back to any snapshot on both sides—to do a rollback to a specific
snapshot on the source side, perform a send/receive operation of a complete snapshot
from the target side to the source side. Do a delete/move operation on the source
side.
Only keep the last snapshot on both sides. This way you have a backup on the target
side that represents the state of the last snapshot made on the source side. It is not
possible to roll back to other snapshots. Do a delete/move operation on the source
and the target side.
a. To only keep the last snapshot on the source side, perform the following commands:
b. To only keep the last snapshot on the target side, perform the following commands:
The rst command will delete the previous backup snapshot, the second command
renames the current backup snapshot to /backup/bkp_data . This ensures that the
latest backup snapshot is always named /backup/bkp_data .
It operates in two modes: read-only and de-duping. When run in read-only mode (that is without
the -d switch), it scans the given les or directories for duplicated blocks and prints them. This
works on any le system.
Running duperemove in de-duping mode is only supported on Btrfs le systems. After having
scanned the given les or directories, the duplicated blocks will be submitted for deduplication.
For more information see man 8 duperemove .
1. Identify the subvolume you need to delete (for example @/opt ). Notice that the root path
has always subvolume ID '5'.
3. Mount the root le system (subvolume with ID 5) on a separate mount point (for example
/mnt ):
4. Delete the @/opt partition from the mounted root le system:
1.3 XFS
Originally intended as the le system for their IRIX OS, SGI started XFS development in the early
1990s. The idea behind XFS was to create a high-performance 64-bit journaling le system to
meet extreme computing challenges. XFS is very good at manipulating large les and performs
well on high-end hardware. XFS is the default le system for data partitions in SUSE Linux
Enterprise Server.
A quick review of XFS’s key features explains why it might prove to be a strong competitor for
other journaling le systems in high-end computing.
High scalability
XFS offers high scalability by using allocation groups
At the creation time of an XFS le system, the block device underlying the le system is
divided into eight or more linear regions of equal size. Those are called allocation groups.
Each allocation group manages its own inodes and free disk space. Practically, allocation
groups can be seen as le systems in a le system. Because allocation groups are rather
independent of each other, more than one of them can be addressed by the kernel simul-
taneously. This feature is the key to XFS’s great scalability. Naturally, the concept of inde-
pendent allocation groups suits the needs of multiprocessor systems.
If you see the message above in the output of the dmesg command, it is recommended
that you update your le system to the V5 format:
1.4 Ext2
The origins of Ext2 go back to the early days of Linux history. Its predecessor, the Extended File
System, was implemented in April 1992 and integrated in Linux 0.96c. The Extended File System
underwent several modifications and, as Ext2, became the most popular Linux le system for
years. With the creation of journaling le systems and their short recovery times, Ext2 became
less important.
A brief summary of Ext2’s strengths might help understand why it was—and in some areas still
is—the favorite Linux le system of many Linux users.
Easy Upgradability
Because Ext3 is based on the Ext2 code and shares its on-disk format and its metadata
format, upgrades from Ext2 to Ext3 are very easy.
1.5 Ext3
Ext3 was designed by Stephen Tweedie. Unlike all other next-generation le systems, Ext3 does
not follow a completely new design principle. It is based on Ext2. These two le systems are
very closely related to each other. An Ext3 le system can be easily built on top of an Ext2 le
system. The most important difference between Ext2 and Ext3 is that Ext3 supports journaling.
In summary, Ext3 has three major advantages to offer:
2. Edit the le /etc/fstab as the root user to change the le system type specified for the
corresponding partition from ext2 to ext3 , then save the changes.
This ensures that the Ext3 le system is recognized as such. The change takes effect after
the next reboot.
3. To boot a root le system that is set up as an Ext3 partition, add the modules ext3 and
jbd in the initrd . Do so by
1.6 Ext4
In 2006, Ext4 started as a fork from Ext3. It is the latest le system in the extended le system
version. Ext4 was originally designed to increase storage size by supporting volumes with a
size of up to 1 exbibyte, les with a size of up to 16 tebibytes and an unlimited number of
subdirectories. Ext4 uses extents, instead of the traditional direct and indirect block pointers, to
map the le contents. Usage of extents improves both storage and retrieval of data from disks.
Ext4 also introduces several performance enhancements such as delayed block allocation and
a much faster le system checking routine. Ext4 is also more reliable by supporting journal
checksums and by providing time stamps measured in nanoseconds. Ext4 is fully backward
compatible to Ext2 and Ext3—both le systems can be mounted as Ext4.
number of inodes = total size of the file system divided by the number of bytes per inode
The number of inodes controls the number of les you can have in the le system: one inode
for each le.
When you make a new Ext4 le system, you can specify the inode size and bytes-per-inode ratio
to control inode space usage and the number of les possible on the le system. If the blocks
size, inode size, and bytes-per-inode ratio values are not specified, the default values in the /
etc/mked2fs.conf le are applied. For information, see the mke2fs.conf(5) man page.
Inode size: The default inode size is 256 bytes. Specify a value in bytes that is a power
of 2 and equal to 128 or larger in bytes and up to the block size, such as 128, 256, 512,
and so on. Use 128 bytes only if you do not use extended attributes or ACLs on your Ext4
le systems.
Bytes-per-inode ratio: The default bytes-per-inode ratio is 16384 bytes. Valid bytes-per-
inode ratio values must be a power of 2 equal to 1024 or greater in bytes, such as 1024,
2048, 4096, 8192, 16384, 32768, and so on. This value should not be smaller than the
block size of the le system, because the block size is the smallest chunk of space used to
store data. The default block size for the Ext4 le system is 4 KiB.
In addition, consider the number of les and the size of les you need to store. For example,
if your le system will have many small les, you can specify a smaller bytes-per-inode ra-
tio, which increases the number of inodes. If your le system will have very large les, you
can specify a larger bytes-per-inode ratio, which reduces the number of possible inodes.
Generally, it is better to have too many inodes than to run out of them. If you have too
few inodes and very small les, you could reach the maximum number of les on a disk
that is practically empty. If you have too many inodes and very large les, you might have
free space reported but be unable to use it because you cannot create new les in space
reserved for inodes.
23 Ext4 file system inode size and number of inodes SLES 15 SP2
Use any of the following methods to set the inode size and bytes-per-inode ratio:
Modifying the default settings for all new ext4 file systems: In a text editor, modify the de-
faults section of the /etc/mke2fs.conf le to set the inode_size and inode_ratio
to the desired default values. The values apply to all new Ext4 le systems. For example:
blocksize = 4096
inode_size = 128
inode_ratio = 8192
At the command line: Pass the inode size ( -I 128 ) and the bytes-per-inode ratio ( -i
8192 ) to the mkfs.ext4(8) command or the mke2fs(8) command when you create a
new ext4 le system. For example, use either of the following commands:
During installation with YaST: Pass the inode size and bytes-per-inode ratio values when
you create a new Ext4 le system during the installation. In the Expert Partitioner, select the
partition, click Edit. Under Formatting Options, select Format deviceExt4, then click Options.
In the Format options dialog, select the desired values from the Block Size in Bytes, Bytes-
per-inode, and Inode Size drop-down box.
For example, select 4096 for the Block Size in Bytes drop-down box, select 8192 from the
Bytes per inode drop-down box, select 128 from the Inode Size drop-down box, then click OK.
24 Ext4 file system inode size and number of inodes SLES 15 SP2
1.6.3 Upgrading to Ext4
extents
contiguous blocks on the hard disk that are used to keep les close together and
prevent fragmentation
unint_bg
lazy inode table initialization
on Ext2: as_journal
enable journalling on your Ext2 le system.
on Ext3:
on Ext2:
2. As root edit the /etc/fstab le: change the ext3 or ext2 record to ext4 . The change
takes effect after the next reboot.
3. To boot a le system that is setup on an ext4 parition, add the modules: ext4 and jbd
in the initramfs . Open or create /etc/dracut.conf.d/filesystem.conf and add the
following line:
dracut -f
1.7 ReiserFS
ReiserFS support was completely removed with SUSE Linux Enterprise Server 15. To migrate
your existing partitions to Btrfs, refer to Section 1.2.3, “Migration from ReiserFS and Ext File Systems
to Btrfs”.
msdos fat , the le system originally used by DOS, is today used by vari-
ous operating systems.
nfs Network File System: Here, data can be stored on any machine in a
network and access might be granted via a network.
vfat Virtual FAT: Extension of the fat le system (supports long le
names).
xfat Extensible File Allocation Table. File system optimized for use with
ash memory, such as USB ash drives and SD cards.
If you try to mount a device with a blacklisted le system using the mount command, the com-
mand outputs an error message, for example:
mount: /mnt/mx: unknown filesystem type 'minix' (hint: possibly blacklisted, see
mount(8)).
To enable the mounting of the le system, you need to remove the particular le system from the
blacklist. Each blacklisted le system has its own configuration le, for example, for efs it is
/etc/modules.d/60-blacklist_fs-efs.conf . To remove the le system from the blacklist,
you have the following options:
Create a symbolic link to /dev/null , for example, for the efs le system:
# blacklist omfs
Even though a le system is blacklisted, you can load the corresponding kernel module for the
le system directly using modprobe :
For example, for the cramfs le system, the output looks as follows:
TABLE 1.2: MAXIMUM SIZES OF FILES AND FILE SYSTEMS (ON-DISK FORMAT, 4 KIB BLOCK SIZE)
File System (4 KiB Block Maximum File System Size Maximum File Size
Size)
Important: Limitations
Table 1.2, “Maximum Sizes of Files and File Systems (On-Disk Format, 4 KiB Block Size)” describes
the limitations regarding the on-disk format. The Linux kernel imposes its own limits on
the size of les and le systems handled by it. These are as follows:
File Size
On 32-bit systems, les cannot exceed 2 TiB (241 bytes).
Maximum number of paths per single LUN No limit by default. Each path is treated as a
normal LUN.
Maximum number of paths with de- Approximately 1024. The actual number
vice-mapper-multipath (in total) per operat- depends on the length of the device num-
ing system ber strings for each multipath device. It
is a compile-time variable within multi-
path-tools, which can be raised if this limit
poses a problem.
systemd creates a unit le in /usr/lib/systemd/system . By default, the service runs once
a week, which is usually sufficient. However, you can change the frequency by configuring the
OnCalendar option to a required value.
The default behaviour of fstrim is to discard all blocks in the le system. You can use options
when invoking the command to modify this behaviour. For example, you can pass the offset
option to define the place where to start the trimming procedure. For details, see man fstrim .
The fstrim command can perform trimming on all devices stored in the /etc/fstab le,
which support the TRIM operation—use the -A option when invoking the command for this
purpose.
Alternatively, on the Ext4 le system you can use the tune2fs command to set the discard
option in /etc/fstab :
The discard option is also added to /etc/fstab in case the device was mounted by mount
with the discard option:
1. Open a terminal.
3. Enter
Ensure that you delete the oldest snapshots rst. The older a snapshot is, the more disk
space it occupies.
To help prevent this problem, you can change the Snapper cleanup algorithms. See Book “Admin-
istration Guide”, Chapter 7 “System Recovery and Snapshot Management with Snapper”, Section 7.6.1.2
“Cleanup Algorithms” for details. The configuration values controlling snapshot cleanup are EMP-
TY_* , NUMBER_* , and TIMELINE_* .
If you use Snapper with Btrfs on the le system disk, it is advisable to reserve twice the amount
of disk space than the standard storage proposal. The YaST Partitioner automatically proposes
twice the standard disk space in the Btrfs storage proposal for the root le system.
If the system disk is filling up with data, you can try deleting les from /var/log , /var/crash ,
/var/lib/systemd/coredump and /var/cache .
The Btrfs root le system subvolumes /var/log , /var/crash and /var/cache can use all
of the available disk space during normal operation, and cause a system malfunction. To help
avoid this situation, SUSE Linux Enterprise Server offers Btrfs quota support for subvolumes.
See Section 1.2.5, “Btrfs Quota Support for Subvolumes” for details.
On test and development machines, especially if you have frequent crashes of applications, you
may also want to have a look at /var/lib/systemd/coredump where the coredumps are stored.
Assume you have 1 TB drive with 600 GB used by data and you add another 1 TB drive.
The balancing will theoretically result in having 300 GB used space on each drive.
You have a lot of near-empty chunks on a device. Their space will not be available until
the balancing has cleared those chunks.
You need to compact half-empty block group based on the percentage of their usage. The
following command will balance block groups whose usage is 5 % or less:
Tip
The /usr/lib/systemd/system/btrfs-balance.timer timer takes care of clean-
ing up unused block groups on a monthly basis.
You need to clear out non-full portions of block devices and spread data more evenly.
You need to migrate data between different RAID types. For example, to convert data on
a set of disks from RAID1 to RAID5, run the following command:
Tip
To ne-tune the default behavior of balancing data on Btrfs le systems—for example,
how frequently or which mount points to balance— inspect and customize /etc/syscon-
fig/btrfsmaintenance . The relevant options start with BTRFS_BALANCE_ .
For details about the btrfs balance command usage, see its manual pages ( man 8 btrfs-
balance ).
An in-depth comparison of le systems (not only Linux le systems) is available from
the Wikipedia project in Comparison of File Systems (http://en.wikipedia.org/wiki/Compari-
son_of_file_systems#Comparison ).
After having manually resized partitions (for example by using fdisk or parted ) or
logical volumes (for example by using lvresize ).
When wanting to shrink Btrfs le systems (as of SUSE Linux Enterprise Server 12, YaST
only supports growing Btrfs le systems).
The new size must be greater than the size of the existing data; otherwise, data loss occurs.
The new size must be equal to or less than the current device size because the le system
size cannot extend beyond the space available.
The new size must be greater than the size of the existing data; otherwise, data loss occurs.
The new size must be equal to or less than the current device size because the le system
size cannot extend beyond the space available.
If you plan to also decrease the size of the logical volume that holds the le system, ensure that
you decrease the size of the le system before you attempt to decrease the size of the device
or logical volume.
Important: XFS
Decreasing the size of a le system formatted with XFS is not possible, since such a feature
is not supported by XFS.
1. Open a terminal.
3. Change the size of the le system using the btrfs filesystem resize command with
one of the following methods:
To extend the le system size to the maximum available size of the device, enter
4. Check the effect of the resize on the mounted le system by entering
> df -h
The Disk Free ( df ) command shows the total size of the disk, the number of blocks used,
and the number of blocks available on the le system. The -h option prints sizes in hu-
man-readable format, such as 1K, 234M, or 2G.
1. Open a terminal.
3. Increase the size of the le system using the xfs_growfs command. The following ex-
ample expands the size of the le system to the maximum value available. See man 8
xfs_growfs for more options.
4. Check the effect of the resize on the mounted le system by entering
> df -h
The Disk Free ( df ) command shows the total size of the disk, the number of blocks used,
and the number of blocks available on the le system. The -h option prints sizes in hu-
man-readable format, such as 1K, 234M, or 2G.
1. Open a terminal.
3. Change the size of the le system using one of the following methods:
To extend the le system size to the maximum available size of the device called /
dev/sda1 , enter
If a size parameter is not specified, the size defaults to the size of the partition.
The SIZE parameter specifies the requested new size of the le system. If no units are
specified, the unit of the size parameter is the block size of the le system. Optionally,
the size parameter can be suffixed by one of the following unit designations: s for
512 byte sectors; K for kilobytes (1 kilobyte is 1024 bytes); M for megabytes; or
G for gigabytes.
5. Check the effect of the resize on the mounted le system by entering
> df -h
The Disk Free ( df ) command shows the total size of the disk, the number of blocks used,
and the number of blocks available on the le system. The -h option prints sizes in hu-
man-readable format, such as 1K, 234M, or 2G.
41 Changing the Size of an Ext2, Ext3, or Ext4 File System SLES 15 SP2
3 Mounting storage devices
This section gives an overview which device identificators are used during mounting of devices,
and provides details about mounting network storages.
For more information about using udev for managing devices, see Book “Administration Guide”,
Chapter 24 “Dynamic Kernel Device Management with udev”.
A multi-tier cache is a replicated/distributed cache that consists of at least two tiers: one is
represented by slower but cheaper rotational block devices (hard disks), while the other is more
expensive but performs faster data operations (for example SSD ash disks).
SUSE Linux Enterprise Server implements two different solutions for caching between ash and
rotational devices: bcache and lvmcache .
Migration
Movement of the primary copy of a logical block from one device to the other.
Promotion
Migration from the slow device to the fast device.
Demotion
Migration from the fast device to the slow device.
Origin device
The big and slower block device. It always contains a copy of the logical block, which may
be out of date or kept in synchronization with the copy on the cache device (depending
on policy).
Cache device
The small and faster block device.
Metadata device
A small device that records which blocks are in the cache, which are dirty, and extra hints
for use by the policy object. This information could be put on the cache device as well,
but having it separate allows the volume manager to configure it differently, for example
as a mirror for extra robustness. The metadata device may only be used by a single cache
device.
Cache miss
A request for I/O operations is pointed to the cached device's cache rst. If it cannot nd
the requested values, it looks in the device itself, which is slow. This is called a cache miss.
Cache hit
When a requested value is found in the cached device's cache, it is served fast. This is
called a cache hit.
Cold cache
Cache that holds no values (is empty) and causes cache misses. As the cached block device
operations progress, it gets lled with data and becomes warm.
Warm cache
Cache that already holds some values and is likely to result in cache hits.
write-back
Data written to a block that is cached go to the cache only, and the block is marked dirty.
This is the default caching mode.
write-through
Writing to a cached block will not complete until it has hit both the origin and cache
devices. Clean blocks remain clean with write-through cache.
write-around
A similar technique to write-through cache, but write I/O is written directly to a permanent
storage, bypassing the cache. This can prevent the cache being flooded with write I/O that
will not subsequently be re-read, but the disadvantage is that a read request for recently
written data will create a 'cache miss' and needs to be read from slower bulk storage and
experience higher latency.
4.3 bcache
bcache is a Linux kernel block layer cache. It allows one or more fast disk drives (such as SSDs)
to act as a cache for one or more slower hard disks. bcache supports write-through and write-
back, and is independent of the le system used. By default it caches random reads and writes
only, which SSDs excel at. It is suitable for desktops, servers, and high end storage arrays as well.
A single cache device can be used to cache an arbitrary number of backing devices. Backing
devices can be attached and detached at runtime, while mounted and in use.
Recovers from unclean shutdowns—writes are not completed until the cache is consistent
with regard to the backing device.
Highly efficient write-back implementation. Dirty data is always written out in sorted or-
der.
2. Create a backing device (typically a mechanical drive). The backing device can be a whole
device, a partition, or any other standard block device.
In this example, the default block and bucket sizes of 512 B and 128 KB are used. The
block size should match the backing device's sector size which will usually be either 512
or 4k. The bucket size should match the erase block size of the caching device with the
intention of reducing write amplification. For example, using a hard disk with 4k sectors
and an SSD with an erase block size of 2 MB this command would look as follows:
/dev/bcacheN
and as
/dev/bcache/by-uuid/UUID
/dev/bcache/by-label/LABEL
5. After both the cache and backing devices are registered, you need to attach the backing
device to the related cache set to enable caching:
6. By default bcache uses a pass-through caching mode. To change it to for example write-
back, run
4.4 lvmcache
lvmcache is a caching mechanism consisting of logical volumes (LVs). It uses the dm-cache
kernel driver and supports write-through (default) and write-back caching modes. lvmcache
improves performance of a large and slow LV by dynamically migrating some of its data to a
faster and smaller LV. For more information on LVM, see Part II, “Logical Volumes (LVM)”.
LVM refers to the small, fast LV as a cache pool LV. The large, slow LV is called the origin LV.
Because of requirements from dm-cache, LVM further splits the cache pool LV into two devices:
the cache data LV and cache metadata LV. The cache data LV is where copies of data blocks
are kept from the origin LV to increase speed. The cache metadata LV holds the accounting
information that specifies where data blocks are stored.
1. Create the origin LV. Create a new LV or use an existing LV to become the origin LV:
2. Create the cache data LV. This LV will hold data blocks from the origin LV. The size of this
LV is the size of the cache and will be reported as the size of the cache pool LV.
3. Create the cache metadata LV. This LV will hold cache pool metadata. The size of this LV
should be approximately 1000 times smaller than the cache data LV, with a minimum
size of 8MB.
4. Create a cache pool LV. Combine the data and metadata LVs into a cache pool LV. You can
set the cache pool LV's behavior at the same time.
CACHE_POOL_LV takes the name of CACHE_DATA_LV .
CACHE_DATA_LV is renamed to CACHE_DATA_LV _cdata and becomes hidden.
CACHE_META_LV is renamed to CACHE_DATA_LV _cmeta and becomes hidden.
5. Create a cache LV. Create a cache LV by linking the cache pool LV to the origin LV.
You can disconnect a cache pool LV from a cache LV, leaving an unused cache pool LV and an
uncached origin LV. Data are written back from the cache pool to the origin LV when necessary.
This writes back data from the cache pool to the origin LV when necessary, then removes the
cache pool LV, leaving the uncached origin LV.
An alternative command that also disconnects the cache pool from the cache LV, and deletes
the cache pool:
5 LVM Configuration 53
This chapter describes the principles behind Logical Volume Manager (LVM) and its
basic features that make it useful under many circumstances. The YaST LVM config-
uration can be reached from the YaST Expert Partitioner. This partitioning tool en-
ables you to edit and delete existing partitions and create new ones that should be
used with LVM.
Warning: Risks
Using LVM might be associated with increased risk, such as data loss. Risks also include
application crashes, power failures, and faulty commands. Save your data before imple-
menting LVM or reconfiguring volumes. Never work without a backup.
LV 1 LV 2 LV 3 LV4
MP MP MP MP MP MP MP
In LVM, the physical disk partitions that are incorporated in a volume group are called physical
volumes (PVs). Within the volume groups in Figure 5.1, “Physical Partitioning versus LVM”, four
logical volumes (LV 1 through LV 4) have been defined, which can be used by the operating
system via the associated mount points (MP). The border between different logical volumes need
not be aligned with any partition border. See the border between LV 1 and LV 2 in this example.
LVM features:
Provided the configuration is suitable, an LV (such as /usr ) can be enlarged when the
free space is exhausted.
Using LVM, it is possible to add hard disks or LVs in a running system. However, this
requires hotpluggable hardware that is capable of such actions.
It is possible to activate a striping mode that distributes the data stream of a logical volume
over several physical volumes. If these physical volumes reside on different disks, this can
improve the reading and writing performance like RAID 0.
The snapshot feature enables consistent backups (especially for servers) in the running
system.
With these features, using LVM already makes sense for heavily used home PCs or small servers.
If you have a growing data stock, as in the case of databases, music archives, or user directo-
ries, LVM is especially useful. It allows le systems that are larger than the physical hard disk.
However, keep in mind that working with LVM is different from working with conventional
partitions.
You can manage new or existing LVM storage objects by using the YaST Partitioner. Instructions
and further information about configuring LVM are available in the official LVM HOWTO (http://
tldp.org/HOWTO/LVM-HOWTO/) .
a. To use an entire hard disk that already contains partitions, delete all partitions on
that disk.
4. At the lower left of the Volume Management page, click Add Volume Group.
c. In the Available Physical Volumes list, select the Linux LVM partitions that you want to
make part of this volume group, then click Add to move them to the Selected Physical
Volumes list.
d. Click Finish.
The new group appears in the Volume Groups list.
6. On the Volume Management page, click Next, verify that the new volume group is listed,
then click Finish.
7. To check which physical devices are part of the volume group, open the YaST Partitioner
at any time in the running system and click Volume Management Edit Physical Devices.
Leave this screen with Abort.
Thin pool: The logical volume is a pool of space that is reserved for use with thin volumes.
The thin volumes can allocate their needed space from it on demand.
2. In the left panel, select Volume Management. A list of existing Volume Groups opens in
the right panel.
3. Select the volume group in which you want to create the volume and choose Logical Vol-
umes Add Logical Volume.
4. Provide a Name for the volume and choose Normal Volume (refer to Section 5.3.1, “Thinly
Provisioned Logical Volumes” for setting up thinly provisioned volumes). Proceed with Next.
5. Specify the size of the volume and whether to use multiple stripes.
Using a striped volume, the data will be distributed among several physical volumes. If
these physical volumes reside on different hard disks, this generally results in a better
reading and writing performance (like RAID 0). The maximum number of available stripes
is equal to the number of physical volumes. The default ( 1 is to not use multiple stripes.
8. Click Finish.
9. Click Next, verify that the changes are listed, then click Finish.
Thin pool
The logical volume is a pool of space that is reserved for use with thin volumes. The thin
volumes can allocate their needed space from it on demand.
Thin volume
The volume is created as a sparse volume. The volume allocates needed space on demand
from a thin pool.
Such a logical volume is a linear volume (without striping) that provides three copies of the
le system. The m option specifies the count of mirrors. The L option specifies the size of the
logical volumes.
The logical volume is divided into regions of the 512 KB default size. If you need a different
size of regions, use the -R option followed by the desired region size in megabytes. Or you can
configure the preferred region size by editing the mirror_region_size option in the lvm.conf
le.
LVM maintains a fully redundant bitmap area for each mirror image, which increases its
fault handling capabilities.
Mirror images can be temporarily split from the array and then merged back.
On the other hand, this type of mirroring implementation does not enable to create a logical
volume in a clustered volume group.
To create a mirror volume by using RAID, issue the command
--type - you need to specify raid1 , otherwise the command uses the implicit segment
type mirror and creates a non-raid mirror.
LVM creates a logical volume of one extent size for each data volume in the array. If you have
two mirrored volumes, LVM creates another two volumes that stores metadata.
After you create a RAID logical volume, you can use the volume in the same way as a common
logical volume. You can activate it, extend it, etc.
By default, non-root LVM volume groups are automatically activated on system restart by Dracut.
This parameter allows you to activate all volume groups on system restart, or to activate only
specified non-root LVM volume groups.
It is also possible to reduce the size of the volume group by removing physical volumes. YaST
only allows to remove physical volumes that are currently unused. To nd out which physical
volumes are currently in use, run the following command. The partitions (physical volumes)
listed in the PE Ranges column are the ones in use:
2. In the left panel, select Volume Management. A list of existing Volume Groups opens in
the right panel.
3. Select the volume group you want to change, activate the Physical Volumes tab, then click
Change.
Add: Expand the size of the volume group by moving one or more physical volumes
(LVM partitions) from the Available Physical Volumes list to the Selected Physical Vol-
umes list.
Remove: Reduce the size of the volume group by moving one or more physical vol-
umes (LVM partitions) from the Selected Physical Volumes list to the Available Physical
Volumes list.
5. Click Finish.
6. Click Next, verify that the changes are listed, then click Finish.
2. In the left panel, select Volume Management. A list of existing Volume Groups opens in
the right panel.
3. Select the logical volume you want to change, then click Resize.
Maximum Size. Expand the size of the logical volume to use all space left in the
volume group.
Minimum Size. Reduce the size of the logical volume to the size occupied by the
data and the le system metadata.
Custom Size. Specify the new size for the volume. The value must be within the
range of the minimum and maximum values listed above. Use K, M, G, T for Kilo-
bytes, Megabytes, Gigabytes and Terabytes (for example 20G ).
5. Click OK.
6. Click Next, verify that the change is listed, then click Finish.
2. In the left panel, select Volume Management. A list of existing volume groups opens in the
right panel.
3. Select the volume group or the logical volume you want to remove and click Delete.
4. Depending on your choice warning dialogs are shown. Confirm them with Yes.
5. Click Next, verify that the deleted volume group is listed (deletion is indicated by a red
colored font), then click Finish.
LVM COMMANDS
pvcreate DEVICE
Initializes a device (such as /dev/sdb1 ) for use by LVM as a physical volume. If there is
any le system on the specified device, a warning appears. Bear in mind that pvcreate
checks for existing le systems only if blkid is installed (which is done by default). If
blkid is not available, pvcreate will not produce any warning and you may lose your
le system without any warning.
pvdisplay DEVICE
Displays information about the LVM physical volume, such as whether it is currently being
used in a logical volume.
complete - only the logical volumes that are not affected by missing physical vol-
umes can be activated, even though the particular logical volume can tolerate such
a failure.
partial - the LVM tries to activate the volume group even though some physical
volumes are missing. If a non-redundant logical volume is missing important physical
volumes, then the logical volume usually cannot be activated and is handled as an
error target.
vgremove VG_NAME
Removes a volume group. Before using this command, remove the logical volumes, then
deactivate the volume group.
vgdisplay VG_NAME
Displays information about a specified volume group.
To nd the total physical extent of a volume group, enter
complete - the logical volume can be activated only if all its physical volumes are
active.
partial - the LVM tries to activate the volume even though some physical volumes
are missing. In this case part of the logical volume may be unavailable and it might
cause data loss. This option is typically not used, but might be useful when restoring
data.
You can specify the activation mode also in /etc/lvm/lvm.conf by specifying one of the
above described values of the activation_mode configuration option.
lvremove /dev/VG_NAME/LV_NAME
Removes a logical volume.
Before using this command, close the logical volume by unmounting it with the umount
command.
lvremove SNAP_VOLUME_PATH
Removes a snapshot volume.
export DM_DISABLE_UDEV=1
This will also disable notifications from udev. In addition, all udev related settings
from /etc/lvm/lvm.conf will be ignored.
If you extend an LV, you must extend the LV before you attempt to grow the le system.
If you shrink an LV, you must shrink the le system before you attempt to shrink the LV.
1. Open a terminal.
2. If the logical volume contains an Ext2 or Ext4 le system, which do not support online
growing, dismount it. In case it contains le systems that are hosted for a virtual machine
(such as a Xen VM), shut down the VM rst.
3. At the terminal prompt, enter the following command to grow the size of the logical
volume:
For SIZE , specify the amount of space you want to add to the logical volume, such as 10
GB. Replace /dev/VG_NAME/LV_NAME with the Linux path to the logical volume, such as
/dev/LOCAL/DATA . For example:
4. Adjust the size of the le system. See Chapter 2, Resizing File Systems for details.
1. Open a terminal.
3. Adjust the size of the le system. See Chapter 2, Resizing File Systems for details.
4. At the terminal prompt, enter the following command to shrink the size of the logical
volume to the size of the le system:
Tip: Resizing the Volume and the File System with a Single
Command
Starting with SUSE Linux Enterprise Server 12 SP1, lvextend , lvresize , and lvreduce
support the option --resizefs which will not only change the size of the volume, but
will also resize the le system. Therefore the examples for lvextend and lvreduce
shown above can alternatively be run as follows:
Note that the --resizefs is supported for the following le systems: ext2/3/4, Btrfs,
XFS. Resizing Btrfs with this option is currently only available on SUSE Linux Enterprise
Server, since it is not yet accepted upstream.
1. Create the original volume (on a slow device) if not already existing.
2. Add the physical volume (from a fast device) to the same volume group the original volume
is part of and create the cache data volume on the physical volume.
3. Create the cache metadata volume. The size should be 1/1000 of the size of the cache
data volume, with a minimum size of 8 MB.
4. Combine the cache data volume and metadata volume into a cache pool volume:
For more information on LVM caching, see the lvmcache(7) man page.
Select LVM objects for processing according to the presence or absence of specific tags.
Use tags in the configuration le to control which volume groups and logical volumes are
activated on a server.
A tag can be used in place of any command line LVM object reference that accepts:
a list of objects
Replacing the object name with a tag is not supported everywhere yet. After the arguments are
expanded, duplicate arguments in a list are resolved by removing the duplicate arguments, and
retaining the rst instance of each argument.
Wherever there might be ambiguity of argument type, you must prefix a tag with the commercial
at sign (@) character, such as @mytag . Elsewhere, using the “@” prefix is optional.
Supported Characters
An LVM tag word can contain the ASCII uppercase characters A to Z, lowercase characters
a to z, numbers 0 to 9, underscore (_), plus (+), hyphen (-), and period (.). The word
cannot begin with a hyphen. The maximum length is 128 characters.
--deltag TAG_INFO
Remove a tag from (or untag) an LVM2 storage object. Example:
--tag TAG_INFO
Specify the tag to use to narrow the list of volume groups or logical volumes to be activated
or deactivated.
Enter the following to activate the volume if it has a tag that matches the tag provided
(example):
tags {
# Enable hostname tags
hosttags = 1
}
You place the activation code in the /etc/lvm/lvm_<HOSTNAME>.conf le on the host. See
Section 5.9.4.3, “Defining Activation”.
tags {
tag2 {
# If no exact match, tag is not set.
host_list = [ "hostname1", "hostname2" ]
}
}
activation {
volume_list = [ "vg1/lvol0", "@database" ]
}
Replace @database with your tag. Use "@*" to match the tag against any tag set on the host.
The activation command matches against VGNAME , VGNAME/LVNAME , or @ TAG set in the meta-
data of volume groups and logical volumes. A volume group or logical volume is activated only
if a metadata tag matches. The default if there is no match, is not to activate.
If volume_list is not present and tags are defined on the host, then it activates the volume
group or logical volumes only if a host tag matches a metadata tag.
If volume_list is defined, but empty, and no tags are defined on the host, then it does not
activate.
If volume_list is undefined, it imposes no limits on LV activation (all are allowed).
lvm.conf
lvm_<HOST_TAG>.conf
At start-up, load the /etc/lvm/lvm.conf le, and process any tag settings in the le. If any host
tags were defined, it loads the related /etc/lvm/lvm_<HOST_TAG>.conf le. When it searches
for a specific configuration le entry, it searches the host tag le rst, then the lvm.conf le,
tags {
hostname_tags = 1
}
3. From any machine in the cluster, add db1 to the list of machines that activate vg1/lvol2 :
Activate volume group vg1 only on the database hosts db1 and db2 .
Activate volume group vg2 only on the le server host fs1 .
Activate nothing initially on the le server backup host fsb1 , but be prepared for it to
take over from the le server host fs1 .
2. Add the @fileserver tag to the metadata of volume group vg2 . In a terminal, enter
3. In a text editor, modify the /etc/lvm/lvm.conf le with the following code to define
the @database , @fileserver , @fileserverbackup tags.
tags {
database {
host_list = [ "db1", "db2" ]
}
fileserver {
host_list = [ "fs1" ]
}
fileserverbackup {
host_list = [ "fsb1" ]
}
}
activation {
# Activate only if host has a tag that matches a metadata tag
volume_list = [ "@*" ]
}
4. Replicate the modified /etc/lvm/lvm.conf le to the four hosts: db1 , db2 , fs1 , and
fsb1 .
5. If the le server host goes down, vg2 can be brought up on fsb1 by entering the following
commands in a terminal on any node:
In the following solution, each host holds locally the information about which classes of volume
to activate.
1. Add the @database tag to the metadata of volume group vg1 . In a terminal, enter
2. Add the @fileserver tag to the metadata of volume group vg2 . In a terminal, enter
a. In a text editor, modify the /etc/lvm/lvm.conf le with the following code to
enable host tag configuration les.
tags {
hosttags = 1
}
b. Replicate the modified /etc/lvm/lvm.conf le to the four hosts: db1 , db2 , fs1 ,
and fsb1 .
4. On host db1 , create an activation configuration le for the database host db1 . In a text
editor, create /etc/lvm/lvm_db1.conf le and add the following code:
activation {
volume_list = [ "@database" ]
}
5. On host db2 , create an activation configuration le for the database host db2 . In a text
editor, create /etc/lvm/lvm_db2.conf le and add the following code:
activation {
volume_list = [ "@database" ]
}
6. On host fs1, create an activation configuration le for the le server host fs1 . In a text
editor, create /etc/lvm/lvm_fs1.conf le and add the following code:
activation {
volume_list = [ "@fileserver" ]
}
a. On host fsb1 , create an activation configuration le for the host fsb1 . In a text
editor, create /etc/lvm/lvm_fsb1.conf le and add the following code:
activation {
volume_list = [ "@fileserver" ]
}
A Logical Volume Manager (LVM) logical volume snapshot is a copy-on-write technology that
monitors changes to an existing volume’s data blocks so that when a write is made to one of
the blocks, the block’s value at the snapshot time is copied to a snapshot volume. In this way, a
point-in-time copy of the data is preserved until the snapshot volume is deleted.
LVM volume snapshots allow you to create a backup from a point-in-time view of the le system.
The snapshot is created instantly and persists until you delete it. You can back up the le system
from the snapshot while the volume itself continues to be available for users. The snapshot
initially contains some metadata about the snapshot, but no actual data from the source logical
volume. Snapshot uses copy-on-write technology to detect when data changes in an original
data block. It copies the value it held when the snapshot was taken to a block in the snapshot
volume, then allows the new data to be stored in the source block. As more blocks change from
their original value on the source logical volume, the snapshot size grows.
When you are sizing the snapshot, consider how much data is expected to change on the source
logical volume and how long you plan to keep the snapshot. The amount of space that you
allocate for a snapshot volume can vary, depending on the size of the source logical volume,
how long you plan to keep the snapshot, and the number of data blocks that are expected to
change during the snapshot’s lifetime. The snapshot volume cannot be resized after it is created.
As a guide, create a snapshot volume that is about 10% of the size of the original logical volume.
If you anticipate that every block in the source logical volume will change at least one time
When you are done with the snapshot, it is important to remove it from the system. A snapshot
eventually lls up completely as data blocks change on the source logical volume. When the
snapshot is full, it is disabled, which prevents you from remounting the source logical volume.
If you create multiple snapshots for a source logical volume, remove the snapshots in a last
created, rst deleted order.
For example:
For example:
1. Ensure that the source logical volume that contains the le-backed virtual machine image
is mounted, such as at mount point /var/lib/xen/images/<IMAGE_NAME> .
2. Create a snapshot of the LVM logical volume with enough space to store the differences
that you expect.
3. Create a mount point where you will mount the snapshot volume.
5. In a text editor, copy the configuration le for the source virtual machine, modify the
paths to point to the le-backed image le on the mounted snapshot volume, and save
the le such as /etc/xen/myvm-snap.cfg .
6. Start the virtual machine using the mounted snapshot volume of the virtual machine.
7. (Optional) Remove the snapshot, and use the unchanged virtual machine image on the
source logical volume.
If both the source logical volume and snapshot volume are not open, the merge begins
immediately.
If the source logical volume or snapshot volume are open, the merge starts the rst time
either the source logical volume or snapshot volume are activated and both are closed.
If the source logical volume cannot be closed, such as the root le system, the merge is
deferred until the next time the server reboots and the source logical volume is activated.
If the source logical volume contains a virtual machine image, you must shut down the
virtual machine, deactivate the source logical volume and snapshot volume (by dismount-
ing them in that order), and then issue the merge command. Because the source logical
volume is automatically remounted and the snapshot volume is deleted when the merge
is complete, you should not restart the virtual machine until after the merge is complete.
After the merge is complete, you use the resulting logical volume for the virtual machine.
After a merge begins, the merge continues automatically after server restarts until it is complete.
A new snapshot cannot be created for the source logical volume while a merge is in progress.
While the merge is in progress, reads or writes to the source logical volume are transparently
redirected to the snapshot that is being merged. This allows users to immediately view and
access the data as it was when the snapshot was created. They do not need to wait for the merge
to complete.
When the merge is complete, the source logical volume contains the same data as it did when
the snapshot was taken, plus any data changes made after the merge began. The resulting logical
volume has the source logical volume’s name, minor number, and UUID. The source logical
volume is automatically remounted, and the snapshot volume is removed.
Merging a Snapshot with the Source Logical Volume to Revert Changes or Roll Back to a Previ-
You can specify one or multiple snapshots on the command line. You can alternatively tag
multiple source logical volumes with the same volume tag then specify @<VOLUME_TAG>
on the command line. The snapshots for the tagged volumes are merged to their respective
source logical volumes. For information about tagging logical volumes, see Section 5.9,
“Tagging LVM2 Storage Objects”.
The options include:
-b,
--background
Run the daemon in the background. This allows multiple specified snapshots to be
merged concurrently in parallel.
-i,
--interval < SECONDS >
Report progress as a percentage at regular intervals. Specify the interval in seconds.
For more information about this command, see the lvconvert(8) man page.
For example:
If lvol1 , lvol2 , and lvol3 are all tagged with mytag , each snapshot volume is merged
serially with its respective source logical volume; that is: lvol1 , then lvol2 , then lvol3 .
If the --background option is specified, the snapshots for the respective tagged logical
volume are merged concurrently in parallel.
2. (Optional) If both the source logical volume and snapshot volume are open and they can
be closed, you can manually deactivate and activate the source logical volume to get the
merge to start immediately.
Merging a Snapshot with the Source Logical Volume to Revert Changes or Roll Back to a Previ-
3. (Optional) If both the source logical volume and snapshot volume are open and the source
logical volume cannot be closed, such as the root le system, you can restart the server
and mount the source logical volume to get the merge to start immediately after the restart.
Merging a Snapshot with the Source Logical Volume to Revert Changes or Roll Back to a Previ-
The purpose of RAID (redundant array of independent disks) is to combine several hard disk
partitions into one large virtual hard disk to optimize performance, data security, or both. Most
RAID controllers use the SCSI protocol because it can address a larger number of hard disks
in a more effective way than the IDE protocol and is more suitable for parallel processing of
commands. There are some RAID controllers that support IDE or SATA hard disks. Software
RAID provides the advantages of RAID systems without the additional cost of hardware RAID
controllers. However, this requires some CPU time and has memory requirements that make it
unsuitable for real high performance computers.
SUSE Linux Enterprise offers the option of combining several hard disks into one soft RAID
system. RAID implies several strategies for combining several hard disks in a RAID system, each
with different goals, advantages, and characteristics. These variations are commonly known as
RAID levels.
7.1.1 RAID 0
This level improves the performance of your data access by spreading out blocks of each le
across multiple disks. Actually, this is not a RAID, because it does not provide data backup, but
the name RAID 0 for this type of system has become the norm. With RAID 0, two or more hard
disks are pooled together. The performance is very good, but the RAID system is destroyed and
your data lost if even one hard disk fails.
7.1.4 RAID 4
Level 4 provides block-level striping like Level 0 combined with a dedicated parity disk. If a
data disk fails, the parity data is used to create a replacement disk. However, the parity disk
might create a bottleneck for write access. Nevertheless, Level 4 is sometimes used.
7.1.5 RAID 5
RAID 5 is an optimized compromise between Level 0 and Level 1 in terms of performance
and redundancy. The hard disk space equals the number of disks used minus one. The data is
distributed over the hard disks as with RAID 0. Parity blocks, created on one of the partitions,
are there for security reasons. They are linked to each other with XOR, enabling the contents
to be reconstructed by the corresponding parity block in case of system failure. With RAID 5,
no more than one hard disk can fail at the same time. If one hard disk fails, it must be replaced
when possible to avoid the risk of losing data.
2. If necessary, create partitions that should be used with your RAID configuration. Do not
format them and set the partition type to 0xFD Linux RAID. When using existing partitions
it is not necessary to change their partition type—YaST will automatically do so. Refer
to Book “Deployment Guide”, Chapter 10 “Expert Partitioner”, Section 10.1 “Using the Expert Par-
titioner” for details.
It is strongly recommended to use partitions stored on different hard disks to decrease
the risk of losing data if one is defective (RAID 1 and 5) and to optimize the performance
of RAID 0.
For RAID 0 at least two partitions are needed. RAID 1 requires exactly two partitions,
while at least three partitions are required for RAID 5. A RAID 6 setup requires at least
four partitions. It is recommended to use only partitions of the same size because each
segment can contribute only the same amount of space as the smallest sized partition.
5. Select a RAID Type and Add an appropriate number of partitions from the Available Devices
dialog.
You can optionally assign a RAID Name to your RAID. It will make it available as /dev/
md/NAME . See Section 7.2.1, “RAID Names” for more information.
6. Select the Chunk Size and, if applicable, the Parity Algorithm. The optimal chunk size
depends on the type of data and the type of RAID. See https://raid.wiki.kernel.org/in-
dex.php/RAID_setup#Chunk_sizes for more information. More information on parity al-
gorithms can be found with man 8 mdadm when searching for the --layout option. If
unsure, stick with the defaults.
7. Choose a Role for the volume. Your choice here only affects the default values for the
upcoming dialog. They can be changed in the next step. If in doubt, choose Raw Volume
(Unformatted).
8. Under Formatting Options, select Format Partition, then select the File system. The content
of the Options menu depends on the le system. Usually there is no need to change the
defaults.
Under Mounting Options, select Mount partition, then select the mount point. Click Fstab
Options to add special mounting options for the volume.
9. Click Finish.
10. Click Next, verify that the changes are listed, then click Finish.
It will cause names like myRAID to be used as a “real” device name. The device will not
only be accessible at /dev/myRAID , but also be listed as myRAID under /proc . Note that
this will only apply to RAIDs configured after the change to the configuration le. Active
RAIDS will continue to use the mdN names until they get stopped and re-assembled.
The command above turns on monitoring of the /dev/md2 array in intervals of 1800s. In case
of a failure, an email to root@localhost will be send.
Linux RAID mailing lists are also available, such as linux-raid at http://marc.info/?l=linux-raid .
In SUSE Linux Enterprise Server, the Device Mapper RAID tool has been integrated into the
YaST Partitioner. You can use the partitioner at install time to create a software RAID for the
system device that contains your root ( / ) partition. The /boot partition cannot be stored on
a RAID partition unless it is RAID 1.
You need two hard disks to create the RAID 1 mirror device. The hard disks should be
similarly sized. The RAID assumes the size of the smaller drive. The block storage devices
can be any combination of local (in or directly attached to the machine), Fibre Channel
storage subsystems, or iSCSI storage subsystems.
A separate partition for /boot is not required if you install the boot loader in the MBR. If
installing the boot loader in the MBR is not an option, /boot needs to reside on a separate
partition.
For UEFI machines, you need to set up a dedicated /boot/efi partition. It needs to be
VFAT-formatted, and may reside on the RAID 1 device to prevent booting problems in case
the physical disk with /boot/efi fails.
If you are using hardware RAID devices, do not attempt to run software RAIDs on top of it.
If you are using iSCSI target devices, you need to enable the iSCSI initiator support before
you create the RAID device.
If your storage subsystem provides multiple I/O paths between the server and its directly
attached local devices, Fibre Channel devices, or iSCSI devices that you want to use in
the software RAID, you need to enable the multipath support before you create the RAID
device.
98 Prerequisites for Using a Software RAID Device for the Root Partition SLES 15 SP2
8.2 Setting Up the System with a Software RAID
Device for the Root (/) Partition
1. Start the installation with YaST and proceed as described in Book “Deployment Guide”,
Chapter 8 “Installation Steps” until you reach the Suggested Partitioning step.
2. Click Expert Partitioner to open the custom partitioning tool. YaST makes a default propos-
al, you can use this proposal by clicking Start with Current Proposal. You can also discard
the suggested proposal by clicking Start with Existing Partitions.
3. (Optional) If there are iSCSI target devices that you want to use, you need to enable the
iSCSI Initiator software by choosing Configure Configure iSCSI from the lower right section
of the screen. Refer to Chapter 15, Mass Storage over IP Networks: iSCSI for further details.
4. (Optional) You may need to configure multiple network interfaces for FCoE, click Configure
FCoE to do so.
5. (Optional) In case you have used the suggested proposal and you do not need it anymore.
Click Rescan Devices to delete the proposed partitioning.
6. Set up the Linux RAID format for each of the devices you want to use for the software
RAID. You should use RAID for / , /boot/efi , or swap partitions.
a. In the left panel, select Hard Disks and select the device you want to use. Click the
Partitions tab and then click Add Partition.
b. Under New Partition Size, specify the size to use, then click Next.
d. Select Do not format and Do not mount. Set the Partition ID to Linux RAID.
e. Click Next and repeat these instructions for the second partition.
99 Setting Up the System with a Software RAID Device for the Root (/) Partition SLES 15 SP2
7. Create the RAID device for the / partition.
b. Set the desired RAID Type for the / partition and the RAID name to system .
c. Select the two RAID devices you prepared in the previous step from the Available
Devices section and Add them.
100 Setting Up the System with a Software RAID Device for the Root (/) Partition SLES 15 SP2
Proceed with Next.
d. Under RAID Options, select the chunk size from the drop-down box. Sticking with
the default is a safe choice.
e. In the left panel, click RAID and then select the system . Click Edit.
g. Select the File System and set the mount point to / . Leave the dialog with Next .
8. The software RAID device is managed by Device Mapper, and creates a device under the
/dev/md/system path.
9. Optionally for UEFI machines, use similar steps to create the /boot/efi mounted parti-
tion. Remember that only RAID 1 is supported for /boot/efi , and the partition needs to
be formatted with the FAT le system.
101 Setting Up the System with a Software RAID Device for the Root (/) Partition SLES 15 SP2
FIGURE 8.1: /, /BOOT/EFI, AND SWAP ON RAIDS
11. Continue with the installation. For UEFI machines with a separate /boot/efi partition,
click Booting on the Installation Settings screen and set GRUB2 for EFI as the Boot Loader.
Check that the Enable Secure Boot Support option is activated.
Whenever you reboot your server, Device Mapper is started at boot time so that the soft-
ware RAID is automatically recognized, and the operating system on the root (/) partition
can be started.
102 Setting Up the System with a Software RAID Device for the Root (/) Partition SLES 15 SP2
9 Creating Software RAID 10 Devices
This section describes how to set up nested and complex RAID 10 devices. A RAID 10 device
consists of nested RAID 1 (mirroring) and RAID 0 (striping) arrays. Nested RAIDs can either be
set up as striped mirrors (RAID 1+0) or as mirrored stripes (RAID 0+1). A complex RAID 10
setup also combines mirrors and stripes and additional data security by supporting a higher data
redundancy level.
RAID 1+0: RAID 1 (mirror) arrays are built rst, then combined to form a RAID 0 (stripe)
array.
RAID 0+1: RAID 0 (stripe) arrays are built rst, then combined to form a RAID 1 (mirror)
array.
The following table describes the advantages and disadvantages of RAID 10 nesting as 1+0
versus 0+1. It assumes that the storage objects you use reside on different disks, each with a
dedicated I/O capability.
10 (1+0) RAID 0 (stripe) built RAID 1+0 provides high levels of I/O perfor-
with RAID 1 (mirror) mance, data redundancy, and disk fault toler-
arrays ance. Because each member device in the RAID
0 is mirrored individually, multiple disk failures
can be tolerated and data remains available as
long as the disks that fail are in different mirrors.
10 (0+1) RAID 1 (mirror) built RAID 0+1 provides high levels of I/O perfor-
with RAID 0 (stripe) ar- mance and data redundancy, but slightly less
rays fault tolerance than a 1+0. If multiple disks fail
on one side of the mirror, then the other mirror
is available. However, if disks are lost concur-
rently on both sides of the mirror, all data is lost.
This solution offers less disk fault tolerance than
a 1+0 solution, but if you need to perform main-
tenance or maintain the mirror on a different
site, you can take an entire side of the mirror of-
fline and still have a fully functional storage de-
vice. Also, if you lose the connection between the
two sites, either site operates independently of
the other. That is not true if you stripe the mir-
rored segments, because the mirrors are managed
at a lower level.
If a device fails, the mirror on that side fails be-
cause RAID 1 is not fault-tolerant. Create a new
RAID 0 to replace the failed side, then resynchro-
nize the mirrors.
The procedure in this section uses the device names shown in the following table. Ensure that
you modify the device names with the names of your own devices.
/dev/sdd1 /dev/md1
/dev/sde1
1. Open a terminal.
2. If necessary, create four 0xFD Linux RAID partitions of equal size using a disk partitioner
such as parted.
3. Create two software RAID 1 devices, using two different devices for each device. At the
command prompt, enter these two commands:
> sudo mdadm --create /dev/md0 --run --level=1 --raid-devices=2 /dev/sdb1 /dev/sdc1
sudo mdadm --create /dev/md1 --run --level=1 --raid-devices=2 /dev/sdd1 /dev/sde1
4. Create the nested RAID 1+0 device. At the command prompt, enter the following com-
mand using the software RAID 1 devices you created in the previous step:
5. Create a le system on the RAID 1+0 device /dev/md2 , for example an XFS le system:
6. Edit the /etc/mdadm.conf le or create it, if it does not exist (for example by running
sudo vi /etc/mdadm.conf ). Add the following lines (if the le already exists, the rst
line probably already exists).
The UUID of each device can be retrieved with the following command:
7. Edit the /etc/fstab le to add an entry for the RAID 1+0 device /dev/md2 . The fol-
lowing example shows an entry for a RAID device with the XFS le system and /data
as a mount point.
Important: Multipathing
If you need to manage multiple connections to the devices, you must configure multipath
I/O before configuring the RAID devices. For information, see Chapter 18, Managing Mul-
tipath I/O for Devices.
In this configuration, spare devices cannot be specified for the underlying RAID 0 devices be-
cause RAID 0 cannot tolerate a device loss. If a device fails on one side of the mirror, you must
create a replacement RAID 0 device, than add it into the mirror.
/dev/sdd1 /dev/md1
/dev/sde1
1. Open a terminal.
2. If necessary, create four 0xFD Linux RAID partitions of equal size using a disk partitioner
such as parted.
3. Create two software RAID 0 devices, using two different devices for each RAID 0 device.
At the command prompt, enter these two commands:
4. Create the nested RAID 0+1 device. At the command prompt, enter the following com-
mand using the software RAID 0 devices you created in the previous step:
> sudo mdadm --create /dev/md2 --run --level=1 --raid-devices=2 /dev/md0 /dev/md1
5. Create a le system on the RAID 1+0 device /dev/md2 , for example an XFS le system:
6. Edit the /etc/mdadm.conf le or create it, if it does not exist (for example by running
sudo vi /etc/mdadm.conf ). Add the following lines (if the le exists, the rst line
probably already exists, too).
The UUID of each device can be retrieved with the following command:
7. Edit the /etc/fstab le to add an entry for the RAID 1+0 device /dev/md2 . The fol-
lowing example shows an entry for a RAID device with the XFS le system and /data
as a mount point.
Multiple copies of data Two or more copies, up to Copies on each mirrored seg-
the number of devices in the ment
array
Hot spare devices A single spare can service all Configure a spare for each
component devices underlying mirrored array,
or configure a spare to serve
a spare group that serves all
mirrors.
109 Number of Devices and Replicas in the Complex RAID 10 SLES 15 SP2
9.2.2 Layout
The complex RAID 10 setup supports three different layouts which define how the data blocks
are arranged on the disks. The available layouts are near (default), far and offset. They have
different performance characteristics, so it is important to choose the right layout for your work-
load.
With the near layout, copies of a block of data are striped near each other on different component
devices. That is, multiple copies of one data block are at similar offsets in different devices. Near
is the default layout for RAID 10. For example, if you use an odd number of component devices
and two copies of data, some copies are perhaps one chunk further into the device.
The near layout for the complex RAID 10 yields read and write performance similar to RAID 0
over half the number of drives.
Near layout with an even number of disks and two replicas:
The far layout stripes data over the early part of all drives, then stripes a second copy of the data
over the later part of all drives, making sure that all copies of a block are on different drives.
The second set of values starts halfway through the component drives.
The offset layout duplicates stripes so that the multiple copies of a given chunk are laid out
on consecutive drives and at consecutive offsets. Effectively, each stripe is duplicated and the
copies are offset by one device. This should give similar read characteristics to a far layout if a
suitably large chunk size is used, but without as much seeking for writes.
Offset layout with an even number of disks and two replicas:
9.2.2.4 Specifying the number of Replicas and the Layout with YaST and
mdadm
The number of replicas and the layout is specified as Parity Algorithm in YaST or with the --
layout parameter for mdadm. The following values are accepted:
nN
Specify n for near layout and replace N with the number of replicas. n2 is the default
that is used when not configuring layout and the number of replicas.
fN
Specify f for far layout and replace N with the number of replicas.
oN
Specify o for offset layout and replace N with the number of replicas.
112 Creating a Complex RAID 10 with the YaST Partitioner SLES 15 SP2
2. If necessary, create partitions that should be used with your RAID configuration. Do not
format them and set the partition type to 0xFD Linux RAID. When using existing partitions
it is not necessary to change their partition type—YaST will automatically do so. Refer
to Book “Deployment Guide”, Chapter 10 “Expert Partitioner”, Section 10.1 “Using the Expert Par-
titioner” for details.
For RAID 10 at least four partitions are needed. It is strongly recommended to use parti-
tions stored on different hard disks to decrease the risk of losing data if one is defective.
It is recommended to use only partitions of the same size because each segment can con-
tribute only the same amount of space as the smallest sized partition.
6. In the Available Devices list, select the desired partitions, then click Add to move them to
the Selected Devices list.
7. (Optional) Click Classify to specify the preferred order of the disks in the RAID array.
113 Creating a Complex RAID 10 with the YaST Partitioner SLES 15 SP2
For RAID types such as RAID 10, where the order of added disks matters, you can specify
the order in which the devices will be used. This will ensure that one half of the array
resides on one disk subsystem and the other half of the array resides on a different disk
subsystem. For example, if one disk subsystem fails, the system keeps running from the
second disk subsystem.
a. Select each disk in turn and click one of the Class X buttons, where X is the letter
you want to assign to the disk. Available classes are A, B, C, D and E but for many
cases fewer classes are needed (only A and B, for example). Assign all available RAID
disks this way.
You can press the Ctrl or Shift key to select multiple devices. You can also right-
click a selected device and choose the appropriate class from the context menu.
b. Specify the order of the devices by selecting one of the sorting options:
Sorted: Sorts all devices of class A before all devices of class B and so on. For ex-
ample: AABBCC .
Interleaved: Sorts devices by the rst device of class A, then rst device of class B,
then all the following classes with assigned devices. Then the second device of class
A, the second device of class B, and so on follows. All devices without a class are
sorted to the end of the devices list. For example: ABCABC .
Pattern File: Select an existing le that contains multiple lines, where each is a
regular expression and a class name ( "sda.* A" ). All devices that match the reg-
ular expression are assigned to the specified class for that line. The regular expres-
sion is matched against the kernel name ( /dev/sda1 ), the udev path name ( /dev/
disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0-part1 ) and then the udev ID
( dev/disk/by-id/ata-ST3500418AS_9VMN8X8L-part1 ). The rst match made de-
termines the class if a device’s name matches more than one regular expression.
114 Creating a Complex RAID 10 with the YaST Partitioner SLES 15 SP2
8. Click Next.
9. Under RAID Options, specify the Chunk Size and Parity Algorithm, then click Next.
For a RAID 10, the parity options are n (near), f (far), and o (offset). The number indicates
the number of replicas of each data block that are required. Two is the default. For infor-
mation, see Section 9.2.2, “Layout”.
10. Add a le system and mount options to the RAID device, then click Finish.
12. Verify the changes to be made, then click Finish to create the RAID.
/dev/sdf1 /dev/md3
/dev/sdg1
/dev/sdh1
/dev/sdi1
1. Open a terminal.
2. If necessary, create at least four 0xFD Linux RAID partitions of equal size using a disk
partitioner such as parted.
Make sure to adjust the value for --raid-devices and the list of partitions according
to your setup.
The command creates an array with near layout and two replicas. To change any of the two
values, use the --layout as described in Section 9.2.2.4, “Specifying the number of Replicas
and the Layout with YaST and mdadm”.
4. Create a le system on the RAID 10 device /dev/md3 , for example an XFS le system:
5. Edit the /etc/mdadm.conf le or create it, if it does not exist (for example by running
sudo vi /etc/mdadm.conf ). Add the following lines (if the le exists, the rst line
probably already exists, too).
The UUID of the device can be retrieved with the following command:
6. Edit the /etc/fstab le to add an entry for the RAID 10 device /dev/md3 . The following
example shows an entry for a RAID device with the XFS le system and /data as a mount
point.
A degraded array is one in which some devices are missing. Degraded arrays are
supported only for RAID 1, RAID 4, RAID 5, and RAID 6. These RAID types are de-
signed to withstand some missing devices as part of their fault-tolerance features.
Typically, degraded arrays occur when a device fails. It is possible to create a de-
graded array on purpose.
To create a degraded array in which some devices are missing, simply give the word missing
in place of a device name. This causes mdadm to leave the corresponding slot in the array empty.
When creating a RAID 5 array, mdadm automatically creates a degraded array with an extra
spare drive. This is because building the spare into a degraded array is generally faster than
resynchronizing the parity on a non-degraded, but not clean, array. You can override this feature
with the --force option.
Creating a degraded array might be useful if you want create a RAID, but one of the devices you
want to use already has data on it. In that case, you create a degraded array with other devices,
copy data from the in-use device to the RAID that is running in degraded mode, add the device
into the RAID, then wait while the RAID is rebuilt so that the data is now across all devices. An
example of this process is given in the following procedure:
1. To create a degraded RAID 1 device /dev/md0 , using one single drive /dev/sd1 , enter
the following at the command prompt:
The device should be the same size or larger than the device you plan to add to it.
3. Add the device you copied the data from to the mirror. For example, to add /dev/sdb1
to the RAID, enter the following at the command prompt:
You can add only one device at a time. You must wait for the kernel to build the mirror
and bring it fully online before you add another mirror.
4. Monitor the build progress by entering the following at the command prompt:
To see the rebuild progress while being refreshed every second, enter
This section describes how to increase or reduce the size of a software RAID 1, 4, 5,
or 6 device with the Multiple Device Administration ( mdadm(8) ) tool.
Resizing an existing software RAID device involves increasing or decreasing the space con-
tributed by each component partition. The le system that resides on the RAID must also be able
to be resized to take advantage of the changes in available space on the device. In SUSE Linux
Enterprise Server, le system resizing utilities are available for le systems Btrfs, Ext2, Ext3,
Ext4, and XFS (increase size only). Refer to Chapter 2, Resizing File Systems for more information.
The mdadm tool supports resizing only for software RAID levels 1, 4, 5, and 6. These RAID
levels provide disk fault tolerance so that one component partition can be removed at a time
for resizing. In principle, it is possible to perform a hot resize for RAID partitions, but you must
take extra care for your data when doing so.
Resizing the RAID involves the following tasks. The order in which these tasks are performed
depends on whether you are increasing or decreasing its size.
Resize the le sys- You must resize the le system that resides 3 1
tem. on the RAID. This is possible only for le
systems that provide tools for resizing.
The procedures in the following sections use the device names shown in the following table.
Ensure that you modify the names to use the names of your own devices.
/dev/md0 /dev/sda1
/dev/sdb1
/dev/sdc1
1. Open a terminal.
If your RAID array is still synchronizing according to the output of this command, you
must wait until synchronization is complete before continuing.
3. Remove one of the component partitions from the RAID array. For example, to remove
/dev/sda1 , enter
4. Increase the size of the partition that you removed in the previous step by doing one of
the following:
Increase the size of the partition, using a disk partitioner such as the YaST Partitioner
or the command line tool parted. This option is the usual choice.
Replace the disk on which the partition resides with a higher-capacity device. This
option is possible only if no other le systems on the original disk are accessed by
the system. When the replacement device is added back into the RAID, it takes much
longer to synchronize the data because all of the data that was on the original device
must be rebuilt.
5. Re-add the partition to the RAID array. For example, to add /dev/sda1 , enter
Wait until the RAID is synchronized and consistent before continuing with the next par-
tition.
7. If you get a message that tells you that the kernel could not re-read the partition table
for the RAID, you must reboot the computer after all partitions have been resized to force
an update of the partition table.
8. Continue with Section 11.1.2, “Increasing the Size of the RAID Array”.
1. Open a terminal.
If your RAID array is still synchronizing according to the output of this command, you
must wait until synchronization is complete before continuing.
3. Check the size of the array and the device size known to the array by entering
Increase the size of the array to the maximum available size by entering
Increase the size of the array to the maximum available size by entering
Replace SIZE with an integer value in kilobytes (a kilobyte is 1024 bytes) for the
desired size.
5. Recheck the size of your array and the device size known to the array by entering
If your array was successfully resized, continue with Section 11.1.3, “Increasing the Size
of the File System”.
If your array was not resized as you expected, you must reboot, then try this proce-
dure again.
The new size must be greater than the size of the existing data; otherwise, data loss occurs.
The new size must be equal to or less than the current RAID size because the le system
size cannot extend beyond the space available.
Important: XFS
Decreasing the size of a le system formatted with XFS is not possible, since such a feature
is not supported by XFS. As a consequence, the size of a RAID that uses the XFS le system
cannot be decreased.
The new size must be greater than the size of the existing data; otherwise, data loss occurs.
The new size must be equal to or less than the current RAID size because the le system
size cannot extend beyond the space available.
1. Open a terminal.
2. Check the size of the array and the device size known to the array by entering
Replace SIZE with an integer value in kilobytes for the desired size. (A kilobyte is 1024
bytes.)
For example, the following command sets the segment size for each RAID device to about
40 GB where the chunk size is 64 KB. It includes 128 KB for the RAID superblock.
4. Recheck the size of your array and the device size known to the array by entering
If your array was successfully resized, continue with Section 11.2.3, “Decreasing the Size
of Component Partitions”.
If your array was not resized as you expected, you must reboot, then try this proce-
dure again.
1. Open a terminal.
If your RAID array is still synchronizing according to the output of this command, you
must wait until synchronization is complete before continuing.
3. Remove one of the component partitions from the RAID array. For example, to remove
/dev/sda1 , enter
4. Decrease the size of the partition that you removed in the previous step to a size that is
slightly larger than the size you set for the segment size. The size should be a multiple of
the chunk size and allow 128 KB for the RAID superblock. Use a disk partitioner such as
the YaST partitioner or the command line tool parted to decrease the size of the partition.
5. Re-add the partition to the RAID array. For example, to add /dev/sda1 , enter
Wait until the RAID is synchronized and consistent before continuing with the next par-
tition.
6. Repeat these steps for each of the remaining component devices in the array. Ensure that
you modify the commands for the correct component partition.
7. If you get a message that tells you that the kernel could not re-read the partition table for
the RAID, you must reboot the computer after resizing all of its component partitions.
Storage enclosure LED Monitoring utility ( ledmon ) and LED Control ( ledctl ) util-
ity are Linux user space applications that use a broad range of interfaces and pro-
tocols to control storage enclosure LEDs. The primary usage is to visualize the sta-
tus of Linux MD software RAID devices created with the mdadm utility. The led-
mon daemon monitors the status of the drive array and updates the status of the dri-
ve LEDs. The ledctl utility allows you to set LED patterns for specified devices.
These LED utilities use the SGPIO (Serial General Purpose Input/Output) specification (Small
Form Factor (SFF) 8485) and the SCSI Enclosure Services (SES) 2 protocol to control LEDs. They
implement the International Blinking Pattern Interpretation (IBPI) patterns of the SFF-8489
specification for SGPIO. The IBPI defines how the SGPIO standards are interpreted as states for
drives and slots on a backplane and how the backplane should visualize the states with LEDs.
Some storage enclosures do not adhere strictly to the SFF-8489 specification. An enclosure
processor might accept an IBPI pattern but not blink the LEDs according to the SFF-8489 spec-
ification, or the processor might support only a limited number of the IBPI patterns.
LED management (AHCI) and SAF-TE protocols are not supported by the ledmon and ledctl
utilities.
The ledmon and ledctl applications have been verified to work with Intel storage controllers
such as the Intel AHCI controller and Intel SAS controller. They also support PCIe-SSD (sol-
id-state drive) enclosure LEDs to control the storage enclosure status (OK, Fail, Rebuilding) LEDs
of PCIe-SSD devices that are part of an MD software RAID volume. The applications might al-
so work with the IBPI-compliant storage controllers of other vendors (especially SAS/SCSI con-
trollers); however, other vendors’ controllers have not been tested.
ledmon and ledctl are part of the ledmon package, which is not installed by default. Run
sudo zypper in ledmon to install it.
-c PATH ,
--confg=PATH
The configuration is read from ~/.ledctl or from /etc/ledcfg.conf if existing. Use
this option to specify an alternative configuration le.
Currently this option has no effect, since support for configuration les has not been im-
plemented yet. See man 5 ledctl.conf for details.
-l PATH ,
--log= PATH
Sets a path to local log le. If this user-defined le is specified, the global log le /var/
log/ledmon.log is not used.
-t SECONDS ,
--interval=SECONDS
Sets the time interval between scans of sysfs . The value is given in seconds. The minimum
is 5 seconds. The maximum is not specified.
-h ,
--help
Prints the command information to the console, then exits.
-v ,
--version
Displays version of ledmon and information about the license, then exits.
-c PATH ,
--confg=PATH
Sets a path to local configuration le. If this option is specified, the global configuration
le and user configuration le have no effect.
-l PATH ,
--log= PATH
Sets a path to local log le. If this user-defined le is specified, the global log le /var/
log/ledmon.log is not used.
--quiet
Turns o all messages sent to stdout or stderr out. The messages are still logged to
local le and the syslog facility.
-h ,
--help
Prints the command information to the console, then exits.
-v ,
--version
Displays version of ledctl and information about the license, then exits.
locate
Turns on the Locate LED associated with the specified devices or empty slots. This state
is used to identify a slot or drive.
locate_off
Turns o the Locate LED associated with the specified devices or empty slots.
normal
Turns o the Status LED, Failure LED, and Locate LED associated with the specified devices.
off
Turns o only the Status LED and Failure LED associated with the specified devices.
rebuild ,
rebuild_p
Visualizes the Rebuild pattern. This supports both of the rebuild states for compatibility
and legacy reasons.
ifa ,
failed_array
Visualizes the In a Failed Array pattern.
hotspare
Visualizes the Hotspare pattern.
pfa
Visualizes the Predicted Failure Analysis pattern.
failure ,
disk_failed
Visualizes the Failure pattern.
ses_abort
SES-2 R/R ABORT
ses_rebuild
SES-2 REBUILD/REMAP
ses_ifa
SES-2 IN FAILED ARRAY
ses_ica
SES-2 IN CRITICAL ARRAY
ses_cons_check
SES-2 CONS CHECK
ses_hotspare
SES-2 HOTSPARE
ses_rsvd_dev
SES-2 RSVD DEVICE
ses_ident
SES-2 IDENT
ses_rm
SES-2 REMOVE
ses_insert
SES-2 INSERT
ses_missing
SES-2 MISSING
ses_dnr
SES-2 DO NOT REMOVE
ses_active
SES-2 ACTIVE
ses_enable_bb
SES-2 ENABLE BYP B
ses_enable_ba
SES-2 ENABLE BYP A
ses_devoff
SES-2 DEVICE OFF
ses_fault
SES-2 FAULT
When a non-SES-2 pattern is sent to a device in an enclosure, the pattern is automatically trans-
lated to the SCSI Enclosure Services (SES) 2 pattern as shown above.
TABLE 12.1: TRANSLATION BETWEEN NON-SES-2 PATTERNS AND SES-2 PATTERNS
locate ses_ident
locate_off ses_ident
normal ses_ok
o ses_ok
ica ses_ica
degraded ses_ica
rebuild ses_rebuild
rebuild_p ses_rebuild
ifa ses_ifa
failed_array ses_ifa
hotspare ses_hotspare
pfa ses_rsvd_dev
failure ses_fault
disk_failed ses_fault
If you specify multiple patterns in the same command, the device list for each pattern can use
the same or different format. For examples that show the two list formats, see Section 12.2.3,
“Examples”.
A device is a path to le in the /dev directory or in the /sys/block directory. The path can
identify a block device, an MD software RAID device, or a container device. For a software RAID
device or a container device, the reported LED state is set for all of the associated block devices.
12.2.3 Examples
To locate a single block device:
To locate disks of an MD software RAID device and to set a rebuild pattern for two of its block
devices at the same time:
To turn o the Status LED and Failure LED for the specified devices:
To locate three block devices, run one of the following commands (both are equivalent):
In the case of the disk media or controller failure, the device needs to be replaced or repaired.
If a hot-spare was not configured within the RAID, then manual intervention is required.
In the last case, the failed device can be automatically re-added by the mdadm command after
the connection is repaired (which might be automatic).
Because md / mdadm cannot reliably determine what caused the disk failure, it assumes a serious
disk error and treats any failed device as faulty until it is explicitly told that the device is reliable.
Under some circumstances—such as storage devices with the internal RAID array— the connec-
tion problems are very often the cause of the device failure. In such case, you can tell mdadm
that it is safe to automatically --re-add the device after it appears. You can do this by adding
the following line to /etc/mdadm.conf :
POLICY action=re-add
Note that the device will be automatically re-added after re-appearing only if the udev rules
cause mdadm -I DISK_DEVICE_NAME to be run on any device that spontaneously appears (de-
fault behavior), and if write-intent bitmaps are configured (they are by default).
Storage area networks (SANs) can contain many disk drives that are dispersed across complex
networks. This can make device discovery and device ownership difficult. iSCSI initiators must
be able to identify storage resources in the SAN and determine whether they have access to them.
Internet Storage Name Service (iSNS) is a standards-based service that simplifies the automat-
ed discovery, management, and configuration of iSCSI devices on a TCP/IP network. iSNS pro-
vides intelligent storage discovery and management services comparable to those found in Fibre
Channel networks.
Without iSNS, you must know the host name or IP address of each node where targets of interest
are located. In addition, you must manually manage which initiators can access which targets
yourself using mechanisms such as access control lists.
Both iSCSI targets and iSCSI initiators can use iSNS clients to initiate transactions with iSNS
servers by using the iSNS protocol. They then register device attribute information in a common
discovery domain, download information about other registered clients, and receive asynchro-
nous notification of events that occur in their discovery domain.
iSNS servers respond to iSNS protocol queries and requests made by iSNS clients using the
iSNS protocol. iSNS servers initiate iSNS protocol state change notifications and store properly
authenticated information submitted by a registration request in an iSNS database.
Benefits provided by iSNS for Linux include:
2. In case open-isns is not installed yet, you are prompted to install it now. Confirm by
clicking Install.
3. The iSNS Service configuration dialog opens automatically to the Service tab.
Manually (Default): The iSNS service must be started manually by entering sudo
systemctl start isnsd at the server console of the server where you install it.
Open Port in Firewall: Select the check box to open the firewall and allow access to
the service from remote computers. The firewall port is closed by default.
Firewall Details: If you open the firewall port, the port is open on all network inter-
faces by default. Click Firewall Details to select interfaces on which to open the port,
select the network interfaces to use, then click OK.
4. Specify the name of the discovery domain you are creating, then click OK.
to restart an initiator or
to restart a target.
You can select an iSCSI node and click the Delete button to remove that node from the
iSNS database. This is useful if you are no longer using an iSCSI node or have renamed it.
The iSCSI node is automatically added to the list (iSNS database) again when you restart
the iSCSI service or reboot the server unless you remove or comment out the iSNS portion
of the iSCSI configuration le.
4. Click the Discovery Domains tab and select the desired discovery domain.
5. Click Add existing iSCSI Node, select the node you want to add to the domain, then click
Add Node.
You can also use the stop , status , and restart options with iSNS.
General information about iSNS is available in RFC 4171: Internet Storage Name Service at https://
datatracker.ietf.org/doc/html/rfc4171 .
Ethernet Switch
Network Backbone Network Backbone
Server 7
Ethernet
Note: LIO
LIO (http://linux-iscsi.org ) is the standard open source multiprotocol SCSI target for Lin-
ux. LIO replaced the STGT (SCSI Target) framework as the standard unified storage target
in Linux with Linux kernel version 2.6.38 and later. In SUSE Linux Enterprise Server 12
the iSCSI LIO Target Server replaces the iSCSI Target Server from previous versions.
iSCSI is a storage networking protocol that simplifies data transfers of SCSI packets over TCP/
IP networks between block storage devices and servers. iSCSI target software runs on the target
server and defines the logical units as iSCSI target devices. iSCSI initiator software runs on
different servers and connects to the target devices to make the storage devices available on
that server.
To install the iSCSI LIO Target Server, run the following command in a terminal:
In case you need to install the iSCSI initiator or any of its dependencies, run the command sudo
zypper in yast2-iscsi-client .
149 Installing the iSCSI LIO Target Server and iSCSI Initiator SLES 15 SP2
15.2 Setting Up an iSCSI LIO Target Server
This section describes how to use YaST to configure an iSCSI LIO Target Server and set up iSCSI
LIO target devices. You can use any iSCSI initiator software to access the target devices.
3. Under Service Start, specify how you want the iSCSI LIO target service to be started:
Manually: (Default) You must start the service manually after a server restart by
running sudo systemctl start targetcli . The target devices are not available
until you start the service.
a. On the Services tab, select the Open Port in Firewall check box to enable the firewall
settings.
b. Click Firewall Details to view or configure the network interfaces to use. All available
network interfaces are listed, and all are selected by default. Deselect all interfaces
on which the port should not be opened. Save your settings with OK.
5. Click Finish to save and apply the iSCSI LIO Target service settings.
151 Configuring Authentication for Discovery of iSCSI LIO Targets and Initiators SLES 15 SP2
Important: Security
We recommend that you use authentication for target and initiator discovery in produc-
tion environments for security reasons.
4. Provide credentials for the selected authentication method(s). The user name and pass-
word pair must be different for incoming and outgoing discovery.
152 Configuring Authentication for Discovery of iSCSI LIO Targets and Initiators SLES 15 SP2
15.2.3 Preparing the Storage Space
Before you configure LUNs for your iSCSI Target servers, you must prepare the storage you want
to use. You can use the entire unformatted block device as a single LUN, or you can subdivide
a device into unformatted partitions and use each partition as a separate LUN. The iSCSI target
configuration exports the LUNs to iSCSI initiators.
You can use the Partitioner in YaST or the command line to set up the partitions. Refer to
Book “Deployment Guide”, Chapter 10 “Expert Partitioner”, Section 10.1 “Using the Expert Partitioner”
for details. iSCSI LIO targets can use unformatted partitions with Linux, Linux LVM, or Linux
RAID le system IDs.
You can use a virtual machine guest server as an iSCSI LIO Target Server. This section describes
how to assign partitions to a Xen virtual machine. You can also use other virtual environments
that are supported by SUSE Linux Enterprise Server.
In a Xen virtual environment, you must assign the storage space you want to use for the iSCSI
LIO target devices to the guest virtual machine, then access the space as virtual disks within
the guest environment. Each virtual disk can be a physical block device, such as an entire disk,
partition, or volume, or it can be a le-backed disk image where the virtual disk is a single
image le on a larger physical disk on the Xen host server. For the best performance, create
each virtual disk from a physical disk or a partition. After you set up the virtual disks for the
guest virtual machine, start the guest server, then configure the new blank virtual disks as iSCSI
target devices by following the same process as for a physical server.
File-backed disk images are created on the Xen host server, then assigned to the Xen guest
server. By default, Xen stores le-backed disk images in the /var/lib/xen/images/VM_NAME
directory, where VM_NAME is the name of the virtual machine.
Important: Partitions
Before you begin, choose the partitions you wish to use for back-end storage. The parti-
tions do not have to be formatted—the iSCSI client can format them when connected,
overwriting all existing formatting.
3. Click Add, then define a new iSCSI LIO target group and devices:
The iSCSI LIO Target software automatically completes the Target, Identifier, Portal Group,
IP Address, and Port Number elds. Use Authentication is selected by default.
a. If you have multiple network interfaces, use the IP address drop-down box to select
the IP address of the network interface to use for this target group. To make the
server accessible under all addresses, choose Bind All IP Addresses.
c. Click Add. Enter the path of the device or partition or Browse to add it. Optionally
specify a name, then click OK. The LUN number is automatically generated, begin-
ning with 0. A name is automatically generated if you leave the eld empty.
d. (Optional) Repeat the previous steps to add targets to this target group.
e. After all desired targets have been added to the group, click Next.
4. On the Modify iSCSI Target Initiator Setup page, configure information for the initiators that
are permitted to access LUNs in the target group:
After you specify at least one initiator for the target group, the Edit LUN, Edit Auth, Delete,
and Copy buttons are enabled. You can use Add or Copy to add initiators for the target
group:
Add: Add a new initiator entry for the selected iSCSI LIO target group.
Edit Auth: Configure the preferred authentication method for a selected initiator.
You can specify no authentication, or you can configure incoming authentication,
outgoing authentication, or both.
Delete: Remove a selected initiator entry from the list of initiators allocated to the
target group.
Copy: Add a new initiator entry with the same LUN mappings and authentication
settings as a selected initiator entry. This allows you to easily allocate the same
shared LUNs, in turn, to each node in a cluster.
a. Click Add, specify the initiator name, select or deselect the Import LUNs from TPG
check box, then click OK to save the settings.
b. Select an initiator entry, click Edit LUN, modify the LUN mappings to specify which
LUNs in the iSCSI LIO target group to allocate to the selected initiator, then click
OK to save the changes.
If the iSCSI LIO target group consists of multiple LUNs, you can allocate one or
multiple LUNs to the selected initiator. By default, each of the available LUNs in the
group are assigned to an initiator LUN.
To modify the LUN allocation, perform one or more of the following actions:
Add: Click Add to create a new Initiator LUN entry, then use the Change drop-
down box to map a target LUN to it.
Delete: Select the Initiator LUN entry, then click Delete to remove a target LUN
mapping.
Change: Select the Initiator LUN entry, then use the Change drop-down box to
select which Target LUN to map to it.
A single server is listed as an initiator. All of the LUNs in the target group are
allocated to it.
Each node of a cluster is listed as an initiator. All of the shared target LUNs are
allocated to each node. All nodes are attached to the devices, but for most le
systems, the cluster software locks a device for access and mounts it on only
one node at a time. Shared le systems (such as OCFS2) make it possible for
multiple nodes to concurrently mount the same le structure and to open the
same les with read and write access.
You can use this grouping strategy to logically group the iSCSI SAN storage for
a given server cluster.
c. Select an initiator entry, click Edit Auth, specify the authentication settings for the
initiator, then click OK to save the settings.
You can require No Discovery Authentication, or you can configure Authentication by
Initiators, Outgoing Authentication, or both. You can specify only one user name and
password pair for each initiator. The credentials can be different for incoming and
outgoing authentication for an initiator. The credentials can be different for each
initiator.
d. Repeat the previous steps for each iSCSI initiator that can access this target group.
Modify the user name and password credentials for an initiator authentication (incoming,
outgoing, or both)
3. Select the iSCSI LIO target group to be modified, then click Edit.
4. On the Modify iSCSI Target LUN Setup page, add LUNs to the target group, edit the LUN
assignments, or remove target LUNs from the group. After all desired changes have been
made to the group, click Next.
For option information, see Modify iSCSI Target: Options.
5. On the Modify iSCSI Target Initiator Setup page, configure information for the initiators
that are permitted to access LUNs in the target group. After all desired changes have been
made to the group, click Next.
3. Select the iSCSI LIO target group to be deleted, then click Delete.
4. When you are prompted, click Continue to confirm the deletion, or click Cancel to cancel it.
Service:
The Service tab can be used to enable the iSCSI initiator at boot time. It also offers to set
a unique Initiator Name and an iSNS server to use for the discovery.
Connected Targets:
The Connected Targets tab gives an overview of the currently connected iSCSI targets. Like
the Discovered Targets tab, it also gives the option to add new targets to the system.
Discovered Targets:
The Discovered Targets tab provides the possibility of manually discovering iSCSI targets
in the network.
4. In the After reboot menu specify action that will take place after reboot:
Start on demand - the associated socket will be running and if needed, the service
will be started.
iqn.yyyy-mm.com.mycompany:n1:n2
iqn.1996-04.de.suse:01:a5dfcea717a
160 Using YaST for the iSCSI Initiator Configuration SLES 15 SP2
The Initiator Name is automatically completed with the corresponding value from the /
etc/iscsi/initiatorname.iscsi le on the server.
If the server has iBFT (iSCSI Boot Firmware Table) support, the Initiator Name is completed
with the corresponding value in the IBFT, and you are not able to change the initiator name
in this interface. Use the BIOS Setup to modify it instead. The iBFT is a block of information
containing various parameters useful to the iSCSI boot process, including iSCSI target and
initiator descriptions for the server.
6. Use either of the following methods to discover iSCSI targets on the network.
iSNS: To use iSNS (Internet Storage Name Service) for discovering iSCSI targets,
continue with Section 15.3.1.2, “Discovering iSCSI Targets by Using iSNS”.
Discovered Targets: To discover iSCSI target devices manually, continue with Sec-
tion 15.3.1.3, “Discovering iSCSI Targets Manually”.
2. Specify the IP address of the iSNS server and port. The default port is 3205.
1. In YaST, select iSCSI Initiator, then select the Discovered Targets tab.
3. Enter the IP address and change the port if needed. The default port is 3260.
161 Using YaST for the iSCSI Initiator Configuration SLES 15 SP2
5. Click Next to start the discovery and connect to the iSCSI target server.
6. If credentials are required, after a successful discovery, use Connect to activate the target.
You are prompted for authentication credentials to use the selected iSCSI target.
9. You can nd the local device path for the iSCSI target device by using the lsscsi com-
mand.
1. In YaST, select iSCSI Initiator, then select the Connected Targets tab to view a list of the
iSCSI target devices that are currently connected to the server.
Automatic: This option is used for iSCSI targets that are to be connected when the iSCSI
service itself starts up. This is the typical configuration.
Onboot: This option is used for iSCSI targets that are to be connected during boot; that
is, when root ( / ) is on iSCSI. As such, the iSCSI target device will be evaluated from the
initrd on server boots. This option is ignored on platforms that cannot boot from iSCSI,
such as IBM Z. Therefore it should not be used on these platforms; use Automatic instead.
discovery.sendtargets.auth.authmethod = CHAP
discovery.sendtargets.auth.username = USERNAME
discovery.sendtargets.auth.password = PASSWORD
The discovery stores all received values in an internal persistent database. In addition, it displays
all detected targets. Run this discovery with the following command:
10.44.171.99:3260,1 iqn.2006-02.com.example.iserv:systems
To discover the available targets on an iSNS server, use the following command:
For each target defined on the iSCSI target, one line appears. For more information about the
stored data, see Section 15.3.3, “The iSCSI Initiator Databases”.
The special --login option of iscsiadm creates all needed devices:
The newly generated devices show up in the output of lsscsi and can now be mounted.
To edit the value of one of these variables, use the command iscsiadm with the update op-
eration. For example, if you want iscsid to log in to the iSCSI target when it initializes, set the
variable node.startup to the value automatic :
Remove obsolete data sets with the delete operation. If the target iqn.2006-02.com.exam-
ple.iserv:systems is no longer a valid record, delete this record with the following command:
Important: No Confirmation
Use this option with caution because it deletes the record without any additional confir-
mation prompt.
To get a list of all discovered targets, run the sudo iscsiadm -m node command.
You can use the help command in any directory to view a list of available commands or infor-
mation about any command in particular.
The targetcli tool is part of the targetcli-fb package. This package is available in the
official SUSE Linux Enterprise Server software repository, and it can be installed using the fol-
lowing command:
After the targetcli-fb package has been installed, enable the targetcli service:
You can then run the ls command to see the default configuration.
/> ls
o- / ............................ [...]
o- backstores ................. [...]
| o- block ..... [Storage Objects: 0]
| o- fileio .... [Storage Objects: 0]
| o- pscsi ..... [Storage Objects: 0]
| o- ramdisk ... [Storage Objects: 0]
| o- rbd ....... [Storage Objects: 0]
o- iscsi ............... [Targets: 0]
o- loopback ............ [Targets: 0]
o- vhost ............... [Targets: 0]
As the output of the ls command indicates, there are no configured back-ends. So the rst step
is to configure one of the supported software targets.
targetcli supports the following back-ends:
To familiarize yourself with the functionality of targetcli, set up a local image le as a software
target using the create command:
This creates a 1 GB test.img image in the specified location (in this case /alt ). Run ls , and
you should see the following result:
/> ls
o- / ........................................................... [...]
o- backstores ................................................ [...]
| o- block .................................... [Storage Objects: 0]
| o- fileio ................................... [Storage Objects: 1]
| | o- test-disc ... [/alt/test.img (1.0GiB) write-back deactivated]
| | o- alua ...... .......................... [ALUA Groups: 1]
| | o- default_tg_pt_gp .... [ALUA state: Active/optimized]
| o- pscsi .................................... [Storage Objects: 0]
| o- ramdisk .................................. [Storage Objects: 0]
| o- rbd ...................................... [Storage Objects: 0]
o- iscsi .............................................. [Targets: 0]
o- loopback ........................................... [Targets: 0]
o- vhost .............................................. [Targets: 0]
o- xen-pvscsi ......................................... [Targets: 0]
/>
The output indicates that there is now a le-based backstore, under the /backstores/fileio
directory, called test-disc , which is linked to the created le /alt/test.img . Note that the
new backstore is not yet activated.
iqn.YYYY-MM.NAMING-AUTHORITY:UNIQUE-NAME
YYYY-MM , the year and month when the naming authority was established
For example, for the domain open-iscsi.com , the IQN can be as follows:
iqn.2005-03.com.open-iscsi:UNIQUE-NAME
When creating an iSCSI target, the targetcli command allows you to assign your own IQN,
as long as it follows the specified format. You can also let the command create an IQN for you
by omitting a name when creating the target, for example:
/> ls
o- / ............................................................... [...]
o- backstores .................................................... [...]
| o- block ........................................ [Storage Objects: 0]
| o- fileio ....................................... [Storage Objects: 1]
| | o- test-disc ....... [/alt/test.img (1.0GiB) write-back deactivated]
| | o- alua ......................................... [ALUA Groups: 1]
| | o- default_tg_pt_gp ............. [ALUA state: Active/optimized]
| o- pscsi ........................................ [Storage Objects: 0]
| o- ramdisk ...................................... [Storage Objects: 0]
| o- rbd .......................................... [Storage Objects: 0]
o- iscsi .................................................. [Targets: 1]
| o- iqn.2003-01.org.linux-iscsi.e83.x8664:sn.8b35d04dd456 ... [TPGs: 1]
| o- tpg1 ..................................... [no-gen-acls, no-auth]
| o- acls ................................................ [ACLs: 0]
| o- luns ................................................ [LUNs: 0]
| o- portals .......................................... [Portals: 1]
| o- 0.0.0.0:3260 ........................................... [OK]
o- loopback ............................................... [Targets: 0]
o- vhost .................................................. [Targets: 0]
o- xen-pvscsi ............................................. [Targets: 0]
/>
Note that targetcli has also created and enabled the default target portal group tpg1 . This
is done because the variables auto_add_default_portal and auto_enable_tpgt at the root
level are set to true by default.
The command also created the default portal with the 0.0.0.0 IPv4 wildcard. This means that
any IPv4 address can access the configured target.
The next step is to create a LUN (Logical Unit Number) for the iSCSI target. The best way to
do this is to let targetcli assign its name and number automatically. Switch to the directory
of the iSCSI target, and then use the create command in the lun directory to assign a LUN
to the backstore.
/> cd /iscsi/iqn.2003-01.org.linux-iscsi.e83.x8664:sn.8b35d04dd456/
/iscsi/iqn.2003-01.org.linux-iscsi.e83.x8664:sn.8b35d04dd456> cd tpg1
/iscsi/iqn.2003-01.org.linux-iscsi.e83.x8664:sn.8b35d04dd456/tpg1> luns/
create /backstores/fileio/test-disc
/iscsi/iqn.2003-01.org.linux-iscsi.e83.x8664:sn.8b35d04dd456/tpg1> ls
o- tpg1 .............................................. [no-gen-acls, no-auth]
o- acls ..................................................... [ACLs: 0]
o- luns ..................................................... [LUNs: 1]
| o- lun0 ....... [fileio/test-disc (/alt/test.img) (default_tg_pt_gp)]
o- portals ............................................... [Portals: 1]
o- 0.0.0.0:3260 ................................................ [OK]
There is now an iSCSI target that has a 1 GB le-based backstore. The target has the
iqn.2003-01.org.linux-iscsi.e83.x8664:sn.8b35d04dd456 name, and it can be accessed
from any network port of the system.
Finally, you need to ensure that initiators have access to the configured target. One way to do
this is to create an ACL rule for each initiator that allows them to connect to the target. In this
case, you must list each desired initiator using its IQN. The IQNs of the existing initiators can
be found in the /etc/iscsi/initiatorname.iscsi le. Use the following command to add
the desired initiator (in this case, it's iqn.1996-04.de.suse:01:54cab487975b ):
The last step is to save the created configuration using the saveconfig command available in
the root directory:
If at some point you need to restore configuration from the saved le, you need to clear the
current configuration rst. Keep in mind that clearing the current configuration results in data
loss unless you save your configuration rst. Use the following command to clear and reload
the configuration:
/> clearconfig
As a precaution, confirm=True needs to be set
/> clearconfig confirm=true
All configuration cleared
/> restoreconfig /etc/target/example.json
Configuration restored from /etc/target/example.json
/>
To test whether the configured target is working, connect to it using the open-iscsi iSCSI ini-
tiator installed on the same system (replace HOSTNAME with the hostname of the local machine):
192.168.20.3:3260,1 iqn.2003-01.org.linux-iscsi.e83.x8664:sn.8b35d04dd456
You can then connect to the listed target using the login iSCSI command. This makes the target
available as a local disk.
withiscsi=1
During installation, an additional screen appears that provides the option to attach iSCSI disks
to the system and use them in the installation process.
This problem occurs if the iSCSI LIO Target Server software is not currently running. To resolve
this issue, exit YaST, manually start iSCSI LIO at the command line with systemctl start
targetcli , then try again.
You can also enter the following to check if configfs , iscsi_target_mod , and tar-
get_core_mod are loaded. A sample response is shown.
15.6.2 iSCSI LIO Targets Are Not Visible from Other Computers
If you use a firewall on the target server, you must open the iSCSI port that you are using to
allow other computers to see the iSCSI LIO targets. For information, see Section 15.2.1, “iSCSI LIO
Target Service Start-up and Firewall Settings”.
171 iSCSI LIO Targets Are Not Visible from Other Computers SLES 15 SP2
15.6.5 iSCSI Targets Are Mounted When the Configuration File Is
Set to Manual
When Open-iSCSI starts, it can mount the targets even if the node.startup option is set to
manual in the /etc/iscsi/iscsid.conf le if you manually modified the configuration le.
Check the /etc/iscsi/nodes/TARGET_NAME/IP_ADDRESS,PORT/default le. It contains a
node.startup setting that overrides the /etc/iscsi/iscsid.conf le. Setting the mount
option to manual by using the YaST interface also sets node.startup = manual in the /etc/
iscsi/nodes/TARGET_NAME/IP_ADDRESS,PORT/default les.
endpoint
The combination of an iSCSI Target Name with an iSCSI TPG (IQN + Tag).
172 iSCSI Targets Are Mounted When the Configuration File Is Set to Manual SLES 15 SP2
EUI (extended unique identifier)
A 64‐bit number that uniquely identifies every device in the world. The format consists
of 24 bits that are unique to a given company, and 40 bits assigned by the company to
each device it builds.
initiator
The originating end of an SCSI session. Typically a controlling device such as a computer.
network portal
The combination of an iSCSI endpoint with an IP address plus a TCP (Transmission Control
Protocol) port. TCP port 3260 is the port number for the iSCSI protocol, as defined by
IANA (Internet Assigned Numbers Authority).
target
The receiving end of an SCSI session, typically a device such as a disk drive, tape drive,
or scanner.
target port
The combination of an iSCSI endpoint with one or more LUNs.
Many enterprise data centers rely on Ethernet for their LAN and data traffic, and on Fibre Chan-
nel networks for their storage infrastructure. Open Fibre Channel over Ethernet (FCoE) Initiator
software allows servers with Ethernet adapters to connect to a Fibre Channel storage subsystem
over an Ethernet network. This connectivity was previously reserved exclusively for systems
with Fibre Channel adapters over a Fibre Channel fabric. The FCoE technology reduces com-
plexity in the data center by aiding network convergence. This helps to preserve your existing
investments in a Fibre Channel storage infrastructure and to simplify network management.
Ethernet Switch
Network Backbone Network Backbone
Ethernet
Ethernet 10 Gbps
Card(s) Ethernet Card(s)
Open FCoE Open FCoE Open FCoE Open FCoE Open FCoE Open FCoE
Initiator Initiator Initiator Initiator Initiator Initiator
Open-FCoE allows you to run the Fibre Channel protocols on the host, instead of on propri-
etary hardware on the host bus adapter. It is targeted for 10 Gbps (gigabit per second) Ether-
net adapters, but can work on any Ethernet adapter that supports pause frames. The initiator
software provides a Fibre Channel protocol processing module and an Ethernet-based transport
module. The Open-FCoE module acts as a low-level driver for SCSI. The Open-FCoE transport
uses net_device to send and receive packets. Data Center Bridging (DCB) drivers provide the
quality of service for FCoE.
Your enterprise already has a Fibre Channel storage subsystem and administrators with
Fibre Channel skills and knowledge.
withfcoe=1
When the FCoE disks are detected, the YaST installation offers the option to configure FCoE
instances at that time. On the Disk Activation page, select Configure FCoE Interfaces to access
the FCoE configuration. For information about configuring the FCoE interfaces, see Section 16.3,
“Managing FCoE Services with YaST”.
Alternatively, use the YaST Software Manager to install the packages listed above.
177 Installing FCoE and the YaST FCoE Client SLES 15 SP2
16.3 Managing FCoE Services with YaST
You can use the YaST FCoE Client Configuration option to create, configure, and remove FCoE
interfaces for the FCoE disks in your Fibre Channel storage infrastructure. To use this option,
the FCoE Initiator service (the fcoemon daemon) and the Link Layer Discovery Protocol agent
daemon ( llpad ) must be installed and running, and the FCoE connections must be enabled at
the FCoE-capable switch.
2. On the Services tab, view or modify the FCoE service and Lldpad (Link Layer Discovery
Protocol agent daemon) service start time as necessary.
FCoE Service Start: Specifies whether to start the Fibre Channel over Ethernet service
fcoemon daemon at the server boot time or manually. The daemon controls the
FCoE interfaces and establishes a connection with the llpad daemon. The values
are When Booting (default) or Manually.
Lldpad Service Start: Specifies whether to start the Link Layer Discovery Protocol
agent llpad daemon at the server boot time or manually. The llpad daemon in-
forms the fcoemon daemon about the Data Center Bridging features and the config-
uration of the FCoE interfaces. The values are When Booting (default) or Manually.
3. On the Interfaces tab, view information about all detected network adapters on the serv-
er, including information about VLAN and FCoE configuration. You can also create an
FCoE VLAN interface, change settings for an existing FCoE interface, or remove an FCoE
interface.
Use the FCoE VLAN Interface column to determine whether FCoE is available or not:
Interface Name
If a name is assigned to the interface, such as eth4.200 , FCoE is available on the
switch, and the FCoE interface is activated for the adapter.
Not Configured:
If the status is not configured, FCoE is enabled on the switch, but an FCoE interface
has not been activated for the adapter. Select the adapter, then click Create FCoE
VLAN Interface to activate the interface on the adapter.
Not Available:
If the status is not available, FCoE is not possible for the adapter because FCoE has
not been enabled for that connection on the switch.
4. To set up an FCoE-enabled adapter that has not yet been configured, select it and click
Create FCoE VLAN Interface. Confirm the query with Yes.
The adapter is now listed with an interface name in the FCoE VLAN Interface column.
5. To change the settings for an adapter that is already configured, select it from the list,
then click Change Settings.
FCoE Enable
Enable or disable the creation of FCoE instances for the adapter.
DCB Required
Specifies whether Data Center Bridging is required for the adapter (usually this is
the case).
Auto VLAN
Specifies whether the fcoemon daemon creates the VLAN interfaces automatically.
If you modify a setting, click Next to save and apply the change. The settings are written
to the /etc/fcoe/cfg-ethX le. The fcoemon daemon reads the configuration les for
each FCoE interface when it is initialized.
6. To remove an interface that is already configured, select it from the list. Click Remove
Interface and Continue to confirm. The FCoE Interface value changes to not configured.
7. On the Configuration tab, view or modify the general settings for the FCoE system ser-
vice. You can enable or disable debugging messages from the FCoE service script and the
fcoemon daemon and specify whether messages are sent to the system log.
3. For each Ethernet interface where FCoe offload is configured, run the following command:
The command creates a network interface if it does not exist and starts the Open-FCoE
initiator on the discovered FCoE VLAN.
The fcoeadm utility allows you to query the FCoE instances about the following:
Interfaces
Target LUNs
Port statistics
fcoeadm
[-c|--create] [<ethX>]
[-d|--destroy] [<ethX>]
[-r|--reset] [<ethX>]
[-S|--Scan] [<ethX>]
[-i|--interface] [<ethX>]
[-t|--target] [<ethX>]
[-l|--lun] [<ethX>]
[-s|--stats <ethX>] [<interval>]
[-v|--version]
[-h|--help]
Examples
fcoeadm -c eth2.101
Create an FCoE instance on eth2.101.
fcoeadm -d eth2.101
Destroy an FCoE instance on eth2.101.
fcoeadm -i eth3
Show information about all FCoE instances on interface eth3 . If no interface is specified,
information for all interfaces that have FCoE instances created will be shown. The follow-
ing example shows information on connection eth0.201:
fcoeadm -l eth3.101
Show detailed information about all LUNs discovered on connection eth3.101. If no con-
nection is specified, information about all LUNs discovered on all FCoE connections will
be shown.
fcoeadm -r eth2.101
Reset the FCoE instance on eth2.101.
fcoeadm -s eth3 3
Show statistical information about a specific eth3 port that has FCoE instances, at an in-
terval of three seconds. The statistics are displayed one line per time interval. If no interval
is given, the default of one second is used.
fcoeadm -t eth3
Show information about all discovered targets from a given eth3 port having FCoE in-
stances. After each discovered target, any associated LUNs are listed. If no instance is spec-
ified, targets from all ports that have FCoE instances are shown. The following example
shows information of targets from the eth0.201 connection:
For information about the Open-FCoE service daemon, see the fcoemon(8) man page.
For information about the Open-FCoE Administration tool, see the fcoeadm(8) man page.
For information about the Data Center Bridging Configuration tool, see the dcbtool(8)
man page.
For information about the Link Layer Discovery Protocol agent daemon, see the lld-
pad(8) man page.
This chapter describes how to set up an NVMe over Fabric host and target.
17.1 Overview
NVM Express (NVMe) is an interface standard for accessing non-volatile storage, commonly SSD
disks. NVMe supports much higher speeds and has a lower latency than SATA.
NVMe over Fabric is an architecture to access NVMe storage over different networking fabrics,
for example RDMA, TCP or NVMe over Fibre Channel (FC-NVMe). The role of NVMe over Fab-
ric is similar to iSCSI. To increase the fault-tolerance, NVMe over Fabric has a built-in support
for multipathing. The NVMe over Fabric multipathing is not based on the traditional DM-Mul-
tipathing.
The NVMe host is the machine that connects to an NVMe target. The NVMe target is the machine
that shares its NVMe block devices.
NVMe is supported on SUSE Linux Enterprise Server 15 SP2. There are Kernel modules available
for the NVMe block storage and NVMe over Fabric target and host.
To see if your hardware requires any special consideration, refer to Section 17.4, “Special Hardware
Configuration”.
Use nvme --help to list all available subcommands. Man pages are available for nvme sub-
commands. Consult them by executing man nvme-SUBCOMMAND . For example, to view the man
page for the discover subcommand, execute man nvme-discover .
Replace TRANSPORT with the underlying transport medium: loop , rdma , tcp or fc . Replace
DISCOVERY_CONTROLLER_ADDRESS with the address of the discovery controller. For RDMA and
TCP this should be an IPv4 address. Replace SERVICE_ID with the transport service ID. If the
service is IP based, like RDMA or TCP, service ID specifies the port number. For Fibre Channel,
the service ID is not required.
The NVMe hosts only see the subsystems they are allowed to connect to.
Example:
Replace TRANSPORT with the underlying transport medium: loop , rdma , tcp or fc . Replace
DISCOVERY_CONTROLLER_ADDRESS with the address of the discovery controller. For RDMA and
TCP this should be an IPv4 address. Replace SERVICE_ID with the transport service ID. If the
service is IP based, like RDMA or TCP, this specifies the port number. Replace SUBSYSTEM_NQN
with the NVMe qualified name of the desired subsystem as found by the discovery command.
NQN is the abbreviation for NVMe Qualified Name. The NQN must be unique.
Example:
Alternatively, use nvme connect-all to connect to all discovered namespaces. For advanced
usage see man nvme-connect and man nvme-connect-all .
17.2.4 Multipathing
NVMe native multipathing is enabled by default. If the CMIC option in the controller identity
settings is set, the NVMe stack recognizes an NVME drive as a multipathed device by default.
To manage the multipathing, you can use the following:
MANAGING MULTIPATHING
nvme list-subsys
Prints the layout of the multipath devices.
multipath -ll
The command has a compatibility mode and displays NVMe multipath devices. Bear in
mind that you need to enable the enable_foreign option to use the command. For details,
refer to Section 18.13, “Miscellaneous options”.
nvme-core.multipath=N
When the option is added as a boot parameter, the NVMe native multipathing will be
disabled.
iostat -p ALL
(nvmetcli)> cd ports
(nvmetcli)> create 1
(nvmetcli)> ls 1/
o- 1
o- referrals
(nvmetcli)> cd /subsystems
(nvmetcli)> create nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-
f2b8ec353a82
(nvmetcli)> cd nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-
f2b8ec353a82/
(nvmetcli)> ls
o- nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82
o- allowed_hosts
o- namespaces
(nvmetcli)> cd namespaces
(nvmetcli)> create 1
(nvmetcli)> cd 1
(nvmetcli)> set device path=/dev/nvme0n1
Parameter path is now '/dev/nvme0n1'.
(nvmetcli)> cd ..
(nvmetcli)> enable
The Namespace has been enabled.
(nvmetcli)> cd ..
(nvmetcli)> ls
o- nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82
o- allowed_hosts
o- namespaces
o- 1
7. Allow all hosts to use the subsystem. Only do this in secure environments.
(nvmetcli)> cd nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-
f2b8ec353a82/allowed_hosts/
(nvmetcli)> cd /
(nvmetcli)> ls
o- /
o- hosts
o- ports
| o- 1
| o- referrals
| o- subsystems
o- subsystems
o- nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82
o- allowed_hosts
o- namespaces
o- 1
9. Make the target available via TCP. Use trtype=rdma for RDMA:
(nvmetcli)> cd ports/1/
(nvmetcli)> set addr adrfam=ipv4 trtype=tcp traddr=10.0.0.3 trsvcid=4420
Parameter trtype is now 'tcp'.
Parameter adrfam is now 'ipv4'.
Parameter trsvcid is now '4420'.
Parameter traddr is now '10.0.0.3'.
(nvmetcli)> cd ports/1/
(nvmetcli)> set addr adrfam=fc trtype=fc
traddr=nn-0x1000000044001123:pn-0x2000000055001123 trsvcid=none
(nvmetcli)> cd /ports/1/subsystems
(nvmetcli)> create nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-
f2b8ec353a82
Now you can verify that the port is enabled using dmesg :
# dmesg
...
[ 257.872084] nvmet_tcp: enabling port 1 (10.0.0.3:4420)
(nvmetcli)> clear
17.4.1 Overview
Some hardware needs special configuration to work correctly. Skim the titles of the following
sections to see if you are using any of the mentioned devices or vendors.
17.4.2 Broadcom
If you are using the Broadcom Emulex LightPulse Fibre Channel SCSI driver, add a Kernel config-
uration parameter on the target and host for the lpfc module:
Make sure that the Broadcom adapter rmware has at least version 11.4.204.33. Also make sure
that you have the current versions of nvme-cli , nvmetcli and the Kernel installed.
To enable a Fibre Channel port as an NVMe target, an additional module parameter needs to
be configured: lpfc_enable_nvmet= COMMA_SEPARATED_WWPNS . Enter the WWPN with a lead-
ing 0x , for example lpfc_enable_nvmet=0x2000000055001122,0x2000000055003344 . Only
listed WWPNs will be configured for target mode. A Fibre Channel port can either be configured
as target or as initiator.
Last, ensure that the latest versions available for SUSE Linux Enterprise Server of nvme-cli ,
QConvergeConsoleCLI , and the Kernel are installed. You may, for example, run
http://
driverdownloads.qlogic.com/QLogicDriverDownloads_UI/ShowEula.aspx?resourcei-
d=32769&docid=96728&ProductCategory=39&Product=1259&Os=126
http://
driverdownloads.qlogic.com/QLogicDriverDownloads_UI/ShowEula.aspx?resourcei-
d=32761&docid=96726&ProductCategory=39&Product=1261&Os=126
http://nvmexpress.org/
http://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabrics.pdf
https://storpool.com/blog/demystifying-what-is-nvmeof
This section describes how to manage failover and path load balancing for multiple paths be-
tween the servers and block storage devices by using Multipath I/O (MPIO).
Storage array
A hardware device with many disks and multiple fabrics connections (controllers) that
provides SAN storage to clients. Storage arrays typically have RAID and failover features
and support multipathing. Historically, active/passive (failover) and active/active (load-
balancing) storage array configurations were distinguished. These concepts still exist but
they are merely special cases of the concepts of path groups and access states supported
by modern hardware.
WWID
“World Wide Identifier”. multipath-tools uses the WWID to determine which low-level
devices should be assembled into a multipath map. The WWID must be distinguished from
the configurable map name (see Section 18.12, “Multipath device names and WWIDs”).
Device mapper
A framework in the Linux kernel for creating virtual block devices. I/O operations to
mapped devices are redirected to the underlying block devices. Device mappings may be
stacked. The device mapper implements its own event signaling, also known as “device
mapper events” or “dm events”.
initramfs
The initial RAM le system, also referred to as “initial RAM disk” (initrd) for historical rea-
sons (see Book “Administration Guide”, Chapter 12 “Introduction to the Boot Process”, Section 12.1
“Terminology”).
ALUA
“Asymmetric Logical Unit Access”, a concept introduced with the SCSI standard SCSI-3.
Storage volumes can be accessed via multiple ports, which are organized in port groups
with different states (active, standby, etc.). ALUA defines SCSI commands to query the port
groups and their states and change the state of a port group. Modern storage arrays that
support SCSI usually support ALUA, too.
18.3.1 Prerequisites
The storage array you use for the multipathed device must support multipathing. For more
information, see Section 18.2, “Hardware Support”.
You need to configure multipathing only if multiple physical paths exist between host bus
adapters in the server and host bus controllers for the block storage device.
196 Storage Arrays that Require Specific Hardware Handlers SLES 15 SP2
For some storage arrays, the vendor provides its own multipathing software to manage
multipathing for the array’s physical and logical devices. In this case, you should follow
the vendor’s instructions for configuring multipathing for those devices.
The root le system is on a multipath device. This is typically the case for diskless servers that
use SAN storage exclusively. On such systems, multipath support is required for booting, and
multipathing must be enabled in the initramfs.
The root le system (and possibly some other le systems) is on local storage, for example, on
a directly attached SATA disk or local RAID, but the system additionally uses le systems in the
multipath SAN storage. This system type can be configured in three different ways:
The Distributed Replicated Block Device (DRBD) high-availability solution for mirroring devices
across a LAN runs on top of multipathing. For each device that has multiple I/O paths and that
you plan to use in a DRDB solution, you must configure the device for multipathing before you
configure DRBD.
Special care must be taken when using multipathing together with clustering software that relies
on shared storage for fencing, such as pacemaker with sbd . See Section 18.9.2, “Queuing policy
on clustered servers” for details.
If you select “No” at this prompt (not recommended), the installation will proceed as in Sec-
tion 18.4.1, “Installing without connected multipath devices”. In the partitioning stage, do not use/edit
devices that will later be part of a multipath map.
If you select “Yes” at the multipath prompt, multipathd will run during the installation. No de-
vice will be added to the blacklist section of /etc/multipath.conf , thus all SCSI and DASD
devices, including local disks, will appear as multipath devices in the partitioning dialogs. After
installation, all SCSI and DASD devices will be multipath devices, as described in Section 18.3.2.1,
“Root file system on multipath (SAN-boot)”.
PROCEDURE 18.1: DISABLING MULTIPATHING FOR THE ROOT DISK AFTER INSTALLATION
This procedure assumes that you installed on a local disk and enabled multipathing during
installation, so that the root device is on multipath now, but you prefer to set up the
system as described in Local disk is excluded from multipath in Section 18.3.2.2, “Root file system
on a local disk”.
1. Check your system for /dev/mapper/... references to your local root device, and replace
them with references that will still work if the device is not a multipath map anymore (see
Section 18.12.4, “Referring to multipath maps”). If the following command nds no references,
you do not need to apply changes:
2. Switch to by-uuid persistent device policy for dracut (see Section 18.7.4.2, “Persistent
device names in the initramfs”):
This command prints all paths devices with their WWIDs and vendor/product information.
You will be able to identify the root device (here, the ServeRAID device) and note the
WWID.
4. Create a blacklist entry in /etc/multipath.conf (see Section 18.11.1, “The blacklist sec-
tion in multipath.conf”) with the WWID you just determined (do not apply these settings
just yet):
blacklist {
wwid 3600605b009e7ed501f0e45370aaeb77f
}
The offline update of your system is similar to the fresh installation as described in Section 18.4,
“Installing SUSE Linux Enterprise Server on multipath systems”. There is no blacklist , thus if the
user selects to enable multipath, the root device will appear as a multipath device, even if it is
normally not. When dracut builds the initramfs during the update procedure, it sees a different
storage stack as it would see on the booted system. See Section 18.7.4.2, “Persistent device names
in the initramfs” and Section 18.12.4, “Referring to multipath maps”.
multipathd
The daemon to set up and monitor multipath maps, and a command-line client to commu-
nicate with the daemon process. See Section 18.6.2, “The multipathd daemon”.
multipath
The command-line tool for multipath operations. See Section 18.6.3, “The multipath com-
mand”.
kpartx
The command-line tool for managing “partitions” on multipath devices. See Section 18.7.3,
“Partitions on multipath devices and kpartx”.
mpathpersist
The command-line tool for managing SCSI persistent reservations. See Section 18.6.4, “SCSI
persistent reservations and mpathpersist”.
Distributing load over multiple paths inside the active path group.
Noticing I/O errors on path devices, and marking these as failed, so that no I/O will be
sent to them.
Switching path groups when all paths in the active path group have failed.
Either failing or queuing I/O on the multipath device if all paths have failed, depending
on configuration.
The following tasks are handled by the user-space components in the multipath-tools pack-
age, not by the Device Mapper Multipath module:
Discovering devices representing different paths to the same storage device and assembling
multipath maps from them.
The Device Mapper Multipath module does not provide an easy-to-use user interface for
setup and configuration.
For details about the components from the multipath-tools package, refer to Section 18.6.2,
“The multipathd daemon”.
On startup, detects path devices and sets up multipath maps from detected devices.
Monitors uevents and device mapper events, adding or removing path mappings to multi-
path maps as necessary and initiating failover or failback operations.
Sets up new maps on the y when new path devices are discovered.
Checks path devices at regular intervals to detect failure, and tests failed paths to reinstate
them if they become operational again.
When all paths fail, multipathd either fails the map, or switches the map device to queu-
ing mode for a given time interval.
Handles path state changes and switches path groups or regroups paths, as necessary.
Tests paths for “marginal” state, i.e. shaky fabrics conditions that cause path state flipping
between operational and non-operational.
Handles SCSI persistent reservation keys for path devices if configured. See Section 18.6.4,
“SCSI persistent reservations and mpathpersist”.
or
There is also an interactive mode that allows sending multiple subsequent commands:
show topology
Shows the current map topology and properties. See Section 18.14.2, “Interpreting multipath
I/O status”.
show paths
Shows the currently known path devices.
show maps
Shows the currently configured map devices.
reconfigure
Rereads configuration les, rescans devices, and sets up maps again. This is basically equiv-
alent to a restart of multipathd . A few options cannot be modified without a restart.
They are mentioned in the man page multipath.conf(5) . The reconfigure command
reloads only map devices that have changed in some way.
Additional commands are available to modify path states, enable or disable queuing, and more.
See multipathd(8) for details.
multipath
Detects path devices and and configures all multipath maps that it nds.
multipath -d
Similar to multipath , but does not set up any maps (“dry run”).
multipath DEVICENAME
Configures a specific multipath device. DEVICENAME can denote a member path device by
its device node name ( /dev/sdb ) or device number in major:minor format. Alternative-
ly, it can be the WWID or name of a multipath map.
multipath -f DEVICENAME
Unconfigures ("flushes") a multipath map and its partition mappings. The command will
fail if the map or one of its partitions is in use. See above for possible values of DEVICENAME .
multipath -F
Unconfigures ("flushes") all multipath maps and their partition mappings. The command
will fail for maps in use.
multipath -ll
Displays the status and topology of all currently configured multipath devices. See Sec-
tion 18.14.2, “Interpreting multipath I/O status”.
multipath -T
Has a similar function as the multipath -t command but shows only hardware entries
matching the hardware detected on the host.
The option -v controls the verbosity of the output. The provided value overrides the verbosity
option in /etc/multipath.conf . See Section 18.13, “Miscellaneous options”.
multipaths {
multipath {
wwid 3600140508dbcf02acb448188d73ec97d
alias yellow
reservation_key 0x123abc
}
}
After setting the reservation_key parameter for all mpath devices applicable for persistent
management, reload the configuration using multipathd reconfigure .
Use the command mpathpersist to query and set persistent reservations for multipath maps
consisting of SCSI devices. Refer to the manual page mpathpersist(8) for details. The com-
mand-line options are the same as those of the sg_persist from the sg3_utils package. The
sg_persist(8) manual page explains the semantics of the options in detail.
In the following examples, DEVICE denotes a device mapper multipath device like /dev/map-
per/mpatha . The commands below are listed with long options for better readability. All op-
tions have single-letter replacements, like in mpathpersist -oGS 123abc DEVICE .
In most situations, restarting the service is not necessary. To simply have multipathd reload
its configuration, run:
Stopping the service does not remove existing multipath maps. To remove unused maps, run:
To disable multipathing just for a single system boot, use the kernel parameter multi-
path=off . This affects both the booted system and the initramfs, which does not need
to be rebuilt in this case.
(Whenever you disable or enable the multipath services, rebuild the initramfs . See
Section 18.7.4, “Keeping the initramfs synchronized”.)
If you want to make sure multipath devices do not get set up, even when running mul-
tipath manually, add the following lines at the end of /etc/multipath.conf before
rebuilding the initramfs:
blacklist {
wwid .*
}
Configure permissions for host LUNs on the storage arrays with the vendor’s tools.
If SUSE Linux Enterprise Server ships no driver for the host bus adapter (HBA), install a
Linux driver from the HBA vendor. See the vendor’s specific instructions for more details.
If multipath devices are detected and multipathd.service is enabled, multipath maps should
be created automatically. If this does not happen, Section 18.15.3, “Troubleshooting steps in emer-
gency mode” lists some shell commands that can be used to examine the situation. When the
LUNs are not seen by the HBA driver, check the zoning setup in the SAN. In particular, check
whether LUN masking is active and whether the LUNs are correctly assigned to the server.
If the HBA driver can see LUNs, but no corresponding block devices are created, additional
kernel parameters may be needed. See TID 3955167: Troubleshooting SCSI (LUN) Scanning Issues
in the SUSE Knowledge base at https://www.suse.com/support/kb/doc.php?id=3955167 .
Partition tables and partitions on multipath devices can be manipulated as usual, using YaST or
tools like fdisk or parted . Changes applied to the partition table will be noted by the system
when the partitioning tool exits. If this does not work (usually because a device is busy), try
multipathd reconfigure , or reboot the system.
Important
Make sure that the initial RAM le system (initramfs) and the booted system behave
consistently regarding the use of multipathing for all block devices. Rebuild the initramfs
after applying multipath configuration changes.
If multipathing is enabled in the system, it also needs to be enabled in the initramfs and
vice versa. The only exception to this rule is the option Multipath disabled in the initramfs in
Section 18.3.2.2, “Root file system on a local disk”.
The multipath configuration must be synchronized between the booted system and the initramfs.
Therefore, if you change any of the les: /etc/multipath.conf , /etc/multipath/wwids , /
etc/multipath/bindings , or other configuration les, or udev rules related to device identi-
fication, rebuild initramfs using the command:
/dev/mapper/3600a098000aad73f00000a3f5a275dc8-part1
for this purpose by default. This is good if the system always runs in multipath mode. But if
the system is started without multipathing, as described in Section 18.7.4.1, “Enabling or disabling
multipathing in the initramfs”, booting with such an initramfs will fail, because the /dev/mapper
devices will not exist. See Section 18.12.4, “Referring to multipath maps” for another possible prob-
lem scenario, and some background information.
To prevent this from happening, change dracut 's persistent device naming policy by using the
--persistent-policy option. We recommend setting the by-uuid use policy:
See also Procedure 18.1, “Disabling multipathing for the root disk after installation” and Section 18.15.2,
“Understanding device referencing issues”.
multipath-tools has built-in defaults for many storage arrays that are derived from
the published vendor recommendations. Run multipath -T to see the current settings
for your devices and compare them to vendor recommendations.
multipath -T >/etc/multipath.conf
The hash ( # ) and exclamation mark ( ! ) characters cause the rest of the line to be discarded
as a comment.
Sections and subsections are started with a section name and an opening brace ( { ) on the
same line, and end with a closing brace ( } ) on a line on its own.
Options and values are written on one line. Line continuations are unsupported.
Options and section names must be keywords. The allowed keywords are documented in
multipath.conf(5) .
Values may be enclosed in double quotes ( " ). They must be enclosed in quotes if they
contain white space or comment characters. A double quote character inside a value is
represented by a pair of double quotes ( "" ).
The values of some options are POSIX regular expressions (see regex(7) ). They are case
sensitive and not anchored, so “ bar ” matches “ rhabarber ”, but not “Barbie”.
section {
subsection {
option1 value
option2 "complex value!"
option3 "value with ""quoted"" word"
} ! subsection end
} # section end
After /etc/multipath.conf , the tools read les matching the pattern /etc/multipath.con-
f.d/*.conf . The additional les follow the same syntax rules as /etc/multipath.conf . Sec-
tions and options can occur multiple times. If the same option in the same section is set in multiple
les, or on multiple lines in the same le, the last value takes precedence. Separate precedence
rules apply between multipath.conf sections, see below.
defaults
General default settings.
blacklist
Lists devices to ignore. See Section 18.11.1, “The blacklist section in multipath.conf”.
blacklist_exceptions
Lists devices to be multipathed even though they are matched by the blacklist. See Sec-
tion 18.11.1, “The blacklist section in multipath.conf”.
devices
Settings specific to the storage controller. This section is a collection of device subsec-
tions. Values in this section override values for the same options in the defaults section,
and the built-in settings of multipath-tools .
device entries in the devices section are matched against the vendor and product of a
device using regular expressions. These entries will be “merged”, setting all options from
matching sections for the device. If the same option is set in multiple matching device
sections, the last device entry takes precedence, even if it is less “specific” than preceding
entries. This applies also if the matching entries appear in different configuration les (see
Section 18.8.2.1, “Additional configuration files and precedence rules”). In the following example,
a device SOMECORP STORAGE will use fast_io_fail_tmo 15 .
devices {
device {
vendor SOMECORP
product STOR
fast_io_fail_tmo 10
}
multipaths
Settings for individual multipath devices. This section is a list of multipath subsections.
Values override the defaults and devices sections.
overrides
Settings that override values from all other sections.
Do not forget to synchronize with the configuration in the initramfs. See Section 18.7.4, “Keeping
the initramfs synchronized”.
This command shows new maps to be created with the proposed topology, but not
whether maps will be removed/ushed. To obtain more information, run with increased
verbosity:
polling_interval
The time interval (in seconds) between health checks for path devices. The default is 5
seconds. Failed devices are checked at this time interval. For healthy devices, the time
interval may be increased up to max_polling_interval seconds.
detect_checker
If this is set to yes (default, recommended), multipathd automatically detects the best
path checking algorithm.
path_checker
The algorithm used to check path state. If you need to enable the checker, disable detec-
t_checker as follows:
defaults {
detect_checker no
}
The following list contains only the most important algorithms. See multipath.conf(5)
for the full list.
tur
Send TEST UNIT READY command. This is the default for SCSI devices with ALUA
support.
directio
Read a device sector using asynchronous I/O (aio).
rdac
Device-specific checker for NetAPP E-Series and similar arrays.
none
No path checking is performed.
217 Configuring policies for failover, queuing, and failback SLES 15 SP2
checker_timeout
If a device does not respond to a path checker command in the given time, it is consid-
ered failed. The default is the kernel's SCSI command timeout for the device (usually 30
seconds).
fast_io_fail_tmo
If an error on the SCSI transport layer is detected (for example on a Fibre Channel remote
port), the kernel transport layer waits for this amount of time (in seconds) for the transport
to recover. After that, the path device fails with “transport offline” state. This is very
useful for multipath, because it allows a quick path failover for a frequently occurring class
of errors. The value must match typical time scale for reconfiguration in the fabric. The
default value of 5 seconds works well for Fibre Channel. Other transports, like iSCSI, may
require longer timeouts.
dev_loss_tmo
If a SCSI transport endpoint (for example a Fibre Channel remote port) is not reachable any
more, the kernel waits for this amount of time (in seconds) for the port to reappear until
it removes the SCSI device node for good. Device node removal is a complex operation
which is prone to race conditions or deadlocks and should best be avoided. We therefore
recommend setting this to a high value. The special value infinity is supported. The
default is 10 minutes. To avoid deadlock situations, multipathd ensures that I/O queuing
(see no_path_retry ) is stopped before dev_loss_tmo expires.
no_path_retry
Determine what happens if all paths of a given multipath map have failed. The possible
values are:
fail
Fail I/O on the multipath map. This will cause I/O errors in upper layers such as
mounted le systems. The affected le systems, and possibly the entire host, will
enter degraded mode.
queue
I/O on the multipath map is queued in the device mapper layer and sent to the device
when path devices become available again. This is the safest option to avoid losing
data, but it can have negative effects if the path devices do not get reinstated for
a long time. Processes reading from the device will hang in uninterruptible sleep
( D ) state. Queued data occupies memory, which becomes unavailable for processes.
Eventually, memory will be exhausted.
218 Configuring policies for failover, queuing, and failback SLES 15 SP2
N
N is a positive integer. Keep the map device in queuing mode for N polling intervals.
When the time elapses, multipathd fails the map device. If polling_interval is
5 seconds and no_path_retry is 6, multipathd will queue I/O for approximately
6 * 5s = 30s before failing I/O on the map device.
flush_on_last_del
If set to yes and all path devices of a map are deleted (as opposed to just failed), fail all
I/O in the map before removing it. The default is no .
deferred_remove
If set to yes and all path devices of a map are deleted, wait for holders to close the le
descriptors for the map device before flushing and removing it. If paths reappear before the
last holder closed the map, the deferred remove operation will be cancelled. The default
is no .
failback
If a failed path device in an inactive path group recovers, multipathd reevaluates the
path group priorities of all path groups (see Section 18.10, “Configuring path grouping and pri-
orities”). After the reevaluation, the highest-priority path group may be one of the currently
inactive path groups. This parameter determines what happens in this situation.
manual
Nothing happens unless the administrator runs a multipathd switchgroup (see
Section 18.6.2, “The multipathd daemon”).
immediate
The highest-priority path group is activated immediately. This is often beneficial for
performance, especially on stand-alone servers, but it should not be used for arrays
on which the change of the path group is a costly operation.
219 Configuring policies for failover, queuing, and failback SLES 15 SP2
followover
Like immediate , but only perform failback when the path that has just become active
is the only healthy path in its path group. This is useful for cluster configurations: It
keeps a node from automatically failing back when another node requested a failover
before.
N
N is a positive integer. Wait for N polling intervals before activating the highest
priority path group. If the priorities change again during this time, the wait period
starts anew.
eh_deadline
Set an approximate upper limit for the time (in seconds) spent in SCSI error handling if
devices are unresponsive and SCSI commands time out without error response. When the
deadline has elapsed, the kernel will perform a full HBA reset.
After modifying the /etc/multipath.conf le, apply your settings as described in Sec-
tion 18.8.4, “Applying multipath.conf modifications”.
path_grouping_policy
Specifies the method used to combine paths into groups. Only the most important policies
are listed here; see multipath.conf(5) for other less frequently used values.
failover
One path per path group. This setting is useful for traditional “active/passive” storage
arrays.
multibus
All paths in one path group. This is useful for traditional “active/active” arrays.
group_by_prio
Path devices with the same path priority are grouped together. This option is useful
for modern arrays that support asymmetric access states, like ALUA. Combined with
the alua or sysfs priority algorithms, the priority groups set up by multipathd
will match the primary target port groups that the storage array reports through
ALUA-related SCSI commands.
Using the same policy names, the path grouping policy for a multipath map can be changed
temporarily with the command:
marginal_pathgroups
If set to on or fpin , “marginal” path devices are sorted into a separate path group. This is
independent of the path grouping algorithm in use. See Section 18.13.1, “Handling unreliable
(“marginal”) path devices”.
prio
Determines the method to derive priorities for path devices. If you override this, disable
detect_prio as follows:
defaults {
detect_prio no
}
The following list contains only the most important methods. Several other methods are
available, mainly to support legacy hardware. See multipath.conf(5) for the full list.
alua
Uses SCSI-3 ALUA access states to derive path priority values. The optional exclu-
sive_pref_bit argument can be used to change the behavior for devices that have
the ALUA “preferred primary target port group” (PREF) bit set:
prio alua
prio_args exclusive_pref_bit
If this option is set, “preferred” paths get a priority bonus over other active/optimized
paths. Otherwise, all active/optimized paths are assigned the same priority.
sysfs
Like alua , but instead of sending SCSI commands to the device, it obtains the access
states from sysfs . This causes less I/O load than alua , but is not suitable for all
storage arrays with ALUA support.
const
Uses a constant value for all paths.
path_latency
Measures I/O latency (time from I/O submission to completion) on path devices, and
assigns higher priority to devices with lower latency. See multipath.conf(5) for
details. This algorithm is still experimental.
prio weightedpath
prio_args "hbtl 2:.*:.*:.* 10 hbtl 3:.*:.*:.* 20 hbtl .* 1"
This assigns devices on SCSI host 3 a higher priority than devices on SCSI host 2,
and all others a lower priority.
prio_args
Some prio algorithms require extra arguments. These are specified in this option, with
syntax depending on the algorithm. See above.
hardware_handler
The name of a kernel module that the kernel uses to activate path devices when switching
path groups. This option has no effect with recent kernels because hardware handlers are
autodetected. See Section 18.2.3, “Storage Arrays that Require Specific Hardware Handlers”.
path_selector
The name of a kernel module that is used for load balancing between the paths of the
active path group. The available choices depend on the kernel configuration. For historical
reasons, the name must always be enclosed in quotes and followed by a “0” in multi-
path.conf , like this:
service-time
Estimates the time pending I/O will need to complete on all paths, and selects the
path with the lowest value. This is the default.
historical-service-time
Estimates future service time based on the historical service time (about which it
keeps a moving average) and the number of outstanding requests. Estimates the time
pending I/O will need to complete on all paths, and selects the path with the lowest
value.
queue-length
Selects the path with the lowest number of currently pending I/O requests.
io-affinity
This path selector currently does not work with multipath-tools .
After modifying the /etc/multipath.conf le, apply your settings as described in Sec-
tion 18.8.4, “Applying multipath.conf modifications”.
After modifying the /etc/multipath.conf le, apply your settings as described in Sec-
tion 18.8.4, “Applying multipath.conf modifications”.
blacklist {
device { 2
vendor ATA
product .*
}
protocol scsi:sas 3
property SCSI_IDENT_LUN_T10 4
devnode "!^dasd[a-z]*" 5
1 wwid entries are ideal for excluding specific devices, for example, the root disk.
2 This device section excludes all ATA devices (the regular expression for product matches
anything).
3 Excluding by protocol allows excluding devices using certain bus types, here SAS. Other
common protocol values are scsi:fcp , scsi:iscsi , and ccw . See multipath.conf(5)
for more. To see the protocols that paths in your systems are using, run:
By default, multipath-tools ignores all devices except SCSI, DASD, or NVMe. Technically,
the built-in devnode exclude list is this negated regular expression:
devnode !^(sd[a-z]|dasd[a-z]|nvme[0-9])
blacklist {
wwid .*
}
blacklist_exceptions {
device {
vendor ^NETAPP$
product .*
}
}
The blacklist_exceptions section supports all methods described for the blacklist section
above.
The property directive in blacklist_exceptions is mandatory because every device must
have at least one of the “allowed” udev properties to be considered a path device for multipath
(the value of the property does not matter). The built-in default for property is
property (SCSI_IDENT_|ID_WWN)
Only devices that have at least one udev property matching this regular expression will be
included.
find_multipaths
This option controls the behavior of multipath and multipathd when a device that is
not excluded is rst encountered. The possible values are:
greedy
Every device non-excluded by blacklist in /etc/multipath.conf is included.
This is the default on SUSE Linux Enterprise. If this setting is active, the only way to
prevent devices from being added to multipath maps is setting them as excluded.
yes
Devices are included if they meet the conditions for strict , or if at least one other
device with the same WWID exists in the system.
smart
If a new WWID is rst encountered, it is temporarily marked as multipath path de-
vice. multipathd waits for some time for additional paths with the same WWID to
appear. If this happens, the multipath map is set up as usual. Otherwise, when the
timeout expires, the single device is released to the system as non-multipath device.
The timeout is configurable with the option find_multipaths_timeout .
This option depends on systemd features which are only available on SUSE Linux
Enterprise Server 15.
allow_usb_devices
If this option is set to yes , USB storage devices are considered for multipathing. The
default is no .
ID_SERIAL for SCSI devices (do not confuse this with the device's “serial number”)
multipaths {
multipath {
wwid 3600a098000aad1e3000064e45f2c2355
alias postgres
}
}
Aliases are expressive, but they need to be assigned to each map individually, which may be
cumbersome on large systems.
Map names are only useful if they are persistent. multipath-tools keeps track of the assigned
names in the le /etc/multipath/bindings (the “bindings le”). When a new map is created,
the WWID is rst looked up in this le. If it is not found, the lowest available user-friendly name
is assigned to it.
user_friendly_names
If set to yes , user-friendly names are assigned and used. Otherwise, the WWID is used as
a map name unless an alias is configured.
alias_prefix
The prefix used to create user-friendly names, mpath by default.
1 5 These two links use the map name to refer to the map. Thus, the links will change if the
map name changes, for example, if you enable or disable user-friendly names.
2 This link uses the device mapper UUID, which is the WWID used by multipath-tools
prefixed by the string dm-uuid-mpath- . It is independent of the map name.
The device mapper UUID is the preferred form to ensure that only multipath devices are ref-
erenced. For example, the following line in /etc/lvm/lvm.conf rejects all devices except
multipath maps:
3 4 These are links that would normally point to path devices. The multipath device took them
over, because it has a higher udev link priority (see udev(7) ). When the map is destroyed
or multipathing is turned o, they will still exist and point to one of the path devices instead.
This provides a means to reference a device by its WWID, whether or not multipathing
is active.
For partitions on multipath maps created by the kpartx tool, there are similar symbolic links,
derived from the parent device name or WWID and the partition number:
Note that partitions often have by-uuid links, too, referring not to the device itself but to the
le system it contains. These links are often preferable. They are invariant even if the le system
is copied to a different device or partition.
verbosity
Controls the log verbosity of both multipath and multipathd . The command-line option
-v overrides this setting for both commands. The value can be between 0 (only fatal
errors) and 4 (verbose logging). The default is 2.
uid_attrs
This option enables an optimization for processing udev events, so-called “uevent merg-
ing”. It is useful in environments in which hundreds of path devices may fail or reappear
simultaneously. In order to make sure that path WWIDs do not change (see Section 18.12.1,
“WWIDs and device Identification”), the value should be set exactly like this:
defaults {
uid_attrs "sd:ID_SERIAL dasd:ID_UID nvme:ID_WWN"
}
skip_kpartx
If set to yes for a multipath device (default is no ), do not create partition devices on top
of the given device (see Section 18.7.3, “Partitions on multipath devices and kpartx”). Useful
for multipath devices used by virtual machines. Previous SUSE Linux Enterprise Server
releases achieved the same effect with the parameter “ features 1 no_partitions ”.
max_sectors_kb
Limits the maximum amount of data sent in a single I/O request for all path devices of
the multipath map.
recheck_wwid
If set to yes (default is no ), double-checks the WWID of restored paths after failure, and
removes them if the WWID has changed. This is a safety measure against data corruption.
enable_foreign
multipath-tools provides a plugin API for other multipathing backends than Device
Mapper multipath. The API supports monitoring and displaying information about the
multipath topology using standard commands like multipath -ll . Modifying the topol-
ogy is unsupported.
The value of enable_foreign is a regular expression to match against foreign library
names. The default value is “ NONE ”.
SUSE Linux Enterprise Server ships the nvme plugin, which adds support for the native
NVMe multipathing (see Section 18.2.1, “Multipath implementations: device mapper and NVMe”).
To enable the nvme plugin, set
defaults {
enable_foreign nvme
}
This algorithm is used if all four numeric marginal_path_ parameters are set to a pos-
itive value, and marginal_pathgroups is not set to fpin . It is available since SUSE
Linux Enterprise Server 15 SP1 and SUSE Linux Enterprise Server 12 SP5.
marginal_path_double_failed_time
Maximum time (in seconds) between two path failures that triggers path monitoring.
marginal_path_err_sample_time
Length (in seconds) of the path monitoring interval.
marginal_path_err_rate_threshold
Minimum error rate (per thousand I/Os).
marginal_path_err_recheck_gap_time
Time (in seconds) to keep the path in marginal state.
defaults {
deferred_remove yes
find_multipaths smart
After modifying the /etc/multipath.conf le, apply your settings as described in Sec-
tion 18.8.4, “Applying multipath.conf modifications”.
multipath_component_detection=1
external_device_info_source="udev"
It is also possible (although normally not necessary) to create a filter expression for LVM2 to
ignore all devices except multipath devices. See Section 18.12.4, “Referring to multipath maps”.
Replace MAPNAME with the correct WWID or mapped alias name for the device.
This command immediately causes all queued I/O to fail and propagates the error to the
calling application. File systems will observe I/O errors and switch to read-only mode.
-a
the option ensures that all SCSI targets are scanned, otherwise only already existing targets
will be scanned for new LUNs.
-r
the option enables the removal of devices which have been removed on the storage side.
--hosts
the option specifies the list of host bus adapters to scan (the default is to scan all).
SYSTEMD_READY=0
DM_MULTIPATH_DEVICE_PATH=1
Such references can appear in various places, notably in /etc/fstab and /etc/crypttab , in
the initramfs, or even on the kernel command line.
The safest way to circumvent this problem is to avoid using the kind of device references that
are not persistent between boots or depend on system configuration. We generally recommend
referring to le systems (and similar entities like swap space) by properties of the le system
itself (like UUID or label) rather than the containing device. If such references are not available
and device references are required, for example, in /etc/crypttab , the options should be
evaluated carefully. For example, in Section 18.12.4, “Referring to multipath maps”, the best option
might be the /dev/disk/by-id/wwn- link because it would also work with multipath=off .
Is my root device configured as a multipath device? If not, is the root device properly
excluded from multipath as described in Section 18.11.1, “The blacklist section in multi-
path.conf”, or are you relying on the absence of the multipath module in the initramfs
(see Section 18.3.2.2, “Root file system on a local disk”)?
Does the system enter emergency mode before or after switching to the real root le sys-
tem?
If you are unsure with respect to the last question, here is a sample dracut emergency prompt
as it would be printed before switching root:
Generating "/run/initramfs/rdsosreport.txt"
Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
The mention of rdsosreport.txt is a clear indication that the system is still running from
the initramfs. If you are still uncertain, log in and check for the existence of the le /etc/
initrd-release . This le exists only in an initramfs environment.
If emergency mode is entered after switching root, the emergency prompt looks similar, but
rdsosreport.txt is not mentioned:
1. Try to figure out what failed by examining failed systemd units and the journal.
# systemctl --failed
# journalctl -b -o short-monotonic
When looking at the journal, determine the rst failed unit. When you have found the rst
failure, examine the messages before and around that point in time very carefully. Are
there any warnings or other suspicious messages?
Watch out for the root switch (" Switching root. ") and for messages about SCSI devices,
device mapper, multipath, and LVM2. Look for systemd messages about devices and le
systems (" Found device …", " Mounting …", " Mounted …").
2. Examine the existing devices, both low-level devices and device mapper devices (note that
some of the commands below may not be available in the initramfs):
# cat /proc/partitions
# ls -l /sys/class/block
# ls -l /dev/disk/by-id/* /dev/mapper/*
# dmsetup ls --tree
# lsblk
# lsscsi
From the output of the commands above, you should get an idea whether the low-level
devices were successfully probed, and whether any multipath maps and multipath parti-
tions were set up.
3. If the device mapper multipath setup is not as you expect, examine the udev properties,
in particular, SYSTEMD_READY (see above)
# udevadm info -e
4. If the previous step showed unexpected udev properties, something may have gone wrong
during udev rule processing. Check other properties, in particular, those used for device
identification (see Section 18.12.1, “WWIDs and device Identification”). If the udev properties
are correct, check the journal for multipathd messages again. Look for " Device or
resource busy " messages.
# mount /var
# swapon -a
# vgchange -a y
Mostly, the manual activation will succeed and allow to proceed with system boot (usually
by simply logging out from the emergency shell) and examine the situation further in the
booted system.
If manual activation fails, you will probably see error messages that provide clues about
what is going wrong. You can also try the commands again with increased verbosity.
6. At this point, you should have some idea what went wrong (if not, contact SUSE support
and be prepared to answer most of the questions raised above).
You should be able to correct the situation with a few shell commands, exit the emergency
shell, and boot successfully. You will still need to adjust your configuration to make sure
the same problem will not occur again in the future.
Otherwise, you will need to boot the rescue system, set up the devices manually to chroot
into the real root le system, and attempt to x the problem based on the insight you got
in the previous steps. Be aware that in this situation, the storage stack for the root le
system may differ from normal. Depending on your setup, you may have force addition
or omission of dracut modules when building a new initramfs. See also Section 18.7.4.1,
“Enabling or disabling multipathing in the initramfs”.
7. If the problem occurs frequently or even on every boot attempt, try booting with increased
verbosity in order to get more information about the failure. The following kernel para-
meters, or a combination of them, are often helpful:
udev.log-priority=debug 1
systemd.log_level=debug 2
scsi_mod.scsi_logging_level=020400 3
rd.debug 4
HOWTO: Add, Resize and Remove LUN without restarting SLES (https://www.suse.com/support/kb/
doc/?id=000017762)
nfs4-getfacl
nfs4-setfacl
nfs4-editacl
These operate in a generally similar way to getfacl and setfacl for examining and modifying
NFSv4 ACLs. These commands are effective only if the le system on the NFS server provides
full support for NFSv4 ACLs. Any limitation imposed by the server will affect programs running
on the client in that some particular combinations of Access Control Entries (ACEs) might not
be possible.
It is not supported to mount NFS volumes locally on the exporting NFS server.
Additional Information
For information, see Introduction to NFSv4 ACLs at http://wiki.linux-nfs.org/wiki/in-
dex.php/ACLs#Introduction_to_NFSv4_ACLs .
This License applies to any manual or other work, in any medium, that contains a notice placed 3. COPYING IN QUANTITY
by the copyright holder saying it can be distributed under the terms of this License. Such a
notice grants a world-wide, royalty-free license, unlimited in duration, to use that work under If you publish printed copies (or copies in media that commonly have printed covers) of the
the conditions stated herein. The "Document", below, refers to any such manual or work. Any Document, numbering more than 100, and the Document's license notice requires Cover Texts,
member of the public is a licensee, and is addressed as "you". You accept the license if you you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts:
copy, modify or distribute the work in a way requiring permission under copyright law. Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers
A "Modified Version" of the Document means any work containing the Document or a portion must also clearly and legibly identify you as the publisher of these copies. The front cover
of it, either copied verbatim, or with modifications and/or translated into another language. must present the full title with all words of the title equally prominent and visible. You may
add other material on the covers in addition. Copying with changes limited to the covers, as
A "Secondary Section" is a named appendix or a front-matter section of the Document that
long as they preserve the title of the Document and satisfy these conditions, can be treated
deals exclusively with the relationship of the publishers or authors of the Document to the
as verbatim copying in other respects.
Document's overall subject (or to related matters) and contains nothing that could fall directly
within that overall subject. (Thus, if the Document is in part a textbook of mathematics, a If the required texts for either cover are too voluminous to t legibly, you should put the
Secondary Section may not explain any mathematics.) The relationship could be a matter rst ones listed (as many as t reasonably) on the actual cover, and continue the rest onto
of historical connection with the subject or with related matters, or of legal, commercial, adjacent pages.
philosophical, ethical or political position regarding them. If you publish or distribute Opaque copies of the Document numbering more than 100, you
The "Invariant Sections" are certain Secondary Sections whose titles are designated, as being must either include a machine-readable Transparent copy along with each Opaque copy, or
those of Invariant Sections, in the notice that says that the Document is released under this state in or with each Opaque copy a computer-network location from which the general net-
License. If a section does not t the above definition of Secondary then it is not allowed to be work-using public has access to download using public-standard network protocols a complete
designated as Invariant. The Document may contain zero Invariant Sections. If the Document Transparent copy of the Document, free of added material. If you use the latter option, you
does not identify any Invariant Sections then there are none. must take reasonably prudent steps, when you begin distribution of Opaque copies in quanti-
ty, to ensure that this Transparent copy will remain thus accessible at the stated location until
The "Cover Texts" are certain short passages of text that are listed, as Front-Cover Texts or
at least one year after the last time you distribute an Opaque copy (directly or through your
Back-Cover Texts, in the notice that says that the Document is released under this License. A
agents or retailers) of that edition to the public.
Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words.
It is requested, but not required, that you contact the authors of the Document well before
A "Transparent" copy of the Document means a machine-readable copy, represented in a for-
redistributing any large number of copies, to give them a chance to provide you with an
mat whose specification is available to the general public, that is suitable for revising the doc-
updated version of the Document.
ument straightforwardly with generic text editors or (for images composed of pixels) generic
paint programs or (for drawings) some widely available drawing editor, and that is suitable
for input to text formatters or for automatic translation to a variety of formats suitable for
input to text formatters. A copy made in an otherwise Transparent le format whose markup,
or absence of markup, has been arranged to thwart or discourage subsequent modification
by readers is not Transparent. An image format is not Transparent if used for any substantial
amount of text. A copy that is not "Transparent" is called "Opaque".
Examples of suitable formats for Transparent copies include plain ASCII without markup, Tex-
info input format, LaTeX input format, SGML or XML using a publicly available DTD, and stan-
dard-conforming simple HTML, PostScript or PDF designed for human modification. Examples
of transparent image formats include PNG, XCF and JPG. Opaque formats include proprietary
You may copy and distribute a Modified Version of the Document under the conditions of
sections 2 and 3 above, provided that you release the Modified Version under precisely this 5. COMBINING DOCUMENTS
License, with the Modified Version filling the role of the Document, thus licensing distribution
and modification of the Modified Version to whoever possesses a copy of it. In addition, you You may combine the Document with other documents released under this License, under
must do these things in the Modified Version: the terms defined in section 4 above for modified versions, provided that you include in the
combination all of the Invariant Sections of all of the original documents, unmodified, and
A. Use in the Title Page (and on the covers, if any) a title distinct from that of the
list them all as Invariant Sections of your combined work in its license notice, and that you
Document, and from those of previous versions (which should, if there were any,
preserve all their Warranty Disclaimers.
be listed in the History section of the Document). You may use the same title as a
previous version if the original publisher of that version gives permission. The combined work need only contain one copy of this License, and multiple identical Invari-
ant Sections may be replaced with a single copy. If there are multiple Invariant Sections with
B. List on the Title Page, as authors, one or more persons or entities responsible for the same name but different contents, make the title of each such section unique by adding
authorship of the modifications in the Modified Version, together with at least ve at the end of it, in parentheses, the name of the original author or publisher of that section if
of the principal authors of the Document (all of its principal authors, if it has fewer known, or else a unique number. Make the same adjustment to the section titles in the list of
than ve), unless they release you from this requirement. Invariant Sections in the license notice of the combined work.
C. State on the Title page the name of the publisher of the Modified Version, as the In the combination, you must combine any sections Entitled "History" in the various original
publisher. documents, forming one section Entitled "History"; likewise combine any sections Entitled
"Acknowledgements", and any sections Entitled "Dedications". You must delete all sections
D. Preserve all the copyright notices of the Document.
Entitled "Endorsements".
E. Add an appropriate copyright notice for your modifications adjacent to the other
copyright notices.
6. COLLECTIONS OF DOCUMENTS
F. Include, immediately after the copyright notices, a license notice giving the public
permission to use the Modified Version under the terms of this License, in the form You may make a collection consisting of the Document and other documents released under
shown in the Addendum below. this License, and replace the individual copies of this License in the various documents with a
single copy that is included in the collection, provided that you follow the rules of this License
G. Preserve in that license notice the full lists of Invariant Sections and required Cover
for verbatim copying of each of the documents in all other respects.
Texts given in the Document's license notice.
You may extract a single document from such a collection, and distribute it individually under
H. Include an unaltered copy of this License. this License, provided you insert a copy of this License into the extracted document, and follow
this License in all other respects regarding verbatim copying of that document.
I. Preserve the section Entitled "History", Preserve its Title, and add to it an item
stating at least the title, year, new authors, and publisher of the Modified Version
as given on the Title Page. If there is no section Entitled "History" in the Document, 7. AGGREGATION WITH INDEPENDENT WORKS
create one stating the title, year, authors, and publisher of the Document as given
on its Title Page, then add an item describing the Modified Version as stated in A compilation of the Document or its derivatives with other separate and independent docu-
the previous sentence. ments or works, in or on a volume of a storage or distribution medium, is called an "aggregate"
if the copyright resulting from the compilation is not used to limit the legal rights of the com-
J. Preserve the network location, if any, given in the Document for public access to
pilation's users beyond what the individual works permit. When the Document is included in
a Transparent copy of the Document, and likewise the network locations given in
an aggregate, this License does not apply to the other works in the aggregate which are not
the Document for previous versions it was based on. These may be placed in the
themselves derivative works of the Document.
"History" section. You may omit a network location for a work that was published
at least four years before the Document itself, or if the original publisher of the If the Cover Text requirement of section 3 is applicable to these copies of the Document, then
version it refers to gives permission. if the Document is less than one half of the entire aggregate, the Document's Cover Texts
may be placed on covers that bracket the Document within the aggregate, or the electronic
K. For any section Entitled "Acknowledgements" or "Dedications", Preserve the Title equivalent of covers if the Document is in electronic form. Otherwise they must appear on
of the section, and preserve in the section all the substance and tone of each of the printed covers that bracket the whole aggregate.
contributor acknowledgements and/or dedications given therein.
L. Preserve all the Invariant Sections of the Document, unaltered in their text and 8. TRANSLATION
in their titles. Section numbers or the equivalent are not considered part of the
section titles. Translation is considered a kind of modification, so you may distribute translations of the
M. Delete any section Entitled "Endorsements". Such a section may not be included Document under the terms of section 4. Replacing Invariant Sections with translations requires
in the Modified Version. special permission from their copyright holders, but you may include translations of some
or all Invariant Sections in addition to the original versions of these Invariant Sections. You
N. Do not retitle any existing section to be Entitled "Endorsements" or to conflict in may include a translation of this License, and all the license notices in the Document, and
title with any Invariant Section. any Warranty Disclaimers, provided that you also include the original English version of this
O. Preserve any Warranty Disclaimers. License and the original versions of those notices and disclaimers. In case of a disagreement
between the translation and the original version of this License or a notice or disclaimer, the
If the Modified Version includes new front-matter sections or appendices that qualify as Se- original version will prevail.
condary Sections and contain no material copied from the Document, you may at your option
If a section in the Document is Entitled "Acknowledgements", "Dedications", or "History", the
designate some or all of these sections as invariant. To do this, add their titles to the list of
requirement (section 4) to Preserve its Title (section 1) will typically require changing the
Invariant Sections in the Modified Version's license notice. These titles must be distinct from
actual title.
any other section titles.
You may add a section Entitled "Endorsements", provided it contains nothing but endorse-
ments of your Modified Version by various parties--for example, statements of peer review
9. TERMINATION
or that the text has been approved by an organization as the authoritative definition of a
You may not copy, modify, sublicense, or distribute the Document except as expressly pro-
standard.
vided for under this License. Any other attempt to copy, modify, sublicense or distribute the
You may add a passage of up to ve words as a Front-Cover Text, and a passage of up to 25
Document is void, and will automatically terminate your rights under this License. However,
words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only
parties who have received copies, or rights, from you under this License will not have their
one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through
licenses terminated so long as such parties remain in full compliance.
arrangements made by) any one entity. If the Document already includes a cover text for the
same cover, previously added by you or by arrangement made by the same entity you are
acting on behalf of, you may not add another; but you may replace the old one, on explicit
permission from the previous publisher that added the old one.
The Free Software Foundation may publish new, revised versions of the GNU Free Documen-
tation License from time to time. Such new versions will be similar in spirit to the present
version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/
copyleft/ .
Each version of the License is given a distinguishing version number. If the Document specifies
that a particular numbered version of this License "or any later version" applies to it, you have
the option of following the terms and conditions either of that specified version or of any
later version that has been published (not as a draft) by the Free Software Foundation. If the
Document does not specify a version number of this License, you may choose any version ever
published (not as a draft) by the Free Software Foundation.
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the
“with...Texts.” line with this:
with the Invariant Sections being LIST THEIR TITLES, with the
Front-Cover Texts being LIST, and with the Back-Cover Texts being LIST.
If you have Invariant Sections without Cover Texts, or some other combination of the three,
merge those two alternatives to suit the situation.
If your document contains nontrivial examples of program code, we recommend releasing
these examples in parallel under your choice of free software license, such as the GNU General
Public License, to permit their use in free software.