Introduction To IBMPower Using IBM PowerVM
Introduction To IBMPower Using IBM PowerVM
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. These and other
IBM trademarked terms are marked on their first occurrence in this information with the
appropriate symbol (® or ™), indicating US registered or common law trademarks owned by
IBM at the time this information was published. Such trademarks may also be registered or
common law trademarks in other countries. A current list of IBM trademarks is available on the
Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the
United States, other countries, or both:
Acknowledgements
While this whitepaper has two principal authors/editors it is the culmination of the work of a
number of different subject matter experts within IBM who contributed ideas, detailed technical
information, and the occasional photograph and section of description.
Kanisha Patel, Kevin Reilly, Marc Gollub, Julissa Villarreal, Michael Mueller, George Ahrens,
Kanwal Bahri, Steven Gold, Jim O’Connor, K Paul Muller, Ravi A. Shankar, Kevin Reick, Peter
Heyrman, Dave Stanton, Dan Hurlimann, Kaveh Naderi, Nicole Nett, John Folkerts and Hoa
Nguyen.
Reliability generally refers to the infrequency of system and component failures experienced by
a server.
Availability, broadly speaking, is how the hardware, firmware, operating systems and
application designs handle failures to minimize application outages.
Serviceability generally refers to the ability to efficiently and effectively install and upgrade
systems firmware and applications, as well as to diagnose problems and efficiently repair faulty
components when required.
These interrelated concepts of reliability, availability and serviceability are often spoken of as”
RAS”.
Within a server environment all of RAS, but especially application availability, is really an end-
to-end proposition. Attention to RAS needs to permeate all the aspects of application
deployment. However, a good foundation for server reliability whether in a scale-out or scale-up
environment is clearly beneficial.
Systems based on the Power processors are generally known for their design emphasizing
Reliability, Availability and Serviceability capabilities. Previous versions of a RAS whitepaper
have been published to discuss general aspects of the hardware reliability and the hardware and
firmware aspects of availability and serviceability.
The focus of this whitepaper is to introduce the Power10 processor-based systems using the
PowerVM hypervisor. Systems not using PowerVM will not be discussed specifically in this
whitepaper.
Section 5: Serviceability
Provides descriptions of the error log analysis, call-home capabilities, service environment and
service interfaces of IBM Power Systems.
Comparative Discussion
In September 2021, IBM introduced the first Power system using the Power10 processor: The
IBM Power E1080 system; a scalable server using multiple four socket Central Electronics
Complex (CEC) drawers. The Power E1080 system design was inspired by the Power E980 but
has enhancements in key areas to complement the performance capabilities of the Power10
processor.
One of these key enhancements includes an all-new memory subsystem with Differential
DIMMs (DDIMMs) using a memory buffer that connects to processors using an Open Memory
Interface (OMI) which is a serial interface capable of higher speeds with fewer lanes compared
to a traditional parallel approach.
Another enhancement is the use of both passive external and internal cables for the fabric busses
used to connect processors between drawers eliminating the routing of signals through the CEC
backplane in contrast to the Power9 approach where signals were routed through a backplane and
the external cables were active. This design point significantly reduces the likelihood that the
labor intensive and costly replacement of the main system backplane will be needed.
Another change of note from a reliability standpoint is that the processor clock design, while still
redundant in the Power E1080 system has been simplified since it is no longer required that each
processor module within a CEC drawer be synchronized with the others.
Figure 1: POWER9/Power10 Server RAS Highlights Comparison
Yes – Power10
design removes
active components
Multi-node SMP Fabric RAS
on cable and
• CRC checked processor fabric bus retry with spare N/A N/A Yes
introduces internal
data lane and/or bandwidth reduction
cables to reduce
backplane
replacements
PCIe hot-plug with processor integrated PCIe controller$ Yes Yes Yes Yes
Yes, For
Processors-
Redundant/spare voltage phases on voltage converters for DDIMMs use on
Both redundant and
levels feeding processor and custom memory DIMMs or No Redundant board power
spare
memory risers. management
integrated Circuits
(PMICs)
* In scale-out systems Chipkill capability is per rank of a single Industry Standard DIMM
(ISDIMM); in IBM Power E950 Chipkill and spare capability is per rank spanning across an
ISDIMM pair; and in the IBM Power E980, per rank spanning across two ports on a Custom
DIMM.
The Power E950 system also supports DRAM row repair
^ IBM Power® S914, IBM Power® S922, IBM Power® S924
IBM Power® H922 ,IBM Power® S924, IBM Power® H924
$
Note: I/O adapter and device concurrent maintenance in this document refers to the hardware capability to allow the
adapter or device to be removed and replaced concurrent with system operation. Support from the operating system
is required and will depend on the adapter or device and configuration deployed.
Power10
Power10 Power10
1s and 2s
IBM Power System IBM Power System
IBM Power System
E1050 E1080
S1014, S1022, S1024
2U DDIMM – no
spare DRAM Yes - base Yes - base
4U DDIMM (post 4U DDIMM – 2 spare 4U DDIMM – 2 spare
Dynamic Memory Row repair and spare DRAM capability GA) – 2 spare DRAM DRAM per rank DRAM per rank
per rank Yes – Dynamic Row Yes – Dynamic Row
Yes – Dynamic Row Repair Repair
Repair
Yes – Base
Active Memory Mirroring for the Hypervisor Yes - Base Yes - Base
New to scaleout
No with
Yes - Base
Redundant On-board Power Management Integrated 2U DDIMM Yes - Base
4U DDIMM
Circuits (PMIC) memory DDIMMs Yes – optional 4U 4U DDIMM
DDIMM (post GA)
Not illustrated in the figure above is the internal connections for the SMP fabric busses which
will be discussed in detail in another section.
In comparing the Power E1080 design to the POWER9-based Power E980 system, it is also
interesting to note that the processor clock function has been separated from the local service
functions and now resides in a dual fashion as separate clock cards.
The POWER8 multi-drawer system design required that all processor modules be synchronized
across all CEC drawers. Hence a redundant clock card was present in the system control drawer
and used for all the processors in the system.
The memory subsystem of the Power E1080 system has been completely redesigned to support
Differential DIMMs (DDIMMs) with DDR4 memory that leverage a serial interface to
communicate between processors and the memory.
A memory DIMM is generally considered to consist of 1 or more “ranks” of memory modules
(DRAMs). A standard DDIMM module may consist of 1 or 2 ranks of memory and will be
approximately 2 standard “rack units” high, called 2U DDIMM. The Power E1080 system
exclusively uses a larger DDIMM with up to 4 ranks per DDIMM (called a 4U DDIMM). This
allows for not only additional system capacity but also room for additional RAS features to
better handle failures on a DDIMM without needing to take a repair action (additional self-
healing features).
Power E1080 CEC Drawer to CEC Drawer SMP Fabric Interconnect Design
The SMP Fabric busses used to connect processors across CEC nodes is similar in RAS function
to the fabric bus used between processors within a CEC drawer. Each bus is functionally
composed of eight bi-directional lanes of data. CRC checking with retry is also used. ½
bandwidth mode is supported.
Unlike the processor-to-processor within a node design, the lanes of data are carried from each
processor module through internal cables to external cables and then back through internal cables
to the other processor.
Physically each processor module has eight pads (four on each side of the module.) Each pad
side has an internal SMP cable bundle which connects from the processor pads to a bulkhead in
each CEC drawer which allows the external and internal SMP cables to be connected to each
other.
Figure 5: SMP Fabric Bus Slice
Where the illustration above shows just the connections on one side of the processor.
Though it is beyond the scope of this whitepaper to delve into the exact details of how TDR
works, as a very rough analogy it can be likened to a form of sonar where when desired, a
processor module that drives a signal on a lane can generate an electrical pulse along the path to
a receiving processor in another CEC drawer. If there is a fault along the path, the driving
processor can detect a kind of echo or reflection of the pulse. The time it takes for the reflection
to be received would be indicative of where the fault is within the cable path.
For faults that occur mid-cable, the timing is such that TDR should be able to determine exactly
what field replaceable unit to replace to fix the problem. If the echo is very close to a connection,
two FRUs might be called out, but in any case, the use of TDR allows for good fault isolation for
such errors while allowing the Power10 system to take advantage of a fully passive path between
processors.
System Structure
A simplified view of the Power E1050 system design is represented in the figure below:
The E1050 maintains the same system form factor and infrastructure redundancy as the E950.
As depicted in the E1050 system diagram below, there are 4 Power Supplies and Fan Field
Replaceable Units (FRU) to provide at least N+1 redundancy. These components can be
concurrently maintained or hot add/removed. There is also N+1 Voltage Regulation Module
(VRM) phase redundancy to the Processors and redundant Power Management Integrated Circuit
(PMIC) supplying voltage to the 4U DDIMM that the E1050 offers.
The E1050 Op Panel base and LCD are connected to the same planer as the internal NVMe
drives. The Op Panel base and LCD are separate FRUs and are concurrently maintainable. The
NVMe backplane also has 2 USB 3.0 ports, accessible through the front of the system, for
system OS. Not shown in the diagram, are 2 additional OS USB 3.0 ports at the rear of the
system, connected through the eBMC card.
USB Op Panel
Ports Base
N N N N N N N N N N
Fan FRU Fan FRU Fan FRU Fan FRU V V V V V V V V V V
M M M M M M M M M M LCD
e e e e e e e e
DASD Backplane
Fan
Fan
Rotor Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
Fan
Rotor
e e
Power
Distribution Service Processor V V Processor in Dual Chip Module
Battery
V V V V V V V V V V V V V V
RTC
Card eBMC R R
R R R R R R R R R R R R R R
SYS M M M M M M M M M M M M M M M M
VPD
Memory DDIMMs
System Planar
Main Planar Board
PCIe Slot C3
PCIe Slot C5
PCIe Slot C6
PCIe Slot C7
PCIe Slot C8
PCIe Slot C9
PCIe Slot C1
PCIe Slot C4
NVMe2,7,3,8,4,9
NVMe0,5,1,6
System VPD
POWER9 Power10
RAS impact
E950 Memory E1050 Memory
• P10 4U DDIMM
o 1st chip kill fixed with spare
o 2nd serial chip kill fixed with spare
One spare
o 3rd serial chip kill fixed with ECC
DRAM per Two spare
X4 Chip Kill o 4th serial chip kill is uncorrectable
port or across a DRAM per port
• E950 DIMM
DIMM pair
o 1st chip kill fixed with spare
o 2nd serial chip kill fixed with ECC
o 3rd serial chip kill is uncorrectable
NOTE: A memory ECC code is defined by how many bits or symbols (group of bits) it can correct. The P10
DDIMM memory buffer ECC code organizes the data into 8-bit symbols and each symbol contains the data from
one DRAM DQ over 8 DDR beats.
DASD Options
The E1050 provides 10 internal NVMe drives at Gen4 speeds. The NVMe drives are
connected to DCM0 and DCM3. In a 2S DCM configuration, only 6 of the drives are
available. A 4S DCM configuration is required to have access to all 10 internal NVMe
drives. Unlike the E950, the E1050 has no internal SAS drives. An external drawer can
be used to provide SAS drives.
The internal NVMe drives support OS-controlled RAID0 and RAID1 array, but no
hardware RAID. For best redundancy, the OS mirror and dual VIOS mirror can be
employed. To ensure as much separation as possible in the hardware path between
mirror pairs, the following NVMe configuration is recommended:
a.) Mirrored OS: NVMe3,4 or NVME8,9 pairs
b.) Mirrored Dual VIOS
I. Dual VIOS: NVMe3 for VIOS1, NVMe4 for VIOS2
II. Mirrored the Dual VIOS: NVMe9 mirrors NVMe3, NVMe8 mirrors NVMe4
The IBM Power10 E1050 system comes with a redesigned service processor based on a
Baseboard Management Controller (BMC) design with firmware that is accessible through open-
source industry standard APIs, such as Redfish. An upgraded Advanced System Management
Interface (ASMI) web browser user interface preserves the required enterprise RAS functions
while allowing the user to perform tasks in a more intuitive way.
The service processor supports surveillance of the connection to the Hardware Management
Console (HMC) and to the system firmware (hypervisor). It also provides several remote power
control options, environmental monitoring, reset, restart, remote maintenance, and diagnostic
functions, including console mirroring. The BMC service processors menus (ASMI) can be
accessed concurrently during system operation, allowing nondisruptive abilities to change
system default parameters, view and download error logs, check system health.
Redfish, an Industry standard for server management, enables the Power Servers to be managed
individually or in a large data center. Standard functions such as inventory, event logs, sensors,
dumps, and certificate management are all supported with Redfish. In addition, new user
management features support multiple users and privileges on the BMC via Redfish or ASMI.
User management via Lightweight Directory Access Protocol (LDAP) is also supported. The
Redfish events service provides a means for notification of specific critical events such that
actions can be taken to correct issues. The Redfish telemetry service provides access to a wide
variety of data (e.g. power consumption, ambient, core, DDIMM and I/O temperatures, etc.) that
can be streamed on periodic intervals.
System Structure
There are multiple scale out system models (MTMs) supported. For brevity, this document
focuses on the largest configuration of the scale out servers.
The simplified illustration, in Figure 10, depicts the 2S DCM with 4U CEC drawer. Similar to
the S9xx, there is infrastructure redundancy in the power supplies and fans. In addition, these
components can be concurrently maintained along with the Op Panel LCD, internal NVMe
drives and IO adapters.
Figure 10: Power S1024 System Structure Simplified View
USB Op Panel
Ports Base
N N N N N N N N N N N N N N N N
V V V V V V V V V V V V V V V V Concurrently replaceable
M M M M M M M M M M M M M M M M LCD
e e e e e e e e e e e e e e e e DASD Backplane Concurrently replaceable and
collectively can provide at least n+1
redundancy
Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Concurrently replaceable (some
Fan
Rotor
Fan
Rotor Fan Fan
Fan
Rotor
Fan Fan Fan Fan Fan Fan
Rotor
Fan
Rotor
components shown)
Rotor Rotor Rotor Rotor Rotor Rotor
Rotor
Other V V
TPM + P10 P10 P10 P10
Service R R
USB Cntr Proc0 Proc1 Proc0 Proc1
Functions M M Memory DDIMMs
Clock
Circuitry Main Planar Board
Other components
Service Processor
Battery
RTC
System Planar
No PCIe Switches – all I/O slots have direct
connection to processor
Proc CP1 Proc CP0
PCIe Slot C1
PCIe Slot C7
PCIe Slot C8
PCIe Slot C2
PCIe Slot C9
PCIe Slot C4
Gen4x8 Gen5x8 Gen5x8 Gen5x8 Open CAPI Gen5x8 Gen4x8 Gen5x8 Gen5x8 or
Gen5x8 Gen5x8
or or Only or NVMe card
or or
Gen4x16 Gen4x16 NVMe
Gen4x16 Gen4x16
card
or
NVMe
Pwr Pwr Pwr Pwr card
Supply Supply Supply Supply
As depicted in the simplified illustration in Figure 11, there is another variation of the Power10
processor module. This option contains 1 P10 chip with processor cores and OMI memory
channels and another P10 chip whose primary purpose is to provide PCIe connections to I/O
devices. This P10 entry Single Chip Module (eSCM) processor configuration gives customers
the option to purchase cost reduced solution without losing any I/O adapter slots. A point of note
is if the primary processor chip of the eSCM is nonfunctional and garded by the firmware, the
Figure 11: Power S1024 System Structure Simplified View With eSCM
USB Op Panel
Ports Base
N N N N N N N N N N N N N N N N
V V V V V V V V V V V V V V V V Concurrently replaceable
M M M M M M M M M M M M M M M M LCD
e e e e e e e e e e
e e e e e e DASD Backplane
Concurrently replaceable and
collectively can provide at least n+1
redundancy
Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU Fan FRU
Concurrently replaceable (some
Fan Fan
Fan Fan
Fan Fan
Fan Fan
Fan Fan Fan Fan
Rotor
components shown)
Rotor Rotor Rotor Rotor Rotor Rotor Rotor
Rotor Rotor Rotor Rotor
Other V V
TPM + P10 P10 P10 P10
Service R R
USB Cntr Proc0 Proc1 Proc0 Proc1
Functions M M Memory DDIMMs
0-cores 0-cores
Clock
Circuitry Main Planar Board
Other components
Service Processor
Battery
RTC
PCIe Slot C1
PCIe Slot C7
PCIe Slot C8
PCIe Slot C2
PCIe Slot C9
PCIe Slot C3
PCIe Slot C4
The memory on these systems is very different from that of the S9xx systems. These scale out
systems now support Active Memory Mirroring for the Hypervisor, which was not available in
the S9xx servers. While an S9xx uses ISDIMMs, these systems support a 2U DDIMM which is
a DDIMM form-factor that fits in 2U systems. The 2U DDIMM is a reduced cost and RAS
Power10
POWER9 S1014, S1022, RAS impact
S9xx Memory S1024 Systems
2U Memory
Direct attached • P10 2U DDIMM: Single FRU or fewer components to replace
DIMM Form Factor 2U
ISDIMMs • S9xx DIMM: Separate FRU used for the ISDIMMs
• P10 2U DDIMM
Single DRAM Single DRAM o 1st chip kill fixed with ECC
chipkill correction, chipkill correction, o 2nd serial chip kill is uncorrectable
X4 Chip Kill
but No spare but No spare • S9xx DIMM
DRAM DRAM o 1st chip kill fixed with ECC
o 2nd serial chip kill is uncorrectable
NOTE: A memory ECC code is defined by how many bits or symbols (group of bits) it can correct. The P10
DDIMM memory buffer ECC code organizes the data into 8-bit symbols and each symbol contains the data from
one DRAM DQ over 8 DDR beats.
C0 x8 G5/x16 G4 CP1 = DCM1/C1 PCIe Adapters, Cable card for I/O expansion
C3 x8 G5/x16 G4 CP1 = DCM1/C0 PCIe Adapters, Cable card for I/O expansion
C4 x8 G5/x16 G4 CP1 = DCM1/C0 PCIe Adapters, Cable card for I/O expansion
C10 x8 G5/x16 G4 CP0 = DCM0/C0 PCIe Adapters, Cable card for I/O expansion, NVMe JBOF
C11 x8 G5 CP0 = DCM0/C0 PCIe Adapters, Cable card for I/O expansion, NVMe JBOF
DASD Options
These systems provide up to 16 internal NVMe drives at Gen4 speeds. The NVMe drives are
connected to the processor via a plug-in PCIe NVMe JBOF (Just a Bunch Of Flash) card. Up to
2 JBOF cards can be populated in the S1024 and S1014 systems, with each JBOF card attached
to an 8-pack NVMe backplane. The S1022 NVMe backplane only supports the 4-pack, which
provides up to 8 NVMe drive per system. The JBOF cards are operational in PCIe slots C8, C10
and C11 only. However, the C8 and C10 slots cannot both be populated with JBOF cards
simultaneously. As depicted in Figure 13 above, all 3 slots are connected to DCM0 which means
a 1S system can have all the internal NVMe drives available. While the NVMe drives are
concurrently maintainable, a JBOF card is not. Unlike the S9xx, these systems have no internal
SAS drives. An external drawer can be used to provide SAS drives.
The S1012 is described as an edge-level server, designed for both edge computing and core
business workloads. The edge-level server is a 1-socket half-wide system available in a 2U rack-
mounted or tower chassis form factor.
While the system benefits from incorporating a Power10 processor, an eBMC and many
firmware components shared in common with other systems, the system design, positioning and
form-factor is different from other Power systems. Consequently, there are a number of RAS
characteristics that are different from the other Power10 systems described in this whitepaper.
Rather than using 2U DDIMMs, the system supports DDR4 Industry Standard RDIMMs
(ISDIMMS). The processor connects to the ISDIMMS through separate memory buffers
incorporated on the system planar.
The S1012 does not support Active Memory Mirroring of the Hypervisor. The PCIe I/O
Adapters are not concurrently maintainable and external I/O drawers are not supported. Live
Partition Mobility (LPM) is not supported for the Power S1012 at General Availability as of the
date of publication of this whitepaper.
The system does support redundancy in system fans and power supplies as well as concurrent
maintenance. The four NVME drives are accessible for hot-plugging (when allowed by the
operating system.) The 4 PCIe slots do not support concurrently maintainable or hot-plugging of
adapters.
Due to limited number of components in the system, the use of fault indicators for failing part
indication is more limiting compared to the scale-out systems.
The PCIe x16 uplink to the system cable cards is carried over a pair of x16 cables. Connectivity
is fault tolerant of either cable of the pair failing and can be repaired independently. The
management of the drawer is provided by two Enclosure Services Modules (ESM) with either
being capable of managing the drawer.
For serviceability the drawer has front and back service indicators and IBM blue colored
touchpoints. These operate according to the IBM Power Systems indicators in repair and
maintenance operations. On the front of the Drawer is a display panel with a Green power LED,
a Blue identify LED and an Amber fault LED.
Each of the NVMe U.2 drives can be concurrently repaired and can be configured using the same
high availability options supported by IBM Power Systems and the Operating Systems such as
RAID or mirroring. Internally NED24 has redundant power supplies and over provisioned with
6 fans in a N+1 configuration.
NVMe Device in slot Loss of function of the NVMe Device NVMe device can be repaired while OS device mirroring is
rest of the NED24 continues to employed.
operate
Enclosure Services Loss of access to all the NVMe devices Associated ESM must be taken down Systems with an HMC
Manager (ESM) through the ESM for repair; rest of the NED24 can
remain active.
Other Failure of PCIe4 Loss of access to all the NVMe devices Associated ESM must be taken down Systems with an HMC
cable adapter card in connected to the ESM for repair; rest of the NED24 can
System or ESM remain active.
A PCIe Lane fault or System continues to run but the number Associated ESM must be taken down Systems with an HMC
other failure of a cable of active lanes available to ESM will for repair; rest of the NED24 can
be reduced remain active.
One Power Supply NED24 continues to run with Concurrently repairable None
remaining PSU
Midplane Depends on source of failure, may take I/O drawer must be powered off to Systems with an HMC
down entire I/O drawer repair (loss of use of all I/O in the
drawer)
The I/O expansion drawer is attached using a converter card called a PCIe x16 to CXP Converter
Card that plugs into certain capable PCIe slots of the CEC. The converter card connects a pair of
CXP type copper or active optical cables that are offered in a variety of lengths to the I/O
expansion drawer. CXP is a type of cable and connector technology used to carry wide width
links such as PCIe. These cards have been designed to improve signal integrity and error
isolation. These cards now contain a PCI-express switch that provides much better fault isolation
by breaking the link to the I/O expansion drawer up into two links. Because each link is shorter
with fewer components an error can be better isolated to a subset of failing component. This aids
in the repair on a failure.
Each I/O expansion drawer contains up to two pluggable 6 slot fanout modules. Each can be
installed and repaired independently. Each uses two x8 cables for its PCIe x16 width uplink. This
implementation provides tolerance from a failure in either cable. The failure of one cable reduces
the I/O bandwidth and lane width in half without a functional outage. Most client applications
will see little or no performance degradation when the link recovers from errors in this fashion.
Additional RAS features allow for power and cooling redundancy. Dual power supplies support
N+1 redundancy to allow for concurrent replacement or single power source loss. A voltage
regulator module (VRM) associated with each PCIe4 6-slot fanout module includes built-in N+1
Philosophy
In the previous section three different classes of servers were described with different levels of
RAS capabilities for Power10 processor-based systems. While each server had specific
attributes, the section highlighted several common attributes.
This section will outline some of the design philosophies and characteristics that influenced the
design. The following section will detail more specifically how the RAS design was carried out
in each sub-system of the server.
The figure above indicates how IBM design and influence may flow through the different layers
of a representative enterprise system as compared to other designs that might not have the same
level of control. Where IBM provides the primary design and manufacturing test criteria, IBM
can be responsible for integrating all the components into a coherently performing system and
verifying the stack during design verification testing.
In the end-user environment, IBM likewise becomes responsible for resolving problems that may
occur relative to design, performance, failing components and so forth, regardless of which
elements are involved.
Incorporate Experience
Being responsible for much of the system, IBM puts in place a rigorous structure to identify
issues that may occur in deployed systems and identify solutions for any pervasive issue. Having
support for the design and manufacture of many of these components, IBM is best positioned to
fix the root cause of problems, whether changes in design, manufacturing, service strategy,
firmware or other code is needed.
The detailed knowledge of previous system performance has a major influence on future systems
design. This knowledge lets IBM invest in improving the discovered limitations of previous
generations. Beyond that, it also shows the value of existing RAS features. This knowledge
justifies investing in what is important and allows for adjustment to the design when certain
techniques are shown to be no longer of much importance in later technologies or where other
mechanisms can be used to achieve the same ends with less hardware overhead.
It is not feasible to detect or isolate every possible fault or combination of faults that a server
might experience, though it is important to invest in error detection and build a coherent
In general, I/O adapters may also have less hardware error detection capability where they can
rely on a software protocol to detect and recover from faults when such protocols are used.
Redundant Definition
Sometimes redundant components are not actively in use unless a failure occurs. For example, a
processor may only actively use one clock source at a time even when redundant clock sources
are provided.
In contrast, fans and power supplies are typically all active in a system. If a system is said to
have “n+1” fan redundancy, for example, all “n+1” fans will normally be active in a system
absence a failure. If a fan fails occurs, the system will run with “n” fans. In cases where there are
fans or power supply failures, power and thermal management code may compensate by
increasing fan speed or making other adjustments according to operating conditions per power
management mode and power/thermal management policy.
Build System Level RAS Rather Than Just Processor and Memory RAS
IBM builds Power systems with the understanding that every item that can fail in a system is a
potential source of outage.
While building a strong base of availability for the computational elements such as the
processors and memory is important, it is hardly sufficient to achieve application availability.
The failure of a fan, a power supply, a voltage regulator, or I/O adapter might be more likely
than the failure of a processor module designed and manufactured for reliability.
Scale-out servers will maintain redundancy in the power and cooling subsystems to avoid system
outages due to common failures in those areas. Concurrent repair of these components is also
provided.
For the Enterprise system, a higher investment in redundancy is made. The Power E1080 system,
for example is designed from the start with the expectation that the system must be largely
shielded from the failure of these other components causing persistent system unavailability;
incorporating substantial redundancy within the service infrastructure (such as redundant service
processors, redundant processor boot images, and so forth.) An emphasis on the reliability of
components themselves are highly reliable and meant to last.
SMT8 SMT8
C SMT8 SMT8
L L C C
M L L L L C
M L L
Core E
2 3 Core 2 3 Core
M
2 3 M
E E Core E
2 3
MMA/TP/NX/PAU
MEM Ctrl/
OMI Interface MEM Ctrl/
OMI Interface
Internal Fabric/Interfaces
POWER6 POWER6
SMT2 Core SMT2 Core
L2 L2 L3 External
Controller
Cache Cache L3 Cache
/Directory /Directory SMT8 SMT8 SMT8
C C C SMT8
L L L L L C
M M L L L
Core E
2 3 Core 2 3 Core
M
2 3 M
E E Core E
2 3
The illustration above shows a rough view of the Power10 scale-up processor design leveraging
SMT8 cores (a maximum of 16 cores shown.)
The Power10 design is certainly more capable. There is a maximum of 16 SMT8 cores compared
to 2 SMT2 cores on POWER6. The core designs architecturally have advanced in function as
well. The number of memory controllers has doubled, and the memory controller design is also
different.
The addition of system-wide functions such as the Matrix Multiply Accelerator (MMA), NX
accelerators and the CAPI interfaces provide functions just not present in the hardware of the
POWER6 system.
The Power10 design is also much more integrated. The L3 cache is internal, and the I/O
processor host bridge is integrated into the processor. The thermal management is now
conducted internally using the on-chip controllers.
There are reliability advantages to the integration. It should be noted that when a failure does
occur its more likely to be a processor module at fault compared to previous generations with
less integration. Some benefits of this integration approach is that Power10 based systems can
leverage the advanced error isolation and recovery mechanisms of the processor to improve
system uptime.
PCIe Hub Behavior and Each Power10 processor has PCIe Host Bridges (PHB) called PCIe hubs which
Enhanced Error Handling generate the various PCIe Gen4/5 busses used in the system. The hub is capable
of “freezing” operations when certain faults occur, and in certain cases can retry
and recover from the fault condition.
This hub freeze behavior prevents faulty data from being written out through the
I/O hub system and prevents reliance on faulty data within the processor
complex when certain errors are detected.
Along with this hub freeze behavior is what has long been termed as Enhanced
Error Handling for I/O.
Processor Safe-mode Failures of certain processor wide facilities such as power and thermal
management code running on the on-chip controller (OCC) and the self-boot-
engine (SBE) used during run-time for out of band service processor functions
can occur. To protect system function, such faults can result in the system
running in a “safe mode” which allows processing to continue with reduced
performance in the face of errors where the ability to access power and thermal
performance or error data is not available.
Persistent Guarding of Failed Should a fault in a processor core become serious enough that the component
Elements needs to be repaired (persistent correctable error with all self-healing capabilities
exhausted or an unrecoverable error), the system will remember that a repair has
been called for and not re-use the processor on subsequent system reboots, until
repair.
This functionality may be extended to other processor elements and to entire
processor modules when relying on that element in the future means risking
another outage.
In a multi-node system, deconfiguration of a processor node for fabric bus
consistency reasons results in deconfiguration of a node.
As systems design continues to mature the RAS features may be adjusted based on the needs of
the newer designs as well as field experience; therefore, this list differs from the POWER8
design.
For example, in POWER7+, an L3 cache column repair mechanism was implemented to be used
in addition to the ability to delete rows of a cache line.
This feature was carried forward into the POWER8 design, but the field experience in the
POWER8 technology did not show the benefit based on how faults that might implement
multiple rows surfaced. For POWER9 and Power10 enterprise systems, given the size of the
caches, the number of lines that were deleted were extended instead of continuing to carry
column repair as a separate procedure.
Going forward, the number of rows that need to be deleted is adjusted for each generation based
on technology.
Likewise, in POWER6 a feature known as alternate processor recovery was introduced. The
feature had the purpose of handling certain faults in the POWER6 core. The faults handled were
limited to certain faults where the fault was discovered before an instruction was committed and
the state of the failing core could be extracted. The feature, in those cases, allowed the failing
workload to be dynamically retried on an alternate processor core. In cases where no alternate
processor core was available, some number of partitions would need to be terminated, but the
Memory Scrubbing The memory controller periodically scrub the DRAMs for soft errors. This HW
accelerated scrubbing involves reading the memory, checking the data and
correcting the data if an ECC fault is detected. If a UE is detected, corrective
action can safely be taken in maintenance mode.
Bus CRC Like all high speed data busses, the OMI can be susceptible to the occasional
multiple bit errors. A cyclic redundancy check code is used to determine if there
are errors within an entire packet of data.
If a transient CRC fault is detected, the packet is retried and the operation
continues.
Split Memory Channel If there is a persistent CRC error that is confined to the data on half the OMI
(½ bandwidth mode) channel (1 to 4 lanes), the bandwidth of the bus is reduced and all the memory
channel traffic is sent across just 4 good lanes.
Predictive Memory Refer to the Dynamic Deallocation/Memory Substitution section below for
Deallocation details.
Memory Channel Checkstops The memory controller communicating between the processor and the memory
and Memory Mirroring buffer has its own set of methods for containing errors or retrying operations.
Some severe faults require that memory under a portion of the controller become
inaccessible to prevent reliance on incorrect data. There are cases where the fault
can be limited to just one memory channel.
In these cases, the memory controller asserts what is known as a channel
checkstop. In systems without hypervisor memory mirroring, a channel
checkstop will usually result in a system outage. However, with hypervisor
memory mirroring the hypervisor may continue to operate in the face of a
memory channel checkstop.
Refer to the Hypervisor Memory Mirroring section below for additional
details.
Memory Safe-mode Memory is throttled based on memory temperature readings in order to thermally
protect the DDIMMs. If memory temperature readings are not available, then
All Power10 processors provide a means to mirror the memory used in critical areas of the
PowerVM hypervisor. This provides protection to the hypervisor memory so that it does not
need to terminate due to just a fault on a DIMM that cause uncorrectable errors in the hypervisor.
There are rare conditions when AMM of the hypervisor may not prevent system termination. For
instance, there are times when mirroring will be disabled in response to a DIMM error which
could lead to system termination if the second DIMM in mirrored pair were to experience an
uncorrectable event.
Mirrored LMBs
Writes Go to Each Side
Reads alternate between sides
Or from one side only when DIMM fails
Memory Ctrl
Memory Ctrl
Hypervisor
Memory
By selectively mirroring only the segments used by the hypervisor, this protection is provided
without the need to mirror large amounts of memory.
It should be noted that the PowerVM design is a distributed one. PowerVM code can execute at
times on each processor in a system and can reside in small amounts in memory anywhere in the
system. Accordingly, the selective mirroring approach is fine grained enough not to require the
hypervisor to sit in any particular memory DIMMs. This provides the function while not
compromising the hypervisor performance as might be the case if the code had to reside
remotely from the processors using a hypervisor service.
Voltage Regulation
There are many different designs that can be used for supplying power to components in a
system.
As described above, power supplies may take alternating current (AC) from a data center power
source, and then convert that to a direct current voltage level (DC).
Modern systems are designed using multiple components, not all of which use the same voltage
level. Possibly a power supply can provide multiple different DC voltage levels to supply all the
components in a system. Failing that, it may supply a voltage level (e.g., 12v) to voltage
regulators which then convert to the proper voltage levels needed for each system component
1
https://www.ibm.com/docs/en/power10/9105-22A?topic=powervm-signatures-keys-in-secure-boot
I/O Slot
I/O Slot
Additional
Slots/Integrat Here
ed I/O
LAN SAN LAN SAN
Adapter Adapter Adapter Adapter
And Here
Virtual
Physical
DCM DCM
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O I/O
Attach Attach
Card Card
x8 x8 x8 x8
PCIe Switch PCIe Switch PCIe Switch PCIe Switch
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
I/O Slot
SAN
Adapter
LAN
Adapter
LAN
Adapter
SAN
Adapter
Availability still Maximized Using
Redundant I/O Drawers
Gen3 I/O Drawer 2 Gen3 I/O Drawer 1
Planned Outages
Unplanned outages of systems and applications are typically very disruptive to applications. This
is certainly true of systems running standalone applications, but is also true, perhaps to a
somewhat lesser extent, of systems deployed in a scaled-out environment where the availability
of an application does not entirely depend on the availability of any one server. The impact of
unplanned outages on applications in both such environments is discussed in detail in the next
section.
Planned outages, where the end-user picks the time and place where applications must be taken
off-line can also be disruptive. Planned outages can be of a software nature – for patching or
upgrading of applications, operating systems, or other software layers. They can also be for
hardware, for reconfiguring systems, upgrading or adding capacity, and for repair of elements
that have failed but have not caused an outage because of the failure.
If all hardware failures required planned downtime, then the downtime associated with planned
outages in an otherwise well-designed system would far-outpace outages due to unplanned
causes.
While repair of some components cannot be accomplished with workload actively running in a
system, design capabilities to avoid other planned outages are characteristic of systems with
advanced RAS capabilities. These may include:
Concurrent Repair
When redundancy is incorporated into a design, it is often possible to replace a component in a
system without taking the entire system down.
As examples, Enterprise Power Systems support concurrently removeable and replaceable
elements such as power supplies and fans.
In addition, Enterprise Power Systems as well as Power10 processor-based 1s and 2s systems
support concurrently removing and replacing I/O adapters according to the capabilities of the OS
and applications.
Integrated Sparing
As previously mentioned, to reduce replacements for components that cannot be removed and
replaced without taking down a system, Power Systems strategy includes the use of integrated
spare components that can be substituted for failing ones.
Infrastructure Infrastructure
Hardware
Service Processor Management Service Processor
Console (HMC)
In simplified terms, LPM typically works in an environment where all the I/O from one partition
is virtualized through PowerVM and VIOS and all partition data is stored in a Storage Area
Network (SAN) accessed by both servers.
To migrate a partition from one server to another, a partition is identified on the new server and
configured to have the same virtual resources as the primary server including access to the same
logical volumes as the primary using the SAN.
When an LPM migration is initiated on a server for a partition, PowerVM begins the process of
dynamically copying the state of the partition on the first server to the server that is the
destination of the migration.
Thinking in terms of using LPM for hardware repairs, if all the workloads on a server are
migrated by LPM to other servers, then after all have been migrated, the first server could be
turned off to repair components.
LPM can also be used for doing firmware upgrades or adding additional hardware to a server
when the hardware cannot be added concurrently in addition to software maintenance within
individual partitions.
Minimum Configuration
For detailed information on how LPM can be configured the following references may be useful:
An IBM Redbook titled: IBM PowerVM Virtualization Introduction and Configuration2 as
well as the document Live Partition Mobility3
In general terms LPM requires that both the system containing a partition to be migrated and the
system being migrated have a local LAN connection using a virtualized LAN adapter. In
addition, LPM requires that all systems in the LPM cluster be attached to the same SAN. If a
single HMC is used to manage both systems in the cluster, connectivity to the HMC also needs
to be provided by an Ethernet connection to each service processor.
The LAN and SAN adapters used by the partition must be assigned to a Virtual I/O server and
the partitions access to these would be by virtual LAN (vLAN) and virtual SCSI (vSCSI)
connections within each partition to the VIOS.
I/O Redundancy Configurations and VIOS
LPM connectivity in the minimum configuration discussion is vulnerable to a number of
different hardware and firmware faults that would lead to the inability to migrate partitions.
Multiple paths to networks and SANs are therefore recommended. To accomplish this, Virtual
I/O servers (VIOS) can be used.
VIOS as an offering for PowerVM virtualizes I/O adapters so that multiple partitions will be able
to utilize the same physical adapter. VIOS can be configured with redundant I/O adapters so that
the loss of an adapter does not result in a permanent loss of I/O to the partitions using the VIOS.
Externally to each system, redundant hardware management consoles (HMCs) can be utilized for
greater availability. There can also be options to maintain redundancy in SANs and local
network hardware.
Infrastructure Infrastructure
Hardware
Management
Hardware
Service Processor Console (HMC)
Management Service Processor
Console (HMC)
Figure generally illustrates multi-path considerations within an environment optimized for LPM.
2
Mel Cordero, Lúcio Correia, Hai Lin, Vamshikrishna Thatikonda, Rodrigo Xavier, Sixth Edition published June 2013,
3
IBM, 2018, ftp://ftp.software.ibm.com/systems/power/docs/hw/p9/p9hc3.pdf
Logical Partition
LAN fc LAN fc
adapter adapter adapter adapter
LAN fc LAN fc
adapter adapter adapter adapter
Since each VIOS can largely be considered as an AIX based partition, each VIOS also needs the
ability to access a boot image, having paging space, and so forth under a root volume group or
rootvg. The rootvg can be accessed through a SAN, the same as the data that partitions use.
Alternatively, a VIOS can use storage locally attached to a server, either DASD devices or SSD
drives such as the internal NVMe drives provided for the Power E1080 and Power E1050
systems. For best availability, the rootvgs should use mirrored or other appropriate RAID drives
with redundant access to the devices.
PowerVC™ and Simplified Remote Restart
PowerVC is an enterprise virtualization and cloud management offering from IBM that
streamlines virtual machine deployment and operational management across servers. The IBM
Cloud PowerVC Manager edition expands on this to provide self-service capabilities in a private
cloud environment; IBM offers a Redbook that provides a detailed description of these
capabilities. As of the time of this writing: IBM PowerVC Version 1.3.2 Introduction and
Configuration4 describes this offering in considerable detail.
Deploying virtual machines on systems with the RAS characteristics previously described will
best leverage the RAS capabilities of the hardware in a PowerVC environment. Of interest in this
availability discussion is that PowerVC provides a virtual machine remote restart capability,
which provides a means of automatically restarting a VM on another server in certain scenarios
(described below).
4
January 2017, International Technical Support Organization, Javier Bazan Lazcano and Martin Parrella
Introduction
All of the previous sections in this document discussed server specific RAS features and options.
This section looks at the more general concept of RAS as it applies to any system in the data
center. The goal is to briefly define what RAS is and look at how reliability and availability are
measured. It will then discuss how these measurements may be applied to different applications
of scale-up and scale-out servers.
RAS Defined
Mathematically, reliability is defined in terms of infrequently something fails.
At a system level, availability is about how infrequently failures cause workload interruptions.
The longer the interval between interruptions, the more available a system is.
Serviceability is all about how efficiently failures are identified and dealt with, and how
application outages are minimized during repair.
Broadly speaking systems can be categorized as ”scale-up” or ”scale-out” depending on the
impact to applications or workload of a system being unavailable.
True scale-out environments typically spread workload among multiple systems so that the
impact of a single system failing, even for a short period of time is minimal.
In scale-up systems, the impact of a server taking a fault, or even a portion of a server (e.g., an
individual partition) is significant. Applications may be deployed in a clustered environment so
that extended outages can in a certain sense be tolerated (e.g., using some sort of fail-over to
another system) but even the amount of time it takes to detect the issue and fail-over to another
device is deemed significant in a scale-up system.
Reliability Modeling
The prediction of system level reliability starts with establishing the failure rates of the
individual components making up the system. Then using the appropriate prediction models, the
component level failure rates are combined to give us the system level reliability prediction in
terms of a failure rate.
In literature, however, system level reliability is often discussed in terms of Mean Time Between
Failures (MTBF) for repairable systems rather than a failure rate. For example, 50 years Mean
Time Between Failures. A 50 years MTBF may suggest that a system will run 50 years between
failures, but means more like that given 50 identical systems, one in a year will fail on average
over a large population of systems.
The following illustration explains roughly how to bridge from individual component reliability
to system reliability terms with some rounding and assumptions about secondary effects:
4
▪ 1 part in 1 100,000 for a some time (steady state failure rate)
\ 2 Steady State ▪ Then increasing rate of failures until system component in
0
use is retired (wear-out rate)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
▪ Typically this is described as a bathtub curve
Quarter
FIT = Failure In Time Roughly 100 FITs means Mean Time Between 1 fail/1000 Systems
1 FIT = 1 Failure per 1 A Failure Rate of 1 Failure in Failures is the inverse per year
Billion hours 1000 systems per year of Failure Rate = 1000 Year MTBF
FIT of Component 1 +
1
FIT of Component 2 + Mean Time
= Between Failure
System System
FIT of Component 3 + Failure Rate Failure Rate (MTBF)
…
FIT of Component n +
Example
100 FITs +
50 FITs +
70 FITs + If 100 Fits is 1 Fail per 1000
30 FITs + systems per year
200 FITs + -------------------- 1000 Systems/10 Failures in a year
50 FITs + 1000 Fits = equates to 100 Years MTBF
30 FITs + 10 Failures Per 1000 Systems
170 FITs +
per Year
50 FITs +
250 FITs =
1,000 FITs For System
Service Costs
It is common for software costs in an enterprise to include the cost to acquire or license code and
a recurring cost for software maintenance. There is also a cost associated with acquiring the
hardware and a recurring cost associated with hardware maintenance – primarily fixing systems
when components break.
Individual component failure rates, the cost of the components, and the labor costs associated
with repair can be aggregated at the system level to estimate the direct maintenance costs
expected for a large population of such systems.
The more reliable the components the less the maintenance costs should be.
Design for reliability can not only help with maintenance cost to a certain degree but also with
the initial cost of a system as well – in some cases. For example, designing a processor with extra
cache capacity, data lanes on a bus or so forth can make it easier to yield good processors at the
end of a manufacturing process as an entire module need not be discarded due to a small flaw.
At the other extreme, designing a system with an entire spare processor socket could
significantly decrease maintenance cost by not having to replace anything in a system should a
single processor module fail. However, each system will incur the costs of a spare processor for
the return of avoiding a repair in the small proportion of those that need repair. This is usually
not justified from a system cost perspective. Rather it is better to invest in greater processor
reliability.
On scale-out systems redundancy is generally only implemented on items where the cost is
relatively low, and the failure rates expected are relatively high – and in some cases where the
redundancy is not complete. For example, power supplies and fans may be considered redundant
in some scale-out systems because when one fails, operation will continue. However, depending
on the design, when a component fails, fans may have to be run faster, and performance-throttled
until repair.
On scale-up systems redundancy that might even add significantly to maintenance costs is
considered worthwhile to avoid indirect costs associated with downtime, as discussed below.
Measuring Availability
Mathematically speaking, availability is often expressed as a percentage of the time something is
available or in use over a given period of time. An availability number for a system can be
mathematically calculated from the expected reliability of the system so long as both the mean
time between failures and the duration of each outage is known.
For example: Consider a system that always runs exactly one week between failures and each
time it fails, it is down for 10 minutes. For the 168 hours in a week, the system is down (10/60)
hours. It is up 168hrs – (10/60) hrs. As a percentage of the hours in the week, it can be said that
the system is (168-(1/6))*100% = 99.9% available.
99.999% available means approximately 5.3 minutes down in a year. On average, a system that
failed once a year and was down for 5.3 minutes would be 99.999% available. This is often
called 5 9’s of availability.
When talking about modern server hardware availability, short weekly failures like in the
example above is not the norm. Rather the failure rates are much lower and the mean time
between failures (MTBF) is often measured in terms of years – perhaps more years than a system
will be kept in service.
Therefore, when a MTBF of 10 years, for example, is quoted, it is not expected that on average
each system will run 10 years between failures. Rather it is more reasonable to expect that on
average, in a given year, one server out of ten will fail. If a population of ten servers always had
exactly one failure a year, a statement of 99.999% availability across that population of servers
would mean that the one server that failed would be down about 53 minutes when it failed.
In theory 5 9’s of availability can be achieved by having a system design which fails frequently,
multiple times a year, but whose failures are limited to very small periods of time. Conversely 5
9’s of available might mean a server design with a very large MTBF, but where a given server
takes a fairly long time to recover from the very rare outage.
5 9's Of Availability
Average Time
Down Per
Failure
(Minutes)
1 2 5 15 10 20 30 50 100
Mean Time Between Failures (Years)
The figure above shows that 5 9’s of availability can be achieved with systems that fail
frequently for miniscule amounts of time, or very infrequently with much larger downtime per
failure.
The figure is misleading in the sense that servers with low reliability are likely to have many
components that, when they fail, take the system down and keep the system down until repair.
Conversely servers designed for great reliability often are also designed so that the systems, or at
least portions of the system can be recovered without having to keep a system down until
repaired.
Hence on the surface systems with low MTBF would have longer repair-times and a system with
5 9’s of availability would therefore be synonymous with a high level of reliability.
However, in quoting an availability number, there needs to be a good description of what is
being quoted. Is it only concerning unplanned outages that take down an entire system? Is it
concerning just hardware faults, or are firmware, OS and application faults considered?
Are applications even considered? If they are, if multiple applications are running on the server,
is each application outage counted individually? Or does one event causing multiple application
outages count as a single failure?
If there are planned outages to repair components either delayed after an unplanned outage, or
predictively, is that repair time included in the unavailability time? Or are only unplanned
outages considered?
Perhaps most importantly when reading that a certain company achieved 5 9’s of availability for
an application - is knowing if that number counted application availability running in a
standalone environment? Or was that a measure of application availability in systems that might
have a failover capability.
Total
Mean time to Recovery Minutes
Outage (in minutes/Incid Down Per Associated
Outage Reason Years) Recovery Activities Needed ent Year Availabilty
Fault Limited To
3 x 7.00 2.33 99.99956%
Application
Fault Causing OS crash
10 x x 11.00 1.10 99.99979%
Fault causing hypervisor
80 x x x 16.00 0.20 99.99996%
crash
Fault impacting system
(crash) but system
recovers on reboot with 80 x x x x 26.00 0.33 99.99994%
enough resources to
restart application
Planned hardware repair
for hw fault (where initial
fault impact could be 70 x x x x x 56.00 0.80 99.99985%
any of the above)
The example presumes somewhat longer recovery for the non-enterprise hardware due to the
other kinds of real-world conditions described in terms of parts acquisition, error detection/fault
isolation (ED/FI) and so forth.
Though these examples presume too much to be specifically applicable to any given customer
environment, they are intended to illustrate two things:
The less frequently the hardware fails, the better the ideal availability, and the less perfect
clustering must be to achieve desired availability.
If the clustering and failover support elements themselves have bugs/pervasive issues, or single
points of failure besides the server hardware, less than 5 9s of availability (with reference to
hardware faults) may still occur in a clustered environment. It is possible that availability might
be worse in those cases than in comparable stand-alone environment.
Clustering Resources
One of the obvious disadvantages of running in a clustered environment, as opposed to a
standalone system environment, is the need for additional hardware to accomplish the task.
An application running full-throttle on one system, prepared to failover on another, needs to have
a comparable capability (available processor cores, memory and so forth) on that other system.
There does not need to be exactly one back-up server for every server in production, however. If
multiple servers are used to run work-loads, then only a single backup system with enough
capacity to handle the workload of any one server might be deployed.
Alternatively, if multiple partitions are consolidated on multiple servers, then presuming that no
server is fully utilized, fail-over might be planned so that one failing server will restart on
different partitions on multiple different servers.
When an enterprise has sufficient workload to justify multiple servers, either of these options
reduces the overhead for clustering.
Hypervisor/Virtualization Hypervisor/Virtualization
Computational Virtualized LAN Adapters LAN Adapters Virtualized Computational
Hardware I/O Servers I/O Servers Hardware
SAN Adapters SAN Adapters
Infrastructure Infrastructure
Shared Storage
Application Layer Using a SAN Application Layer
Shared Storage
Hypervisor/Virtualization Using a SAN Hypervisor/Virtualization
Computational Virtualized SAN Adapters LAN Adapters Virtualized Computational
Hardware I/O Servers I/O Servers Hardware
LAN Adapters SAN Adapters
Infrastructure Infrastructure
SAN Adapters
LAN Adapters
Hypervisor/Virtualization
Application Layer
Infrastructure
*Where “down” and availability refer to the service period provided, not the customer application.
In understanding the service level agreement, what the “availability of the service” means is
critical to understanding the SLA.
Presuming the service is a virtual machine consisting of certain resources (processor
cores/memory) these resources would typically be hosted on a server. Should a failure occur
which terminates applications running on the virtual machine, depending on the SLA, the
resources could be switched to a different server.
If switching the resources to a different server takes no more than 4.38 minutes and there is no
more than a single failure in a month, then the SLA of 99.99% would be met for the month.
However, such an SLA might take no account of how disruptive the failure to the application
might be. While the service may be down for a few minutes it could take the better part of an
hour or longer to restore the application being hosted. While the SLA may say that the service
achieved 99.99% availability in such a case, application availability could be far less.
Consider the case where an application hosted on a virtual machine (VM) with a 99.99%
availability for the VM. To achieve the SLA, the VM would need to be restored in no more than
about 4.38 minutes. This typically means being restored to a backup system.
If the application takes 100 minutes to recover after a new VM is made available (for example),
the application availability would be more like 99.76% for that month.
The purpose of serviceability is to efficiently repair the system while attempting to minimize or
eliminate impact to system operation. Serviceability includes new system installation,
Miscellaneous Equipment Specification (MES) which involves system upgrades/downgrades,
and system maintenance/repair. Based on the system warranty and maintenance contract, service
may be performed by the client, an IBM representative, or an authorized warranty service
provider.
The serviceability features delivered in IBM Power system ensure a highly efficient service
environment by incorporating the following attributes:
Service Environment
In the PowerVM environment, the HMC is a dedicated server that provides functions for
configuring and managing servers for either logical partitioned or full-system partition using a
GUI or command-line interface (CLI) or REST API. An HMC attached to the system enables
support personnel (with client authorization) to remotely or locally login to review error logs and
perform remote maintenance if required.
• HMC Attached - one or more HMCs or vHMCs are supported by the system with
PowerVM. This is the default configuration for servers supporting logical partitions with
dedicated or virtual I/O. In this case, all servers have at least one logical partition.
• HMC less - There are two service strategies for non-HMC managed systems.
1. Full-system partition with PowerVM: A single partition owns all the server
resources and only one operating system may be installed. The primary service
interface is through the operating system and the service processor.
2. Partitioned system with NovaLink: In this configuration, the system can have
more than one partition and can be running more than one operating system. The
primary service interface is through the service processor.
Service Interface
Support personnel can use the service interface to communicate with the service support
applications in a server using an operator console, a graphical user interface on the management
console or service processor, or an operating system terminal. The service interface helps to
deliver a clear, concise view of available service applications, helping the support team to
manage system resources and service information in an efficient and effective way. Applications
Different service interfaces are used, depending on the state of the system, hypervisor, and
operating environment. The primary service interfaces are:
• LEDs
• Operator Panel
• BMC Service Processor menu
• Operating system service menu
• Service Focal Point on the HMC or vHMC with PowerVM
In the light path LED implementation, the system can clearly identify components for
replacement by using specific component-level LEDs and can also guide the servicer directly to
the component by signaling (turning on solid) the enclosure fault LED, and component FRU
fault LED. The servicer can also use the identify function to blink the FRU-level LED. When
this function is activated, a roll-up to the blue enclosure identify LED will occur to identify an
enclosure in the rack. These enclosure LEDs will turn on solid and can be used to follow the
light path from the enclosure and down to the specific FRU in the PowerVM environment.
FFDC information, error data analysis, and fault isolation are necessary to implement the
advanced serviceability techniques that enable efficient service of the systems and to help
determine the failing items.
In the rare absence of FFDC and Error Data Analysis, diagnostics are required to re-create the
failure and determine the failing items.
Diagnostics
General diagnostic objectives are to detect and identify problems so they can be resolved
quickly. Elements of IBM's diagnostics strategy is to:
• Provide a common error code format equivalent to a system reference code with
PowerVM, system reference number, checkpoint, or firmware error code.
• Provide fault detection and problem isolation procedures. Support remote connection
capability that can be used by the IBM Remote Support Center or IBM Designated
Service.
• Provide interactive intelligence within the diagnostics, with detailed online failure
information, while connected to IBM's back-end system.
Stand-alone Diagnostics
As the name implies, stand-alone or user-initiated diagnostics requires user intervention. The
user must perform manual steps, which may include:
Concurrent Maintenance
The determination of whether a firmware release can be updated concurrently is identified in the
readme information file that is released with the firmware. An HMC is required for the
concurrent firmware update with PowerVM. In addition, as discussed in more details in other
sections of this document, concurrent maintenance of PCIe adapters and NVMe drives are
supported with PowerVM. Power supplies, fans and op panel LCD are hot pluggable as well.
Service Labels
Service providers use these labels to assist them in performing maintenance actions. Service
labels are found in various formats and positions and are intended to transmit readily available
information to the servicer during the repair process. Following are some of these service labels
and their purpose:
• Location diagrams: Location diagrams are located on the system hardware, relating
information regarding the placement of hardware components. Location diagrams may
include location codes, drawings of physical locations, concurrent maintenance status, or
other data pertinent to a repair. Location diagrams are especially useful when multiple
components such as DIMMs, processors, fans, adapter cards, and power supplies are
installed.
• Remove/replace procedures: Service labels that contain remove/replace procedures are
often found on a cover of the system or in other spots accessible to the servicer. These
labels provide systematic procedures, including diagrams detailing how to remove or
replace certain serviceable hardware components.
• Arrows: Numbered arrows are used to indicate the order of operation and the
serviceability direction of components. Some serviceable parts such as latches, levers,
and touch points need to be pulled or pushed in a certain direction and in a certain order
QR Labels
QR labels are placed on the system to provide access to key service functions through a mobile
device. When the QR label is scanned, it will go to a landing page for Power10 processor-based
systems. The landing page contains links to each MTM service functions and its useful to a
servicer or operator physically located at the machine. The service functions include things such
as installation and repair instructions, reference code look up, and so on.
• Color coding (touch points): Blue-colored touch points delineate touchpoints on service
components where the component can be safely handled for service actions such as
removal or installation.
• Tool-less design: Selected IBM systems support tool-less or simple tool designs. These
designs require no tools or simple tools such as flathead screw drivers to service the
hardware components.
• Positive retention: Positive retention mechanisms help to assure proper connections
between hardware components such as cables to connectors, and between two cards that
attach to each other. Without positive retention, hardware components run the risk of
becoming loose during shipping or installation, preventing a good electrical connection.
Positive retention mechanisms like latches, levers, thumbscrews, pop Nylatches (U-
clips), and cables are included to help prevent loose connections and aid in installing
(seating) parts correctly. These positive retention items do not require tools.
When an HMC is attached in the PowerVM environment, an ELA routine analyzes the error,
forwards the event to the Service Focal Point (SFP) application running on the HMC, and
notifies the system administrator that it has isolated a likely cause of the system problem. The
service processor event log also records unrecoverable checkstop conditions, forwards them to
the SFP application, and notifies the system administrator.
The system has the ability to call home through the operating system to report platform-
recoverable errors and errors associated with PCIe adapters/devices.
In the HMC-managed environment, a call home service request will be initiated from the HMC
and the pertinent failure data with service parts information and part locations will be sent to an
Call Home
Call home refers to an automatic or manual call from a client location to the IBM support
structure with error log data, server status, or other service-related information. Call home
invokes the service organization in order for the appropriate service action to begin. Call home
can be done through the Electronic Service Agent (ESA) imbedded in the HMC or through a
version of ESA imbedded in the operating systems for non-HMC managed or a version of ESA
that runs as a standalone call home application. While configuring call home is optional, clients
are encouraged to implement this feature in order to obtain service enhancements such as
reduced problem determination and faster and potentially more accurate transmittal of error
information. In general, using the call home feature can result in increased system availability.
See the next section for specific details on this application.
This web portal provides valuable reports of installed hardware and software using information
collected from the systems by IBM Electronic Service Agent. Reports are available for any
system associated with the customer's IBM ID.
For more information on how to utilize client support portal, visit the following website: Client
Support Portal or contact an IBM Systems Services Representative (SSR).
Investing in RAS
Systems designed for RAS may be more costly at the “bill of materials” level than systems with
little investment in RAS.
Some examples as to why this could be so:
In terms of error detection and fault isolation: Simplified, at the low level, having an 8-bit bus
takes a certain amount of circuits. Adding an extra bit to detect a single fault, adds hardware to
the bus. In a class Hamming code, 5 bits of error checking data might be required for 15 bits of
data to allow for double-bit error detection, and single bit correction. Then there is the logic
involved in generating the error detection bits and checking/correcting for errors.
In some cases, better availability is achieved by having fully redundant components which more
than doubles the cost of the components, or by having some amount of n+1 redundancy or
sparing which still adds costs at a somewhat lesser rate.
In terms of reliability, highly reliable components will cost more. This may be true of the
intrinsic design, the materials used including the design of connectors, fans and power supplies.
Increased reliability in the way components are manufactured can also increase costs. Extensive
time in manufacture to test, a process to ”burn-in” parts and screen out weak modules increases
costs. The highest levels of reliability of parts may be achieved by rejecting entire lots –even
good components - when the failure rates overall for a lot are excessive. All of these increase the
costs of the components.
Design for serviceability, especially for concurrent maintained typically is more involved than a
design where serviceability is not a concern. This is especially true when designing, for example,
for concurrent maintenance of components like I/O adapters.
Beyond the hardware costs, it takes development effort to code software to take advantage of the
hardware RAS features and time again to test for the many various ”bad-path” scenarios that can
be envisioned.
On the other hand, in all systems, scale-up and scale-out, investing in system RAS has a purpose.
Just as there is recurring costs for software licenses in most enterprise applications, there is a
recurring cost associated with maintaining systems. These include the direct costs, such as the
cost for replacement components and the cost associated with the labor required to diagnose and
repair a system.
The somewhat more indirect costs of poor RAS are often the main reasons for investing in
systems with superior RAS characteristics and overtime these have become even more important
to customers. The importance is often directly related to:
The importance associated with discovery errors before relying on faulty data or
computation including the ability to know when to switch over to redundant or alternate
resources.
The costs associated with downtime to do problem determination or error re-creation, if
insufficient fault isolation is provided in the system.
The cost of downtime when a system fails unexpectedly or needs to fail over when an
application is disrupted during the failover process.
The costs associated with planning an outage to or repair of hardware or firmware,
especially when the repair is not concurrent.
In a cloud environment, the operations cost of server evacuation.
In a well-designed system investing in RAS minimizes the need to repair components that are
failing. Systems that recover rather than crash and need repair when certain soft errors occur will
Final Word
The POWER9 and Power10 processor-based systems discussed leverage the long heritage of
Power Systems designed for RAS. The different servers aimed at different scale-up and scale-out
environments provide significant choice in selecting servers geared towards the application
environments end-users will deploy. The RAS features in each segment differ but in each
provide substantial advantages compared to designs with less of an up-front RAS focus.
Daniel Henderson is an IBM Senior Technical Staff Member. He has been involved with
POWER and predecessor RISC based products development and support since the earliest RISC
systems. He is currently the lead system hardware availability designer for IBM Power Systems
PowerVM based platforms.
Irving Baysah is a Senior Hardware development engineer with over 25 years of experience
working on IBM Power systems. He designed Memory Controller and GX I/O logic on multiple
generations of IBM Power processors. As a system integration and post silicon validation
engineer, he successfully led the verification of complex RAS functions and system features. He
has managed a system cluster automation team that developed tools to rapidly deploy in a Private
Cloud, the PCIe Gen3/4/5 I/O bringup team and the Power RAS architecture team. He is
currently the lead RAS Architect for IBM Power Systems.