Sg242221-00 Aug98 IBM Versatile Storage Server
Sg242221-00 Aug98 IBM Versatile Storage Server
http://www.redbooks.ibm.com
SG24-2221-00
IBML
SG24-2221-00
International Technical Support Organization
August 1998
Take Note!
Before using this information and the product it supports, be sure to read the general information in
Appendix A, “Special Notices” on page 361.
This edition applies to the IBM Versatile Storage Server storage subsystem. See the PUBLICATIONS section of
the IBM Programming Announcement for IBM Versatile Storage Server for more information about what
publications are considered to be product documentation.
Note
This book is based on a pre-GA version of a product and may not apply when the product becomes generally
available. We recommend that you consult the product documentation or follow-on versions of this redbook
for more current information.
When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any
way it believes appropriate without incurring any obligation to you.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
The Team That Wrote This Redbook . . . . . . . . . . . . . . . . . . . . . . . . xvii
Comments Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Current Open Storage Disk Products . . . . . . . . . . . . . . . . . . . . . . . . . 5
IBM 7204 External Disk Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
IBM 7131 105, SCSI Mixed Media . . . . . . . . . . . . . . . . . . . . . . . . . 5
IBM 7027 High Capacity Storage Drawer . . . . . . . . . . . . . . . . . . . . . 5
IBM 7131-405 SSA High Performance . . . . . . . . . . . . . . . . . . . . . . . 6
IBM 7133, Model 020 and Model 600 . . . . . . . . . . . . . . . . . . . . . . . 6
IBM 7135 RAIDiant Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
IBM 7137 RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Customer Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Centralized storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Storage partitioning with heterogeneous environments . . . . . . . . . . . . 9
Data sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Easy user access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
High Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Investment Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Data and File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Classes of Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Storage Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Storage Infrastructure Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 15
Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Versatile Storage Server Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Contents v
Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Thin Film Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Substrate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Recording layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Protection layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Landing zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
MR Head Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Separate read and write elements . . . . . . . . . . . . . . . . . . . . . . . . 89
Magnetic recording and reading process . . . . . . . . . . . . . . . . . . . . 90
PRML Read Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Peak detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Viterbi detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Zoned Bit Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
ID Sector Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
MR Head Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Rotary actuator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
ID sector format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
MR Head Effect on ID Sector Format . . . . . . . . . . . . . . . . . . . . . . . . . 99
No-ID Sector Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
No-ID Sector Format ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
RAM-based, servo-generated sector IDs . . . . . . . . . . . . . . . . . . . . 101
Other advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Predictive Failure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Predictive Failure Analysis ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Error logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Channel calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Disk sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Ultrastar 2XP in SSA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 107
Three-way router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Versatile Storage Server 524-Byte Sector Format . . . . . . . . . . . . . . . . 108
Contents vii
Storage server generates cache statistics . . . . . . . . . . . . . . . . . . . 140
Algorithm adapts to changing data access patterns . . . . . . . . . . . . . 141
Algorithm not biased by sequential access . . . . . . . . . . . . . . . . . . 141
Input/Output Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Random reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Sequential reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Random Read Cache Hit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Cache directory searched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Transfer from Storage Server cache to host . . . . . . . . . . . . . . . . . . 144
Random Read Cache Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Cache directory searched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Read from SSA disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Sequential Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Cache hit possible if data recently written . . . . . . . . . . . . . . . . . . . 147
Data prestaging begins when sequential access is detected . . . . . . . . 147
Storage server overlaps prestage with host access . . . . . . . . . . . . . 148
Data read sequentially preferentially destaged from storage server cache 148
Fast Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Fast Write bypass when appropriate . . . . . . . . . . . . . . . . . . . . . . 149
Optimum data availability − three copies of data . . . . . . . . . . . . . . 149
Data destaged asynchronously from Fast Write Cache . . . . . . . . . . . 150
RAID-5 Write Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Update writes subject to RAID-5 write penalty . . . . . . . . . . . . . . . . 151
Fast write masks the write penalty on update writes . . . . . . . . . . . . . 152
Data Integrity for RAID-5 Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
RAID-5 update write integrity challenge . . . . . . . . . . . . . . . . . . . . 153
Fast Write Cache assists parity integrity . . . . . . . . . . . . . . . . . . . . 153
Parity regenerated if SSA adapter cache malfunctions . . . . . . . . . . . 154
Fast Write Cache Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Asynchronous destage to disk . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Fast Write Cache managed by threshold . . . . . . . . . . . . . . . . . . . . 155
Write preempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Stripe Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
RAID-5 write penalty avoidance . . . . . . . . . . . . . . . . . . . . . . . . . 157
Sequential write throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Write Preempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Update write to a particular block . . . . . . . . . . . . . . . . . . . . . . . . 159
Fast Write Hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Multiple updates processed by single destage . . . . . . . . . . . . . . . . 159
Write Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Adjacent blocks destaged together . . . . . . . . . . . . . . . . . . . . . . . 160
Fast Write Cache Destage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Fast Write Cache destage triggered by threshold . . . . . . . . . . . . . . . 161
Other destage triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Destage from SSA adapter cache . . . . . . . . . . . . . . . . . . . . . . . . 162
Writes from Fast Write Cache to Disk . . . . . . . . . . . . . . . . . . . . . . . 163
Destage uses hierarchy of VSS storage . . . . . . . . . . . . . . . . . . . . 163
Data removed from Fast Write Cache following destage . . . . . . . . . . 163
SSA adapter cache copy retained to improve hit rates . . . . . . . . . . . 164
Transfers from Storage Server Cache to SSA Adapter . . . . . . . . . . . . . 165
VSS data transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Fast Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Stripe Write to SSA Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Bypass Fast Write Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Contents ix
Connectivity Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Number of hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Volume of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Data rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
I/O rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Backup requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Host Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Advanced software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
SMP Cluster Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Read cache size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Standard features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Disk Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Number of adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
RAID Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Ma x i mu m Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Disk Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
State-of-the-art technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
High performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Logical Volume Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
VSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Multiple access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
2105-B09 Storage Server Rack . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Disk storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
2105-100 expansion rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Power supply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Disk drawers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
SSA cables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Investment Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Versatile Storage Server Maximum Configuration . . . . . . . . . . . . . . . . 212
Power Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Redundant power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Disk drawer power cords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Optional battery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Service indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
VSS Enclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Configuration management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Intranet access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Configuration Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Number of hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Volume of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Data rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
I/O rate.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Contents xi
SSA adapter nonvolatile storage . . . . . . . . . . . . . . . . . . . . . . . . 255
Total System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Disk subsystem performance as part of overall system performance . . . 256
System performance considerations . . . . . . . . . . . . . . . . . . . . . . 256
I/O Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Front end I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Back end I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Guidelines for Configuration—Overview . . . . . . . . . . . . . . . . . . . . . . 260
Storage server cache size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Number of SCSI ports per host . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Choice of 4.5 or 9.1 GB disk drives . . . . . . . . . . . . . . . . . . . . . . . 261
Number of RAID-5 arrays per SSA loop . . . . . . . . . . . . . . . . . . . . 261
Storage Server Cache Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Storage server cache size determined by two factors . . . . . . . . . . . . 262
Host Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Effective Use of Host Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Effectiveness of host caching depends on several factors . . . . . . . . . 264
Host Caching Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
UNIX file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Database management systems . . . . . . . . . . . . . . . . . . . . . . . . . 267
DB2 parallel edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Oracle Parallel Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Subsystem Effect of Host Caching . . . . . . . . . . . . . . . . . . . . . . . . . 268
Where host caching is highly effective . . . . . . . . . . . . . . . . . . . . . 268
Where host caching is not highly effective . . . . . . . . . . . . . . . . . . . 269
Storage Server Cache Size Guidelines—Part I . . . . . . . . . . . . . . . . . . 270
Host caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Storage Server Cache Size Guidelines—Part II . . . . . . . . . . . . . . . . . . 272
Where host caching is effective . . . . . . . . . . . . . . . . . . . . . . . . . 272
Storage Server Cache Size Guidelines—Part III . . . . . . . . . . . . . . . . . 274
Where host caching is effective . . . . . . . . . . . . . . . . . . . . . . . . . 274
Four-Way SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Where both storage servers are four-way SMPs . . . . . . . . . . . . . . . 276
Rules of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Consider Storage server capability during failover . . . . . . . . . . . . . . 277
Versatile Storage Server SCSI Ports . . . . . . . . . . . . . . . . . . . . . . . . 278
Emulates multiple LUNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Supporting multiple SCSI initiators . . . . . . . . . . . . . . . . . . . . . . . 279
UltraSCSI adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Throughput considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Number of SCSI Ports per Host . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
VSS SCSI adapter throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Consider high-availability SCSI connection . . . . . . . . . . . . . . . . . . 280
Multiple SCSI attachment options . . . . . . . . . . . . . . . . . . . . . . . . 281
Consider virtual disk partitioning . . . . . . . . . . . . . . . . . . . . . . . . 281
Disk Capacity Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Disk specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Access density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Disk Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Disk can perform 50 I/Os per second . . . . . . . . . . . . . . . . . . . . . . 283
Access Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
I/Os per second per gigabyte . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Selection of capacity depends on access density . . . . . . . . . . . . . . . 284
Rule of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Select 4.5 G disk drives where appropriate . . . . . . . . . . . . . . . . . . 285
Contents xiii
Chapter 9. Versatile Storage Server Maintenance . . . . . . . . . . . . . . . . 309
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Repair Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Overview ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Sparing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Code EC management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
VS Specialist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
HTML browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Status screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Onsite Maintenance Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
ASCII terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Character based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Remote Support Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Call home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Support Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Reporting − Error Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Error log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Error log analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Problem record generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Reporting − SNMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Simple network management protocol . . . . . . . . . . . . . . . . . . . . . 322
Two management information bases . . . . . . . . . . . . . . . . . . . . . . 322
Reporting − E-mail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Reporting − Call Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Remote service facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Information provided. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Repair Actions − Customer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Console specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Logical configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Limited physical configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Limited repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Subsystem code EC management . . . . . . . . . . . . . . . . . . . . . . . . 328
Repair Actions − CE and PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Service procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Code EC Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Supported interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Code EC process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
Release process and media types . . . . . . . . . . . . . . . . . . . . . . . . 332
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Contents xv
xvi IBM Versatile Storage Server
Preface
Dave McAuley is the project leader for open systems disk storage products at
the International Technical Support Organization, San Jose Center. He has
written eight ITSO redbooks and several international conference papers, among
them a paper on image processing and pattern recognition in IBM′s Journal of
Research and Development . Dave teaches IBM classes worldwide on all areas
of IBM disk storage products. Before joining the ITSO in 1994, he worked in the
product support center in Scotland as a large systems storage specialist. Dave
has worked at IBM for 19 years, with international career assignments in
manufacturing, design and development, and marketing. His background in
computer science spans a varied portfolio including supercomputing application
development, process control and image analysis systems design and microcode
development, automation and robotics systems project management, and large
systems marketing and technical support.
Pat Blaney is a Technical Support Rep for Storage Systems with Advanced
Technical Support in San Jose, California, USA. He joined IBM in 1977 as a
Systems Engineer in New York and moved to San Jose in 1982, working in
internal I/S for the San Jose plant. In 1987, he became the product planner for
the 3390 and later worked on the introduction of the 3990-3 Extended Platform
and the 3990-6. He joined what is now Advanced Technical Support in 1993
when the support mission for storage systems moved from Gaithersburg to San
Jose. He has worked with all members of the RAMAC Array Family, most
recently with the introduction of the RAMAC Virtual Array and the RAMAC
Barry Mellish is a Senior Storage Specialist in the UK. He has worked for IBM
for the last 14 years. Barry joined IBM as a Property Services Engineer
responsible for IBM premises in Central London. Barry moved into system
engineering 10 years ago, initially working on mid-range systems, he started
specializing in the IBM 6150, the forerunner of todays RS/6000. He joined the AIX
Business Unit when it was set up following the launch of the RS/6000 in 1990.
Barry has worked extensively with Business Partners and Systems Integrators
providing technical support with systems design. Over the last two years he has
specialized in storage and storage systems joining SSD EMEA when it was set
up in January 1997. His current role is as a member of the UK technical support
group specializing in Open System Storage Solutions.
Mark Blunden is the project leader for Open Systems at the International
Technical Support Organization, San Jose Center. He has coauthored four
previous redbooks and teaches IBM classes worldwide on all areas of Storage.
Mark has worked for IBM for 18 years in many areas of the IT business. Before
joining the ITSO in 1998, Mark worked in Sydney, Australia as an Advisory
Storage Specialist.
Thanks to the following people for their invaluable contributions to this project:
Thanks are also due to the many other collaborators and reviewers who
contributed to the production of this document.
Comments Welcome
Your comments are important to us!
Preface xix
xx IBM Versatile Storage Server
Chapter 1. Introduction
Chapter 1. Introduction 5
IBM 7131-405 SSA High Performance
This provides up to 45.5 GB of disk storage using IBM′s SSA hard disk drives.
The disk drives are housed in five hot-swappable bays and can be a mixture of
2.2 GB, 4.5 GB and 9.1 GB drives.
Chapter 1. Introduction 7
Customer Requirements
Customer requirements have evolved in response to the explosive growth in
storage requirements brought in by:
• Network computing
• Online storage of voice, image, and video data
• Data mining and data warehousing
• Data collection and point-of-sale terminals
Centralized storage
During the past 10 years, one of the main goals within the information
technology organization was the decentralization of data. The idea was to place
the data as close to the user as possible. Decentralization was a result of the
introduction of client/server computing and advancements in technology that
reduced the size and cost of computers and peripheral devices while increasing
performance. With this decentralization of processing power have come added
costs in managing installations and networks. Now the direction for many
customers is to recentralize, while the processing power remains separate,
whether in the same room or remotely.
Data sharing
Many customers need to share data among several servers. There is confusion
as to exactly what is meant by data sharing and there are problems at a
technical level such as differing file formats and data structures. These are
discussed in the next four foils.
High Availability
Data is a key business resource. As business becomes more reliant on
computer systems for all its activities, it is becoming essential that business data
be available at all times. With the greater advent of globalization, more and
more companies are trading 24 hours a day. The window for data repair and
maintenance is becoming smaller and thus the emphasis is on high availability
or highly resilient and available systems with built-in redundancy. There are
three main components:
• Protection against data loss
• Protection against loss of data integrity
• Protection against loss of data availability
Any failures in these areas will cause downtime on customer computer systems
with possible adverse business effects.
Investment Protection
With a growing investment in storage, customers want to be able to retain
existing storage systems and incorporate them into the overall strategy.
Chapter 1. Introduction 9
Data and File Systems
This foil shows the types of data and file systems that are associated with the
main classes of host servers. For sharing a true single copy of data, the data
will have to be stored in such a way that it can be presented to each different
host in the manner it expects. The storage system will have to present different
views of the same piece of data to each host.
• S/390 and AS/400 use EBCDIC data format while most other platforms use
ASCII.
• Data is stored in completely different structures. For example, MVS uses
data sets and catalogs and UNIX uses file systems with directories.
Intel-based machines also use file systems and directories, but they use the
file allocation table (FAT) file system and have a different file system naming
convention that uses “\” instead of the UNIX “/.”
• The methods of storing data on disks are different. For example, MVS uses
extended count key data (ECKD) and UNIX, AS/400, and Intel use fixed block
architecture (FBA).
• Attachment protocols from the host to the disk subsystem are different. For
example MVS uses ESCON, and UNIX, AS/400 and Intel uses SCSI.
Storage
With storage sharing, there is a common pool of storage. This is divided into
parts dedicated to individual servers. No part of the data is common. The disk
controller is shared. The effect is that each host has its own“disks,” although it
is possible to reassign disks from one host to another.
Copy
Copy sharing is the most common type of sharing used today. One host creates
or modifies the data, which is then broadcast to all other hosts that need to
access that data. The mechanism for transferring the data in a UNIX
environment is usually either file transfer protocol (FTP) or a database
replication service. Mainframe storage systems have a remote copy facility that
enables distant systems to have a read copy of data.
Chapter 1. Introduction 11
Data
With data sharing, two or more hosts have direct access to the data and they all
have the ability to create and modify data. Mechanisms have to be put in place
to ensure the integrity of the data.
Chapter 1. Introduction 13
Data Sharing
In true data sharing, the physical or logical disk is assigned to more than one
host. Procedures must be in place to prevent a second host from reading or
writing data that is already being updated by another host. This locking is
generally provided by the application such as Oracle Parallel Server.
Growth
As applications evolve, transaction workloads, and storage capacity
requirements can grow explosively and unpredictably. To enhance value, many
applications are designed to be platform and vendor independent. Flexible
storage systems must address this need for independence while offering
granular capacity growth and the ability to move to newer, faster technologies as
they become available for host adapters and storage controllers.
Access
The exponential growth of the Internet and client/server applications has led
many organizations to rapidly adopt the network computing model. As mobile
employees, business partners, and customers demand global access to data,
storage systems must provide heterogeneous attachment for multiple platforms
and address the requirements to share data across those platforms. Storage
systems must also be flexible enough to easily incorporate new types of host
attachments.
Chapter 1. Introduction 15
Management
The cost of managing many widely distributed servers can be very high. To
reduce costs, the storage system must provide the option of remote
management and automated operations in most computing environments, giving
the flexibility to locate staff at the most cost effective location.
Movement
Data is being moved in large quantities over networks for many purposes, such
as backup and archive, disaster recovery, and data warehousing. With the rapid
growth in the quantity and types of data that must be moved, data movement can
quickly become a bottleneck in the network environment. To prevent data
transfer from inhibiting growth, the storage system should provide the capability
for automated data movement and management that is transparent to the
platform and common to all heterogeneous environments.
Security
Enterprise data is a vital corporate asset. Storage systems must address data
availability and security concerns. The importance of today′s data demands
continued innovation in self-diagnosis and error correction, remote monitoring,
performance monitoring, automatic reconfiguration, and tamper-proof access.
All this has to be provided in an affordable, auditable solution.
Chapter 1. Introduction 17
• Provides an easy-to-use web-based browser interface, so that configuring
and maintaining the system is as simple as possible.
• Offers built-in protection against data loss, and loss of data integrity.
• Can maintain access to the data under almost all circumstances.
• Through its “plug and play” I/O architecture, components can easily be
changed as technology changes without disruption. The ability to use
existing 7133 disk drawers and racks demonstrates IBM′s commitment to
protecting customer investment in storage.
VSS uses existing RISC processor boards and chips, SSA disks and adapters,
racks and power supplies from ES/390 servers, coupled with the resource and
expertise of more than 40 years of making disk subsystems.
VSS has two interface types, a web-based browser interface, known as the IBM
StorWatch Versatile Storage Specialist (VS Specialist for short), for user
configuration and control of the subsystem, and a serial attach interface for use
by service personnel. Additionally a modem connection for remote diagnostics
and download of fixes can be made.
For a business requiring storage of large amounts of data using a single host or
all RS/6000 hosts, other storage solutions should be considered. For the
homogeneous RS/6000 environment, native attached 7133 SSA disks should be
considered, especially for stand-alone servers, or small clusters. For UNIX
environments where a common storage pool is not required, 7133 SSA disks
Chapter 1. Introduction 19
20 IBM Versatile Storage Server
Chapter 2. Versatile Storage Server Architecture
The VSS is the one of the Seascape family of products which are integrated
storage servers used for the attachment of storage devices to various host
computer systems. IBM′s new Seascape architecture helps organizations
implement a simplified, yet flexible storage infrastructure that helps them derive
the most value from their data assets. Seascape includes integrated storage
solutions based on a set of common building blocks, interchangeable
components that can be easily matched to changing storage requirements. With
excellent flexibility, Seascape helps provide a reliable storage infrastructure that
can deliver performance, scalability, affordability and ease of management today
and in the future.
RS/6000 processor
A rack mounted RS/6000 Model R20 forms the central control unit of Netstore. It
comes with the AIX operating systems preinstalled and preconfigured.
SSA disks
The minimum configuration includes 16 x 4.5 GB SSA disks to give 72 GB of
storage. This can be increased to 144 GB.
WebShell client
The ADSM WebShell client and server code support functions such as backup,
client code distribution, and a “help desk” function to be run over a company′ s
intranet.
The IBM 3466 Network Storage Manager brings to the customer a complete
packaged solution for managing and protecting data in a network.
The function of the Web Cache Manager is to store web objects locally, so that
multiple requests from the user community do not consume bandwidth
needlessly. In performing this function, the Web Cache Manager reduces the
bandwidth needs for connecting those users to the Internet backbone or an
upstream access provider. It provides the capability to cache objects that flow
under the HTTP and the FTP protocols. Because this product has integrated the
IBM Web Traffic Express software, it can also proxy, or screen, Internet requests
of end users.
RS/6000 processor
A rack mounted RS/6000 forms the central control unit. It comes with the AIX
operating system preinstalled and preconfigured.
SSA disks
The 7133 disk subsystem uses high-speed serial storage architechure (SSA) for
disk I/O.
ADSM/HSM
ADSM/HSM transparently moves the older and larger web objects from disk to
the tape library. This migration of objects frees up the disk space for smaller,
more popular objects.
The cache can hold hundreds of virtual volumes. The content of the cache is
managed to retain the most recently accessed virtual volumes so that numerous
subsequent mount requests can be satisfied very quickly from the cache, similar
to a cache request on DASD. If a requested volume is not present in the cache,
the required Magstar 3590 cartridge is mounted, and the logical volume is
moved back into the cache from a stacked volume.
Stacked volumes
Stacked volumes are Magstar 3590 volumes that contain several logical volumes.
After a virtual volume is unloaded, it is copied onto a stacked volume. The virtual
volume then remains in the cache until its space is required for another virtual
volume. The content of stacked volumes is managed by the Magstar Virtual Tape
Server such that partially full stacked volumes are consolidated to free up space.
3490E emulation
From a host perspective, data is processed as if it resides on actual devices and
cartridges. This representation of tape devices enables transparent use of
Magstar tape technology. It also enables use of the Magstar Virtual Tape Server
without requiring installation of new software releases. Within the Magstar
Virtual Tape Server, data is stored on disk as images of either virtual Cartridge
System Tape (CTS) or enhanced Capacity Cartridge System Tape (ECCST). All
3490E-type commands are supported. Tape motion commands are translated into
disk commands, resulting in response times much faster than in conventional
tape drives.
Disk racks
The disk racks contain from 2 to 18 drawers of 7133 disks There can be either 8
or 16 disks in a drawer. There are two types of rack. The first rack must be the
storage server rack 2105-BO9, which contains the storage server, power
management system and up to four disk drawers. The second and third racks
are disk expansion racks 2105-100, and contain up to seven drawers of SSA
disks each. The maximum supported configuration is 18 disk drawers. The
disks are configured into RAID-5 arrays of eight disks. Each array is in the form
of six disks plus parity plus hot spare (6+P+S) or seven disks plus parity
( 7 + P ) . No other disk configurations are supported. The first RAID array in a
drawer must always be configured as a 6+P+S array. The disk drives that are
supported are the 4.5 GB and 9.1 GB drives. The SSA technology architecture
allows new and old devices of the same capacity to coexist on the same SSA
loop.
A web-based interface allows the administrator to control the VSS and to assign
storage to individual hosts as required. By using web technology for the
configuration tool the administrator can sit anywhere on the intranet site.
Definition
There is a single storage server in the VSS, consisting of two processing clusters
with RISC based four-way SMP configurations. The storage server consists of the
host adapter, PCI Link technology, SMP cluster, SSA adapter and communication
ports.
Host
The host sees the logical disk partition as a familiar entity. In the case of the
AS/400 the storage server emulates a 9337, so the AS/400 sees a 9337 580 or 590
disk array. The UNIX host server is presented with a generic SCSI disk
RISC planars
The planar boards that are used in VSS are RISC planars with RISC chips. They
are based on the boards that are used in the RS/6000.
SSA adapters
The adapters used in VSS are based on the PCI bus RAID-5 adapter used in the
RS/6000.
7133
The 7133 SSA disk drawer and 4.5 GB and 9.1 GB disks are used in VSS. The
drawer is not modified in any way except that new power cords are supplied to
connect to the two 350 V DC power buses. The disk drives for VSS are specially
formatted with 524-byte sectors for AS/400 compatibility, compared with 512-byte
sectors that are used in the native version of the 7133. If existing disk drives are
to be used in VSS, they have to be reformatted with the new disk sector sizes.
Both existing 7133-020 and 7133-010 can be used in the VSS. NoteThe drawer
configuration may need to be modified to conform to the VSS rules. Jumper
Ultra-SCSI adapter
The host interface adapters are the connection point between VSS and the hosts.
The host interface adapters that are used are Ultra-SCSI adapters. They
conform to the SCSI-3 standard of which SCSI-2 F/W is backward compatible.
Thus SCSI-2 differential F/W adapters in the host can connect to VSS albeit at a
connection speed of 20 MB/s compared with 40 MB/s for Ultra-SCSI. There are
four bays, each of which can hold two host and two SSA adapters, giving a total
of eight adapters that can be used. Each adapter has two independent ports that
can both read or write data concurrently. Any port on any adapter can be
configured to communicate to either processing cluster. Once the connection is
defined, it talks to only that cluster unless a failover occurs. If a host needs to
connect to disks that are attached to both clusters, then there must be two
host-to-VSS connections. One connection is to be defined to one cluster, the
other connection is to be defined to the other cluster. Up to 64 hosts can be
connected at once.
Heartbeat monitor
The heartbeat monitor enables each cluster to monitor the other and failover if a
problem is detected, (see Chapter 10, “Subsystem Recovery” on page 333).
Failover support is provided by the link between the two disk adapter PCI
bridges and the host bridge. This link enables one cluster to route operations to
both sets of disks. This link also enables online cluster microcode updates to
take place. One of the storage servers is deliberately shut down and all
transactions are failed over to the other. The microcode update is applied and
Read/Write cache
Each cluster contains a read/write cache to improve system response time, (see
Chapter 8, “Versatile Storage Server Performance” on page 251). The minimum
cache size is 512 MB and the maximum is 6 GB, split evenly between the two
clusters.
RS-232 ports
The RS-232 port is used to attach a modem to the system so that in the event of
a system malfunction the service team can be notified directly. The routing and
severity of the alerts are set up during installation. A service engineer can use
the second port to attach a local monitor to carry out diagnostic and remedial
work.
Fast-write cache
The total fast-write cache in each SSA RAID adapter is 8 MB. Half of this, 4 MB,
is used as a write-through cache and is in volatile memory. The other half is a
mirror copy of this stored in a battery backed up “permastore” or Fast Write
Cache, (see Chapter 3, “Versatile Storage Server Technology” on page 43). The
“permastore” is proven technology used in the 7137 RAID array.
AS/400
VSS emulates the 9337 when attached to AS/400 systems. The following
emulation is provided:
• Logical 4 GB drives emulate the 9337-580. A minimum of four 4 GB drives
must be configured. The actual capacity of each logical drive is 4.194 GB.
• Logical 9 GB drives emulate the 9337-590. A minimum of four 9 GB drives
must be configured. The actual capacity of each logical drive is 8.59 GB.
OS/400 expects to see a separate device address for each disk drive, logical or
physical. VSS will report unique addresses for each arm that is defined to the
AS/400. OS/400 behaves as if 9337s are attached to the system.
UNIX
For UNIX-based systems VSS emulates the generic SCSI device drives that are
supported by drivers found in most systems. The VSS enables each LUN can
have a capacity ranging from 0.5 GB to 32 GB (valid LUN sizes are: 0.5, 1, 2, 4, 8,
12, 16, 20, 24, 28, and 32 GB). There does not have to be a direct correlation
between physical disk drives and the logical LUNs seen by the UNIX-based host.
The racks are more than metal chassis; they contain power control and
sequencing logic to ensure high availability.
The storage server uses Power Performance Chip (Power PC), Serial Storage
Architecture (SSA) and Ultra SCSI technologies to provide high performance,
reliable, sharable access to customer data.
The disk drawers are state-of-the-art SSA drawers, providing disk insertion and
removal features that allow configuration and maintenance to be performed
without affecting host system up time.
The Ultrastar disk drives are IBM′s latest drives, using such innovative features
as thin-film disk, magnetoresistive heads, zoned recording, No-ID sector formats,
and predictive failure analysis.
Two types of racks are used in the VSS subsystem: the 2105-B09 and 2105-100.
2105-B09
The 2105-B09 rack contains a storage server, dual SMP clusters, read cache, I/O
adapters and bays, 32 9.1 GB drives, redundant power supplies and space for
two 7133s.
Power requirements and specifications of the 2105-B09 rack are fully discussed
under “The 2105-B09 Rack” on page 47. The VSS storage server is discussed in
greater detail under “Storage Server (Front)” on page 50. The VSS storage
server and disk drawers are discussed in greater detail under “The Versatile
Storage Server” on page 52 and “The Disk Drawer” on page 72.
When positioning the VSS racks, any 2105-100 racks are physically placed next to
the 2105-B09 rack, to allow connection of the power sequencing cables (the
frames are bolted together). If existing 7015-R00 or 7202-900 racks must be used
in the subsystem, they should be placed next to any 2105-100 racks.
Where there are only 2105-100 expansion racks in the subsystem, they can be
placed up to 20 m (65 ft) away from the 2105-B09 if the appropriate RPQ is
specified (this orders the cable that enables the 20 m distance). The standard
configuration will have the 2105 racks bolted together.
Power is supplied to the 2105-B09 rack through two power cords carrying either
50 ampere single-phase or 50/60 ampere three-phase alternating current (AC)
power, with each cord providing 100% of the needed power.
There are two 350 V DC bulk power supplies. One of these is enough to supply
power to the subsystem, which provides for uninterruptible service if one of the
DC power supplies fails. A 48 V DC transformer on each 350 V rail drops the
voltage to the correct level for the individual components such as the SMP
clusters and the adapter bays. The drive drawers are supplied with 350 V DC
provided by the bulk power supplies. Because the drawers must have at least
two of their three power supplies available to power up, each drawer has a
single connection to each 350 V rail, plus a Y connection that connects to both
rails. In the event that one of the 350 V bulk power supplies fails, the drive
drawers can still power up.
Power cables for the four disk drawers are included in the base rack, regardless
of how many drawers are initially configured in the rack.
The 2105-100 rack contains the same fully redundant power subsystem as the
2105-B09 rack. Power is supplied to the 2105-100 rack through two power cords
carrying either 50 ampere single-phase or 50/60 ampere three-phase AC power,
with each cord providing 100 % of the needed power.
The 2105-100 rack also has an optional battery that provides the same function it
provides for the 2105-B09 rack: several minutes of backup to facilitate host
shutdown.
Each cluster has its own motherboard that contains the SMPs, cache memory
and other components (see “The Versatile Storage Server” on page 52 for a
complete explanation of the various components of the storage server). In
addition, each cluster has its own diskette drive and CD-ROM drive for loading
microcode and diagnostics.
The four adapter bays are hot pluggable, hot removable bays used for
connecting the host and drive adapters to the VSS storage servers. Each rack
can be configured with a maximum of four adapters, giving a maximum of
sixteen adapters. Of these, eight can be host adapters and eight can be drive
adapters.
Each SMP cluster uses a RISC planar, which contains a complete computer
system in itself. It has all of the usual functional parts that a RISC has: central
processing units (CPUs), level 2 cache, dynamic random access memory
(DRAM), PCI bus, read-only storage (ROS), nonvolatile random access memory
(NVRAM), an internal small computer systems interface (SCSI) bus, internal SCSI
disk drive, flexible disk drive, compact disk read-only memory (CD-ROM),
multidigit liquid crystal display (LCD), Ethernet, two serial ports, and a power
supply.
The instruction cache can provide up to four instructions in a single clock cycle.
The RISC architecture defines special instructions for maintaining instruction
cache coherence. The 604e implements these instructions.
Packaging
CPUs are packaged two per CPU board. Each SMP cluster has two CPU boards,
or four CPUs per cluster, or eight CPUs in total.
More components of the VSS storage server are described on the next foil.
CD-ROM
Each SMP cluster contains an 8X speed CD-ROM that is used to initially load and
install the operating code. It is also used to load diagnostics and microcode or
operating code updates.
Ethernet
A 10 Mbit/s Ethernet adapter is installed in each SMP cluster. The Ethernet
adapter is used to connect to the customer′s intranet (through 10BaseT) and is
the primary interface for configuration of the subsystem.
Serial ports
Two RS-232 serial ports per cluster are used for access by service personnel.
One port is connected to a modem to enable remote diagnostic access and “call
home.” The “call home” feature enables the VSS to contact the IBM service
center in the event of a failure or if the subsystem requires attention by a
Customer Engineer (CE).
The second port is used by the CE for maintenance and repair actions to the
subsystem.
LCD display
A multidigit liquid crystal display (LCD) is used to display status codes while the
cluster boots and while the storage server is operational. Typically, the CE will
use the status codes when performing maintenance and repair actions on the
VSS.
SDRAM also incorporates automatic precharge, which eliminates the need for an
external device to close a memory bank after a burst operation. In addition, it
allows two row addresses to be open simultaneously. Accesses between two
opened banks can be interleaved, hiding row precharge and first access delays.
The cache is used as read and write cache but is considered volatile because it
has no battery backup. The SSA disk adapters have nonvolatile storage (Fast
Write Cache) and a write is not considered complete until the data has been
transferred from the cluster cache into the disk adapter Fast Write Cache. As
the disk adapter Fast Write Cache is battery protected, the host operating system
is informed that the write is complete without having to wait for the data to be
written to disk. This facility is known as fast write .
If an express write of data is transferred from the cluster cache to the adapter in
one operation, it is placed in the adapter memory (not Fast Write Cache) for
calculation of parity data and then immediately written to disk without being
transferred to Fast Write Cache.
Staging
Staging is the term used for reading data from the physical disk media into the
cluster cache. When the host accesses a track, if it is not in cache, a read
“miss” occurs. Three types of staging can occur during the read access to the
data:
• Record mode
Only the records asked for by the host are staged. Record mode is the
default mode; that is, when a track is accessed for the first time, only the
records asked for will be staged. If the track is being accessed frequently, it
will be “promoted” to either partial or full-track staging (see below).
Typically, data being serviced in record mode is being accessed so
infrequently and randomly that the adaptive cache algorithm (see below) has
decided it is better served in record mode.
• Partial track
The initial transfer of the data to cache is from the initial record accessed to
the end of the track. If the initial record accessed is not from the index of
the track, the cache image is a partial track.
• Full track
A full track is staged to the cluster cache if the initial record accessed is
from the index record.
For a full description of data flow and caching mechanisms, refer to Chapter 4,
“Versatile Storage Server Data Flow” on page 109.
Each SMP cluster has two 4-byte PCI buses running at 32 MHz.
The foil shows a block diagram of the storage server memory, CPU, MX, and PCI
buses. The M- to-PCI bus bridges are custom chips that allow any PCI adapter
to access the memory bus or be accessed by the memory controller. Because
the PCI buses are isolated from the system bus with the bridge chip, all CPUs
have access to both PCI buses.
Host Adapter
The host adapter is a 32-bit PCI Ultra SCSI adapter. It supports the SCSI-3
protocol and command set and provides two 16-bit-wide, 40 MB/s differential
channels, capable of reading and writing concurrently. Each channel of the host
adapter can support up to 16 target IDs like any 16-bit SCSI bus, but each can
support up to 64 logical unit numbers (LUNs) per target ID (in conformance with
the SCSI-3 protocol).
Each of the channels can be attached to a single host, two homogeneous hosts,
or two heterogeneous hosts. The hosts must attach either through an Ultra SCSI
differential interface or a SCSI-2 fast and wide differential interface.
The VSS SSA adapter supports two separate loops of disks. As a general VSS
configuration rule, 16 disks are supported per loop, because 16 disks are
supported in one SSA drawer. In a maximum configuration of 18 drawers,
however, two loops will support 32 drives in two drawers.
An SSA adapter contains two SSA initiators. An SSA initiator is an SSA node
that is capable of initiating SSA commands. Each of the two initiators on the
adapter is capable of functioning as a master. An SSA master can access a
node by specifying the node′s address in the SSA network. When an SSA
network contains more than one initiator node, the initiator with the highest
unique ID is the master.
Each initiator controls two ports. When connected through a series of SSA disk
drives, the ports form a loop.
Volatile storage
The adapter contains 16 MB of DRAM, which is used for transferring data
between the disk arrays and the SMP cluster cache. The adapter DRAM is the
only place where RAID parity data is stored once it is staged from disk.
Parity data is never transferred to the VSS cache. It is cached in DRAM to
assist in calculation of the new parity data if the data is modified. In addition,
the DRAM is used for storing the adapter microcode, its code control blocks, and
a mirror of the adapter Fast Write Cache. The DRAM is not partitioned in a
physical sense; specific sizes are not set aside for read cache, fast write cache,
adapter microcode, or control blocks. The size of the read cache depends on
the amount of unused DRAM at any given time. The size of the fast write cache
is limited to the size of the Fast Write Cache, 4 MB.
Parity calculation
To maintain data availability if a drive in the array fails, when data is written an
extra parity strip is written along with the data. Depending on the configuration
of the array, a full stripe of RAID data on the VSS is either 224 KB for a
six-drive-plus-parity array (6+P) or 256 KB for a seven-drive-plus-parity (7+P)
array. From the numbers, we can see that a single strip of data is 32 KB. A
strip is the amount of data written to a single drive in the array. In a 7+P array
(eight drives in total), a full stripe is 8 * 32 KB = 256 KB.
The parity data is calculated by XORing the seven 32 KB strips together. The foil
shows a representation of the XOR function.
The XOR function is reversible in that, if you apply XOR to the parity data on any
one of the seven data strips, the result is the XOR of the remaining six data
strips. This reversibility provides two distinct advantages:
The XOR function minimizes the amount of I/Os that have to occur to calculate
new parity data when a strip is updated. It minimizes the number of I/Os to four
per write: read old data, read old parity, write new data, write new parity.
The 7133 contains four physically separate loops of slots for installing disk
drives. The slots support hot insertion and removal of drives.
If disk drives are not installed in drive slots, a dummy drive must be installed to
propagate the signals from adjoining slots. These four loops can be physically
cabled in a variety of different configurations. In the VSS subsystem, however,
only one loop of 8 or 16 drives per drawer is supported, so all four drive slot
loops must be cabled together.
Slots 1 to 8 are physically located at the front of the 7133, numbered from left to
right as you face the front. Slots 9 to 16 are physically located at the rear of the
7133, numbering from left to right as you face the rear of the 7133.
7133-010
Two types of 7133 disk drawers are supported in the VSS subsystem: the
7133-010 and the 7133-020. Although the 7133-010 is no longer available,
customers who already have the 7133-010 drawers can use them in the VSS
subsystem.
There are some considerations when using 7133-010 drawers. For example, all
disks within the drawers must be reformatted to the VSS specification. Thus the
data cannot be migrated directly from existing RAID drawers to the VSS
subsystem. Also, because the 7133-010 does not provide sufficient cooling air
flow to support the 9.1 GB disk drives, it supports 4.5 GB disk drives only.
7133-020
The 7133-020 is the latest SSA disk drawer available from IBM. It is similar in
design to the 7133-010, but has some upgraded features and, because of
improved cooling air flow within the drawer, supports the 9.1 GB disk drives in
addition to the 4.5 GB disk drives. Customers with existing 7133-020 drawers can
also use them in the VSS. Differences between the 7133-010 and -020 are
discussed under “7133 Model Differences” on page 76.
For more information about the VSS sector format, see “Versatile Storage Server
524-Byte Sector Format” on page 108.
For more information about migrating data with existing 7133s, see Chapter 7,
“Migrating Data to the Versatile Storage Server” on page 221.
Each power supply module also has a cooling fan assembly which not only cools
the power supply module but also provides cooling to the 7133 drawer and the
drive modules installed within. The 7133-020 has redesigned cooling air flow to
support the 9.1 GB drive modules.
7133-010 Connectors
Connectors Disk Drive Modules
7133-020 Connectors
Connectors Disk Drive Modules
In a 7133-020, the SSA signal cards have been replaced by SSA bypass cards.
The bypass cards can operate in two modes: bypass mode or forced inline
mode. The bypass cards have active circuitry that can detect whether a cable is
plugged into an external SSA connector. If a cable is not plugged in, the card
operates in bypass mode, where the two connectors are looped together
(Sequence 1). Bypass mode minimizes the amount of intradrawer cabling
required and automatically connects a loop if the host or adapter fails. If the
card detects that a cable is plugged in to one of the external connectors, it
switches inline, connecting the internal SSA electronics to the external connector
(Sequences 2 and 3).
As the dummy drive module does not contain any active circuitry, a limitation on
the number of adjacent dummy modules is imposed. Only three dummy
modules can be adjacent to each other, either in the same loop or adjoining
loops. More than three dummy modules adjacent in a loop degrades signal
quality enough to cause errors.
As a single loop of SSA disks in a VSS can contain up to 16 drives, one or two
arrays are supported per loop. For each drawer, at least one array must be a
6 + P + S a r r a y . The other can be another 6+P+S array, or a 7+p array.
6+P+S
The 6+P+S array contains six disks used to store customer data, one parity
drive, and one spare drive. In a RAID-5 configuration, parity is not stored on one
single drive, but for purposes of discussion, we assume that the parity is written
on one drive. The spare drive is required to support a single loop in case of
drive loss during normal operations. The adapter can detect the loss of a drive
and perform an automatic reconstruction of the data for the lost disk. The
reconstructed data is then written to the spare, and the bad disk is replaced as
soon as possible. The spare drive must be the same size as the data and parity
disks.
In this case we assume that the array is configured as a 6+P+S array, that is 6
data drives, 1 parity drive, and 1 hot spare drive. A hot spare drive is one that is
actually in the array and powered on, but idle; it is hot because it is at operating
temperature. The built-in idle functions of the drive—disk sweeping and channel
calibration—ensure that the drive is in good shape when required by the sparing
operation.
Referring to the foil and following the sequence numbers, assume that Member 2
of the array above fails in some way. If the failure is complete (that is, the drive
no longer responds to commands), the loop is broken and the SSA adapter is
informed by the drives on either side of the failed drive in the loop (Members 1
and 3); see Sequence 1.
At this time, two things happen: first, any requests for the data from the failed
disk are regenerated by the RAID adapters XOR function (see Sequence 3) by
reading the parity data, and data from the other six drives (see Sequence 2).
The increased number of reads required, and the reconstruction calculation in
the adapter, cause a slight degradation in performance. Second, the hot spare
The hot spare disk is now no longer the spare, but a functional drive in the
array, storing customer and parity data. The failed drive can now be replaced.
Once it has been replaced, it then assumes the role of hot spare for the array
(see Sequence 5).
The failed drive should be replaced as soon as possible, because as long as the
failed drive remains in the array, should another drive fail (an unlikely event) no
spare would be available. The adapter would then have to reconstruct data for
every read request, resulting in decreased performance until drive replacement
and data reconstruction had taken place. For more information on recovery
procedures, see Chapter 10, “Subsystem Recovery” on page 333.
Both disk drives come in an industry-standard 3.5 in. factory package, which is
compact and robust.
Many of the technologies used in the manufacture of the Ultrastar drives were
developed at IBM′s Almaden Research Center in San Jose, California, USA.
Features
The Ultrastar drives use the latest technology in disk drive manufacture from
IBM. In recent years, many improvements in disk drive manufacture have led to
increased capacity, performance, and reliability at decreased costs.
Capacity
The Ultrastar 2XP drives have capacities of 4.5 GB and 9.1 GB. The 4.5 GB
model has five platters and nine MR heads mounted on a rotary voice-coil
actuator. The 9.1 GB model has 9 platters and 18 MR heads mounted on a
rotary voice-coil actuator.
Performance
The Ultrastar 2XP drive disks rotate at 7200 revolutions per minute (rpm), which
reduces rotational latency to an average of 4.17 ms. Average access times are
less than 8.5 ms, and track-to-track reads average 0.5 ms.
The Ultrastar 2XP drives use embedded sector servo technology, which stores
servo (head positioning) data on the data areas of each platter, rather than on a
dedicated servo platter. Embedded servo eliminates the repetitive thermal
calibration routines required when dedicated servo platters are used, improving
high-speed access to the data.
The Ultrastar 2XP drives can provide a data rate (stream of bits from the read
head) of from 80.6 Mbit/s at the inner zones to 123.4 Mbit/s at the outer zones
(for more information about zones, see “Zoned Bit Recording” on page 93).
To aid in read-ahead and faster write access, the Ultrastar 2XP drives contain a
512 KB buffer. The buffer is used for storage of write data on its way to the drive
or as storage for data that has been read in advance in anticipation of being
required by the adapter (read ahead).
Reliability
Recoverable read errors are less than 10 per 10 13 bits read; nonrecoverable read
errors are less than 10 per 10 15 bits read. Seek errors are less than 10 per 10 8
seeks.
The Ultrastar 2XP drives have predictive failure analysis (PFA), which complies
with the Self-Monitoring, Analysis, and Reporting Technology (SMART) industry
standard.
During idle periods, the drive collects various drive characteristics and writes
them to a log area on the drive. If any of the characteristics of the drive exceed
their thresholds, the drive notifies the host to which it is connected. The
thresholds are set in such a way as to give at least 24 hours notice of a pending
failure, which allows maintenance to be carried out before loss of data and
possible downtime occur.
Substrate
Thin film disk platters start with a substrate of aluminum and magnesium (AlMg)
alloy, for lightness and strength. They are then coated with a nickel phosphorus
(NiP) layer and a thin layer of chromium (Cr).
Recording layer
The recording layer is a thin layer of cobalt, platinum, and chromium alloy
(CoPtCr) that is bonded to the chromium layer below it. The recording layer
alloy provides high signal-to-noise ratio, which greatly enhances readability.
Landing zone
Finally, a laser-textured head landing zone is placed at the inner edge of the
disk. When the drive is powered down, the heads are moved to the center of the
disk where they touch the platter. The textured landing zone allows the head to
touch down without damaging itself or the platter and greatly reduces the
likelihood of “head stiction”—the phenomenon where the head, once landed,
sticks to the platter and prevents it from spinning when the drive is next powered
up.
The read element consists of an alloy film, usually NiFe (nickel iron), that
exhibits a change in resistance in the presence of a magnetic field—the MR
effect. Shielding layers protect the MR elements from other magnetic fields.
The second shield also functions as one pole of the inductive write head, thus
giving rise to the term merged MR head .
By storing data more densely, disk drives get smaller and less expensive. As
data storage becomes more dense, however, data bits begin to interfere with
their neighbors, which is known as intersymbol interference (ISI).
Peak detection
In a disk drive using a peak detection head, the read head senses the flux
changes in the platter′s magnetic material and generates an analog waveform
that is passed to the read channel. The read channel detects the waveform
peaks, each of which represents one bit of data. The data is then converted and
deserialized back into digital data bytes.
Viterbi detection
The Viterbi algorithm is the key to the PRML read channel, named after Andrew
Viterbi, the inventor of the algorithm. The PRML read channel uses analog
signal processing to shape the read signal to the required frequency and
equalize the incoming data to the partial response (PR) waveform. The PR
circuits are tuned to match typical signals from the read head.
Ultimately, PRML reduces error rates and allows higher capacities than drives
with peak detection read channels.
Zoned bit recording takes advantage of the differing lengths of the tracks and
keeps the areal density constant across the platter.
Zones
The Ultrastar 2XP platters are divided into eight zones. Since the areal density
remains the same, the number of sectors per track increases toward the outer
edge of the platter. Zone 1 is located at the outer edge, and Zone 8 is located at
the inner edge.
The main advantage of using zoned bit recording is increased capacity without
increased platter size. However, a side effect of having higher density at the
outer edges is a higher data rate off the platter due to the higher angular
velocity at the outer edge. The maximum data rate at the inner edge is 10.3
MB/s and at the outer edge, the maximum is15.5 MB/s.
The foil illustrates the track layout of a typical fixed-block disk drive using
embedded servo. Each track is divided into a number of data sectors and servo
sectors. The servo fields contain the positioning information used to locate the
head over a given track. The user data is stored in the data fields, each with an
associated ID field. The ID fields contain information that identifies the data
sector and other information, such as flags, to indicate defective sectors.
The use of ID fields allows great flexibility in the format and provides a simple
mechanism for handling defects. However, substantial costs are associated with
the use of ID fields. The ID fields themselves can occupy up to 10% of a
track—space that would otherwise be used to store data. Further, because the
disk drive must read the ID field for each sector before a read or write operation,
additional space is required to allow for write-to-read recovery prior to each ID
field (refer to the foil). The recovery gaps can occupy more than 5 % of a track.
Rotary actuator
The Ultrastar drives use a rotary voice-coil actuator with approximately 20
degrees of angular movement. When used with a rotary actuator, the head is
skewed with respect to the tracks as the actuator moves across the disk. The
result is a lateral offset between the read and write head centerlines. Optimum
performance is achieved by centering the read head over the data track for read
operations and the write head over the data track for write operations. This
operation causes the read head to be partially off-track during a write operation.
The MR head elements are offset with respect to each other. When a write
operation occurs, the write head must be positioned directly over the data field.
However, the read head must be able to read the ID field immediately preceding.
Because the offset of the read head with respect to the write head, the ID field
has to be offset from the data field.
The LBN is simply a number from 0 to the number of addressable blocks on the
disk drive. The PBN is a number from 0 to the number of physical blocks on the
disk drive, but with the defective and spare sectors mapped out. Once the PBN
is computed, it may be converted to the exact ZCHS value for the sector.
Because the defect information is known in advance, the proper logical block is
guaranteed to be located at the computed ZHCS. The defect map is stored in a
compressed format, optimized for small size and rapid lookup.
The servo system is used to locate the physical sector on the basis of knowledge
of the track formats in each zone. This information includes the locations of any
data field splits due to embedded servo, which are also stored in RAM.
The No-ID sector format enhances disk drive reliability because the header and
data field split information is stored in RAM, not on the disk. Current disk drives
Other advantages
No-ID sector format increases the capacity of disk drives by reducing the format
overhead and allowing the MR head to be utilized to its fullest extent. Increasing
track density gives up to a 15% increase in capacity, while increased linear bit
density also gives up to a 15% increase in capacity, in the same space. Disk
drive manufacturing yield is further enhanced by the advanced defect
management capabilities. The performance is enhanced by the increased
throughput (through reduced overhead) and by the knowledge of the absolute
sector locations. Power management is enhanced because there is no need to
supply current to the read electronics to read ID fields when searching for a
sector.
Overview
With any electromechanical device, there are two basic failure types. First, there
is the unpredictable catastrophic failure. A cable breaks, a component burns
out, a solder connection fails. As assembly and component processes have
improved, these defects have been reduced but not eliminated. PFA cannot
provide warning for unpredictable failures. Then there is the gradual
performance degradation of components. PFA has been developed to monitor
performance of the disk drive, analyze data from periodic internal
measurements, and recommend replacement when specific thresholds are
exceeded. The thresholds have been determined by examining the history logs
of disk drives that have failed in actual customer operation.
Seven measurements are taken over a 4-hour period, each taking approximately
80 ms. These seven measurements include various magnetic parameters of the
head and disk, head fly height on all data surfaces, channel noise, signal
coherence, signal amplitude, and writing parameters.
The symptom-driven process of PFA uses the output of data, nondata, and
motor-start error-recovery logs. The analysis of the error log information is
performed periodically during idle periods, along with the data collected by GEM.
When PFA analysis detects a threshold-exceeded failure, the drive notifies the
host system through the controlling adapter.
Error logs
The Ultrastar 2XP drive periodically saves data in error logs located in reserved
areas of the disks. These reserved areas are not part of the usable
customer-data area of the drive. Logs are written during idle times, usually only
after the drive is idle for 5 s or more. The writing process takes about 30 ms,
and the logs are saved about every half-hour.
Channel calibration
The Ultrastar 2XP periodically calibrates the read channel to ensure that the
read and write circuitry is functioning optimally, thus reducing the likelihood of
soft errors. Like the PFA functions, channel calibration is done during idle
periods, every 4 hours. It takes about 20 ms to complete and requires the drive
to have been idle for more than 5 s.
Disk sweep
Disk sweep is the process of ensuring that a drive does not remain idle for
excessive periods. Components that remain idle for long periods become
vulnerable to failure. If the drive has not processed a command for more than
40 s, it initiates a move of the heads to a random location. If the heads fly in the
Three-way router
The three-way router is an important part of the SSA loop. The router takes
inbound frames from either port, and checks the address field of the frame. The
address field has two parts, the path and channel addresses. The router checks
the path address. The path address is typically 1 byte but can be extended to up
to 4 bytes in complex and switched webs. If the first byte of the path address is
a zero, then the frame is assumed to be for the current node, and the frame is
routed to the node (the drive). The channel address is used to route the frame
within the node.
If the first byte of the path address is not zero, the frame is assumed to be for a
device further along the loop. The router decrements the first byte of the path
address, and transmits the frame out of the opposite port.
Thus, a frame travels around the loop only as far as it needs to, minimizing loop
traffic. Minimization of traffic is known in SSA terms as spatial reuse .
At the start of the sector, 8 bytes are used by IBM AS/400 systems and are not
used when the VSS is attached to UNIX hosts. The data portion of the sector
remains at 512 bytes for all systems. A 2-byte sequence number and a 2-byte
longitudinal redundancy check (LRC) increase the size of the sector to 524 bytes.
The sequence number is a modulo-64K value of the LBA of this sector and is
used as an extra method of ensuring that the correct block is being accessed.
The LRC, generated on the SCSI host adapter, is calculated on the 520 data and
header bytes and is used as an error-checking mechanism as the data
progresses from the host, through the VSS storage server, into the SSA adapter,
and on to the RAID array (see Chapter 4, “Versatile Storage Server Data Flow”
on page 109 for a detailed description of data flow through the subsystem). The
LRC is also used as an error- checking mechanism as the data is read from the
array and passed up to the host adapter. The sequence number and LRC are
never transferred to the host system.
In this chapter, we investigate the flow of I/O operations in the Versatile Storage
Server.
RAID-5 Terminology
First, we define the RAID terms that are used in this chapter.
Hardware
Next, we describe the cache and nonvolatile storage (Fast Write Cache)
architecture of the Versatile Storage Server.
Concepts
Here we describe several important concepts unique to the VSS. It is important
to understand these concepts in order to understand the algorithms the Versatile
Storage Server uses to manage cache and access to the attached SSA disks.
RAID array
The RAID Advisory Board defines a RAID array as “a disk array in which part of
the physical storage capacity is used to store redundant information about user
data stored on the remainder of the storage capacity.” We use the term RAID
array to refer to the 6+P or 7+P group of member disks of the RAID-5 disk
array in the VSS subsystem.
Strip
A strip , sometimes called a stripe element , is the unit of consecutively addressed
blocks on a disk that are logically followed by a strip on the next data disk in the
array. In the VSS, the strip size is 32 KB.
In the VSS subsystem, the redundancy group stripe depth is four stripes for a
7+P array, and five stripes for a 6+P array.
Virtual disk
The total space in a disk array can be divided into virtual disks of different sizes.
In the VSS subsystem, the virtual disks appear to the external SCSI host as a
physical disk or hdisk.
A RAID array consists of strips of arbitrary but fixed size on each of the member
disks of the array.
The total space in the RAID array can be subdivided into virtual disks.
The redundancy group stripe size must be small enough to effectively distribute
the RAID parity update workload across all the member disks of the array.
Otherwise, a RAID-5 array can have the parity “hot spot” characteristics of a
RAID-4 array.
The total Fast Write Cache in the Versatile Storage Server depends on the
number of SSA adapters in the subsystem.
In addition to mirroring the Fast Write Cache, the primary purpose of the SSA
adapter volatile cache is to improve performance for RAID-5 update writes by
caching recently accessed data and RAID parity sectors.
The total adapter volatile cache in the VSS depends on the number of SSA
adapters in the subsystem, usually one adapter per drawer.
There are two processing clusters in the storage server, and each can have
access to multiple hosts through multiple SCSI ports and to multiple RAID arrays
through multiple SSA adapters. For simplicity, this diagram, which will be used
repeatedly in this section, shows only the path used by a single I/O. Typically,
many concurrent I/Os will be processed by a single VSS subsystem, but this
concurrency does not affect the data flow used in processing a single I/O
operation.
The choice of storage server cache size is dictated by two factors: the size of the
disk backstore in gigabytes and the workload characteristics. Because UNIX
systems typically provide very efficient disk caching in memory, larger cache
sizes are most useful in VSS subsystems with a large disk backstore capacity or
in environments where host disk caching is not highly effective.
The storage server cache is not involved in protection of fast write data. Fast
write data is protected by SSA adapter cache and Fast Write Cache. Write data
can be LRU destaged from storage server cache whether or not it has been
written from Fast Write Cache to disk.
Data only
RAID parity is not stored in the storage server cache. RAID parity is never
transferred from the SSA adapters to the storage server cache.
RAID parity data is not stored in Fast Write Cache, but an indication of required
pending parity updates is. This is done to preserve Fast Write Cache space and
because the SSA adapter can reread the data necessary to create parity if a
volatile cache failure occurs.
If two SSA loops are attached to the adapter, they share the Fast Write Cache. It
is not statically partitioned across the two loops.
The Fast Write Cache is packaged as a removable card with an integral battery.
The battery, which is a lithium battery and therefore not rechargeable, maintains
the Fast Write Cache contents if the VSS is powered down for any reason.
During a normal shutdown cycle, data in Fast Write Cache is destaged to disk
before power is turned off to the subsystem.
The battery is integrated into the card, so retention of data stored in Fast Write
Cache is ensured even if the card is removed from the SSA adapter. The battery
is designed to provide retention of data for up to 10 years.
32 MB of DRAM
Each SSA adapter contains 32 MB of DRAM, in two independent 16 MB chips. If
there are two SSA loops attached to the SSA adapter, all cache on the adapter
card is shared between the two loops. Least recently used (LRU) algorithms are
used to manage adapter cache and Fast Write Cache across all arrays attached
to both loops.
Data integrity is ensured by storing two copies of all write data on the SSA
adapter card before task end or a write command is sent to the host. Of the
SSA adapter DRAM, 4 MB are used to mirror the contents of Fast Write Cache so
that data integrity is ensured in case of a Fast Write Cache failure.
When data is destaged to disk from Fast Write Cache, it is actually the volatile
cache copy that is used. Technically speaking, the Fast Write Cache mirrors the
4 MB portion of SSA adapter volatile cache, not the other way around. However,
About 26 MB of the DRAM are used to store user data read from disk and parity
data read from and written to disk. Although it is possible for the SSA adapter to
process a read operation as a cache hit against the adapter cache, the primary
purpose of the adapter cache is to enable cache hits when performing RAID-5
write processing. Because a RAID-5 update generates a read old data, read old
parity, write new data, write new parity sequence, multiple updates to the same
stripe result in a write to parity followed by a read of the data just written ( the
old parity for the second write). If the parity data written is still in the SSA
adapter cache, the read of old parity data can be avoided.
While it is also possible to avoid the read of old data by a cache hit in the
adapter cache if data in the same track is written twice, the write preempt
capability of the Fast Write Cache management is likely to result in a single
destage of the blocks. This is described in “Write preempts” on page 156.
The SSA adapter uses a portion of the DRAM for control blocks. In this space,
the adapter keeps track of information such as the SSA loop topology, which
could change as a result of a disk or SSA cable failure.
SSA loop topology is not stored in Fast Write Cache since the topology can be
determined again if necessary by the SSA adapter.
For reads, the disk stages into the buffer the sectors requested and following
sectors up to a total of 64 KB. For writes, only the sectors written are placed in
the buffer.
Cache concepts
We discuss the concept of a 32 KB track and how it relates to cache
management.
Cache transparency
The VSS cache is transparent to SCSI applications.
System additions
The system adds a block sequence number and longitudinal redundancy check
bytes to the user data. Space is reserved for an 8 byte AS/400 header whether
the host is a UNIX host or an AS/400. AS/400 hosts always write data in 520 byte
blocks.
Cache segment
Cache is managed in 4 KB segments. These are segments of 4096 bytes, not
segments of eight 524-byte blocks. This means that space in storage server
cache will always be allocated as an integral number of 4 KB segments.
Disk track
The VSS uses the concept of a disk track, which is equivalent to the RAID-5 strip
size. The disk track is used in cache management, but is not externalized to the
host.
A system administrator will define a file system block size for a UNIX file system.
This is both a unit of allocation for the file and a unit of disk I/O. The 4 KB is a
common file system block size. While the SCSI commands used by a UNIX host
can address 512 byte sectors, the UNIX host reads and writes fixed-length
groups of sectors, or blocks.
A set of contiguous blocks that are part of the same 32 KB VSS track will require
from two to nine 4 KB cache segments, depending on the file system block and
the number of blocks staged to cache.
Read cache hits don′t require any SSA back end I/O.
Where SSA back end I/O is required, both how much data is read and how many
I/Os are used to read the data are determined by the VSS.
With adaptive cache, the VSS may choose to read just the block requested, the
block requested and the balance of the 32 KB track, or the entire 32 KB track.
In the case of a 4 KB block random read, either no back end I/O will be
generated if the block is in storage server cache, or a single read of up to 32 KB
will be generated.
Virtually all SCSI front end writes will generate SSA back end writes, except
where a block is written twice before it has been destaged from Fast Write
Cache. How data is destaged is controlled by the Fast Write Cache management
algorithms, described later in this chapter.
How many SSA back end writes are generated from a SCSI write is a function of
the size of the write and whether it generates a stripe write or an update write.
For update writes, as opposed to sequential writes, the VSS writes one or more
blocks as part of a 32 KB track. Update writes require a parity update, which
invokes the read data, read parity, write data, write parity sequence. Each of
these steps consists of a block or a set of contiguous blocks that are part of the
same 32 KB track.
In the case of sequential writes, the VSS uses stripe writes, avoiding the RAID-5
write penalty. Stripe writes are described later in this chapter.
Since SCSI I/O gives no explicit signal that a sequential I/O pattern is under way,
the VSS detects this based on the residence in cache of logically contiguous 32
KB tracks in cache. It is important to emphasize that these are logically
contiguous not physically contiguous tracks, and that logically contiguous 32 KB
tracks are stored on different member disks in the array,.
Once the VSS has determined that a sequential read operation is under way, it
begins anticipatory staging of data.
Adaptive cache
For random reads, the VSS performs adaptive caching.
Depending on the access patterns for data, the VSS will either stage just data
read to cache, or the data read and the balance of the 32 KB track, or the entire
32 KB track. The choice of caching algorithm is adaptive, so if different data is
stored on the disk, the caching algorithm will adapt to the access pattern of the
new data. Similarly, if there is a change in access pattern of data, the algorithm
will adapt.
Because the sequential prediction algorithm cannot predict when the sequential
retrieval will stop (it has no knowledge of the file system using the logical disk),
The SSA disks used in the VSS will read the data requested and the following
sectors up to 64 KB. The 32 KB reads issued by the VSS will result in 64 KB
being read into the SSA disk buffer. The VSS will issue reads for four or five 32
KB tracks on each disk (depending on whether the array is 6+P or 7+P) in
rapid succession. The SSA disks can accept up to three I/O operations before
the first is performed.
This design allows the VSS to read up to four 32 KB blocks in a single rotation,
yet still receive a task-complete indication from the SSA back end when data
from the first stripe is available. Data being read sequentially is returned to the
SCSI host as soon as it is requested and as soon as it is in the storage server
cache. The storage server requires a task-complete indication from the SSA
back end to know that data from a stripe is available in cache to be transmitted
to the host.
If many reads are issued for data that would have been staged had a different
cache management algorithm been used, the VSS may change to a more
appropriate cache management algorithm for that data. The default is to cache
blocks read.
With partial track staging, the block requested and the balance of the 32 KB track
are staged into cache.
The same caching algorithm is used for all data in a band. The choice of
algorithm is dynamic, but if it changes, it changes for access to all data in a
band.
Bands do not span virtual disks. A virtual disk will be divided into bands of 1920
tracks, except that the last band will probably have fewer than 1920 tracks. A
new virtual disk will always begin a new 1920-track band.
A back access is a miss where the block requested follows the block that caused
the track to be in cache. A back access is an indication that partial track staging
would have been beneficial for this data.
A front access is a miss where the block requested precedes the block that
caused the track to be in cache. A front access is an indication that full track
staging would have been beneficial for this data.
The choice of caching algorithm for a band does not affect sequential prestage
for data in that band.
This is an introductory chart that describes I/O data flow. For this discussion, I/O
operations are characterized as random reads, sequential reads, and writes.
The specific flow for each of these I/O types is described in the following foils.
Random reads
Read operations reading a small amount of data (usually one or two blocks) from
a disk location, then seeking a new location before reading again, are called
random reads . Random reads are typical of transaction-based systems and are
usually response-time sensitive. We describe the processing flow of a random
read that is a cache hit in the storage server cache and a random read that
results in a cache miss in the storage server cache.
Sequential reads
Read operations reading large contiguous blocks of data, such as file system
backups and database scans used in decision support applications are called
sequential reads . Sequential read and write operations are usually
throughput-sensitive rather than response- time sensitive; the sustained
throughput (in megabytes per second) is usually more important than the
response time of an individual read operation.
The discussion of sequential reads and writes pertains to reads and writes in
LBA sequence. A UNIX file system will perform sequential readahead, and the
file system knows
whether data for a given file is stored in more than one extent on the logical
volume. A UNIX file system considers sequential access to be sequential
relative to the file system, the VSS considers sequential access to be sequential
relative to the VSS virtual disk. Except in very badly fragmented file systems, a
UNIX sequential read of a file 10 MB or larger will generate at least some I/O
considered sequential by the VSS.
Writes
Because much of the processing of random writes and sequential writes is the
same, they are covered in a single discussion.
If the read request can be partially satisfied from cache but also requires some
sectors not in the cache, then the read request is treated as a cache miss. Only
those sectors not already in the cache are read from the SSA disks. There is no
reason to reread data from disk that is already in the cache. Once all the data
requested is in the storage server cache, it can be returned to the SCSI host.
If at least some of the data requested is not in the cache, it must be read from
the SSA disks. Only the data not in the cache is read from disk.
It is unlikely that a host program will know whether a seek will be necessary to
perform a read. In the VSS subsystem, what is seen by the host as a physical
The primary use of the disk buffer is to allow data transfer on the SSA loop
asynchronously to the rotational position of the disk surface in relation to the
heads. This allows the interleaved transmission of frames from many disks on
the SSA bus.
Data is read from the disk surface to the disk buffer, and can then be placed on
the SSA loop as frames are available and as dictated by the SSA SAT protocol.
Data read from the disk is transferred on the SSA loop to the SSA adapter where
is it placed in the cache. It is then also transferred to the storage server cache.
The queue used for cache management is updated for both the adapter cache
and for the storage server cache.
When all data necessary to satisfy the host request is in storage server cache, it
can be transferred across the SCSI bus to the host.
In a case where the adaptive caching algorithm in the VSS storage server has
determined that full or partial track staging should be used, the blocks requested
will not be available for transfer to the host until the completion of the staging
operation.
Even if the host is requesting data faster than the sequential prestage can
provide it, sequential throughput is still maximized because the VSS is issuing
back end I/Os designed to exploit the capabilities of the SSA disks and the SSA
bus.
This is done because the RAID-5 write penalty can be avoided through the stripe
write without the use of Fast Write Cache for a large write.
Data integrity is ensured because there are two copies of the data stored in the
SSA adapter, one in adapter volatile cache, and one in Fast Write Cache. In the
unlikely event of a failure in the SSA adapter, the Fast Write Cache copy ensures
integrity. In the even less likely event of a failure of the Fast Write Cache
component, the SSA adapter cache copy will be used to perform writes.
This approach minimizes the Fast Write Cache space used to ensure data
integrity for RAID-5 parity. In the unlikely event of an SSA adapter failure, the
data required to regenerate the parity is read from disk.
Stripe writes increase the sequential write throughput of the VSS by reducing the
number of disk I/Os required to write data. By holding data sequentially written
in the Fast Write Cache until a full stripe is collected, the VSS can realize this
throughput improvement regardless of the host applications write blocking.
Write preempts
Write preempts occur when a block still in Fast Write Cache is updated again.
The VSS allows the update to occur in Fast Write Cache without the block first
being written to disk. This in turn allows a block that is frequently written, such
as a file system superblock, to be written to the SSA adapter cache and Fast
Write Cache but destaged fewer times than it was written, all while maintaining
full data integrity protection.
Adjacent blocks are destaged together to minimize arm movement and latency.
When a block that has reached the bottom (least recently used end) of the cache
management queue is selected to be destaged to disk, any other blocks adjacent
to it are destaged also. This allows more effective utilization of the target disk
drive and the drive containing associated parity.
Note that storage server cache is managed by 32 KB tracks, so that any blocks
in a 32 KB track will be destaged from storage server cache together. Fast Write
Cache is managed by blocks. Blocks that are part of the same 32 KB track are
destaged independently unless they are adjacent.
A 32 KB strip is written to the member disk containing parity for this redundancy
stripe.
This foil depicts three update writes to the same block. All three are written to
SSA adapter cache and Fast Write Cache. Only the last write is destaged to
disk, which involves the read data, read parity, write data, write parity sequence.
Write blocking will occur whether the blocks were written in the same or different
SCSI I/Os. Write blocking occurs if adjacent blocks are in Fast Write Cache and
one is chosen to be destaged.
This foil depicts two SCSI update writes that write adjacent blocks in the same 32
KB track. Both are written to SSA adapter cache and Fast Write Cache. When
the first write is destaged to disk, which involves the read data, read parity, write
data, write parity sequence, the second block is also destaged. In this case,
“read data” reads two adjacent blocks, as does “read parity.” The two write
operations also write two adjacent blocks. Both blocks are updated with what is
effectively a single RAID-5 write penalty sequence.
Because writes to a disk are written first to the disk buffer, the disk itself can
manage writing up to three 32 KB tracks in one operation if the subsequent 32
KB tracks are written to the disk before the first can be written from the disk
buffer to the media. This can occur if writes occur at a high rate as they can
during a file system restore or the creation of a logical volume copy.
If there has been no activity to the SSA adapter for 5 s, a small amount of data is
LRU destaged from Fast Write Cache even if the Fast Write Cache is not at the
During normal shutdown of the VSS, the VSS issues a command to the SSA
adapter to destage all data from Fast Write Cache to disk. The VSS waits for all
data to be destaged before shutting down.
This is not required to ensure the integrity of fast write data, since data would
remain in Fast Write Cache until the machine was again powered up. It is done
so that system administrators who want all data destaged to disk on normal
shutdown will be satisfied.
Theoretically, data being read and being written could be moving in the same
direction on the SSA bus. In the VSS, this will not occur, since the SSA adapter
sends data to a disk by the shortest route.
As a result, any write that does not cause the VSS to generate a stripe write as
a single I/O will be processed as a fast write. The VSS will generate a stripe
write as a single I/O for any write including one or more full stripes.
The strips in the stripe are XOR′d together to create the parity strip. Note that
parity generation is always performed in adapter cache in the VSS.
Adaptive cache
The adaptive cache algorithms of the VSS increase the number of read cache
hits for a given cache size. The adaptive cache algorithms dynamically observe
the locality of reference of data stored on the VSS, and choose a cache staging
algorithm based on the locality of reference observed. Data with high locality of
reference will have more data loaded into cache for each cache miss read. Data
with low locality of reference will load just the data read into cache, preserving
cache for other uses.
Fast writes
Fast write masks the RAID-5 write penalty. While the RAID-5 entails an
unavoidable overhead for update writes in a RAID-5 disk subsystem, the effects
of the RAID-5 write penalty can be masked by fast write.
Drawer Design
This foil shows the topics we discuss in this chapter. In this chapter, we present
the drawer design of VSS. The drawer contains 8 or 16 disk drives and carries
the internal SSA path that connects the disks installed in the drawer. The
drawer plays a key role in the SSA loop configurations, and is the physical
interface between the SSA RAID adapter and the disks. In addition, each drawer
has three power supplies and cooling fan modules so that the drawer can
continue operation in case one of the power supplies, or a cooling fan, fails. We
examine the hardware functions and, finally, explain the component interaction
around the drawer.
The connection between two drawers or between the host and the drawer is the
SSA cable, which attaches to the port indicated in the foil. The ports of the
drawer include a function called host bypass circuit (This function is explained in
the next foil.)
Buffer
The disk drive has a 512 KB buffer that can improve the I/O performance of the
disk drive and VSS. The data to be written or read is stored in this buffer and
then physically written to the disk surface or sent to the initiator. The initiator
here means the SSA RAID adapter and the target means the disk drive. In the
write operation, the data to be written can wait in this buffer for access and
seeking. In the sequential read operation, this buffer improves performance by
prefetching the data from the contiguous block.
In terms of cost, one 9.1 GB disk drive is less expensive than two 4.5 GB disk
drives. Also, the maximum capacity that one VSS can be configured is 2.0 TB
using 9.1 GB disk drives and 1.0 TB using 4.5 GB disk drives if all RAID arrays
a re c o n fi g u re d a s 6+ P + S .
Routing function
Each disk drive is recognized as a node by the initiator. The initiator means the
pair of ports on the SSA RAID adapter. Each SSA RAID adapter has two
initiators because each adapter has four SSA ports. Because there are multiple
disks (up to 48) on one loop and the loop path is shared by all nodes, the control
command or data may have to pass through the other disks before arriving at
the destination disk. Each disk is assigned an individual address on the loop
when the configuration or reconfiguration is executed on the VSS storage server.
The SSA RAID adapter sends the data or command with the destination address
of the disk on the loop. If a disk receives the data, a command comes as the
SSA frame through the SSA cable, and the three-way router checks the address
header contained in the SSA frame. If the address is zero, then the three-way
router recognizes this frame is sent for it and the data or command is sent to the
disk drive. If not, the three-way router sends it to a contiguous node (most often
a disk) after decrementing the address value. All initiators or targets on the SSA
loop work in the same way as the “router” described and it functions in
compliance with the SSA protocol.
For details of the disk drive technology, such as the disk format, see Chapter 3,
“Versatile Storage Server Technology” on page 43.
PFA
PFA monitors key device indicators for change over time or exceeding specified
limits. The device notifies the system when an indicator surpasses a
predetermined threshold.
Monitoring
PFA monitors the performance of the disk drive in two ways: a
measurement-driven process, and a symptom-driven process. The
measurement-driven process is based on IBM′s exclusive generalized
error-measurement feature. At periodic intervals, PFA′s generalized error
measurement (GEM) automatically performs a suite of self-diagnostic tests that
measure changes of the characteristics of the disk drives. The GEM circuit
monitors the head fly height on all data surfaces, channel noise, signal
coherence, signal amplitude, writing parameters, and more. From those
measurements and equipment history, PFA then recognizes the specific
mechanisms that can cause the disk drive failure.
The symptom-driven process uses the output of the data, nondata, and
motor-start error-recovery logs. The analysis of the error logs is performed
periodically during idle time. When the PFA detects a measurement that exceeds
the threshold, the host system is notified.
SMART
The reports or warnings generated by the activities of the PFA comply with
industry-standard self-monitoring analysis and reporting technology (SMART).
The PFA function complies with the SMART industry standard. The PFA functions
invoked when the disk drive is idle measure the conditions of each head and
save the data to a log in a reserved area of the disk. This log data can be
retrieved and analyzed.
Log recording
The Ultrastar 2XP disk drive periodically saves the data in the logs located in a
reserved area on the disks. These reserved areas on the disks are not included
in the area used for storing customer data. The log recording is an idle-time
function that is invoked only if the disk drive has been idle for 5 s or more. The
log recording takes approximately 30 ms, and the logs are saved about every
half-hour. The logs are used for failure analysis.
Disk sweep
If the Ultrastar 2XP disk drive has not processed a SCSI command for at least 40
s, the disk drive executes the disk sweep, another of the idle-time functions. The
disk sweep exercises the heads by moving them to another area of the disk, and
it initiates a second movement of the heads if they fly in the same point for 9
min. This function ensures that low-use disk drive components do not become
vulnerable to errors through inactivity.
Thus the event of the single point of failure on the SSA loop is recovered.
By this process, access to data is allowed when a disk in the RAID array fails.
However, the failed disk should be replaced with a repaired or new disk as soon
as possible. The repaired or new disk can then become the hot spare disk. The
previous hot spare disk is now the data or parity disk replacing Member 2. That
is, the hot spare disk becomes the member of the RAID array.
SSA connection
The drawer offers the SSA loop connection between the disk drives. This facility
eliminates the cabling between the disk drives installed in the same drawer.
Host connectivity
We discuss supported hosts, adapter options and factors that influence
connectivity choices.
Storage server
The storage server includes four-way SMP clusters. Cache size can be a big
factor in influencing system performance, the minimum size is 512 MB and the
maximum is 6 GB.
Racks
In many ways the number of racks is an easy decision; it is based mainly on the
number of disk drawers to be installed. Where a customer wishes to incorporate
existing racks of 7202 or 7015-R00 SSA disk drawers, there are factors to
consider.
Power supplies
An optional power supply battery backup can be considered.
Configuration
We discuss the configuration methods and factors that influence the
configuration such as availability and performance considerations.
Number of hosts
Each host will require at least one connection to the VSS. Each host adapter has
to be configured so that it is “owned” by one of the two SMP clusters. This
means that the adapter can access only disks that are attached to its cluster. If
the host needs to access disks that are attached to the other cluster, then a
second host connection to the VSS is required. In practice, in order to balance
the workload across the two clusters, in most situations, a host′s disks will be
spread across the two clusters; therefore, at least two connections per host are
required.
Volume of data
Each Ultra-SCSI adapter in a host can support up to 15 targets with 64 LUNs per
target. In VSS, we can configure the maximum LUN size to be 32 GB.
I/O rate
Very often in the early stages of designing a new system, the actual application
data and I/O rates are not known. Some guesstimate based on previous
knowledge has to be made. Where assumptions have been made, they should
be documented and validated as the project progresses.
Availability
The loss of a host-to-VSS connection means that the host cannot access the data
that is normally available by that route. If this loss of data availability would
seriously affect the organization′s business operations, then an alternative path
to the data should be configured using a second adapter on the host and VSS.
Availability can be further improved by configuring a second host that uses
HACMP-type software to provide protection in the event of primary host failure.
Note Not all host operating systems will support adapter failover.
Backup requirements
With the increasing use of around-the-clock operations, the time available for
system backups is decreasing. It is important to consider the demands that
system backups make on the storage system when doing system design and
configuration. When online backups are carried out, an even greater workload is
placed on the system. The backup requirement may change the data rate and
I/O rate requirement but is less likely to affect the storage volume unless
backups are going to made from a mirrored copy of the data.
Summary
There are four factors that influence the number of host-to- VSS connections
required:
• Volume of data
• Data rate
• I/O rate
• Availability requirement
Normally one of these factors will be the bottleneck and determine how many
connections are required.
Advanced software
VSS provides the capability to share data as part of the base system. No special
features are required to utilize data sharing. Data sharing is under the control of
application code to regulate data access and prevent corruption.
VSS as standard comes with 512 MB of cache. This is split across both of the
SMP clusters. The cache can be upgraded from 512 MB to a maximum of 6 GB
(3 GB per cluster).
Processors
The type of processors required in each cluster is a function of the I/O rate that
the cluster must process. I/O rate has more effect on the processors than the
data rate.
Number of adapters
As standard, the VSS comes with one SSA disk adapter. The maximum number
of adapters that can be installed is eight. The number of adapters required is a
function of the volume of disk storage.
Guidelines
There is one SSA loop per disk drawer, and each drawer will house either one
or two RAID arrays of eight SSA disks. Unless the VSS has more than 16 disk
drawers, the standard method of connecting adapters to disk drawers is to have
one SSA loop per disk drawer. Each disk adapter will therefore be supporting up
to two disk drawers. If more than 16 disk drawers are required, then there will
be two adapters supporting two disk drawers (that is, 32 disks).
Configuration rules are based on the need for maximum performance. The base
configuration includes two SSA adapters and two SSA disk drawers (32 disk
drives). As disk drawers (7133s) are added to the configuration, additional SSA
State-of-the-art technology
IBM disk drives are acknowledged as leaders in the field of disk-drive
technology. They have many advanced features, such as PRML and
third-generation MR head technology (see Chapter 3, “Versatile Storage Server
Technology” on page 43).
Server
A separate storage pool can be configured for each server attached to the VSS.
The number of LUNs and their size will depend on two main factors: first, the
total volume of storage required by the server and, second, the characteristics of
the workload. For example, if the work involves the use of large sequential I/O,
then a small number of large LUNs is appropriate, while for random workload
with very high I/O rates, having more small LUNs spread over several disk
arrays is more appropriate, see Chapter 8, “Versatile Storage Server
Performance” on page 251. Note that a single LUN cannot span more than one
disk array.
VSS
The configuration of the actual disk arrays into a logical representation is done
using the IBM StorWatch Versatile Storage Specialist (VS Specialist), a
web-browser- based management tool.
Disk storage
Included in the base 2105-B09 rack are four RAID-5 arrays, (32 9.1 GB drives).
All VSSs are RAID-5 protected. The first RAID array in an SSA loop must include
a hot spare drive. All drives in a loop must have the same capacity, either 4.5
GB or 9.1 GB. There is room in the 2105-B09 rack for two additional 7133
drawers. Each drawer can support up to two RAID-5 arrays. New or existing
7133s (Model 10 or 20) can be placed in the rack. Use 7133 Feature 2105 when
Power supply
The power supply and control system is fully redundant. Power is supplied
through two power cords with either 50 ampere (single phase) or 50/60 ampere
(three phase) connectors with each cord capable of providing 100% of the power
requirements. The type of power supply must be specified. The power unit in
the 2105-100 is controlled by the main power control unit in the 2105-B09 storage
server rack.
SSA cables
7133 drawers shipped with the 2105 racks (using 7133 Feature 2105) do not
require SSA cables. These cables are provided with the rack. If additional new
7133s (using 7133 Feature 2106) are ordered, then SSA cables need to be
ordered with the 7133. Two 5 m cables are standard. If longer cables are
required, please order the 10 m or 25 m lengths.
If existing 7133s are placed in the 2105-1000 rack, you may need new cables,
depending on the length. Cables of 5 m or longer are required.
Note If existing racks are used, the power control is not integrated into the power
storage server in the 2015-B09 control rack.
In either case, the disk drives must be reformatted to 524 byte sectors before
they can be used in VSS.
Redundant power
All power supplies are fully redundant. Power is supplied through two power
cords with either 50 ampere (single phase) or 50/60 ampere (three phase)
connectors with each cord capable of providing 100% of the power
requirements. This flexibility allows VSS to be installed into existing
environments with a minimum of disruption. The type of power supply must be
specified.
Optional battery
An optional battery backup system is also available. The battery is designed to
assist with system shutdown in the event of catastrophic power failure. It will
also provide power during a temporary loss, such as a brownout. If the optional
battery is installed, we recommend that it be installed in all of the VSS racks.
The battery can provide power for several minutes.
Service indicators
There are indicators for when service is required and for power availability.
Note If existing 7202, 7014-S00 and 7015-R00 racks are used, they are not
integrated into the 2015-B09 power control unit.
Intranet access
Configuration access is usually through a customer-controlled network, or
intranet. Using an existing customer intranet allows the user to manage the
complex from his or her usual place of work and usual desktop. The connection
on VSS is Ethernet; if token ring access is required, it must then be made via a
router.
Systems management tasks can also be carried out using this interface. SNMP
alerts can be routed to the appropriate location or server. Error logs can be
viewed and appropriate remedial action taken. The microcode level can be
viewed and new levels applied.
Number of hosts
This is the number of hosts that will be attached to the VSS.
Volume of data
This is usually a relatively easy figure to obtain. The ease with which VSS can
be upgraded online means that if some of the initial sizing assumptions are
wrong, it is relatively easy to rectify them later.
Data rate
The data rate affects two areas of the subsystem:
1. Host to VSS connection
2. Disk to VSS connection
Care must be taken when designing a system that the overall configuration is
balanced and that all the parts of the configuration are synchronized. The host
to VSS connection is affected only by the data rates of the applications running
on the individual host. The disk-to- VSS connection data rate is the sum of the
data rates of all the LUNs that are configured on that adapter. (See Chapter 8,
“Versatile Storage Server Performance” on page 251.
Availability
VSS has been designed to have zero exposure to loss of data integrity and so all
maintenance can be carried out online and without need to power down attached
host systems. There can be failures when the host cannot access its data. In
such cases, consideration should, if the host operating system supports adapter
failover, be given to providing alternative paths to the same LUN. If this is not
the case, then disk mirroring should be considered. Mirroring across LUNs on
different disk adapters should be considered if disk adapter failure seems likely.
Connection
As with any storage subsystem, the VSS subsystem must be physically
connected to any host that wishes to use it to store data. While making a
physical connection is typically a quick and easy operation, it requires that the
host system be shut down and powered off to plug the cable into the host. If the
VSS is to be accessed by a number of different hosts, connecting to each host
should be coordinated in order to minimize the necessary downtime.
The VSS cabinets, along with the 2105-B09 and 2105-100 racks, have dual power
line cords to facilitate their high availability. Customers should ensure that they
have enough power outlets to accommodate the number of racks being installed.
All host platforms must have, at the very least, a differential SCSI-2 fast and wide
adapter, although an Ultra SCSI adapter is the preferred host adapter, in order to
take advantage of the extra capabilities of Ultra SCSI in the VSS.
IBM RS/6000
All IBM RS/6000 servers support VSS, including the SP.
Operating system levels required to support the VSS are 4.1.5, 4.2.1, 4.3 and
4.3.1. The VSS ships with a small installation package, similar to that of other
external IBM storage subsystems; it contains the vital product data (VPD)
necessary to populate the AIX ODM database. The VPD allows AIX to see the
VSS as a VSS, not as a “generic SCSI disk.”
From the next release of AIX, the VPD for the VSS will be included in the base
releases.
Sun Microsystems
Sun systems supporting the VSS are the Ultraserver Models 1000, 1000E, 2000,
2000E, 3000, 4000, 5000, 6000. They require Solaris versions 2.5.1 or 2.6. No
other special software is required.
Hewlett-Packard
The VSS is supported on HP 9000-800 series, D, E, G, H, I and K series, T series
and EPS systems. HP-UX levels 10.01, 10.10, 10.20 or 10.30 are required. No
other special software is required.
Compaq
The VSS is supported on Compaq ProLiant Models 3000, 5000, 5500, 6500, and
7000. Compaq requires Windows NT 4.0. No other special software is required.
UNIX systems
For UNIX-based systems, the VSS emulates a SCSI disk drive. The host system
will access the VSSs virtual drives as if they were a generic SCSI disk drive.
The AIX operating system will contain entries in its ODM database to identify the
VSS properly, but will access it using its generic SCSI disk driver. The VSS will
appear to be a standard physical volume or hdisk to AIX. The VSS will appear
similarly to Solaris and HP-UX systems.
When using Ultra or wide SCSI interfaces, a total of 16 devices are supported
connected to the bus. The initiator (a device capable of initiating a SCSI
command—usually an adapter) uses one address, leaving a total of 15 address
for devices, typically called targets . Each target can have a total of 64 LUNs for
Ultra SCSI or 32 LUNs for SCSI-2 wide. A VSS can be configured to appear as
up to 64 LUNs per SCSI interface, each LUN having a capacity of 0.5 GB up to 32
GB (valid LUN sizes are: 0.5, 1, 2, 4, 8, 12, 16, 20, 24, 28, and 32 GB).
AS/400 systems
The VSS emulates 9337 subsystems when attached to an AS/400. As AS/400s
require 9337 subsystems to have a minimum of four drives and a maximum of
eight, the VSS must be configured to support a four- to eight-drive subsystem.
The 9337-580 emulation requires logical 4 GB drives, while the 9337-590
emulation requires logical 8.5 GB drives. There is no relationship between the
physical disk drives and the logical volumes assigned to the AS/400. The AS/400
expects to see a separate device address for each disk drive in the subsystem,
so the VSS will report unique addresses for each virtual drive defined to the
AS/400.
A VSS in minimum configuration comes with four arrays of eight disk drives. It
is possible that some part of the array can be used as temporary storage for
archives instead of using a slower tape device.
There is a basic volume management product for Solaris, called Solstice , which
is available from Sun Microsystems. Similarly, the Veritas Volume Manager
(VxVM) and Veritas File System (VxFS) can be purchased as optional products
for Solaris. All of these volume managers provide tools and utilities to move
data—at a logical volume level, at a physical volume (disk drive) level, or by
selective use of the mirroring features of the volume managers.
Direct copy
If the data to be migrated resides as individual files on UNIX file systems, and no
volume management software is available, the next easiest method of moving
the data is to use a utility that supports a direct copy feature, such as cpio with
the -p (pass) option.
Command cpio is available on all of the UNIX operating systems that support the
VSS, and is easy to use. The command cpio reads a list of files to copy from its
standard input. The easiest way to produce the list of files is to use the UNIX
find command and pipe its standard output to the standard input of the cpio
command. The following output shows a typical example of the use of cpio to
move data.
There are a number of different utilities on UNIX systems with which to archive
data. The cpio command mentioned above can also create and read archives
on tape devices. Using the -o (output) option creates archives and the -i (input)
option reads and extracts data from archives.
AIX provides a pair of commands backup and restore. The backup command
has two different methods of creating archives, either by file name or by inode.
The restore command must be used to read any archive created by backup.
Solaris and HP-UX provide the dump and restore commands, which provide
backup and restoration by inode.
For databases that use raw file systems, logical volumes, or methods other than
a UNIX file system, it may not be possible to use the volume management
methods of migrating data, especially if the database uses volume serial
numbers in its licensing code, or validity checking. If databases use licensing
methods or validity checking, it may only be possible to export the database
from its old locations, and import the database to its new location. It is up to the
database software to provide some mechanism to move the data. This may take
the form of a standard database backup and restore if it does not have any
specific tools for movement of data.
For most purposes of data migration, these methods are preferred, as they work
below the file system level—they don′t care what sort of data resides on the
logical volume, be it a UNIX file system or a raw database.
Terms
For the purposes of this discussion, the following terms refer to elements of, or
definitions used by, the AIX LVM:
• Volume — a collection of data
• Physical volume — a disk drive, upon which a collection of data is stored.
• Logical volume — how AIX views collections of data. Logical volumes reside
on physical volumes, but there is not necessarily any correlation between
the size or position of logical volumes and the physical volumes underneath
them. That is, a logical volume may span one or more physical volumes, or
there may be more than one logical volume on a physical volume.
Now that we have discussed some basic terms regarding the AIX LVM, let′ s
discuss how to copy and migrate data using the facilities provided by the LVM.
# cplv -v datavg -y newlv oldlv
The second example shows the use of cplv to copy the data from the existing
logical volume oldlv to the existing logical volume existinglv (-e) . When using
the -e option, the existing target logical volume will be overwritten with the data
from the source logical volume. When using the -e option, the characteristics of
The command cplv is a good method for copying or migrating a single logical
volume. Sometimes, however, it may be necessary to migrate all the data from
a physical volume. The next foil discusses migratepv
# migratepv hdisk1 hdisk2
In the second example, we are moving any data in the logical volume datalv
which physically resides on hdisk3 to hdisk9 . If any other data resides on hdisk3
it is not moved.
The migratepv command works by creating a mirror of the logical volumes being
moved, then synchronizing the logical volumes. The original logical volume is
then removed. The next example describes two ways of using mirroring to
duplicate and migrate data.
The mklvcopy command creates a second copy of the data within the logical
volume. Command mklvcopy is all that is needed to start mirroring of data in a
logical volume. The mklvcopy process can be used on an active system without
disturbing users. Some performance degradation may be noticed while the new
data copy is being synchronized with the existing copy.
# mklvcopy -e m -s y -k datalv 2 hdisk3 hdisk7
.
.
.
# splitlvcopy -y splitlv datalv 1
The example shows how to create a mirror copy in a logical volume. The
options specify to use minimum interdisk allocation policy (-e m), strictly allocate
mirror copies on separate physical volumes (-s y) and synchronize new copies
immediately (-k). Here, datalv is the name of the logical volume we wish to start
mirroring, two is the number of copies that we wish to make of the data, in total
(this can be a maximum of three). Disks hdisk3 and hdisk7 are the physical
volumes upon which the logical volume will reside. Disk hdisk3 is the physical
volume already holding the data of datalv . Disk hdisk7 is the physical volume
that will hold the mirror copy, and where we want to move the data.
The splitlvcopy command creates a new logical volume called splitlv with the -y
option. The source logical volume is datalv . The number 1 specifies how many
copies are to remain in the logical volume. Logical volume datalv will still exist
as it did before, residing on hdisk3 , and the new splitlv will reside on hdisk7 .
# mklvcopy -e m -s y -k infxlv 2 hdisk4 hdisk10
.
.
.
# rmlvcopy infxlv 1 hdisk4
The disk hdisk4 is where the data for infxlv already exists, while hdisk10 is where
the mirror copy will reside, and where we want to ultimately move the data.
Other factors like creation of new logical volumes or file systems, modification of
configuration files, data integrity checks, and testing all contribute to the
unavailability of data being migrated. This foil discusses some of the aspects
that must be considered when choosing a method to use for migrating data.
Closing of a logical volume requires either that the file system built upon it be
unmounted, or that any database that has the raw logical volume open should be
shut down. After running cplv, the file system configuration file /etc/filesystems
The command migratepv is one of the better methods for moving data without
disrupting users. Command migratepv can be run on an active system, as it first
creates a mirror of each logical volume contained within the physical volume,
and synchronizes both copies. Once the synchronization is complete, the
original copy is removed, leaving the new copy active and available to users.
Users may notice some performance degradation due to the atomic nature of
creation and synchronization of the mirror, as each physical partition is written
separately, locked from being accessed by any other process. This can slow
down access to the data, but ensures data integrity.
The mklvcopy, splitlvcopy method is ideal for creating a copy of logical volumes,
and then running with both copies, one in a production environment and one in a
test environment. The mklvcopy command ensures data integrity by creating a
mirror copy of the data and synchronizing it atomically. Command splitlvcopy,
however, should not be run on an active logical volume for the same reasons
that cplv should not be run on an active logical volume. If processes are
updating the data while the split is taking place, the consistency of the data on
both copies can not be guaranteed.
After running splitlvcopy, the file system configuration file /etc/filesystems should
be updated to include the relevant configuration data regarding the new logical
volume and file system. The file system integrity checker fsck should then be
run to ensure data consistency within the new logical volume.
Some databases require that to move data between file systems or logical
volumes, a database export must occur, followed by an import of the data on to
the new file system or logical volume. This is inefficient and time consuming but
often necessary. Some reconfiguration of the database may have to occur to
point at the new data location.
Although 7133s in a VSS rack run on 350 V DC, no special procedures need to be
done in order to facilitate this—the 7133 has auto-sensing power supplies. The
power cords from the original enclosure or rack are not used, however, as the
VSS rack already has power cables installed. Because of configuration
restrictions (see “Use of Existing Disk Drives in VSS” on page 248), existing
7133-010 drawers may require extra loop jumper cables to complete loop
connection.
SSA cabling of the drawers is to the SSA adapters in the 2105-B09 adapter bays.
Configuration restrictions (see “Use of Existing Disk Drives in VSS” on page 248)
mean that existing 7133-010 drawers may require extra loop jumper cables to
complete loop connection.
As all VSS subsystems come with two arrays of eight drives, it is possible that
these drives can be used for either direct migration of data, or as temporary
storage while the existing drives are moved. If this is not possible, a removable
media device such as a tape drive would have to be used to temporarily store
the data while the drives are reformatted.
“524-byte Sector Format” on page 250 explains the new format in detail.
The 8 bytes at the start of the sector are used by IBM AS/400 systems, and are
not used when the VSS is attached to UNIX hosts. The data portion of the sector
remains at 512 bytes, for all systems. A 2-byte sequence number and a 2-byte
LRC increase the size of the sector to 524 bytes. The sequence number is a
modulo 64k value of the LBA of this sector and is used as an extra method of
ensuring that the correct block is being accessed.
The LRC, generated by the SCSI host adapter, is calculated on the 520 data and
header bytes and is used as an error-checking mechanism as the data
progresses from the host, through the VSS storage server, into the SSA adapter,
and on to the RAID array (see Chapter 4, “Versatile Storage Server Data Flow”
on page 109 for a detailed description of data flow through the subsystem). The
LRC is also used as an error- checking mechanism as the data is read from the
array and passed up to the host adapter. The sequence number and LRC are
never transferred to the host system.
Workload Characterization
In this set of foils, we look at several aspects of a workload that may affect the
configuration of the VSS. They are:
• Read to write ratio
• Synchronous write I/O content
• Sequential I/O content
• Caching characteristics of data
Summary
In these foils, we first repeat the key performance capabilities of the VSS in
review, then summarize the key points presented in this chapter.
Adaptive cache
The VSS storage server cache is managed by an adaptive caching algorithm
which determines how much data should be staged to storage server cache
when data is read from the disk back store. The storage server can either stage
just the block (or blocks) read, the blocks read and the balance of the 32 KB
track, or the entire 32 KB track containing the requested blocks.
The key advantage of this adaptive caching is that it intelligently manages which
data is placed in cache. Adaptive caching can increase the number of read
cache hits given the same size storage server cache.
Fast write
Its fast write capability allows the VSS to return a SCSI task-complete indication
when data has safely been stored in both cache and nonvolatile storage (Fast
Write Cache) in the SSA adapter. The fast write capability allows the VSS to
mask the overhead associated with RAID-5 update writes.
You can assume that the access patterns for the member disks in a 6+P or
7+P array will be reasonably uniform. Since this is a RAID-5 disk subsystem,
While the RAID array has an extra disk, the smoothing of the access pattern
isn′t because of the addition of the extra disk, it′s because of striping the
workload of each of the seven JBOD disks across the RAID array. A “hot spot”
RAID array is less likely to occur than a “hot spot” JBOD disk.
Not all reads directed to the VSS result in access to the SSA disk back store.
Only cache read misses require access to the back store. A VSS subsystem
with a high cache hit ratio can have much less back end I/O activity than front
end I/O activity.
To estimate the back end I/O load being generated for a given workload of
random reads, use the formula reads x (1 - cache hit ratio) .
On the other hand, random writes typically generate more back end I/O load
than front end I/O load, because of the RAID-5 write penalty. While the VSS will
cache both data and parity strips in the SSA adapter cache, which can result in
cache hits during parity generation for a random write, a conservative
assumption is that all random writes will generate four I/Os to the back store
SSA disks. Use the formula writes x 4 to calculate the write I/O load that will be
generated from a given random write workload.
By sequential reads , we mean reads that are in LBA sequence for the virtual
disk. Sequential processing in a UNIX file system can refer to processing all
blocks in a file, which may not be stored together on the disk, depending on the
structure of the logical volume definition and whether the disk is fragmented.
When the UNIX file system detects sequential processing, it will begin
read-ahead into the host cache. The sequential detect and prefetch of the VSS
complement this sequential read-ahead by increasing the read throughput from
the VSS.
When performing sequential prestaging, the VSS will issue reads for 32 KB
tracks to the SSA disk back store, regardless of the size of the SCSI reads being
received. So we use the formula bytes read / 32 KB to calculate the number of
read operations generated against the SSA disk back store as a result of
sequential read I/O operations.
Data written sequentially is written in stripe writes by the VSS. As a result, there
is no RAID-5 write penalty for sequential writes, although a parity strip must be
written for every stripe. Since the parity overhead of a 6+P RAID array is not
very different from the parity overhead of a 7+P RAID array, use the formula
(bytes written / 32 KB) x 1.15 to calculate the number of write I/Os generated to
the SSA back store for a given sequential write workload.
Given a certain effectiveness of host caching, the amount of VSS storage server
cache should scale with the size of the SSA disk back store.
From this foil, it′s easy to see why the effectiveness of a disk subsystem cache
depends on the effectiveness of host caching. Host cache hits do not result in
I/O to the disk subsystem. Host cache misses that result in I/O to the disk
subsystem are unlikely to be in an external disk subsystem cache, unless the
disk subsystem cache is considerably larger than the host cache, since the disk
subsystem cache is likely to contain much of the same data as the host cache.
The first question is, how much memory is available in the host for caching? An
AIX system will use all memory not committed to other purposes as a disk
cache. Other UNIX implementations may allow the system administrator to
specify the size of the disk cache. Database management systems typically
require an explicit specification of the amount of memory that should be used for
disk buffers.
Data stored in a UNIX file system will automatically be cached in a host cache.
Database data may be stored in either a UNIX file system or, more commonly, in
raw disk. The host may have plenty of memory, but unless a database
management system is using it for buffers, the host caching of the database data
is not going to be highly effective.
Host caching will be less effective in such a shared disk environment, depending
on the percent of total data being updated, and the likelihood that updated data
will be retrieved, since recently accessed data is flushed from cache in all
systems but the one that has performed the update.
If the workload to be placed on the VSS consists primarily of data that is part of
a UNIX file system, the effectiveness of host disk caching largely depends on the
amount of memory available for the host cache. Systems that are memory
constrained will have less effective host caching, where systems with abundant
memory should perform very effective host caching, assuming that the cache
size has not been severely limited by the system administrator.
This architecture allows the effective use of host caching, because data needs to
be stored in only one host cache, the cache managed by the processor owning
the data. DB2 Parallel Edition requires an explicit specification of disk cache
size, but there are no special considerations for the effectiveness of host disk
caching for a DB Parallel Edition configuration, since data will be in only one
host cache.
A new AIX capability, the Generalized Parallel File System (GPFS), exploits the
VSD capability to provide access to data in an AIX file system.
Where host caching is effective, the cache hits that could occur in the external
disk subsystem cache do not because I/O is never generated to the external disk
subsystem. As an extreme example, if there is only one host, all data is stored
in one external disk subsystem, and both host and disk have 1 GB of memory in
use as a disk cache, the two disk caches are likely to contain much of the same
data. The LRU chains in the two caches would be different, because the external
disk subsystem cache would be unaware of rereads of the same blocks, but the
last n blocks read into the host cache would also be the last n blocks read into
the disk subsystem cache.
Since most cache hits occur in the host cache, the cache hit ratio for reads in
the external disk subsystem cache is typically lower. This helps explain why
workloads that would be considered “cache unfriendly” in an S/390 mainframe
environment are the norm in a UNIX environment.
The rule of thumb states that a disk subsystem cache four or more times the
size of the host cache can provide significant performance benefit through
additional read cache hits.
In the case of a cluster environment, the rule of thumb would state that the disk
subsystem cache should be four times or more the size of the total storage used
for host caching in all hosts in the cluster.
Host caching
This is the first of three VSS storage server cache size guidelines, to be used
where host caching is considered “less effective,” or unknown.
In a configuration with less effective host caching, disk subsystem cache can be
more critical to overall system performance. Read requests not satisfied in the
host cache are being sent as SCSI I/O to the external disk subsystem. A disk
subsystem cache hit is much faster than a cache miss, and therefore subsystem
cache hits are beneficial to performance.
The reason to consider a large VSS cache in a configuration with less effective
host caching is that each increment in VSS storage server cache can give
measurable improvement to I/O performance.
The greater the capacity of the SSA disk back store in gigabytes, the more
cache. Again, it is capacity in gigabytes, not number of disks or drawers that
matters. Therefore, 450 GB of 9 GB disks is equivalent to 450 GB of 4.5 GB disks
when sizing VSS storage server cache.
This rule of thumb states that in a configuration with less effective host caching,
twice as much VSS storage server cache should be configured as in a
configuration with effective host caching.
Note that, in an environment with less effective host caching, the largest storage
server cache sizes available for the VSS would be indicated for a VSS
subsystem of less than the maximum number of drawers of 9 GB disks.
The cache should be configured based on the capacity of the SSA disk back
store. Capacity in gigabytes is what matters, not capacity in disks or drawers.
Therefore, 450 GB of 9 GB disks is equivalent to 450 GB of 4.5 GB disks when
sizing VSS storage server cache.
The usable capacity of a VSS drawer containing a 6+P RAID array, a 7+P RAID
array, and a spare disk is 57 GB with 4.5 GB disks and 115 GB with 9 GB disks.
The rules of thumb for VSS storage server cache size are stated in terms of 450
GB of disk, which is the capacity, in round numbers, of four drawers of 9 GB
disks or eight drawers of 4.5 GB disks.
The foil includes a table showing how many 9 GB disk drawers and 4.5 GB disk
drawers can be installed for each available cache size. Both columns assume
that all drawers are of the disk capacity indicated. The table does not contain
new information; it applies the rule of thumb to the available VSS storage server
cache sizes.
In this case, it is the host cache, not the capacity of the disk back store, that is
the scaling factor.
The rule of thumb states that for an external disk subsystem cache to be
effective, it must be four or more times the size of host memory used for disk
caching. In a cluster environment, this is the total of all systems in the cluster
environment.
There is no upper limit to this rule of thumb. An external disk subsystem cache
eight times the size of the host memory used for disk caching is not
unreasonable if the objective is to improve performance by providing subsystem
cache read hits.
Note that if you seek an increase in performance where host caching is already
effective, you have a choice between increasing the memory available for host
caching and using a large cache in the VSS storage server. Many factors may
affect this decision, such as the age of the UNIX hosts, their upgradability, and
In the table, information is shown defining what VSS storage-server cache sizes
should be used to increase performance for a given host cache size.
Rules of thumb
The first rule of thumb is to select high-performance four-way SMPs where SCSI
front end I/O reaches or exceeds 10,000 operations per second.
Where the SCSI front end I/O rate is not known, a reasonable assumption can be
made by applying a typical access density to the size of the SSA disk back store.
Assuming an access density of about three I/O per second per GB, a 450 GB
VSS subsystem would be expected to receive about 1500 I/O per second.
In case of a failure of a storage server processor or memory, the VSS will shift
the workload of that cluster side to the other cluster side (failover) until a repair
action can be performed against the failed component. In cases of failover, a
single cluster side will be performing more work than under normal operation.
Up to 64 LUNs can be addressed by a single SCSI target ID. In some cases, the
UNIX operating system used will limit the number of LUNs per target ID to a
smaller number. In any case, the VSS can emulate multiple LUNs and multiple
target IDs on a single SCSI port.
Where a SCSI bus is dedicated to a VSS SCSI port and a single host, there will
be no SCSI bus arbitration overhead. This is true even if the LUNs are
addressed as more than one target ID.
UltraSCSI adapters
VSS adapters are UltraSCSI adapters, which operate at a burst speed of 40
MB/s. As is true for any UltraSCSI adapter, backward compatibility with SCSI-2
is provided, so that the VSS UltraSCSI adapter can be attached to a host SCSI-2
adapter. In this instance, the VSS SCSI port will operate at the 20 MB/s burst
speed supported by SCSI-2 when transferring data.
Throughput considerations
The throughput of a VSS SCSI bus depends not just on the VSS SCSI adapter but
also on the host SCSI adapter. In addition to whether the host SCSI adapter is
UltraSCSI or SCSI-2, various SCSI adapters have different limitations in terms of
maximum I/O rate. The maximum I/O rate of a VSS SCSI port may be
determined by the host SCSI adapter limitation.
One reason these figures are lower than the 40 and 20 MB/s burst rates of
UltraSCSI and SCSI-2, respectively, is that all command and control transfer on a
SCSI bus occurs at 10 MB/s for compatibility purposes, as required by the SCSI-3
architecture. Only data transfer occurs at the 20 or 40 MB/s speeds.
A virtual disk can be shared between two SCSI attachments that are used by two
different hosts, but a virtual disk cannot be shared between two SCSI
attachments used by the same host.
The capacity of a VSS − host SCSI attachment may be further limited by the host
SCSI adapter, especially if it is SCSI-2.
High sequential throughput requirements, above 25 MB/s for all virtual disks
sharing a VSS − host UltraSCSI attachment, or 15 MB/s for a SCSI-2 attachment,
would dictate partitioning virtual disks across multiple SCSI attachments.
Disk specifications
The next foil shows a comparison of the specifications of the 4.5 GB and 9 GB
SSA disks used in the VSS.
Access density
Access density is measured in I/Os per second per GB, which is a key metric in
choosing between 4.5 GB and 9 GB drives.
Because there are more heads on the actuator arm in the 9 GB drive, the
average seek time for the 9 GB drive is slightly longer than for the 4.5 GB drive.
Rotational speed, and therefore average latency, are the same.
Calculation of the SSA back end I/O load for a given SCSI front end I/O load was
previously discussed in “I/O Operations” on page 258.
Since SSA disks in the VSS are always part of a RAID array, this can be stated
as:
• 350 I/Os per second for a 6+P RAID array
• 400 I/Os per second for a 7+P RAID array
Use 7 and 8 as multipliers for the 6+P and 7+P RAID arrays, respectively,
since in a RAID-5 configuration, both data and parity are stored on all disks in
the array. The I/O per second capability of the array, therefore, includes all
disks in the array.
For sequential workloads, the sustained transfer rate is not limited by I/Os per
second to each disk in the array. A rule of thumb for sequential read workloads
can therefore be specified as 25 MB/s for a RAID array, regardless of the disk
capacity used in the array.
For mixed sequential and random workloads, the sequential I/O load will expend
some of the SSA disk I/O capacity and therefore some of the RAID array I/O
capacity.
Migrating data from RAID-1 subsystems may present an exception to this rule of
thumb.
Adapter bandwidth, except for SSA loop transfer bandwidth, is shared by all
RAID arrays attached to either loop.
Rules of thumb
The relevant rules of thumb are stated in terms of numbers of drawers in the
VSS subsystem regardless of disk drive capacity. It is assumed that 9 GB disks
are chosen unless the access density requirements dictate the use of 4.5 GB
disks. Since the rules of thumb are driven by the SSA back end I/O workload,
the workload is expected to be the same for either 4.5 GB or 9 GB disks.
For configurations containing up to eight drawers, use one SSA adapter per
drawer. Configure both RAID arrays (one 6+P and one 7+P) on the same loop
so they can share a spare drive.
Where 4.5 GB drives are chosen for reasons other than access density, such as
where existing drives are used from an installed 7133 rack, it may be possible to
use both loops on an SSA adapter to support two drawers from a single SSA
adapter. Even if the rationale for using 4.5 GB drives is unrelated to
performance requirements, it is important to determine that performance
requirements would not have dictated the use of 4.5 GB drives before configuring
more than two RAID arrays per SSA adapter.
Since there can be only eight SSA adapters in a VSS subsystem, configurations
using 9 to 15 drawers require using both loops on some SSA adapters.
Configurations with 16 drawers require using both loops on all SSA adapters.
Where multiple drawers share an SSA adapter, the two RAID arrays in a drawer
should be configured in one loop, and the two RAID arrays in the second drawer
should be configured in the other loop.
Note that this means that VSS configurations of nine drawers or more will have
less SSA back-end bandwidth per gigabyte for those drawers sharing an SSA
adapter. As an example, in a nine drawer system, there would be two drawers
sharing an SSA adapter, and seven drawers configured with one drawer per SSA
adapter. For many workloads, there will be a measurable performance
difference between the RAID arrays in drawers with one drawer per SSA adapter
and the RAID arrays in drawers with two drawers per adapter.
Since there can be only eight SSA adapters in a VSS subsystem, configurations
using 17 or 18 drawers will require two of the eight SSA adapters to attach the
RAID arrays in three drawers each.
Random writes
For random writes in a RAID-5. disk subsystem, there is the RAID-5 write
penalty—read data, read parity, write data, write parity. Since a single front end
write I/O can generate four back end I/Os, the effect of the RAID-5 write penalty
on the back end I/O load can be considerable.
In the VSS., the RAID-5 write penalty is masked by fast write cache. Note that it
is masked , not eliminated.. With fast write cache, a SCSI task-complete
indication is sent as soon as the data written is safely stored in SSA adapter
cache and Fast Write Cache. As long as the arrival rate of update writes does
not exceed the capability of the SSA adapter to destage these writes to disk, the
RAID-5 write penalty is effectively eliminated. During write bursts, it may be
necessary for SCSI write operations to wait for Fast Write Cache space to be
cleared before being processed as fast writes.
If a block is rewritten before it has been destaged from Fast Write Cache, the
VSS will update the SSA adapter cache and Fast Write Cache copy and only one
destage will occur. For planning purposes, a conservative assumption would be
that all SCSI writes generate SSA disk write I/O.
If the RAID-5 update write bandwidth requirement will be very great, consider
configuring multiple VSS virtual disks in different RAID arrays, and then
combining these virtual disks using the logical volume manager of the UNIX
implementation.
Sequential writes
The stripe write capability of the VSS allowsRAID-3-like processing for sequential
writes. This can speed database loads and file system recoveries from backups.
In a stripe write, data is written to each of the data strips in the stripe and parity
is generated from that data and written to the parity strip. No read before write
is required. A sequential write not only is not subject to the RAID-5 write
penalty, but the striping across multiple drives increases the write throughput
beyond what would be possible with a single unstriped disk.
Stripes are collected in Fast Write Cache. Stripe writes in the VSS are not
dependent on having SCSI write operations write data in exact stripe increments.
Where a SCSI write contains a complete stripe, the VSS storage server
recognizes this and passes the data to the SSA adapter accordingly. Where
SCSI writes are smaller, the collection of the strips in a stripe occurs in Fast
Write Cache and full stripes are destaged to disk as soon as they are collected.
Sequential I/O
Sequential I/O is characterized by the transfer of large amounts of data. From a
disk subsystem perspective, sequential I/O refers to retrieval of data in LBA
sequence. Where a UNIX host does sequential read-ahead, its read-ahead is
based on the logical structure of the files in the file system, which may be stored
in a fragmented manner on disk and which may be striped across several logical
volumes.
Sequential I/O is typically throughput sensitive over many I/O operations rather
than response time sensitive for each I/O. We often speak of the sustained data
rate in discussing sequential throughput, which underscores the throughput
versus response time focus. A fast sequential I/O is of little value if it cannot be
followed immediately by another fast sequential I/O.
A requirement for sequential processing alone will not dictate a need for a large
VSS subsystem cache.
Data that has been read through sequential prestage is destaged more quickly
than data staged to cache through random reads and writes, which avoids
flooding the cache with data read sequentially.
Only the blocks read in a SCSI I/O request are placed in storage server cache.
This uses the least room in cache and therefore allows data to remain in cache
for as long as possible to enable a hit if the data is reread by the same or a
different host.
Or, the blocks read and the rest of the 32 KB track they reside in are placed in
storage server cache. This choice is best where the data exhibits locality of
reference and blocks adjacent to one already read are often later referred to.
In some cases, the locality of reference is not just forward in LBA sequence but
to blocks clustered around the blocks read. Or, where indicated, the VSS will
read not just the block requested but the entire 32 KB track into storage server
cache.
For more information about VSS adaptive caching, see Chapter 4, “Versatile
Storage Server Data Flow” on page 109.
Race conditions
Race condition is the term used to describe contention for resources where the
contenders are unequal in speed. We will see that race conditions resulting
from the attachment of hosts of different processor speeds, and with different
SCSI attachment capabilities, do not affect the VSS.
As we will see, race conditions are not an issue for the VSS. This information is
included to address concerns about VSS performance in heterogeneous
environments.
Heterogeneous environments
Heterogeneous environments include:
• Multiple hosts with a variety of processor speeds sharing a VSS subsystem
• A mixture of UltraSCSI and SCSI-2 attachments to VSS
I/O requests are selected and processed by an available request server, in the
order they were placed on the queue.
This provides significant read bandwidth, both for random reads and for
sequential reads.
Adaptive cache
The VSS storage server cache is managed by an adaptive caching algorithm
which determines how much data should be staged to storage server cache
when data is read from the disk backstore. The storage server can either stage
just the block or blocks read, the block or blocks read and the balance of the 32
KB track, or the entire 32 KB track containing the requested blocks.
The key advantage of this adaptive caching is that it intelligently manages which
data is placed in cache. Adaptive caching can increase the number of read
cache hits given the same size storage server cache.
Fast write
The fast write capability of the VSS allows it to return a SCSI task-complete
indication when data has safely been stored in both cache and nonvolatile
storage (Fast Write Cache) in the SSA adapter. The fast write capability allows
the VSS to mask the overhead associated with RAID-5 update writes.
Configuration flexibility
The VSS has a flexible architecture that can be scaled in subsystem capacity
and can be configured independently of subsystem capacity to meet your
performance needs.
Philosophy
One of the main driving forces behind the VSS was the need to provide
uninterruptible service to the owner or user of the storage subsystem. Typically,
maintenance on most components of a computer system causes some sort of
interruption to service as an engineer replaces a part or a system administrator
reconfigures around a degraded peripheral or component.
Components that are most likely to fail can be replaced without disruption of
service. The components most likely to fail in the VSS are the disk drives. The
7133 disk drawer is designed to facilitate hot replacement and hot insertion of
drives and its power supplies. The RAID-5 adapter can detect a failed disk drive
and initiate sparing —the process whereby a good spare drive is automatically
brought online into an array to take over from a failed drive.
Repair Actions
There are three distinct levels of repairs and maintenance actions that can be
performed on the VSS. These are as follows:
• Customer repairs. Repair actions for the customer are limited.
• CE repairs. Typically, most failures that occur on a VSS will be repaired by
the CE.
• PE repairs. The PE will perform code (EC) functions, and provide high-level
support.
Interfaces
A number of interfaces are provided to allow access for maintenance purposes:
• Web browser. A web-based HTML browser is the primary interface for
configuring the VSS.
• ASCII terminal. A serial port is provided for each storage server to enable
the CE and PE to access error logs, configure error reporting, run
diagnostics, and upgrade microcode.
• Remote support interface. A serial port is provided for each storage server
to allow attachment of a modem to enable remote access by an IBM support
representative. The IBM support representative can examine error logs,
configure error reporting, and run diagnostics without needing access to
customer data.
Reporting
Depending on the configuration of the customer′s site and network, the VSS
provides different levels of error reporting to accommodate any situation:
• Error log. The VSS constantly monitors itself and logs any errors and
unusual occurrences in its error log.
• SNMP. If the customer has a large site that uses the SNMP protocol for
network management, the VSS can be configured to send SNMP alerts to the
network management stations.
• E-mail. When a problem is reported, the VSS can e-mail a list of users to
inform them of the problem.
• Call home. Through the modem connected for remote access, the VSS can
automatically dial and connect to the local IBM support office and log a call
for service. The appropriate IBM CE is then dispatched with a replacement
part, or a support representative can dial in to examine the error logs and
make recommendations.
Upgrades
Upgrade of the components of the VSS is made easy by the concurrent
maintenance features of the subsystem. All components of the VSS that are
designed to be replaced while the subsystem is running can be upgraded while
the subsystem is running: disk modules, disk drawer power supplies, the storage
servers, clusters, adapters and rack power supplies. In addition, system
microcode can be upgraded while components are running, through the code EC
management process.
Code EC management
The code EC management process exists to provide concurrent management of
licensed internal code (LIC). It allows the interrogation, modification, and
maintenance of the installed ECs. The EC process is available through the
standard interfaces: web-browser, CE maintenance interface, and remote support
interface. ECs are distributed on CD-ROM, diskette, and network.
HTML browser
The standard interface to the configuration of the VSS is via a web-based HTML
browser, so called because it uses the World-Wide Web style of hypertext and
information browsing currently available on the Internet and intranets. HTML is
the text markup language that is the front end of all web applications. The VSS
configuration menus, written in HTML, provide the user interface. The data
provided by the user is passed through the Common Gateway Interface (CGI) to
binary utilities that perform the actual configuration and repair actions of the
VSS.
The maintenance options for the VSS are available from the system status
screen. From here, the following options are available:
• Change error reporting
• Allow IBM service access to your system
• Allow remote configuration
• View error logs
ASCII terminal
The services provided through this interface are designed to be displayed on
character-based asynchronous ASCII terminals, such as the IBM 3151 or DEC
VT100. Typically, the CE or PE will use this interface while performing onsite
maintenance. The ASCII terminal can be supplied by the customer, or the CE or
PE can plug in a laptop containing the mobile solutions terminal (MoST)
emulator. The terminal connects through a null-modem cable to an RS-232
switch that is in turn connected to a RS-232 serial port on each storage server
cluster. Either storage server cluster can be selected from the switch, without
having to physically unplug the terminal cable. The switch is located in the rear
of the 2105-B09 cabinet.
Call home
The call-home facility provides automatic notification of problems or potential
problems to IBM, as well as notifying the customer by e-mail or SNMP alert. The
VSS RAS code of each cluster constantly monitors and logs errors or events that
occur. The logs are analyzed on a regular basis by a separate RAS process,
and any immediate problems or trends that may cause problems are noted. If
necessary, another RAS process dials out to the closest IBM support center,
where it communicates with a “catcher system” that logs the data passed to it
by the VSS. The catcher system then informs support center personnel of the
problem. The support center will then dial back in to the VSS and examine the
error logs, making the necessary recommendations. If needed, the support
center notifies the PE, who also dials in to examine the problem, or the support
center can dispatch the appropriate CE with a replacement for the failing part.
For code ECs, the remote support interface supports only emergency microcode
fixes, because of the limited data rate of asynchronous modems.
Error log
The VSS has code for logging, tracking, and managing errors and events that
occur. Each storage server runs a copy of the same code and keeps track of its
own errors, both hardware and software (microcode). In addition, the two
storage servers of the VSS monitor each other across the Ethernet interface
using TCP/IP and UDP/IP protocols. Configuration options include the log file
name, maximum size of the log, and the maximum memory buffer size. The
memory buffer is a small (8192 bytes by default) circular buffer in the VSS
operating kernel to which errors are written. A separate process monitors the
kernel buffer, extracts entries that are written to it, and logs them. The next step
in the process is to analyze the logs.
Problem records are available for viewing by the customer, CE, and PE through
all interfaces.
Once the problem record has been generated, a service alert is sent, according
to one of the methods discussed on the next three foils.
The second MIB contains information about pending faults. The information
allows the user of the network management station to display any pending
Information provided.
When the storage server microcode is running, the following data is provided to
the IBM support center:
• The problem record
• The VPD record for the failing component. The VPD record includes
engineering change (EC) levels for both hardware and firmware and for
configuration data.
The call is screened by the support center and the appropriate IBM
representative is contacted to provide support. This support may take the form
of a CE arriving on site with a replacement for the failing part, or a support
representative dialing in to examine error logs and subsystem status. The
call-home facility is configured from the Configure Error Reporting option from
the System Status menu.
Console specification
To perform repairs on the VSS, the customer must have a console running a web
browser capable of handling HTML forms and frames. It attaches to the
customer intranet, and communicates with the hypertext transfer protocol (HTTP)
server running on the VSS. If the customer wishes to install code ECs
distributed on diskette and CD-ROM, the console must have the appropriate
drive installed. If the customer plans to install code ECs over the network, an
Internet or IBM-NET connection is required.
Logical configuration
The customer is able to perform all logical configuration tasks on the VSS.
Logical configuration is defining resources to the host systems that are
accessing the VSS. The customer is able to perform the following logical
configuration tasks:
• View, define, change, and remove disk arrays.
• View, create, change, and remove LUNs.
• View, define, change, and remove configuration data.
Limited repair
The customer is able to perform some limited repair actions of the RAID disk
arrays in the VSS.
All actions apart from the following are performed by the CE or PE:
• Display problem information associated with disk drives.
• Determine disk drive problems that are customer repairable.
• Request that a failed disk be conditioned for repair.
• Upon completing a disk repair, identify the repaired drive.
• Return the repaired drive to the RAID array.
• Close the associated problem.
Service procedures
The service procedures that the CE is required to perform comprise a number of
different “levels.” The first of these is access to the service processor of the
storage server, typically when the storage server microcode is not running
(either after a fatal error or prior to booting the storage server). Access to the
service processor provides limited access to the VPD, NVRAM, and some error
data. Typically, access to the service processor is warranted only after an
internal error causes the microcode to halt.
The second level of service procedures are those performed through the system
management services (SMS) of the storage server firmware. Access to the SMS
is when the storage server microcode is not running. The SMS can be used to
provide access to the storage server bootlist (the list of devices that the storage
server attempts to boot from, in order), extended memory tests, and update
storage server firmware.
The diagnostic and service aids menus are similar to those of the RS/6000 online
diagnostic and services menus. They allow the CE to run online diagnostics on
the storage server, and access low-level functions of the subsystem, such as
viewing and setting the storage server bootlist, and viewing and setting NVRAM.
Supported interfaces
The web-browser interface discussed under “VS Specialist” on page 314 and the
CE onsite maintenance interface discussed under “Onsite Maintenance
Interface” on page 316 are the primary supported methods for code EC
management. Through the web-based interface, the customer has limited
access to EC management. The CE interface supports only ECs from CD-ROM,
diskette, and from previously loaded file sets.
The remote support interface provides limited support for code EC management.
Typically, asynchronous modems do not provide enough bandwidth to upload
large file sets, so only emergency code fixes will be supported through the
remote support interface.
Subsystem Recovery
In this chapter we discuss how to recover from a failure within the VSS. From a
host-system perspective VSS is “just another disk storage system.” However,
resilience, availability, and serviceability (RAS) are built into VSS, so that most
types of possible failure are masked from the host operating system and allow
normal data processing to continue. If a failure should occur, the RAS systems
will ensure that repair actions can take place while the system is online.
Data integrity is key to VSS; in all circumstances, once the host has been
signaled that the write is complete, the integrity of the data is assured.
However, the most common cause of data loss and corruption is human error.
As VSS cannot protect data from this, it is vital to have backup and recovery
plans and procedures in place, and to test them. Therefore, all the normal rules
governing data backup, restore, and disaster recovery plans still apply.
The VSS has built-in diagnostic and error warning systems. It can return error
codes relating to failing components back to the attached hosts, send out an
SNMP alert, and dial out and send messages to remote maintenance personnel.
Data integrity
VSS has been designed so that in any set of circumstances, no single failure will
result in any loss of data integrity. Data integrity is assured for any completed
I/O operation.
Data availability
In most circumstances, data availability is always maintained. When repair
action is taken, it may be necessary to restrict data access to the failed area
within VSS while concurrent maintenance is taking place.
Concurrent maintenance
VSS has been designed so that nearly all maintenance can be carried out
without having to power the system down or take it off line.
Call-home support
This is a continuous self-monitoring system that initiates a call to the service
provider if a failure or potential failure is detected. With this feature, a service
technician can automatically be sent out to replace the failed or failing
component and can bring along the required replacement parts. Repair time is
thus minimized.
It is important to stress that this remote access does not compromise security of
the data stored on disk.
There are two levels of support that can access the system via the RS-232 port.
The normal customer service engineer (CE) support does not have access to any
customer data. The product engineer (PE), the highest level of support, may
require root access to the system, which will give him or her access to the data.
This level of access must be specifically unlocked by the customer on site; it
relocks automatically after 8 hours. For all support levels, the customer has to
authorize the use of remote support. If remote support is activated, then the
customer is notified by e-mail. Call-back modems and other security procedures
(such as passwords) should be used to prevent unauthorized access to the
system.
Both the user and the CE can configure the VSS, assigning storage to individual
hosts. The configuration manager can be accessed from the subsystem
interface, the RS232 port, or through a customer-controlled network (Ethernet
based). Using an existing customer network (intranet) allows the user to
manage the complex from the normal desktop. The configuration manager is a
web-based GUI application, providing an easy-to-use interface into the VSS.
In all cases, the result is the same: the host is not able to read or write data to
the storage subsystem. There is no issue of data integrity, as any in-flight
transactions that are taking place when the failure occurs will abort. The
application running on the host will not receive an affirmative return code from
VSS and so will take its normal recovery procedures. In order to ensure
continuous data availability, two paths using different adapters on the host and
VSS can be used.
Note This requires the host operating system to have adapter failover capability
or some other method of defining two different routes to the same piece of data.
VSS has been designed so that concurrent maintenance can take place. This
means that in the event of host connection adapter failure, the failing component
can be replaced without the need to power VSS down. Maintenance procedures
for this are fully described in Chapter 9, “Versatile Storage Server Maintenance”
on page 309. It must be understood by the operators that while VSS has been
designed for concurrent maintenance and hot pluggability of components, host
At initial program load (IPL), the kernel is loaded into the two storage server
clusters, 0 and 1. The kernel then “walks the bus” as far as the two PCI bridges
and the IOA; see Chapter 3, “Versatile Storage Server Technology” on page 43.
The pertinent system data is loaded into the object data manager (ODM)
database. The ODM has no record of any of the adapters in the input or output
(IOA) bays. After this initial phase, the VSS microcode is loaded. It is this
microcode that contains maps of the adapter configuration.
The microcode in each of the two storage server clusters will contain information
about the adapters attached to its IOA bays and a mirror copy of the data in the
other storage server cluster. It is this mirror copy that is loaded into the online
storage server when the other fails. Storage server Cluster 0 will contain live
data relating to Host adapter 0 and Disk 0 with a mirrored (inactive) copy of the
data in storage server Cluster 1. The microcode in each of the two storage
server clusters monitors a “heartbeat” in the other storage server cluster.
In the event of a failure, the failover procedure is invoked. The VSS microcode
runs as processes within the storage server clusters. It runs at a higher priority
than the scheduler and controls and schedules all events.
Note Any host adapter can be “owned” by either cluster; ownership is set up
during configuration. Disk adapters are “tied” to the cluster they are connected
to. This is changed only in the event of cluster failover. If Cluster 0 fails, then
the adapters will be controlled by Cluster 1.
When the I/O request is resubmitted, failover will automatically reroute the
requests to the remaining on-line storage server, which will process the request
and return a request-complete code to the host.
Error signals will be generated and dispatched to the service technician or CE.
The failed cluster will be replaced without powering the VSS down and access to
all the data will be maintained during this operation by the other cluster.
Most single failures still allow unrestricted host access to data. The disk
subsystem is built on SSA components. Since its introduction in October 1995,
over two petabytes of SSA disks have been bought, thus proving the market
acceptability and reliability of SSA.
On resumption of operation, the Fast Write Cache writes its data to disk; the
memory cells within the Fast Write Cache are then marked as available. The
write-complete return code is sent back to the host as soon as the data is written
to the Fast Write Cache. It is at this point that the integrity of the written data is
assured. The disk adapter can be replaced in the VSS without need to power
the system down.
The Fast Write Cache in the disk adapter is a mirror copy of the data in the
write-through cache in the adapter volatile storage. If the Fast Write Cache
If the write-through cache in volatile storage should fail, then the adapter fails
and error message and alerts are generated.
All drives in a 7133 should be of the same size. If a disk drive in a RAID array
should fail, there is always a hot spare in the loop, so data rebuilding begins
immediately after a failure occurs. When the new drive is installed, it becomes
the spare, allowing the spare to float among the drives in the array. In fact, the
spare will support all the arrays on the loop. The rebuilding process is run as a
background task, giving priority to customer requests for data.
Redundant power
Each VSS rack contains a fully redundant power subsystem. Two 50 or 60
ampere line cords (country and location dependent) provide the power to each
rack. Complete system operation can take place on a single line cord. Like the
line cords, power distribution units and cooling fans have redundancy, so any
single failure will not affect operation. If a power control unit should fail, then
the VSS will run from the remaining unit. The failed unit can be replaced without
need to power the system down or affect user access to data.
Battery backup
There is an optional battery pack for both the 2105-B09 main rack and the
2105-100 expansion rack. The battery should provide enough power to run the
system for 10 minutes after power loss. This battery backup protects the VSS
from temporary loss of power in a brownout.
Concurrent maintenance can take place so that users have access to data while
a failed unit is being replaced.
Data Recovery
It is the responsibility of the host systems that are attached to VSS to back up
data that they own. In this respect, VSS should be viewed as a series of
standard SCSI disks and all normal operational procedures will apply.
Regular backups should be taken of the data and copies stored off site. Tools
such as ADSM are designed to aid this process and they have special modules
to cover the possibilities of site failure.
Software Support
This foil shows the topics that we discuss in this chapter.
Device Drivers
• SSA RAID adapter device driver
This device driver supports the SSA RAID adapter, providing access to the
adapter for communications and for managing the adapter itself.
• SSA router
The SSA router has the function of handling the SSA frames that flow in the
SSA loop. The SSA router receives every SSA frame coming through the
SSA loop, sending it on to the adjacent target (or initiator) if the frame is not
for that router, or keeping it if it is.
• SSA physical disk device driver
Adapter
The SSA RAID adapter software carries out RAID-5 implementation of functions
such as parity processing and data reconstruction.
The UNIX operating system recognizes the virtual volume mapped by the VSS
storage server as one physical, generic SCSI disk. The host does not know
whether the RAID-5 function is performed on this SCSI disk drive, or on the
virtual volume.
Information in this book was developed in conjunction with use of the equipment
specified, and is limited in application to those specific hardware and software
products and levels.
IBM may have patents or pending patent applications covering subject matter in
this document. The furnishing of this document does not give you any license to
these patents. You can send license inquiries, in writing, to the IBM Director of
Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY 10594 USA.
Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact IBM Corporation, Dept.
600A, Mail Drop 1329, Somers, NY 10589 USA.
The information contained in this document has not been submitted to any
formal IBM test and is distributed AS IS. The use of this information or the
implementation of any of these techniques is a customer responsibility and
depends on the customer′s ability to evaluate and integrate them into the
customer′s operational environment. While each item may have been reviewed
by IBM for accuracy in a specific situation, there is no guarantee that the same
or similar results will be obtained elsewhere. Customers attempting to adapt
these techniques to their own environments do so at their own risk.
Any pointers in this publication to external Web sites are provided for
convenience only and do not in any manner serve as an endorsement of these
Web sites.
AIX AS/400
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this redbook.
Redbooks on CD-ROMs
Redbooks are also available on CD-ROMs. Order a subscription and receive
updates 2-4 times a year at significant savings.
Other Publications
These publications are also relevant as further information sources:
• VSS Executive Overview , G225-6718
• Seascape Architecture Presentation , G325-3347
• VSS Introduction and Planning Guide , GC26-7223
This information was current at the time of publication, but is continually subject to change. The latest
information may be found at http://www.redbooks.ibm.com/.
Redpieces
For information so current it is still in the process of being written, look at ″Redpieces″ on the Redbooks Web
Site ( http://www.redbooks.ibm.com/redpieces.html). Redpieces are redbooks in progress; not all redbooks
become redpieces, and sometimes just a few chapters will be published this way. The intent is to get the
information out much quicker than the formal publishing process allows.
IBMMAIL Internet
In United States: usib6fpl at ibmmail usib6fpl@ibmmail.com
I n Canada: caibmbkz at ibmmail lmannix@vnet.ibm.com
Outside North America: dkibmbsh at ibmmail bookshop@dk.ibm.com
• Telephone Orders
Redpieces
For information so current it is still in the process of being written, look at ″Redpieces″ on the Redbooks Web
Site ( http://www.redbooks.ibm.com/redpieces.html). Redpieces are redbooks in progress; not all redbooks
become redpieces, and sometimes just a few chapters will be published this way. The intent is to get the
information out much quicker than the formal publishing process allows.
Company
Address
We accept American Express, Diners, Eurocard, Master Card, and Visa. Payment by credit card not
available in all countries. Signature mandatory for credit card payment.
I
input/output operations 258 O
back-end I/O 258 Oracle Parallel Server 14, 29, 267
front-end I/O from hosts 258
P
J parallel query processing 303, 304
JBOD disks 259 situation to avoid 304
PCI local bus 64
PE 311
L maintenance and repairs 311
LCD status display 57
Index 371
spatial reuse 107 stripe constituents 113
SRAM 114 data strips 113
SSA adapter 114, 115, 122, 123 parity strip 113
data transfer from disk buffer 123 stripe writes 291
Fast Write Cache, 4 MB 114 strips collected in Fast Write Cache 291
loop topology 122 Sun SSA adapter 6
changes in 122 7133 attached 6
volatile cache 115 Sun systems running Solaris 279
mirroring Fast Write Cache 115 number of initiators supported 279
volatile cache in DRAM 115 synchronous DRAM 59
SSA adapter cache 122 synchronous writes 292, 293
use in RAID-5 write processing 122 response-time sensitive 292
SSA adapter DRAM 121 storage of two copies 293
two 16 MB chips 121
SSA back store 259
formula to calculate write I/Os for sequential T
writes 259 thin-film disks 87
SSA back store disk 259 layer constituents 87
read operations formula 259
SSA back store disks 259
write I/O load formula 259
U
Ultrastar 2XP disk drive 176, 181
SSA bus 124
calibration data logs 181
data transfer from disk to buffer 124
disk sweep 181
interleaved data for different devices 124
periodic calibration 181
SSA cable 173
Ultrastar 2XP disk drives 44, 177
SSA disk 115
disk size effect on performance 177
disk buffer 115
individually addressed 177
SSA disk adapater 70
Ultrastar 2XP drives 86, 103
RAID array 70
embedded servo technology 86
amount of data written to one drive 70
features 86
SSA disk adapter 61, 62, 66, 70
predictive failure analysis 86
disk loops supported 66
24 hour notice of pending failure 86
full-stripe destage 61
self-monitoring and imminent failure warning 103
idle-time cache destaging 62
Ultrastar disk drive 91, 92, 95
parity calculation 70
overhead reduction 95
XOR function for calculating parity 70
embedded servo 95
SSA disk adapter memory 68, 69
read channel as data storage limiter 91
DRAM 68
Viterbi algorithm in the PRML read channels 92
SRAM 69
Ultrastar disk drives 44, 93, 100, 104
SSA disk adapters 60
areal density across the platter 93
cache memory 60
innovative features 44
Fast Write cache 60
maximum data rates across the disk 93
SSA disk buffer 123
measurement-driven self-monitoring 104
SSA disks 127
reducing overhead by using No-ID sector
formatting by sectors 127
format 100
SSA initiator 67
symptom-driven self-monitoring 104
SSA node functioning 107
zoned bit recording 93
three-way router 107
Ultrastar drive head 90
SSA RAID adapter 184, 186
inductrive write element and magnetoresistive
data sparing on disk failure 184
read element 90
data sparing with hot spare disk 186
Ultrastart 2XP drives 86
data sparing with operator intervention 186
data rate 86
storage sharing 13
UNIX 6, 39
storage systems 15
7133 attached 6
platform and vendor independence 15
kernel storage in VSS 39
StorWatch Versatile storage specialist 18, 205
UNIX file system 143, 258, 259
strip size 112
compared with VSS operation 143
SCSI writes 258
synchronization daemon 258
Index 373
Virtual Storage Server (continued) VSS configuration (continued)
number of disk drawers 28 number of 604eCPUs 197
disk configuration supported 28 number of disk adapters 199
reassigning storage pools 30 number of SMP process boards 197
VS Specialist 18 one SSA loop per disk drawer 199
VSS 7133 drawer 171, 173, 174, 175 physical dimensions 215
contents 173 power control subsystem 207
fault detection 175 power supplies 213
physical interface to disks 171 RAID array configuration options 201
power supplies and cooling 174 RAID-5 disk adapters 207
VSS adaptive cache read options 131 RAID-5 protected storage 207
VSS component failure 333 rectifying wrong sizing assumptions 218
warning and customer notification 333 redundancy to protect data availability 195
VSS component replacement 313 SSA cable lengths 210
VSS configuration 192, 193, 194, 195, 196, 197, 199, storage pools per server 205
200, 201, 203, 205, 206, 207, 209, 210, 211, 212, 213, storage server memory 207
215, 216, 217, 218, 219 system backup needs as an influence 195
access to the configuration manager 216 system size selection 192
adding disk drawers 200 workload influence on RAID array assignment 206
application server view 205 VSS configuration manager 336
base configuration 199 VSS configuration menus 314
cable choices 194 VSS disks 249
cache memory in SMP clusters 197 formatted iin 524-byte sectors 249
compromise and its impacts 192 VSS drawer 171, 188
connection kit choice 194 design 171
connection to and from the host 219 loop connection between disk drives 188
customer connection 217 VSS maintenance 318, 320, 321, 322, 323, 325, 327,
data sharing control by application code 196 328, 329, 330, 331, 332, 334, 338, 339
disk adapters 199 access restrictions 338
disk array representation 205 call-home feature 318, 325
StorWatch Versatile Storage Specialist 205 concurrent maintenance 334
disk drawer power cords 213 conditions that require service action 321
disk drawers 210 customer repair actions 327
disk drive features 203 customer repair limitations 328
disk drive reformatting 211 disk mirroring to protect data 339
disk drive size influences 203 engineering change levels of subsystem
disk sizes 203 components 332
expansion rack 209 engineering change management 331
expansion rack power supply 209 error log analysis 318
Fast Write cache 207 error reporting and error log 320
first storage server rack, 2105-B09 207 exception symptom codes 323
host adapter cards 207 field replaceable units 323
host interface adapter 193 management information base, configuration 322
host system support 193 online menus for customer engineer use 330
AIX 193 primary interface for the CE and PE 329
HP/UX 193 problem record 321
Solaris 193 remote access 318
influences 192 remote support interface 331
availability 192 emergency code fixes only 331
budget constraints 192 security implications of call-home feature 325
performance 192 service request numbers 323
logical disks 192 storage server service processor 329
number and size 192 access for customer engineer 329
LUN choices 205 storage server system management services 329
LUN limitation 205 VSS maintenance tools 312
main influences on 195 call-home function 312
maximum configuration 212 e-mail notification of users 312
number of hosts possible 193 error log 312
Index 375
VSS subsystem software (continued)
SSA router 354
VSS subsystem to replace existing storage 224
virtual disk configuration needed 224
VSS subsystems 277
reasons to choose four-way SMPs 277
VSS task-complete signal to host 255
VSS track blocks 129
use of cache segments 129
VSS UltraSCSI adapters 279
speed and compatibility 279
VSS write options 132
update writes compared with sequential
writes 132
VSS write processing 149, 150, 151, 152, 155, 157,
158, 163
data removal from Fast Write Cache 163
data removal from SSA adapter cache 163
ensuring data integrity 150
fast write 149
masking of RAID-5 write penalty 152
not fast write 149
RAID-5 penalty and update writes 151
sequential writes 158
stripe writes 155, 157
W
Web Cache Manager 24
RS/6000 central control unit 24
Web Traffic Express Software 24
workload aspects that can affect VSS
configuration 253
Your feedback is very important to help us maintain the quality of ITSO redbooks. Please complete this
questionnaire and return it using one of the following methods:
• Use the online evaluation form found at http://www.redbooks.ibm.com
• Fax this form to: USA International Access Code + 1 914 432 8264
• Send your comments in an Internet note to redbook@us.ibm.com
Please rate your overall satisfaction with this book using the scale:
(1 = very good, 2 = good, 3 = average, 4 = poor, 5 = very poor)
Overall Satisfaction ____________
Was this redbook published in time for your needs? Yes____ No____
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
_____________________________________________________________________________________________________
IBML
SG24-2221-00