DSTN Combined RL 1568656699770
DSTN Combined RL 1568656699770
& Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Pilani Campus Department of Computer Science and Information Systems
Topics
Address
Processor Control
Data bus
Main
Registers Memory
Address
Control
Data bus
• Computing
– Apps such as web servers, video conferencing,
database server, streaming etc.
• Networking
– Provides connectivity between computing nodes
– e.g. web service running on a computing node talks
to a database service running on another computer
• Storage (Persistent + Non-Persistent)
– All data resides
3
BITS Pilani, Pilani Campus
Memory Requirements
4
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[1]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
5
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[2]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
Instructions 0.1 1 2 (4-way) 8 (quad core,2
per cycle threads/core
6
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[3]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
Instructions 0.1 1 2 (4-way) 8 (quad core,2
per cycle threads/core
7
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[4]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
8
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[5]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
9
BITS Pilani, Pilani Campus
Memory Hierarchy[1]
– Locality of reference
• Memory references are clustered to either a small region
of memory locations or same set of data accessed
frequently
10
BITS Pilani, Pilani Campus
Memory Hierarchy[2]
12
BITS Pilani, Pilani Campus
Memory Hierarchy:
Performance
• Exercise:
– Effective Access time for 2-level hierarchy
13
BITS Pilani, Pilani Campus
Memory Hierarchy: Memory
Efficiency
• Memory Efficiency
– M.E. = 100 * (Th/Teff)
– M.E. = 100/(1+Pmiss (R-1)) [R = Th+1/Th]
• Maximum memory efficiency
– R = 1 or Pmiss = 0
– Consider
• R = 10 (CPU/SRAM)
• R = 50 (CPU/DRAM)
• R = 100 (CPU/Disk)
• What will be the Pmiss for ME = 95% for each of these?
14
BITS Pilani, Pilani Campus
Memory Technologies-
Computational
• Cache between CPU registers and main memory
– Static RAM (6 transistors per cell)
– Typical Access Time ~10ns
• Main Memory
– Dynamic RAM (1 transistor + 1 capacitor)
– Capacitive leakage results in loss of data
• Needs to be refreshed periodically – hence the term
“dynamic”
– Typical Access Time ~50ns
– Typical Refresh Cycle ~100ms.
15
BITS Pilani, Pilani Campus
Memory Technologies-
Persistent
• Hard Disks
– Used for persistent online storage
– Typical access time: 10 to 15ms
– Semi-random or semi-sequential access:
• Access in blocks – typically – of 512 bytes.
– Cost per GB – Approx. Rs 5.50
• Magnetic Tapes
– Access Time – (Initial) 10 sec.; 60Mbps data transfer
– Density – up-to 6.67 billion bits per square inch
– Data Access – Sequential
– Cost - Cheapest
17
BITS Pilani, Pilani Campus
Caching
18
BITS Pilani, Pilani Campus
Caching- Generic [1]
Major Components
Platters
Read/write heads
Actuator assembly
Spindle motor
Source: dataclinic.co.uk
23
BITS Pilani, Pilani Campus
Disk Drive: Geometry
26
BITS Pilani, Pilani Campus
Disk Characteristics
• Fixed (rare) or movable head
– Fixed head
• One r/w head per track mounted on fixed ridged arm
– Movable head
• One r/w head per side mounted on a movable arm
• Removable or fixed
• Single or double (usually) sided
• Single or multiple platter
– Heads are joined and aligned
– Aligned tracks on each platter form cylinders
– Data is striped by cylinder
• Reduces head movement
• Increases speed (transfer rate)
• Head mechanism
– Contact (Floppy)
– Fixed gap
– Flying (Winchester)
27
BITS Pilani, Pilani Campus
Capacity
32
BITS Pilani, Pilani Campus
Disk Drive Performance
• Parameters to measure the performance
– Seek time
– Rotational Latency
– IOPS (Input/Output Operations per Second)
• Read
• Write
• Random
• Sequential
• Cache hit
• Cache miss
– MBps
• # of Mega bytes per second a disk drive can perform
• Useful to measure sequential workloads like media streaming
33
BITS Pilani, Pilani Campus
Disk Drive Performance: Eye
Opener Facts!
• Access time (Seek + Rotational ) rating
– Important to distinguish between sequential and
random access request set
• Usually vendors quote IOPS numbers to impress
– Important to note that whether the IOPS numbers
being quoted are cache hits or cache misses
• Real world workload is a mix of accesses with
– Read, Write, random, sequential, cache hit, cache
miss
34
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
37
BITS Pilani, Pilani Campus
Flash Memory
• Semiconductor based persistent storage
• Two types
– NAND and NOR flash
• Anatomy of flash memory
– Cells Pages Blocks
– New flash device comes with all cells set to 1
– Cells can be programmed from 1 to 0
– To change the value of cell back to 1 then we need
to erase entire block
• Can be erased at block level only!
38
BITS Pilani, Pilani Campus
Read/Write/Programming on
Flash Memory
• Read operation is the fastest operation
• First time write is very fast
– Every cell in the block is preset to 1 and can be
individually programmed to 0
– If any part of a flash memory block has already been
written to, all subsequent writes to any part of that
block will require a process called read/erase/program
• It is 100 times slower than read operation
– Erasing is a 10 times slower process than read
operation
39
BITS Pilani, Pilani Campus
NAND vs. NOR
NAND NOR
Cost per bit Low High
Capacity High Low
Read Speed Medium *High
Write Speed High Low
File Storage Use Yes No
Code Execution Hard Easy
Stand by Power Medium Low
Erase Cycles High Low
*Individual cells (in NOR) are connected in parallel which enables random reads faster
40
BITS Pilani, Pilani Campus
Anatomy of NAND Flash
• NAND Flash types
– Single level cell (SLC)
• A cell can store 1 bit of data
• Highest performance and longest life span (100,000 program/erase cycles
per cell)
– Multi level cell (MLC)
• Stores 2 bits of data per cell.
• P/E cycles = 10,000
– Enterprise MLC (eMLC)
• MLC with stronger error correction
• Heavily over-provisioned for high performance and reliability
– e.g. a 400 GB eMLC drive might actually have 800 GB of eMLC flash
– Triple level cell (TLC)
• Stores 3 bits per cell
• P/E cycles = 5,000 per cell
• High on capacity but low on performance and reliability
41
BITS Pilani, Pilani Campus
Enterprise Class SSD
• Reliability
– Mean Time-Between-Failure (MTBF)
• e.g. 1.2 TB SAS drive states a MTBF value of 2 million
hours
– Annual Failure Rate (AFR)
• To estimate the likelihood that a disk drive will fail during
a year of full use
• Individual Disk Reliability (as claimed in
manufacturer’s warranties) is often very high
– E.g. Rated: 30,000 hours In Practice: 100,000 for an
IBM disk in 80s
46
BITS Pilani, Pilani Campus
Disk Performance issues[2]
• Access Speed
– Access Speed of a pathway = Minimum speed among all
components in the path
– e.g. CPU and Memory Speeds vs. Disk Access Speeds
• Solution:
– Multiple Disks i.e. array of disks
– Issue: Reliability
• MTTF of an array = MTTF of a single disk / # disks in the array
47
BITS Pilani, Pilani Campus
Disk Reliability
49
BITS Pilani, Pilani Campus
Performance Improvement in
Secondary Storage
• In general multiple components improves the
performance
• Similarly multiple disks should reduce access time?
– Arrays of disks operates independently and in parallel
• Justification
– With multiple disks separate I/O requests can be
handled in parallel
– A single I/O request can be executed in parallel, if the
requested data is distributed across multiple disks
• Researchers @ University of California-Berkeley
proposed the RAID (1988)
50
BITS Pilani, Pilani Campus
RAID
51
BITS Pilani, Pilani Campus
RAID Fundamentals
• Striping
– Map data to different disks
– Advantage…?
• Mirroring
– Replicate data
– What are the implications…?
• Parity
– Loss recovery/Error correction / detection
52
BITS Pilani, Pilani Campus
RAID
• Characteristics
1. Set of physical disks viewed as single logical drive
by operating system
2. Data distributed across physical drives
3. Can use redundant capacity to store parity
information
53
BITS Pilani, Pilani Campus
Data Mapping in RAID 0
54
BITS Pilani, Pilani Campus
RAID 1
Mirrored Disks
Data is striped across disks
2 copies of each stripe on separate disks
Read from either and Write to both
55
BITS Pilani, Pilani Campus
Data Mapping in RAID 2
57
BITS Pilani, Pilani Campus
RAID 4
• Make use of independent access with block level striping
• Good for high I/O request rate due to large strips
• Bit by bit parity calculated across stripes on each disk
• Parity stored on parity disk
• Drawback???
58
BITS Pilani, Pilani Campus
RAID 5
• Round robin allocation for parity stripe
• It avoids RAID 4 bottleneck at parity disk
• Commonly used in network servers
• Drawback
– Disk failure has a medium impact on throughput
– Difficult to rebuild in the event of a disk failure (as
compared to RAID level 1)
59
BITS Pilani, Pilani Campus
RAID 6
• Two parity calculations
• Stored in separate blocks on different disks
• High data availability
– Three disks need to fail for data loss
– Significant write penalty
• Drawback
– Controller overhead to compute parity is very high
60
BITS Pilani, Pilani Campus
Nesting of RAID Levels:
RAID(1+0)
• RAID 1 (mirror) arrays are built first,
then combined to form a RAID 0
(stripe) array.
• Provides high levels of:
– I/O performance
– Data redundancy
– Disk fault tolerance.
61
BITS Pilani, Pilani Campus
Nesting of RAID Levels:
RAID(0+1)
• RAID 0 (stripe) arrays are built first, then
combined to form a RAID 1 (mirror) array
• Provides high levels of I/O performance
and data redundancy
• Slightly less fault tolerance than a 1+0
– How…?
62
BITS Pilani, Pilani Campus
RAID Implementations
• A hardware implementation of
• RAID requires at least a special-
• purpose RAID controller.
• On a desktop system this may be
built into the motherboard.
• Processor is not used for RAID
calculations as a separate
controller present.
64
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
66
BITS Pilani, Pilani Campus
Files and File Systems:
Introduction
• A file system is a logical abstraction for
secondary (persistent) storage
– i.e. physical storage is partitioned into logical units
– Each logical unit has a file system – modeled as a
tree for navigation
– Nodes of the tree are directories (internal) and files
(terminal)
• In Unix directories are also files that store directory
listings
67
BITS Pilani, Pilani Campus
Files and File Systems: Data
Access Types
• Persistent data may be accessed:
– in large sequential chunks or
– in small random records
69
BITS Pilani, Pilani Campus
File Descriptors in UNIX
72
BITS Pilani, Pilani Campus
Open Files in UNIX: File Entries
[3]
• Each “file entry” has a reference count
– Multiple descriptors may refer to the same file
entry:
• Single Process: dup system call
• Multiple Processes: child process after a fork inherits
file structures
– A read or write by either process (via the corresponding
descriptor) will advance the file offset
– This allows interleaved input/output from/to a file
73
BITS Pilani, Pilani Campus
Virtual File Systems and V-nodes
75
BITS Pilani, Pilani Campus
Unix File System
76
BITS Pilani, Pilani Campus
Local File System and i-node
80
BITS Pilani, Pilani Campus
Organization of File store
• Berkeley Fast File System (FFS) model:
– A file system is described by its superblock
• Located at the beginning of the file system
• Possibly replicated for redundancy
– A disk partition is divided into one or more cylinder
groups i.e. a set of consecutive cylinders:
• Each group maintains book-keeping info. including
– a redundant copy of superblock
– Space for i-nodes
– A bitmap of available blocks and
– Summary usage info.
• Cylinder groups are used to create clusters of
allocated blocks to improve locality.
81
BITS Pilani, Pilani Campus
Local File Store- Storage
Utilization
• Data layout – Performance requirement
– Large blocks of data should be transferable in a single disk operation
• So, logical block size must be large.
– But, typical profiles primarily use small files.
• Internal Fragmentation:
– Increases from 1.1% for 512 bytes logical block size, 2.5% for 1KB, to
5.4% for 2KB, to an unacceptable 61% for 16KB
• I-nodes also add to the space overhead:
– But overhead due to i-nodes is about 6% for logical block sizes of
512B, 1KB and 2KB, reduces to about 3% for 4KB, and to 0.8% for
16KB.
• One option to balance internal fragmentation against
improved I/O performance is
– to maintain large logical blocks made of smaller fragments
82
BITS Pilani, Pilani Campus
Local File Store – Layout [1]
• Global Policies:
– Use summary information to make decisions
regarding placement of i-nodes and disk blocks.
• Routines responsible for deciding placement of new
directories and files.
– Layout policies rely on locality to cluster information
for improved performance.
• E.g. Cylinder group clustering
• Local Allocation Routines.
– Use locally optimal schemes for data block layouts.
83
BITS Pilani, Pilani Campus
Local File Store – Layout [2]
84
BITS Pilani, Pilani Campus
OS Support for I/O
• System calls form the interface between applications
and OS (kernel)
– File System and I/O system are responsible for
• Implementing system calls related to file management and handling
input/output.
• Device drivers form the interface between the OS
(kernel) and the hardware
85
BITS Pilani, Pilani Campus
I/O in UNIX - Example
87
BITS Pilani, Pilani Campus
I/O in UNIX – Device Drivers
90
BITS Pilani, Pilani Campus
Device Driver to Disk Interface
• Disk Interface:
– Disk access requires an address:
• device id, LBA OR
• device id, cylinder #, head#, sector #.
– Device drivers need to be aware of disk details for
address translation:
• i.e. converting a logical address (say file system level address
such as i-node number) to a physical address (i.e. CHS) on the
disk;
• they need to be aware of complete disk geometry if CHS
addressing is used.
– Early device drivers had hard-coded disk geometries.
• This results in reduced modularity –
– disks cannot be moved (data mapping is with the device driver);
– device driver upgrades would require shutdowns and data copies.
91
BITS Pilani, Pilani Campus
Disk Labels
92
BITS Pilani, Pilani Campus
Buffering
• Buffer Cache
– Memory buffer for data being transferred to and from
disks
– Cache for recently used blocks
• 85% hit rate is typical
• Typical buffer size = 64KB virtual memory
• Buffer pool – hundreds of buffers
• Consistency issue
– Each disk block mapped to at most one buffer
– Buffers have dirty bits associated
– When a new buffer is allocated and if its disk blocks
overlap with that of an existing buffer, then the old buffer
must be purged.
93
BITS Pilani, Pilani Campus
Buffer Pool Management
• Buffer pool is maintained as a (separately chained) hash table
indexed by a buffer id
• The buffers in the pool are also in one of four lists:
– Locked list:
• buffers that are currently used for I/O and therefore locked and cannot be
released until operation is complete
– LRU list:
• A queue of buffers – a recently used item is added at the rear of the queue
and when a buffer is needed one at the front of the queue is replaced.
• Buffers staying in this queue long enough are migrated to an Aged list
– Aged List:
• Maintained as a list and any element may be used for replacement.
– Empty List
• When a new buffer is needed check in the following order:
– Empty List, Aged List, LRU list
94
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
96
BITS Pilani, Pilani Campus
Flash Memory
• Flash chips are arranged into blocks which are typically 128KB
on NOR and 8KB on NAND flash
• All flash cells are preset to one. These cells can be individually
programmed to zero
– Burt resetting bits from zero to one cannot be done individually,
can be done only by resetting or erasing a complete block
97
BITS Pilani, Pilani Campus
Traditional File System: Erase-
Modify-Write back
• Use 1:1 mapping from the emulated block
device to the flash chip
– read the whole erase block, modifying the
appropriate part of the buffer, erase and rewrite the
entire block
• No wear leveling…!
• Unsafe due to power loss between erase and write back
• Slightly better method
– by gathering writes to a single erase block and only
performing the erase/modify/write back procedure
when a write to a different erase block is requested.
98
BITS Pilani, Pilani Campus
Journaling File System
99
BITS Pilani, Pilani Campus
JFFS Storage Format
100
BITS Pilani, Pilani Campus
Wear Leveling
101
BITS Pilani, Pilani Campus
Flash Memory: Operations
• I/O Techniques
– Polling
– Interrupt driven
– DMA
105
BITS Pilani, Pilani Campus
I/O Techniques
• Polling:
– CPU check on the status of an I/O device by reading
a memory address which is associated with an I/O
device
– Pseudo-asynchronous
• Processor inspects (multiple) devices in rotation
– Cons
• Processor may still be forced to do useless work or wait or
both
– Pros
• CPU can determines how often it needs to poll
106
BITS Pilani, Pilani Campus
I/O Techniques
• Interrupts:
– Processor initiates I/O by requesting an operation
with the device.
– May disconnect if response can’t be immediate,
which is usually the case
– When device is ready with a response it interrupts
the processor.
• Processor finishes I/O with the device.
– Asynchronous but
• Data transfer between I/O device and memory still
requires processor to execute instructions.
107
BITS Pilani, Pilani Campus
I/O Techniques: Interrupts
108
BITS Pilani, Pilani Campus
I/O Techniques
109
BITS Pilani, Pilani Campus
I/O Techniques
• I/O Processor
– More sophisticated version of DMA controller
with the ability to execute code: execute I/O
routines, interact with the O/S etc
110
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
112
BITS Pilani, Pilani Campus
What is a Bus ?
• A bus is
– a shared communication link between sub- systems
of a computer and
– an associated protocol for communication
• Note: A protocol is a set of rules - often specified formally.
• E.g.
– Single Shared bus
– Separate system bus
and I/O Bus
113
BITS Pilani, Pilani Campus
Traditional Bus Architecture
• Multi bus architecture
– To avoid contention
– Better Performance
– Device requirements are different
114
BITS Pilani, Pilani Campus
Bus Arbitration
116
BITS Pilani, Pilani Campus
Common Bus Protocols: PCI
• PCI Bus
– Created by INTEL in 1993
– Synchronous bus with 32 bits operating at a clock
rate of 33 MHz
– Transfer rate 132 MB/s
– PCI-X extended the bus to 64 bits and either 66
MHz or 133 MHz for data rates of 528 MB/s or
1064 MB/s
117
BITS Pilani, Pilani Campus
PCI BUS Lines
119
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
121
BITS Pilani, Pilani Campus
SCSI: I/O Path from CPU to
Storage
• Stands for Small Computer System Interface
– Asynchronous, parallel bus
– Allows multiple bus masters
– Commonly used for high-end Storage devices
• Several standards
– Earliest Standard
• No. of wires (50)
• 8 data lines
• data transfer rate (5MB/s)
• wire length(25m)
122
BITS Pilani, Pilani Campus
SCSI Bus Standard
• Fast SCSI
– Doubled clock rate, data transfer up to 10MB/s
• Wide SCSI
– 16 data lines
– Fast Wide SCSI – 20 MB/s
• Ultra SCSI
– 8 data lines – data transfer rate 20MB/s
– Ultra Wide SCSI – 16 data lines – data transfer 40
MB/s
– Ultra320 - 16 data lines – data transfer 320 MB/s
• Serial Attached SCSI (SAS) and iSCSI (IP over SCSI)
123
BITS Pilani, Pilani Campus
SCSI – Components Model
126
BITS Pilani, Pilani Campus
SCSI Command Protocol
• Command Types
– Non-data, Write, Read, Bidirectional
– e.g.
• Test Unit Ready
• Inquiry
• Start/Stop Unit
• Request Sense (for error)
• Read Capacity
• Format Unit
• Read (4 variants)
• Write (4 variants)
127
BITS Pilani, Pilani Campus
Command Descriptor Block
(CDB)
• Opcode
• LUN (3 bits)
• e.g. for Read
– Read (6): 21 bits LBA, 1 byte transfer length
– Read(10): 32 bit LBA, 2 byte transfer length
– Read(12): 32 bit LBA, 4 byte transfer length
– Read Long for ECC-Compliant data
128
BITS Pilani, Pilani Campus
SCSI – State Transition – Bus
Phases
• Bus Free Phase
– Bus is not being used by anyone
• Arbitration Phase
– One or more initiators request use of the bus and
• the one with the highest priority (SCSI ID order) is allowed to proceed
• Selection / Reselection phase
– Initiator asserts BUS BUSY signal and
• places target’s address on the bus thus establishing a connection/session
– Re-selection applies for a target resuming an interrupted operation:
• target asserts BUS BUSY signal and places initiator’s address on the bus.
• Message
– Exchange of control messages between initiator and target
• Command
– Initiator sends command to target
• Data
– Data exchange (as part of a read/write operation)
• Status
– Status of a command.
129
BITS Pilani, Pilani Campus
SCSI – State Transition – Bus
Phase Sequence [1]
• For a read operation:
1. When bus is free, initiator enters into arbitration (with other possible
initiators)
2. On arbitration, initiator selects the bus and places target address on
bus
3. Target acknowledges selection by a message
4. Initiator sends command
5. Target acknowledges command by a message
6. Target devices places (read) data on bus
7. Initiator acknowledges data by a message
8. Target sends status
• Assumption:
– Target holds the bus while it is reading – this acceptable only for simple
devices and small read delays.
• Exercise: Modify the above sequence for a write operation.
130
BITS Pilani, Pilani Campus
SCSI – State Transition – Bus
Phase Sequence [2]
• For a read operation:
1. When bus is free, initiator enters into arbitration (with other possible
initiators)
2. On arbitration, initiator selects the bus and places target address on bus
3. Target acknowledges selection by a message
4. Initiator sends command
5. Target acknowledges command by a message
6. Target sends “Disconnect” message to initiator
7. Bus is released
8. When target is ready with (read) data, target reselects bus and places
initiator address on bus
9. Initiator acknowledges selection by a message
10. Target sends data.
11. Initiator acknowledges data by a message
12. Target sends status
• Question: Modify the above sequence for a write operation.
131
BITS Pilani, Pilani Campus
SCSI – State Transition – Bus
Phase Sequence [3]
• For a sequence of read-write operations:
– Commands can be chained – i.e. a sequence of I/O operations
initiated by a single initiator on the same target.
– In this case, one arbitration step and one selection phase are enough.
134
BITS Pilani, Pilani Campus
Error Correction -In channel
135
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
137
BITS Pilani, Pilani Campus
Storage Arrays Types
• SAN
– Provides connectivity via FC, FCoE, iSCSI, SCSI (SAS)
– Uses low level disk drive access commands like READ block,
WRITE block, and READ capacity
• NAS
– Provides connectivity over file based protocols like Network
File System (NFS), SMB/CIFS (Server Message Block/Common
Internet File System)
– Uses high level file based protocols
– Commands: create a file, rename a file, lock a byte range
within a file etc.
• Unified (SAN and NAS)
– Shared storage over both file and block protocols (aka
multiprotocol arrays)
138
BITS Pilani, Pilani Campus
Storage on the Network
Block I/O to
R/W file NAS Disk drives
Head
142
BITS Pilani, Pilani Campus
File Systems
143
BITS Pilani, Pilani Campus
File Systems Functions
• Journaling
– Ensure consistency of the file system even after a system
crash
– Every change in the file system is first recorded in a log
file
• Snapshots
– To freeze the state of the file system at a given point of
time (state of the data should be consistent)
– It loads server’s CPU and hardware independent
• Dynamic File System Expansion
– Volume manager
144
BITS Pilani, Pilani Campus
NAS Systems
• File Servers
– Operating System Implementation
• Customized Operating Systems
– Typically thinned versions of Linux or Windows
• Most tasks are I/O bound (particularly file I/O)
– Simpler scheduling and task management
– No user management or user interaction required
• Restricted Memory allocation model (Mostly for buffering)
– No heap needed
– Limited stack size
• Tasks are (soft) real-time
– At the server level, I/O requests must have time-bounds to provide
performance guarantees
145
BITS Pilani, Pilani Campus
NAS Arrays: Scaling
• NAS Cluster
147
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
149
BITS Pilani, Pilani Campus
What is NFS?
• Stateless protocol
– Increased robustness – easy recovery from failures
• Support for Unix filesystem semantics
– Also supports other semantics such as MS-DOS
• Protection and Access Control same as Unix
filesystem semantics
• Transport-independent protocol
– Originally implemented on top of UDP
– Easily moved to TCP later
– Has been implemented over many non-IP-based
protocols.
151
BITS Pilani, Pilani Campus
NFS Design Limitations
• Stateless protocol
– Session state is not maintained
152
BITS Pilani, Pilani Campus
NFS- Transport[1]
158
BITS Pilani, Pilani Campus
NFS - Operation
• NFSv4 features
– Access control lists (similar to Windows ACLs)
– Compound RPCs
– Client side caching
– Operation over TCP port 2049
– Mandating Strong security (Kerberos v5 for cryptographic
services)
161
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
163
BITS Pilani, Pilani Campus
NFS Performance
164
BITS Pilani, Pilani Campus
NFS - Performance -
Inconsistent Data
• Scenario:
First client writes data that is later read by second client.
• Two main ways for stale data to be read:
1. Second client has stale data in its cache and does not
know that modified data are available.
2. First client has modified data in its cache but has not
written those data back to the server.
• Synchronous writing solves the second problem.
– It also results in behavior that is close to the local filesystem.
– But clients are restricted to one write per RPC RTT.
165
BITS Pilani, Pilani Campus
NFS Performance Caching[1]
• Delayed writing model:
– Write request returns as soon as data are cached by the client
• Pros:
– Following things can now be bundled in to a single request to the
server (i.e. the last one):
• multiple writes to the same blocks,
• file deletion or file truncation shortly after write(s)
• Cons:
– Client crash may result in loss of data
– Server must notify a client - holding a cached copy –
• that other client(s) want to read/write the file held by the first client.
• This introduces state in the implementation
– Error propagation to the client may be problematic:
• e.g. “Out of space” error
• e.g. client process exiting before error notification
166
BITS Pilani, Pilani Campus
NFS Performance Caching[2]
• Asynchronous writing model:
– As soon as data are cached by the client, write to the server is
initiated and then the write request returns.
• Variants:
– write on close (file)
• Delays are only deferred
– Read-sharing only (.e.g. Sprite file system, Unix like distributed file
system)
• Cache Verification model
– Client performs cache verification on access
• RPC RTT delays
• Callback model
– Server keeps track of cached copies and notifies them on update
167
BITS Pilani, Pilani Campus
NFS Performance Caching[3]
• Leasing model:
– Leases are issued for time intervals.
– As long as lease holds server will callback on update
– When lease expires client must verify its cache
contents and/or obtain a new lease.
• Requires much less server memory and reduces
traffic.
• Read-caching and write-caching may be given
separate leases.
168
BITS Pilani, Pilani Campus
NFS Crash Recovery
• SMB/CIFS protocol
171
BITS Pilani, Pilani Campus
• CIFS is the de facto file-serving protocol in the
Microsoft windows world
• It operates natively over TCP/IP networks on
port 445
172
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
174
BITS Pilani, Pilani Campus
Storage Area Network
Architecture
Clients
LAN
Servers
SAN
Storage
175
BITS Pilani, Pilani Campus
Storage Area Network
178
BITS Pilani, Pilani Campus
SAN Traffic
181
BITS Pilani, Pilani Campus
FC-SAN High Level View
FC Switch A
Storage
Host with Array
2xHBAs
(Initiators) (target)
FC Switch B
182
BITS Pilani, Pilani Campus
FC Protocol Stack
183
BITS Pilani, Pilani Campus
SAN Ports: FC Switch Ports
• U_Port
– Un-configured and uninitialized
• N_Port
– Aka Node Port (to connect end devices)
• F_Port
– Switch ports that accept connections from N_Ports operate as fabric ports
• L_Port
– Node port used to connect a node to a Fibre Channel loop
• E_Port
– Expansion port to connect two SAN switches (allows merging)
• EX_Port
– It is a E_Port used for FC routing (Prevent fabrics from merging)
• Port Speed:
– 2, 4, 8, 16 Gbps
184
BITS Pilani, Pilani Campus
FC-SAN Port Connectivity
FC Switch A
N_Port
F_Port F_Port
N_Port
Host with Storage
2xHBAs Array
(Initiators) (target)
N_Port
FC Switch B
N_Port
F_Port F_Port
E_Port
E_Port
FC Switch C 185
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
187
BITS Pilani, Pilani Campus
SAN Ports: FC Switch Ports
• U_Port
– Un-configured and uninitialized
• N_Port
– Aka Node Port (to connect end devices)
• F_Port
– Switch ports that accept connections from N_Ports operate as fabric ports
• L_Port
– Node port used to connect a node to a Fibre Channel loop
• E_Port
– Expansion port to connect two SAN switches (allows merging)
• EX_Port
– It is a E_Port used for FC routing (Prevent fabrics from merging)
• Port Speed:
– 2, 4, 8, 16 Gbps
188
BITS Pilani, Pilani Campus
Example: FC-SAN Port
Connectivity
FC Switch A
N_Port
F_Port F_Port
N_Port
Host with Storage
2xHBAs Array
(Initiators) (target)
N_Port
FC Switch B
N_Port
F_Port F_Port
E_Port
E_Port
FC Switch C 189
BITS Pilani, Pilani Campus
Common SAN Topologies: SAN
Structure
• Point to Point
– Direct connection between HBA port - Storage array
port
– e.g. for 8-port storage array, you can have a
maximum of 8 directly attached servers talking to
that storage array
– Limitation
• No scalability
190
BITS Pilani, Pilani Campus
FC -SAN Structure[1]
• Structure – Arbitrated Loop (AL)
– Storage devices - through L-ports - are connected to an
(FC) AL hub
– Local hosts are also connected to the AL via I/O bus
adapters
– Hubs do not allow a high transfer rate (due to sharing)
but are cheap.
191
BITS Pilani, Pilani Campus
FC SAN Structure[2]
192
BITS Pilani, Pilani Campus
FC SAN Structure[3]
194
BITS Pilani, Pilani Campus
Switched Fabric
• Inter-connected FC-Switches
195
BITS Pilani, Pilani Campus
Switched Fabric Topologies[1]
Core
Edge
196
BITS Pilani, Pilani Campus
Switched Fabric Topologies[2]
• Cascade Topology
• Ring Topology
197
BITS Pilani, Pilani Campus
Switched Fabric Topologies[3]
• Mesh Topology
– Every switch is connected to every other switch
198
BITS Pilani, Pilani Campus
Redundancy and Resiliency[1]
199
BITS Pilani, Pilani Campus
Redundancy and Resiliency[2]
200
BITS Pilani, Pilani Campus
Redundancy and Resiliency[3]
201
BITS Pilani, Pilani Campus
Redundancy and Resiliency
202
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
• SAN Components
– Addressing
– Zoning
– Multi-pathing
– Trunking
– LUN Masking
• FC-SAN Performance Issues
204
BITS Pilani, Pilani Campus
SAN Addressing[1]
209
BITS Pilani, Pilani Campus
Zoning: Example
210
BITS Pilani, Pilani Campus
SAN Zoning
211
BITS Pilani, Pilani Campus
SAN Zoning
• Software Zoning:
– Based on WWN – managed by the OS in the switch
– Number of members in a zone limited by memory
available
– A node may belong to more than one zone.
– More than one sets of zones can be defined in a
switch but only one set is active at a time
• Zone sets can be changed without bringing switch down
– Less secure :
• SZ is implemented using SNS
– Device may connect directly to switch without going through SNS
• WWN spoofing
• WWN numbers can be probed
212
BITS Pilani, Pilani Campus
SAN- Frame Filtering
214
BITS Pilani, Pilani Campus
SAN-Multipathing
• Provide multiple paths between a host and a device (LUN).
– Redundancy for improved reliability and/or higher bandwidth for
improved availability / performance
• Channel subsystem of the kernel in switch OS handles multi-
pathing at software level
– Usually Separate device driver is used with following capabilities:
• Enhanced Data Availability
• Automatic path failover and recovery to alternative path
• Dynamic Load balancing
• Path selection policies
• Failures handled:
– Device Bus adapters, External SCSI cables, fibre connection cable,
host interface adapters
• Additional software needed for ensuring that the host sees a
single device.
215
BITS Pilani, Pilani Campus
SAN- LUN Masking
• LUN Masking
– Which servers (HBAs) can see which LUNs
– Performed on the storage array using WWPN of
host’s HBA in FC-SAN
• Zoning vs. Masking
– Zoning takes place at SAN switches where as LUN
masking takes place on the storage array
– LUN masking provides more detailed security than
zoning. How?
216
BITS Pilani, Pilani Campus
FC-SAN Performance Issues
• IP Storage standards
– iSCSI: “Storage resources to be shared over an IP
network”
• Connecting two FC-SANs using TCP/IP
– Tunneling (i.e. FCIP)
• Migration from FC-SAN to an IP-SAN
– internet-FCP
219
BITS Pilani, Pilani Campus
iSCSI SAN
IP
iSCSI Initiators
iSCSI Target
220
BITS Pilani, Pilani Campus
FC-SAN vs. IP-SAN
FC SAN IP-SAN
Protocol Overhead Low High
Distance Limit YES NO
H/W Cost High Low
Network Administration NO YES
tools availability
Network Latency Low High
CPU Use Low High
Data Access Protocol SCSI-III NFS/CIFS
Access type Block level (File system is File Level (File system is
part of Server) part of storage)
Interface (Server connector) HBA NIC
221
BITS Pilani, Pilani Campus
iSCSI SAN Components
• Initiators
– Issue Read/Write data requests to targets
– Request – response mechanism
– Initiators and targets are technically processes that
run on the respective devices
• Target (disk arrays or servers sharing storage
over iSCSI)
• IP Network
– Used for transportation of iSCSI PDUs
222
BITS Pilani, Pilani Campus
iSCSI Interfaces
223
BITS Pilani, Pilani Campus
iSCSI PDU and Encapsulation
228
BITS Pilani, Pilani Campus
iFCP: Mapping of FCP on TCP/IP
229
BITS Pilani, Pilani Campus
FCIP: Connects two FC-SAN by
TCP/IP
• FCIP is a tunneling protocol
230
BITS Pilani, Pilani Campus
SAN Protocol Taxonomy
Applications
Operating Systems
SCSI Protocol
FCP FCP FCP SCSI
iSCSI
iFCP FCIP
Fabric Services
Transport and
FC-3
TCP TCP TCP
FC-2
IP IP IP
231
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
233
BITS Pilani, Pilani Campus
Data Center Network
Technologies
• Ethernet
– Lossy network due to congestion
– Packet or frame losses needs to be handled by higher layers
(e.g. TCP)
– Earlier latency was an issue but not NOW… (e.g. 10Gbps,
40Gbps, and 100Gbps Ethernet)
– Shared media (e.g. Bus topology)
• Fibre Channel
– High speed and low latency network
– Exclusively used for storage traffic (e.g. SCSI traffic)
– Operates link layer signaling (communicates buffer credits
and status) to avoid packet losses
– Uses Point to Point topology
234
BITS Pilani, Pilani Campus
FC vs. Ethernet
235
BITS Pilani, Pilani Campus
Storage Network Requirements
236
BITS Pilani, Pilani Campus
Enhanced Ethernet: Lossless
Ethernet
• IEEE task group responsible for it
– Data Center Bridging (DCB) Task Group (802.1)
• Also called Converged Enhanced Ethernet (CEE)
– Objectives
• To transport IP LAN traffic
• FC storage traffic
• Infiband high performance computing traffic (2.5 Gbps)
– Lossless Ethernet
• Enabling PAUSE function (IEEE 802.3)
– Stopping all traffic on a port when a full queue condition is achieved
– PAUSE should be issued only for lossless traffic requirement not for all traffic
– Priority Flow Control (IEEE 802.1Qbb) can halt traffic according to the priority
Tag
– Administrators use the eight lanes defined in IEEE 802.1p to create virtual
lossless lanes for traffic classes like storage and lossy lanes for other classes
237
BITS Pilani, Pilani Campus
Lossless Ethernet Cont…
238
BITS Pilani, Pilani Campus
FCoE Encapsulation
239
BITS Pilani, Pilani Campus
FCoE Encapsulation
SCSI FC FCoE
Blade Servers
With CNAs
241
BITS Pilani, Pilani Campus
Cabling Options
242
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
• RDMA
244
BITS Pilani, Pilani Campus
Remote Direct memory Access
(RDMA) [1]
• Communication between applications
(traditional way)
– Incoming data is accepted by NIC card
– OS kernel process it and deliver to app
– Multiple level buffering
– Costs CPU power and loading system bus
– Increase in latency and reduce throughput
245
BITS Pilani, Pilani Campus
Remote Direct memory Access
(RDMA) [2]
• Virtual Interface Architecture
– Allows data exchange between apps and network
card by bypassing OS (i.e. no CPU intervention)
– How to do…?
• Two communicating apps set-up a connection aka Virtual
Interface
• A VI is a common memory area defined on both
computers
• Using this memory, app and network card on a machine
can exchange data
246
BITS Pilani, Pilani Campus
Remote Direct memory Access
(RDMA) [3]
• Sender application
– fills the memory area with data
– then app announces by means of the send queue of
the VI hardware
– the VI hardware reads the data directly from the
common memory area and transmits to the VI
hardware of the second computer
– It does not inform to the app until all data is available
in the common memory area
– Point to point communication is allowed
– E.g. VI capable FC-HBAs, NICs
– VI communications allows RDMA
247
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
• Replication Technologies
– Synchronous
– Asynchronous
• Where to perform replication…?
– Application/Database layer
– Host layer (Logical Volume Manager in Linux)
– Hypervisor layer
– Storage array layer
249
BITS Pilani, Pilani Campus
Replication: Synchronous[1]
• Synchronous
– All write to a volume are committed to both the
source and the remote replica before the write is
considered complete
• Provides zero data loss i.e. RPO of zero
• Negative impact on performance
253
BITS Pilani, Pilani Campus
Application Layer Replication
Production
DR Site
Site
255
BITS Pilani, Pilani Campus
Data Guard: Standby databases
• Physical
– Exact physical copy of the primary database with identical
on-disk block layout
• Logical
– Data is same as primary but on-disk structures and database
layout will be different
• Redo logs are converted into SQL statements that are then executed
against the standby database
• Logical standby DB is more flexible than physical standby
DB
– It can be used more than just DR
• Oracle DB allows both switchover and failover
– Switchover = Manual transition from standby to primary
– Failover = Automatic transition when primary DB fails
256
BITS Pilani, Pilani Campus
Replication: MS Exchange
Server
• Active/Passive Architecture
– Allows more than one standby
copy of a database
– Also, utilizes log shipping
– Supports both switch-over and
fail-over
– Fundamental unit of replication is
the Data Availability Group (DAG)
(i.e. collection of exchange
servers)
– Recovery method
• At database level
257
BITS Pilani, Pilani Campus
Replication- Logical Volume
Manager Based
• Host based volume manager for remote
replication using LVM mirrors
– e.g. Linux LVM
• LVM is agnostic about the underlying storage
technology
– LVM can be considered as a thin software layer on
top of the hard disks and partitions
• Ease the Disk replacement, repartitioning, backup, and
migration
– used for storage migration by people
258
BITS Pilani, Pilani Campus
Replication-Hypervisor based
260
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
262
BITS Pilani, Pilani Campus
Snapshot Based Replication
263
BITS Pilani, Pilani Campus
Snapshot: Some Key Facts
264
BITS Pilani, Pilani Campus
More on Snapshots
• Snapshots are local replicas of data
– Exists on the same host or storage array as the source
volume
– Provide an image of a system or volume at given point in
time
• Snapshots can be taken at different levels
– Hypervisor based snapshots
• e.g. VMware snapshots (aka PIT snapshots)
– Host based Volume Manager snapshots
– Storage Array based snapshots
• Full clones
– An exact block for block copy of the source volume is created
• Space Efficient
265
BITS Pilani, Pilani Campus
Space Efficient Snapshot
Example
• Pointer based snapshots
Block # 0 1 2 3 4 5 6 7
Vol1 4653 1234 3456 3678 5433 1243 2343 6745
Vol1.snap
Block # 0 1 2 3 4 5 6 7
Vol1 4670 1234 3456 3690 5433 1243 2348 6745
267
BITS Pilani, Pilani Campus
Redirect-on-Write Snapshots
270
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
• Replication Topologies
– Three site cascade
– Three site multi-target
– Three site triangle
272
BITS Pilani, Pilani Campus
Replication Topologies
• Three Site Cascade
– Source, Target and Bunker site
– Source and target site can not communicate each
other directly
• Data replication is through bunker site
– Example
• Source and bunker site are within 100 miles apart
– Configured with synchronous replication
• Target site is 1000 miles away
– Asynchronous replication is used
273
BITS Pilani, Pilani Campus
Three Site Cascade
Resiliency?
If source site lost.
If bunker site lost.
274
BITS Pilani, Pilani Campus
Three Site Multi-Target
Source
Target
275
BITS Pilani, Pilani Campus
Three Site Triangle Topology
• Three site triangle topology
– Standby link is used when source is
failed
– Improvement over three site multi-
target topology
Bunker
Source
Target
276
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
278
BITS Pilani, Pilani Campus
Storage Virtualization[1]
• Storage Virtualization
– An abstraction of storage achieved by
• Hiding some details of the functions of an aggregation
of storage devices and
• Presenting a uniform set of I/O services and an
interface for provisioning storage
279
BITS Pilani, Pilani Campus
Storage Virtualization[2]
281
BITS Pilani, Pilani Campus
Storage Virtualization Levels[2]
282
BITS Pilani, Pilani Campus
SNIA Shared Storage Model
• Host Network
• Network Block Aggregation Device
• Device
– Storage Devices layer Storage Devices
283
BITS Pilani, Pilani Campus
Storage Virtualization Types
• Host-based Virtualization
• Network-based Virtualization
• Storage-based/Controller based Virtualization
284
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
286
BITS Pilani, Pilani Campus
Host-based Virtualization
• Implemented via Logical Volume Manager
• Volume managers work with block storage
– e.g. DAS or SAN
• Virtualizes multiple block I/O devices into
logical volumes
– Take devices from the lower layer and creates
LVs
– These LVs made available to higher layers,
where file systems are written to them
– LVs can be sliced, striped, and concatenated
– Limited in scalability due to Host centric - not
easily sharable among other hosts
287
BITS Pilani, Pilani Campus
Network-based/ SAN based
Virtualization
• Not successful e.g. EMC Invista
• Storage virtualization at the network layer
requires intelligent network switches, or SAN
based appliances to perform functions like-
– Aggregation and virtualization of storage arrays
– Combining LUNs from heterogeneous arrays into a
single LUN
– Heterogeneous replication at the fabric level
288
BITS Pilani, Pilani Campus
Storage based Virtualization
• It is predominant form of virtualization in the real world
• In Band Virtualization (Symmetric)
– Controller sits between the host issuing the I/O and the storage
array that is virtualized
• Virtualization function is transparent to the host
– Doesn’t requires any driver or software on the storage array
– I/O, command and metadata pass through the controller
289
BITS Pilani, Pilani Campus
Out-of band Virtualization
• It is asymmetric
• Only command and metadata pass through
the virtualization device (controller)
• Requires HBA drivers and agent software
deployed on the host to communicate with
the controller
– to get the accessibility information
290
BITS Pilani, Pilani Campus
Controller Based Virtualization
291
BITS Pilani, Pilani Campus
Controller Virtualization
Configuration
Array configuration
1. LUNs on the virtualized array are presented out to the WWPNs of the
virtualization controller
2. The controller claims these LUNs and use them as storage (e.g. formed
into a pool)
3. Volumes are created from the pool
4. Volumes are presented to hosts as LUNs through front-end ports
Controller Configuration
1. Configure front-end ports into virtualization mode
2. This enables these ports to discover and use LUNs of Array being
virtualized
3. These ports usually emulate a standard Windows or Linux host
292
BITS Pilani, Pilani Campus
Benefits of Storage Virtualization
• Management
– Multiple technologies from multiple vendors can be virtualized and managed
through a single management interface
• Functionality
– e.g. snapshots and replication, of the higher tier array can be extended to the
capacity provided by the lower tier array
• Performance
– e.g. addition of drives behind the virtual LUN
• Availability
– e.g. RAID groups are used to create RAID protected LUNs, array based
replication and snapshot technologies also add to the protection provided by
the virtualization
• Technology Migrations/refresh
– Easy to migrate from one storage array to another storage array
• Cost
– We can virtualize lower performance storage array behind higher performance
storage array
293
BITS Pilani, Pilani Campus
Storage Virtualization and Auto-
tiering
• Tier-1 Array (Internal- in controller)
– High performance SSD Drives
– Highest in the performance with least latency
• Tier-2 (Internal- in controller)
– 15K SAS drives
• Tier-3 (External-In virtualized array)
– 7.2K NL-SAS
– Least in the performance with high latency
• Traditional Storage
– Hardware (physical drives) and software (intelligence
and functionality) are tightly coupled
– Vendor specific
295
BITS Pilani, Pilani Campus
SDS Architecture
Orchestration
Decoupled
Data Services
Decoupled
Hardware
296
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
299
BITS Pilani, Pilani Campus
Thick &Thin Volumes
• Let’s say you create a 1 TB thick volume from an
array that has 10 TB of free space.
– Thick Volume
• Thick volume will immediately consume 1 TB and reduce the
free capacity on the array to 9 TB, even without writing any
data to it.
• It is a real waste of space if you never bother writing any data
to that 1 TB thick volume.
– Thin Volumes
• Thin volumes consume little or sometimes no capacity when
initially created.
• They start consuming space only as data is written to them.
• Extent size is the unit of growth applied to a thin volume as
data is written to it.
– Large extent size vs. small extent size
300
BITS Pilani, Pilani Campus
Over Provisioning
• Over-provisioning allows us to pretend we have more capacity
than we really do
– It works on the principle that we over-allocate storage and
rarely use everything we asked for
Unused
Used
25TB 25TB 25TB 25TB 25TB
Provisioned = 125TB
100TB pool of storage
Used = 60TB
302
BITS Pilani, Pilani Campus
Problems with Thin Volumes
303
BITS Pilani, Pilani Campus
Solution: Space Reclamation
304
BITS Pilani, Pilani Campus
Traditional Space Reclamation
• Lossless compression
• Lossy compression
– e.g. JPEG image compression
• Most files and data streams repeat the same data
patterns over and over again.
– Compression works by re-encoding data so that the
resulting data uses fewer bits than the source data.
• Compression is popular in backup and archive
space
– Compression for primary storage (DAS, SAN, NAS) has
not been taken up widely. Why???
307
BITS Pilani, Pilani Campus
Array Based Compression[1]
• Inline
– Inline compression compresses data while in cache,
before it gets sent to disk.
– For high I/O, the acts of compressing and
decompressing can increase I/O latency
– It reduces the amount of data being written to the
backend
• Less capacity is required, both on the internal bus as well
as on the drives on the backend
308
BITS Pilani, Pilani Campus
Array Based Compression[2]
• Post-process
– Data has to be written to disk in its uncompressed
form, and then compressed at a later date
– It demands enough storage capacity to land the data
in its uncompressed form
• Decompression have to be done inline in real
time
– Performance concerns
• Additional latency for data read
• SSDs can get more advantage from compression
309
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
• Deduplication
– Replacement of multiple copies of
identical data with reference to a single
shared copy
• Tiering
– Seek to place data on appropriate
storage mediums (based on frequency
of access to the data)
– Moves infrequently accessed data down
to cheaper (slower storage) and putting
more frequently accessed data on
more-expensive (faster storage)
311
BITS Pilani, Pilani Campus
Deduplication
312
BITS Pilani, Pilani Campus
Deduplication Types[1]
• Block level
– Unique blocks to be stored in the system
– What should be the appropriate block size? Smaller
or larger.
– Fixed Length
• Change in one block of data set, all subsequent blocks
become offset and no longer deduplicate
– Variable Length
• Variable-block approaches apply a dynamic, floating
boundary when segmenting a data set
• Segmenting based on repeating patterns in the data itself
313
BITS Pilani, Pilani Campus
Deduplication Types[2]
• File level
– If more than one exact copy of a file exist, all
duplicate copies are discarded
– Shouldn’t really be called deduplication?
• Files referred to as single-instance storage (SIS)
– Example
• Two files with one character difference will be treated
different!
314
BITS Pilani, Pilani Campus
Where to deduplicate?
• At Source
– Consuming host resources including CPU and RAM
– Can significantly reduce the amount of network
bandwidth consumed between the source and
target—store less, move less!
– Useful in backup scenarios
• Large amount of data to be moved from source to target
– It can only deduplicate against data that exists on
the same host
315
BITS Pilani, Pilani Campus
Where to deduplicate?
316
BITS Pilani, Pilani Campus
Where to deduplicate?
• Federated
– Distributed approach to the process of deduplicating
data
• Both, source and target perform deduplication
– Used in many backup solutions now a days
• e.g. Symantec OpenStorage (OST) allow for tight
integration between backup software and backup target
devices
317
BITS Pilani, Pilani Campus
When to do deduplication?
• Inline Deduplication
– Searches for duplicates before storing data to disk
– Only needs enough storage to store deduplicated data
– Potential performance impact as data has to be deduplicated
in real-time
• Post-Process Deduplication
– Stores duplicate data to disk and then deduplicates later
– No performance impact
– Requires enough storage to initially store duplicated data
318
BITS Pilani, Pilani Campus
Backup Use Case [1]
321
BITS Pilani, Pilani Campus
Auto-Tiering
• Doesn’t increase usable capacity but optimize use of
resources
• Auto-Tiering
– Data sits on the right tier of storage
• Consider an example
– A volume was 200 GB and only 2 GB of it was frequently
accessed and needed moving to tier 1, but auto-tiering
caused the entire 200 GB to be moved
– This is wastage of expensive resources
• Solution
– Sub-LUN tiering does away with the need to move entire
volumes up and down through the tiers.
– It works by slicing and dicing volumes into multiple smaller
extents.
322
BITS Pilani, Pilani Campus
Impact of Auto-Tiering on
Remote Replication
• As a result of the read and write workload, the auto-tiering
algorithms of the array move extents up and down the tiers,
ensuring an optimal layout.
• However, the target array does not see the same workload
– All the target array sees is replicated I/O, and only write I/Os are
replicated from the source to the target.
328
BITS Pilani, Pilani Campus
Backup Architecture: Example
1. The backup server monitors the
backup schedule.
2. When the schedule indicates that a
particular client needs backing up,
the backup server instructs the
agent software on the backup
client to execute a backup.
3. The backup server sends the data
over the IP to the media server.
4. The media server channels the
incoming backup data to the
backup target over the FC SAN thatFig. source: Data Storage networks by Nigel Poulton
it is connected to.
329
BITS Pilani, Pilani Campus
Backup Methods: Hot Backups
and Offline Backups
• Online backups (i.e. application is servicing users)
– Reduces administrative overhead and allow backup
environments to scale because they don’t require
administrative stops and restarts of applications
– Provides ability to perform integrity checks against
backups
• Offline Backups
– Require applications and databases to be taken offline
for the duration of the backup
– Not used in today’s time!!!
330
BITS Pilani, Pilani Campus
Backup Methods: LAN Based
Backups
• Cheap and convenient!
• Sending backup data over the LAN
– Often low performance with a risk of impacting other
traffic on the network
– Can be dedicated VLAN or dedicated physical
network or can be main production network
331
BITS Pilani, Pilani Campus
Backup Methods: LAN Free
Backups
• Data is passed from the backup client to the storage medium
over the SAN
• Block based backup and faster than file based backups
• Restore is complex
– Required to restore the entire backup to a temporary area and then
locate the required file and copy it back to its original location.
• Backup Types
– Full
– Incremental
– Differential
– Synthetic
– Application aware
336
BITS Pilani, Pilani Campus
Full Backups
338
BITS Pilani, Pilani Campus
Cumulative Incremental Full
Backups
• Only back up the data, that has changed since
the last full backup
– So, in the first place, you need a full backup, and
from there on, you can take cumulative
incremental backups
• Most cumulative incremental backup solutions
will back up an entire file even if only a single
byte of the file has changed since the last
backup
339
BITS Pilani, Pilani Campus
Differential Incremental
Backups
• Like cumulative incremental backups, differential
incremental backups work in conjunction with full
backups
341
BITS Pilani, Pilani Campus
Application Aware Backups
344
BITS Pilani, Pilani Campus
Backup to Tapes
348
BITS Pilani, Pilani Campus
Backup to the Cloud
1. All data from the last 28 days can be restored from the
previous night’s backup (RPO), and can be restored
within one working day (RTO).
2. All data between 28 and 365 days old can be restored
to the nearest Friday night backup of the data (RPO),
and can be restored within two working days (RTO).
3. All data between 1 year and 5 years old can be
restored to the nearest first Friday of the month
(RPO), and can be restored within two working days
(RTO).
351
BITS Pilani, Pilani Campus
Backup vs. Archiving
354
BITS Pilani, Pilani Campus
Capacity Management
356
BITS Pilani, Pilani Campus
Thin Provisioning
Considerations
• Over provisioning
– Allows us to pretend we have more storage than
we really do
– It involves RISK..?
• Physical Capacity
– Usable capacity of the array
• Provisioned Capacity
– How much you are pretending to have
• Used Capacity
– Amount of the capacity hosts have actually
written
• Each of these metrics must be known and
tracked individually for each of array you have
359
BITS Pilani, Pilani Campus
Example: Trending
360
BITS Pilani, Pilani Campus
De-duplication and Compression
361
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
• Performance Management
– Ensuring that your storage estate is high
performance
– Identifying performance bottlenecks
– Tuning performance
363
BITS Pilani, Pilani Campus
Performance Management
364
BITS Pilani, Pilani Campus
Base-lining
367
BITS Pilani, Pilani Campus
Performance Metric: IOPS and
MBps
• IOPS
– Concept of IOPS is vague.
– What exactly is an I/O?
– Are all I/Os equal?
• Read vs. Write, Random vs. Sequential, Small vs. Big
– Used for transactional applications
• MBps
– Number of Mega bytes transferred by a disk or
storage array
– Used for throughput driven applications
368
BITS Pilani, Pilani Campus
Factors Affecting Storage
Performance [1]
• RAID
– Good for data protection but with impact on
performance front
– Random small-block write workloads are slower
• e.g. RAID 5 and RAID 6
• Thin LUNs
– Follows allocate on demand model
– Performance Impacts
• Can add latency as system has to identify free extents and
allocate them to a volume, each time a write request
comes into a new area on a thin LUN
• Can result in fragmented backend layout
369
BITS Pilani, Pilani Campus
Factors Affecting Storage
Performance [2]
• Cache
– Caches improve performance of spinning disk
based storage
• Network hops
– Leads to network induced latency
– Higher in IP/Ethernet as compared to FC-SAN
• Multi-Pathing
– Prime motivation to provide high availability
– MPIO balanced out the I/O from a single host over
multiple array ports
370
BITS Pilani, Pilani Campus
Standard Performance Tools
371
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus
• Cloud Types
– Public, Private and Hybrid
373
BITS Pilani, Pilani Campus
Cloud Models
• Public Cloud
– Highest level of abstraction
• Clients don’t have visibility of the technologies holding
the service
• Multitenant-Underlying infrastructure is shared by
multiple customers
• e.g. EC2 (Elastic Compute Cloud), Windows Azure
374
BITS Pilani, Pilani Campus
Private Cloud
• Features
– Elastic Storage
– Massive Scalability
– Accessibility Fig. Source: Data Storage Networking by Nigel Poulton
– Fast self service provisioning
– Built-in protection
– No cap-ex cost
377
BITS Pilani, Pilani Campus
Drawbacks for Cloud Storage
379
BITS Pilani, Pilani Campus
Data Consistency
380
BITS Pilani, Pilani Campus
Public Cloud Storage
Performance
• Storage performance hierarchy
– Main memory > Local attached Hard Disk > SAN or NAS > Cloud
Storage
• Capacity relationship
– Closer to CPU less capacity (lesser sharable), farther to CPU more
capacity (more sharable)
• Best performance use case for cloud storage
– Large objects like videos, photos
• Requirement: Good throughput rather than low latency
• Atomic Uploads
– Object is not available to access until the upload operation is 100%
complete
• CDNs (Geo-Replication)
– Principle: closer the data better the performance
381
BITS Pilani, Pilani Campus