0% found this document useful (0 votes)
127 views381 pages

DSTN Combined RL 1568656699770

This document discusses data storage technologies and networks. It covers the following key points: 1. It describes the memory hierarchy in computer systems, from registers to main memory to secondary storage. 2. It explains the three tiers of computing, networking, and storage and how they interact. Data resides in both volatile and non-volatile memory. 3. It shows how memory bandwidth requirements have increased over time from early computers to modern multi-core processors, and how the memory hierarchy addresses this through caching and different memory technologies.

Uploaded by

Ajay Gowtham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views381 pages

DSTN Combined RL 1568656699770

This document discusses data storage technologies and networks. It covers the following key points: 1. It describes the memory hierarchy in computer systems, from registers to main memory to secondary storage. 2. It explains the three tiers of computing, networking, and storage and how they interact. Data resides in both volatile and non-volatile memory. 3. It shows how memory bandwidth requirements have increased over time from early computers to modern multi-core processors, and how the memory hierarchy addresses this through caching and different memory technologies.

Uploaded by

Ajay Gowtham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 381

Data Storage Technologies

& Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Pilani Campus Department of Computer Science and Information Systems
Topics

• Computer System Architecture


– Memory Hierarchy

Address
Processor Control
Data bus
Main
Registers Memory
Address
Control
Data bus

I/O Printer, Modems, Secondary Storage,


Monitor etc.
2
BITS Pilani, Pilani Campus
Three Tier Architecture

• Computing
– Apps such as web servers, video conferencing,
database server, streaming etc.
• Networking
– Provides connectivity between computing nodes
– e.g. web service running on a computing node talks
to a database service running on another computer
• Storage (Persistent + Non-Persistent)
– All data resides
3
BITS Pilani, Pilani Campus
Memory Requirements

• Per computation data (non-persistent) and


Permanent (Persistent) data

• Separate memory/storage required for both


– Technology driven
• Volatile vs. Non-volatile
– Cost driven
• Faster and Costlier vs. Slower and Cheaper

4
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[1]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns

5
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[2]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
Instructions 0.1 1 2 (4-way) 8 (quad core,2
per cycle threads/core

6
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[3]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)
Clock Cycle 250ns 25ns 1ns 0.4ns
Instructions 0.1 1 2 (4-way) 8 (quad core,2
per cycle threads/core

Instructions per second = Cycles per second * Instructions per cycle

Instructions 4 * 105 40 * 106 2 * 109 20 * 1010


per second

7
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[4]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)

Instructions 4 * 105 40 * 106 2 * 109 20 * 1010


per second

Instruction Size 3.8B 4B 4B 4B

Operands in 1.8 *4B 0.3 *4B 0.25*4B 0.25*4B


memory per
instruction

8
BITS Pilani, Pilani Campus
Memory Bandwidth
Requirement[5]
DEC VAX Early pipelines Superscalars Hyperthreaded
11/780 (circa ‘90) (circa ‘00) Multi-cores
(circa ‘80) (circa ‘08)

Instructions 4 * 105 40 * 106 2 * 109 20 * 1010


per second

Instruction Size 3.8B 4B 4B 4B

Operands in 1.8 *4B 0.3 *4B 0.25*4B 0.25*4B


memory per
instruction
BW Demand = Instructions per second * (Instruction size + Operand size)

BW Demand 4.4 Mbps 208 Mbps 10 Gbps 100 Gbps

9
BITS Pilani, Pilani Campus
Memory Hierarchy[1]

• How do we meet memory BW requirements?


– Observation
• A typical data item may be accessed more than once

– Locality of reference
• Memory references are clustered to either a small region
of memory locations or same set of data accessed
frequently

10
BITS Pilani, Pilani Campus
Memory Hierarchy[2]

• How do we meet memory bandwidth requirements?


• Multiple levels
– Early days – register set, primary, secondary and archival
– Present day- register set, L1 cache, L2 cache, DRAM,
direct attached storage, networked storage and archival
storage
• Motivation
– Amortization of cost
– As we move down the hierarchy cost decreases and speed
decreases.
11
BITS Pilani, Pilani Campus
Memory Hierarchy[3]

• Multi-Level Inclusion Principle


– All the data in level h is included in level h+1
• Reasons?
– Level h+1 is typically more persistent than level h.
– Level h+1 is order(s) of magnitude larger.
– When level h data has to be replaced (Why?)
• Only written data needs to be copied.
• Why is this good savings?

12
BITS Pilani, Pilani Campus
Memory Hierarchy:
Performance
• Exercise:
– Effective Access time for 2-level hierarchy

13
BITS Pilani, Pilani Campus
Memory Hierarchy: Memory
Efficiency
• Memory Efficiency
– M.E. = 100 * (Th/Teff)
– M.E. = 100/(1+Pmiss (R-1)) [R = Th+1/Th]
• Maximum memory efficiency
– R = 1 or Pmiss = 0
– Consider
• R = 10 (CPU/SRAM)
• R = 50 (CPU/DRAM)
• R = 100 (CPU/Disk)
• What will be the Pmiss for ME = 95% for each of these?
14
BITS Pilani, Pilani Campus
Memory Technologies-
Computational
• Cache between CPU registers and main memory
– Static RAM (6 transistors per cell)
– Typical Access Time ~10ns
• Main Memory
– Dynamic RAM (1 transistor + 1 capacitor)
– Capacitive leakage results in loss of data
• Needs to be refreshed periodically – hence the term
“dynamic”
– Typical Access Time ~50ns
– Typical Refresh Cycle ~100ms.
15
BITS Pilani, Pilani Campus
Memory Technologies-
Persistent
• Hard Disks
– Used for persistent online storage
– Typical access time: 10 to 15ms
– Semi-random or semi-sequential access:
• Access in blocks – typically – of 512 bytes.
– Cost per GB – Approx. Rs 5.50

• Flash Devices (Solid State Drive)


– Electrically Erasable Programmable ROM
– Used for persistent online storage
– Limit on Erases – currently 100,000 to 500,000
– Read Access Time: 50ns
– Write Access Time: 10 micro seconds
– Semi-random or semi-sequential write:
• Blocks – say 512 bits.
– Cost Per GByte – U.S. $5.00 (circa 2007)
16
BITS Pilani, Pilani Campus
Memory Technologies-Archival

• Magnetic Tapes
– Access Time – (Initial) 10 sec.; 60Mbps data transfer
– Density – up-to 6.67 billion bits per square inch
– Data Access – Sequential
– Cost - Cheapest

17
BITS Pilani, Pilani Campus
Caching

• L1, L2, and L3 caches between CPU and RAM


– Transparent to OS and Apps.
– L1 is typically on (processor) chip
• R=1 to 2
• May be separate for data and instructions (Why?)
• L3 is typically on-board (i.e. processor board
or “motherboard”)
– R=5 to 10

18
BITS Pilani, Pilani Campus
Caching- Generic [1]

• Caching as a principle can be applied between any two


levels of memory
– e.g. Buffer Cache (part of RAM)
– transparent to App, maintained by OS, between main memory
and hard disk,
• R RAM,buffer = 1
• e.g. Disk cache
– between RAM and hard disk
– typically part of disk controller
– typically semiconductor memory
– may be non-volatile ROM on high end disks to support power
breakdowns.
– transparent to OS and Apps
19
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Storage Devices (Secondary Storage)


– Hard Disk Drive or Hard Disk or Disk Drive

Image source: en.wikipedia.org


21
BITS Pilani, Pilani Campus
Disk Storage

• Electromechanical hard disk drive called as a disk


drive, hard drive, or hard disk drive (HDD)
• Originally meant for PCs and mainframes
– 14 in. diameter for mainframes in 60’s
– 3.5 in. diameter for PCs from 80’s
• Now available in various shapes:
– Mini disks (2.5 in. dia.) for laptops, gaming consoles and
as external pocket storage
– Micro disks (1.68 in. or 1.8 in. dia) for iPods / Cameras /
other handheld devices
• Note: Size of diameter is also referred as form factor
22
BITS Pilani, Pilani Campus
Disk Drive: Geometry

Major Components

Platters

Read/write heads

Actuator assembly

Spindle motor

Source: dataclinic.co.uk

23
BITS Pilani, Pilani Campus
Disk Drive: Geometry

• Disk substrate coated with magnetizable material


(iron oxide…rust)
• Early days substrate is made of aluminium
• Now glass
– Improved surface uniformity
• Increases reliability
– Reduction in surface defects
• Reduced read/write errors
– Lower flight heights (r/w heads fly at some height from disk surface)
– Better stiffness
– Better shock/damage resistance
24
BITS Pilani, Pilani Campus
Data Organization and
Formatting
• Concentric rings or tracks
– Gaps between tracks
– Reduce gap to increase capacity
– Same number of bits per track
(variable packing density)
– Constant angular velocity
• Tracks divided into sectors
• Minimum block size is one
sector
• May have more than one
sector per block
• Individual tracks and sectors
are addressable
25
BITS Pilani, Pilani Campus
Zoned Disk Drive

Each track in a zone has the same number of sectors, determined


by the circumference of innermost track.

26
BITS Pilani, Pilani Campus
Disk Characteristics
• Fixed (rare) or movable head
– Fixed head
• One r/w head per track mounted on fixed ridged arm
– Movable head
• One r/w head per side mounted on a movable arm
• Removable or fixed
• Single or double (usually) sided
• Single or multiple platter
– Heads are joined and aligned
– Aligned tracks on each platter form cylinders
– Data is striped by cylinder
• Reduces head movement
• Increases speed (transfer rate)
• Head mechanism
– Contact (Floppy)
– Fixed gap
– Flying (Winchester)
27
BITS Pilani, Pilani Campus
Capacity

• Capacity: maximum number of bits that can be stored.


• Vendors express capacity in units of gigabytes (GB),
where 1 GB =109 Byte
• Capacity is determined by these technology factors:
– Recording density (bits/in): number of bits that can be
squeezed into a 1 inch segment of a track.
– Track density (tracks/in): number of tracks that can be
squeezed into a 1 inch radial segment.
– Areal density (bits/in^2): product of recording and track
density.
• Modern disk drives can have more than 1 tera bit of areal density
28
BITS Pilani, Pilani Campus
Computing Disk Capacity
• Capacity =(# bytes/sector) x
(# sectors/track) x
(# tracks/surface) x
(# surfaces/platter) x
(# platters/disk)
• Example:
– 512 bytes/sector
– 300 sectors/track
– 20,000 tracks/surface
– 2 surfaces/platter
– 5 platters/disk
– Capacity = 512 x 300 x 20000 x 2 x 5 = 30.72GB 29
BITS Pilani, Pilani Campus
Disk Drive- Addressing
• Access is always in group of 1 or more contiguous sectors
– Starting sector address must be specified for access
• Addressing:
– Cylinder, Head, Sector (CHS) addressing
– Logical Block Addressing (LBA)
• Issues in LBA:
– Bad sectors (before shipping)
• Address Sliding / Slipping could be used – skip bad sectors for
numbering
– Bad Sectors (during operation)
• Mapping – maintain a map from logical number to physical CHS
address
• Remap when you have a bad sector – use reserve sectors
30
BITS Pilani, Pilani Campus
Disk Drive – Access Time
• Read and writing in sector-sized blocks
– Typically 512 bytes
• Access time (Seek time + rotational delay)
– Seek time (Average seek time typically 6 to 9 ms)
– Rotational latency: (Average latency = ½ rotation)
• Transfer time (T = b/r*N)
• Total average access time
Ta = Ts+ 1/2r+b/r*N
– Here Ts is Average seek time
– r is rotation speed in revolution per second
– b number of bytes to be transferred
– N number of bytes on a track
31
BITS Pilani, Pilani Campus
Disk Access Time Exercise
• Average seek time=4ms
• Rotation speed= 15,000 RPM
• 512 bytes per sector
• No. of sectors per track=500
• Want to read a file consisting of 2500 sectors.
• Calculate the time to read the entire file
– A) File is stored sequentially.
– B) File is stored randomly.

32
BITS Pilani, Pilani Campus
Disk Drive Performance
• Parameters to measure the performance
– Seek time
– Rotational Latency
– IOPS (Input/Output Operations per Second)
• Read
• Write
• Random
• Sequential
• Cache hit
• Cache miss
– MBps
• # of Mega bytes per second a disk drive can perform
• Useful to measure sequential workloads like media streaming
33
BITS Pilani, Pilani Campus
Disk Drive Performance: Eye
Opener Facts!
• Access time (Seek + Rotational ) rating
– Important to distinguish between sequential and
random access request set
• Usually vendors quote IOPS numbers to impress
– Important to note that whether the IOPS numbers
being quoted are cache hits or cache misses
• Real world workload is a mix of accesses with
– Read, Write, random, sequential, cache hit, cache
miss
34
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Storage Devices (Secondary Storage)


– Solid State Storage (SSDs and PCI expansion card)
• Flash Memory

Image Source: electronicdesign.com


36
BITS Pilani, Pilani Campus
SSD Fundamentals

– 2.5 in. and 3.5 in. form factors


– Supports SAS, SATA and FC interfaces and protocols
– Different types as Flash Memory, Phase change
Memory (PCM) Ferroelectric RAM (FRAM)
– Semiconductor based hence no mechanical parts
– Predictable performance due to no positional
latency (i.e. Seek time and Rotational latency)

37
BITS Pilani, Pilani Campus
Flash Memory
• Semiconductor based persistent storage
• Two types
– NAND and NOR flash
• Anatomy of flash memory
– Cells  Pages  Blocks
– New flash device comes with all cells set to 1
– Cells can be programmed from 1 to 0
– To change the value of cell back to 1 then we need
to erase entire block
• Can be erased at block level only!
38
BITS Pilani, Pilani Campus
Read/Write/Programming on
Flash Memory
• Read operation is the fastest operation
• First time write is very fast
– Every cell in the block is preset to 1 and can be
individually programmed to 0
– If any part of a flash memory block has already been
written to, all subsequent writes to any part of that
block will require a process called read/erase/program
• It is 100 times slower than read operation
– Erasing is a 10 times slower process than read
operation
39
BITS Pilani, Pilani Campus
NAND vs. NOR

NAND NOR
Cost per bit Low High
Capacity High Low
Read Speed Medium *High
Write Speed High Low
File Storage Use Yes No
Code Execution Hard Easy
Stand by Power Medium Low
Erase Cycles High Low

*Individual cells (in NOR) are connected in parallel which enables random reads faster

40
BITS Pilani, Pilani Campus
Anatomy of NAND Flash
• NAND Flash types
– Single level cell (SLC)
• A cell can store 1 bit of data
• Highest performance and longest life span (100,000 program/erase cycles
per cell)
– Multi level cell (MLC)
• Stores 2 bits of data per cell.
• P/E cycles = 10,000
– Enterprise MLC (eMLC)
• MLC with stronger error correction
• Heavily over-provisioned for high performance and reliability
– e.g. a 400 GB eMLC drive might actually have 800 GB of eMLC flash
– Triple level cell (TLC)
• Stores 3 bits per cell
• P/E cycles = 5,000 per cell
• High on capacity but low on performance and reliability
41
BITS Pilani, Pilani Campus
Enterprise Class SSD

• More over-provisioned capacity


– Provides Better performance and life-time
• More cache
– Any write to a block that already contains data
requires to copy the existing contents into the cache
– Helps to coalesce writes and combining writes
• More channels
– Allows concurrent I/O operations
• More comprehensive warranty
42
BITS Pilani, Pilani Campus
Hybrid Drives

• Having both rotating platter


and solid-state memory (i.e.
combination of HDD and SSD)
– Tradeoff between high capacity
and performance
• Hybrid storage technologies
– Dual drive
• Separate SSD and HDD devices are
installed in a computer
– SSHD drive
• Single drive having NAND flash
memory and HDD
43
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Disk reliability measures


• Improving Disk Reliability
– RAID Levels

Image source: msdn.microsoft.com


45
BITS Pilani, Pilani Campus
Disk Performance issues[1]

• Reliability
– Mean Time-Between-Failure (MTBF)
• e.g. 1.2 TB SAS drive states a MTBF value of 2 million
hours
– Annual Failure Rate (AFR)
• To estimate the likelihood that a disk drive will fail during
a year of full use
• Individual Disk Reliability (as claimed in
manufacturer’s warranties) is often very high
– E.g. Rated: 30,000 hours In Practice: 100,000 for an
IBM disk in 80s
46
BITS Pilani, Pilani Campus
Disk Performance issues[2]

• Access Speed
– Access Speed of a pathway = Minimum speed among all
components in the path
– e.g. CPU and Memory Speeds vs. Disk Access Speeds

• Solution:
– Multiple Disks i.e. array of disks
– Issue: Reliability
• MTTF of an array = MTTF of a single disk / # disks in the array

47
BITS Pilani, Pilani Campus
Disk Reliability

• Redundancy may be used to improve Reliability


– Device Level Reliability
• Improved by redundant disks
– This of course implies redundant data
– Data Level Reliability
• Improved by redundant data
– This of course implies additional disks
• (RAID) Redundant Array of Inexpensive Disks
– or Redundant Array of Independent Disks
• Different Levels / Modes of Redundancy
• Referred to as RAID levels
48
BITS Pilani, Pilani Campus
How to achieve reliability?

• Use more number of small sized disks !!


– What should be the number of disks?
– How small should be the disks?
– How should they be structured and used?

49
BITS Pilani, Pilani Campus
Performance Improvement in
Secondary Storage
• In general multiple components improves the
performance
• Similarly multiple disks should reduce access time?
– Arrays of disks operates independently and in parallel
• Justification
– With multiple disks separate I/O requests can be
handled in parallel
– A single I/O request can be executed in parallel, if the
requested data is distributed across multiple disks
• Researchers @ University of California-Berkeley
proposed the RAID (1988)
50
BITS Pilani, Pilani Campus
RAID

• Redundant Array of Inexpensive Disks


– Connect multiple disks together to
• Increase storage
• Reduce access time
• Increase data redundancy
• Provide fault tolerance
• Many different levels of RAID systems
• differing levels of redundancy,
• error checking,
• capacity, and cost

51
BITS Pilani, Pilani Campus
RAID Fundamentals

• Striping
– Map data to different disks
– Advantage…?
• Mirroring
– Replicate data
– What are the implications…?
• Parity
– Loss recovery/Error correction / detection
52
BITS Pilani, Pilani Campus
RAID

• Characteristics
1. Set of physical disks viewed as single logical drive
by operating system
2. Data distributed across physical drives
3. Can use redundant capacity to store parity
information

53
BITS Pilani, Pilani Campus
Data Mapping in RAID 0

No redundancy or error correction


Data striped across all disks
Round Robin striping

54
BITS Pilani, Pilani Campus
RAID 1

Mirrored Disks
Data is striped across disks
2 copies of each stripe on separate disks
Read from either and Write to both

55
BITS Pilani, Pilani Campus
Data Mapping in RAID 2

Bit interleaved data


Lots of redundancy
Use parallel access technique
Very small size strips
Expensive: Good for erroneous disk
56
BITS Pilani, Pilani Campus
Data Mapping in RAID 3
• Similar to RAID 2
• Only one redundant disk, no matter how large the array
• Simple parity bit for each set of corresponding bits
• Data on failed drive can be reconstructed from surviving data
and parity information
• Question:
• Can achieve very high transfer rates. How?

57
BITS Pilani, Pilani Campus
RAID 4
• Make use of independent access with block level striping
• Good for high I/O request rate due to large strips
• Bit by bit parity calculated across stripes on each disk
• Parity stored on parity disk
• Drawback???

58
BITS Pilani, Pilani Campus
RAID 5
• Round robin allocation for parity stripe
• It avoids RAID 4 bottleneck at parity disk
• Commonly used in network servers
• Drawback
– Disk failure has a medium impact on throughput
– Difficult to rebuild in the event of a disk failure (as
compared to RAID level 1)

59
BITS Pilani, Pilani Campus
RAID 6
• Two parity calculations
• Stored in separate blocks on different disks
• High data availability
– Three disks need to fail for data loss
– Significant write penalty
• Drawback
– Controller overhead to compute parity is very high

60
BITS Pilani, Pilani Campus
Nesting of RAID Levels:
RAID(1+0)
• RAID 1 (mirror) arrays are built first,
then combined to form a RAID 0
(stripe) array.
• Provides high levels of:
– I/O performance
– Data redundancy
– Disk fault tolerance.

61
BITS Pilani, Pilani Campus
Nesting of RAID Levels:
RAID(0+1)
• RAID 0 (stripe) arrays are built first, then
combined to form a RAID 1 (mirror) array
• Provides high levels of I/O performance
and data redundancy
• Slightly less fault tolerance than a 1+0
– How…?

62
BITS Pilani, Pilani Campus
RAID Implementations

• Software implementations are provided by many


Operating Systems.
• A software layer sits above the disk device drivers
and provides an abstraction layer between the
logical drives(RAIDs) and physical drives.
• Server's processor is used to run the RAID
software.
• Used for simpler configurations like RAID 0 and
RAID 1.
63
BITS Pilani, Pilani Campus
Hardware Implementation

• A hardware implementation of
• RAID requires at least a special-
• purpose RAID controller.
• On a desktop system this may be
built into the motherboard.
• Processor is not used for RAID
calculations as a separate
controller present.

64
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Today’s Agenda

• File System (Unix FS as an Example)


– Files & File Descriptors
– File Entries
– Virtual File System and v-node
– Local File System and i-node

66
BITS Pilani, Pilani Campus
Files and File Systems:
Introduction
• A file system is a logical abstraction for
secondary (persistent) storage
– i.e. physical storage is partitioned into logical units
– Each logical unit has a file system – modeled as a
tree for navigation
– Nodes of the tree are directories (internal) and files
(terminal)
• In Unix directories are also files that store directory
listings

67
BITS Pilani, Pilani Campus
Files and File Systems: Data
Access Types
• Persistent data may be accessed:
– in large sequential chunks or
– in small random records

• File systems are typically designed to handle these


two kinds of accesses respectively

• Any standard OS includes built-in File System


– One or more file systems are supported
– e.g. ext, ext2, ext3, NTFS, FAT
68
BITS Pilani, Pilani Campus
Files in UNIX

• Files are logical storage units as well as logical I/O


streams in Unix
– i.e. access to each physical I/O device is supported
through a logical abstraction, which is a file
• A file is contained in a directory (which is also
supported as a file)
• Unix treats network connections and storage
devices through the same high level abstraction:
– Socket access (by applications) is very similar to I/O on
devices

69
BITS Pilani, Pilani Campus
File Descriptors in UNIX

• An application opening a file gets a “file descriptor”


object
– File descriptor is held by the user process
• Each process maintains a descriptor table for holding descriptors
(nonnegative integer) of open files.
– Typical operations on file descriptor:
• read, write, select, ioctl, close
– Each descriptor object in turn points to a “file entry” in a
shared table maintained by the kernel
• The file entry points to a data structure for maintaining specific
file instance information and state associated with an instance.
– This data structure is opaque to operations manipulating the file
entry.
– Underlying object instances cannot manipulate the file entry
70
BITS Pilani, Pilani Campus
Open Files in UNIX: File Entries
[1]
• read/write system calls do not take an offset as
an argument
– They update the current file offset which determines
the position for the next read/write
• lseek system call can be used to set the offset
• More than one process may open the same file
– i.e. each process needs its own offset
– i.e. each open system call allocates a new file entry
(in the kernel table) which stores the offset
71
BITS Pilani, Pilani Campus
Open Files in UNIX: File Entries
[2]
• Access Control Semantics is enforced at the file
descriptor level.
– There are flags in the descriptor recording access
permissions as “read”, “write”, or “read-write”

• There is also an “append” flag in the descriptors


– Allows file to be treated as a FIFO stream
• E.g. Multiple processes may be writing to the same file

72
BITS Pilani, Pilani Campus
Open Files in UNIX: File Entries
[3]
• Each “file entry” has a reference count
– Multiple descriptors may refer to the same file
entry:
• Single Process: dup system call
• Multiple Processes: child process after a fork inherits
file structures
– A read or write by either process (via the corresponding
descriptor) will advance the file offset
– This allows interleaved input/output from/to a file

73
BITS Pilani, Pilani Campus
Virtual File Systems and V-nodes

• File descriptors and file entries keep track of per


process, per I/O (sequence) state.
• For each open file a virtual node (v-node) is also
available in memory.
– All file entries corresponding to the file, point to the v-
node and the v-node maintains a reference count.
– Buffers (for I/O on a file) are associated with the v-
node.
• Each v-node has a clean buffers list and a dirty buffers list
– Each v-node is also associated with a mount data
structure (that corresponds to a mounted file system)
74
BITS Pilani, Pilani Campus
Virtual File System and v-nodes
• A v-node is an in-memory abstraction for a file
– i.e. it acts a proxy to a physical file store data structure
that abstracts away from the details of the file store.
• For instance, irrespective of whether a file is stored remotely
or locally, system call operations flow through a file entry to a
v-node
• A v-node is similar to an interface in an object-
oriented system where multiple implementations
of the interface are possible.
– The underlying implementation would correspond to
the physical file.
• For example, an i-node for a local file or nfs-node for a remote
file.

75
BITS Pilani, Pilani Campus
Unix File System

76
BITS Pilani, Pilani Campus
Local File System and i-node

• For a local file system an index node (i-node) is an in


memory copy of a file handle i.e.
– Each file stored in a local file system (e.g. a disk) has i-node
associated with it.
– All i-nodes are stored (along with the file system) on disk.
– Copies of i-nodes for active (i.e. open) files are kept in
memory in a hash table (i-node table) hashed by i-node
number which is a unique identifier.
• Each i-node provides information for accessing a file
data from a local store (e.g. a disk).
– It also stores file meta-data.
77
BITS Pilani, Pilani Campus
Local File Store

• Each disk drive has one or more partitions


– Typical (1-1) mapping: file store <==> partition
• File store is responsible for
– Creation, storage, retrieval and removal of files
• New file creation – i-node allocation
• Flat name space
• Allocation of new blocks to files as files grow.
• Naming, access control, locking, attribute
manipulation are all handled by the file system
management layer above the file store
78
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• File Store Organization

• OS Support for I/O


– Device Drivers
– Interrupt handling
– Device Interface
– Buffering

80
BITS Pilani, Pilani Campus
Organization of File store
• Berkeley Fast File System (FFS) model:
– A file system is described by its superblock
• Located at the beginning of the file system
• Possibly replicated for redundancy
– A disk partition is divided into one or more cylinder
groups i.e. a set of consecutive cylinders:
• Each group maintains book-keeping info. including
– a redundant copy of superblock
– Space for i-nodes
– A bitmap of available blocks and
– Summary usage info.
• Cylinder groups are used to create clusters of
allocated blocks to improve locality.
81
BITS Pilani, Pilani Campus
Local File Store- Storage
Utilization
• Data layout – Performance requirement
– Large blocks of data should be transferable in a single disk operation
• So, logical block size must be large.
– But, typical profiles primarily use small files.
• Internal Fragmentation:
– Increases from 1.1% for 512 bytes logical block size, 2.5% for 1KB, to
5.4% for 2KB, to an unacceptable 61% for 16KB
• I-nodes also add to the space overhead:
– But overhead due to i-nodes is about 6% for logical block sizes of
512B, 1KB and 2KB, reduces to about 3% for 4KB, and to 0.8% for
16KB.
• One option to balance internal fragmentation against
improved I/O performance is
– to maintain large logical blocks made of smaller fragments
82
BITS Pilani, Pilani Campus
Local File Store – Layout [1]

• Global Policies:
– Use summary information to make decisions
regarding placement of i-nodes and disk blocks.
• Routines responsible for deciding placement of new
directories and files.
– Layout policies rely on locality to cluster information
for improved performance.
• E.g. Cylinder group clustering
• Local Allocation Routines.
– Use locally optimal schemes for data block layouts.
83
BITS Pilani, Pilani Campus
Local File Store – Layout [2]

• Local allocators use a multi-level allocation strategy:


1. Use the next available block that is rotationally
closest to the requested block on the same cylinder
2. If no blocks are available in the same cylinder use
a block within the same cylinder group
3. If the cylinder group is full, choose another group
by quadratic hashing.
4. Search exhaustively.

84
BITS Pilani, Pilani Campus
OS Support for I/O
• System calls form the interface between applications
and OS (kernel)
– File System and I/O system are responsible for
• Implementing system calls related to file management and handling
input/output.
• Device drivers form the interface between the OS
(kernel) and the hardware

85
BITS Pilani, Pilani Campus
I/O in UNIX - Example

• Application level operation


– E.g. printf call
• OS (kernel) level
– System call bwrite
• Device Driver level
– Strategy entry point – code for write operation
• Device level
– E.g. SCSI protocol command for write
86
BITS Pilani, Pilani Campus
I/O in UNIX - Devices

• I/O system is used for


– Network communication and virtual memory (Swap
space)
• Two types of devices
– Character devices
• Terminals, line printers, main memory
– Block devices
• Disks and Tapes
• Buffering done by kernel

87
BITS Pilani, Pilani Campus
I/O in UNIX – Device Drivers

• Device Driver Sections


– Auto-configuration and initialization Routines
• Probe and initialize the device
– Routines for servicing I/O requests
• Invoked because of system calls or VM requests
• Referred to as the “top-half” of the driver
– Interrupt service routines
• Invoked by interrupt from a device
– Can’t depend on per-process state
– Can’t block
• Referred to as the “bottom-half” of the driver
88
BITS Pilani, Pilani Campus
I/O Queuing and Interrupt
Handling
• Each device driver manages one or more queues
for I/O requests
– Shared among asynchronous routines – must be
synchronized.
– Multi-process synchronization also required.
• Interrupts are generated by devices
– Signal status change (or completion of operation)
– DD-ISR invoked through a glue routine that is
responsible for saving volatile registers.
– ISR removes request from queue, notifies requestor
that the command has been executed.
89
BITS Pilani, Pilani Campus
I/O in UNIX – Block Devices and I/O

• Disk sector sized read/write


– Conversion of random access to disk sector
read/write is known as block I/O

– Kernel buffering reduces latency for multiple reads


and writes.

90
BITS Pilani, Pilani Campus
Device Driver to Disk Interface
• Disk Interface:
– Disk access requires an address:
• device id, LBA OR
• device id, cylinder #, head#, sector #.
– Device drivers need to be aware of disk details for
address translation:
• i.e. converting a logical address (say file system level address
such as i-node number) to a physical address (i.e. CHS) on the
disk;
• they need to be aware of complete disk geometry if CHS
addressing is used.
– Early device drivers had hard-coded disk geometries.
• This results in reduced modularity –
– disks cannot be moved (data mapping is with the device driver);
– device driver upgrades would require shutdowns and data copies.
91
BITS Pilani, Pilani Campus
Disk Labels

• Disk Geometry and Data mapping is stored on


the disk itself.
– Known as disk label.
– Must be stored in a fixed location – usually 0th
block i.e. 0th sector on head 0, cylinder 0.
– Usually this is where bootstrap code is stored
• In which case disk information is part of the bootstrap
code, because booting may require disk access.

92
BITS Pilani, Pilani Campus
Buffering
• Buffer Cache
– Memory buffer for data being transferred to and from
disks
– Cache for recently used blocks
• 85% hit rate is typical
• Typical buffer size = 64KB virtual memory
• Buffer pool – hundreds of buffers
• Consistency issue
– Each disk block mapped to at most one buffer
– Buffers have dirty bits associated
– When a new buffer is allocated and if its disk blocks
overlap with that of an existing buffer, then the old buffer
must be purged.
93
BITS Pilani, Pilani Campus
Buffer Pool Management
• Buffer pool is maintained as a (separately chained) hash table
indexed by a buffer id
• The buffers in the pool are also in one of four lists:
– Locked list:
• buffers that are currently used for I/O and therefore locked and cannot be
released until operation is complete
– LRU list:
• A queue of buffers – a recently used item is added at the rear of the queue
and when a buffer is needed one at the front of the queue is replaced.
• Buffers staying in this queue long enough are migrated to an Aged list
– Aged List:
• Maintained as a list and any element may be used for replacement.
– Empty List
• When a new buffer is needed check in the following order:
– Empty List, Aged List, LRU list
94
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Flash File System


– Wear leveling

96
BITS Pilani, Pilani Campus
Flash Memory

• Flash chips are arranged into blocks which are typically 128KB
on NOR and 8KB on NAND flash

• All flash cells are preset to one. These cells can be individually
programmed to zero
– Burt resetting bits from zero to one cannot be done individually,
can be done only by resetting or erasing a complete block

• The lifetime of a flash chip is measured in such erase cycles,


with the typical lifetime being 100,000 erases per block
– Erase count per block should be evenly distributed across all the
blocks for better life time of the flash chip
– Process is known as “wear leveling”

97
BITS Pilani, Pilani Campus
Traditional File System: Erase-
Modify-Write back
• Use 1:1 mapping from the emulated block
device to the flash chip
– read the whole erase block, modifying the
appropriate part of the buffer, erase and rewrite the
entire block
• No wear leveling…!
• Unsafe due to power loss between erase and write back
• Slightly better method
– by gathering writes to a single erase block and only
performing the erase/modify/write back procedure
when a write to a different erase block is requested.
98
BITS Pilani, Pilani Campus
Journaling File System

• How to provide wear leveling…?


– emulated block device are stored in varying
locations on the physical medium
– needs to keep track of the current location of each
sector in the emulated block device
• Use of Translation Layer to keep track of current mapping
– It is a form of Journaling File System

99
BITS Pilani, Pilani Campus
JFFS Storage Format

• It is a log structured file system


• Nodes containing data and metadata are stored
on the flash chips sequentially, progressing
strictly linearly through the storage space
available.

100
BITS Pilani, Pilani Campus
Wear Leveling

101
BITS Pilani, Pilani Campus
Flash Memory: Operations

• Out-place updating is usually done to avoid erasing


operations on every write
– Latest copy of data is “live”
• Page meta-data
– “Live” vs “dead”
– “Free” pages are unused pages
• Cleaning
– Required when free pages are not available
– Erasure is done per block
• May require copying of live data
• Some blocks may be erased more than others
– “Worn” pages
102
BITS Pilani, Pilani Campus
File System and Flash Access

• Although File Systems abstract out device details


– Most file system internals were designed for semi-
random access models:
• i.e. notions of blocks/sectors, block addresses, and buffers
tied to blocks are incorporated into many file systems
– E.g. Block devices in Unix file systems
• So, either file systems are redesigned for flash
• Or, device drivers handle flash memory access and
file systems use the same access models
– Require a Flash Translation Layer that emulates a block
device
103
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• I/O Techniques
– Polling
– Interrupt driven
– DMA

105
BITS Pilani, Pilani Campus
I/O Techniques

• Polling:
– CPU check on the status of an I/O device by reading
a memory address which is associated with an I/O
device
– Pseudo-asynchronous
• Processor inspects (multiple) devices in rotation
– Cons
• Processor may still be forced to do useless work or wait or
both
– Pros
• CPU can determines how often it needs to poll
106
BITS Pilani, Pilani Campus
I/O Techniques

• Interrupts:
– Processor initiates I/O by requesting an operation
with the device.
– May disconnect if response can’t be immediate,
which is usually the case
– When device is ready with a response it interrupts
the processor.
• Processor finishes I/O with the device.
– Asynchronous but
• Data transfer between I/O device and memory still
requires processor to execute instructions.
107
BITS Pilani, Pilani Campus
I/O Techniques: Interrupts

108
BITS Pilani, Pilani Campus
I/O Techniques

• Direct Memory Access


– Processor initiates I/O
– DMA controller acts as an intermediary:
• interacts with the device,
• transfers data to/from memory as appropriate, and
• interrupts processor to signal completion.
– From the processor’s perspective DMA controller is
yet another device
• But one that works at semiconductor speeds

109
BITS Pilani, Pilani Campus
I/O Techniques

• I/O Processor
– More sophisticated version of DMA controller
with the ability to execute code: execute I/O
routines, interact with the O/S etc

110
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• I/O Path From CPU to Storage


– BUSES
– PCI Bus as an example

112
BITS Pilani, Pilani Campus
What is a Bus ?

• A bus is
– a shared communication link between sub- systems
of a computer and
– an associated protocol for communication
• Note: A protocol is a set of rules - often specified formally.
• E.g.
– Single Shared bus
– Separate system bus
and I/O Bus

113
BITS Pilani, Pilani Campus
Traditional Bus Architecture
• Multi bus architecture
– To avoid contention
– Better Performance
– Device requirements are different

114
BITS Pilani, Pilani Campus
Bus Arbitration

• More than one module controlling the bus


– e.g. CPU and DMA controller
• Only one module may control bus at one time

• Arbitration may be-


– Centralised
• Single hardware device controlling bus access
– Distributed
– Each module may claim the bus
• Control logic on all modules
115
BITS Pilani, Pilani Campus
Timing

• Co-ordination of events on bus


• Synchronous
– Events determined by clock signals
– Control Bus includes clock line
– All devices can read clock line
– Usually a single cycle for an event
• Asynchronous
– Events are not synchronized with clock signals

116
BITS Pilani, Pilani Campus
Common Bus Protocols: PCI

• PCI Bus
– Created by INTEL in 1993
– Synchronous bus with 32 bits operating at a clock
rate of 33 MHz
– Transfer rate 132 MB/s
– PCI-X extended the bus to 64 bits and either 66
MHz or 133 MHz for data rates of 528 MB/s or
1064 MB/s

117
BITS Pilani, Pilani Campus
PCI BUS Lines

• PCI Bus Lines (50) (required)


• Systems lines
– Including clock and reset
• Address & Data
– 32 time mux lines for address/data
– Interrupt & validate lines
• Interface Control
– Control the timings of transactions
• Arbitration
– Not shared
– Direct connection to PCI bus arbiter
• Error lines
– Used to report parity and other errors
118
BITS Pilani, Pilani Campus
PCI Operation Example: Read

119
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• SCSI Bus Protocol


– Bus components
– SCSI Devices
– SCSI commands
– SCSI r/w operation

121
BITS Pilani, Pilani Campus
SCSI: I/O Path from CPU to
Storage
• Stands for Small Computer System Interface
– Asynchronous, parallel bus
– Allows multiple bus masters
– Commonly used for high-end Storage devices
• Several standards
– Earliest Standard
• No. of wires (50)
• 8 data lines
• data transfer rate (5MB/s)
• wire length(25m)

122
BITS Pilani, Pilani Campus
SCSI Bus Standard

• Fast SCSI
– Doubled clock rate, data transfer up to 10MB/s
• Wide SCSI
– 16 data lines
– Fast Wide SCSI – 20 MB/s
• Ultra SCSI
– 8 data lines – data transfer rate 20MB/s
– Ultra Wide SCSI – 16 data lines – data transfer 40
MB/s
– Ultra320 - 16 data lines – data transfer 320 MB/s
• Serial Attached SCSI (SAS) and iSCSI (IP over SCSI)
123
BITS Pilani, Pilani Campus
SCSI – Components Model

• All devices are connected to a bus through a


SCSI controller
– e.g. a host adaptor on the processor side
– e.g. a disk controller for a hard disk device
• 2 types of devices
– Initiators or Targets
• Initiator has the ability to select a target and send
commands specifying operations
• Target has the ability to execute the operations based on
the commands received
• Data transfers are controlled by the targets
124
BITS Pilani, Pilani Campus
SCSI Devices

• Devices are identified through controllers


• Device Identification
– Identifiers: Narrow SCSI (8 devices)
• 0-7 in increasing order of priority
– Identifiers: Wide SCSI (16 devices)
• 8-15, 0-7 in increasing order of priority
– Identifier assignment is
• physical (through jumpers or switches), or
• programmatic (through BIOS firmware in adapter)
• Logical devices
– Each Drive/Disk is assigned a Logical Unit Number (LUN) for
addressing
– A Single physical device may be divided into logical devices
each with a LUN (Logical Unit Number)
125
BITS Pilani, Pilani Campus
SCSI Command Protocol

• Communication between initiator and target


– Command sent in a Command Descriptor Block
• 1 byte of opcode followed by
• 5+ bytes of command-specific parameters

– Target (eventually) sends a status code byte


• Success, or busy, or Check Condition (i.e. error)

126
BITS Pilani, Pilani Campus
SCSI Command Protocol

• Command Types
– Non-data, Write, Read, Bidirectional
– e.g.
• Test Unit Ready
• Inquiry
• Start/Stop Unit
• Request Sense (for error)
• Read Capacity
• Format Unit
• Read (4 variants)
• Write (4 variants)
127
BITS Pilani, Pilani Campus
Command Descriptor Block
(CDB)
• Opcode
• LUN (3 bits)
• e.g. for Read
– Read (6): 21 bits LBA, 1 byte transfer length
– Read(10): 32 bit LBA, 2 byte transfer length
– Read(12): 32 bit LBA, 4 byte transfer length
– Read Long for ECC-Compliant data

128
BITS Pilani, Pilani Campus
SCSI – State Transition – Bus
Phases
• Bus Free Phase
– Bus is not being used by anyone
• Arbitration Phase
– One or more initiators request use of the bus and
• the one with the highest priority (SCSI ID order) is allowed to proceed
• Selection / Reselection phase
– Initiator asserts BUS BUSY signal and
• places target’s address on the bus thus establishing a connection/session
– Re-selection applies for a target resuming an interrupted operation:
• target asserts BUS BUSY signal and places initiator’s address on the bus.
• Message
– Exchange of control messages between initiator and target
• Command
– Initiator sends command to target
• Data
– Data exchange (as part of a read/write operation)
• Status
– Status of a command.
129
BITS Pilani, Pilani Campus
SCSI – State Transition – Bus
Phase Sequence [1]
• For a read operation:
1. When bus is free, initiator enters into arbitration (with other possible
initiators)
2. On arbitration, initiator selects the bus and places target address on
bus
3. Target acknowledges selection by a message
4. Initiator sends command
5. Target acknowledges command by a message
6. Target devices places (read) data on bus
7. Initiator acknowledges data by a message
8. Target sends status
• Assumption:
– Target holds the bus while it is reading – this acceptable only for simple
devices and small read delays.
• Exercise: Modify the above sequence for a write operation.
130
BITS Pilani, Pilani Campus
SCSI – State Transition – Bus
Phase Sequence [2]
• For a read operation:
1. When bus is free, initiator enters into arbitration (with other possible
initiators)
2. On arbitration, initiator selects the bus and places target address on bus
3. Target acknowledges selection by a message
4. Initiator sends command
5. Target acknowledges command by a message
6. Target sends “Disconnect” message to initiator
7. Bus is released
8. When target is ready with (read) data, target reselects bus and places
initiator address on bus
9. Initiator acknowledges selection by a message
10. Target sends data.
11. Initiator acknowledges data by a message
12. Target sends status
• Question: Modify the above sequence for a write operation.
131
BITS Pilani, Pilani Campus
SCSI – State Transition – Bus
Phase Sequence [3]
• For a sequence of read-write operations:
– Commands can be chained – i.e. a sequence of I/O operations
initiated by a single initiator on the same target.
– In this case, one arbitration step and one selection phase are enough.

• For large data transfers:


– Target could “disconnect” multiple times i.e. data transfer may be
interrupted more than once.
• If the target decides to disconnect, target sends a “save data pointer”
message to the initiator before the “disconnect” message

• A corresponding “restore data pointer” message is required after bus has


been reselected (i.e. data transfer is about to resume).
• SCSI controllers permit at most two sequences (i.e.
connections/sessions) to be interleaved.
132
BITS Pilani, Pilani Campus
SCSI – Command Queuing

• SCSI enables target devices to accept multiple


requests before finishing any of them
– Target devices must maintain queue outstanding
requests
– Also referred to as Tagged Command Queuing
– In SATA this is referred to as Native Command
Queuing
• SCSI-3 permits queue lengths up to 64
– In practice, most SCSI controllers (based on traditional
parallel SCSI) support a queue length of 256
• ATA/SATA permits queue lengths up to 32.
133
BITS Pilani, Pilani Campus
SCSI – Command Queuing

• Command Queuing is useful


– For initiator to offload I/O operations instead of
sequencing them itself.

– For amortizing seek and rotation times over multiple


accesses thereby reducing the average access time:
• Controller must have support for ordering the accesses so
that seek time is minimized or rotation time is minimized
or both.

134
BITS Pilani, Pilani Campus
Error Correction -In channel

• SCSI supports read/write operations with error


correction code (ECC)
– ECC uses redundant bits for error correction
– Most modern disks have “out-of-band” ECC per disk
sector
• This is used internally by the controller to verify whether
any bit errors have occurred during storage.

135
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics
• Network Attached Storage (NAS)
– Storage Array Types
– What is NAS….?
– File Systems Functions
– File System for Network Storage vs. Traditional File
System
– NAS Systems/Architectures
• Gateway NAS and NAS Appliances

137
BITS Pilani, Pilani Campus
Storage Arrays Types
• SAN
– Provides connectivity via FC, FCoE, iSCSI, SCSI (SAS)
– Uses low level disk drive access commands like READ block,
WRITE block, and READ capacity
• NAS
– Provides connectivity over file based protocols like Network
File System (NFS), SMB/CIFS (Server Message Block/Common
Internet File System)
– Uses high level file based protocols
– Commands: create a file, rename a file, lock a byte range
within a file etc.
• Unified (SAN and NAS)
– Shared storage over both file and block protocols (aka
multiprotocol arrays)
138
BITS Pilani, Pilani Campus
Storage on the Network

• What does a Computer Network achieve?


– Communication
• Complexity management for computation
– Local computation vs. Non-local Computation
• Collaboration
• Persistent Shared Data
– Data is accessible to (or accessible through) multiple
computers and persistent across computations
– Storage is shared by multiple computers i.e. available
on a network
139
BITS Pilani, Pilani Campus
Network Attached Storage (NAS)

Block I/O to
R/W file NAS Disk drives
Head

Client NAS Backend


(Disk drives)
• NAS drives examples:
– Windows
• Shares or network shares
• Supports mapping of external file systems on local machine as drive
name
– Linux
• Exports or NFS exports
• NFS exports are usually mounted to a mount point within the root
file system 140
BITS Pilani, Pilani Campus
Network Attached Storage

• Storage units are on the network


– Network is the same as the compute network (LAN)
– Data is accessed as files from file systems
• File Systems are supported by file servers (NAS servers)
• Novell’s file servers were one of the earliest networked file
servers.

• NAS systems come in different configurations:


– Server including FM and FS and direct attached
storage
– NAS head (only the FM) separate from the FS
141
BITS Pilani, Pilani Campus
NAS Architectures

• NAS Server • NAS Gateway

142
BITS Pilani, Pilani Campus
File Systems

• File systems form an intermediate layer between


block oriented hard disks and applications with a
volume manager
– Manage the blocks of the disk
– Make available to users and applications via
directories and files

143
BITS Pilani, Pilani Campus
File Systems Functions

• Journaling
– Ensure consistency of the file system even after a system
crash
– Every change in the file system is first recorded in a log
file
• Snapshots
– To freeze the state of the file system at a given point of
time (state of the data should be consistent)
– It loads server’s CPU and hardware independent
• Dynamic File System Expansion
– Volume manager
144
BITS Pilani, Pilani Campus
NAS Systems

• File Servers
– Operating System Implementation
• Customized Operating Systems
– Typically thinned versions of Linux or Windows
• Most tasks are I/O bound (particularly file I/O)
– Simpler scheduling and task management
– No user management or user interaction required
• Restricted Memory allocation model (Mostly for buffering)
– No heap needed
– Limited stack size
• Tasks are (soft) real-time
– At the server level, I/O requests must have time-bounds to provide
performance guarantees
145
BITS Pilani, Pilani Campus
NAS Arrays: Scaling

• Multiple individual NAS arrays can be deployed


– e.g. per project one NAS array
– Drawback
• Management becomes tedious
• Single Global Namespace or single file system
(scale-out NAS system)
– Adding more servers with CPU + memory + storage
– One file system on the network accessed by all nodes
– Called as NAS clusters
146
BITS Pilani, Pilani Campus
Example

• Multiple Individual NAS Arrays

• NAS Cluster

147
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Network File System protocol (NFS)


– Design goals
– Interface
– NFS Transport
– Operation

149
BITS Pilani, Pilani Campus
What is NFS?

• A remote file system protocol


– Designed and implemented by Sun Microsystems
– Provides Unix-like file system interface and
semantics
• It is implemented as a Client-Server application
– Client part imports file systems (from other
computers)
– Server part exports (local) file systems
• i.e. makes it visible on the network and enables access
from other computers
150
BITS Pilani, Pilani Campus
NFS Design Goals

• Stateless protocol
– Increased robustness – easy recovery from failures
• Support for Unix filesystem semantics
– Also supports other semantics such as MS-DOS
• Protection and Access Control same as Unix
filesystem semantics
• Transport-independent protocol
– Originally implemented on top of UDP
– Easily moved to TCP later
– Has been implemented over many non-IP-based
protocols.
151
BITS Pilani, Pilani Campus
NFS Design Limitations

• Design for clients and servers connected on a


locally fast network.
– Does not work well over slow links nor between
clients and servers with intervening gateways

• Stateless protocol
– Session state is not maintained

152
BITS Pilani, Pilani Campus
NFS- Transport[1]

• Uses a Request – Response model


– Server receives RPC – Remote Procedure Call –
requests from clients
– Can be run on top of stream (e.g. TCP) or datagram
protocol (e.g. UDP)
• Each RPC message may need to be broken into
multiple packets to be sent across the network.
– May typically require up to 6 packets
– May cause problems in UDP – in case of failure,
entire message must be retransmitted
153
BITS Pilani, Pilani Campus
NFS – Transport[2]

• Remote Procedure Call (RPC)


– Client sees a procedure call interface (akin to a local
procedure call)
– But the call (along with the parameters) is marshalled
into a message and sent over the network
• Marshalling may involve serializing
• Server unmarshalls the message (i.e. separates it
into pieces) and processes it as a local operation.
• The result is similarly marshalled by the server and
sent to the client which in turn unmarshalls it.
154
BITS Pilani, Pilani Campus
NFS- Interface

• NFS RPC requests


– Idempotent Operations(i.e. NFS client may send the
same request one or more times without any harmful
side effects)
• GETATTR, SETATTR, LOOKUP, READLINK, READ, WRITE, CREATE,
SYMLINK, READDIR, STATFS
– Non-Idempotent Operations
• REMOVE, RENAME, LINK, MKDIR, RMDIR
• Idempotency is significant because of
– slow links and lost RPC acks.
• Each file on the server is identified by a globally
unique file handle
– created when a lookup is sent by client to server
155
BITS Pilani, Pilani Campus
NFS- Operation[1]
• NFS protocol is stateless
– Each request is independent
• Server need not maintain information about clients and their
requests
• In practice, server caches recently accessed file
data.
– It improves performance
– and is useful for detecting retrials of previously serviced
requests
• Transaction-Commit semantics implies high
overhead:
– Synchronous writes are required for write operations.
156
BITS Pilani, Pilani Campus
NFS- Operation[2]

• NFS is a client-server protocol


– No discovery or registry is used.
• i.e. client must know the server and the filesystem
– Server’s filesystem has to be exported (by
mounting).
• Refer to mount system call on Unix
• Each server has a globally unique id (guid)
• Client application need not distinguish
between local and remote files
– i.e. once a filesystem is mounted applications use
the same interface for both types of files
157
BITS Pilani, Pilani Campus
NFS - Operation[3]
• NFS is a client-server protocol
– File manager part of client file system remains the same
as that for the local file system
• Only the file store is local or remote
• Each remote file that is active manifests as an
nfsnode in the client’s filesystem
– Recall the active file data structures:
• Per-Process File Descriptor
– Kernel File Entry
– vnode
– file-store-specific data structure
» inode for local file store and
» nfsnode for remote filestore

158
BITS Pilani, Pilani Campus
NFS - Operation

• Basic Protocol (Mounting): Client C, Server S


– C (mount process) -----> S(portmap daemon)
• request port address of mountd
– S (portmap daemon) ----> C
• return the port address of its mountd
– C (mount process) ----> S(mountd daemon)
• request mounting of filesystem [pathname]
– S (mountd daemon):
• get file handle for desired mount point from kernel
– S (mountd daemon) ---> C
• return filehandle of mountpoint OR error
159
BITS Pilani, Pilani Campus
NFS - Operation

• Basic Protocol (I/O Operation): Client C, Server S


– C (app. process): write system call
• Data is copied into kernel buffer and call returns
• nfsiod daemon wakes up
– C (nfsiod daemon) -----> S(nfs_rcv)
• Request write [buffer contents]
– S delegates request to nfsd daemon
• nfsd daemon invokes write on disk and waits for I/O
– S (nfsd daemon) ---> C (nfsiod daemon)
• return ack for the write operation
160
BITS Pilani, Pilani Campus
NFSv3 vs NFSv4
• NFSv3
– One RPC per RTT
– Port mapper service required to determine which network
port to listen or connect
– No caching

• NFSv4 features
– Access control lists (similar to Windows ACLs)
– Compound RPCs
– Client side caching
– Operation over TCP port 2049
– Mandating Strong security (Kerberos v5 for cryptographic
services)
161
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• NFS performance Issues: Inconsistency,


caching

163
BITS Pilani, Pilani Campus
NFS Performance

• Every request by a client may be delayed by:


1. RPC Round-trip
2. I/O Wait at the server
– which may be serving multiple clients
• Server-side Caching addresses the second issue.
– Server may still be serving multiple clients
– But server’s I/O rate will be improved and so should
client’s wait time
• Client-side Caching addresses the first issue.
– But caching may introduce coherence issues:
• Clients may be accessing stale data

164
BITS Pilani, Pilani Campus
NFS - Performance -
Inconsistent Data
• Scenario:
First client writes data that is later read by second client.
• Two main ways for stale data to be read:
1. Second client has stale data in its cache and does not
know that modified data are available.
2. First client has modified data in its cache but has not
written those data back to the server.
• Synchronous writing solves the second problem.
– It also results in behavior that is close to the local filesystem.
– But clients are restricted to one write per RPC RTT.

165
BITS Pilani, Pilani Campus
NFS Performance Caching[1]
• Delayed writing model:
– Write request returns as soon as data are cached by the client
• Pros:
– Following things can now be bundled in to a single request to the
server (i.e. the last one):
• multiple writes to the same blocks,
• file deletion or file truncation shortly after write(s)
• Cons:
– Client crash may result in loss of data
– Server must notify a client - holding a cached copy –
• that other client(s) want to read/write the file held by the first client.
• This introduces state in the implementation
– Error propagation to the client may be problematic:
• e.g. “Out of space” error
• e.g. client process exiting before error notification
166
BITS Pilani, Pilani Campus
NFS Performance Caching[2]
• Asynchronous writing model:
– As soon as data are cached by the client, write to the server is
initiated and then the write request returns.
• Variants:
– write on close (file)
• Delays are only deferred
– Read-sharing only (.e.g. Sprite file system, Unix like distributed file
system)
• Cache Verification model
– Client performs cache verification on access
• RPC RTT delays
• Callback model
– Server keeps track of cached copies and notifies them on update

167
BITS Pilani, Pilani Campus
NFS Performance Caching[3]

• Leasing model:
– Leases are issued for time intervals.
– As long as lease holds server will callback on update
– When lease expires client must verify its cache
contents and/or obtain a new lease.
• Requires much less server memory and reduces
traffic.
• Read-caching and write-caching may be given
separate leases.
168
BITS Pilani, Pilani Campus
NFS Crash Recovery

• Caching schemes introduce state


– If system crashes, state must be recovered:
• E.g. leases
– If state depends on time (or intervals), recovery
time must be accounted for in leasing
• Clocks become critical.
• Server congestion may also lead to failure
– And recovery issues apply here as well.
• Timing Issues
– Clocks??
169
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• SMB/CIFS protocol

171
BITS Pilani, Pilani Campus
• CIFS is the de facto file-serving protocol in the
Microsoft windows world
• It operates natively over TCP/IP networks on
port 445

172
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Storage Area Network (SAN)


– What is SAN?
– Why separate network for Storage?
– How it is different from NAS?

• Fibre Channel Protocol Stack


– FCP Layers

174
BITS Pilani, Pilani Campus
Storage Area Network
Architecture
Clients

LAN
Servers

SAN

Storage
175
BITS Pilani, Pilani Campus
Storage Area Network

• Storage units are on the network


– Network is (typically) different from the LAN
• e.g. Fibre-Channel SAN
– Data is accessed raw (in disk blocks) from storage units
• As opposed to file access in NAS
• Fibre-Channel SANs were the earliest:
– FC offers high Bandwidth
• Alternative SAN technologies available today:
– e.g. IP SAN
• SAN and NAS are converging:
– e.g. NAS head with a SAN backend.
176
BITS Pilani, Pilani Campus
Separate Network for Storage
• Reasons to use a separate network for Storage:
– Increased Throughput
• improves performance of I/O sensitive applications
– More flexibility
• for change in storage compared to the Direct Attached Storage model
– Reliability
• allows storage devices to be accessed by all servers on the SAN
– Higher Scalability
• in terms of greater numbers of servers and storage units to be
interconnected
– Data Mobility / Migration
• is easier
– Data / Storage Management
• is easier (e.g. backup is independent of LAN)
177
BITS Pilani, Pilani Campus
SAN Traffic

• Communication between Initiator (a computer and its


controller) and target (storage device) uses SCSI
– Recall SCSI commands:
• Types: Non-data, Write, Read, Bidirectional
– E.g. Test Unit Ready, Inquiry, Start/Stop Unit, Request Sense (for error), Read
Capacity, Format Unit, Read (5 variants), Write (5 variants)
• Command Descriptor:
– Opcode, LUN (3 bits)
– e.g. for Read
• Read (6): 21 bits LBA, 1 byte transfer length
• Read(10): 32 bit LBA, 2 byte transfer length
• Read(12): 32 bit LBA, 4 byte transfer length
• Read (16): 32 bit LBA, 4 byte transfer length (ECC-Compliant data)
• Read(16): 64 bit LBA, 4 byte transfer length

178
BITS Pilani, Pilani Campus
SAN Traffic

• So typical SAN Traffic is that of SCSI commands


– The transmission protocol may vary
• The earliest SANs used Fibre Channel for
transmission
– But iSCSI (Internet SCSI) enables the use of TCP/IP for
transmission
• Essentially, SCSI commands are carried as payloads
(i.e. data) over SAN protocols
– In SCSI -3 Architecture the SCSI command protocol and
bus protocol are referred to as the SCSI Parallel Interface
(SPI)
179
BITS Pilani, Pilani Campus
NAS vs. SAN

• SAN does Block I/O


– NAS does file I/O (i.e. remote file system I/O)
• File system in SAN is managed by servers (aka
Hosts)
– File system in NAS is managed by NAS head
• SAN uses encapsulated SCSI
– NAS uses CIFS/SMB/HTTP over TCP/IP
• SAN uses Fibre Channel
– NAS uses TCP/IP Networks
180
BITS Pilani, Pilani Campus
FC SAN Components
• Components
– Hosts
• Client / Server computers
– Storage Devices
• Disks, Disk Arrays etc.
– Interfaces
• (Fibre-Channel) Ports for communication
– Hubs, Switches, and Gateways

181
BITS Pilani, Pilani Campus
FC-SAN High Level View

FC Switch A

Storage
Host with Array
2xHBAs
(Initiators) (target)

FC Switch B

182
BITS Pilani, Pilani Campus
FC Protocol Stack

FC-4 Upper Layer protocols (PDU): SCSI, IP


FC-3 Common Services (Multicasting,
Striping etc…)
FC-2 Framing, flow control, zoning
FC-1 Signaling, encode/decode
FC-0 Lasers, cables, connectors, data rates

183
BITS Pilani, Pilani Campus
SAN Ports: FC Switch Ports
• U_Port
– Un-configured and uninitialized
• N_Port
– Aka Node Port (to connect end devices)
• F_Port
– Switch ports that accept connections from N_Ports operate as fabric ports
• L_Port
– Node port used to connect a node to a Fibre Channel loop
• E_Port
– Expansion port to connect two SAN switches (allows merging)
• EX_Port
– It is a E_Port used for FC routing (Prevent fabrics from merging)
• Port Speed:
– 2, 4, 8, 16 Gbps

184
BITS Pilani, Pilani Campus
FC-SAN Port Connectivity

FC Switch A
N_Port
F_Port F_Port

N_Port
Host with Storage
2xHBAs Array
(Initiators) (target)
N_Port
FC Switch B
N_Port
F_Port F_Port
E_Port

E_Port

FC Switch C 185
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Fibre Channel SAN: FC-SAN topologies


– Point to Point
– Arbitrated Loop
– Switched Fabric
• Hardware Components of FC-SAN

187
BITS Pilani, Pilani Campus
SAN Ports: FC Switch Ports
• U_Port
– Un-configured and uninitialized
• N_Port
– Aka Node Port (to connect end devices)
• F_Port
– Switch ports that accept connections from N_Ports operate as fabric ports
• L_Port
– Node port used to connect a node to a Fibre Channel loop
• E_Port
– Expansion port to connect two SAN switches (allows merging)
• EX_Port
– It is a E_Port used for FC routing (Prevent fabrics from merging)
• Port Speed:
– 2, 4, 8, 16 Gbps

188
BITS Pilani, Pilani Campus
Example: FC-SAN Port
Connectivity

FC Switch A
N_Port
F_Port F_Port

N_Port
Host with Storage
2xHBAs Array
(Initiators) (target)
N_Port
FC Switch B
N_Port
F_Port F_Port
E_Port

E_Port

FC Switch C 189
BITS Pilani, Pilani Campus
Common SAN Topologies: SAN
Structure
• Point to Point
– Direct connection between HBA port - Storage array
port
– e.g. for 8-port storage array, you can have a
maximum of 8 directly attached servers talking to
that storage array
– Limitation
• No scalability

190
BITS Pilani, Pilani Campus
FC -SAN Structure[1]
• Structure – Arbitrated Loop (AL)
– Storage devices - through L-ports - are connected to an
(FC) AL hub
– Local hosts are also connected to the AL via I/O bus
adapters
– Hubs do not allow a high transfer rate (due to sharing)
but are cheap.

191
BITS Pilani, Pilani Campus
FC SAN Structure[2]

• An Arbitrated Loop can span several hubs –


referred to as a cascading

192
BITS Pilani, Pilani Campus
FC SAN Structure[3]

• An Arbitrated Loop (AL) can be public or private


– A Private loop is closed on itself
– Public loop is connected to a fabric by a switch
• Although a public loop can be connected to more than one
switch only one switch can be active at any time
– i.e. additional switch connections are for fault tolerance – i.e. for
fail-over only.
– This is realized by a hub connected through a FC-
switch to remote hosts
• Switches allow individual connections with high transfer
rates but are expensive.
193
BITS Pilani, Pilani Campus
Public Loops

• End Devices in a public loop can communicate


with end devices in the fabric only if they have
NL-ports
• A fabric is a collection of connected FC
switches that have a common set of services

194
BITS Pilani, Pilani Campus
Switched Fabric

• Inter-connected FC-Switches

195
BITS Pilani, Pilani Campus
Switched Fabric Topologies[1]

• Core Edge Topology


Edge

Core

Edge

196
BITS Pilani, Pilani Campus
Switched Fabric Topologies[2]

• Cascade Topology

• Ring Topology

197
BITS Pilani, Pilani Campus
Switched Fabric Topologies[3]

• Mesh Topology
– Every switch is connected to every other switch

198
BITS Pilani, Pilani Campus
Redundancy and Resiliency[1]

• Single Fabric Non-Resilient Design


– Each end-device is connected to one switch (and
to one fabric)

199
BITS Pilani, Pilani Campus
Redundancy and Resiliency[2]

• Single Fabric Resilient Design


– Each end-device is connected to one switch (and
to one fabric)

200
BITS Pilani, Pilani Campus
Redundancy and Resiliency[3]

• Redundancy and Resiliency:


– Redundant Fabric Non-Resilient Design

201
BITS Pilani, Pilani Campus
Redundancy and Resiliency

• Redundancy and Resiliency:


– Redundant Fabric Resilient Design

202
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• SAN Components
– Addressing
– Zoning
– Multi-pathing
– Trunking
– LUN Masking
• FC-SAN Performance Issues

204
BITS Pilani, Pilani Campus
SAN Addressing[1]

• World Wide Names (WWNs)


– unique World Wide Name per N-port (also
referred as WWPNs)
– Devices may have a WWN (independent of the
adapters/ports)
– Defined and maintained by IEEE
– 64-bit long
– 24-bit port addresses may be used locally to
reduce overhead.
205
BITS Pilani, Pilani Campus
SAN Addressing[2]
• 24-bit addressing - in a switched fabric
– Assigned by switch
– At login, each WWN is assigned (mapped) to a 24-bit
address by Simple Name Service (SNS)
• SNS is a component of the fabric OS – acts as a
registry/database
• Address format:
– Domain address (bits 23-16) identifies the switch
• Some addresses are reserved e.g. broadcast
• 239 possible address.
– Area address (bits 15-8) identifies a group of F-ports,
– Port address (bits 7-0) identifies a specific N-port
• Total addressible ports: 239x256x256
206
BITS Pilani, Pilani Campus
SAN-Addressing[3]
• 24-bit addressing - in an AL
– Obtained at loop initiation time and re-assigned at
login to the switch
– Address Format:
• Fabric loop address (bits 23-8) identifies the loop
– All 0s denotes a private loop i.e., not connected to any fabric
• Port address (bits 7-0) identifies a specific NL-port
– Only 126 addresses are usable (for NL-ports):
– 8B/10B encoding is used for signal balancing
– Out of the 256 bit patterns only 134 have neutral running
disparity – 7 are reserved for FC protocol usage; 1 for an FL-port
(so that the loop can be on the fabric)
207
BITS Pilani, Pilani Campus
SAN Routing
• Routing
– Analogous to switching in a LAN
– Goal:
• Keep a single path (bet. Any two ports) alive – no redundant paths or
loops
• Additional paths are held in reserve – may be used in case of failures.
– Fabric Shortest Path First (FSPF) protocol –
• Cost: hop count
• Link state protocol
• Link state database (or topology database) kept in switches
• Updated/Initialized when switch is turned on or new ISL comes up or
an ISL fails
– Switches use additional logic when hop count is same.
• Round Robin is often used for load sharing
208
BITS Pilani, Pilani Campus
SAN Zoning

• Zoning Controls device visibility in a SAN fabric


– Without zoning, initiator will probe and discover all
devices on the SAN fabric
• Zoning allows fabric segmentation
– Storage (traffic) isolation
• e.g. Scenario: Windows systems claim all visible storage
• Similar to Virtual LANs
– Broadcast isolation: each VLAN is a separate broadcast domain
– Zoning can be done based on WWN and Port

209
BITS Pilani, Pilani Campus
Zoning: Example

210
BITS Pilani, Pilani Campus
SAN Zoning

• Hardware Zoning: (1-1, 1-*, *-*)


– Based on ports connected to fabric switches
(switches-internal port numbering is used)
– A port may belong to multiple zones
– Advantage:
• Implemented into a routing engine by filtering – i.e. no
performance overhead
– Disadvantage:
• Device connections are tied to (physical) ports

211
BITS Pilani, Pilani Campus
SAN Zoning
• Software Zoning:
– Based on WWN – managed by the OS in the switch
– Number of members in a zone limited by memory
available
– A node may belong to more than one zone.
– More than one sets of zones can be defined in a
switch but only one set is active at a time
• Zone sets can be changed without bringing switch down
– Less secure :
• SZ is implemented using SNS
– Device may connect directly to switch without going through SNS
• WWN spoofing
• WWN numbers can be probed
212
BITS Pilani, Pilani Campus
SAN- Frame Filtering

• Process of inspecting each frame (i.e. header


information) at hardware level for access control
• Usually implemented as an ASIC with choice and
configuration of filter that can be done at switch
initialization/boot time.
– Allows zoning to be implemented with access control
performed at wire speed
– Port level Zoning, WWN level Zoning, Device level
Zoning, LUN level Zoning, and Protocol level Zoning can
be implemented using Frame Filtering
213
BITS Pilani, Pilani Campus
SAN-Trunking

• Grouping of ISLs into a trunk i.e. a logical link


– aka Port Channel/ISL trunk
• Useful for load sharing in the presence of zoning
– i.e. zoning need not restrict ISL usage
• Supports in-order, end-to-end transmission of
frames
– Re-ordering done by the switch as required

214
BITS Pilani, Pilani Campus
SAN-Multipathing
• Provide multiple paths between a host and a device (LUN).
– Redundancy for improved reliability and/or higher bandwidth for
improved availability / performance
• Channel subsystem of the kernel in switch OS handles multi-
pathing at software level
– Usually Separate device driver is used with following capabilities:
• Enhanced Data Availability
• Automatic path failover and recovery to alternative path
• Dynamic Load balancing
• Path selection policies
• Failures handled:
– Device Bus adapters, External SCSI cables, fibre connection cable,
host interface adapters
• Additional software needed for ensuring that the host sees a
single device.
215
BITS Pilani, Pilani Campus
SAN- LUN Masking

• LUN Masking
– Which servers (HBAs) can see which LUNs
– Performed on the storage array using WWPN of
host’s HBA in FC-SAN
• Zoning vs. Masking
– Zoning takes place at SAN switches where as LUN
masking takes place on the storage array
– LUN masking provides more detailed security than
zoning. How?
216
BITS Pilani, Pilani Campus
FC-SAN Performance Issues

• Hop-Count and Latency


• Over-subscription
– Device Over-subscription: Number of Computers that
need to access a storage device
– ISL Over-subscription: Total bandwidth requirements of
all end-to-end connections that are likely over an ISL
• Over-subscription will result in
– Congestion – Delayed Deliveries
– Blocking – Failed Deliveries
• Fan-out, Fan-in:
– Ratio of server-ports to a single storage port (Fan-out)
and Ratio of storage-ports to a single server-port (Fan-in)
217
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• IP Storage standards
– iSCSI: “Storage resources to be shared over an IP
network”
• Connecting two FC-SANs using TCP/IP
– Tunneling (i.e. FCIP)
• Migration from FC-SAN to an IP-SAN
– internet-FCP

219
BITS Pilani, Pilani Campus
iSCSI SAN

• iSCSI allows storage resources to be shared


over an IP network
• iSCSI is a mapping of SCSI protocol over TCP/IP
– Similar to mapping of SCSI over Fibre Channel (FC)

IP
iSCSI Initiators
iSCSI Target

220
BITS Pilani, Pilani Campus
FC-SAN vs. IP-SAN

FC SAN IP-SAN
Protocol Overhead Low High
Distance Limit YES NO
H/W Cost High Low
Network Administration NO YES
tools availability
Network Latency Low High
CPU Use Low High
Data Access Protocol SCSI-III NFS/CIFS
Access type Block level (File system is File Level (File system is
part of Server) part of storage)
Interface (Server connector) HBA NIC

221
BITS Pilani, Pilani Campus
iSCSI SAN Components

• Initiators
– Issue Read/Write data requests to targets
– Request – response mechanism
– Initiators and targets are technically processes that
run on the respective devices
• Target (disk arrays or servers sharing storage
over iSCSI)
• IP Network
– Used for transportation of iSCSI PDUs
222
BITS Pilani, Pilani Campus
iSCSI Interfaces

• Standard Ethernet NIC with a Software Initiator


– Basic and cheapest interface
– Usually implemented as a kernel based device driver
• Ethernet NIC with a TCP Offload Engine and a Software
Initiator
– Offloads TCP stack processing from host to NIC
• iSCSI Host Bus Adapter
– Reduced impact on host CPU
– Allows boot from SAN
– More Cost
• Converged Network Adapter (CNA)
– Also supports for other protocol like FCoE

223
BITS Pilani, Pilani Campus
iSCSI PDU and Encapsulation

• iSCSI operates at the session layer of the OSI model


– An application issues an I/O operation to the file system
(i.e. OS) which sends to SCSI
– SCSI creates the CDB and sends to iSCSI layer
• The CDB consists of a one byte operation code followed by
some command-specific parameters
– SCSI commands- READ BLOCK LIMITS, TEST UNIT READY, READ,
WRITE, SEEK etc
– CDBs and associated target LUN data are encapsulated
by iSCSI into PDUs at the session layer and passed
further down the layers until they hit the wire for
transport across the network
• There is nothing iSCSI specific about the
encapsulation once below the session layer
224
BITS Pilani, Pilani Campus
iSCSI Names
• Every device on iSCSI SAN needs a name
– iSCSI names are tied with iSCSI nodes no to the interfaces like
HBAs, NIC etc.
– iSCSI names are permanent and globally unique
• Naming Conventions
– iSCSI qualified names (IQN)
• 233 bytes in length and location independent
• Hierarchical naming
– Type, date, Naming authority, Customizable string
– e.g. iqn.2011-06.com.technicaldeepdive:server:lablinuxdb01
– Other naming conventions
• Extended Unique Identifier (EUI), Network Address Authority (NAA)
• Aliases can be used to user friendly names to initiators or
target
– Can be used in managements tools not for discovery, login and
authentication
225
BITS Pilani, Pilani Campus
Device Discovery
• Initiator to target communication requires
– Target IP address
– TCP port
– iSCSI name

• Device Discovery Methods


– Manual Device Discovery
• Good for small sized SANs
– Send targets
• Initiator issues the SEND TARGETS command to the network portal on the target .
Target responds with a list of available targets
• It assumes prior knowledge of the IP address and the TCP port on the target
• Common for small to medium sized SANs
– Automatic device discovery
• Service Location Protocol (SLP) agent runs on initiator and the target
• Initiator SLP agent issues multicast messages to targets, and service agents on
targets responds
226
BITS Pilani, Pilani Campus
iSNS (Internet Storage Name
Service
• It is a centralized database containing configuration
data about iSCSI SAN
– Good for managing large SAN array
• It allows to register and query the information by
initiator and target. Provides following services
– Name service
– Partitioning of the iSCSI SAN into discovery domains
– Login Services
– Sate change notifications
– Mapping FC to iSCSI devices
• Similar to FC-SAN’s SNS service
227
BITS Pilani, Pilani Campus
Discovery Domains

• Used to partitioning iSCSI SAN


– Devices can see and communicate only with other
devices in the same DD

• Discovery domains are form of security by


obscurity
– DD have nothing to physically stop a misbehaving
initiator from breaking the rules

228
BITS Pilani, Pilani Campus
iFCP: Mapping of FCP on TCP/IP

• Used to connect two IP-SANs


– It is a gateway protocol that connects Fibre
Channel devices via a TCP/IP network
– Does mapping of FC-FCP on TCP/IP

229
BITS Pilani, Pilani Campus
FCIP: Connects two FC-SAN by
TCP/IP
• FCIP is a tunneling protocol

230
BITS Pilani, Pilani Campus
SAN Protocol Taxonomy

Applications
Operating Systems
SCSI Protocol
FCP FCP FCP SCSI
iSCSI
iFCP FCIP

Fabric Services
Transport and
FC-3
TCP TCP TCP
FC-2

IP IP IP
231
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Fiber Channel vs. Ethernet


• Storage Traffic over Ethernet
– Fibre Channel over Ethernet (FCoE)
• Enhanced Ethernet
• FCoE Encapsulation
• Converged Storage Network

233
BITS Pilani, Pilani Campus
Data Center Network
Technologies
• Ethernet
– Lossy network due to congestion
– Packet or frame losses needs to be handled by higher layers
(e.g. TCP)
– Earlier latency was an issue but not NOW… (e.g. 10Gbps,
40Gbps, and 100Gbps Ethernet)
– Shared media (e.g. Bus topology)
• Fibre Channel
– High speed and low latency network
– Exclusively used for storage traffic (e.g. SCSI traffic)
– Operates link layer signaling (communicates buffer credits
and status) to avoid packet losses
– Uses Point to Point topology
234
BITS Pilani, Pilani Campus
FC vs. Ethernet

• FC is less versatile and scalable as compared


to Ethernet
– FC is good for channel technology like SCSI rather
network technology
– Channel technology focus on high performance
and low overhead interconnects
– Where as network technology (e.g. Ethernet)
needs to be scalable and more versatile

235
BITS Pilani, Pilani Campus
Storage Network Requirements

• Reliable and efficient transportation of SCSI


commands between targets and initiators
• SCSI was not designed to deal well with delays,
congestion, or transmission errors
• FC suits well at transporting SCSI traffic becoz
– Simple topologies (e.g. Arbitrated Loop)
– Good bandwidth
– Low latency
– Lossless
– Deterministic Performance

236
BITS Pilani, Pilani Campus
Enhanced Ethernet: Lossless
Ethernet
• IEEE task group responsible for it
– Data Center Bridging (DCB) Task Group (802.1)
• Also called Converged Enhanced Ethernet (CEE)
– Objectives
• To transport IP LAN traffic
• FC storage traffic
• Infiband high performance computing traffic (2.5 Gbps)
– Lossless Ethernet
• Enabling PAUSE function (IEEE 802.3)
– Stopping all traffic on a port when a full queue condition is achieved
– PAUSE should be issued only for lossless traffic requirement not for all traffic
– Priority Flow Control (IEEE 802.1Qbb) can halt traffic according to the priority
Tag
– Administrators use the eight lanes defined in IEEE 802.1p to create virtual
lossless lanes for traffic classes like storage and lossy lanes for other classes
237
BITS Pilani, Pilani Campus
Lossless Ethernet Cont…

• Bandwidth sharing between lossless and lossy


traffic
– Enhanced Transmission Selection (IEEE 802.1Qaz) allows
to share the same link for both types of traffic
– Data Center Bridging Exchange Protocol (DCBX) is
responsible for the configuration of link parameters for
DCB functions
• Determines which device support the enhanced functionalities
(create DCB cloud for FCoE traffic)
• It is a standard-based extension of the Link Layer Discovery
Protocol (LLDP)

238
BITS Pilani, Pilani Campus
FCoE Encapsulation

• Fibre Channel over Ethernet


– FC frames are encapsulated within layer 2 Ethernet
frames and transported over an IEEE 802.3 network
• FCoE Encapsulation
– FC frame length is 2148 Bytes
– Standard Ethernet frame size is 1518 Bytes
– How to fit FC frame in Ethernet frame?
• Fragment FC frames
– Results in latency and lower performance
• Use larger Ethernet frames
– e.g. Jumbo frames of size 9 KB

239
BITS Pilani, Pilani Campus
FCoE Encapsulation

SCSI FC FCoE

• FCoE encapsulation process leaves the FC frame


intact
– Zoning, WWPNs, and SSN are still applicable
• FC frames are not encapsulated within IP packets
• FC networks are large, flat, layer 2 and non-routable
networks
– It is in contrast to iSCSI …?
240
BITS Pilani, Pilani Campus
FCoE in Real World
• An FCoE switch connects to the LAN and the SAN
– Having CEE (Converged Enhanced Ethernet) and FC both ports
• FCoE deployment in industry
– As part of prepackaged converged computing network and storage
package
– At the access layer (first hop between host and access layer switches such
as Top of Rack)
Fibre connection
to FC SAN FCoE Switch

Direct Access Copper


From CNA to
FCoE switch

Blade Servers
With CNAs
241
BITS Pilani, Pilani Campus
Cabling Options

• For Gigabit Ethernet


– Copper (1000 Base-T with RJ-45 connectors)
• Dominating the market as compared to optical
– Optical (same as used in FC)
• For 10 Gigabit Ethernet
– Copper (Twinax)
• Based on SFF-8431 standard and SFP+ interface for a copper
connection
• Limited for short distances (e.g. server to TOR connections)
– Widely deployed
• Standard multimode optical cabling can be used for longer
distances (e.g. rack to core connections)
– Not much deployed

242
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• RDMA

244
BITS Pilani, Pilani Campus
Remote Direct memory Access
(RDMA) [1]
• Communication between applications
(traditional way)
– Incoming data is accepted by NIC card
– OS kernel process it and deliver to app
– Multiple level buffering
– Costs CPU power and loading system bus
– Increase in latency and reduce throughput

245
BITS Pilani, Pilani Campus
Remote Direct memory Access
(RDMA) [2]
• Virtual Interface Architecture
– Allows data exchange between apps and network
card by bypassing OS (i.e. no CPU intervention)
– How to do…?
• Two communicating apps set-up a connection aka Virtual
Interface
• A VI is a common memory area defined on both
computers
• Using this memory, app and network card on a machine
can exchange data

246
BITS Pilani, Pilani Campus
Remote Direct memory Access
(RDMA) [3]
• Sender application
– fills the memory area with data
– then app announces by means of the send queue of
the VI hardware
– the VI hardware reads the data directly from the
common memory area and transmits to the VI
hardware of the second computer
– It does not inform to the app until all data is available
in the common memory area
– Point to point communication is allowed
– E.g. VI capable FC-HBAs, NICs
– VI communications allows RDMA
247
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Replication Technologies
– Synchronous
– Asynchronous
• Where to perform replication…?
– Application/Database layer
– Host layer (Logical Volume Manager in Linux)
– Hypervisor layer
– Storage array layer

249
BITS Pilani, Pilani Campus
Replication: Synchronous[1]

• Synchronous
– All write to a volume are committed to both the
source and the remote replica before the write is
considered complete
• Provides zero data loss i.e. RPO of zero
• Negative impact on performance

• Replication Distance and latency


– No. of routers that have to be traversed
– Distance between the source and target
• e.g. ping can be used to find end to end latency
250
BITS Pilani, Pilani Campus
Replication: Synchronous[2]

• Replication link Considerations


– Requires robust and reliable network links
• Deployment of multiple diverse replication networks is
common practice to handle unreliable links
– Sizing network bandwidth for peak traffic including
bursts
• During high bursts additional latency will be added to
each write transaction without peak provisioning
– For shared IP networks (link is shared between
replication and other application traffic) policing or
bandwidth throttling should be used
251
BITS Pilani, Pilani Campus
Replication: Asynchronous

• No overhead to write transactions


– Transactions are signaled as complete as soon as they are
committed to the source volume
– Copying transaction to the target is performed
asynchronously
• Potential for data loss to occur
• Exact loss depends upon the technology involved
– 15 minutes lag is common among existing technology
– Distance between source and replica can be far larger
than synchronous replication
• RTT is not so important here
– Sizing network bandwidth for peak traffic is not required
252
BITS Pilani, Pilani Campus
Where to perform replication?

• Layers of storage stack where replication is


commonly performed
– Application/Database layer
– Host layer (Logical Volume Manager in Linux)
– Hypervisor layer
– Storage array layer

253
BITS Pilani, Pilani Campus
Application Layer Replication

• Replication load on application server


– Not a big deal with advancement in CPU technology
• Application aware replication
– Replication technology is tightly coupled with the
application
– Replicas are in an application consistent state
• Quick and effective recovery process
• e.g. Oracle data Guard, Native application technology
comes with Microsoft Application Server and Microsoft
Exchange Server
254
BITS Pilani, Pilani Campus
Replication: Oracle Data Guard

• DG replicas are transactionally consistent


• DG Configuration (One primary + one or more
standby)
– Production data base known as Primary
– Remote recovery copy is known as Standby
• DG’s Log Shipping Technology
Logs Logs DB
DB
Log Replay

Production
DR Site
Site
255
BITS Pilani, Pilani Campus
Data Guard: Standby databases
• Physical
– Exact physical copy of the primary database with identical
on-disk block layout
• Logical
– Data is same as primary but on-disk structures and database
layout will be different
• Redo logs are converted into SQL statements that are then executed
against the standby database
• Logical standby DB is more flexible than physical standby
DB
– It can be used more than just DR
• Oracle DB allows both switchover and failover
– Switchover = Manual transition from standby to primary
– Failover = Automatic transition when primary DB fails
256
BITS Pilani, Pilani Campus
Replication: MS Exchange
Server
• Active/Passive Architecture
– Allows more than one standby
copy of a database
– Also, utilizes log shipping
– Supports both switch-over and
fail-over
– Fundamental unit of replication is
the Data Availability Group (DAG)
(i.e. collection of exchange
servers)
– Recovery method
• At database level
257
BITS Pilani, Pilani Campus
Replication- Logical Volume
Manager Based
• Host based volume manager for remote
replication using LVM mirrors
– e.g. Linux LVM
• LVM is agnostic about the underlying storage
technology
– LVM can be considered as a thin software layer on
top of the hard disks and partitions
• Ease the Disk replacement, repartitioning, backup, and
migration
– used for storage migration by people
258
BITS Pilani, Pilani Campus
Replication-Hypervisor based

• Hypervisor based replication


– e.g. vSphere Replication from VMware using Site
Recovery Manager (SRM)
• Agnostic about the underlying storage technology
• IP based replication technology, works at the VM disk
(VMDK) level
• Provides replication at very granular level
• e.g. Imagine a VM with four virtual disks
– VMDK 1: OS partition, VMDK 2: Page file partition, VMDK 3:
Database partition, VMDK 4: Scratch partition
– Based on network b/w availability at VMDK level replication can
be configured
• Allows incremental backups to save b/w
259
BITS Pilani, Pilani Campus
Replication-Storage Array based

• Replication responsibility is of storage not the


applications
– Positive
• Applications are offloaded
– Negative
• State of the application data in the remote replica volume will
not be ideal for the application to recover from
• Applications may require to do additional work
– Alternative
• Integrate array based replication technique with applications
– e.g. Synchronous replication API available with Microsoft Exchange
Server 2010 and 2013

260
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Asynchronous Replication Techniques


– Snapshot based
• Snapshot types
– Journal based

262
BITS Pilani, Pilani Campus
Snapshot Based Replication

• A storage snapshot is a set of reference


markers, or pointers, to data stored on a disk
drive or tape or on a storage network
• Precondition
– Initial copy of the source volume needs to be
replicated to the target system
• After that snapshots of the source volume are send to
the target system

263
BITS Pilani, Pilani Campus
Snapshot: Some Key Facts

• How frequent you should take snapshot?


– Decided by RPO committed to the client in SLA
– e.g. for an RPO of 10 minutes for a volume, you should
configure the replication schedule for that volume to take
a snapshot every 5 minutes
• Needs time to transfer the snapshot and to act on the target
volume
• What is ideal Snapshot extent size?
– e.g. For 1 MB extent size, we will have to replicate 1 MB of
data every time at-least a single byte within that extent is
updated
• Question ???
– Is smaller extent size better or larger?

264
BITS Pilani, Pilani Campus
More on Snapshots
• Snapshots are local replicas of data
– Exists on the same host or storage array as the source
volume
– Provide an image of a system or volume at given point in
time
• Snapshots can be taken at different levels
– Hypervisor based snapshots
• e.g. VMware snapshots (aka PIT snapshots)
– Host based Volume Manager snapshots
– Storage Array based snapshots
• Full clones
– An exact block for block copy of the source volume is created
• Space Efficient
265
BITS Pilani, Pilani Campus
Space Efficient Snapshot
Example
• Pointer based snapshots

Block # 0 1 2 3 4 5 6 7
Vol1 4653 1234 3456 3678 5433 1243 2343 6745

Vol1.snap

Block # 0 1 2 3 4 5 6 7
Vol1 4670 1234 3456 3690 5433 1243 2348 6745

Vol1.snap 4670 3690 2348


266
BITS Pilani, Pilani Campus
Copy-on-Write Snapshots

• Maintain the contiguous layout of the primary


volume
• The first write to any block on the source volume
(after snapshot has been taken) requires the
original contents of that block to be copied to the
snapshot volume for preservation
• Only after this write operation overwrite original
contents of the block on the source volume

267
BITS Pilani, Pilani Campus
Redirect-on-Write Snapshots

• Also called as “allocate-on-write” snapshots


• Original data in the source volume is not
changed or copied to the snapshot volume
• A new write comes into the source volume
– Data is frozen and becomes part of the snapshot
volume
– It is written to a new location in the array using a
redirect operation
– Metadata table is updated to reflect the new
location
268
BITS Pilani, Pilani Campus
Journal Based Replication
• In this approach new writes are buffered to either cache or
dedicated volumes before asynchronously send them to target
array
– Such volumes known as journal volumes or write intent logs
– The gap between the replication intervals can be configured
• Usually it is few seconds
• Journal based replication can apply write sequencing metadata
– Which helps to maintain the consistency of the remote replica
volumes

• In case of failure (i.e. system failed or journal buffer filled),


system will revert to maintaining a bitmap of changed blocks
– It keep track of changed blocks in source volume so that only
changes are sent to the remote array
– Write sequencing is not maintained
269
BITS Pilani, Pilani Campus
Thank You!

270
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Replication Topologies
– Three site cascade
– Three site multi-target
– Three site triangle

272
BITS Pilani, Pilani Campus
Replication Topologies
• Three Site Cascade
– Source, Target and Bunker site
– Source and target site can not communicate each
other directly
• Data replication is through bunker site
– Example
• Source and bunker site are within 100 miles apart
– Configured with synchronous replication
• Target site is 1000 miles away
– Asynchronous replication is used

273
BITS Pilani, Pilani Campus
Three Site Cascade

Sync Replication Async Replication

Source Bunker Target

Resiliency?
If source site lost.
If bunker site lost.

274
BITS Pilani, Pilani Campus
Three Site Multi-Target

• Three site multi-target


– Source array simultaneously replicate
to two target arrays
– More resilient than three site cascade.
How…?
Bunker
• If Bunker/source site lost then…?

Source
Target

275
BITS Pilani, Pilani Campus
Three Site Triangle Topology
• Three site triangle topology
– Standby link is used when source is
failed
– Improvement over three site multi-
target topology
Bunker

Source
Target

276
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics
• Storage Virtualization
– Virtualization = Abstraction of physical devices to
logical/virtual devices
– Storage Virtualization is virtualization of virtualization
• Happens at every layer of the storage stack
• Storage Virtualization Levels
• SNIA Shared Storage Model

278
BITS Pilani, Pilani Campus
Storage Virtualization[1]

• Storage Virtualization
– An abstraction of storage achieved by
• Hiding some details of the functions of an aggregation
of storage devices and
• Presenting a uniform set of I/O services and an
interface for provisioning storage

279
BITS Pilani, Pilani Campus
Storage Virtualization[2]

• Storage Virtualization can be realized at


different levels of abstraction
– Device Level
• Physical devices are aggregated and presented as
different virtual devices (e.g. partitions, RAID controllers
etc.)
– File System Level
• Block storage devices are presented as file systems
• Question: In each case, what is hidden, what is
made visible?
280
BITS Pilani, Pilani Campus
Storage Virtualization Levels[1]
• Application speaks to filesystem or databases
• File systems/databases laid out over logical
volumes/partitions
• Multiple LUNs configured as a single logical volume
• Logical Units (LUN)
• Storage Network virtualized
• Volumes pooled together into storage pools
• Physical drives carved into RAID-protected volumes
• Virtualized (simplified) LBA view of physical disk drive
• Disk drives specifies (platters, tracks, sectors)

281
BITS Pilani, Pilani Campus
Storage Virtualization Levels[2]

• Due to multiple Layers of virtualization


– Applications/databases, file-system or even
operating system don’t have any idea where on
disk its data is getting stored…!
• e.g. local file system address maps on a logical volume,
which maps to an address on a LUN, which maps to
pages in a pool, which map to blocks on a logical device
on a RAID group, which map to LBA addresses on
multiple physical drives, which finally map to sectors
and tracks on drive platters

282
BITS Pilani, Pilani Campus
SNIA Shared Storage Model

• Three layer architecture Application


– File/Record Layer File/Record Layer
• Database DBMS File System
• Filesystem
– Block Aggregation Layer Host

• Host Network
• Network Block Aggregation Device
• Device
– Storage Devices layer Storage Devices

283
BITS Pilani, Pilani Campus
Storage Virtualization Types

• Host-based Virtualization
• Network-based Virtualization
• Storage-based/Controller based Virtualization

284
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Storage Virtualization Types


– Host-based Virtualization
– Network-based Virtualization
– Storage-based/Controller based Virtualization
• In-band and Out-of-band virtualization

• Software Defined Storage (SDS)

286
BITS Pilani, Pilani Campus
Host-based Virtualization
• Implemented via Logical Volume Manager
• Volume managers work with block storage
– e.g. DAS or SAN
• Virtualizes multiple block I/O devices into
logical volumes
– Take devices from the lower layer and creates
LVs
– These LVs made available to higher layers,
where file systems are written to them
– LVs can be sliced, striped, and concatenated
– Limited in scalability due to Host centric - not
easily sharable among other hosts
287
BITS Pilani, Pilani Campus
Network-based/ SAN based
Virtualization
• Not successful e.g. EMC Invista
• Storage virtualization at the network layer
requires intelligent network switches, or SAN
based appliances to perform functions like-
– Aggregation and virtualization of storage arrays
– Combining LUNs from heterogeneous arrays into a
single LUN
– Heterogeneous replication at the fabric level

288
BITS Pilani, Pilani Campus
Storage based Virtualization
• It is predominant form of virtualization in the real world
• In Band Virtualization (Symmetric)
– Controller sits between the host issuing the I/O and the storage
array that is virtualized
• Virtualization function is transparent to the host
– Doesn’t requires any driver or software on the storage array
– I/O, command and metadata pass through the controller

289
BITS Pilani, Pilani Campus
Out-of band Virtualization

• It is asymmetric
• Only command and metadata pass through
the virtualization device (controller)
• Requires HBA drivers and agent software
deployed on the host to communicate with
the controller
– to get the accessibility information

290
BITS Pilani, Pilani Campus
Controller Based Virtualization

• Controller based virtualization occurs


at the device layer
– Called as block based in-band
virtualization
– Controller has all of the intelligence and
functionality
– The array being virtualized just provides
dumb RAID protected capacity with
some cache
– The controller takes the write request
and send the ACK to host
– It writes to an array asynchronously
based on cache algorithm

291
BITS Pilani, Pilani Campus
Controller Virtualization
Configuration
Array configuration
1. LUNs on the virtualized array are presented out to the WWPNs of the
virtualization controller
2. The controller claims these LUNs and use them as storage (e.g. formed
into a pool)
3. Volumes are created from the pool
4. Volumes are presented to hosts as LUNs through front-end ports

Controller Configuration
1. Configure front-end ports into virtualization mode
2. This enables these ports to discover and use LUNs of Array being
virtualized
3. These ports usually emulate a standard Windows or Linux host

292
BITS Pilani, Pilani Campus
Benefits of Storage Virtualization
• Management
– Multiple technologies from multiple vendors can be virtualized and managed
through a single management interface
• Functionality
– e.g. snapshots and replication, of the higher tier array can be extended to the
capacity provided by the lower tier array
• Performance
– e.g. addition of drives behind the virtual LUN
• Availability
– e.g. RAID groups are used to create RAID protected LUNs, array based
replication and snapshot technologies also add to the protection provided by
the virtualization
• Technology Migrations/refresh
– Easy to migrate from one storage array to another storage array
• Cost
– We can virtualize lower performance storage array behind higher performance
storage array
293
BITS Pilani, Pilani Campus
Storage Virtualization and Auto-
tiering
• Tier-1 Array (Internal- in controller)
– High performance SSD Drives
– Highest in the performance with least latency
• Tier-2 (Internal- in controller)
– 15K SAS drives
• Tier-3 (External-In virtualized array)
– 7.2K NL-SAS
– Least in the performance with high latency

Figure source: Data Storage Networking, Sybax Wiley 294


BITS Pilani, Pilani Campus
Software Defined Storage (SDS)

• Traditional Storage
– Hardware (physical drives) and software (intelligence
and functionality) are tightly coupled
– Vendor specific

• Software Defined Storage


– Use of VSA (Virtual Storage Appliance)-> Intelligent
storage controller implemented as pure software as VM
– Decouples the hardware and software (intelligence)
– Vendor agnostic
– Scalable and compatible with the cloud

295
BITS Pilani, Pilani Campus
SDS Architecture
Orchestration

Decoupled

Data Services

Decoupled

Hardware

296
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics
• Capacity Optimization
– Methods used to reduce the space requirements for
storing data (persistent data)
– Capacity optimization technologies are all about doing
more with less
• i.e. storing more data on less hardware
• Capacity Optimization Technologies
– Thin/Thick Provisioning
– Compression
– Deduplication
– Tiering (Levels)
298
BITS Pilani, Pilani Campus
Capacity Optimization
Technologies: Places to implement

• Capacity optimization technologies


implemented at just about every level of the
storage stack…..
– Source-based
– Appliance-based
– Gateways
– Target-based
– Inline
– Post process

299
BITS Pilani, Pilani Campus
Thick &Thin Volumes
• Let’s say you create a 1 TB thick volume from an
array that has 10 TB of free space.
– Thick Volume
• Thick volume will immediately consume 1 TB and reduce the
free capacity on the array to 9 TB, even without writing any
data to it.
• It is a real waste of space if you never bother writing any data
to that 1 TB thick volume.
– Thin Volumes
• Thin volumes consume little or sometimes no capacity when
initially created.
• They start consuming space only as data is written to them.
• Extent size is the unit of growth applied to a thin volume as
data is written to it.
– Large extent size vs. small extent size
300
BITS Pilani, Pilani Campus
Over Provisioning
• Over-provisioning allows us to pretend we have more capacity
than we really do
– It works on the principle that we over-allocate storage and
rarely use everything we asked for
Unused

Used
25TB 25TB 25TB 25TB 25TB

Provisioned = 125TB
100TB pool of storage
Used = 60TB

• Over-provisioning creates the danger that we could try


and use more storage than we actually have 301
BITS Pilani, Pilani Campus
Administrative tasks for Over-
provisioned Arrays
• Trending of storage capacity requirement
• Trigger Points for Purchase Order
– Based on Used capacity
• e.g. Trigger is invoked when used capacity reached to
90%
– Available unused capacity
– Percentage of available capacity
– Percentage over-provisioned

302
BITS Pilani, Pilani Campus
Problems with Thin Volumes

• Most traditional filesystems don’t understand thin


provisioning and don’t inform storage arrays when they
delete data
– e.g. Assume 100 GB filesystem on a 100 GB thin provisioned
volume. Host writes 90 GB data to the filesystem. Array
allocates 90 GB of data to the volume. If the host then deletes
80 GB of data, file system will show 10 GB of used data and 90
GB of available space. Array is unaware about this deletion. It
shows volume is 90% used.

303
BITS Pilani, Pilani Campus
Solution: Space Reclamation

• Zero space reclamation


– Put the deleted data back to the free pool on the
array, which can be used for space for other volumes
• Space reclamation technology has required deleted data
to be zeroed out—overwritten by binary zeros.
• More-modern approaches use technologies (e.g. T10
UNMAP) to allow a host/filesystem to send signals to the
array telling it which extents can be reclaimed

304
BITS Pilani, Pilani Campus
Traditional Space Reclamation

• Instead of zeroing out the space, the filesystem


just marked the area as deleted, but left the
data there
• It’s no use to a thin provisioning storage array.
– As far as the array knows, the data is still there!!!
– Special tools or scripts are required that write binary
zeros to the disk to reclaim it for use elsewhere
• Implication of Thin Provisioning
– It can also lead to fragmented layout on the
backend, that impact sequential performance
305
BITS Pilani, Pilani Campus
Compression[1]

• It is a capacity optimization technology that


enables you to store more data by using less
storage capacity.
– e.g. File based compression i.e. Winzip
• Allows you to transmit data over networks
quicker while consuming less bandwidth
• Types of compression
– Filesystem compression, storage array–based
compression, and backup compression
306
BITS Pilani, Pilani Campus
Compression [2]

• Lossless compression
• Lossy compression
– e.g. JPEG image compression
• Most files and data streams repeat the same data
patterns over and over again.
– Compression works by re-encoding data so that the
resulting data uses fewer bits than the source data.
• Compression is popular in backup and archive
space
– Compression for primary storage (DAS, SAN, NAS) has
not been taken up widely. Why???

307
BITS Pilani, Pilani Campus
Array Based Compression[1]

• Inline
– Inline compression compresses data while in cache,
before it gets sent to disk.
– For high I/O, the acts of compressing and
decompressing can increase I/O latency
– It reduces the amount of data being written to the
backend
• Less capacity is required, both on the internal bus as well
as on the drives on the backend

308
BITS Pilani, Pilani Campus
Array Based Compression[2]

• Post-process
– Data has to be written to disk in its uncompressed
form, and then compressed at a later date
– It demands enough storage capacity to land the data
in its uncompressed form
• Decompression have to be done inline in real
time
– Performance concerns
• Additional latency for data read
• SSDs can get more advantage from compression
309
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Deduplication
– Replacement of multiple copies of
identical data with reference to a single
shared copy

• Tiering
– Seek to place data on appropriate
storage mediums (based on frequency
of access to the data)
– Moves infrequently accessed data down
to cheaper (slower storage) and putting
more frequently accessed data on
more-expensive (faster storage)
311
BITS Pilani, Pilani Campus
Deduplication

• Duplicate data stored only once


– Any block in the data set that already exists in our pool of
existing data is stripped out of the new data set and replaced
with a pointer

312
BITS Pilani, Pilani Campus
Deduplication Types[1]

• Block level
– Unique blocks to be stored in the system
– What should be the appropriate block size? Smaller
or larger.
– Fixed Length
• Change in one block of data set, all subsequent blocks
become offset and no longer deduplicate
– Variable Length
• Variable-block approaches apply a dynamic, floating
boundary when segmenting a data set
• Segmenting based on repeating patterns in the data itself
313
BITS Pilani, Pilani Campus
Deduplication Types[2]

• File level
– If more than one exact copy of a file exist, all
duplicate copies are discarded
– Shouldn’t really be called deduplication?
• Files referred to as single-instance storage (SIS)
– Example
• Two files with one character difference will be treated
different!

314
BITS Pilani, Pilani Campus
Where to deduplicate?

• At Source
– Consuming host resources including CPU and RAM
– Can significantly reduce the amount of network
bandwidth consumed between the source and
target—store less, move less!
– Useful in backup scenarios
• Large amount of data to be moved from source to target
– It can only deduplicate against data that exists on
the same host

315
BITS Pilani, Pilani Campus
Where to deduplicate?

• At Target (e.g. deduplicating backup appliance)


– Referred to as hardware-based deduplication
– These appliances tend to be purpose-built
appliances with their own CPU, RAM, and
persistent storage (disk).
– Increase network bandwidth consumption between
source and target as compared to source based

316
BITS Pilani, Pilani Campus
Where to deduplicate?

• Federated
– Distributed approach to the process of deduplicating
data
• Both, source and target perform deduplication
– Used in many backup solutions now a days
• e.g. Symantec OpenStorage (OST) allow for tight
integration between backup software and backup target
devices

317
BITS Pilani, Pilani Campus
When to do deduplication?

• Inline Deduplication
– Searches for duplicates before storing data to disk
– Only needs enough storage to store deduplicated data
– Potential performance impact as data has to be deduplicated
in real-time

• Post-Process Deduplication
– Stores duplicate data to disk and then deduplicates later
– No performance impact
– Requires enough storage to initially store duplicated data
318
BITS Pilani, Pilani Campus
Backup Use Case [1]

• Deduplication is a natural fit for the types of


data and repeated patterns of data that most
backup schedules comprise
– During a single month, most people back up the
same files and data time and time again
– Deduplicating backup-related data also enables
longer retention of that data because you get more
usable capacity out of your storage
• Longer you keep the data better you will get the
deduplication ratio!
319
BITS Pilani, Pilani Campus
Backup Use Case [2]
• Deduplication technologies tend to fragment file layouts.
– Duplicated data is referenced by pointers that point to other
locations on the backend
• The fragmenting of data layout has a direct impact on
backup-related technologies such as synthetic full
backups
– Synthetic full backup = Full Backup + Incremental backups
after previous full backup
– Effectively creating new backup without transferring all data
from source to the target
– Random readings (full backup + incremental backup) and
writing(new backup) from fragmented data can have impact
on deduplication appliance performance
320
BITS Pilani, Pilani Campus
Virtualization Use Case

• Consider server and desktop virtualization


• These environments tend to get good
deduplication ratios
– e.g. 2 TB image of Linux server and deploy 100
servers from that image
– Lots of identical files and blocks of data
– It can lead to potential performance improvements
for a large read cache

321
BITS Pilani, Pilani Campus
Auto-Tiering
• Doesn’t increase usable capacity but optimize use of
resources
• Auto-Tiering
– Data sits on the right tier of storage
• Consider an example
– A volume was 200 GB and only 2 GB of it was frequently
accessed and needed moving to tier 1, but auto-tiering
caused the entire 200 GB to be moved
– This is wastage of expensive resources
• Solution
– Sub-LUN tiering does away with the need to move entire
volumes up and down through the tiers.
– It works by slicing and dicing volumes into multiple smaller
extents.
322
BITS Pilani, Pilani Campus
Impact of Auto-Tiering on
Remote Replication
• As a result of the read and write workload, the auto-tiering
algorithms of the array move extents up and down the tiers,
ensuring an optimal layout.

• However, the target array does not see the same workload
– All the target array sees is replicated I/O, and only write I/Os are
replicated from the source to the target.

• As a result, how a volume is spread across the tiers on the


target array will often be very different from source array
– Read I/Os are not replicated

• This frequently results in a volume being on lower


performance tiers on the target array.
323
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Backup and Recovery


– One of the major business continuity plan activity
– Recovery is more important then Backups
– If the business can’t go backward after a disaster,
it can’t go forward.
• Backup methods
– Hot (aka Online)
– Offline
– LAN based
– SAN based
325
BITS Pilani, Pilani Campus
Recovery Point Objectives
(RPO)
• What is RPO…?
– It is the point in time to which a service can be
restored/recovered
– For example a backup schedule that backs up a
database every day at 2 a.m. can recover that
database to the state it was in at 2 a.m. every day
– For example RPO for tape based backup is -
• Minimum 24 hrs
– If more-granular RPOs are required, traditional
backups can be augmented with technologies such
as snapshots
326
BITS Pilani, Pilani Campus
Recovery Time Objective vs.
Recovery Point Objective
• RPO is that you can recover a system or
application to how it was within a certain
given time window

• RTO states how long it will take you to do that

• Example: It takes 8 hours to recover a system


to the state it was in 24 hours ago
327
BITS Pilani, Pilani Campus
Backup Window

• A backup window is a period of time—with a


start and stop time—during which data can be
backed up
– For example, a backup window might include all
noncore business hours for a certain business,
such as 11 p.m. until 9 a.m.
– Why backup window requires…?
• Negative impact on the system during backup

328
BITS Pilani, Pilani Campus
Backup Architecture: Example
1. The backup server monitors the
backup schedule.
2. When the schedule indicates that a
particular client needs backing up,
the backup server instructs the
agent software on the backup
client to execute a backup.
3. The backup server sends the data
over the IP to the media server.
4. The media server channels the
incoming backup data to the
backup target over the FC SAN thatFig. source: Data Storage networks by Nigel Poulton
it is connected to.
329
BITS Pilani, Pilani Campus
Backup Methods: Hot Backups
and Offline Backups
• Online backups (i.e. application is servicing users)
– Reduces administrative overhead and allow backup
environments to scale because they don’t require
administrative stops and restarts of applications
– Provides ability to perform integrity checks against
backups
• Offline Backups
– Require applications and databases to be taken offline
for the duration of the backup
– Not used in today’s time!!!
330
BITS Pilani, Pilani Campus
Backup Methods: LAN Based
Backups
• Cheap and convenient!
• Sending backup data over the LAN
– Often low performance with a risk of impacting other
traffic on the network
– Can be dedicated VLAN or dedicated physical
network or can be main production network

331
BITS Pilani, Pilani Campus
Backup Methods: LAN Free
Backups
• Data is passed from the backup client to the storage medium
over the SAN
• Block based backup and faster than file based backups
• Restore is complex
– Required to restore the entire backup to a temporary area and then
locate the required file and copy it back to its original location.

Fig. source: Data Storage networks by Nigel Poulton 332


BITS Pilani, Pilani Campus
Backup Methods: Server less
Backups
• Don’t utilize server resources (i.e. LAN Free)
• Use the SCSI-based EXTENDED COPY command
– This allows data to be copied directly from a source LUN
to a destination LUN

Fig. source: Data Storage networks by Nigel Poulton 333


BITS Pilani, Pilani Campus
Network Data Management
Protocol (NDMP)
• Network Data Management Protocol (NDMP) is a
protocol designed for standard-based efficient NAS
backups
– Data can be sent directly from the NAS device to the
backup device without having to pass through a backup
media server.
– Without NDMP, NAS-based file shares would have to be
mounted on a backup media server to be able to be
backed up
– Windows and Linux servers don’t need NDMP
• NAS arrays run custom operating systems that can’t normally have
backup agent software installed on them
334
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Backup Types
– Full
– Incremental
– Differential
– Synthetic
– Application aware

336
BITS Pilani, Pilani Campus
Full Backups

• A backup in which all of a defined set of data


objects are copied, regardless of whether they
have been modified since the last backup
– They offer the best in terms of restore speed and
restore simplicity.
– Consume the most backup space on the backup
target device
– Use the most network bandwidth and server
resources
– Also takes the most time as well
337
BITS Pilani, Pilani Campus
Incremental Backups

• An incremental backup is defined as a job that


backs up files that have changed since the last
backup...
– and that “last backup” could be an incremental
backup or a full backup
• A full recovery requires the most recent full
backup plus all incremental backups since the
full backup

338
BITS Pilani, Pilani Campus
Cumulative Incremental Full
Backups
• Only back up the data, that has changed since
the last full backup
– So, in the first place, you need a full backup, and
from there on, you can take cumulative
incremental backups
• Most cumulative incremental backup solutions
will back up an entire file even if only a single
byte of the file has changed since the last
backup
339
BITS Pilani, Pilani Campus
Differential Incremental
Backups
• Like cumulative incremental backups, differential
incremental backups work in conjunction with full
backups

• But instead of backing up all data that has changed


since the last full backup, differential incremental
backups, back up only changed data since the last
differential incremental.

• These are excellent at “space efficiency”


340
BITS Pilani, Pilani Campus
Synthetic Full Backup

• A synthetic full backup works by taking a one-


time full backup and then from that point on
taking only incremental or differential backups

• Synthetic full backups referred to as a form of


incremental forever backup system

341
BITS Pilani, Pilani Campus
Application Aware Backups

• Application-consistent backups (form of hot


backups)
– e.g. Flushing local buffers, applying logs, setting any
application flags for smooth recovery
• Enable to restore your applications from your
backups
• Application-aware backups usually require
installing special agent software on the application
server
– Such as installing an Exchange Server agent on the
Exchange server
342
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Backup Target Devices


– Tape devices
– Virtual Tape Libraries
– Disk Pools
– Cloud Services
• Backup retention policies
and Archiving

344
BITS Pilani, Pilani Campus
Backup to Tapes

• Oldest and most common form of backup…!


• Advantages
– High capacity
– High sequential performance
– Low power
• Disadvantages
– Media degradation
– Technology refresh
• Lack of software to read from old backups
– Discomfited for some restore types
• e.g. differential backups, incremental backups
345
BITS Pilani, Pilani Campus
Linear Tape Open Format (LTO)
Technology

Table Source: Data Storage networking by Nigel Poulton


346
BITS Pilani, Pilani Campus
Virtual Tape Library (VTL)
Emulation
• At a high level, VTL technologies take disk-based storage
arrays and carve them up to appear as tape libraries and
tape drives—hence the name Virtual Tape Library
– It is disk pretending to be tape
– Pros: Existing backup software could talk to it.

Fig. source: Data Storage networks by Nigel Poulton 347


BITS Pilani, Pilani Campus
Backup to Disk: Disk Pools

• Advantages over tapes


– Superior random access
– Superior concurrent read and write operation
• Good for synthetic full backups, where the system is
both reading and writing during the creation of the
synthetic full backup image
– Better reliability
• Can be protected by RAID technologies

348
BITS Pilani, Pilani Campus
Backup to the Cloud

• Cloud is ideal for Backup and recovery


– Don’t need fast and frequent access
– How to get backup data to the cloud?
• Disk-to-Disk-to-Cloud (D2D2C)
• D2D portion of it is in your data center (local)

Fig. source: Data Storage networks by Nigel Poulton 349


BITS Pilani, Pilani Campus
Backup Retention Policies

• Grandfather-Father-Son (GFS) scheme


– Grandfather (monthly backups)
– Father (Weekly backups)
– Son (Daily backups)
– Example
• Monthly backups: Keep for 60 months
• Weekly backups: Keep for 52 weeks
• Daily backups: Keep for 28 days
– Monthly and weekly full backups are taken off-site
and keep daily incremental backup on-site
350
BITS Pilani, Pilani Campus
SLA based on Retention Policies

1. All data from the last 28 days can be restored from the
previous night’s backup (RPO), and can be restored
within one working day (RTO).
2. All data between 28 and 365 days old can be restored
to the nearest Friday night backup of the data (RPO),
and can be restored within two working days (RTO).
3. All data between 1 year and 5 years old can be
restored to the nearest first Friday of the month
(RPO), and can be restored within two working days
(RTO).
351
BITS Pilani, Pilani Campus
Backup vs. Archiving

• Backups are intended to protect data that is currently


or recently in use.
• Backups provide medium to fast recovery of this data,
usually in the event that it’s lost, deleted, or corrupted.
• Backups deal with operational recovery of files, folders,
applications, and services.
• Archiving is used to efficiently store data that is no
longer used, or only very infrequently used, but needs
keeping.
• Archiving is used for legal compliance requirements
352
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics
• Storage Management Areas
– Capacity, Performance, Availability
• Capacity Management
– Make sure you always have enough capacity to
service your applications!
– Capacity utilization is in best way
– Requires Reporting and Trending

354
BITS Pilani, Pilani Campus
Capacity Management

• Capacity management in FC-SAN based storage


– SAN switches
– SAN ports
– SAN bandwidth
– Inter-switch links
– Front-end array ports
– Data center power and cooling
– Data center floor space
– Licenses for usable capacity
– Many more….
355
BITS Pilani, Pilani Campus
Capacity Reporting

• What is using up storage in the shared corporate


space?
– Production-Live
– Production-Disaster Recovery
– Development
– Research
–…

356
BITS Pilani, Pilani Campus
Thin Provisioning
Considerations
• Over provisioning
– Allows us to pretend we have more storage than
we really do
– It involves RISK..?

Fig. Source: Storage Area Networks by Nigel Poulton 357


BITS Pilani, Pilani Campus
The Need for Trending

• Start out by over provisioning by a small percentage i.e.


10-20 percent
• Then trend and forecast for next few months say 4 to 6
months
• Once you have a good idea of the capacity related
characteristics, over provisioning can be increased up-to
40 percent
• Keep trending and forecasting, and repeating the cycle
• The key here is to understand the growth and
characteristics of your environment
• There is no thumb rule for this!!!
358
BITS Pilani, Pilani Campus
Key Metrics for Trending

• Physical Capacity
– Usable capacity of the array
• Provisioned Capacity
– How much you are pretending to have
• Used Capacity
– Amount of the capacity hosts have actually
written
• Each of these metrics must be known and
tracked individually for each of array you have
359
BITS Pilani, Pilani Campus
Example: Trending

• Physical Capacity: 100 TB


• Provisioned Capacity: 150 TB
• Actual used Capacity: 50 TB
• Over provisioned Percentage: 50 %
• Trend:
– Used capacity is increasing by 5TB per month

360
BITS Pilani, Pilani Campus
De-duplication and Compression

• Tracking how effective these technologies are


useful for reporting purpose
• De-duplication effectiveness is expressed as a ratio
– e.g. 4:1 and 10:1
– From capacity management point of view, there is a big
difference between inline and post process de-
duplication technologies
• Conversion ratios to percentage space saved
– 10:1 is converted as 100-(1*100/10) = 90 %
– 25:1 is converted as 100 – (1*100/25) = 96 %

361
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Performance Management
– Ensuring that your storage estate is high
performance
– Identifying performance bottlenecks
– Tuning performance

363
BITS Pilani, Pilani Campus
Performance Management

• It is about the optimal performance of the entire


storage estate
– Storage arrays
– Network
– NICs
– HBAs
– Volume managers
– Multi-pathing software

364
BITS Pilani, Pilani Campus
Base-lining

• To troubleshoot a suspected performance issue, a


frame of reference is must to compare it with the
system--- called base-lining

• “Performance” is an application dependent


– e.g. 15000 IOPS with a response time of 15 ms may be
good for some applications and bad for some other
applications

• This can be decided on base-lining. How???


365
BITS Pilani, Pilani Campus
Latency/Response Time

• Biggest performance killer!


• Time to send a command and to get the response –
is referred as response time
– Any delay imposed on response time is called latency
• Latency can occur at just about every layer in the
design
– Host: File system, volume manager, HBA
– Network: SAN or IP network, inter switch links
– Storage array: Array front end ports, cache, backend,
replication
366
BITS Pilani, Pilani Campus
Low Latency Requirements:
Examples
• Trading systems
• Pay as you go cell phone systems requiring
account balances to be checked before
connecting calls
• Online shopping systems that have to perform
fraud detection before processing payments

367
BITS Pilani, Pilani Campus
Performance Metric: IOPS and
MBps
• IOPS
– Concept of IOPS is vague.
– What exactly is an I/O?
– Are all I/Os equal?
• Read vs. Write, Random vs. Sequential, Small vs. Big
– Used for transactional applications
• MBps
– Number of Mega bytes transferred by a disk or
storage array
– Used for throughput driven applications
368
BITS Pilani, Pilani Campus
Factors Affecting Storage
Performance [1]
• RAID
– Good for data protection but with impact on
performance front
– Random small-block write workloads are slower
• e.g. RAID 5 and RAID 6
• Thin LUNs
– Follows allocate on demand model
– Performance Impacts
• Can add latency as system has to identify free extents and
allocate them to a volume, each time a write request
comes into a new area on a thin LUN
• Can result in fragmented backend layout
369
BITS Pilani, Pilani Campus
Factors Affecting Storage
Performance [2]
• Cache
– Caches improve performance of spinning disk
based storage
• Network hops
– Leads to network induced latency
– Higher in IP/Ethernet as compared to FC-SAN
• Multi-Pathing
– Prime motivation to provide high availability
– MPIO balanced out the I/O from a single host over
multiple array ports
370
BITS Pilani, Pilani Campus
Standard Performance Tools

• Perfmon or Performance monitor in Windows


– One can monitor host based performance counters
(e.g. end to end latency)
– You can add the counters and monitor the
performance
• iostat in Linux
– CPU utilization
– Device utilization
– Network file system utilization

371
BITS Pilani, Pilani Campus
Data Storage Technologies &
Networks
BITS Pilani Dr. Virendra Singh Shekhawat
Department of Computer Science and Information Systems
Pilani Campus

BITS Pilani, Pilani Campus


Topics

• Cloud Types
– Public, Private and Hybrid

• Storage and the Cloud


– Data Durability
– Eventual Consistency Model

373
BITS Pilani, Pilani Campus
Cloud Models

• Public Cloud
– Highest level of abstraction
• Clients don’t have visibility of the technologies holding
the service
• Multitenant-Underlying infrastructure is shared by
multiple customers
• e.g. EC2 (Elastic Compute Cloud), Windows Azure

374
BITS Pilani, Pilani Campus
Private Cloud

• Your own internal cloud, with on-premise


equipment that you still own and manage and
give service to your internal customers
– e.g. web portals, billing services

• A third party hosted services and offers


customers a dedicated infrastructure (no
multi-tenancy)
375
BITS Pilani, Pilani Campus
Hybrid Cloud

• Hybrid = Public + Private


• Single application uses a mix of public and
private cloud
– e.g. Microsoft StorSimple, Panzura
• Examples
– Some services on public cloud and some on in-
house
– In peak time run services on public cloud other
times run on in-house
376
BITS Pilani, Pilani Campus
Storage and the Cloud
• Storage as a Service (SaaS)
– Cloud storage is accessed remotely over the Internet

• Features
– Elastic Storage
– Massive Scalability
– Accessibility Fig. Source: Data Storage Networking by Nigel Poulton
– Fast self service provisioning
– Built-in protection
– No cap-ex cost
377
BITS Pilani, Pilani Campus
Drawbacks for Cloud Storage

• Low performance when compared to non-cloud


based storage solutions
• Most of cloud storage solutions are Object
Storage based
– Not good for structural data (i.e. databases)
– Not good for data that changes a lot!
– Good for streaming (audio, video, images) and read
workloads
• Suitable for social media sites
378
BITS Pilani, Pilani Campus
Data Durability in Clouds

• Data survivability in failure situations


– Amazon claims 99.999999999% durability
– How to achieve high durability…?
• Multiple copies of data is stored (within region and also
outside the region, i.e. Geo Replication)
• Data updation overhead…?
– Does Atomic updates (Updates data to multiple locations before
confirms that the update was successful)
• Data consistency across the region is difficult to achieve?

379
BITS Pilani, Pilani Campus
Data Consistency

• Eventual Consistency Model


– Updates to objects takes time to propagate
throughout the cloud
– Objects are replicated or propagated to multiple
nodes and locations within the cloud to protect the
data
– In geo-replication situation, updates to objects
occurs asynchronously

380
BITS Pilani, Pilani Campus
Public Cloud Storage
Performance
• Storage performance hierarchy
– Main memory > Local attached Hard Disk > SAN or NAS > Cloud
Storage
• Capacity relationship
– Closer to CPU less capacity (lesser sharable), farther to CPU more
capacity (more sharable)
• Best performance use case for cloud storage
– Large objects like videos, photos
• Requirement: Good throughput rather than low latency
• Atomic Uploads
– Object is not available to access until the upload operation is 100%
complete
• CDNs (Geo-Replication)
– Principle: closer the data better the performance
381
BITS Pilani, Pilani Campus

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy