He-Dieu-Hanh - Kai-Li - Disksflash - (Cuuduongthancong - Com)
He-Dieu-Hanh - Kai-Li - Disksflash - (Cuuduongthancong - Com)
Storage Devices
Kai Li
Computer Science Department
Princeton University
(http://www.cs.princeton.edu/courses/cos318/)
Today’s Topics
Magnetic disks
Magnetic disk performance
Disk arrays
Flash memory
2
A Typical Magnetic Disk Controller
External connection
IDE/ATA, SATA External connection
SCSI, SCSI-2, Ultra SCSI, Ultra-160
SCSI, Ultra-320 SCSI
Fibre channel Interface
Cache
Buffer data between disk and DRAM
interface cache
Controller
Controller
Read/write operation
Cache replacement
Disk
Failure detection and recovery
3
Disk Caching
Method
Use DRAM to cache recently accessed blocks
• Most disk has 16MB
• Some of the RAM space stores “firmware” (an embedded OS)
Blocks are replaced usually in an LRU order
Pros
Good for reads if accesses have locality
Cons
Cost
Need to deal with reliable writes
4
Disk Arm and Head
Disk arm
A disk arm carries disk heads
Disk head
Mounted on an actuator
Read and write on disk surface
Read/write operation
Disk controller receives a
command with <track#, sector#>
Seek the right cylinder (tracks)
Wait until the right sector comes
Perform read/write
Mechanical Component of A Disk Drive
Tracks
Concentric rings around disk surface, bits laid out serially along each track
Cylinder
A track of the platter, 1000-5000 cylinders per zone, 1 spare per zone
Sectors
Each track is split into arc of track (min unit of transfer)
6
Disk Sectors
Where do they come from?
Formatting process
Logical maps to physical
What is a sector? Hdr 512 bytes ECC …
Header (ID, defect flag, …)
Real space (e.g. 512 bytes)
Sector
Trailer (ECC code)
What about errors?
Detect errors in a sector
Correct them with ECC
If not recoverable, replace it i defect i+1 defect i+2
with a spare
Skip bad sectors in the future
7
Disks Were Large
First Disk:
IBM 305 RAMAC (1956)
5MB capacity
50 disks, each 24”
8
They Are Now Much Smaller
10
(Mark Kryder at SNW 2006)
50 Years Later (Mark Kryder at SNW 2006)
11
Sample Disk Specs (from Seagate)
Cheetah 15k.7 Barracuda XT
Capacity
Formatted capacity (GB) 600 2000
Discs 4 4
Heads 8 8
Sector size (bytes) 512 512
Performance
External interface Ultra320 SCSI, FC, S. SCSI SATA
Spindle speed (RPM) 15,000 7,200
Average latency (msec) 2.0 4.16
Seek time, read/write (ms) 3.5/3.9 8.5/9.5
Track-to-track read/write (ms) 0.2-0.4 0.8/1.0
Internal transfer (MB/sec) 1,450-2,370 600
Transfer rate (MB/sec) 122-204 138
Cache size (MB) 16 64
Reliability
Recoverable read errors 1 per 1012 bits read 1 per 1010 bits read
Non-recoverable read errors 1 per 1016 bits read 1 per 1014 bits read
12
Disk Performance (2TB disk)
Seek
Position heads over cylinder, typically 3.5-9.5 ms
Rotational delay
Wait for a sector to rotate underneath the heads
Typically 8 - 4 ms (7,200 – 15,000RPM)
or ½ rotation takes 4 - 2ms
Transfer bytes
Transfer bandwidth is typically 40-138 Mbytes/sec
Performance of transfer 1 Kbytes
Seek (4 ms) + half rotational delay (2ms) + transfer (0.013 ms)
Total time is 6.01 ms or 167 Kbytes/sec (1/360 of 60MB/sec)!
More on Performance
What transfer size can get 90% of the disk bandwidth?
Assume Disk BW = 60MB/sec, ½ rotation = 2ms, ½ seek = 4ms
BW * 90% = size / (size/BW + rotation + seek)
size = BW * (rotation + seek) * 0.9 / 0.1
= 60MB * 0.006 * 0.9 / 0.1 = 3.24MB
15
SSTF (Shortest Seek Time First)
Method
0 53 199
Pick the one closest on disk
Rotational delay is in
calculation
Pros
Try to minimize seek time
Cons
Starvation
Question
Is SSTF optimal?
Can we avoid the starvation? 98, 183, 37, 122, 14, 124, 65, 67
(65, 67, 37, 14, 98, 122, 124, 183)
16
Elevator (SCAN)
Method
Take the closest request in 0 53 199
the direction of travel
Real implementations do not
go to the end (called LOOK)
Pros
Bounded time for each
request
Cons
Request at the other end will
take a while
98, 183, 37, 122, 14, 124, 65, 67
(37, 14, 65, 67, 98, 122, 124, 183)
17
C-SCAN (Circular SCAN)
Method
Like SCAN 0 53 199
But, wrap around
Real implementation doesn’t
go to the end (C-LOOK)
Pros
Uniform service time
Cons
Do nothing on the return
18
Discussions
Which is your favorite?
FIFO
SSTF
SCAN
C-SCAN
Disk I/O request buffering
Where would you buffer requests?
How long would you buffer requests?
19
RAID (Redundant Array of Independent Disks)
Main idea
Store the error correcting RAID controller
codes on other disks
General error correcting D1 D2 D3 D4 P
codes are too powerful
Use XORs or single parity
Upon any failure, one can
recover the entire block ⊕
from the spare disk (or any
disk) using XORs
Pros P = D1 ⊕ D2 ⊕ D3 ⊕ D4
Reliability
High bandwidth
D3 = D1 ⊕ D2 ⊕ P ⊕ D4
Cons
The controller is complex
20
Synopsis of RAID Levels
RAID Level 1:
Mirroring
RAID Level 2:
Byte-interleaved, ECC
RAID Level 3:
Byte-interleaved, parity
RAID Level 4:
Block-interleaved, parity
RAID Level 5:
Block-interleaved, distributed parity
21
RAID Level 6 and Beyond
Goals
Less computation and fewer updates per
random writes
Small amount of extra disk space
0 1 2 3 A
Extended Hamming code
Remember Hamming code? 4 5 6 7 B
Specialized Eraser Codes
IBM Even-Odd, NetApp RAID-DP, … 8 9 10 11 C
Beyond RAID-6
Reed-Solomon codes, using MOD 4 12 13 14 15 D
equations
Can be generalized to deal with k (>2)
disk failures E F G H
22
Dealing with Disk Failures
What failures
Power failures
Disk failures
Human failures
What mechanisms required
NVRAM for power failures
Hot swappable capability
Monitoring hardware
RAID reconstruction
Reconstruction during operation
What happens if a reconstruction fail?
What happens if the OS crashes during a reconstruction
23
Next Generation: FLASH
Flash chip density increases on the Moore’s law curve
1995 16 Mb NAND flash chips
2005 16 Gb NAND flash chips
2009 64 Gb NAND flash chips
Doubled each year since 1995
Market driven by Phones, Cameras, iPod,…
Low entry-cost,
~$30/chip → ~$3/chip
2012 1 Tb NAND flash Samsung prediction
== 128 Gb chip
== 1TB or 2TB “disk”
for ~$400
or 128GB disk for $40
or 32GB disk for $5
24
What’s Wrong With FLASH?
Expensive: $/GB
2x less than cheap DRAM
50x more than disk today, may drop to 10x in 2012
Limited lifetime
~100k to 1M writes / page (single cell)
~15k to 1M writes / page (single cell)
requires “wear leveling”
but, if you have 1,000M pages,
then 15,000 years to “use” ½ the pages.
Current performance limitations
Slow to write can only write 0’s, so erase (set all 1) then write
Large (e.g. 128K) segments to erase
25
Current Development
Flash Translation
Layer (FTL)
Remapping
Wear-leveling
Write faster
Form factors
SSD
USB, SD, Stick,…
PCI cards
Performance
Fusion-IO cards
achieves 200K
IOPS
26
Summary
Disk is complex
Disk real density is on Moore’s law curve
Need large disk blocks to achieve good throughput
OS needs to perform disk scheduling
RAID improves reliability and high throughput at a cost
Careful designs to deal with disk failures
Flash memory has emerged at low and high ends
27