0% found this document useful (0 votes)
7 views6 pages

DAC2011PowerCut

Uploaded by

rknet304mkii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

DAC2011PowerCut

Uploaded by

rknet304mkii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Understanding the Impact of Power Loss on Flash Memory

Hung-Wei Tseng Laura M. Grupp Steven Swanson


The Department of Computer Science and Engineering
University of California, San Diego
{h1tseng,lgrupp,swanson}@cs.ucsd.edu

Abstract as a result, corruption of the storage array can potentially render


Flash memory is quickly becoming a common component in com- the entire drive inoperable: Not only will the in-progress write not
puter systems ranging from music players to mission-critical server succeed, but all the data on the drive may become inaccessible.
systems. As flash plays a more important role, data integrity in To ensure reliability, system designers must engineer the SSDs
flash memories becomes a critical question. This paper examines to withstand power failures and the resulting data corruption. To do
one aspect of that data integrity by measuring the types of errors this, they must understand what kinds of corruption power failure
that occur when power fails during a flash memory operation. Our can cause.
findings demonstrate that power failure can lead to several non- This paper characterizes the effect of power failure on flash
intuitive behaviors. We find that increasing the time before power memory devices. We designed a testing platform to repeatedly cut
failure does not always reduce error rates and that a power failure power to a raw flash device during program and erase operations.
during a program operation can corrupt data that a previous, suc- Our data show that flash memory’s behavior under power failure
cessful program operation wrote to the device. Our data also show is surprising in several ways. First, operations that come closer
that interrupted program operations leave data more susceptible to to completion do not necessarily exhibit fewer bit errors. Second,
read disturb and increase the probability that the programmed data power failure not only results in failure of the operation in progress,
will decay over time. Finally, we show that incomplete erase opera- it can also corrupt data already present in the flash device. Third,
tions make future program operations to the same block unreliable. power failure can negatively impact the integrity of future data writ-
ten to the device. Our results point out potential pitfalls in design-
Categories and Subject Descriptors ing flash file systems and the importance of understanding failure
modes in design embedded storage systems.
B.3.4 [Memory Structures]: Reliability, Testing, and Fault- The rest of this paper is organized as follows: Section 2 de-
Tolerance scribes the aspects of flash memory pertinent to this study. Sec-
tion 3 describes our experimental platform and methodology for
General Terms characterizing flash memory’s behavior during power failure. Sec-
Experimentation,Measurement,Performance,Reliability tion 4 presents our results and describes the sources of data corrup-
tion due to power failure. Section 5 provides a summary of related
Keywords work to put this project in context, and Section 6 concludes the
paper.
flash memory, power failure, power loss

1. INTRODUCTION 2. FLASH MEMORY


As flash-based solid-state drives (SSDs) make inroads into com- Flash memory has several unique characteristics that make
puter systems ranging from data centers to sensor networks, the power failure particularly dangerous. The long latency of program
integrity of flash memory as a storage technology becomes increas- and erase operations present large "window of vulnerability" and
ingly important. A key component of that integrity is what happens the complex programming algorithms and data encoding schemes
to the data on an SSD when power failure occurs unexpectedly. they employ can lead to non-intuitive failure modes. This section
Power loss in flash is potentially much more dangerous than it presents a summary of flash’s characteristics that are most pertinent
is for conventional hard drives. If power fails during a write to a to this work.
hard drive, the data being written may be irretrievable, but the other Flash memory stores data by trapping electrons using a floating
data on the disk remains intact. However, SSDs use complex flash gate transistor. The electrons affect the transistor’s threshold volt-
translation layers (FTLs) to manage the mapping between logical age, and the chip measures this change to read data from the cell.
block addresses and physical flash memory locations. FTLs must The flash chip organizes cells into pages (between 2 and 8 KB) and
store metadata about this mapping in the flash memory itself, and, pages into blocks (between 32 and 256 pages). Erasing a block sets
all the bits to ’1’. Finer grain erasure is not possible. Programs
operate on pages and convert 1s to 0s. To hide the difference in
granularity between programs and erases and increase reliability,
Permission to make digital or hard copies of all or part of this work for SSDs use complex flash translation layers (FTLs) to perform out-
personal or classroom use is granted without fee provided that copies are of-place update and remapping operations. To support these func-
not made or distributed for profit or commercial advantage and that copies tions, FTLs store metadata in the flash storage array along with the
bear this notice and the full citation on the first page. To copy otherwise, to user data.
republish, to post on servers or to redistribute to lists, requires prior specific Program and erase operations are iterative. Program operations
permission and/or a fee.
DAC 2011, June 5-10, 2011, San Diego, California, USA. selectively inject electrons into floating gates to change the thresh-
Copyright 2011 ACM ACM 978-1-4503-0636-2 ...$10.00. old voltage and then perform a read-verify operation to check if the
Logic Bits Abbrev. Manufa- Cell Cap. Tech. Page Pgs/
Gray coding 2’s complement coding cturer Type (GBit) Node Size Blk
Voltage 1st page 2nd page 1st page 2nd page (nm) (B)
Levels bit bit bit bit A-SLC2 A SLC 2 2048 64
Lowest 1 1 1 1 A-SLC4 A SLC 4 2048 64
1 0 1 0 A-SLC8 A SLC 8 60 2048 64
0 0 0 1 B-SLC2 B SLC 2 50 2048 64
Highest 0 1 0 0 B-SLC4 B SLC 4 72 2048 64
Table 1: The mapping of voltage level and logic bits in 2-bit E-SLC8 E SLC 8 2048 64
MLC chips using Gray coding and 2’s complement coding. A-MLC16 A MLC 16 4096 128
B-MLC32-2 B MLC 32 34 4096 256
cells have reached the desired threshold voltage. If any of the cells
D-MLC32 D MLC 32 4096 128
in a page has not reached the target threshold voltage, the chip will
repeat the program and read-verify process [15, 6]. For erase oper- E-MLC8 E MLC 8 4096 128
ations, the chip removes the electrons from cells within the block. F-MLC16 F MLC 16 41 4096 128
The chip will continue to remove electrons until the voltages of Table 2: Parameters for the 11 flash devices we studied in this
cells reach the erased state. work
There are two types of flash cells: single-level cell (SLC) and
multi-level cell (MLC). SLC devices store one bit per cell, while
MLC devices store two or more. SLC chips provides better and the last byte of the command to the flash chip. High-resolution
more consistent performance than MLC chips. According to em- measurements of the chips’ power consumption show that the chip
pirical measurements in [3] it takes an SLC chip 20 µs to perform a starts executing the command with a few microseconds.
read operation, 200 µs to perform a program operation, and 400 µs For program tests, we use cut off intervals varying from 0.4 µs
- 2 ms to perform an erase operation. to 2.4 ms at increments of 0.4 µs. For erase, we use power cut off
MLC chips achieve higher densities by using 2n threshold volt- intervals varying from 2 µs to 4.8 ms at increments of 2 µs.
age levels to represent n bits. MLC chips need 300 µs - 2 ms to
perform a program operation, and 2 ms – 4 ms to perform an erase
3.3 Flash devices
operation. In this paper, we focus on 2-bit MLC cells, since they The behavior of flash memory chips from different manufactur-
are most prevalent in current systems. ers varies because of architectural differences within the devices
For 2-bit MLC chips, cells store data of two different pages. and because of differences in manufacturing technologies. To un-
Manufactures require that pages within a block be programmed in derstand the variation in power failure performance, we selected 11
order, so to differentiate between the two pages in a cell, we refer chips that cover a variety of technologies and capacities.
to them as “first page” and “second page.” Programming a second Table 2 lists the flash memory chips that we studied in this work.
page is consistently slower than programming a first page, since They come from five different vendors. Their capacities range from
programming the second page requires a more complex program- 2 GBits to 32 GBits and their feature sizes range from 72 nm to
ming algorithm. Table 1 shows the mappings between threshold 34 nm. Values that are not publicly available from the manufacturer
voltages and logic bits of a 2-bit MLC cell using gray coding and are from [3].
2’s complement coding.
4. EXPERIMENTAL RESULTS
3. METHODOLOGY We found unexpected behavior for both program and erase op-
To study the effect of power failure on flash memory, we built erations in the presence of power failure. For both program and
a test platform that allows us to issue command to raw flash chips erase, the variation in bit error rate as the power cut off interval
and to cut off the power supply to the chip at precise moments. changes is non-monotonic, and our measurements show that power
This section describes our test bed, testing methodology, and the loss can lead to both immediate and long-term data integrity issues.
flash chips we used in this study. We describe the results for each operation in turn.

3.1 Experimental hardware 4.1 Program and power failure


For this work, we built a test platform that consists of three com- To understand the impact of power failure during programming,
ponents: the Xilinx XUP board, a custom flash testing board, and we begin by programming random data and cutting off power at
the power control circuit. different intervals. Then, we measure the resulting bit error rate.
The FPGA on the Xilinx XUP board contains a PowerPC 405 Figure 1 contains the results for SLC chips (a) and MLC chips (b).
core running Linux. A custom flash controller on the FPGA pro- Intuitively, the more time we give the flash chip to program a
vides us direct access to the flash device via the flash testing board. page before power failure, the fewer errors there should be. How-
The FPGA also controls the power to the flash chips by means of a ever, the graphs show that the bit error rate does not decrease mono-
pair of high-speed power transistors. Measurements with an oscil- tonically. Instead, the bit error rate for each chip has multiple
loscope show that the system can switch the chip’s power supply plateaus – where the error rate remains constant, and spikes – where
to 0 V within 3.7 µs. the bit error rate increases briefly.
For example, the error rate for E-SLC8 jumps dramatically at
3.2 Test procedure 30 µs, drops slowly until 75 µs when it plummets to nearly zero.
To test the impact of power failure during program and erase The other SLC chips exhibit much more predictable behavior.
operations, we cut power to flash at different points during the op- MLC behavior is much more complex. For example, B-MLC32-
eration. We define the power cut off interval as the time between 2’s error rate remains constant at 50% until 100 µs and then drops
issuing the command to the flash chip and when we trigger the sharply to 25% by 110 µs, where it remains until 200 µs. The
power cut off circuit. We start the cut off interval after sending error rate starts increasing at 200 µs and reaches 29% at 290 µs.
0.6
A-SLC2 1 State 11 State 10 State 00 State 01
A-SLC4
0.5
A-SLC8
0.8

Cell State Distribution


B-SLC2
0.4 B-SLC4
Bit Error Rate

E-SLC8 0.6
0.3
0.4
0.2
0.2
0.1
0
0 0 200 400 600 800 1000 1200 1400 1600 1800 2000
0 50 100 150 200 250 300 Power Cut Off Interval (us)
Power Cut Off Interval (us)
(a)
(a) 1 State 11 State 10 State 00 State 01
0.6
B-MLC32-2
0.8

Cell State Distribution


A-MLC16
0.5
D-MLC32
E-MLC8 0.6
0.4 F-MLC16
Bit Error Rate

0.4
0.3
0.2
0.2
0
0.1 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Power Cut Off Interval (us)
0
0 500 1000 1500 2000 (b)
Power Cut Off Interval (us) 1 State 11 State 10 State 00 State 01

(b)
Cell State Distribution 0.8
Figure 1: The bit error rate of program operations with dif-
ferent power cut off intervals for (a) SLC chips and (b) MLC 0.6
chips
The error rate decreases again after 360 µs and then stays at 25% 0.4

until 500 µs. After 500 µs, the error rate decreases as steps until 0.2
it reaches 0 at 1400 µs. The chip also shows numerous spikes in
error rate, for example, at 540 µs. 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
These results are unexpected because programming flash chips
Power Cut Off Interval (us)
should only be able to move bits from 1 to 0, yet the non-monotone
error rates suggest that program operations are moving cells in both (c)
State 11 State 10 State 00 State 01
directions at different times. Below, we investigate this behavior in 1

finer detail.
0.8
Cell State Distribution

4.1.1 Per-page MLC error rates


0.6
To understand the cause of MLC chips’ non-monotonic error
rates, we examine the behavior of pairs of first and second pages 0.4
in more detail. A pair of corresponding bits from the two pages can
0.2
be in four states: 01 (i.e., the first page bit is ’0’ and the second
page bit is ’1’), 00, 11, and 10. We used program operations to 0
move the pair between those states and interrupted the power to see 0 200 400 600 800 1000 1200 1400 1600 1800 2000
which intermediate states the cell passes through. We consider four Power Cut Off Interval (us)

transitions: (1) 11→01: we program the first page bit to 0 from the (d)
erased state. (2) 01→00: we program a 0 to the second page bit Figure 2: Cell state breakdown for B-MLC32-2 for (a) 11→01,
after programming a 0 to the first page bit. (3) 01→01: we pro- (b) 01→00, (c) 01→01, (d) 11→10 transitions. The results show
gram a 1 to the second page bit after programming a 0 to the first that even for seeming no-ops, cells may pass through multiple
page bit. (intuitively, this should cause no change) (4) 11→10: we states
program a 0 to the second page bit from the erased state. Other
transitions are not possible because we must program the first page from state 11 to state 01 during programming. The chip reads the
first, and because programs can only convert 1s to 0s. For 01→00 cells as state 01 because the second page bits are not programmed
and 01→01, we only cut off power while programming the second yet, but the voltage levels are actually at state 00 instead of state 01
page. at this point.
Figure 2 shows the experimental results for B-MLC32-2. For Figure 2(b) provides some additional insight into this behavior.
each graph in Figure 2, the x-axis shows the power cut off interval, It shows the graph for the 01→00 transition. The cell states re-
and the y-axis depicts the distribution of cells for four different main at 01 until 300 µs, when they all instantly become 00. This
states in a block. Figure 2(a) plots the distribution of cell states for instantaneous change of cell states indicates, we believe, that the
the 11→01 transition. The graph shows the shift in state between chip switches reference voltages at this point to a new reference
0 and 220 µs, but it is not a smooth transition: There are two clear that allows the chip to distinguish state 00 from state 01. Since all
steps with spikes that suggest that some cells temporarily move the cells move to state 00 immediately after the chip applies a new
0.5 1
B-MLC-32-2 (1st page) Program w/o power failure
0.45 Program with power cut off interval of 1.35ms
E-MLC8 (1st page) 0.1
0.4
0.35 0.01

Bit Error Rate


Bit Error Rate

0.3
0.25 0.001
0.2
0.0001
0.15
0.1 1e-05
0.05
0 1e-06
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06
Power Cut Off Interval (us) Number of Reads

Figure 3: A power failure while programming a second page Figure 4: Incompletely programmed pages are more suscepti-
can corrupt data programmed to the corresponding first page, ble to read disturb than completely programmed pages.
even if the first page program completed without interruption.
Besides backup batteries and capacitors, SSDs can take (at least)
reference, it appears that the cells were already at a voltage level three steps to mitigate the effects of retroactive data corruption.
corresponding to a 00 state after programming the first page was First, the FTL could program corresponding pairs of first and sec-
complete. ond pages together and treat them as a logical unit. If programming
Figure 2(c) shows the result when we try to perform a 01→01 the second page failed, the FTL would consider them to both have
transition, a seeming no-op. However, if power cuts off between failed. This is not as easy as it sounds, however, since first and
250 and 1000 µs, a large fraction of the cells will be in state 00 and second pages are not logically consecutive within the flash block:
some may be in state 10. In most cases the first page at address n is paired with the second
In Figure 2(d), we make a 11→10 transition. Though we only page at address n+6. Flash device datasheets require programming
change the second page bits, the cells move through all possible pages in a block in order because of the flash array organizations.
states during the operation. The chip changes state from 11 to 10 (a However, our experiment shows that for chips like E-MLC8, pro-
shift of one voltage level) during 200 µs to 600 µs. Between 500 µs gramming the first and second pages together does not increase the
and 900 µs, it seems to adjust reference voltage to differentiate program error rate.
between states 00 and 01 and result in the abrupt transitions from Second, since the retroactive data corruption never affects the
state 00 to state 10. Then it applies a new reference voltage at second page, the FTL could protect its metadata by storing it solely
900 µs. The chip can differentiate state 10 from state 00 with the in second pages. While it would leave user data exposed to retroac-
new threshold voltage. This also causes the transition of cell states tive data corruption, the SSD would, at least, remain operational.
from 00 to 10. Third, the FTL could adopt a specialized data encoding that
would avoid the cell state transitions that can lead to retroactive
4.1.2 Retroactive data corruption
corruption. For E-MLC8, corruption occurs only when making
The unpredictable effect of power loss during an MLC program a 01→01 transition. Sacrificing some bit capacity and applying
operation demonstrated above makes it clear that SSDs must as- some simple data coding techniques could prevent that transition
sume that data written during an interrupted program is corrupt. from occurring. However, for B-MLC32-2, this scheme does not
However, the data in Figure 2 also show something more danger- work since the retroactive data corruption happens in all the cases
ous: Power failure while programming a second page can corrupt where the first page bit is 0.
data that the chip successfully programmed into a first page. We
call this effect retroactive data corruption. 4.1.3 Read disturb sensitivity
Figure 2(d) demonstrates the phenomenon. We expect the pro- Power failure can also affect the data integrity of programmed
gram operation to move the cell from 11 to 10, leaving the first data by making it more susceptible to other flash failure modes. In
page’s data untouched. However, we can find cells in any of the this section, we examine the relationship of power failure and read
four states depending on when power failure occurred. disturb.
Figure 3 illustrates this effect in more detail. In this graph, we Read disturb arises because reading data from a flash array ap-
first program random data to first page bits in B-MLC32-2 and E- plies weak programming voltages to cells in pages not involved in
MLC8 without power failure. Then we cut off power when we the read operation. Measurements in [3] shows that it typically
program the corresponding second page bits with random data. The takes several million read operations to cause significant errors due
x-axis shows the power cut off intervals for second pages, and the to read disturb.
y-axis shows the bit error rates for the first pages. For B-MLC32- Figure 4 compares the read disturb sensitivity of pages pro-
2, the bit error rate of the first page reaches 25% with power cut grammed to completion (i.e., no power cut off) and pages pro-
off interval between 200 µs and 900 µs even though the program grammed with a power cut off interval of 1.35 ms using B-MLC32-
operation of the first page completed successfully! For E-MLC8, 2. For that interval, reading the page back reveals no errors.
the retroactive data corruption effect is more serious. The bit error For both sets of pages, the error rate starts at 0 after program-
rate can reach 50% if the power cut off interval for the second page ming. For the completely programmed page, errors from read dis-
is between 50 µs and 100 µs . turb appear after 2.8 million reads. For the partially programmed
Flash device datasheets make no mention of this phenomenon, page, errors appear after just 1000 reads and the error rate rises
so FTL designers may assume that once a program operation com- quickly to 3.1 × 10−3 . It appears that the power failure prevents
pletes, the data will remain intact regardless of any future failures. the program operation from completely tuning the voltage level on
This assumption is incorrect for MLC devices. Since retroactive some of the cells leaving them susceptible to read disturb.
data corruption can affect both user data and FTL metadata, it poses This effect is potentially dangerous, especially given the very
a serious threat to SSD reliability. steep increase in error rate. A common approach to dealing with
0.6
2.4e-06 A-SLC2
Barely programmed
2e-06 A-SLC4
Bit Error Rate

Partially programmed 0.5


1.6e-06 Fully programmed A-SLC8
1.2e-06 B-SLC2
0.4 B-SLC4

Bit Error Rate


8e-07
4e-07
E-SLC8
0 1 2 3 4 5 6 7 8 9 10 0.3
Year(s)

(a) 0.2
5e-07
Barely programmed
4e-07
Bit Error Rate

Partially programmed 0.1


3e-07 Fully programmed

2e-07
1e-07
0
0
0 50 100 150 200 250 300 350 400
0 1 2 3 4 5 6 7 8 9 10
Power Cut Off Interval (us)
Year(s)

(b) (a)
2e-07 0.8
Barely programmed B-MLC32-2
Bit Error Rate

Partially programmed
Fully programmed 0.7 A-MLC16
1e-07
D-MLC32
0.6 E-MLC8
F-MLC16

Bit Error Rate


0 0.5
0 1 2 3 4 5 6 7 8 9 10
Year(s) 0.4
(c)
0.3
Figure 5: Baking chips to accelerate aging reveals that power
0.2
failure during program operations reduces the long-term relia-
0.1
bility of data stored in flash chips.
0
0 100 200 300 400 500
read disturb is to copy data to a fresh page when errors begin to
Power Cut Off Interval (us)
appear. However, if the error rate rises too quickly, the data may
become uncorrectable before this can occur. The flash memory (b)
Figure 6: Bit error rates after an interrupted erase opera-
controller should copy the data programmed under power failure to
tion are well-behaved for SLC devices (a), but MLC behavior is
a fresh page as soon as possible.
much more complex (b).
4.1.4 Data Retention Figure 6 presents the bit error rates of erase operations for dif-
Programming data with power failure may also reduce the long- ferent power cut off intervals for (a) SLC chips and (b) MLC chips.
term stability of the data stored in the flash chip. We use a labora- For each test, we initially programmed the block with random data.
tory oven to accelerate the aging effect of flash memory chips up Behavior is similar among the SLC chips that we tested. The bit
to 10 years. According to the JESD22-A113 standard [5], we bake error rate stays at 50% (since the block contains random data) for
the chips for 9 hours and 20 minutes at 125°C to achieve the aging between 50 and 240 µs, after which all the cells become erased in
of one year. For each chip, we programed 5 blocks for each of the less than 10µs. However, chips do not report that the command has
three conditions: (1) Barely programmed: the power cut off interval completed until 400 µs– 2 ms.
is as short as possible without resulting in increased bit error rates For MLC chips, the bit error rate is, once again, non-monotone.
(for some MLC chips, the programmed error rate is never zero). For all chips except E-MLC8, it reaches as high as 75%. The timing
(2) Fully programmed: the program operation completes without of MLC chips also varies among different models: It takes between
power failure. (3) Partially programmed: the power cut off interval 50 µs – 475 µs for every cell to become 1. However, these chips
is halfway between barely programmed and fully programmed. report that erase command completes after 2 – 4 ms.
Figure 5 shows the result for (a) B-MLC32-2, (b) E-MLC8, and 4.2.2 Programming blocks after an erase failure
(c) B-SLC4. The x-axis of the graph is accelerated age in years.
The y-axis shows the average bit error rates in a block after aging In previous experiments, we found that erases appear to be com-
for each accelerated year. We slightly shift the points horizontally plete long before the chip reports that they are finished. Based
for partially programmed and fully programmed results to make the on discussions with flash manufacturers, we suspect that the chip
error bars visible. B-MLC32-2 does not exhibit any relationship spends this time fine-tuning the voltage levels in the cells. Presum-
between power failure and data retention. However, for E-MLC8, ably, this step is important (or the manufacturers would save time
the effect is clear. After 10 years, the error rate for barely pro- and skip it), so it is important to understand what impact cutting it
grammed data is 1.91 × 10−7 rather than 4.77 × 10−8 for partially short will have.
programmed or 0 for fully programmed. B-SLC4 shows the ro- To measure this effect, we cut power during erase operations,
bustness common among SLC chips. The chip only shows errors performed a complete program operation, and then read back the
for the barely programmed case. data to measure the bit error rate.
Figure 7 shows the results for (a) SLC and (b) MLC chips. The
4.2 Erase and power failure x-axis measures the power cut off interval of the previous erase
Erase operations are subject to a different set of reliability re- operation and the y-axis depicts the bit error rates of the later pro-
quirements than program operations. While it is important that an gramming operations. We start the experiment at 300 µs for SLC
erase operation reliably write a ’1’ to every bit in a block, it is and 500 µs for MLC, since by this time, the erase appears to be
equally important that the erase prepare the block properly for the complete. In both cases, interrupting an erase operation reduces
program operations that will follow it. We investigate both aspects reliability for future program operations to the block. For SLC and
of this reliability below. most MLC chips, the error rate rises from 0 to between 0.4% and
0.9%. For B-MLC32-2, the program error rate is never zero, and
4.2.1 Erasing bits the bit error rate rises from 1.2 × 10−7 to 0.2%. For E-MLC8, the
0.1
A-SLC2 at least one entire block for each write operation. We described
A-SLC4
A-SLC8 several alternative solutions.
B-SLC2 System software and embedded applications are critical in deal-
B-SLC4
Bit Error Rate

E-SLC8 ing with reliability issues like power loss. Kim et al. [8] designed
0.01 a software framework to mimic faults, including power failure, in
flash memory. This work (and the results in [3]) demonstrates that
flash has many non-intuitive error modes, so fault-injection frame-
works require input from real hardware measurements to ensure
the faults they inject are representative of what can occur in real
0.001 systems.
500 1000 1500 2000 2500
Power Cut Off Interval (us)
(a)
6. CONCLUSION
0.1 The flash memory devices we studied in this work demonstrated
B-MLC32-2 D-MLC32 F-MLC16 unexpected behavior when power failure occurs. The error rates do
A-MLC16 E-MLC8
0.01 not always decrease as the operation proceeds, and power failure
can corrupt the data from operations that completed successfully.
0.001
Bit Error Rate

We also found that relying on blocks that have been programmed or


0.0001 erased during a power failure is unreliable, even if the data appears
to be intact.
1e-05

1e-06
7. REFERENCES
[1] S. Boboila and P. Desnoyers. Write endurance in flash drives:
1e-07
500 1000 1500 2000 2500 3000 3500 4000
measurements and analysis. In FAST ’10: Proceedings of the 8th
USENIX conference on File and storage technologies, pages 9–9,
Power Cut Off Interval (us) Berkeley, CA, USA, 2010. USENIX Association.
(b) [2] T.-S. Chung, M. Lee, Y. Ryu, and K. Lee. Porce: An efficient power off
Figure 7: Bit error rates after an interrupted erase opera- recovery scheme for flash memory. Journal of Systems Architecture,
54(10):935 – 943, 2008.
tion are well-behaved for SLC devices (a), but MLC behavior is [3] L. Grupp, A. Caulfield, J. Coburn, S. Swanson, E. Yaakobi, P. Siegel,
much more complex (b) and J. Wolf. Characterizing flash memory: Anomalies, observations,
and applications. In MICRO-42: 42nd Annual IEEE/ACM International
program error rate varies between 0.4% and 0% for power cut off Symposium on Microarchitecture, pages 24 –33, 12 2009.
intervals between 1747 and 2062 µs. The program error rate of A- [4] A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: a flash translation layer
MLC16 also bounces for power cut off interval between 2640 µs employing demand-based selective caching of page-level address
mappings. In ASPLOS ’09: In Proceeding of the 14th international
and 2814 µs. These frequent variations cause the two vertical bands conference on Architectural support for programming languages and
in the graph. operating systems, pages 229–240, 2009.
[5] JEDEC. Preconditioning of Plastic Surface Mount Devices Prior to
Reliability Testing.
5. RELATED WORK http://www.jedec.org/sites/default/files/docs/22a113F.pdf.
[6] T.-S. Jung, Y.-J. Choi, K.-D. Suh, B.-H. Suh, J.-K. Kim, Y.-H. Lim,
Flash manufactures provide limited information about many as- Y.-N. Koh, J.-W. Park, K.-J. Lee, J.-H. Park, K.-T. Park, J.-R. Kim, J.-H.
pects of their chips, including their behavior under power loss. Our Yi, and H.-K. Lim. A 117-mm2 3.3-v only 128-mb multilevel nand flash
memory for mass storage applications. IEEE Journal of Solid-State
work is similar in spirit to Grupp et al. [3] in that we empirically Circuits, 31(11):1575 –1583, Nov. 1996.
quantify flash behavior in order to better understand the opportuni- [7] J. Kim, J. M. Kim, S. H. Noh, S. L. Min, and Y. Cho. A space efficient
ties and design challenges that it presents. In addition to chip level flash translation layer for compactflash systems. IEEE Transactions on
Consumer Electronics, 48:366–375, 2002.
performance, Boboila et. al [1] also explored device-level charac- [8] S.-K. Kim, J. Choi, D. Lee, S. H. Noh, and S. L. Min. Virtual framework
teristics including timing, endurance, and FTL designs. However, for testing the reliability of system software on embedded systems. In
neither of the above works focus on the power failure behavior of SAC ’07: Proceedings of the 2007 ACM symposium on Applied
computing, pages 1192–1196, New York, NY, USA, 2007. ACM.
flash memory chips. [9] K. Y. Lee, H. Kim, K.-G. Woo, Y. D. Chung, and M. H. Kim. Design
Many high-end SSDs have backup batteries or capacitors to en- and implementation of mlc nand flash-based dbms for mobile devices.
sure that operations complete even if power fails. Our results argue Journal of Systems and Software, 82(9):1447–1458, 2009.
[10] P. March. Power Loss Recovery (PLR) for cell phones using NAND
that these systems should provide power until the chip signals that Flash memory. http://www.numonyx.com/en-
the operation is finished rather than until the data appears to be cor- US/ResourceCenter/SoftwareArticles/Pages/PLRforNAND.aspx.
rect. Low-end SSDs and embedded systems, however, often do not [11] Numonyx. How to operate Power Loss Recovery for the Numonyx
65nm Flash Memory Devices.
contain backup power sources due to cost or space constraints, and www.numonyx.com/Documents/Applications_Operation.pdf.
these systems must be extremely careful to prevent data loss and/or [12] C. Park, P. Talawar, D. Won, M. Jung, J. Im, S. Kim, and Y. Choi. A
reduced reliability after a power failure. High Performance Controller for NAND Flash-based Solid State Disk
(NSSD). In NVSMW ’06: Non-Volatile Semiconductor Memory
Existing work on recovery from power failure aims to restore or Workshop, 2006., pages 17 –20, feb. 2006.
repair flash file systems using logs and other techniques [16, 4, 12, [13] S. Park, J. H. Yu, and S. Y. Ohm. Atomic write FTL for robust flash file
10] or page-level atomic writes [7, 2]. Numonyx [11] also pro- system. In ISCE ’05: Proceedings of the Ninth International Symposium
on Consumer Electronics, 2005., pages 155 – 160, June 2005.
vides guidelines to repeat interrupted operations after power fail- [14] F. M. Systems. Power failure prevention, recovery, and test.
ure. These designs may work for SLC chips, but the retroactive http://www.fortasa.com/Attachments/040_Power Failure Corruption
data corruption we observed for MLC chips suggests that they will Prevention.pdf.
[15] K. Takeuchi, T. Tanaka, and T. Tanzawa. A multipage cell architecture
be less effective there. for high-speed programming multilevel NAND flash memories. IEEE
Some commercial systems avoid retroactive corruption by treat- Journal of Solid-State Circuits, 33(8):1228 –1238, Aug. 1998.
ing a block as the basic unit of atomic writes [9, 13, 14]. This [16] D. Woodhouse. JFFS2: The Journalling Flash File System, version 2.
http://sources.redhat.com/jffs2/.
approach is inefficient for small writes since it requires re-writing

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy