Dynamic SDRAM SEFI Detection and Recovery Test Results
Dynamic SDRAM SEFI Detection and Recovery Test Results
Test Results
Steven M Guertin, Jeffrey D Patterson, Duc N Nguyen
Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California
Abstract—Single Event Functionality Interrupt (SEFI) results are failure modes on these devices. Thus a wide variety of failures
presented for Hynix SDRAMS. SEFI response threshold is below might need mitigation. Using the generic definition of a SEFI,
LET 9.9 Mev-cm2/mg and saturated cross section is 6x10-5cm2. most such failures are usually detected, and any additional
Dynamic SEFI identification was made, and in-situ recovery failures would require special testing.
restored functionality. Verification results of the identification
algorithm are presented. An observed high current radiation SDRAMs are among the simplest devices with an internal
response is also presented. state machine. Thus they are among the simplest where
dynamic testing is necessary. Development and tuning of such
Keywords-SEFI, SEU, SDRAM, Radiation Effects
a dynamic test is a useful way of isolating SEFIs as they occur
in order to improve event and fluence counting.
I. INTRODUCTION
This paper covers single event functionality interrupts SEFI mitigation in SDRAMs is often easy to handle. Koga
(SEFIs) on Hynix/Hyundai HY57V654020B SDRAMs. et al have shown that periodically rewriting the mode register
Although the data is provided, there are really two main issues of the device basically removes the possibility of SEFI effects
of concern in this paper. The first is dynamic SEFI for many SDRAMs. Developing test data on this phenomenon
identification and recovery of the device after such an event. is therefore not entirely necessary for particular applications if
The second is observation of a SEFI mode whereby the device the circuit is flexible enough to follow their prescription. To
becomes unusable. understand the trends and follow possible scaling, voltage, and
technology dependencies of the underlying phenomenon,
Device interrupt due to radiation will continue to be a major masking of SEFIs is not desired. Also, SEFI susceptibility and
issue due to the competing demands on spacecraft architecture. behavior in these simpler devices may provide insight into test
To derive this, factor in minimal power cycling, minimal methods and mitigation schemes for more complex devices.
failure identification, ever-faster modern device structures and
evolving system architectures. Designers are not usually
interested in what causes these interrupts, but rather, how to III. THEORY
deal with them. In a broader sense, however, while rates stay A particle strike disruptive enough to upset the control
low enough and effects stay manageable enough, empirical structure of an SDRAM is the primary source of SEFI. There
results quantifying just how many SEFIs are occurring is are several internal mechanisms that can potentially be
sufficient. In order to spot the trends before entire device affected. These include control latches that handle auto
types, or even technologies, will no longer have manageable sequencing of data, transfer paths that might trigger actions not
effects, it is useful to try to second-guess error modes such as requested, and elements directly affecting the state machine of
SEFIs and measure their rates even though mitigation schemes the device.
are presently sufficient.
When such a strike occurs, if the result is observable, the
II. BACKGROUND output stream will eventually be affected. Since this effect
might be delayed, a single SEFI might disrupt so much data it
SEFI is something of a catch all phrase for device interrupt is misinterpreted as multiple SEFIs. Although it can be
due to radiation. Classification systems sometimes separate effective to look at device snap-shots for increased SEU
observable signatures such as row or column hits, region-type density, a dynamic solution is desired.
failures, or current behavior. This work uses a definition of
SEFI that is purely based on observation. If a memory device In this paper an attempt is made to push SEFI identification
begins to show an error rate higher than expected due to closer in time to its occurrence. Since the output stream
uniformly distributed bit upsets, it is a SEFI. [1-5] eventually shows increased SEU density following SEFI,
recovery is triggered upon that sign. This allows better
The work herein covers SDRAMs, which continue to be identification of the fluence to SEFI. It should be noted there
considered for space applications due to speed and density was no attempt to identify SEFIs outside of the increased SEU
considerations. The complexity of mitigation systems for density, except if the recovery did not work.
general SDRAM failure continues to grow due to complex
Only the recovery mechanism remains for examination
The research in this paper was carried out at the Jet Propulsion here. The first issue concerning recovery is returning the
Laboratory, California Institute of Technology, under contract with the
National Aeronautics and Space Administration (NASA).
Upset-based SEFIs were not the only type seen. The tested SEFIs are expected to occur in the control circuitry only.
Hyundai devices also showed a high current mode. When that Hence, it is expected that the sensitive regions of the devices
was observed during testing, modifications were made to the are the control structures near the bond pads at the axial spine
test system because these high current pseudo-SEFIs (since of the device, and therefore devices might not require full
they didn’t always recover) occurred more readily when the decapsulation. Unfortunately, these SDRAMs are known to
test program was not exercising the DUT. behave very differently across both lot/date codes, and
diffusion lots, and these categorizations are different for the
V. TEST DETAILS two sets, so the test behavior was not expected to be identical.
Two sets of test devices were used. All test devices were The devices are arranged with 12 row bits, 10 column bits,
Hyundai HY57V654020B SDRAMs. For the first set, five and 2 bank bits. They have a 4-bit-wide data word. The usual
bare dies were bonded into ceramic bond-out packages. The testing arrangement was 12 row bits by 3 column bits by 2
second set of devices, which totaled three DUTs, was based on bank bits, which reduced full cycle time from minutes to ~1.5
plastic encapsulated parts that were either partially or seconds.
completely acid decapsulated. Two were partial, and were thus
expected to be useful for qualitative system checks since they Testing was conducted at the Lawrence Berkeley National
have either plastic or lead-frame covering much of their dies. Laboratory (LBL), and at Texas A&M (TAM). The results of
The final device was completely decapsulated and rebonded to the initial testing at LBL prompted the follow-up test at TAM.
mimic the devices in the first set. Basically the LBL results are the quantitative results while the
Bus
PCI
18
O 2.2 230
22
Ne 3.4 179 PLX DIO 40 pin Perpendicular
Card to DUT board,
40
Ar 9.9 129 (PLX9050) ribbon (4) and device top.
Vacuum
Bulk DUT Board
65
Cu 21.6 108 40 pin
Head
Adapter (1 DUT at a
HP 6629 40 pin ribbon (5)
86
Kr 30.0 111 Power Supply
time)
136
(Set at 3.3V) ribbon (1)
Xe 53.6 110
Facility
GPIB
Mounting
TABLE II. IONS USED FOR SEFI TESTING AT TAM Laptop: Apparatus
Monitor/
Logging
Ion LET (MeV-cm2/mg) Range (µm) Vacuum Chamber Boundary
63
C 19.3 141 Figure 2. This is the layout of the test apparatus used. There are two test
63
PCs (on left) which are connected via the bulk head adapter to the DUT board
C 21.6 104 (on right).
63
C 22.0 96
One of the key goals of the TAM testing was verification of
129
Xe 50.8 126 the test algorithm. To that end, several configuration
modifications were tested. These include enabling device
129
Xe 60.0 60 writing during irradiation, different buffer and SEFI threshold
sizes, and different addressing schemes.
Fig. 2 shows the arrangement of test appliances for the
testing at LBL, the TAM version of the setup is the same VI. RESULTS OF TESTING
except that a vacuum system was not used. The DUTs were The results presented here are separated into four areas.
tested on a custom SDRAM test apparatus. Power delivery The first is the algorithm development and its goals, basically
from an HP6629 power supply was provided on 40-pin ribbon establishing foundation to the test method. The second is the
cables. The standard JPL power supply control program primary goal of the testing which is obtaining SEFI cross
operated, monitored, and logged the power supply. This section data. The third result is a quick listing of the single bit
system has 100 ms observation and shutdown control over the upset (SBU) parameters. The final portion of this section
HP6629. For this testing, the devices were operated at 3.3V, at covers the wandering device current during irradiation (and
room temperature, with a latchup threshold starting at a current possible permanent device failure).
of 50mA. During the testing the latchup threshold was found
to be inadequate due to false latchup events, therefore the
threshold was raised to 100mA, and finally to 500mA. No A. Algorithm Development
latchups were observed, but device currents did wander as high Algorithm development made many assumptions that might
as 250mA. affect data validity. Beyond assumptions were algorithmic and
test variables that might also influence the test results. To
DUT operation signals originated from the test computer address some of these the test was modified in a few different
via a custom PCI digital I/O card. They connected on 40-pin ways to look for dependencies on test conditions. The
ribbon cables as well. All signals, including the clock, were modifications included changing the definition of SEFI from
generated in the test computer. Due to the nature of PCI I/O, (n,N) = (384,1024) to (96,256) and to (1536,4096). Also the
these signals are not generally regular, and in some cases have system was checked to see if writing would cause modified
100 ms of dead time between signals. In this arrangement, full SEFI cross section. The final algorithm modification was to
refresh took about 50 ms of signal time, and is based on row change the addressing of the device to emphasize columns over
address refreshing only. Rough study of these intervals rows. These results are presented in Fig. 3. Some other system
suggests typical refresh is not likely to be longer than 250 ms. validation notes are made in the wandering current portion of
The manufacturer’s specification is 64 ms at 70 degrees the results.
Celsius. Using a derating of 2x per 10 degrees Celsius,
B. Normal SEFIs Figure 4. Cross section for SEFIs from data taken at LBL. The spread at
The primary goal of the test was to produce data on high LETs could be due to device degradation, lot to lot variations, or even
heating and environmental conditions. Note the worst case curve is based on
recoverable SEFIs. The first result along these lines is the a data point of no SEFIs at LET=8.4 MeV-cm2/mg courtesy of BAE Systems.
absence of any SEFIs where power cycle was required to return
the DUT to normal operation. Implementing the
manufacturer’s warm-up and mode register program sequence SEFI Behavior Across Devices @ TAM
recovered all recoverable SEFIs.
10-5
For the main SEFI results, multiple runs at a single LET are
averaged for each part. This should promote the cancellation
of random errors inherent with particle beam experiments, but
Cross Section (cm )
2
still retain the part-to-part variations known to exist for SEE 10-6
Device H1
testing. However, this technique will not mitigate any biasing
Device H2
that systematic error or experimental limitations might Device UCH
introduce. The results are shown in Figs. 4 and 5 for the LBL 10-7
and TAM testing. Note the large variation found in both data
sets and between the data sets. Also recall that H1 and H2 are
packaged differently than UCH1. The concentration around
10-8
LET = 20 MeV-cm2/mg is due to the likelihood of high current 0 10 20 30 40 50 60 70
failures in that region. LET (MeV-cm2/mg)
things. First, the major region of the DUT die that causes SEFI
140
120
is on the axial spine of the devices near the bond pads. Second,
100 reading and writing made no statistically significant difference
80 in SEFI behavior.
60
The last major result of the algorithm verification was to
40
20
show that the definition of a SEFI was well behaved. Often
0
when examining phenomena that are defined by a threshold
0 100 200 300
(e.g. earthquakes), the definition of the object greatly changes
Stripchart Time Index (arb. Units)
how many you see (i.e. if you labeled a 1.0 or greater “seismic
event” as an earthquake, you would say earthquakes happen all
the time). The fact that the definition of a SEFI as an event
Figure 6. During irradiation at TAM under static bias, the device current was
where the number of errors n in a region of the DUT of length
quite erratic (for LETs above 10 MeV-cm2/mg). This current stripchart shows N does not depend much on the selection of n and N is a useful
what such a run looks like. Similar runs were seen at LBL with some showing result. It should be noted, however, that the ratio n:N was held
currents above 250mA. Two of the LBL devices failed in response to this constant, however careful consideration of the algorithm should
behavior. show that only n is really meaningful, but this could have been
verified empirically.
A second division in organizing the limitations is to look to
the DUTs themselves. First and foremost are the differing
C. SEFI Results
decapsulation methods. The DUTs marked H1 and H2 were
only partially decapsulated. Thus their results are likely to be The spread in measurements makes it difficult to infer
different from the remainder of the DUTs. The fact that for definitive conclusions from the data set, however it was
SEFIs this appears to not be the case might be a very good decided that a worst case enveloping curve be used to calculate
result for this type of testing because preparation of H1 and H2 an upset rate. To do this, the LBL results were used as the
was easily ten times less expensive than the remainder of the basis since the TAM results were taken with only qualitative
DUTs. The second DUT related limitation is the severe results in mind.
dependence of this device type to both the lot/date code and This worst-case curve is also shown in Fig. 4. The behavior
diffusion lots of the tested devices. Unfortunately, although of the enveloping line at the low end is due to the absence of
both are known for the devices tested at LBL, for the remaining SEFIs (with more than 1e8 fluence) at LET 3.4 MeV-cm2/mg,
devices this information was not tracked all the way through and known threshold near LET = 8 MeV-cm2/mg [6]. Based
the rebonding process; and H1 and H2 are known to be on this information, the inferred rate may or may not be
different from those tested at LBL. conservative.
VIII. CONCLUSION
Qualitative validation of a SEFI testing algorithm for
SDRAMs showed the presented method of SEFI testing is
fairly robust. Limitations of this testing, as presented, can be
used for both improving the algorithm itself, and for further
establishing its validity. In order to do this, to the level
indicated in this paper, initial testing at LBL was followed by
algorithm and exotic result (high current) testing at TAM.
The exotic result seen at LBL, the full failure of two of the
test devices, was not corroborated by the TAM testing. There
are several possible reasons for this, including heating and
differing device lots. This phenomenon can be interpreted as
the sequence: high current events lead to higher nominal
operating current, which leads to device failure. In that sense,
the TAM results agree with the LBL results on the first stage of
this sequence, since high current behavior was observed at
TAM.
Dynamic SEFI testing on the Hyundai HY57V654020B
SDRAM is a viable way of estimating possible effects on more
complicated devices. Similar testing might be necessary on
other devices as complexity levels grow and contribute further
SEFI modes. For these devices, SEFI identification where 384
of the previous 1024 addresses show errors provides a worst-
case SEFI threshold around 8 MeV-cm2/mg, with a saturated
cross section of 6e-5cm2, for any mode fitting this definition.