0% found this document useful (0 votes)
57 views

ECC Memory - Wikipedia

ECC memory uses error correction codes to detect and correct bit errors that occur in computer memory. It is commonly used where data corruption cannot be tolerated, such as industrial control systems, critical databases, and memory caches. ECC memory can detect and correct single bit errors and detect double bit errors. The document discusses how bit flips can occur due to radiation and provides examples of how ECC memory works to detect and correct errors.

Uploaded by

chassisd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

ECC Memory - Wikipedia

ECC memory uses error correction codes to detect and correct bit errors that occur in computer memory. It is commonly used where data corruption cannot be tolerated, such as industrial control systems, critical databases, and memory caches. ECC memory can detect and correct single bit errors and detect double bit errors. The document discusses how bit flips can occur due to radiation and provides examples of how ECC memory works to detect and correct errors.

Uploaded by

chassisd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Create account Log in

ECC memory 16 languages

Article Talk Read Edit View history Tools

From Wikipedia, the free encyclopedia

Error correction code memory (ECC memory) is a type of


computer data storage that uses an error correction code[a]
(ECC) to detect and correct n-bit data corruption which occurs
in memory. ECC memory is used in most computers where
data corruption cannot be tolerated, like industrial control
applications, critical databases, and infrastructural memory
caches.

Typically, ECC memory maintains a memory system immune


to single-bit errors: the data that is read from each word is
always the same as the data that had been written to it, even ECC DIMMs typically have nine memory chips on
if one of the bits actually stored has been flipped to the wrong each side, one more than usually found on non-ECC
state. Most non-ECC memory cannot detect errors, although DIMMs (some modules may have 5 or 18).[1]
some non-ECC memory with parity support allows detection
but not correction.

Description [ edit ]

Error correction codes protect against undetected data corruption and are used in computers where such corruption
is unacceptable, examples being scientific and financial computing applications, or in database and file servers.
ECC can also reduce the number of crashes in multi-user server applications and maximum-availability systems.

Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access
memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha
particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off
soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries,
which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to
them.[2] Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of
neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10-12 km (the cruising altitude of commercial
airplanes).[3] As a result, systems operating at high altitudes require special provisions for reliability.

As an example, the spacecraft Cassini–Huygens, launched in 1997, contained two identical flight recorders, each
with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Due to built-in EDAC functionality, the
spacecraft's engineering telemetry reported the number of (correctable) single-bit-per-word errors and
(uncorrectable) double-bit-per-word errors. During the first 2.5 years of flight, the spacecraft reported a nearly
constant single-bit error rate of about 280 errors per day. However, on November 6, 1997, during the first month in
space, the number of errors increased by more than a factor of four on that single day. This was attributed to a solar
particle event that had been detected by the satellite GOES 9.[4]

There was some concern that as DRAM density increases further, and thus the components on chips get smaller,
while operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently, since
lower-energy particles will be able to change a memory cell's state.[3] On the other hand, smaller cells make smaller
targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or
even reverse, this trend. Recent studies[5] show that single-event upsets due to cosmic radiation have been
dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are
unfounded.

Research [ edit ]

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude
difference, ranging from 10−10 error/bit·h (roughly one bit error per hour per gigabyte of memory) to 10−17 error/bit·h
(roughly one bit error per millennium per gigabyte of memory).[5][6][7] A large-scale study based on Google's very
large number of servers was presented at the SIGMETRICS/Performance '09 conference.[6] The actual error rate
found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between
25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours)
errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per
year.

The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a
crash or to corruption of data; in large-scale production sites, memory errors are one of the most-common hardware
causes of machine crashes.[6] Memory errors can cause security vulnerabilities.[6] A memory error can have no
consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in
calculations or saved. A 2010 simulation study showed that, for a web browser, only a small fraction of memory
errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of
memory errors were greater than would be expected for independent soft errors.[8]

Some tests conclude that the isolation of DRAM memory cells can be circumvented by unintended side effects of
specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak
their charges and interact electrically, as a result of high cell density in modern memory, altering the content of
nearby memory rows that actually were not addressed in the original memory access. This effect is known as row
hammer, and it has also been used in some privilege escalation computer security exploits.[9][10]

An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine
with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or
becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is
loaded, and the character "8" (decimal value 56 in the ASCII encoding) is stored in the byte that contains the stuck
bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved. As a result, the "8" (0011
1000 binary) has silently become a "9" (0011 1001).

Solutions [ edit ]

Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming,
RAM parity memory, and ECC memory.

This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that
exploit these bits. These extra bits are used to record parity or to use an error-correcting code (ECC). Parity allows
the detection of all single-bit errors (actually, any odd number of wrong bits). The most-common error correcting
code, a single-error correction and double-error detection (SECDED) Hamming code, allows a single-bit error to be
corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected. Chipkill ECC is a
more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.

Implementations [ edit ]

Seymour Cray famously said "parity is for farmers" when asked why he left
this out of the CDC 6600.[11] Later, he included parity in the CDC 7600,
which caused pundits to remark that "apparently a lot of farmers buy
computers". The original IBM PC and all PCs until the early 1990s used
parity checking.[12] Later ones mostly did not.

An ECC-capable memory controller can generally[a] detect and correct


errors of a single bit per word[b] (the unit of bus transfer), and detect (but In 1982 this 512KB memory board
not correct) errors of two bits per word. The BIOS in some computers, from Cromemco used 22 bits of storage
when matched with operating systems such as some versions of Linux, per 16 bit word to allow for single-bit
error correction
BSD, and Windows (Windows 2000 and later[13]), allows counting of
detected and corrected memory errors, in part to help identify failing
memory modules before the problem becomes catastrophic.

Some DRAM chips include "internal" on-chip error correction circuits, which allow systems with non-ECC memory
controllers to still gain most of the benefits of ECC memory.[14][15] In some systems, a similar effect may be
achieved by using EOS memory modules.

Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, it is assumed
that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors.
This used to be the case when memory chips were one-bit wide, what was typical in the first half of the 1980s; later
developments moved many bits into the same chip. This weakness is addressed by various technologies, including
IBM's Chipkill, Sun Microsystems' Extended ECC, Hewlett-Packard's Chipspare, and Intel's Single Device Data
Correction (SDDC).

DRAM memory may provide increased protection against soft errors by relying on error correcting codes. Such
error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant
applications, such as servers, as well as deep-space applications due to increased radiation. Some systems also
"scrub" the memory, by periodically reading all addresses and writing back corrected versions if necessary to
remove soft errors.

Interleaving allows for distribution of the effect of a single cosmic ray, potentially upsetting multiple physically
neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event
upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it
can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be
maintained.[16]

Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular
redundancy (TMR). The latter is preferred because its hardware is faster than that of Hamming error correction
scheme.[16] Space satellite systems often use TMR,[17][18][19] although satellite RAM usually uses Hamming error
correction.[20]

Many early implementations of ECC memory mask correctable errors, acting "as if" the error never occurred, and
only report uncorrectable errors. Modern implementations log both correctable errors (CE) and uncorrectable errors
(UE). Some people proactively replace memory modules that exhibit high error rates, in order to reduce the
likelihood of uncorrectable error events.[21]

Many ECC memory systems use an "external" EDAC circuit between the CPU and the memory. A few systems with
ECC memory use both internal and external EDAC systems; the external EDAC system should be designed to
correct certain errors that the internal EDAC system is unable to correct.[14] Modern desktop and server CPUs
integrate the EDAC circuit into the CPU,[22] even before the shift toward CPU-integrated memory controllers, which
are related to the NUMA architecture. CPU integration enables a zero-penalty EDAC system during error-free
operation.

As of 2009, the most-common error-correction codes use Hamming or Hsiao codes that provide single-bit error
correction and double-bit error detection (SEC-DED). Other error-correction codes have been proposed for
protecting memory – double-bit error correcting and triple-bit error detecting (DEC-TED) codes, single-nibble error
correcting and double-nibble error detecting (SNC-DND) codes, Reed–Solomon error correction codes, etc.
However, in practice, multi-bit correction is usually implemented by interleaving multiple SEC-DED codes.[23][24]

Early research attempted to minimize the area and delay overheads of ECC circuits. Hamming first demonstrated
that SEC-DED codes were possible with one particular check matrix. Hsiao showed that an alternative matrix with
odd weight columns provides SEC-DED capability with less hardware area and shorter delay than traditional
Hamming SEC-DED codes. More recent research also attempts to minimize power in addition to minimizing area
and delay.[25][26][27]

Cache [ edit ]

Many CPUs use error-correction codes in the on-chip cache, including the Intel Itanium, Xeon, Core and Pentium
(since P6 microarchitecture)[28][29] processors, the AMD Athlon, Opteron, all Zen-[30] and Zen+-based[31]
processors (EPYC, EPYC Embedded, Ryzen and Ryzen Threadripper), and the DEC Alpha 21264.[23][32]

As of 2006, EDC/ECC and ECC/ECC are the two most-common cache error-protection techniques used in
commercial microprocessors. The EDC/ECC technique uses an error-detecting code (EDC) in the level 1 cache. If
an error is detected, data is recovered from ECC-protected level 2 cache. The ECC/ECC technique uses an ECC-
protected level 1 cache and an ECC-protected level 2 cache.[33] CPUs that use the EDC/ECC technique always
write-through all STOREs to the level 2 cache, so that when an error is detected during a read from the level 1 data
cache, a copy of that data can be recovered from the level 2 cache.

Registered memory [ edit ]

Main article: Registered memory

Registered, or buffered, memory is not the same as ECC; the technologies


perform different functions. It is usual for memory used in servers to be
both registered, to allow many memory modules to be used without
electrical problems, and ECC, for data integrity. Memory used in desktop
computers is usually neither, for economy. However, unbuffered (not-
registered) ECC memory is available,[34] and some non-server
motherboards support ECC functionality of such modules when used with a
CPU that supports ECC.[35] Registered memory does not work reliably in
motherboards without buffering circuitry, and vice versa. Two 8 GB DDR4-2133 ECC 1.2 V
RDIMMs

Advantages and disadvantages [ edit ]

Ultimately, there is a trade-off between protection against unusual loss of data and a higher cost.

ECC memory usually involves a higher price when compared to non-ECC memory, due to additional hardware
required for producing ECC memory modules, and due to lower production volumes of ECC memory and
associated system hardware. Motherboards, chipsets and processors that support ECC may also be more
expensive.

ECC support varies among motherboard manufacturers so ECC memory may simply not be recognized by a ECC-
incompatible motherboard. Most motherboards and processors for less critical applications are not designed to
support ECC. Some ECC-enabled boards and processors are able to support unbuffered (unregistered) ECC, but
will also work with non-ECC memory; system firmware enables ECC functionality if ECC memory is installed.

ECC may lower memory performance by around 2–3 percent on some systems, depending on the application and
implementation, due to the additional time needed for ECC memory controllers to perform error checking.[36]
However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses
as long as no errors are detected.[22][37][38]

ECC supporting memory may contribute to additional power consumption due to error correcting circuitry.

Notes [ edit ]

a. ^ a b Most ECC memory uses a SECDED code.


b. ^ While 72-bit word with 64 data bits and 8 checking bits are common, ECC is also used with smaller and larger
sizes.

References [ edit ]

1. ^ Werner Fischer. "RAM Revealed" . admin-magazine.com. Retrieved October 20, 2014.


2. ^ Single Event Upset at Ground Level, Eugene Normand, Member, IEEE, Boeing Defense & Space Group, Seattle,
WA 98124-2499
3. ^ a b "A Survey of Techniques for Modeling and Improving Reliability of Computing Systems ", IEEE TPDS, 2015
4. ^ Gary M. Swift and Steven M. Guertin. "In-Flight Observations of Multiple-Bit Upset in DRAMs". Jet Propulsion
Laboratory
5. ^ a b Borucki, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level", 46th
Annual International Reliability Physics Symposium, Phoenix, 2008, pp. 482–487
6. ^ a b c d Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). DRAM Errors in the Wild: A Large-Scale
Field Study (PDF). SIGMETRICS/Performance. ACM. ISBN 978-1-60558-511-6.
Robin Harris (October 4, 2009). "DRAM error rates: Nightmare on DIMM street" . ZDNet.
7. ^ "A Memory Soft Error Measurement on Production Systems" . Archived from the original on 2017-02-14.
Retrieved 2011-06-27.
8. ^ Li, Huang; Shen, Chu (2010). " "A Realistic Evaluation of Memory Hardware Errors and Software System
Susceptibility". Usenix Annual Tech Conference 2010" (PDF).
9. ^ Yoongu Kim; Ross Daly; Jeremie Kim; Chris Fallin; Ji Hye Lee; Donghyuk Lee; Chris Wilkerson; Konrad Lai; Onur
Mutlu (2014-06-24). "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance
Errors" (PDF). ece.cmu.edu. IEEE. Retrieved 2015-03-10.
10. ^ Dan Goodin (2015-03-10). "Cutting-edge hack gives super user status by exploiting DRAM weakness" . Ars
Technica. Retrieved 2015-03-10.
11. ^ "CDC 6600" . Microsoft Research. Retrieved 2011-11-23.
12. ^ "Parity Checking" . Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
13. ^ DOMARS. "mca - Windows drivers" . docs.microsoft.com. Retrieved 2021-03-27.
14. ^ a b A. H. Johnston. "Space Radiation Effects in Advanced Flash Memories" Archived 2016-03-04 at the
Wayback Machine. NASA Electronic Parts and Packaging Program (NEPP). 2001.
15. ^ "ECC DRAM – Intelligent Memory" . intelligentmemory.com. Archived from the original on 2019-02-12.
Retrieved 2021-06-12.
16. ^ a b "Using StrongArm SA-1110 in the On-Board Computer of Nanosatellite" . Tsinghua Space Center, Tsinghua
University, Beijing. Archived from the original on 2011-10-02. Retrieved 2009-02-16.
17. ^ "Actel engineers use triple-module redundancy in new rad-hard FPGA" . Military & Aerospace Electronics.
Archived from the original on 2012-07-14. Retrieved 2009-02-16.
18. ^ "SEU Hardening of Field Programmable Gate Arrays (FPGAs) For Space Applications and Device
Characterization" . Klabs.org. 2010-02-03. Archived from the original on 2011-11-25. Retrieved 2011-11-23.
19. ^ "FPGAs in Space" . Techfocusmedia.net. Retrieved 2011-11-23.[permanent dead link]
20. ^ "Commercial Microelectronics Technologies for Applications in the Satellite Radiation Environment" .
Radhome.gsfc.nasa.gov. Archived from the original on 2001-03-04. Retrieved 2011-11-23.
21. ^ Doug Thompson, Mauro Carvalho Chehab. "EDAC - Error Detection And Correction" Archived 2009-09-05 at
the Wayback Machine. 2005 - 2009. "The 'edac' kernel module goal is to detect and report errors that occur within the
computer system running under linux."
22. ^ a b "AMD-762™ System Controller Software/BIOS Design Guide, p. 179" (PDF).
23. ^ a b Doe Hyun Yoon; Mattan Erez. "Memory Mapped ECC: Low-Cost Error Protection for Last Level Caches" .
2009. p. 3
24. ^ Daniele Rossi; Nicola Timoncini; Michael Spica; Cecilia Metra. "Error Correcting Code Analysis for Cache Memory
High Reliability and Performance" Archived 2015-02-03 at the Wayback Machine.
25. ^ Shalini Ghosh; Sugato Basu; and Nur A. Touba. "Selecting Error Correcting Codes to Minimize Power in Memory
Checker Circuits" Archived 2015-02-03 at the Wayback Machine. p. 2 and p. 4.
26. ^ Chris Wilkerson; Alaa R. Alameldeen; Zeshan Chishti; Wei Wu; Dinesh Somasekhar; Shih-lien Lu. "Reducing cache
power with low-cost, multi-bit error-correcting codes" . doi:10.1145/1816038.1815973 .
27. ^ M. Y. Hsiao. "A Class of Optimal Minimum Odd-weight-column SEC-DED Codes" . 1970.
28. ^ Intel Corporation. "Intel Xeon Processor E7 Family: Reliability, Availability, and Serviceability" . 2011. p. 12.
29. ^ "Bios and Cache - Custom Build Computers" . www.custom-build-computers.com. Retrieved 2021-03-27.
30. ^ "AMD Zen microarchitecture - Memory Hierarchy" . WikiChip. Retrieved 15 October 2018.
31. ^ "AMD Zen+ microarchitecture - Memory Hierarchy" . WikiChip. Retrieved 15 October 2018.
32. ^ Jangwoo Kim; Nikos Hardavellas; Ken Mai; Babak Falsafi; James C. Hoe. "Multi-bit Error Tolerant Caches Using
Two-Dimensional Error Coding" . 2007. p. 2.
33. ^ Nathan N. Sadler and Daniel J. Sorin. "Choosing an Error Protection Scheme for a Microprocessor's L1 Data
Cache" . 2006. p. 1.
34. ^ "Typical unbuffered ECC RAM module: Crucial CT25672BA1067" .
35. ^ Specification of desktop motherboard that supports both ECC and non-ECC unbuffered RAM with compatible
CPUs
36. ^ "Discussion of ECC on pcguide" . Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
37. ^ Benchmark of AMD-762/Athlon platform with and without ECC Archived 2013-06-15 at the Wayback Machine
38. ^ "ECCploit: ECC Memory Vulnerable to Rowhammer Attacks After All" . Systems and Network Security Group at
VU Amsterdam. Retrieved 2018-11-22.

External links [ edit ]

SoftECC: A System for Software Memory Integrity Checking


A Tunable, Software-based DRAM Error Detection and Correction Library for HPC
Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing
Single-Bit Errors: A Memory Module Supplier's perspective on cause, impact and detection
Intel Xeon Processor E3 - 1200 Product Family Memory Configuration Guide
Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

V ·T ·E Primary computer data storage technologies [show]

Categories: Computer memory Fault-tolerant computer systems

This page was last edited on 12 May 2023, at 06:29 (UTC).

Text is available under the Creative Commons Attribution-ShareAlike License 3.0; additional terms may apply. By using this site, you
agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit
organization.

Privacy policy About Wikipedia Disclaimers Contact Wikipedia Mobile view Developers Statistics Cookie statement

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy