0% found this document useful (0 votes)
28 views19 pages

Zenhammer Sec24

The document discusses Z EN H AMMER, the first Rowhammer attack targeting AMD Zen-based CPUs, which successfully reverse engineers DRAM addressing functions and overcomes challenges related to synchronization and activation throughput. The attack demonstrates the ability to trigger bit flips on multiple DDR4 devices, enabling Rowhammer exploitation on these platforms, and also achieves this on a DDR5 device for the first time. The authors highlight the need for effective synchronization with refresh commands and optimized access patterns to bypass in-DRAM mitigations, ultimately contributing to the understanding of Rowhammer vulnerabilities in AMD systems.

Uploaded by

gys767614
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views19 pages

Zenhammer Sec24

The document discusses Z EN H AMMER, the first Rowhammer attack targeting AMD Zen-based CPUs, which successfully reverse engineers DRAM addressing functions and overcomes challenges related to synchronization and activation throughput. The attack demonstrates the ability to trigger bit flips on multiple DDR4 devices, enabling Rowhammer exploitation on these platforms, and also achieves this on a DDR5 device for the first time. The authors highlight the need for effective synchronization with refresh commands and optimized access patterns to bypass in-DRAM mitigations, ultimately contributing to the understanding of Rowhammer vulnerabilities in AMD systems.

Uploaded by

gys767614
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Z EN H AMMER: Rowhammer Attacks on AMD Zen-based Platforms

Patrick Jattke† Max Wipfli† Flavien Solt Michele Marazzi Matej Bölcskei Kaveh Razavi
ETH Zurich
† Equal contribution first authors

Abstract DRAM Addressing. Previous work that reverse engineered


the DRAM addressing functions on Intel and ARM platforms
AMD has gained a significant market share in recent years
assumes that these functions are constructed by XOR-ing
with the introduction of the Zen microarchitecture. While
certain physical address bits with each other [32]. We find
there are many recent Rowhammer attacks launched from
that on AMD platforms, this assumption leads to an incom-
Intel CPUs, they are completely absent on these newer AMD
plete recovery of address functions. Our experiments show
CPUs due to three non-trivial challenges: 1) reverse engineer-
that AMD’s memory controllers require offsets for certain
ing the unknown DRAM addressing functions, 2) synchroniz-
physical address ranges before applying the XOR functions.
ing with refresh commands for evading in-DRAM mitigations,
Adjusting for these offsets, better handling of noisy measure-
and 3) achieving a sufficient row activation throughput. We
ments, and considering higher physical address bits lead to
address these challenges in the design of Z EN H AMMER, the
the recovery of correct and complete DRAM addressing func-
first Rowhammer attack on recent AMD CPUs. Z EN H AM -
tions for these CPUs. Yet, this initial version of Z EN H AMMER
MER reverse engineers DRAM addressing functions despite
only triggers bit flips on five (Zen 2) and none (Zen 3) of our
their non-linear nature, uses specially crafted access patterns
ten DDR4 DRAM modules, while an Intel-based fuzzer [17]
for proper synchronization, and carefully schedules flush and
triggers bit flips on eight of them. We track down the reason to
fence instructions within a pattern to increase the activation
inadequate synchronization with DRAM refresh commands
throughput while preserving the access order necessary to
and the low throughput of activations sent to DRAM.
bypass in-DRAM mitigations. Our evaluation with ten DDR4
devices shows that Z EN H AMMER finds bit flips on seven and Refresh Synchronization. Modern DRAM devices employ
six devices on AMD Zen 2 and Zen 3, respectively, enabling in-DRAM Target Row Refresh (TRR) mitigations that de-
Rowhammer exploitation on current AMD platforms. Fur- tect potential victims of a Rowhammer attack and internally
thermore, Z EN H AMMER triggers Rowhammer bit flips on a refresh these victims before bits can flip. These preventive
DDR5 device for the first time. refreshes happen transparently inside DRAM during the stan-
dard refresh commands issued by the memory controller. To
bypass these mitigations, state-of-the-art Rowhammer pat-
1 Introduction terns executed on Intel CPUs synchronize with refresh com-
Recent Rowhammer attacks that require the circumvention mands by repeatedly measuring the time it takes to access
of in-DRAM mitigations have mostly been investigated on two rows inside DRAM [17]. Memory controllers’ refresh
Intel platforms [6, 8–10, 13, 17, 19, 29, 30, 38, 39, 41, 44]. commands delay these accesses, allowing the patterns to de-
The success of these attacks is crucially dependent on the tect and synchronize with these commands. We find that the
intimate architectural and microarchitectural details of Intel required flushing and ordering instructions introduce signif-
CPUs, such as how the memory controller maps physical icant inaccuracies in the detection of refresh commands on
addresses to DRAM chips, the observability of certain DRAM AMD’s Zen-based CPUs. To address this issue, we rely on
commands, and the behavior of memory flushing and ordering synchronization using many uncached addresses that are only
instructions. Lacking this information, Rowhammer attacks flushed after a refresh is detected.
are currently absent on modern AMD CPUs based on the Activation Throughput. To bypass TRR during a Rowham-
Zen microarchitecture. We conduct experiments to uncover mer attack, maximizing the number of activations on both
this information, which we then use in the construction of decoy and target DRAM locations is favorable and often nec-
Z EN H AMMER, the first successful Rowhammer attack on essary. We noticed that, unlike Intel CPUs, it is not trivial to
Zen-based AMD platforms.
saturate the activation throughput on Zen-based AMD CPUs 2.1 DRAM
due to the behavior of cache flushing and fencing instructions. In desktop and server systems, dual in-line memory modules
To find the best hammering strategy, we systematically ex- (DIMMs) are connected to the CPU’s memory channels to
plore different memory access instructions and scheduling equip systems with DRAM. Each of these DIMMs consists of
policies for flushing and fencing instructions. We discover multiple DRAM chips, which operate in a lockstep mode.
that on AMD Zen 3, the CPU does not require a fence to order Each chip contains several banks, a grid-like structure of
the cache flush with the memory access, allowing Z EN H AM - DRAM cells organized in rows and columns. Each cell stores
MER to achieve a higher activation throughput by omitting a single-bit value and consists of a capacitor and an access
unnecessary fence instructions. Furthermore, we learn that transistor. Further, each DRAM bank is connected to a row
on all AMD Zen CPUs, the number of necessary fence in- buffer that acts as a buffer for the whole DRAM row during
structions to order different memory accesses can be tuned read and write operations.
to a given DRAM vendor based on the sensitivity of its TRR DDRx Protocol. The DDRx protocol is used to communicate
mitigation to this ordering. between the DRAM device and the memory controller. The
Equipped with better refresh synchronization and schedul- protocol dictates timing requirements and permitted command
ing of flushing and fencing instructions, Z EN H AMMER can orders that the memory controller should respect to ensure the
trigger bit flips on seven of our ten sample devices (compared DRAM device’s proper functioning. For example, in DDR4,
to eight on an Intel CPU). Our evaluation further shows that a refresh command (REF) has to be sent to the DRAM devices
these bit flips can be used to build the page table [36], RSA every 7.8 µs (tREFI) on average. Before any data can be read
public key corruption [34], and sudo [11] exploits on 7/6/4 from (RD) or written to (WR) a DRAM row, it must first be
of these devices, taking on average 164/267/209 seconds. To opened with an activate command (ACT), which brings its data
verify the attack’s practicality, we implement and evaluate the into the row buffer. Before any other row in the same bank
page table exploit by Seaborn et al. [36] on a Zen 3 system. can be opened, the row must be closed, i.e., precharged (PRE).
We also show that Z EN H AMMER can trigger bit flips on a
DDR5 device for the first time. Row Buffer Side Channel. Row buffer conflicts have been
exploited as a timing side-channel to detect same-bank
Contributions. We make the following contributions: rows [32, 42, 43]. For this, two randomly picked addresses
• We reverse engineer the confidential DRAM addressing are repeatedly accessed in succession and their access time is
functions on various AMD Zen-based CPUs in different measured. If the addresses map to different banks, the rows
configurations, and we present a coloring technique en- will stay open in their respective bank’s row buffer and can
abling system exploitation despite higher physical address directly be read. This means, row buffer hits happen, which
bits involved in the DRAM functions. leads to faster (lower latency) access times. However, if the
• We show how to synchronize effectively with refresh com- rows map to the same bank, successive accesses evict each
mands and increase the activation throughput with new other from the row buffer, thus requiring a PRE and ACT before
fence scheduling strategies on AMD Zen-based CPUs. data can be read or written. This row buffer conflict yields
• We build Z EN H AMMER using the reverse engineered slower (higher latency) access times.
DRAM addressing functions, the new synchronization,
and fencing strategies. Our evaluation shows that Z EN - 2.2 DRAM Addressing
H AMMER is effective on seven out of ten sample DDR4 The memory controller uses an addressing scheme to map
devices, enabling Rowhammer exploitation on AMD Zen- the physical address space to DRAM locations. For this, ven-
based systems for the first time. dors employ confidential address mappings that are optimized
• Using Z EN H AMMER, we show Rowhammer bit flips on a for performance. As correctly addressing DRAM rows is es-
DDR5 device for the first time. sential for Rowhammer attacks, reverse engineering these
Responsible Disclosure and Open Sourcing. While proprietary functions is often a preliminary step. Unlike Intel,
Rowhammer is a known problem in industry, we nonethe- AMD published address mappings for their pre-Zen CPUs
less informed AMD about our findings and agreed to an in the BIOS and Kernel Developer’s Guide [2], but stopped
embargo expiring on March 25, 2024. More information in- publishing this information for its newer CPUs since 2017.
cluding the source code of Z EN H AMMER can be found at: Linearity of Functions. Previous work [5, 14, 15, 32, 35, 42,
https://comsec.ethz.ch/research/dram/zenhammer 43] assumed the DRAM functions are linear. That is, a func-
tion f j is the exclusive-OR (XOR) of a set S j of physical ad-
dress bits: f j (a) = k∈S j ak . We will show later (Section 4.2)
L

2 Background that this assumption does not hold on our target systems.

We provide a high-level overview of DRAM (Section 2.1), 2.3 Rowhammer


how physical memory is mapped to it (Section 2.2), and dis- The DRAM vulnerability “Rowhammer” [22] allows attack-
cuss Rowhammer exploitation (Section 2.3). ers to induce memory disturbance errors. By rapidly accessing
aggressor rows, an attacker can leak charge from adjacent Table 1. Details about the Ryzen-based test systems (Z + , Z 2 , Z 3 )
victim rows, eventually causing bit flips in them. The effect is used in this work.
caused by the weak physical isolation of memory rows, and Micro- Release Our Test Systems
it is expected to worsen in future devices due to the ongoing architecture Date System CPU
miniaturization of DRAM cells [28]. In response to the wors- Zen 3 11-2020 Z3 Ryzen 5 5600G
ening Rowhammer effect, DRAM vendors have deployed Zen 2 07-2019 Z2 Ryzen 5 3600X
in-DRAM mitigations, known as Target Row Refresh (TRR), Zen+ 04-2018 Z+ Ryzen 5 2600X
that preemptively refresh victim rows before bits flip [10, 12].
Hammering Patterns. The originally proposed single-
sided [22, 40] and double-sided [36] patterns were later gen- Challenge 1. Reverse engineering the undocumented
eralized by n-sided patterns with n aggressors, which helped DRAM address mappings on AMD Zen-based systems.
to bypass some of the in-DRAM TRR mitigations [10]. The
state-of-the-art non-uniform Rowhammer patterns showed We show in Section 4 that the state-of-the-art DRAMA [32]
how to bypass all existing mitigations on DDR4 devices [17]. technique fails to recover the address functions on our AMD
These patterns are composed of double-sided aggressor pairs systems. Instead, a modified timing primitive and a relaxation
that are hammered with different frequencies, phases, and am- of the common linearity assumption are required to obtain the
plitudes to find blind spots in the deployed mitigations more full physical-to-DRAM address mappings. We then use these
effectively. Furthermore, a new Rowhammer effect known as mappings to build Z EN H AMMER and perform Rowhammer
Half-Double [24, 26] can bypass certain in-DRAM mitiga- on our AMD systems with a sample of ten DRAM devices.
tions using near and far aggressors. However, even when hammering devices known to be ex-
ploitable on Intel systems, we find very few bit flips with
Synchronization. A key ingredient of recent Rowhammer
Z EN H AMMER, with many devices not showing any bit flips
patterns is the synchronization with the REF commands that
at all. Based on these results, we conclude that additional
trigger TRRs [9, 17]. Earlier work [9] showed that soft syn-
(micro-)architectural considerations are necessary for effec-
chronization can be achieved by introducing a carefully cho-
tive Rowhammer attacks on AMD Zen-based systems.
sen number of NOPs into the pattern. In this way, the memory
Earlier work [9, 17] shows that synchronizing a Rowham-
controller is more likely to schedule REFs in these gaps of less
mer pattern with DRAM refresh commands is key for by-
DRAM activity, thus helping to keep the pattern in sync with
passing TRR mitigations. State-of-the-art non-uniform pat-
the REF. Blacksmith [17], however, uses hard synchronization
terns [17], for example, rely on a timing side channel to detect
by tailoring the pattern to the length of multiple refresh inter-
spikes in the memory access latency caused by refresh com-
vals and exploiting the increased access latency during REFs
mands [6, 9, 10]. Our analysis shows that this mechanism
to detect them.
produces inaccurate results on AMD Zen-based CPUs, even
Discussion. Existing research on Rowhammer mostly fo- failing completely on the newer Zen 3 platform. This leads us
cused on Intel systems [7, 10, 17], where DRAM address to our second challenge:
functions [32, 42] and the effects of the hammering instruc-
Challenge 2. Understanding and overcoming the short-
tion sequence [8, 13] are well known. However, AMD has
comings of timing-based refresh synchronization.
gained a significant market share in the last years and held
around 36 % of the market for x86 CPUs in 2024 [1]. Yet, In Section 5, we design and implement modified versions
it is unclear if Rowhammer is similarly exploitable on these of timing-based refresh synchronization. We experimentally
AMD systems. evaluate the various implementations to achieve more reli-
able synchronization on AMD platforms. Additionally, we
3 Overview notice that the activation throughput is only about half when
Our goal is to trigger bit flips on AMD Zen-based platforms, compared to the Intel baseline. This severely reduces the
particularly the systems listed in Table 1. These make use of budget of activations that can be used to “trick” TRR mitiga-
DDR4 memory technology, allowing us to compare their vul- tions, substantially increasing the difficulty of finding effec-
nerability with a baseline on well-studied Intel systems [17]. tive Rowhammer patterns. This introduces our last challenge:
A requirement for most Rowhammer attacks is the knowl- Challenge 3. Increasing activation rate during hammering
edge of the DRAM address mapping, i.e., how physical while preserving the order of memory accesses.
addresses map to the DRAM locations. This allows pre-
cisely selecting the location of aggressors around a vic- In Section 6, we systematically evaluate the activation
tim row, as needed by most effective Rowhammer tech- throughput achieved by different memory access, flushing,
niques [10, 24, 26, 36]. As the memory controllers of Intel and fencing instructions to find optimal access patterns tuned
and AMD systems use different DRAM address mappings, to the underlying DRAM device. Finally, after solving these
determining them poses our first challenge: challenges, Section 7 evaluates the effectiveness of Z EN H AM -
In DARE, we perform 32 iterations of accesses to both ad-
dresses while measuring the entire loop. In contrast, DRAMA
also measures accesses to two additional addresses that are
shifted during measurement repetitions. We then perform
16 measurements and use the minimum value. However, we
only do 16 × 32 = 512 accesses instead of 4 × 5 K = 20 K
(DRAMA) accesses, making our method significantly faster.
Evaluation Setup. We evaluate the accuracy of both timing
Figure 1. Histogram of access latencies measured on Z + when using routines, DRAMA and DARE, on Z + . For each routine, we
DRAMA or DARE. Measurements are partitioned based on whether generate 100 K random address pairs and measure their ac-
the address pair should produce a row conflict or not. cess times. We expect the access latencies for bank hits and
conflicts to be clearly distinguishable to allow reliable dif-
MER in triggering bit flips on the AMD Zen 2 and Zen 3 ferentiation. The address pairs are then partitioned based on
platforms and the ability of these bit flips for building success- whether they map to the same bank or not using the ground
ful Rowhammer exploits. We further show that Z EN H AMMER truth obtained with an oscilloscope (see Section 4.3).
can trigger bit flips on one of our DDR5 devices on the latest
Results. The results, which are shown in Figure 1, clearly
AMD Zen 4 platform.
show that the DRAMA routine significantly overlaps the two
4 DRAM Addressing cases, whereas our method shows a clearer separation. This
means that it is less susceptible to noise than DRAMA, thus
DRAMA [32] is currently the standard approach for reverse reducing the number of misclassified addresses. We further
engineering DRAM address mappings. We briefly describe enhance DARE by cluster cleaning (i.e., intra-cluster pairwise
the two main steps of this technique. testing of addresses) to ensure that we can reliably find the
Step 1: Clustering. DRAMA measures the access latency address functions despite system noise.
of two randomly picked addresses. If the measured value
exceeds the row conflict threshold, the addresses map to dif-
ferent rows in the same bank and otherwise to distinct banks. 4.2 Address Offsets
Using this method, DRAMA creates clusters of addresses in We notice that DARE fails to find sufficient address function
the same bank. DRAMA repeats this process until it finds a candidates (i.e., at least log2 N for N banks) but succeeds
cluster for each bank. when restricted to smaller memory ranges such as 256 MiB-
Step 2: Function Brute Forcing. DRAMA then generates aligned blocks. In other words, some functions are valid over
XOR-function candidates and tests them on the clustered a limited area only and cease to work across larger areas.
addresses exhaustively. A valid function must (i) be constant We investigate this anomaly by applying the obtained func-
for all addresses in each cluster and (ii) not produce the same tions on smaller windows over the entire memory range, as
result over all clusters. After removing linearly dependent shown in Figure 2a. We note that the sections where the func-
functions, the resulting set of log2 N functions, on a system tion result is 0 and where it is 1 have different sizes. However,
with N unique banks, can uniquely index every DRAM bank. if the address mapping was linear, the result would have to
We verified that DRAMA, originally designed for Intel CPUs, be either constant 0 or 1 (if the function is correct) or evenly
does not produce valid results on recent AMD CPUs.1 Ei- split between the two (for any other linear function). We refer
ther it does not find any address functions at all or functions to Appendix A for a proof of this fact. Based on this observa-
that are incomplete. We describe how improvements to tim- tion, the correct functions cannot be linear, which contradicts
ing (Section 4.1) and taking system-specific address offsets assumptions made by previous work (see Section 2.2).
into account (Section 4.2) enable our new DRAM reverse- We find that removing this nonlinearity is possible by sub-
engineering tool, called DRAM Address Mapping Reverse- tracting a particular constant offset to all physical addresses
Engineering (DARE), to successfully recover the address map- in the clusters before brute forcing the functions. This lin-
pings on AMD Zen-based platforms (Section 4.3). earization is demonstrated in Figure 2b, where, after removing
the offset, it is possible to find the correct function as shown
4.1 Timing Routine in Figure 2c. Finally, the correctly identified function gener-
The access time difference between addresses that produce ates a constant value for all addresses in the same cluster.
row conflict and those that do not is very small. Thus, existing Observation 1. DRAM address functions may be non-
reverse-engineering tools use specially crafted timing routines linear due to physical address space remapping, in which
to amplify the timing difference while eliminating unwanted case a constant offset needs to be subtracted from physical
noise (e.g., from unrelated system activity). addresses before applying an XOR function.
1 https://github.com/IAIK/drama
(a) using original (unshifted) phy. addresses (b) using physical addresses shifted by 768 MiB (c) after finding the correct function g(x)
1 1 1

g(x)
f(x)

f(x)
0 0 0
8 9 10 11 12 8 9 10 11 12 8 9 10 11 12
physical address [GiB] offset physical address [GiB] offset physical address [GiB]
Figure 2. (a) Function values for f (x) given by 0x64440100 for same-cluster addresses over the full address range on Z 3 , showing a uneven
distribution between “0” and “1”. (b) After offsetting the physical addresses by 768 MiB before applying the function, the same function’s
output looks evenly distributed. (c) This allows us to find the function g(x) defined by 0x44440100 that is constant for the cluster’s addresses
across all memory. We color the addresses whose function value changes when applying the offset in green (0 → 1) or blue (1 → 0).
Intel CPU AMD
OS Invisible (Phys. Addr. Space) Shifting by PCI Range Offset Table 2. Primary PCI memory
Reclaim Offsetting System mappings and detected physi-
DRAM3 [MiB] [MiB]
TOM cal address offsets, i.e., differ-
DRAM3 Z+ 3072 – 4048 1024
DRAM2 DRAM2 Z2 3584 – 3968 512 ence between 4 GiB and the
DRAM2 4 GiB Z3 3328 – 4076 768 PCI mapping’s start address.
DRAM3 PCI
DRAM1 DRAM1 DRAM1
0 GiB 4.3 Recovered Address Mappings
Figure 3. Remapping of higher address ranges to unused parts We run DARE on all our systems using single and dual-rank
of physical memory on Intel and AMD CPUs. The Top of Mem-
DIMMs. DARE successfully reverse engineers the address
ory (TOM) is the system’s highest addressable memory location.
functions for all memory configurations on all three systems.
For simplicity, we limit our analysis to single-channel, single-
System Address Map. We now provide an explanation and DIMM systems with default UEFI settings, as this is sufficient
supporting evidence for the existence of this offset. for performing Rowhammer.
The physical address space is divided into ranges backed by To validate our results, we verify the functions’ correct-
main memory (i.e., DRAM) and ranges for memory-mapped ness using a high-bandwidth oscilloscope similar to previous
I/O (MMIO) devices. In particular, PCI(e) devices are com- work [32]. This also allows us to obtain the function labels
monly mapped just below the 4 GiB boundary to keep 32-bit (i.e., assign functions to the DRAM address components) and
compatibility, thus masking parts of main memory. Due to clarify the cases where our tool found linear combinations of
DRAM sizes in the order of gigabytes, CPU vendors intro- the actual address functions. We note that this manual step is
duced mechanisms to remap the otherwise inaccessible part not required for Rowhammer attacks.
of DRAM to a higher address range, as shown in Figure 3. In- Results. We provide a list of all our reverse engineered and
tel still employs this “OS Invisible Reclaim” mechanism [16, oscilloscope-validated address functions for three AMD Zen
p. 19], but AMD stopped documenting “Memory Hoisting” [2, microarchitectures and different memory configurations in
§2.9.12] with the Zen microarchitecture. Our findings suggest Table 3. Note that some physical address bits are above the
that newer AMD processors shift all physical addresses above 1 GiB mark, which explains why DARE uses as many 1 GiB
4 GiB by a fixed, system-specific offset. This offset depends superpages as possible while building same-bank address
on the system’s hardware configuration, e.g., mainboard, in- clusters.
stalled PCI(e) devices. Observation 2. We need access to a memory block larger
than 1 GiB to entirely recover all DRAM address mappings.
Automation. To avoid having to brute force the physical
address offset, we analyze the system memory map of our Discussion. To the best of our knowledge, we are the first
target systems to find the location of the primary PCI memory to reverse engineer and provide physically validated DRAM
mapping.2 For example, as we show in Table 2, the PCI mem- address mappings on recent AMD Zen-based systems with
ory range on Z 2 starts at 3584 MiB and ends at 3968 MiB. consideration of the address offsets. Further, we provide an
This allows us to precisely calculate the system’s address improved reverse-engineering tool to reproduce and extend
offset by determining the difference between the PCI map- our results with more memory configurations as needed.
ping’s start address and the 4 GiB boundary, for example,
4096 − 3584 = 512 MiB for Z 2 . We apply the offset to our Row Mapping. DARE, just like DRAMA, does not allow the
physical addresses before brute forcing the address functions detection of physical address bits used for DRAM row and
to produce valid functions on all our systems. column indices. Therefore, before we can experimentally
evaluate our address mappings using a Rowhammer attack, we
need to extract the row mapping. Based on previous results [9,
2 In Linux, the “PCI Bus 0000:00” in the (privileged) /proc/iomem file. 42], we assume that the highest available address bits are used
Table 3. Reverse engineered address mappings and offsets for different DRAM configurations. All memory configurations are single-channel,
single-DIMM, with the tuple indicating the DIMM’s geometry (#ranks, #bank groups, #banks per bank group, #rows).

Geometry Size Offt. DRAM Address Functions Row


Sys.
(RK, BG, BA, R) [GiB] [MiB] Rank (RK) Bank Group (BG) Bank Address (BA) Bits

Z+ (1, 4, 4, 216 ) 8 1024 n/a 0x088883fc0, 0x111104000 0x022228000, 0x044450000 32 – 17


(2, 4, 4, 216 ) 16 1024 0x3fffe0000 0x111103fc0, 0x222204000 0x044448000, 0x088890000 33 – 18
(2, 4, 4, 217 ) 32 1024 0x7fffe0000 0x111103fc0, 0x222204000 0x444448000, 0x088890000 34 – 18
Z2 (1, 4, 4, 216 ) 8 512 n/a 0x088883fc0, 0x111104000 0x022228000, 0x044450000 32 – 17
(2, 4, 4, 216 ) 16 512 0x3fffe0000 0x111103fc0, 0x222204000 0x044448000, 0x088890000 33 – 18
(2, 4, 4, 217 ) 32 512 0x7fffe0000 0x111103fc0, 0x222204000 0x444448000, 0x088890000 34 – 18
Z3 (1, 4, 4, 216 ) 8 768 n/a 0x022220100, 0x044440200 0x088880400, 0x111100800 32 – 17
(2, 4, 4, 216 ) 16 768 0x3fffe0000 0x044440100, 0x088880200 0x111100400, 0x222200800 33 – 18
(2, 4, 4, 217 ) 32 768 0x7fffe0000 0x444440100, 0x088880200 0x111100400, 0x222200800 34 – 18

for row indexing, which we verified with our oscilloscope. for all THPs with the same color are identical, we can use
For example, a 16 GiB device (with 216 rows) consists of 234 the same THP row offsets for all THPs of the same color.
individually addressable bytes, and a row index is described Finally, we validate the row addresses using our bank conflict
by the bits (a33 , a32 , . . . , a18 ). side channel and discard all THPs where any two rows do not
cause bank conflicts.
4.4 Enabling Exploitation Results. We measured how long the coloring and detecting
On our Intel Coffee Lake system the bank, bank group, and same-bank rows take on our Zen 3 system with a dual-rank
rank bits all fall within the lower 21 bits, i.e., within a transpar- DIMM (S2 in Table 4). The THP coloring took on overage
ent huge page (THP). However, we noticed that the address 39.23 s and must be repeated for each attack as the THP al-
functions on AMD Zen 2 and Zen 3 systems can cover up to location in physical memory changes. Detecting same-bank
bit 34 (see Table 3). This makes exploitation without knowing rows for each THP color is a one-time cost that can be pre-
these bits challenging. Previous methods assume DRAM func- computed for each system memory configuration and took on
tions with all addressing bits falling in the lower 21 bits [24], average 18 ms.
do not take advantage of THPs [25], or color THPs for other
purposes such as cache eviction [9]. We now describe how the 4.5 Evaluation
bank conflict side channel and the reverse-engineered DRAM In addition to the physical validation of our mappings, we
mappings can be combined to detect consecutive same-bank use Rowhammer on our AMD systems with non-uniform
rows, which is crucial for Rowhammer attacks. hammering patterns [17] to see if we can trigger bit flips, as
Coloring THPs. We allocate 256 MiB of 2 MiB-aligned this requires precise DRAM addressing. Further, we evaluate
memory and turn it into 2 MiB THPs by using madvise. We the recent Half-Double patterns [24].
then iterate in steps of 2 MiB over the allocated memory such Threat Model. In our evaluation, we assume that the CPU
that the 21 lower bits are always the same. As the upper phys- model of the target machine is known to the attacker and that
ical address bits are unknown, we cannot directly apply our they have obtained the correct DRAM address mappings, for
recovered address functions. Instead, we use the bank conflict example, using DARE. We further assume that an unprivi-
side channel to measure if the current THP conflicts with any leged attacker can execute programs on the victim’s machine
other THP we found before. If two THPs conflict, we assign but does not know anything more specific about the DRAM
them the same color; otherwise, we assign a new color to the devices (e.g., DRAM chip manufacturer).
current THP. This approach allows us to assign a color to
Setup. We modify the reference implementation of Black-
each THP based on the unknown upper physical address bits.
smith [17].3 Our changes include adding the address map-
Detecting same-bank rows. Given that THPs are 2 MiB con- pings we found previously and other necessary platform
tiguous memory regions, we know that the lower 21 physical changes, such as timing thresholds, as the fuzzer was orig-
and virtual address bits are the same. Thus, we can group inally designed for an Intel Coffee Lake system. However,
the THPs of the same color and use our recovered address we do not apply any microarchitecture-specific optimizations.
functions on the lower bits to address consecutive same-bank For the evaluation, we do six-hour fuzzing runs on both Z 2
rows. For that, we iterate over the row index bits that fall into and Z 3 with the ten DDR4 DIMMs listed in Table 4 that we or-
the lower 21 bits. As they may overlap with bank address bits, dered randomly from an online retailer. These DIMMs cover
it may require flipping lower (non-overlapping) bits to stay
within the same bank. As the values of the DRAM functions 3 https://github.com/comsec-group/blacksmith
Table 4. DDR4 UDIMMs used in the evaluation of our AMD Zen- Listing 1. Refresh synchronization routine as used by Blacksmith.
optimized Rowhammer fuzzer. We abbreviate the DRAM vendors
void ref_sync_original(volatile char* rows[2]) {
Samsung (S ), SK Hynix (H ), and Micron (M ). For each device, we while (true) {
report the number of ranks (RK), bank groups (BG), banks per bank uint64_t start = rdtscp(); /* START TIMER */
group (BA), and rows (R). lfence();
*rows[0]; *rows[1];
Production Freq. Size DIMM Geometry clflushopt(rows[0]); clflushopt(rows[1]);
ID
Date [MHz] [GiB] (RK,BG,BA,R) uint64_t stop = rdtscp(); /* STOP TIMER */
lfence();
S0 Q3-2020† 2132 8 (1, 4, 4, 216 ) if ((stop - start) > THRESHOLD) break;
S1 Q3-2020† 2132 16 (2, 4, 4, 216 ) }
S2 Q2-2020 2666 32 (2, 4, 4, 217 ) }
S3 Q4-2017 2400 8 (1, 4, 4, 216 )
S4 Q3-2020† 2666 8 (1, 4, 4, 216 )
S5 Q2-2020 2666 16 (2, 4, 4, 216 ) terns in the remainder of this work, and base Z EN H AMMER
H0 Q3-2020† 2132 16 (2, 4, 4, 216 ) on non-uniform Rowhammer patterns.
H1 Q4-2020 2400 8 (1, 4, 4, 216 ) Based on our results, we conclude that the common ham-
M0 Q1-2020 2666 8 (1, 4, 4, 216 ) mering instruction sequence as used by Blacksmith [17] and
M1 Q1-2020 2400 8 (1, 4, 4, 216 ) others [10] encodes implicit assumptions about the underly-
† ing Intel microarchitecture. Our results show that this signifi-
Purchase date used as production date unavailable.
cantly affects Rowhammer’s effectiveness on other platforms,
such as the AMD systems targeted in this work. Motivated
Table 5. Result of running Blacksmith with our address mappings
and platform fixes (e.g., thresholds) on AMD Zen 2 and Zen 3 sys-
by this, we investigate the two crucial aspects of hammering,
tems, compared to our Intel Coffee Lake baseline. We report for each namely, refresh synchronization (Section 5) and the activation
device the number of patterns found (|P+ |) and the number of bit rate (Section 6) on AMD systems, and show how Z EN H AM -
flips over all patterns (|Ffuzz |). We omit devices without any bit flips. MER can improve them.

Zen 2 Zen 3 Coffee Lake


ID 5 Refresh Synchronization
|P+ | |Ffuzz | |P+ | |Ffuzz | |P+ | |Ffuzz |
S0 14 19 0 0 122 3,502 As shown by previous work [9, 10, 12, 17], it is essential to
S1 4 4 0 0 102 1,374 synchronize Rowhammer patterns with refresh commands.
S2 14 28 0 0 782 22,339 This is necessary as in-DRAM mitigations (i.e., TRR) have
S3 0 0 0 0 3 3 been shown to act during REFs. Synchronization is commonly
S4 4 5 0 0 47 654 done by detecting spikes in memory access latency, which
S5 6 7 0 0 155 4,131 correspond to when DRAM is briefly unavailable during re-
H1 0 0 0 0 24 35 freshes [9, 10]. In this section, we investigate whether the
M1 0 0 0 0 16 23 refresh synchronization mechanism used by Blacksmith is
effective on AMD Zen-based systems.

the three major DRAM manufacturers. To allow comparison 5.1 Blacksmith Synchronization
with Intel, we further run the same code on the same DIMMs In Listing 1, we present Blacksmith’s synchronization rou-
on a Coffee Lake (Core i7-8700K) machine. tine, which uses two same-bank rows. This method relies on
Results. The result of our evaluation is presented in Table 5. RDTSCP to capture timestamps, LFENCE to serialize the execu-
It shows that with our minimal changes, we can trigger bit tion stream, and CLFLUSHOPT to immediately flush accessed
flips on our Zen 2 system; however, only on 5 of 10 modules. rows. It assumes a REF has been detected whenever the timing
We could not find any patterns on Zen 3. This is much lower measurements exceed a predefined threshold.
than compared to 8 of 10 modules on the Intel Coffee Lake Evaluation. To detect whether synchronization works prop-
platform. We further note that the number of patterns found erly, we evaluate the time between detected refreshes, both on
in the worst case (S2 ) is roughly 50x smaller on Zen 2 (14 Z + and Z 3 . When refresh commands are correctly detected,
patterns) than on Coffee Lake (782 patterns). we expect the time between them to be around 7.8 µs, i.e.,
We also tested Half-Double [24] patterns on all DDR4 tREFI as specified by the DDR4 standard [18].
devices with our address mappings and the reference imple-
Results. The experiment results, each with 10 K iterations,
mentation.4 As we did not find any bit flips on our devices
are presented in Figure 4. The median latencies are 7.62 µs for
using these patterns, and Half-Double has not been shown to
Z + and 5.37 µs for Z 3 .5 While the data for the Zen+ system
be exploitable on x86-64 machines, we disregard these pat-
5 For a
fair comparison with Blacksmith, which uses AsmJit [23] to just-in-
4 https://github.com/IAIK/halfdouble time (JIT) compile hammering patterns and their synchronization from x86-
104 7.8µs Z+ 7.8µs Z3 Listing 2. Our continuous, non-repeating refresh synchronization.
# Samples

void ref_sync_nonrep(volatile char* rows[64]) {


102
uint64_t prev = rdtscp();
100 for (size_t i = 0; i < 64; i++) {
0 5
10 15 20 0 5 10 15 20 *rows[i];
uint64_t curr = rdtscp();
Measured REF-to-REF interval [µs]
if ((curr - prev) > THRESHOLD) break;
Figure 4. Measured time between successive REFs using the refresh prev = curr;
synchronization routine ref_sync_original(), for both Z + and }
Z 3 . The number of samples (y-axis) are logarithmically scaled. // REF detected here (or ran out of rows)
for (size_t i = 0; i < 64; i++) clflushopt(rows[i]);
}

suggests that this method works quite reliably, REFs are often Table 6. REF-to-REF inter-
detected too early on Zen 3. This could be because of two Median [µs] Outliers [%]
#Rows val when using the con-
reasons: either the refresh detection fails most of the time, Z+ Z3 Z+ Z3 tinuous, non-repeating tim-
or the memory controller schedules REFs opportunistically. 16 2.01 2.62 7.3 24.7 ing measurement routine
The latter is possible because the DDR4 standard [18] only 32 1.19 4.41 43.4 71.4 (ref_sync_nonrep) for dif-
specifies the average time between refresh commands and 64 7.81 7.77 0.3 0.6 ferent numbers of rows on
allows for some flexibility. In the following section, we will 128 7.93 7.85 0.3 0.7 Z + and Z 3 . We identify as
show that it is possible to detect the majority of refreshes 256 7.80 7.71 0.2 0.7 outliers all the values that
reliably, as the original refresh synchronization method is differ more than 10 % from
Orig.† 7.62 5.37 1.1 93.4 the median.
inadequate on our AMD platforms.
† The original refresh sync. routine
with 2 rows (see Figure 4).
5.2 Precise and Reliable Synchronization
We analyzed Blacksmith’s refresh synchronization routine routine is presented in Listing 2.
as used by Z EN H AMMER to identify possible measurement
Evaluation. We evaluate our new routine using the same ex-
errors. By looking at the source code (Listing 1), we identi-
periment as before. We show the obtained distribution of mea-
fied a brief time window, where fencing (lfence) happens,
sured REF-to-REF intervals in Table 6. The results demonstrate
that is not measured between the stop timestamp and the
that when more than 32 rows are employed in the synchroniza-
next iteration’s start timestamp. As the memory controller
tion, we correctly identify refreshes on all our systems. This
has some flexibility for scheduling refresh commands, it can
means that a sufficient number of unique rows is necessary
happen that a REF sometimes remains undetected if it falls
to cover an entire refresh interval (i.e., 7.8 µs) before falling
into this untimed gap. Furthermore, the memory controller
through the end of the detection loop.
may schedule the REF commands opportunistically during
flush instructions, reducing the accuracy of detecting the REF Observation 3. Continuous, non-repeating time measure-
commands. ments strongly improve the reliability of our refresh com-
Continuous Measurements. To mitigate this issue, we pro- mand detection.
pose a modified refresh synchronization routine with con-
tinuous, non-repeating timing measurements: each recorded 6 Activation Rate
timestamp serves as both the end time of the current measure- We noticed that the number of tested patterns on the AMD sys-
ment round and the start time of the next. This ensures that tems is significantly lower than on the Intel Coffee Lake base-
all the instructions are included in the timing measurement. line during fuzzing, on average by 45 % (Z 2 ) and 52 % (Z 3 ).
To ensure that the memory controller does not opportunisti- As we fuzz for a fixed period (6 h) while hammering each pat-
cally schedule REF commands during the flush instructions, tern for 5 M activations, this suggests that each individual pat-
we avoid flushing during the synchronization phase. We solve tern takes significantly longer to hammer. To investigate this,
this by designing a new method that allows a flexible number we measure hammering execution times to compute the aver-
of rows and measures the latency of each memory access age number of activations per refresh interval (ACTs/tREFI)
individually. for each pattern. We present the comparison between Z + , Z 3 ,
Avoiding Cache Hits. To avoid CLFLUSHOPT during synchro- and Coffee Lake in Figure 5. The data shows that the average
nization, our code can only access different rows not to incur number of ACTs/tREFI achieved on Z + (41.9) and Z 3 (37.2)
cache hits. To evict the cache lines for the subsequent synchro- are only about half when compared to Coffee Lake (76.8).
nization phase, we flush the accessed rows after the REF is The lower activation rate on the AMD systems have a direct
detected. Our continuous, non-repeating timing measurement impact on Rowhammer as discussed next.
64 assembly, we implement all routines using AsmJit. We show equivalent C Hammer Count Estimation. We now approximate the ham-
representations throughout this paper. mer count (HC) that a victim row is subjected to given these
Table 7. Heatmap of memory access rates (in ACTs/tREFI) for
Rel. frequency

Z+ different instruction sequences and varying the number of accessed


rows on the Z 3 system. We omit unsuitable sequences with a low
Z3 CL throughput (≤ 100 ACTs/tREFI) and sequences indicating cache
hits with a very high throughput (≥ 1000 ACTs/tREFI), and provide
30 40 50 60 70 80 90 100 the complete table in Appendix B.
#ACTs/tREFI
Figure 5. Distribution of the activation rates of non-uniform ham- Access Flushing Fence #Rows
mering patterns on Z + , Z 3 , and Intel Coffee Lake (CL). The whiskers Type Strategy Type
1 2 4 8 16 32 256
indicate the minimum and maximum values.
MOV (load) gather M 24 49 71 91 100 110 114
MOV (load) gather L 24 49 80 113 134 147 121
activation rates. The estimation is made on a refresh window, MOV (load) gather S 24 49 80 113 133 146 125
as the bit flip needs to happen before the refresh of the victim MOV (load) gather — 24 49 80 113 133 146 125
row. In a refresh window, there are 8192 refresh intervals MOV (load) scatter M 24 49 79 107 126 143 157
(tREFI). For an activation rate of 40 ACTs/tREFI, this results MOV (load) scatter L 24 49 95 137 149 153 159
in a maximum of 328 K row activations before the victim MOV (load) scatter S 24 48 97 154 159 159 159
row is refreshed. We consider a device with a Rowhammer MOV (load) scatter — 24 49 97 154 159 159 159
mitigation that keeps track of 16 aggressors at a time [12]. PREFETCHNTA scatter — 80 132 191 208 253 309 273
To perform an effective double-sided Rowhammer on such PREFETCHNTA scatter M 24 49 80 108 131 170 284
VGATHERDD scatter — 24 49 79 112 159 159 159
a device, we need to hammer 18 rows, two aggressors and
16 dummy rows [10]. Assuming that we hammer the rows uni-
(i.e., “gather”), or directly after each memory accesses (i.e.,
formly, this results in a cumulative HC of 36 K for the victim
“scatter”) [9].
row. This is smaller than the minimum hammer count (HCmin )
b. Memory Barriers. To ensure that aggressors are flushed
reported for many DDR4 devices by previous work [21, 26].
from the cache before they are accessed again, existing ap-
Therefore, based on these estimates, activation rates are insuf-
proaches rely on memory barriers. For example, MFENCE se-
ficient to induce Rowhammer bit flips on many devices from
rializes all preceding loads and stores, LFENCE serializes all
AMD Zen-based CPUs. Thus, we aim to improve activation
preceding loads, and SFENCE serializes all preceding stores.
rates to enable more effective hammering. To this end, we first
Given our “scattered” flushing, fences can either be placed af-
analyze possible hammering instruction sequences to find the
ter every flush (“fence each”) or only once at the end (“fence
optimal way to hammer.
once”). Lastly, we may omit fences to sacrifice some accesses
(hitting the cache) for a higher activation rate [8].
6.1 Instruction Sequences
c. Access Types. Typically, load instructions are used to
Existing studies [8, 9, 13, 33] proposed and evaluated differ-
execute Rowhammer patterns on regular x86-64 machines.
ent hammering sequences. For example, Cojocar et al. [8]
This is necessary because the DRAM activate command that
showed that the sequence of machine instructions used for
triggers the Rowhammer effect cannot be directly issued.
hammering affects the rate of activations. As they performed
Instead of loads, Rowhammer is also possible using store
their experiments on Intel CPUs, it is unlikely that their results
operations, as they also induce row activations [8].
transfer to our AMD processors. Therefore, we perform our
d. Non-Temporal Instructions. The x86 ISA specifies non-
own analysis of possible instruction sequences.
temporal instructions that bypass CPU caches entirely, thus
We start by analyzing the standard instruction sequences
avoiding cache flushing [33]. However, they either require
used by Z EN H AMMER. They flush the cache directly after
non-standard write-combining (WC) memory (MOVNTDQA),
each access (“scatter” [9]) and fence (MFENCE) after each flush
may prefetch data from L3 cache instead of accessing DRAM
(“fence each”). However, this instruction sequence might not
(PREFETCHNTA), or can be cached in WC buffers (MOVNTI).
be optimal on AMD systems, as our earlier results suggest.
In the following (a–e), we present the fundamental building e. Vector Instructions. The gather family of AVX2 load
blocks of possible hammering instruction sequences. instructions can be used to load data from a non-contiguous
address list. As an example, VPGATHERDD loads up to eight
a. Cache Flushing. Because a hammered aggressor is
32-bit values simultaneously [3]. This method still requires
cached, we need to ensure that subsequent hammering ac-
cache-flush instructions and possibly memory barriers.
cesses are fetched from DRAM again. For flushing aggres-
sors from the cache, we can use CLFLUSH or the optimized Evaluation. We implement an experiment to evaluate the per-
CLFLUSHOPT. The latter avoids serialization, which improves formance of various instruction sequences. For this, we pick
concurrency when used back-to-back [8, 13]. Depending N random row addresses and access them in a loop for 10 M
on the Rowhammer pattern, we might have some flexibil- memory accesses while recording the elapsed time. Later, we
ity in deciding when to issue the flushing instructions: either use the measured time to compute the activation rate. Note
batched together for all aggressors at the end of the pattern that N also corresponds to the distance between consecutive
accesses to the same row, which we sweep between 1 and 256 Table 8. Overview of our proposed fence scheduling policies. We
to cover the various distances in non-uniform patterns. The indicate which policies are pattern-aware by taking the pattern’s
result from Z 3 is visualized in Table 7 (similar results on Z + ). structure into account and which are cache-avoiding.
From these, we derive the following six observations (O1-O6) Fencing Frequency Pattern- Cache-
Policy
and three concrete recommendations (R1-R3): Example Aware Avoiding
SPnone no fences within pattern ✘ ✘
(O1) Non-temporal instructions hit caches: Non-temporal SPBP between base periods ✔ ✘
instructions such as PREFETCHNTA have access rates ex- SPBP/2 every half base period ✔ ✘
ceeding the available bandwidth, thus suggesting cache SPpair between different aggr. pairs† ✔ ✘
hits. Therefore, we disregard such instructions. Ex.: | a1 a2 a1 a2 | a3 a4 |
(O2) More rows increase the ACT rate: Using more rows al- SPrep between aggr. pair repetitions ✔ ✔
most always increases the rate of memory accesses, Ex.: | a1 a2 | a1 a2 | a3 a4 |
except for “fence each” sequences. SPfull after every access ✘ ✔
Ex.: | a1 | a2 | a1 | a2 | a3 | a4 |
(O3) CLFLUSHOPT is slightly faster than CLFLUSH: In most
† In Blacksmith’s terminology [17]: rows that are 2 rows apart and
cases, there is no measurable difference between them.
In a few cases, CLFLUSHOPT produces up to 5 % higher have the same frequency, phase, and amplitude.
activation rates.
R1. Always use CLFLUSHOPT over CLFLUSH to max- Ordering of Loads and Cache-Flushes. We notice that the
imize the activation rates. sequences without memory barriers (“no fence”) do not ex-
ceed the activation rate of sequences with fences. This sug-
(O4) “scatter” is always faster than “gather”: The “scatter”- gests that memory loads are served by DRAM, and conse-
style cache flushing always produces higher access quently, load-flush-load sequences to the same address are
rates than the equivalent “gather” sequence. Further, as strongly ordered. This is surprising, as AMD documents loads
our non-uniform frequency-based patterns may ham- only to be ordered with same-cacheline stores [4].
mer the same aggressors multiple times consecutively, To confirm our observation that all load requests are served
we consider only “scattered” flushes. by DRAM, we use the CPU’s performance counters to mea-
R2. Schedule cache flushes in a “scatter”-style, i.e., sure the number of data cache fills by DRAM. On Z 3 , we
flush immediately after accessing an aggressor. find that the number of measured cache fills does not differ
between sequences with and without memory barriers. More-
(O5) Loads are always faster than stores: All sequences us- over, this value is equal to the number of loads issued while
ing store instructions result in low memory access rates hammering. Instead, on Z + and Z 2 , this is not the case, and
(up to 76 ACTs/tREFI), with reductions between 5 % we observe up to 70 % cache hits in some cases.
and 56 % compared to equivalent load sequences.
Observation 4. Memory load requests following a
R3. Always prefer load instructions to hammer over CLFLUSH(OPT) to the same cache line never incur cache
store instructions to optimize activation rates. hits on Zen 3, but do incur cache hits on Zen+ and Zen 2.
(O6) AVX instructions are fast but complex to implement: As omitting all fences leads to very high activation rates
The AVX VPGATHERDD instruction produces memory without incurring cache hits (on Z 3 ), it seems like the opti-
access rates comparably to regular loads (i.e., MOV). mal choice for efficient Rowhammering. However, omitting
However, it is more complex to implement than regular memory barriers allows reordering the accesses of different
loads. This is especially the case for non-uniform aggressors. The reason is that both load and flush instructions
patterns that hammer aggressors with different are not ordered between different cache blocks [4], and thus
frequencies and with flushes in between. may be rearranged by the processor. This can hinder us in
effectively bypassing some TRR mitigations, which are sensi-
Based on these results, we exclude CLFLUSH, “gather”-style tive to the order of row activations [12]. Therefore, we need to
flushing, stores, non-temporal accesses, and AVX2 vector in- determine the optimal balance between high activation rates
structions in the remaining experiments. Thus, we will focus and strict ordering.
on CLFLUSHOPT, “scatter”-style flushing, and the different
types of fences for loads (M/LFENCE and “no fence”). 6.2 Fence Scheduling Policies
We further run this experiment on Intel Coffee Lake to Based on Observation 4, we hypothesize that we may omit
allow comparison with the results of our AMD systems. We some fences to speed up pattern execution, while keeping oth-
provide the full results in Appendix B. The results show that ers to preserve sufficient ordering. To explore this trade-off
the activation rates on Coffee Lake are generally higher for all between high activation rates and strict ordering of mem-
tested configurations. ory accesses, we propose six different fence scheduling poli-
Zen 2 Zen 3
100
For SK Hynix devices, we can see that SPpair works on all
tested devices. We have also found effective patterns with
DIMMs [%]

75
50
SPnone and SPrep on half of all devices.
25 Observation 6. For SK Hynix devices, choosing SPpair
0
Samsung SK Hynix Micron Samsung SK Hynix Micron works best across different devices.
SPnone SPpair SPrep SPfull
Lastly, we have not found any effective hammering pattern
Figure 6. Comparison of the four effective scheduling policies
for Micron devices using SPnone , which indicates that ordering
(SPnone , SPpair , SPopt , SPfull ) grouped by vendors, normalized by
#devices per vendor. The dashed areas indicate how often each pol-
is essential for these chips. This behavior could be explained
icy was the best in the no. of effective patterns. The percentages per by the type of deployed in-DRAM mitigation. Rowhammer
vendor sum up to the total percentage of devices with bit flips. mitigations that sample rows with non-uniform probabili-
ties are harder to evade if the accesses are uncontrollably
reordered.
cies (SPs), which are summarized in Table 8. Besides the Observation 7. Preserving ordering in hammer patterns is
two simple polices, no fences (SPnone ) and fencing after ev- essential on Micron devices.
ery access (SPfull ), we propose four policies that take the
pattern’s structure into account, fencing every (SPBP ) or As the results show that the best scheduling policy may
every half base period (SPBP/2 ), fencing between aggres- vary for different devices from the same vendor, we do not
sor pairs (SPpair ), and fencing between repetitions of the incorporate vendor-specific policies in Z EN H AMMER.
same aggressors (SPrep ). Some scheduling policies are cache-
avoiding, i.e., they strongly order all consecutive accesses to
the same aggressor. However, we still consider all policies 7 Evaluation
on all our systems, as previous work has shown that omitting
fences can lead to both higher activation rates [8] and more In this section, we compare Z EN H AMMER, especially de-
bit flips [43] despite possibly incurring cache hits. signed for Rowhammer on Zen-based systems, to the baseline
Evaluation. We evaluate the effectiveness of our fence established on Intel in Section 4.5. In addition, we assess
scheduling policies in two ways. To begin with, we build the impact of our optimizations on the effectiveness of Z EN -
a theoretical model for the amount of ordering provided by H AMMER and evaluate the exploitability of the discovered
different scheduling policies, and contrast this with the ham- bit flips. We first describe our evaluation setup and methodol-
mering speeds obtained with the respective policies on our ogy (Section 7.1) and then present and discuss the results (Sec-
systems, as described in Appendix C. The results show that tion 7.2). We conclude by applying Z EN H AMMER on DDR5
SPpair and SPrep can provide significantly higher activation devices (Section 7.3).
rates when compared to SPfull without allowing significant
reordering. To validate our theoretical model against the real
world, we perform 6 h fuzzing for each of our ten DIMMs (Ta-
7.1 Setup and Methodology
ble 4. We employ the two proposed policies SPpair and SPrep , For our evaluation, we pick the same previously used DDR4
and for comparison SPnone and SPfull . As the activation rate devices (Sections 4 and 6), covering DRAM chips from
experiment (Section 6.1) was inconclusive in defining which all three major DRAM manufacturers, Samsung (S ), SK
memory barrier is optimal, we randomize the fence type be- Hynix (H ), and Micron (M ). For establishing the Intel base-
tween MFENCE and LFENCE. line, we used an Intel Core i7-8700K. The AMD Zen 2 and
In Figure 6, we show the results of our experiments. We Zen 3 machines are equipped with the CPUs listed in Table 1.
present how many configurations generated at least one effec- All machines use default UEFI settings and device timings.
tive hammering pattern per vendor, normalized by the number In line with previous work [10, 17], we evaluate Z EN H AM -
of DIMMs from that vendor. These results describe which MER in three stages: (i) fuzzing for 6 h using Z EN H AMMER
configuration is most widely effective for each DRAM vendor. for each configuration (i.e., fence scheduling policy), (ii) de-
From the data, we observe that fencing is not strictly required, termining the best pattern using a minisweep over all effective
as SPnone found bit flips on all devices from Samsung on both patterns by moving the pattern over a physically contiguous
Zen 2 and Zen 3. However, SPpair is the most effective pol- 4 MiB of memory, and (iii) sweeping the best pattern found
icy on Zen 2 across most devices (75%). The same, but less over a physically contiguous 256 MiB memory range to as-
significantly, applies to Zen 3. sess the device’s vulnerability level and assess the bit flips’
Observation 5. For Samsung devices, the scheduling pol- exploitability. We note that our approach does not rely on
icy SPpair is the most widely applicable (across devices) any DRAM device-specific knowledge as we tested all fence
and most effective (across patterns). scheduling policies and fence types on each device to deter-
mine the optimal per-device configuration (see Section 6).
Zen 2 Zen 3 Coffee Lake Table 9. Z EN H AMMER results on AMD
ID Zen 2 and Zen 3 as well as Intel
SPopt |P+ | |Ffuzz | |Fswp | SPopt |P+ | |Ffuzz | |Fswp | SPopt |P+ | |Ffuzz | |Fswp | Coffee Lake. For each of our ten de-
S0 SPrep 51 151 6,945 SPnone 31 124 17,775 SPfull 122 3,502 6,782 vices, we report the best scheduling pol-
S1 SPrep 26 97 1,758 SPpair 25 144 15,613 SPfull 102 1,374 10,106 icy (SPopt ) and the number of effec-
S2 SPnone 97 1,685 12,893 SPnone 45 471 79,306 SPfull 782 22,339 1,708 tive patterns (|P+ |) and bit flips (|Ffuzz |)
S3 SPnone 8 15 2,020 SPpair 1 1 667 SPfull 3 3 0 found while fuzzing with the best pol-
S4 SPnone 60 182 1,183 SPpair 43 297 13 SPfull 47 654 18,357 icy. We also show the number of bit flips
S5 SPnone 25 83 1,911 SPpair 26 87 10,741 SPfull 155 4,131 5,860 found when sweeping the best patterns
H0 SPnone 6 13 182 – 0 0 0 – 0 0 0 over a 256 MiB range (|Fswp |).
H1 – 0 0 0 – 0 0 0 SPfull 24 35 0
M0 – 0 0 0 – 0 0 0 – 0 0 0
M1 – 0 0 0 – 0 0 0 SPfull 16 23 2

Table 10. Analysis of the bit flip exploitability found during the sweep over 256 MiB on AMD Zen 2, Zen 3, and Intel Coffee Lake. For each
attack, we indicate the number of exploitable bit flips (#Ex.) and average time to find an exploitable bit flip (Time). We mark DIMMs with a
single exploitable bit flip by (*). We omit DIMMs without any exploitable bit flips.
PTE [36] RSA-2048 [34] sudo [11]
DIMM Zen 2 Zen 3 Coffee Lake Zen 2 Zen 3 Coffee Lake Zen 2 Zen 3 Coffee Lake
#Ex. Time #Ex. Time #Ex. Time #Ex. Time #Ex. Time #Ex. Time #Ex. T. #Ex. Time #Ex. Time
S0 7 6m 4s 7 2m 55s 3 4m 15s 17 2m 47s 37 46s 14 1m 36s – – 4 3m 13s 1 *23m 49s
S1 90 9s 1474 2s 846 2s 6 2m 2s 27 30s 21 26s – – 1 *6m 50s 1 *1m 20s
S2 641 21s 5326 1s 126 11s 30 2m 16s 170 6s 6 1m 59s – – 12 1m 17s – –
S3 142 9s 61 32s – – 7 2m 21s – – – – – – – – – –
S4 220 28s 3 23m 52s 2658 1s 7 12m 29s 1 *23m 52s 53 26s – – – – 4 5m 16s
S5 102 6s 625 2s 330 4s 6 1m 14s 28 33s 11 1m 5s – – 2 5m 58s 3 2m 34s
H0 11 53s – – – – – – – – – – – – – – – –

7.2 Effectiveness and Exploitability any optimizations (see Table 5) to 7 and 6 devices afterward,
for Zen 2 and Zen 3, respectively. The number of effective
The results of our evaluation are presented in Table 9. We
hammering patterns found further increased drastically, in the
show for each tested platform (AMD Zen 2 and Zen 3, Intel
best case (S2 ) by roughly six times (from 14 to 97). Moreover,
Coffee Lake) and each DDR4 device, the number of effec-
the results on Zen 3, where we had not found any bit flips
tive patterns found (|P+ |) and the number of bit flips (|Ffuzz |)
previously, stress the need for our optimizations to trigger
found during fuzzing with the device’s best fence schedul-
any bit flips on the AMD Zen 3 platform. This shows that
ing policy (SPopt ) that we used in all three stages. For Intel
the hammering instruction sequence and fence scheduling
Coffee Lake, we assumed the scheduling policy SPfull , which
policy are important when adapting Rowhammer attacks to
corresponds to the one used by the original Blacksmith fuzzer.
new platforms.
We also show for the best pattern, the total number of
Nevertheless, we note that there are still strong differences
bit flips over the sweeped 256 MiB of physically contiguous
in terms of hammering effectiveness between AMD and Intel.
memory (|Fswp |), which we then use to assess exploitability
On Intel, four of eight DIMMs have a higher bit flips count
of three known Rowhammer end-to-end attacks in Table 10.
in the sweep than the same devices on Zen 2. Interestingly,
For the exploitability analysis, we follow prior work [7,
there is one device (H0 ) where we could not find any bit
10, 17] and use the Rowhammer attack simulation framework
flip on Coffee Lake while Z EN H AMMER is successful on
Hammertime [37] to estimate the required time for three pre-
Zen 2. Generally, our optimizations seem to be more effective
viously proposed Rowhammer attacks targeting (i) page table
on Zen 3, where the number of bit flips of the best pattern
entries (PTE) to craft an arbitrary memory read/write prim-
during the sweep is in 5 out of 6 cases higher than on Coffee
itive [36], (ii) RSA-2048 keys to break the SSH public-key
Lake. In the best case (S2 ), we find 46x more bit flips on
authentication [34], and (iii) the sudo binary to elevate the
Zen 3 (79,306) than on Coffee Lake (1,708). These results
privilege to the root user [11]. We use the bit flips we found
suggest that the effectiveness of a Rowhammer attack does
during the sweep with the best pattern to perform the ex-
not entirely depend on the activation rate, which is generally
ploitability analysis.
higher on Coffee Lake than on Zen 3, but also on enforcing
Results. Our results in Table 9 show that our Zen-based plat- the order of aggressor accesses (i.e., the fencing policy) and
form optimizations have strongly improved the number of de- CPU-specific memory controller optimizations.
vices we can trigger bit flips on, from 5 and 0 devices before
Table 11. Reverse engineered address mappings and offsets for our Zen 4 (Ryzen 7 7700X) system. All memory configurations are single-
channel, single-DIMM, with the tuple indicating the DIMM’s geometry (#subchannels, #ranks, #bank groups, #banks per bank group, #rows).

Geometry Size Offt. DRAM Address Functions Row


(SC,RK,BG,BA,R) [GiB] [MiB] Subchannel Rank Bank Group (BG) Bank Address (BA) Bits

(2, 1, 4, 4, 216 ) 8 2048 0x1fffe0040 n/a 0x088880100, 0x111100200 0x022220400, 0x044440800 32 – 17


(2, 1, 8, 4, 216 ) 16 2048 0x3fffc0040 n/a 0x042100100, 0x084200200, 0x210840400, 0x021080800 33 – 18
0x108401000
(2, 2, 8, 4, 216 ) 32 2048 0x7fff80040 0x000040000 0x084200100, 0x108400200, 0x421080400, 0x042100800 34 – 19
0x210801000

Exploitability Analysis. The larger number of bit flips after the time for THP coloring as reported in Section 4.4.
our optimizations strongly facilitates exploitation, as we show Discussion. These results show that using the techniques
in Table 10. The PTE attack by Seaborn [35] can be exploited we discussed in this paper, Z EN H AMMER enables practical
in the best case in around one second on both Zen 3 and Coffee Rowhammer exploits on AMD Zen-based platforms for the
Lake. Due to the lower number of exploitable bit flips on Zen 2, first time. We also believe that our insights will make it easier
we need in the best case six times as long (6 s) as on the two to port Rowhammer attacks to newer platforms in the future,
other systems. There is one device (S3 ) where exploitation is such as DDR5 devices, as we will show next.
not possible at all on Coffee Lake due to missing bit flips, but
on Zen 2 and Zen 3 we can find exploitable bit flips in 9 s and 7.3 ZenHammer on DDR5
32 s, respectively. We note that even if the number of bit flips
As part of our evaluation, we tested whether Z EN H AMMER is
is very low (e.g., 3 bit flips on S4 , Zen 3), we were still able
effective in triggering bit flips on more recent devices (DDR5).
to exploit the system in a practical time (23 m 52 s).
We reverse engineered the DRAM address functions of our
The RSA-2048 key attack [34] is on 4 of 5 exploitable
Zen 4 system (Ryzen 7 7700X) and present the functions
devices on average 38 s faster on Zen 3 than on Coffee Lake.
in Table 11. As for DDR4, we randomly picked ten DDR5
Overall, the average time to find an exploitable bit flip is
devices (Table 16 in Appendix D) and repeated the experiment
3 m 52 s, 29 s, and 1 m 6 s for Zen 2, Zen 3, and Coffee Lake,
described in Section 6.2 to find the best fence scheduling
respectively. We note that the device H0 with bit flips only on
policy for each device.
Zen 2 is not exploitable. Our data shows that even if we find a
We found bit flips on only 1 of 10 tested devices (S1 ), sug-
very low number of patterns only (e.g., 7 pattern for S3 ), we
gesting that the changes in DDR5 such as improved Rowham-
still are likely to find an exploitable bit flip (2 m 21 s).
mer mitigations, on-die error correction code (ECC), and a
Lastly, the sudo binary exploit [11] is the hardest attack
higher refresh rate (32 ms) make it harder to trigger bit flips.
as it requires a precise set of bit flips. Given the low number
On S1 with the policy SPnone , we found 109 patterns and
of bit flips on Zen 2, we cannot find any exploitable bit flips
23,110 bit flips during fuzzing. The best pattern triggered
for this attack. For the remaining platforms, Zen 3 and Coffee
41,995 bit flips during the sweep over 256 MiB of memory.
Lake, we find an equal number of exploitable devices (4) and
Given the lack of bit flips on 9 of 10 DDR5 devices, more work
a similar average time to find an exploitable bit flip, 3 m 29 s
is needed to better understand the potentially new Rowham-
and 3 m 55 s, respectively, when excluding devices with a
mer mitigations and their security guarantees.
single bit flip only. The exploitable devices are those that
showed the highest number of bit flips while sweeping on
these platforms 8 Related Work
End-to-End Attack’s Practicality. As our exploitability anal- In this section, we discuss differences between DARE and
ysis is based on simulation results, we further verified the existing tools for reverse engineering DRAM address func-
practicality of the PTE attack by Seaborn and Dullien [36]. tions (Section 8.1). Thereafter, we discuss similar and orthog-
Our attack’s implementation is based on the THP coloring onal approaches used to reverse engineer the DRAM address
technique described in Section 4.4. Moreover, we modified functions (Section 8.2). Lastly, we summarize previous ef-
our Z EN H AMMER fuzzer to use THPs like it has been done forts regarding Rowhammer on pre-Zen AMD systems (Sec-
before for n-sided patterns [9]. This means we distribute ag- tion 8.3).
gressors across THPs such that aggressor pairs are placed on
the same THP and the pattern is spread across multiple THPs. 8.1 Comparison to Existing Tools
We successfully verified the attack’s feasibility on device S2 . In Table 12, we compare our new reverse engineering tool
Over ten successful attack runs (i.e., obtaining root privileges), DARE to the open-source tool DRAMA [32] and concurrent
we report an average time of 93 seconds for the end-to-end at- work AMDRE [14]. DRAMA was not able to recover the cor-
tack once an exploitable bit flip has been found. This includes rect DRAM address mappings on our Zen-based systems,
Table 12. Comparison of DARE with AMDRE and DRAMA. The 8.2 Comparison to Other Techniques
table shows features and changes made for correctness (Corr.), noise
The approaches used by existing work to reverse engineer the
handling (Noise), and performance improvement (Perf.).
secret DRAM address mappings can be divided into software-
Tool Goal based and hardware-based approaches. Software-based ap-
DARE AMDRE DRAMA Corr. Noise Perf. proaches generally require side channels, such as bank con-
Thresh. Detection flicts. Instead, hardware-based techniques require specialized
– Autom. Detection ✔ ✔ ✔
– Reliable Timing ✔ ✔ ✘ equipment like a logic analyzer. We compare the existing
Clustering approaches in Table 13, which we now explain in more detail.
– Superpages ✔ ✘ ✘ Our comparison considers three categories: requirements,
– Pairwise Testing ✔ ✔ ✘ results, and features. For the Requirements, we compare the
Brute forcing
– Address Offsets ✔ ✘ ✘ monetary costs involved (Cst.), if any special hardware is
– Strict Validation ✔ ✔ ✘ needed (HW), and if the method relies on a side channel (SC).
In the Results category, we look at how generic (Gen.) the
approach is (i.e., if it also works with different memory config-
urations), the result’s completeness (Cpl.) w.r.t. the different
DRAM address components, and the result’s precision (i.e.,
while AMDRE could only partially (up to bit 21) recover the how reliable results are). Lastly, the Features category con-
Zen 2 functions due to its limitation to 2 MiB THPs. siders whether the approach can obtain labels for the found
Our changes enabled us to recover the complete and correct functions (Lbl.) and analyze the devices’ internal row remap-
DRAM address mappings in a fast and reliable way. Like ping (RR).
DRAMA and AMDRE, our tool requires superuser privileges
Table 13. Comparison of existing software-based (top) and hardware-
for the virtual-to-physical address translation. However, an
based (bottom) techniques for recovering DRAM address mappings.
attacker could recover the DRAM address mappings offline, Our work uses row buffer conflicts to find the functions and an
i.e., on another system with the same hardware configuration. oscilloscope to verify their validity.
We now discuss our improvements to the existing work.
Requirements Results Features
Reliable Timing. The timing routine used in DRAMA does Technique
Cst. HW SC Gen. Cpl. Prec. Lbl. RR
not reliably work on AMD Zen-based systems, leading to
1 Row buffer conflict
many outliers. In AMDRE, the timing routine works mostly [5, 14, 32, 40, 42, 43]
reliably, except for the few occasions where the automatic 2 Rowhammer [35]
threshold detection fails. We designed an optimized and more 3 Perf. counters [15]
reliable timing routine in Section 4.1.
4 Oscilloscope [32]
Superpages. During reverse engineering, we use all avail-
able 1 GiB superpages as higher physical address bits (above 5 Logic analyzer [31]
1 GiB) are involved in some address mappings. Both DRAMA 6 Retention
and AMDRE can be configured to use more memory; however, + Temp. [20]
only with 4 KiB pages and 2 MiB THPs, respectively.
Requirements. Software-based approaches 1 – 3 are cost-
Pairwise Testing. We reduce false positives by measuring effective, essentially free. Oscilloscopes 4 are affordable ,
pairwise latencies for cluster addresses and removing those while logic analyzers 5 are more expensive . Approach 6
conflicting with less than 75% of the cluster, thus creating requires an FPGA and special heating equipment . Using
perfect bank clusters. AMDRE uses a similar technique to Rowhammer bit flips as side channel 2 requires a vulnerable
remove false positives. device , which might be hard to obtain. To the best of our
Address Offsets. The functions found by DRAMA and AM- knowledge, only server platforms provide hardware-based
DRE are not valid across the whole physical address space. performance counters 3 with DRAM-related data . Be-
This is caused by the remapping of physical memory above sides Rowhammer bit flips 2 , other side channels used are
the 4 GiB mark, which introduces a nonlinearity. DARE is row buffer conflicts 1 and DRAM retention time 6 .
the first tool to take this into account by applying a system- Results. Oscilloscopes 4 , logic analyzers 5 , and Rowham-
specific offset prior to brute forcing the XOR functions. mer 2 are purely generic and support any DRAM de-
Strict Validation. DRAMA only requires that candidate func- vice configuration. Exploiting row buffer conflicts 1 may
tions do not produce the same result across the clusters. Our require tweaking timing thresholds in multi-DIMM/-channel
and AMDRE’s condition is stronger, requiring that every func- setups . Only logic analyzers 5 can recover all DRAM ad-
tion returns the same result on exactly half of all clusters. dress components as the limited number of channels on os-
This condition allows us to filter out many invalid address cilloscopes 4 may make data filtering for some address com-
functions early on during brute forcing the functions. ponent hard or impossible . The retention time approach 6
cannot recover DRAM address bits requiring multiple DRAM DDR5 device for the first time.
devices . The hardware-based approaches 4 – 6 and per-
formance counters 3 provide high precision , whereas row Acknowledgments
buffer conflicts 1 require a reliable timing function . Using
We thank the anonymous reviewers for their feedback.
Rowhammer itself 2 might be imprecise as mitigations in
This research was supported by the Swiss National Sci-
the memory controller or the devices themselves could disturb
ence Foundation under NCCR Automation, grant agreement
the bit flip feedback channel .
51NF40_180545, by the Swiss State Secretariat for Education,
Features. All hardware-based approaches 4 – 6 provide in- Research and Innovation under contract number MB22.00057
formation to derive labels for DRAM address mappings . (ERC-StG PROMISE), and by a Microsoft Swiss JRC grant.
Depending on the availability, performance counters 3 may
have separate counters per bank and/or rank, allowing to de- References
rive some labels only . Rowhammer bit flips 2 and DRAM
retention 6 are the only techniques allowing to reverse the [1] PassMark CPU Benchmarks: AMD vs Intel Mar-
DRAM-internal row remapping . ket Share. URL https://www.cpubenchmark.net/
market_share.html.
Relation to Our Work. Similar to previous work, we rely
on the row buffer conflict side channel 1 to reverse engineer [2] Advanced Micro Devices. BIOS and Kernel De-
the DRAM address mappings. However, as the first work, veloper’s Guide (BKDG) for AMD Family 15h
we take the address offset into account and collect addresses Models 00h-0Fh Processors, January 2013. URL
from multiple superpages, enabling us to recover the correct https://www.amd.com/content/dam/amd/en/
mappings on all Zen-based systems. Furthermore, we use documents/archived-tech-docs/programmer-
an oscilloscope 4 , with the same method as in previous references/42301_15h_Mod_00h-0Fh_BKDG.pdf.
work [32], to physically validate our address mappings.
[3] Advanced Micro Devices. AMD64 Architecture
8.3 Rowhammer on AMD Programmer’s Manual Volume 4: 128-Bit and
256-Bit Media Instructions, November 2021. URL
Little attention has been paid to Rowhammer on AMD in the
https://www.amd.com/content/dam/amd/en/
past decade. The original Rowhammer study from 2014 by
documents/processor-tech-docs/programmer-
Kim et al. [22] showed bit flips on Intel and AMD Piledriver.
references/26568.pdf.
In these older systems, using the same hammering instructions
on the two systems was still effective. We demonstrated that [4] Advanced Micro Devices. AMD64 Architecture Pro-
this is not the case anymore for modern CPUs. grammer’s Manual Volume 3: General-Purpose and Sys-
Later, in 2016, a comparative analysis looked into Rowham- tem Instructions, June 2023. URL https://www.amd.
mer on Intel (Sandy Bridge, Ivy Bridge, and Haswell) and com/content/dam/amd/en/documents/processor-
AMD (Piledriver) platforms. They showed that not only the tech-docs/programmer-references/24594.pdf.
access rate is much lower on AMD (6.1 M/s compared to
11.6 M/s–12.3 M/s), but also the number of bit flips observed [5] Alessandro Barenghi, Luca Breveglieri, Niccolò Izzo,
is roughly two orders of magnitude larger for Intel (16.1 k– and Gerardo Pelosi. Software-only Reverse Engineering
22.9 k) than on AMD (59) [27]. Our findings show a lower of Physical DRAM Mappings for Rowhammer Attacks.
number of bit flips on AMD Zen 2 compared to Intel systems, In IVSW ’18, pages 19–24, July 2018.
even after our optimizations. [6] Yaakov Cohen, Kevin Sam Tharayil, Arie Haenel,
Daniel Genkin, Angelos D. Keromytis, Yossi Oren, and
9 Conclusion Yuval Yarom. HammerScope: Observing DRAM Power
We presented Z EN H AMMER, the first successful Rowhammer Consumption Using Rowhammer. In CCS ’22, pages
attacks launched from AMD Zen-based CPUs. To build Z EN - 547–561, November 2022.
H AMMER, we needed to overcome a number of challenges [7] Lucian Cojocar, Kaveh Razavi, Cristiano Giuffrida, and
including the reverse engineering of the DRAM addressing Herbert Bos. Exploiting Correcting Codes: On the Effec-
functions by taking physical address offsets into account, a tiveness of ECC Memory Against Rowhammer Attacks.
new mechanism for synchronization with refresh commands, In IEEE S&P ’19, pages 55–71, May 2019.
and careful scheduling of flushing and fencing instructions to
improve the activation throughput of Rowhammer patterns. [8] Lucian Cojocar, Jeremie Kim, Minesh Patel, Lillian Tsai,
Z EN H AMMER is capable of flipping bits on 7 and 6 out of Stefan Saroiu, Alec Wolman, and Onur Mutlu. Are We
our ten DDR4 samples on AMD Zen 2 and 3 respectively, Susceptible to Rowhammer? An End-to-End Method-
enabling Rowhammer exploits on recent AMD platforms for ology for Cloud Providers. In IEEE S&P ’20, pages
the first time. We further show Rowhammer bit flips on a 712–728, May 2020.
[9] Finn de Ridder, Pietro Frigo, Emanuele Vannacci, [20] Matthias Jung, Carl C. Rheinländer, Christian Weis, and
Herbert Bos, Cristiano Giuffrida, and Kaveh Razavi. Norbert Wehn. Reverse Engineering of DRAMs: Row
SMASH: Synchronized Many-sided Rowhammer At- Hammer with Crosshair. In MEMSYS ’16, pages 471–
tacks from JavaScript. In USENIX Security ’21, pages 476, October 2016.
1001–1018, August 2021.
[21] Jeremie S. Kim, Minesh Patel, A. Giray Yağlıkçı, Hasan
[10] Pietro Frigo, Emanuele Vannacc, Hasan Hassan, Victor Hassan, Roknoddin Azizi, Lois Orosa, and Onur Mutlu.
van der Veen, Onur Mutlu, Cristiano Giuffrida, Herbert Revisiting RowHammer: An Experimental Analysis of
Bos, and Kaveh Razavi. TRRespass: Exploiting the Modern DRAM Devices and Mitigation Techniques. In
Many Sides of Target Row Refresh. In IEEE S&P ’20, ISCA ’20, pages 638–651, May 2020.
pages 747–762, May 2020.
[22] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin,
[11] Daniel Gruss, Moritz Lipp, Michael Schwarz, Daniel Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad
Genkin, Jonas Juffinger, Sioli O’Connell, Wolfgang Lai, and Onur Mutlu. Flipping Bits in Memory Without
Schoechl, and Yuval Yarom. Another Flip in the Wall Accessing Them: An Experimental Study of DRAM
of Rowhammer Defenses. In IEEE S&P ’18, pages Disturbance Errors. In ISCA ’14, pages 361–372, June
245–261, May 2018. 2014.
[12] Hasan Hassan, Yahya Can Tugrul, Jeremie S. Kim, Vic- [23] Petr Kobalicek. AsmJit: Low-Latency Machine Code
tor van der Veen, Kaveh Razavi, and Onur Mutlu. Uncov- Generation, 2023. URL https://asmjit.com/.
ering In-DRAM RowHammer Protection Mechanisms:
A New Methodology, Custom RowHammer Patterns, [24] Andreas Kogler, Jonas Juffinger, Salman Qazi, Yoongu
and Implications. In MICRO ’21, pages 1198–1213, Kim, Moritz Lipp, Nicolas Boichat, Eric Shiu, Mattias
October 2021. Nissler, and Daniel Gruss. Half-Double: Hammering
From the Next Row Over. In USENIX Security ’22,
[13] Wei He, Zhi Zhang, Yueqiang Cheng, Wenhao Wang, pages 3807–3824, August 2022.
Wei Song, Yansong Gao, Qifei Zhang, Kang Li, Dongxi
Liu, and Surya Nepal. WhistleBlower: A System-level [25] Andrew Kwong, Daniel Genkin, Daniel Gruss, and Yu-
Empirical Study on RowHammer. IEEE Transactions val Yarom. RAMBleed: Reading Bits in Memory With-
on Computers, pages 1–15, January 2023. out Accessing Them. In IEEE S&P ’20, pages 695–711,
May 2020.
[14] Martin Heckel and Florian Adamsky. Reverse-
Engineering Bank Addressing Functions on AMD [26] Zhenrong Lang, Patrick Jattke, Michele Marazzi, and
CPUs. In DRAMSec ’23, pages 1–6, June 2023. Kaveh Razavi. Blaster: Characterizing the Blast Radius
[15] Christian Helm, Soramichi Akiyama, and Kenjiro Taura. of Rowhammer. In DRAMSec ’23, pages 1–7, June
Reliable Reverse Engineering of Intel DRAM Address- 2023.
ing Using Performance Counters. In MASCOTS ’20, [27] Mark Lanteigne. A Tale of Two Hammers: A Brief
pages 1–8, November 2020. Rowhammer Analysis of AMD vs. Intel. Techni-
[16] Intel. 12th Generation Intel Core Processors, Datasheet cal report, Third I/O, May 2016. URL http://www.
Volume 2 of 2, April 2022. URL https://cdrdv2. thirdio.com/rowhammera1.pdf.
intel.com/v1/dl/getContent/655259. [28] Michele Marazzi, Flavien Solt, Patrick Jattke, Kubo
[17] Patrick Jattke, Victor Van Der Veen, Pietro Frigo, Stijn Takashi, and Kaveh Razavi. REGA: Scalable Rowham-
Gunter, and Kaveh Razavi. Blacksmith: Scalable mer Mitigation with Refresh-Generating Activations. In
Rowhammering in the Frequency Domain. In IEEE IEEE S&P ’23, pages 1684–1701, May 2023.
S&P ’22, pages 716–734, May 2022.
[29] Koksal Mus, Yarkın Doröz, M. Caner Tol, Kristi Rah-
[18] JEDEC Solid State Technology Association. DDR4 man, and Berk Sunar. Jolt: Recovering TLS Signing
SDRAM, September 2012. URL https://www.jedec. Keys via Rowhammer Faults. In IEEE S&P ’23, pages
org/sites/default/files/docs/JESD79-4.pdf. 1719–1736, May 2023.
[19] Michael Fahr Jr, Thinh Dang, Hunter Kippen, Jacob [30] Lois Orosa, Ulrich Rührmair, A. Giray Yaglikci, Hao-
Lichtinger, Andrew Kwong, Dana Dachman-Soled, cong Luo, Ataberk Olgun, Patrick Jattke, Minesh Patel,
Daniel Genkin, and Alexander Nelson. When Frodo Jeremie Kim, Kaveh Razavi, and Onur Mutlu. SpyHam-
Flips: End-to-End Key Recovery on FrodoKEM via mer: Using RowHammer to Remotely Spy on Temper-
Rowhammer. In CCS ’22, pages 979–993, November ature, October 2022. URL https://arxiv.org/abs/
2022. 2210.04084.
[31] Minesh Patel, Jeremie S. Kim, and Onur Mutlu. The [42] Minghua Wang, Zhi Zhang, Yueqiang Cheng, and Surya
Reach Profiler (REAPER): Enabling the Mitigation of Nepal. DRAMDig: A Knowledge-assisted Tool to Un-
DRAM Retention Failures via Profiling at Aggressive cover DRAM Address Mapping. In DAC ’20, July 2020.
Conditions. In ISCA ’17, pages 255–268, June 2017.
[43] Yuan Xiao, Xiaokuan Zhang, Yinqian Zhang, and Radu
[32] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Teodorescu. One Bit Flips, One Cloud Flops: Cross-
Schwarz, and Stefan Mangard. DRAMA: Exploiting VM Row Hammer Attacks and Privilege Escalation. In
DRAM Addressing for Cross-CPU Attacks. In USENIX USENIX Security ’16, pages 19–35, August 2016.
Security ’16, pages 565–581, August 2016. [44] Z. Zhang, Y. Cheng, D. Liu, S. Nepal, Z. Wang, and
[33] Rui Qiao and Mark Seaborn. A New Approach For Y. Yarom. PThammer: Cross-User-Kernel-Boundary
Rowhammer Attacks. In HOST ’16, pages 161–166, Rowhammer through Implicit Accesses. In MICRO ’20,
May 2016. pages 28–41, October 2020.

[34] Kaveh Razavi, Ben Gras, Erik Bosman, Bart Preneel, Appendices
Cristiano Giuffrida, and Herbert Bos. Flip Feng Shui:
Hammering a Needle in the Software Stack. In USENIX A Equally-sized Bins in XOR Partition
Security ’16, pages 1–18, August 2016. In Section 4.2, we assumed that the result of any XOR func-
tion on a bin of addresses returns either a constant value (i.e.,
[35] Mark Seaborn. How physical addresses map 0 or 1) for all addresses or evenly splits the addresses. We
to rows and banks in DRAM, May 2015. URL prove this assumption in the following.
https://lackingrhoticity.blogspot.com/2015/
Claim. Consider an aligned power-of-two range of addresses
05/how-physical-addresses-map-to-rows-and-
A = [m · 2n , (m + 1) · 2n − 1] (m, n ∈ N), a XOR function f
banks.html.
which is non-constant on A, and the set of addresses B = {a ∈
[36] Mark Seaborn and Thomas Dullien. Exploiting the A | f (a) = 0}. Partitioning the addresses in B using a different,
DRAM rowhammer bug to gain kernel privileges, non-constant XOR function g results in two equally-sized bins
March 2015. URL https://googleprojectzero. where g is constant 0 and constant 1, respectively.
blogspot.com/2015/03/exploiting-dram- Proof. First, we show that the claim holds for one function
rowhammer-bug-to-gain.html. g1 ̸= f . We construct g1 by extending f to include another
previously unused bit in the XOR computation.6 We note
[37] Andrei Tatar, Cristiano Giuffrida, Herbert Bos, and that adding this new bit leads to a different function result
Kaveh Razavi. Defeating Software Mitigations Against for exactly half of all addresses in B (namely, those where
Rowhammer: A Surgical Precision Hammer. In RAID that address bit is set). As the function result was previously
’18, pages 48–66, September 2018. constant 0 for all b ∈ B, it must now be equally distributed
[38] M. Caner Tol, Saad Islam, Andrew J. Adiletta, Berk between 0 and 1, satisfying our claim.
Sunar, and Ziming Zhang. Don’t Knock! Rowhammer Second, we show that we can successively modify g1 to
at the Backdoor of DNN Models. In DSN ’23, pages obtain an arbitrary function g without changing the size of
109–122, June 2023. the two bins. To do this, we successively add (or remove) a
bit to (or from) the XOR computation in g1 until reaching g.
[39] Chihiro Tomita, Makoto Takita, Kazuhide Fukushima, During each of these steps, the function result will flip for
Yuto Nakano, Yoshiaki Shiraishi, and Masakatu Morii. half of all addresses. We note that the addresses where the
Extracting the Secrets of OpenSSL with RAMBleed. affected bit is set are always split evenly between the two bins.
Sensors, 22(9):3586, January 2022. Thus, the affected addresses are split evenly between the two
bins, keeping the size of the two bins equal after each step
[40] Victor van der Veen, Yanick Fratantonio, Martina Lin- and satisfying our claim for any function g.
dorfer, Daniel Gruss, Clementine Maurice, Giovanni Vi-
gna, Herbert Bos, Kaveh Razavi, and Cristiano Giuffrida. B Heatmap of Memory Access Rates
Drammer: Deterministic Rowhammer Attacks on Mo- Table 14 shows the same data as Table 7. However, we also
bile Platforms. In CCS ’16, pages 1675–1689, October show the instruction sequences that were previously excluded
2016. due to their throughput either being low (≤ 100 ACTs/tREFI)
[41] Hari Venugopalan, Kaustav Goswami, Zainul Abi Din, or very high, indicating cache hits (≥ 1000 ACTs/tREFI).
Jason Lowe-Power, Samuel T. King, and Zubair Shafiq. In Table 15, we show the results of the same experiment
Centauri: Practical Rowhammer Fingerprinting, June for the Intel Coffee Lake system.
2023. URL https://arxiv.org/abs/2307.00143. 6 Alternatively, a bit could be removed from the XOR computation.
Table 14. Heatmap of memory access rates (in ACTs/tREFI) for all Table 15. Heatmap of memory access rates (in ACTs/tREFI) for all
tested instruction sequences and varying numbers of accessed rows on tested instruction sequences and varying numbers of accessed rows on
the AMD Z 3 system. We abbreviate scatter, fence each by “s.f.e.” the Intel CL system. We abbreviate scatter, fence each by “s.f.e.”

Access Flushing Fence #Rows Access Flushing Fence #Rows


Type Strategy Type Type Strategy Type
1 2 4 8 16 32 256 1 2 4 8 16 32 256
MOV (load) gather M 24 49 71 91 100 110 114 MOV (load) gather M 83 110 130 144 150 153 158
MOV (load) gather L 24 49 80 113 134 147 121 MOV (load) gather L 151 90 115 142 146 154 159
MOV (load) gather S 24 49 80 113 133 146 125 MOV (load) gather S 248 128 138 148 159 154 159
MOV (load) gather — 24 49 80 113 133 146 125 MOV (load) gather — 226 136 160 160 163 153 159
MOV (load) scatter M 24 49 79 107 126 143 157 MOV (load) scatter M 83 110 130 144 152 156 160
MOV (load) scatter L 24 49 95 137 149 153 159 MOV (load) scatter L 151 98 121 144 154 157 160
MOV (load) scatter S 24 48 97 154 159 159 159 MOV (load) scatter S 186 121 142 156 160 160 160
MOV (load) scatter — 24 49 97 154 159 159 159 MOV (load) scatter — 248 128 143 160 160 160 160
MOV (load) s.f.e. M 24 33 33 33 33 34 34 MOV (load) s.f.e. M 83 83 83 83 83 83 83
MOV (load) s.f.e. L 24 49 65 70 69 71 70 MOV (load) s.f.e. L 156 100 99 99 99 99 99
MOV (load) s.f.e. S 24 41 70 72 72 73 74 MOV (load) s.f.e. S 164 120 137 160 160 160 160
MOV (store) gather M 24 32 50 67 71 72 72 MOV (store) gather M 87 154 233 322 464 646 87
MOV (store) gather L 24 32 49 66 72 71 71 MOV (store) gather L 95 206 364 670 867 852 87
MOV (store) gather S 24 32 49 67 67 73 72 MOV (store) gather S 94 150 262 427 611 712 87
MOV (store) gather — 24 32 49 67 70 73 72 MOV (store) gather — 94 206 361 670 871 849 88
MOV (store) scatter M 24 32 54 72 73 73 72 MOV (store) scatter M 89 93 82 101 103 98 86
MOV (store) scatter L 24 32 54 72 73 72 72 MOV (store) scatter L 94 183 108 110 108 108 86
MOV (store) scatter S 24 32 52 73 72 73 72 MOV (store) scatter S 95 116 92 101 106 106 86
MOV (store) scatter — 24 32 54 73 73 73 72 MOV (store) scatter — 94 187 111 112 108 107 86
MOV (store) s.f.e. M 24 24 28 28 28 28 28 MOV (store) s.f.e. M 89 54 53 51 51 50 50
MOV (store) s.f.e. L 24 32 53 72 74 75 72 MOV (store) s.f.e. L 94 190 107 116 109 107 86
MOV (store) s.f.e. S 24 24 48 49 49 50 50 MOV (store) s.f.e. S 94 71 70 69 70 69 68
MOVNTDQA none — 15K 20K 24K 26K 28K 26K 12K MOVNTDQA none — 12K 17K 14K 25K 17K 7K 6K
MOVNTI none — 15K 24K 29K 8K 618 367 131 MOVNTI none — 12K 9K 14K 15K 950 721 107
PREFETCHNTA none — 15K 24K 29K 29K 30K 27K 19K PREFETCHNTA none — 13K 19K 14K 29K 32K 14K 221
PREFETCHNTA scatter — 80 132 191 208 253 309 273 PREFETCHNTA scatter — 253 314 261 271 160 160 160
PREFETCHNTA scatter M 24 49 80 108 131 170 284 PREFETCHNTA scatter M 83 110 130 144 152 156 160
VGATHERDD scatter — 24 49 79 112 159 159 159 VGATHERDD scatter — 156 219 228 143 152 160 160

Figure 7. Activation rates and possible pattern orderings for non-uniform hammering patterns when using different scheduling policies. The
data was collected on Z 3 using the MFENCE barrier.
C Modelling Fence Scheduling Policies Observations. As expected, the scheduling policies differ
significantly in the trade-off they provide. SPnone provides
In this appendix, we first present a theoretical model for the
very high activation rates, as it allows the most reordering.
amount of ordering enforced by a scheduling policy (as de-
On the contrary, SPfull allows zero reordering at the expense
scribed in Section 6.2) based on a simple CPU behavior model.
of low activation rates (of 37 ACTs/tREFI on average). The
We then evaluate the trade-off provided by different schedul-
pattern-aware policies show two different types of distribu-
ing policies by contrasting the amount of ordering provided
tions. For SPBP and SPBP/2 , the distributions are somewhat
with the patterns’ hammering speeds.
similar to SPnone , albeit without the very fast outliers. On the
Computing Pattern Permutations To analyze the amount other hand, SPpair and SPrep provide ordering that is nearly as
of ordering provided by a scheduling policy, we use a model strict as SPfull , while allowing faster hammering when com-
for the processor’s memory subsystem which assumes that pared to the latter, with average activation rates increased by
(a) load requests cannot be reordered around memory barriers, 51 % (SPpair ) and 39 % (SPfull ) respectively.
as guaranteed by M/LFENCE [4], and (b) all load requests are Based on these results, we believe SPpair and SPrep could
served by DRAM, including consecutive ones to the same be well suited to reduce the amount of fencing without signif-
cache line with flushing in between accesses (Obs. 4). Us- icantly impacting a pattern’s ordering.
ing this model, we can compute the number of theoretically
possible orderings of a hammering pattern. D Analyzed DDR5 Devices
We assume that patterns are always ordered at their be-
ginning and their end, and we compute the number of per- In Table 16, we present the list of ten randomly chosen DDR5
mutations for each interval (delineated by memory barriers) UDIMMs covering all three major manufacturers, i.e., Sam-
individually. For a multiset M, containing l different elements sung, SK Hynix, and Micron. We report each device’s pro-
with multiplicities m1 , m2 , . . . , ml , the number of permuta- duction date, speed, size, and DRAM geometry.
tions is given by the multinomial coefficient m1 ,mm

=
2 ,...,ml Table 16. DDR5 UDIMMs used in the evaluation of our AMD Zen-
m1 ! m2 ! ... ml ! . To obtain the total number of all permutations,
m!
optimized Rowhammer fuzzer. We abbreviate the DRAM vendors
we multiply the numbers for the different intervals. Samsung (S ), SK Hynix (H ), and Micron (M ). We report for each de-
In practice, it is highly unlikely that memory accesses are vice, the number of subchannels (SC), ranks (RK), bank groups (BG),
reordered over large distances, even if theoretically possible banks per bank group (BA), and rows (R).
based on ordering semantics. However, as the realistic extent Production Freq. Size DIMM Geometry
of reordering is unknown, we use this simpler model. ID
Date [MHz] [GiB] (SC,RK,BG,BA,R)
Example. To illustrate, we use an example non-uniform pat- S0 Q4-2021 4800 8 (2, 1, 4, 4, 216 )
tern |a1 a2 a1 a2 a3 a4 | where fences are shown using verti- S1 Q4-2021 4800 16 (2, 1, 8, 4, 216 )
cal bars. The number of possible orderings is computed as S2 Q4-2021 5600 8 (2, 1, 4, 4, 216 )
6  6!
2,2,1,1 = 2! 2! 1! 1! = 180. When inserting another fence after S3 Q4-2021 4800 8 (2, 1, 4, 4, 216 )
the fourth access (corresponding to SPpair ), we get the pattern H0 Q4-2021 4800 8 (2, 1, 4, 4, 216 )
4 2 (2, 1, 8, 4, 216 )
|a1 a2 a1 a2 |a3 a4 | with 2,2 · 1,1 = 12 possible orderings. M0 Q4-2021 4800 16
By inserting a single memory barrier in the middle of the M1 Q4-2021 4800 16 (2, 1, 8, 4, 216 )
patterns, the number of possible orderings has been reduced M2 Q4-2021 4800 16 (2, 1, 8, 4, 216 )
drastically. M3 Q4-2021 4800 16 (2, 1, 8, 4, 216 )
M4 Q4-2021 4800 16 (2, 1, 8, 4, 216 )
Ordering vs. Hammering Speed. To explore the trade-off
provided by our scheduling policies, we contrast the provided
ordering and hammering speeds of 15 K random non-uniform
patterns. We implement all proposed scheduling policies (see
Table 8) in our fuzzer, hammer the generated patterns using
the different policies, and record their activation rates. We
then compute the number of pattern permutations using the
theoretical model introduced above.7
We plot the results for Z 3 in Figure 7, where we show,
for each policy and each generated pattern, the hammering
speed (x-axis) and the number of possible orderings (y-axis).
We omit the similar results from Z + , where we also ran this
experiment.
7 Toaccount for different pattern lengths (L), we use the normalized

ordering metric Ñ := L N, where N is the number of possible orderings.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy