0% found this document useful (0 votes)
33 views16 pages

RL-based Cache Replacement A Modern Interpretation

This article presents Stormbird, a novel cache replacement policy that enhances Belady's algorithm through reinforcement learning and a cache bypass mechanism. Stormbird improves instructions per cycle efficiency by 0.13% on a single-core system and reduces hardware overhead by 62.5%, while effectively managing cache access types. The study highlights the importance of cache management in modern processors and proposes a method to optimize cache performance by integrating dynamic techniques and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views16 pages

RL-based Cache Replacement A Modern Interpretation

This article presents Stormbird, a novel cache replacement policy that enhances Belady's algorithm through reinforcement learning and a cache bypass mechanism. Stormbird improves instructions per cycle efficiency by 0.13% on a single-core system and reduces hardware overhead by 62.5%, while effectively managing cache access types. The study highlights the importance of cache management in modern processors and proposes a method to optimize cache performance by integrating dynamic techniques and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

RL-based Cache Replacement: A


Modern Interpretation of Belady’s
Algorithm with Bypass Mechanism and
Access Type Analysis
HO JUNG YOO1 , (Student Member, IEEE), JEONG HUN KIM1 , (Student Member, IEEE), and
TAE HEE HAN1,2 , (Senior Member, IEEE)
1
Department of Artificial Intelligence, Sungkyunkwan University, Suwon 16419, South Korea
2
Department of Semiconductor Systems Engineering, Sungkyunkwan University, Suwon 16419, South Korea
Corresponding author: Tae Hee Han (than@skku.edu).
This work was supported by the Technology Innovation Program (or Industrial Strategic Technology Development Program-Public-Private
Joint Investment Advanced Semiconductor Talent Development Project) (RS-2023-00237136, Development of CXL/DDR5-based memory
subsystem for AI accelerators) funded By the Ministry of Trade, Industry & Energy(MOTIE, Korea)(1415187686); in part by the Institute
of Information & Communications Technology Planning & Evaluation (IITP) funded by the Korean Government (MSIT) through the
Artificial Intelligence Graduate School Support Program, Sungkyunkwan University, under Grant 2019-0-00421; in part by the Ministry of
Trade, Industry and Energy (MOTIE) under Grant 20011074; and in part by the National Research and Development Program through the
National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (MSIT) under Grant 2020M3H2A1076786.

ABSTRACT Belady’s algorithm is widely known as an optimal cache replacement policy. It has been
the foundation of numerous recent studies on cache replacement policies, and most studies assume this as
an upper limit. Despite its widespread adoption, we discovered opportunities to unleash the headroom by
addressing cache access types and implementing cache bypass. In this study, we propose Stormbird, a cache
replacement policy that synergistically integrates the extensions of Belady’s algorithm and the power of
reinforcement learning. Reinforcement learning is well-suited for cache replacement policy problems owing
to its ability to interact dynamically with the environment, adapt to changing access patterns, and optimize
the maximum cumulative rewards. Stormbird utilizes several selected features from the reinforcement
learning model to enhance the instructions per cycle efficiency while maintaining a low hardware overhead.
Furthermore, it considers cache access types and integrates dynamic set dueling techniques to improve the
cache performance. For 2 MB last-level cache per core, Stormbird achieves an average instructions per
cycle improvement of 0.13% over the previous state-of-the-art on a single-core system and 0.02% on a four-
core system while simultaneously reducing hardware overhead by 62.5%. Stormbird incurs a low hardware
overhead of only 10.5 KB for 2 MB last-level cache and can be implemented without using program counter
values.

INDEX TERMS Computer architecture, caches, reinforcement learning, replacement policy, Belady’s
algorithm, set dueling, cache access type.

I. INTRODUCTION to efficiently simulate Belady’s behavior. Mockingjay [4]


Cache replacement is a critical aspect of modern proces- exploits an estimated time of arrival (ETA)-based policy that
sors because the efficiency of replacement policies signif- effectively emulated Belady’s algorithm by forecasting the
icantly influences the overall system performance [1]. The anticipated arrival time of future cache accesses. While these
landscape of cache replacement policies has evolved because representative policies regarded Belady’s algorithm as the
of relentless innovation and extensive research, and many theoretical upper limit, we pursued an approach to improve
cutting-edge studies are based on Belady’s algorithm [2]–[6]. Belady’s algorithm.
Hawkeye [3], the victor of the 2nd Cache Replacement
Championship (CRC2) [7], uses a predictive model (OPT)

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 1. Comparison of the improved dead-block-reduction rate over LRU of Belady’s algorithm and Belady+ on SPEC CPU 2017 and Streaming of Cloudsuite
(16 way, 2MB LLC)

A. CACHE BYPASS TECHNIQUE


The cache bypass technique can enhance memory effi-
ciency by not allocating cache lines that are unlikely to be
reused in the near future [8]. This technique can help avert
potential cache pollution and reduce dead blocks, especially
in data streaming or scanning access patterns. Utilizing a
forward-looking replacement policy, such as Belady’s al-
gorithm, enables incorporating a cache bypass mechanism
that considers the reuse distance. It is primarily because the
algorithm inherently considers the reuse distance, a critical
metric in making bypass decisions.
The bypass technique assesses whether the memory data
will be bypassed before selecting the victim way upon a
cache miss. If the reuse distance of the cache access is greater
FIGURE 2. Cache access types for victims with Belady’s algorithm
than all cache lines in the corresponding set, the incoming
data are directly delivered to the upper-level cache without
being stored in the last-level cache (LLC). The cache bypass
technique preserves cache space from non-reused accesses. effects on cache performance. Fig. 2 shows the cache access
Fig. 1 compares the dead-block-reduction rate improve- types for the victims using Belady’s algorithm. This implies
ments over LRU for SPEC CPU 2017 benchmarks [25] and that prioritizing the eviction of specific types according to the
the streaming benchmarks of Cloudsuite [27]. The graph benchmark can improve the hit rate. For example, in the case
contrasts the original Belady’s algorithm with Belady+, of the 649.fotonik3d benchmark, prioritizing the eviction of
which extends Belady’s algorithm by incorporating a reuse the writeback type would be effective. The details of utilizing
distance-aware bypass technique. This comparison suggests the cache access types in this study are elucidated in Section
that combining a cache bypass mechanism informed by reuse 3.
distance can enhance the efficiency of cache management,
as evidenced by the increased dead-block-reduction rate and C. RL-BASED CACHE REPLACEMENT POLICY
the implied reduced cache pollution in Belady+. Notably, Be- As the field of artificial intelligence continues to expand at
lady+ exhibits a marked 73% average improvement in dead- an unprecedented rate, machine learning (ML) is increasingly
block-reduction rate over LRU than Belady’s algorithm in being employed across a wide range of chip design processes.
streaming benchmarks. This substantial improvement in the The application of ML to cache architectures is multifaceted,
streaming benchmark highlights the effectiveness of Belady+ with studies investigating areas such as data prefetching [9],
in scenarios where cache access patterns are sequential and branch prediction [10], and replacement policies [6], [11].
scanning, improving overall cache performance and reducing In this study, we propose a cache replacement policy,
pollution. Stormbird1 , which utilizes the strengths of reinforcement
learning (RL). Stormbird aims to derive a cost-effective re-
B. CACHE ACCESS TYPES
1 The Stormbird is a robotic creature in Horizon Zero Dawn. It is a large
The cache access type carries potentially significant bird-like machine, with a heavily armored body. It has a powerful electrical
weight because different types of cache access have distinct attack that it uses to defend itself.

2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

placement policy using Belady+ to construct an RL simula- TABLE 1. Comparison of cache replacement policies.
tion framework. The input features obtained from RL with
significant weights were refined to mitigate the hardware Compatible with Standalone Considering
Policies\Features
traditional cache mechanism access type
overhead.
The remainder of this paper is organized as follows. Hawkeye [3] ✗ ✓ ✓
Mockingjay [4] ✗ ✓ ✓
Section 2 provides an overview of related work on cache SHiP [12] ✗ ✓ ✗
replacement policies. Section 3 presents the methodology SHiP++ [13] ✗ ✓ ✓
used in our research, encompassing the RL simulation model RLR [6] ✓ ✓ ✓
Glider [11] ✗ ✓ ✗
for Stormbird. Section 4 provides an in-depth exposition of EHC [5] ✓ ✗ ✗
Stormbird and elucidates its mechanisms and functionali- ReD [15] ✓ ✗ ✗
ties. Section 5 presents the evaluation results and analysis, MRP [16] ✗ ✓ ✗
Stormbird (ours) ✓ ✓ ✓
comparing Stormbird’s performance with that of other cache
replacement policies. Finally, Section 6 concludes the study
and outlines potential directions for future work.
Glider leveraged long short-term memory (LSTM) learning
II. RELATED WORK models in offline environments to refine the precision of
Several innovative policies have exploited program existing hardware predictors.
counter (PC)-based signatures and predictors to boost the Certain policies targeted unique aspects of cache replace-
efficiency of cache replacement. The signature-based hit ment. ReD [15], a block selection policy developed by Diaz
prediction (SHiP) replacement policy proposed by Wu et al. et al., was designed to determine the eligibility of a block
[12] focused on predicting the re-reference characteristics of from the main memory for insertion into the LLC based
cache lines using a signature history counter table (SHCT). on its expected reuse behavior. The multi-perspective reuse
Similarly, the SHiP++ method [13] strengthened the orig- prediction (MRP) method [16], proposed by Jimenez et al.,
inal SHiP approach and introduced several optimizations, forecasted the future reuse of cache blocks employing diverse
including more effective handling of writeback accesses and features to optimize cache management.
prefetch-aware re-reference prediction values (RRPV) up- Table 1 summarizes the comparison of the cache replace-
dates. Moreover, the Hawkeye replacement policy [3] utilized ment policies. Many of the aforementioned policies utilized
a PC-based predictor to enhance the cache management PC values to bolster their accuracy. However, this approach
efficiency by emulating Belady’s algorithm to classify cache requires extra logic, wire, and energy consumption and incurs
lines as cache-friendly or cache-averse. On a cache miss, a substantial organizational expense. When a replacement
Hawkeye prioritized evicting cache-averse lines, utilizing policy incorporates PC values, it is incompatible with data
sampling sets and a counter-based structure to track the use- prefetchers. Additionally, the miss status holding register
fulness of cache lines, leading to elevated cache management (MSHR) and pipeline design needs to encompass these PC
efficiency. values [17].
Various strategies, including the Hawkeye policy, use Addressing the cache access type can improve the replace-
different methodologies to simulate Belady’s algorithm for ment ability to handle various sequences of cache accesses
cache replacement. The Mockingjay replacement policy [4] effectively, as different access types can have unique behav-
introduced an ETA-based policy that mimicked Belady’s ior patterns and impacts on cache performance. Many ML-
algorithm while considering cache access types. Expected based policies have been established based on the foundation
hit count policy (EHC) [5], a concept developed by Vakil- of conventional Belady’s algorithm. To improve the overall
Ghahani et al., proposes a hit-count-based victim selection performance of the replacement policy, we opted to use
procedure intended to bridge the gap between the traditional Belady+ instead of Belady’s algorithm.
LRU policy and Belady’s MIN policy. The authors observed a
strong correlation between a cache block’s expected number III. CACHE REPLACEMENT SIMULATION WITH
of hits and the reciprocal of its reuse distance. They proposed REINFORCEMENT LEARNING
a hit-count-based victim selection procedure that can be RL is particularly suited to the cache replacement policy
implemented on top of existing replacement policies (e.g., for several reasons. The cache replacement policy is essen-
DRRIP [14]). tially a problem of making the best decision under uncertain
ML has been investigated in cache replacement field using conditions. RL excels in such problems because it is designed
several approaches. For example, the reinforcement learning to learn from the environment and make the most benefi-
replacement (RLR) policy [6] employed RL to isolate critical cial decisions over time [18]. In addition, RL focuses on
features for cache replacement. RLR implements a replace- maximizing long-term rewards rather than immediate gains.
ment policy without directly integrating an ML framework This align with the goal of cache replacement policies, which
into the cache architecture. LSTM learning model was also aim to minimize the long-term miss rate, rather than sim-
investigated for cache replacement, as shown in the Glider ply avoiding immediate cache misses. Constructing a neural
policy by Shi et al [11]. Building on the Hawkeye policy, network directly in hardware is undesirable owing to power,
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 4. Partitioning of the LLC tag bits (2MB LLC)

TABLE 2. Features used in the RL model

Classification Feature Size (bit)


Offset 6
3-bit tag subset (1st, [25:23]) 3
3-bit tag subset (2nd, [22:20]) 3
Access 3-bit tag subset (3rd, [19:17]) 3
information Access type: Load 1
Access type: RFO 1
Access type: Prefetch 1
FIGURE 3. RL cache simulation model Access type: Writeback 1
Set number 11
Set total accesses 8
area, and timing constraints. Therefore, we scrutinized the Set accesses since last miss 6
Access count: Load 8
neural network and used the findings to develop a hardware- Set Access count: RFO 8
implementable replacement algorithm. information Access count: Prefetch 8
Access count: Writeback 8
Last access type: Load 1
A. KEY COMPONENTS OF RL CACHE SIMULATION Last access type: RFO 1
MODEL Last access type: Prefetch 1
Last access type: Writeback 1
Fig. 3 shows the proposed RL cache simulation model. We
employ the LRU policy to generate unbiased LLC access 3-bit tag subset (1st, [25:23]) 3
3-bit tag subset (2nd, [22:20]) 3
traces. These traces were fed into a Python-based cache 3-bit tag subset (3rd, [19:17]) 3
simulation model equipped with an RL agent responsible for Offset 6
making the replacement and bypass decisions. Dirty 1
Preuse distance 8
Age since insertion 8
1) LLC current access Age since the last access 8
Cache line Last access type: Load 1
When a cache line access occurs, the RL model determines information Last access type: RFO 1
whether the LLC access is hit or miss. The state is updated (for n-way) Last access type: Prefetch 1
if hit, and the RL framework proceeds to the subsequent Last access type: Writeback 1
Access count: Load 8
access. On encountering a non-compulsory miss, each RL Access count: RFO 8
framework collaborates with the agent to make a replacement Access count: Prefetch 8
and bypass decisions. Access count: Writeback 8
Line hits since insertion 6
Line misses since insertion 6
2) State vector Recency 4
The state vector comprises three distinct types of informa-
tion: access, set, and way. The access embodies the specifics
of the current cache access. The set provides information bits [22:20] as the 2nd-tag-subset, and bits [19:17] as the 3rd-
regarding the cache set being accessed. Lastly, the way tag-subset.
conveys details of each way within the accessed cache set. Table 2 lists the features employed in the RL agent. The
These three data types provide an agent with a comprehensive state vector for a 16-way set associative LLC is represented
understanding of the current cache state, guiding its decision- by 323 single-precision floating-point values (access infor-
making process. mation: 8, set information: 11, line information: 304). Every
We used a 32-bit physical address for the LLC. It is divided numerical feature is normalized to values between -1 and 1.
into tag bit, index bit, and byte offset. The low part of the tag Using diverse information, we focus on uncovering new
bit feature is partitioned into 3-bit segments, starting from critical features that are not typically discerned heuristi-
the rightmost bit of the tag and excluding the set index. Fig. cally. For instance, features derived from tags are neither
4 illustrates the partitioning of tag bits into three categories commonly used in cache replacement policies nor are they
for experimental purposes: bits [25:23] as the 1st-tag-subset, typically considered pivotal for cache line eviction or main-
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Algorithm 1 Reward function for determining victim selec- Algorithm 2 Reward function for determining bypass deci-
tion (Rvictim ) sion (Rbypass )
Input: Actionvictim , action of victim selection by agent; Input: Actionbypass , action of bypass by agent; RDaccess ,
RDaccess , reuse distance of current access; RDn , reuse reuse distance of current access; RDN , reuse distance
distance of way n in corresponding set; of way N in corresponding set;
Output: rvictim , reward of victim selection; Output: rbypass , reward of bypass decision;
1: Beladyvictim ← way with the max(RDN ) 1: if RDaccess > max(RDN ) then
2: if Actionvictim is Beladyvictim then 2: if Actionbypass is Bypass then
3: rvictim ← 1 3: rbypass ← 3
4: else if RD of Actionvictim < RDaccess then 4: else
5: rvictim ← −1 5: rbypass ← −2
6: else 6: end if
7: rvictim ← 0 7: else
8: end if 8: if Actionbypass is Bypass then
9: return rvictim 9: rbypass ← −1
10: else
11: rbypass ← 0
tenance. However, we observed a correlation between certain 12: end if
tag bits and the agent’s replacement decisions. We also ex- 13: end if
perimented with additional features such as accessed tag bits 14: return rbypass
sequence and prefetch usefulness counters. These features
were excluded from the RL model, taking into account fac-
tors such as computational overhead, implementation com-
plexity at replacement policy, and marginal contribution to the selected way could have been reused sooner in the future
performance enhancement. Implementing a policy based on but is instead being evicted by the current access that will be
impactful segments by indicating the agent’s response in reused later. In other situations, a neutral reward is provided.
various scenarios provides reliable performance, considering To balance positive and negative rewards, we set the
the hardware overhead. magnitude of both positive and negative rewards equal. Fur-
thermore, considering the clarity of decision boundaries and
3) Agent computational efficiency, we have designated the reward val-
The agent shown in Fig. 3 evaluates the current state vector ues as integers. The bypass decision made by Belady+ occurs
and generates an output vector corresponding to an n-way less often than victim selection, which arises every cache
set associative LLC. The output vector signifies the merit miss. This infrequency makes learning the bypass decision
evicting for each way and bypassing for the current access. more challenging than the victim selection, necessitating a
We use a deep Q network (DQN) [19] with a multi-layer per- more sophisticated determination of the reward value for the
ceptron (MLP) using a single hidden layer. The hidden layer Rbypass . The agent can recognize its importance by allocat-
has tanh activation function. The neural network structure ing a larger reward for more essential actions or outcomes,
involved 323 input neurons, and 256 neurons in the hidden enhancing learning efficacy.
layer. Agentvictim has 16 output neurons corresponding to Therefore, if the bypass action is selected by both the
the 16-way LLC, and Agentbypass has 2 output neurons for agent and Belady+ (i.e., If the reuse distance of the cache
decision bypass. On every LLC miss, the agent selects the access exceeds that of every way within the cache set, the
victim way based on the output of the network. We take the incoming data is determined to be bypassed), a large mag-
ϵ-greedy algorithm [19] to avoid overfitting problems, using nitude of reward value (+3) is provided to encourage the
0.5 for initial value and decaying 0.001 every 1024 steps until agent to learn this decision strongly. In cases where only
0.01. Belady+ selects the bypass action and the agent does not, a
negative reward (-2) will be provided as a strong penalty for
4) Reward missing this behavior. Additionally, a minor negative reward
To approximate the behavior of the Belady+ in Stormbird, (-1) is assigned for the agent’s unilateral decision to bypass,
we crafted two reward functions: one for the selection of mitigating unnecessary bypass actions.
the victim way (Rvictim , 4-a in Fig. 3) and the other for The rationale for assigning different magnitudes for the
determining bypass decisions (Rbypass , 4-b in Fig. 3). three distinct reward scenarios is deliberately structured so
The Rvictim provides a positive reward only when the that the aggregate sum of rewards equals zero. This equi-
agent selects the same victim way as Belady’s algorithm librium is intended to balance the reward system, ensuring
would. In contrast, If the reuse distance of the way chosen that the agent’s learning process is neither excessively penal-
by the agent is shorter than the reuse distance of the current ized nor unjustifiably rewarded, thus facilitating an unbiased
access, the Rvictim provides a negative reward. It is because adaptation to the Belady+ algorithm’s behavior.
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 5. The learning process of 602.gcc over 100 epochs (Agentvictim ) FIGURE 7. Weight heatmap for victim selection

FIGURE 6. The learning process of 607.cactuBSSN over 100 epochs


FIGURE 8. Weight heatmap for bypass mechanism
(Agentbypass )

5) Training the victim way using the Belady+ algorithm, and Fig. 8
In alignment with numerous RL implementations, experi- shows a weight heatmap for bypass learning based on the
ence replay [19] serves a pivotal role in training RL networks. same algorithm. Given that the learning direction required
This method uses a batch of transactions randomly sampled for selecting the victim and bypass ways differ, we conducted
from the replay buffer. Through a series of experiments, it separate training for both cases and derived their respective
is essential to assign a distinct replay buffer size to each weight heatmaps. Weights were extracted from each heatmap
benchmark. Empirical analysis revealed that a replay buffer and the geometric mean of the weights was computed and
size equivalent to 1/10 of the total LLC access count yielded subsequently normalized individually for each benchmark.
the most optimal results during training. This process ensures that the relative importance of each
For each benchmark, 100 training epochs were applied, feature is accurately represented across various benchmarks,
enabling thorough training and evaluation of the RL model. accommodating the diverse nature of the workloads. The ex-
Fig. 5 and 6 illustrate the evolution of feature weights over periments encompassed all benchmark types from the SPEC
epochs. As the learning progresses, a distinction in the weight CPU 2017 to guarantee fair feature selection across a broad
significance of each feature emerges, highlighting the adapta- spectrum of workloads.
tion of the model to the importance of each feature for cache
replacement. B. RL WEIGHT HEATMAP ANALYSIS
The feature weights were computed based on the trained Fig. 7 and 8 illustrate the learning from the Belady+ re-
RL network. Fig. 7 presents a weight heatmap for selecting placement policy, indicating the significance of each feature.
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 9. LLC Hit rate comparison (16 way, 2MB LLC, w\o prefetcher)
FIGURE 10. IPC comparisons across benchmarks using a cache access
type-aware eviction scheme
TABLE 3. Comparison of cache replacement policies.

Benchmarks \IPC LRU Hawkeye Mockingjay the latter two policies, indicating a substantial deviation from
605.mcf 0.291 0.347 0.387 the LRU. The overall LLC hit rate is higher in LRU than in
619.lbm 0.498 0.458 0.467 the other cases, but IPC improves on the other two policies
620.omnetpp 0.363 0.370 0.373
621.wrf 0.935 0.960 0.958 except for 619.lbm.
625.x264 1.837 1.865 1.867 Fig. 10 shows IPC performance with a simple cache ac-
628.pop2 1.441 1.450 1.458
654.roms 0.551 0.584 0.593
cess type-aware eviction scheme added to the basic LRU
policy. The data indicates that a writeback-aware eviction
technique yields the most substantial improvement, with IPC
enhancements reaching up to 1.88%. This result suggests
Since bypass occurs less frequently than victim selection,
that considering cache access type techniques can improve
features do not stand out as distinctly in the bypass heatmap
the replacement policy performance, especially in writeback
compared to victim selection. Examining the heatmap re-
access.
veals several features as pivotal: set-access-count-since-last-
miss (SALM), tag-bit-subset-array (TBSA), line-recency, Fig. 9 and 10 indicate that writeback miss has a mi-
and access-type. The following discussion delves into the nor impact on IPC because it occurs when data are being
importance of these features. written back to memory from the cache rather than when
A higher count SALM implies that the set is frequently hit, data are being fetched for processing. Stormbird incorporates
indicating more temporal locality for LLC in the correspond- this observation during the hardware implementation stage,
ing set. This information could influence the aggressiveness enabling the development of a more IPC-effective cache
of the replacement policy, with a more assertive eviction replacement policy that considers assigning different priority
strategy in sets with a low SALM. values depending on the cache access types.
TBSA demonstrates capturing the spatial locality of the Our RL framework, which emphasizes the above four
LLC. TBSA provides a compact yet informative representa- features aligns well with the traditional principles of cache
tion of the cache lines. The details of the TBSA are described management, underscoring the efficacy of our RL-based
in Section 4. approach for cache replacement. Each feature exploits the
Line-recency represents the principle of LRU, especially cache’s current state and potential future behavior, enabling
on cache access sequence. However, from Belady’s per- the model to make informed decisions regarding which cache
spective, the MRU line is likely to be evicted because the lines to evict.
MRU line typically has the longest reuse distance in the
access sequences. The sequence of cache line accesses itself IV. PROPOSED CACHE REPLACEMENT POLICY:
underscores the importance of victim selection. STORMBIRD
Access-type plays a pivotal role in the efficacy of cache Building on these simulation observations, we introduce
replacement policies, given that different access-types entail a cache replacement policy called Stormbird. It is designed
varying latencies and miss penalties. A comparison of the to effectively exploit temporal and spatial localities, discern
hit rate and IPC results of LRU, Hawkeye, and Mockingjay between different access types, and manage cache contention
based on the cache access type is presented in Fig. 9 and adaptively. Stormbird incorporates victim selection and by-
Table 3, respectively. In the benchmark exhibiting the largest pass policy, reflecting our observations acquired through
difference between the load hit rate and writeback hit rate, the RL simulation. Resource efficiency has been a primary
654.roms, writeback hits account for only 26.49∼27.46% in consideration in the development of Stormbird, designing at
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

the exclusion of TBSA segment size exceeding 4-bit was


due to concerns regarding computational overhead. We con-
ducted experiments on the LLC TBSA using segment sizes
of 2, 3, and 4-bit.
However, when the TBSA segment size was set to 2-bit or
4-bit, we observed that the pattern was inconsistent between
different benchmarks. For instance, the 4-bit segment size
TBSA from 644.nab [24:21] does not manifest consistency
on "0000", “1000”, "1001", and “1111” (blue boxes in Fig.
12 (a)). Also, in 2-bit segment size TBSA, [20:19] show
different irregular patterns across the benchmarks under
evaluation (blue boxes in Fig. 13). These non-generalized
results indicate that 2-bit and 4-bit segment size TBSA are
unsuitable for replacement predictors in a 2MB 16-way LLC.
FIGURE 11. Distribution of the highest value of SALM
The features for cache replacement should show consistent
behavior across various benchmarks. Hence, a 3-bit segment
size was chosen for the baseline architecture of this study to
provide balanced and dependable performance.
policy to achieve competitive performance with minimized
Fig. 14 shows the pattern of tag-related evictions for the
hardware overhead.
619.lbm, 605.mcf and 644.nab benchmarks when Belady+
Set competition priority (Pset ): Use SALM to assign is applied. As shown, the 1st-TBSA and 2nd-TBSA mainly
competition priority to each LLC set. Integrating Pset with evict the way that has the same TBSA as the current access.
the set dueling [20], [21] technique ensures effective ex- This observation not only underscores the significance of
ploitation. An LLC set with a higher SALM implies a higher spatial locality in cache accesses but also suggests that TBSA
hit rate for the LLC set, suggesting that an age-based re- can be a predictive indicator for eviction decisions. Efforts
placement policy is required to avoid unnecessary cache line have been made to analyze tag bit patterns in areas such
evictions. Conversely, a lower SALM indicates a frequent as mapping policies [22] and cache prefetcher [23]. We
miss in the set, and a context-aware (e.g., addressing access- integrated these TBSA into our replacement policy. Cache
type or TBSA) replacement policy helps. lines with the same TBSA as the current access are assigned
We set the threshold empirically based on an extensive a higher eviction priority.
analysis of the behaviors of different workloads. For the 16- However, while the 3rd-TBSA tends to evict ways with a
way set associative cache, Fig. 11 shows the distribution of tag one bit smaller than the current access, other patterns
the highest value of SALM in each benchmark. For example, were also observed. Depending on the benchmark, it selects
the 607.cactuBSSN and 623.xalanbmk benchmarks represent ways with a tag two bits (649.fotonik3d in Fig. 15-(d)) or
SALM greater than 64 in most sets. Therefore, an age-based three bits (627.cam4 in Fig. 15-(e)) smaller than the current
policy would be effective for these benchmarks. Conversely, access, or it follows patterns that do not adhere to this rule
benchmarks such as 603.bwave and 649.fotonik3d demon- (600.perlbench in Fig. 15-(f)). This inconsistency indicates
strate a different behavior, in which all LLC sets have a that the optimal eviction strategy may vary according to the
SALM is smaller than 16. These benchmarks indicate the specific workload characteristics. However, it is not straight-
necessity for a more context-aware replacement policy that forward to define the pattern clearly because the character-
prioritizes current access characteristics over age-based or istics appear for each benchmark and must be adaptively
reuse-distance-based policies. analyzed within the runtime operation.
We segment LLC sets using SALM values 16, 32, and 64 Similarly, our initial attempts to employ set dueling with
thresholds. When SALM is below 16, Pset is 1, indicating a tag bit offset during the warm-up phase were undertaken
a context-aware policy. For SALM between 16 and 32, Pset to identify and isolate 3rd-TBSA patterns. However, this
is 2; between 32 and 64, it is 4; and above 64, Pset is 8, approach results in the unintended consequence of cache
representing an age-based policy. pollution. Moreover, the accuracy in determining the offset
Tag bits priority (Ptag ): We explored the correlation was not consistent with that of Belady+. Therefore, we
between the cache tag bit and the victim selection. For decided to focus on the 1st-TBSA and 2nd-TBSA, where
most benchmarks, we discovered a pattern in which TBSA pattern recognition was more reliable than the 3rd-TBSA.
frequently led to the eviction of a way with either the same Line recency priority (Precency ): To reduce hardware
TBSA or a way with a consistent offset in TBSA. overhead, we opt to use an age counter and store the preuse
The idea for TBSA came from the opportunistic aspects distance rather than relying on recency. While recency sim-
provided by the lower part of the cache tag address in terms ply indicates the relative order of access among the ways,
of spatial locality. When using an n-bit segment size TBSA, the associated hardware requirement is substantial; thus, it
there are 22n possible access-evict TBSA pairs. Therefore, necessitates 4 bits per way, amounting to 16KB of overhead
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) 644.nab [24:21] TBSA

(b) 644.nab [20:17] TBSA

FIGURE 12. 644.nab 4-bit segment size TBSA

for a 16-way 2MB LLC. Instead of storing an age counter


for each way, we employ a look-up table that uses an 8-bit
portion of the tag bits as an index. For every LLC access,
the 18-bit saturation counter value is incremented, and 16-bit
counter value, which is 2-bit right shifted, is stored. This 2-
bit right shift serves as a discretization process reducing the
storage requirements for the counter values.
(a) 644.nab [20:19] TBSA This look-up table merges the spatial locality of the cache
address with the preuse distance of the age counter. Notably,
by incorporating at previously unused feature, the 3rd-TBSA,
and employing an age counter based on the tag, we effec-
tively substituted the role of recency. Precency consumes
only 1KB, a mere 6.25% of the overhead necessitated by
traditional LRU recency.
Cache access type priority (Ptype ): The RL weight
heatmap underscores the importance of the cache access
(b) 649.fotonik3d [20:19] TBSA
type. This finding emphasizes that cache access types notably
FIGURE 13. 644.nab and 649.fotonik3d 2-bit segment size TBSA influence the IPC. Previous experimental results confirmed
that IPC can be improved primarily by evicting the writeback
type. In addition, we observed that cache access type is
important in bypassing through RL heatmap. Fig. 16 shows
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) 619.lbm 1st TBSA (b) 619.lbm 2nd TBSA

(c) 605.mcf 1st TBSA (d) 605.mcf 2nd TBSA

(e) 644.nab 1st TBSA (f) 644.nab 2nd TBSA

FIGURE 14. 1st and 2nd TBSA for different benchmarks

(a) 619.lbm 3rd TBSA (b) 605.mcf 3rd TBSA

(c) 644.nab 3rd TBSA (d) 649.fotonik3d 3rd TBSA

(e) 627.cam4 3rd TBSA (f) 600.perlbench 3rd TBSA

FIGURE 15. 3rd TBSA for different benchmarks

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 4. Comparative hardware overhead of various replacement policies


for 16-way 2MB LLC

Policy HW Overhead (KB)


LRU 16
SHiP 14
SHiP++ 20
Hawkeye 28
Mockingjay 31.91
RLR 16.75
Glider 61.60
FIGURE 16. Cache access type of bypass with Belady+ Stormbird (ours) 10.50

TABLE 5. Hardware budget of Stormbird

Component Budget Parameters


Pset 1.5KB 6-bit for SALM
18-bit saturation counter
Precency 1.002KB Tag-age vector: 256 entries, 16-bit entry
Preuse-distance vector: 256 entries, 16-bit entry
Ptype 8KB 2-bit for access type
Ptag 0KB Use LLC tag bits

TABLE 6. Baseline configuration

352-entry ROB, 128-entry LQ, 72-entry SQ,


Out-of-order Core FetchWidth=6, ExecWidth=4,
DecodeWidth=6, RetireWidth=4
L1 I-Cache 32KB, 4-way, 4-cycle latency, 8 MSHRs
L1 D-Cache 32KB, 4-way, 4-cycle latency, 16 MSHRs
L2 Cache 256KB, 8-way, 12-cycle latency, 32 MSHRs
LLC per core 2MB, 16-way, 40-cycle latency, 64 MSHRs
4GB, 8 ranks, 8 banks, 32K rows, 32 columns,
DRAM
tRP=12.5ns, tRCD=12.5ns, tCAS=12.5ns,
Prefetcher Next-line (L1), IP-stride (L2), None (LLC)
FIGURE 17. Overview of the Stormbird algorithm

the cache set contention. The product of Precency and Pset


that the prefetch type accounts for the majority of bypasses combines this temporal and spatial locality, aiming for a
in Belady+. In benchmarks with low prefetch accuracy, by- balanced evaluation of cache lines based on recency and
passing prefetch access implies to prevent cache pollutions frequency of access within a set.
and improves the LLC performance. Contrary to Precency and Pset , which concretely quantify
Fig. 17 illustrates the comprehensive algorithm of Storm- the access information of cache lines, Ptag provides a mea-
bird. The algorithm operates in two main stages: Set dueling sure of spatial locality for the cache access. As such, Ptag has
in the warm-up phase, and LLC bypass and replace. During been incorporated additively, assigning it a relatively minor
the warm-up phase, Stormbird conducts set dueling by per- role within the overall weighting replacement and bypass
forming a modular operation on the set number and dividing priorities. Ptag is predominantly utilized as a secondary
Stormbird among the three policies. criterion.
The Tag age-based policy (TAB) relies on the tag age Pset is a coefficient to determine how much Stormbird will
counter and mainly evicts the way with the highest value of act on an age-counter basis. In cache sets with large SALM
corresponding tag age counter. The degree of the age-based values, which indicate a high cache hit rate, the multiplication
policy will be guided by the Pset , which relies on the SALM. of Precency and Pset assumes a dominant role in calculating
For Ptag , if the 1st-TBSA and 2nd-TBSA of current access overall priority. This design ensures that for cache sets with a
match the TBSA of the corresponding way, a value of 1K is high frequency of access, the eviction decisions depend more
assigned to each. on the recency than the spatial locality patterns indicated by
Precency × Pset + Ptag is the basis formulation of Storm- Ptag .
bird. Precency is prioritized because it is based on the age Conversely, lower Pset values reduce the influence of
counter, a pivotal aspect of cache replacement strategies. On Precency × Pset within overall priority calculations, making
the other hand, Pset is derived from SALM, which represents the replacement policy dependent on factors such as Ptag ,
VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 18. IPC over LRU for different LLC replacement policies (SPEC CPU 2017)

FIGURE 19. IPC over LRU for different LLC replacement policies (SPEC CPU 2006)

TABLE 7. Overall IPC performance for different replacement policies

1-core (2MB LLC) 4-core (8MB LLC)


Overall
SPEC CPU 2017 SPEC CPU 2006 SPEC CPU 2017 SPEC CPU 2006 Cloudsuite
Hawkeye 99.70% 101.91% 100.07% 101.63% 100.25%
Mockingjay 100.03% 102.64% 100.37% 102.10% 100.31%
SHiP 99.95% 100.88% 100.18% 101.19% 100.33%
SHiP++ 100.32% 101.98% 100.12% 101.36% 100.63%
RLR 99.69% 100.40% 99.62% 100.43% 100.13%
Stormbird (ours) 99.83% 100.56% 99.46% 100.13% 100.28%

which represents the current access information. diverse range of benchmarks by employing fine-grained set
The writeback prior evict policy (TAB) analogously to dueling techniques.
its predecessor but introduces flexibility in access type prior- Table 4 illustrates the hardware overhead for each policy
ity through Ptype . Specifically, Ptype is set at 2 if the current for a 16-way 2MB LLC. Stormbird outperforms the com-
access type is writeback; otherwise, it assumes a value of 1. petitors in this aspect, requiring the least hardware overhead
This policy focuses on prioritizing the eviction of writebacks. of 10.5KB. Further details on Stormbird’s hardware budget
The prefetch prior bypass policy (TAB) parallels the are listed in Table 5. Conversely, Glider, which is built on an
functioning of the tag age-based policy. However, before LSTM model, incurs higher hardware overhead. Similarly,
identifying the victim way, the current access is bypassed SHiP, Hawkeye, and Mockingjay, which are predictor-based
if the current access type is identified as a prefetch. Conse- policies, also demonstrate a relatively larger hardware over-
quently, prefetch data are bypassed to L2 rather than to LLC. heads.
This policy mitigates the potential cache pollution caused by
non-reused prefetch data. V. EVALUATION
After completing the warm-up phase, Stormbird elects the A. METHODOLOGY
policy with the highest IPC performance among the three We evaluate Stormbird using the ChampSim [24] simula-
policies. Stormbird aspires to enhance performance across a tor initially introduced during the CRC2 [7]. Our evaluation
12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 8. Maximum Simpoints of benchmarks TABLE 9. Benchmark mixes from the SPEC CPU 2017 multicore experiment

Benchmarks Benchmarks mix0 625 621 602 623


Simpoints Simpoints
(SPEC 2017) (SPEC 2006) mix1 619 620 628 654
mix2 607 631 627 603
600.perlbench 210B 400.perlbench 50B mix3 602 620 600 625
602.gcc 734B 429.mcf 51B mix4 623 620 600 619
603.bwaves 3699B 433.milc 127B mix5 628 619 625 649
605.mcf 665B 434.zeusmp 10B mix6 627 649 654 603
619.lbm 4268B 436.cactusADM 1804B mix7 649 600 638 627
620.omnetpp 874B 437.leslie3d 273B mix8 607 628 603 619
621.wrf 575B 447.dealII 3B mix9 600 644 619 602
625.x264 18B 450.soplex 247B mix10 621 607 600 619
627.cam4 573B 458.sjeng 1088B mix11 600 603 620 607
631.deepsjeng 928B 459.GemsFDTD 1491B mix12 620 628 638 644
644.nab 5853B 462.libquantum 1343B mix13 623 605 600 627
649.fotonik3d 1176B 465.tonto 1914B mix14 638 641 628 605
654.roms 842B 470.lbm 1274B mix15 607 620 625 628
657.xz 3167B 471.omnetpp 188B mix16 641 654 649 600
473.astar 153B mix17 620 628 619 625
481.wrf 196B mix18 605 625 649 638
482.sphinx3 1395B mix19 649 602 623 631
483.xalancbmk 716B mix20 621 627 605 623
mix21 654 627 621 638
mix22 623 631 621 619
mix23 619 644 623 603
covers both single-core and 4-core configurations with out- mix24 600 602 621 638
of-order cores. Table 6 outlines the simulation parameters mix25 638 602 600 605
and provides details of the memory hierarchy. mix26 641 623 631 638
mix27 600 623 627 649
To ensure a comprehensive and fair evaluation, we incor- mix28 603 654 649 625
porated all benchmarks from the default sets of SPEC CPU mix29 623 649 628 625
2017 [25] and SPEC CPU 2006 [26] as provided in CRC2, mix30 603 620 638 625
mix31 628 625 631 638
excluding an LLC miss count of less than 10K. Additionally, mix32 607 638 621 600
for multi-core experiments, we employed the Cloudsuite [27] mix33 627 603 628 619
benchmarks. mix34 654 602 638 649
mix35 641 628 654 638
The cache is warmed for 200 million instructions and eval- mix36 623 607 605 619
uate the performance for the following one billion instruction mix37 641 649 620 627
traces. If a benchmark finishes early, it is rewound until mix38 602 654 600 631
mix39 619 621 607 644
every other application in the mix has finished. In 4-core mix40 619 625 627 641
simulations, we evaluate performance when four different mix41 602 641 620 654
benchmarks are run simultaneously on separate cores. 50 mix42 641 621 623 625
mix43 602 654 605 621
mixes of four benchmarks from SPEC CPU 2017, other 50 mix44 621 641 644 620
benchmarks from SPEC CPU 2006, and the other 5 bench- mix45 644 605 631 620
marks from Cloudsuite are generated for 4-core simulations. mix46 603 620 644 602
mix47 602 621 607 631
We use IPC speedup over LRU to evaluate performance. mix48 631 602 603 644
Single-core simulation IPC is measured as IPIPCi,LRUCi
of each mix49 623 621 644 631
benchmark
Q i and
 14 4-core simulation IPC is measured as
4 IP Ci
i=1 IP Ci,LRU , representing the geometric mean of four
IPC results. their original works. Performance discrepancies can occur for
several reasons. Changes in the cache specifications used in
B. EXPERIMENTAL RESULTS the experiments can influence performance outcomes. A no-
This section compares the performance of Stormbird table observation was the impact of the presence or absence
with five leading state-of-the-art cache replacement policies: of prefetchers in the L1 and L2 caches on the evaluation out-
SHiP, SHiP++, Hawkeye, Mockingjay, and RLR. For SHiP, comes. Furthermore, the performance across the benchmarks
SHiP++, and Hawkeye, are sourced the implementations varied considerably based on the traces selected for the exper-
from the CRC2 website. In the case of Mockingjay, the iments. Metrics like IPC or LLC hit rate, and even the ranking
utilization of the author’s GitHub repository for accessing between policies, fluctuated based on individual SimPoints
the source code is accessible. As for RLR, we implemented [28]. We executed the benchmarks using the highest weights
it ourselves, guided by the original paper, owing to the as determined by SimPoints, ensuring they most accurately
unavailability of any official source code. represented benchmark behaviors.
First, we observed IPC degradation in several policies Fig. 18 and 19 present the performance comparisons for
compared with LRU, in contrast to the outcomes reported in the SPEC CPU 2017 and SPEC CPU 2006 compared with
VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 10. Benchmark mixes from the SPEC CPU 2006 multicore experiment

mix50 410 456 454 473


mix51 400 456 447 435
mix52 401 450 454 470
mix53 416 473 445 483
mix54 462 471 416 429
mix55 470 445 434 456
mix56 434 450 483 410
mix57 471 462 435 456
mix58 464 483 471 416
mix59 445 433 447 410
mix60 433 458 401 483
mix61 410 481 445 453
FIGURE 20. IPC over LRU for different LLC replacement policies (4-core,
mix62 447 450 436 473
Cloudsuite)
mix63 436 437 473 471
mix64 482 444 456 470
mix65 454 456 450 416
mix66 447 464 437 401
mix67 482 464 444 416
mix68 400 445 410 444
mix69 462 401 456 400
mix70 471 435 434 450
mix71 453 445 429 403
mix72 459 433 453 470
mix73 403 465 462 416
mix74 416 436 433 458
mix75 483 453 401 482
mix76 459 470 433 483
mix77 437 400 434 471
mix78 454 453 481 401
mix79 465 437 416 429
mix80 410 447 434 483
mix81 456 465 459 481
mix82 465 471 453 458
mix83 403 445 464 483
mix84 437 435 470 444
mix85 403 465 458 437
mix86 429 454 436 458 FIGURE 21. IPC over LRU comparison of different policies in the 4-core setup
mix87 437 429 454 471 (SPEC CPU 2017)
mix88 436 462 454 410
mix89 453 458 450 482
mix90 470 400 444 436
mix91 403 450 482 434
mix92 436 444 450 481
mix93 434 433 403 400
mix94 444 435 465 453
mix95 401 458 435 473
mix96 459 458 403 437
mix97 459 400 482 462
mix98 459 403 447 435
mix99 464 462 435 445

LRU. The geometric means of Stormbird’s performance


across all benchmarks are 99.83% and 100.56%, showcas-
ing its consistent and reliable performance across various
application domains. In the 619.lbm benchmark, Stormbird
achieves an IPC over LRU value of 98.09%, outperforming
other policies and maintaining a 7.90% lead over Hawkeye.
FIGURE 22. IPC over LRU comparison of different policies in the 4-core setup
Our observations indicate that prioritizing the eviction of (SPEC CPU 2006)
Writeback types is beneficial for this benchmark. However,
for the 470.lbm benchmark, Stormbird chooses the PPB
policy, a deviation from the approach for 619.lbm. Despite policy proves effective for benchmarks such as 602.gcc
this selection, extended testing reveals that TAB policy would and 482.sphinx3, resulting in performance improvements
be more advantageous for 470.lbm in full simulation. This of 1.84% and 15.39%, respectively. The evaluation of the
points towards the possibility that the warmup phase may mcf’s performance in Stormbird yields unsatisfactory results.
not have fully captured the benchmark’s behavior. The TAB The tag-based age counter is not well utilized because the
14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

behavior of the mcf benchmark has a large range of addresses nents of the benchmark mix used in multicore experiments.
and pattern complexities. This result suggests avenues for the This addition provides fellow researchers with the necessary
future refinement of adaptive cache management techniques. information to replicate our experiments, thereby improving
Fig. 20 shows the performance of the different LLC re- the objectivity and reproducibility of experiment results.
placement policies on a 4-core setup using the Cloudsuite
benchmark suite. On average, all six policies outperform the REFERENCES
LRU policy. Stormbird stands at 100.28%, which is com- [1] S. Kumar and P. K. Singh, “An overview of modern cache memory and
petitive with the other policies. In the streaming benchmark, performance analysis of replacement policies,” in 2016 IEEE International
Conference on Engineering and Technology (ICETECH). IEEE, 2016, pp.
Stormbird’s IPC of 100.56% demonstrates its efficacy, show- 210–214.
ing a marked improvement over the other policies. Stormbird [2] L. A. Belady, “A study of replacement algorithms for a virtual-storage
chooses TAB policy in streaming benchmark, implying that computer,” IBM Systems journal, vol. 5, no. 2, pp. 78–101, 1966.
[3] A. Jain and C. Lin, “Back to the future: Leveraging Belady’s algorithm
the streaming benchmark does not efficiently utilize the for improved cache replacement,” ACM SIGARCH Computer Architecture
prefetched data. News, vol. 44, no. 3, pp. 78–89, 2016.
Fig. 21 and 22 show the multicore evaluation of IPC over [4] I. Shah, A. Jain, and C. Lin, “Effective mimicry of Belady’s min policy,”
in 2022 IEEE International Symposium on High-Performance Computer
LRU in the SPEC CPU benchmarks suite. Stormbird achieves Architecture (HPCA). IEEE, 2022, pp. 558–572.
99.46% and 100.13%, respectively. Given that Stormbird pri- [5] A. Vakil-Ghahani, S. Mahdizadeh-Shahri, M.-R. Lotfi-Namin,
oritizes the traits of current access within the LLC, the irreg- M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Cache
replacement policy based on expected hit count,” IEEE computer
ular access patterns arising from varied workloads hindered architecture letters, vol. 17, no. 1, pp. 64–67, 2017.
Stormbird’s performance. PC-based policies are noticeable [6] Sethumurugan, Subhash, J. Yin, and J. Sartori, “Designing a cost-effective
within the multicore extension. A closer examination of the cache replacement policy using machine learning,” in International Sym-
posium on High-Performance Computer Architecture (HPCA). IEEE,
benchmarks, especially those in which Mockingjay stood 2021, pp. 291–303.
out, revealed that the limited PC values were accessed more [7] “The 2nd cache replacement championship,” https://crc2.ece.tamu.edu/,
intensively. This observation demonstrates the effective op- 2017.
[8] A. M. Krause, P. C. Santos, and P. O. Navaux, “Avoiding unnecessary
eration of the PC-based reuse distance predictor in multicore caching with history-based preemptive bypassing,” in 2022 IEEE 34th In-
extension. ternational Symposium on Computer Architecture and High Performance
Table 7 summarizes the overall IPC performance for Computing (SBAC-PAD). IEEE, 2022, pp. 71–80.
[9] Y. Zeng and X. Guo, “Long short term memory based hardware prefetcher:
single-core and 4-core evaluation. Overall, although all poli- a case study,” in Proceedings of the International Symposium on Memory
cies showcase competitive performances, specific strengths Systems, 2017, pp. 305–311.
and weaknesses emerge depending on the benchmark and [10] D. A. Jiménez and C. Lin, “Dynamic branch prediction with percep-
trons,” in Proceedings HPCA Seventh International Symposium on High-
core configuration. This analysis underscores the importance Performance Computer Architecture. IEEE, 2001, pp. 197–206.
of a tailored approach for choosing cache replacement strate- [11] Z. Shi, X. Huang, A. Jain, and C. Lin, “Applying deep learning to the cache
gies that are contingent on the specific behavior of workloads. replacement problem,” in Proceedings of the 52nd Annual IEEE/ACM
International Symposium on Microarchitecture, 2019, pp. 413–425.
[12] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr,
VI. CONCLUSION and J. Emer, “SHiP: Signature-based hit predictor for high performance
In this study, we present a replacement policy built upon caching,” in Proceedings of the 44th Annual IEEE/ACM International
Symposium on Microarchitecture, 2011, pp. 430–441.
RL, extracting innovative features for optimal replacement
[13] V. Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “SHiP++: Enhancing
algorithm behavior. As the computer architecture landscape signature-based hit predictor for improved cache performance,” in The
evolves with ever-growing cache sizes, the importance of re- 2nd Cache Replacement Championship (CRC-2 Workshop in ISCA 2017),
2017.
ducing hardware overhead cannot be overstated. We focused
[14] A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, “High performance
on reducing the HW overhead of the replacement policy. One cache replacement using re-reference interval prediction (RRIP),” ACM
of the pivotal findings of our experiments was the distinction SIGARCH computer architecture news, vol. 38, no. 3, pp. 60–71, 2010.
between the IPC and hit rate. An increase in the hit rate [15] J. Díaz Maag, P. E. Ibáñez Marín, T. Monreal Arnal, V. Viñals Yúfera,
and J. M. Llaberia Griñó, “ReD: A policy based on reuse detection for
does not invariably lead to a corresponding rise in IPC. This demanding block selection in last-level caches,” in The Second Cache
differentiation is accentuated when the access types vary. Replacement Championship: workshop schedule, 2017, pp. 1–4.
Therefore, it is necessary to amplify the addressing access [16] D. A. Jiménez and E. Teran, “Multiperspective reuse prediction,” in
Proceedings of the 50th Annual IEEE/ACM International Symposium on
types in the replacement policy. Beyond traditional set duel- Microarchitecture, 2017, pp. 436–448.
ing, we integrated an approach based on SALM to discern [17] J. Kim, E. Teran, P. V. Gratz, D. A. Jiménez, S. H. Pugsley, and C. Wilk-
the competitiveness of LLC sets. Additionally, we ensure erson, “Kill the program counter: Reconstructing program behavior in the
processor cache hierarchy,” ACM SIGPLAN Notices, vol. 52, no. 4, pp.
the application of policies tailored to specific benchmark 737–749, 2017.
characteristics. Our proposed replacement policy not only [18] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
demonstrates comparable performance to existing policies MIT press, 2018.
[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
but also achieves a significant reduction in HW overhead. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al.,
“Human-level control through deep reinforcement learning,” nature, vol.
APPENDIX 518, no. 7540, pp. 529–533, 2015.
[20] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptive in-
Table 8 includes a detailed list of the specific Simpoints sertion policies for high performance caching,” ACM SIGARCH Computer
used in our benchmarks. Table 9 and 10 shows the compo- Architecture News, vol. 35, no. 2, pp. 381–391, 2007.

VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3346790

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[21] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer, TAE HEE HAN (Senior Member, IEEE) received
“Adaptive insertion policies for managing shared caches,” in Proceedings the B.S., M.S., and Ph.D. degrees in electrical
of the 17th international conference on Parallel architectures and compi- engineering from the Korea Advanced Institute of
lation techniques, 2008, pp. 208–219. Science and Technology (KAIST), Daejeon, Ko-
[22] F. Hameed, L. Bauer, and J. Henkel, “Reducing inter-core cache contention rea, in 1992, 1994, and 1999, respectively. From
with an adaptive bank mapping policy in dram cache,” in 2013 Interna- 1999 to 2006, he was with the Telecom R&D
tional Conference on Hardware/Software Codesign and System Synthesis center of Samsung Electronics, where he devel-
(CODES+ISSS), 2013, pp. 1–8.
oped 3G wireless, mobile TV, and mobile WiMax
[23] P. Zhang, A. Srivastava, A. V. Nori, R. Kannan, and V. K. Prasanna,
handset chipsets. Since March 2008, he has been
“Fine-grained address segmentation for attention-based variable-degree
prefetching,” in Proceedings of the 19th ACM International Conference with Sungkyunkwan University, Suwon, Korea, as
on Computing Frontiers. Association for Computing Machinery, 2022, a Professor. From 2011 to 2013, he had served as a full-time advisor
pp. 103–112. on System ICs for the Korean Government. His current research interests
[24] N. Gober, G. Chacon, L. Wang, P. V. Gratz, D. A. Jimenez, E. Teran, include SoC/Chiplet architectures for AI, advanced memory architecture,
S. Pugsley, and J. Kim, “The championship simulator: Architectural sim- network-on-chip, and system-level design methodologies.
ulation for education and competition,” arXiv preprint arXiv:2210.14324,
2022.
[25] J. Bucek, K.-D. Lange, and J. v. Kistowski, “SPEC CPU2017: Next-
generation compute benchmark,” in Companion of the 2018 ACM/SPEC
International Conference on Performance Engineering, 2018, pp. 41–42.
[26] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM
SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.
[27] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic,
C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the
clouds: a study of emerging scale-out workloads on modern hardware,”
Acm sigplan notices, vol. 47, no. 4, pp. 37–48, 2012.
[28] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder,
“Using SimPoint for accurate and efficient simulation,” ACM SIGMET-
RICS Performance Evaluation Review, vol. 31, no. 1, pp. 318–319, 2003.

HO JUNG YOO (Student Member, IEEE) re-


ceived the B.S. degree in electronic engineering
from Sungkyunkwan University, Suwon, South
Korea, in 2022. He is currently pursuing the M.S.
degree in artificial intelligence at Sungkyunkwan
University, Suwon, South Korea. His research
interests include computer architecture, memory
system, and machine learning.

JEONG HUN KIM (Student Member, IEEE) re-


ceived the B.S. degree in electronic engineering
from Korea University of Technology and Ed-
ucation, Cheonan, South Korea, in 2021. He is
currently pursuing the M.S. and Ph.D degrees in
artificial intelligence at Sungkyunkwan Univer-
sity, Suwon, South Korea. His research interests
include computer architecture, memory system,
and machine learning.

16 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy