0% found this document useful (0 votes)
24 views18 pages

Taking Off The Gloves With Reference Counting Immix

Uploaded by

maxbsp66bit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views18 pages

Taking Off The Gloves With Reference Counting Immix

Uploaded by

maxbsp66bit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Taking Off the Gloves with Reference Counting Immix

Rifat Shahriyar, Stephen M. Blackburn, Xi Yang Kathryn S. McKinley


Australian National University Microsoft Research
<First.Last>@anu.edu.au mckinley@microsoft.com

Abstract 1. Introduction
Despite some clear advantages and recent advances, refer- In 1960, researchers introduced the two main branches of
ence counting remains a poor cousin to high-performance automatic garbage collection: tracing and reference count-
tracing garbage collectors. The advantages of reference ing [14, 24]. Reference counting directly identifies dead ob-
counting include a) immediacy of reclamation, b) incremen- jects by counting the number of incoming references. When
tality, and c) local scope of its operations. After decades the count goes to zero, the object is unreachable and the
of languishing with hopelessly bad performance, recent collector may reclaim it. Tracing takes the opposite tack. It
work narrowed the gap between reference counting and the identifies live objects by performing a transitive closure over
fastest tracing collectors to within 10%. Though a major ad- the object graph, implicitly identifying dead objects. It then
vance, this gap remains a substantial barrier to adoption in reclaims all untraced objects.
performance-conscious application domains. Reference counting has advantages. 1) It may reclaim ob-
Our work identifies heap organization as the principal jects as soon as they are no longer referenced. 2) It is inher-
source of the remaining performance gap. We present the ently incremental. 3) Its operations are object-local, rather
design, implementation, and analysis of a new collector, than global in scope. Its major disadvantage is that it can-
RC Immix, that replaces reference counting’s traditional not reclaim cycles and therefore it requires a backup trac-
free-list heap organization with the line and block heap ing collector [2, 18]. This limitation has the practical con-
structure introduced by the Immix collector. The key in- sequence that any reference counter that guarantees com-
novations of RC Immix are 1) to combine traditional ref- pleteness (i.e., it will eventually reclaim all garbage) es-
erence counts with per-line live object counts to identify sentially requires two collector implementations. Further-
reusable memory and 2) to eliminate fragmentation by in- more, the performance of reference counting implementa-
tegrating copying with reference counting of new objects tions lagged high performance tracing collectors by 30% or
and with backup tracing cycle collection. In RC Immix, ref- more until recently [21, 22, 27]. In 2012, Shahriyar et al.
erence counting offers efficient collection and the line and solved two problems responsible for much of the perfor-
block heap organization delivers excellent mutator local- mance overhead of reference counting. This paper identifies
ity and efficient allocation. With these advances, RC Immix and solves the remaining problems, completely eliminating
closes the 10% performance gap, outperforming a highly performance degradation as a barrier to adoption.
tuned production generational collector. By removing the Shahriyar et al. identify the following characteristics of
performance barrier, this work transforms reference count- programs and use them to optimize reference counting. (We
ing into a serious alternative for meeting high performance call their collector RC for simplicity.)
objectives for garbage collected languages.
1. The vast majority of reference counts are low, less than
Categories and Subject Descriptors Software, Virtual Machines, five. The RC collector uses only a few bits for the ref-
Memory management, Garbage collection erence count. It sticks counts at a maximum before they
Keywords Reference Counting, Immix, Mark-Region, Defragmenta- overflow and then corrects stuck counts when it traces the
tion heap during cycle collection.
2. Many reference count increments and decrements are to
Permission to make digital or hard copies of all or part of this work for personal or newly allocated objects. RC elides reference counting of
classroom use is granted without fee provided that copies are not made or distributed new objects and allocates them as dead, which eliminates
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM a lot of useless work.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a RC performs deferred reference counting and occasional
fee. Request permissions from permissions@acm.org.
backup cycle tracing. Deferral trades vastly fewer reference
OOPSLA ’13, October 29–31, 2013, Indianapolis, Indiana, USA.
Copyright c 2013 ACM 978-1-4503-2374-1/13/10. . . $15.00. counting increments and decrements for less immediacy of
http://dx.doi.org/10.1145/2509136.2509527 reclamation. RC divides execution into three distinct phases:

93
mutation, reference counting collection, and cycle collec- opportunistic copying, which mixes copying and leaving ob-
tion. The result is a reference counting collector with the jects in place, and thus can stop copying when it exhausts
same performance as a whole-heap tracing collector, and available memory.
within 10% of the best high performance generational col- Two engineering contributions of RC Immix are im-
lector in MMTk [6, 8, 27]. proved handling of roots and sharing the limited header
This paper identifies that the major source of this 10% bits to serve triple duty for reference counting, backup cy-
gap is that RC’s free-list heap layout has poor cache local- cle collection with tracing, and opportunistic copying. The
ity and imposes instruction overhead. Poor locality occurs combination of these innovations results in a collector that
because free-list allocators typically disperse contemporane- attains great locality for the mutator and very low overhead
ously allocated objects in memory, which degrades locality for reference counting.
compared to allocating them together in space [6, 8]. Instruc- Measurements on a large set of Java benchmarks show
tion overheads are greater in free lists, particularly when pro- that for all but the smallest of heap sizes RC Immix outper-
gramming languages require objects to be pre-initialized to forms the best high performance collector in the literature.
zero. While a contiguous allocator can do bulk zeroing very In some cases RC Immix can perform substantially better. In
efficiently, a free-list allocator must zero object-by-object, summary, we make the following contributions compared to
which is inefficient [33]. the previous state of the art [27].
To solve these problems, we introduce Reference Count- 1. We identify heap organization as the remaining perfor-
ing Immix (RC Immix). RC Immix uses the allocation strat- mance bottleneck for reference counting.
egy and the line and block heap organization introduced by
2. We merge reference counting with the heap structure of
Immix mark-region garbage collection [6]. Immix places ob-
Immix by marrying per-line live object counts with object
jects created consecutively in time consecutively in space in
reference counts for reclamation.
free lines within blocks. Immix allocates into partially free
blocks by efficiently skipping over occupied lines. Objects 3. We identify two opportunities for copying objects — one
may span lines, but not blocks. Immix reclaims memory at a for young objects and one that leverages the required cy-
line and block granularity. cle collector — further improving locality and mitigating
The granularity of reclamation is the key mismatch be- fragmentation both proactively and reactively.
tween reference counting and Immix that RC Immix re- 4. RC Immix improves performance by 12% on average
solves. Reference counting reclaims objects, whereas Im- compared to RC and sometimes much more, outperform-
mix reclaims lines and blocks. The design contributions of ing the fastest production and eliminating the perfor-
RC Immix are as follows. mance barrier to using reference counting.
• RC Immix extends the reference counter to count live
Because the memory manager determines performance for
objects on a line. When the live object count of a line managed languages and consequently application capabili-
is zero, RC Immix reclaims the free line. ties, these results open up new ways to meet the needs of
• RC Immix extends opportunistic copying [6], which applications that depend on performance and prompt recla-
mixes copying with leaving objects in place. RC Im- mation.
mix adds proactive copying, which combines reference
counting and copying to compact newly allocated live 2. Motivation and Related Work
objects. RC Immix on occasion reactively copies old ob- This section motivates our approach and overviews the nec-
jects during cycle detection to eliminate fragmentation. essary garbage collection background on which we build.
Combining copying and reference counting is novel and sur- We start with a critical analysis of the performance of
prising. Unlike tracing, reference counting is inherently lo- Shahriyar et al’s reference counter [27], which we refer to
cal, and therefore in general the set of incoming references to simply as RC. This analysis shows that inefficiencies derive
a live object is not known. However, we observe two impor- from 1) remaining reference counting overheads and 2) poor
tant opportunities. First, in a reference counter that coalesces locality and instruction overhead due to the free-list heap
increments and decrements [21, 22], since each new object structure. We then review existing high performance collec-
starts with no references to it, the first collection must enu- tors, reference counting, and the Immix [6] garbage collector
merate all references to that new object, presenting an op- upon which we build.
portunity to move that object proactively. We find that when
new objects have a low survival rate, the remaining live ob- 2.1 Motivating Performance Analysis
jects are likely to cause fragmentation. We therefore copy All previous reference counting implementations in the lit-
new objects, which is very effective in small heaps. Second, erature use a free-list allocator because when the collector
since completeness requires a tracing cycle collection phase, determines that an object’s count is zero, it may then imme-
RC Immix seizes upon this opportunity to incorporate reac- diately place the freed memory on a free list. We start our
tive defragmentation of older objects. In both cases, we use analysis by understanding the performance impact of this

94
Global Free Block Allocator Recycled Allocation Start Recycled Allocation Limit

Bump Pointer Cursor Bump Pointer Limit


Legend

Free: unmarked in previous collection Live: marked in previous collection Freshly allocated

Figure 1. Immix Heap Organization

choice, using hardware performance counters. We then an- Mutator Immix Mark-Sweep Semi-Space
alyze RC further to establish its problems and opportunities
Time 1.000 1.087 1.007
for performance improvements. Instructions Retired 1.000 1.071 1.000
L1 Data Cache Misses 1.000 1.266 0.966
Free-List and Contiguous Allocation The allocator plays
a key role in mutator performance since it determines the
Table 1. The mutator characteristics of mark-sweep relative
placement and thus locality of objects. Contiguous mem-
to Immix using the geometric mean of the benchmarks. GC
ory allocation appends new objects by incrementing a bump
time is excluded. Free-list allocation increases the number
pointer by the size of the new object [13]. On the other
of instructions retired and L1 data cache misses. Semi-space
hand, modern free-list allocators organize memory into k
serves as an additional point of comparison.
size-segregated free lists [4, 8]. Each free list is unique to a
size class and is composed from blocks of contiguous mem-
ory. It allocates an object into a free cell in the smallest size minus the collector time, in Table 1. We measure Immix,
class that accommodates the object. Whereas a contiguous mark-sweep using a free list, and semi-space [13], across
allocator places objects in memory based on allocation or- a suite of benchmarks. (See Section 4 for methodology de-
der, a free list places objects in memory based on their size tails.) We compare mutator time of Immix to mark-sweep to
and free memory availability. cleanly isolate the performance impact of the free-list allo-
Blackburn et al. [8] show that contiguous allocation in a cator versus the Immix allocator. Mark-sweep uses the same
copying collector delivers significantly better locality than free-list implementation as RC, and neither Immix nor mark-
free-list allocation in a mark-sweep collector. Feng and sweep use barriers in the mutator. We also compare to semi-
Berger [17] show similar locality benefits from initial con- space. Semi-space is the canonical example of a contiguous
tiguous allocation in a free list for C applications, but only allocator and thus an interesting limit point, but it is incom-
when allocation to live ratios are very low since with high patible with reference counting. The semi-space data con-
ratios the allocator reverts to free-list allocation. We confirm firms that Immix is very close to the ideal for a contiguous
the locality benefits of contiguous on contemporary hard- allocator.
ware, workloads, and allocator implementations below. The contiguous bump allocator has two advantages over
When contiguous allocation is coupled with copying col- the free list, both of which are borne out in Table 1. The
lection, the collector must update all references to each combined effect is almost a 9% performance advantage. The
moved object [13], a requirement that is at odds with refer- first advantage of a contiguous allocator is that it improves
ence counting’s local scope of operation. Because reference the cache locality of contemporaneously allocated objects by
counting does not perform a closure over the live objects, placing them on the same or nearby cache lines, and interacts
in general, a reference counting collector does not know of well with modern memory systems. Our measurements in
and therefore cannot update all pointers to an object it might Table 1 confirm this intuition, showing that a free list adds
otherwise move. Thus far, this prevented reference counting 26% more L1 data cache misses to the mutator, compared to
from copying and using a contiguous allocator. the Immix contiguous allocator. This degradation of locality
On the other hand, the Immix mark-region heap layout has two related sources. 1) Contemporaneously allocated
offers largely contiguous heap layout, line and block recla- objects are much less likely to share a cache line when
mation, and copying [6]. Figure 1 shows how Immix allo- using a free list. 2) A contiguous allocator touches memory
cates objects contiguously in empty lines and blocks (see sequentially, priming the prefetcher to fetch lines before the
Section 2.3 for more details). In partially full blocks, Immix allocator writes new objects to them. On the other hand, a
skips over occupied lines. free-list allocator disperses new objects, defeating hardware
First to explore the performance impact of free-list allo- prefetching prediction mechanisms. Measurements by Yang
cation, we compare the mutator time, which is the total time et al. show these effects [33].

95
Mutator Sticky Immix RC Immix
two approaches are duals. Reference counting directly iden-
tifies dead objects by keeping a count of the number of ref-
Time 1.000 1.093 0.975
Instructions Retired 1.000 1.092 0.972 erences to each object, freeing the object when its count
L1 Data Cache Misses 1.000 1.329 1.018 reaches zero. Tracing algorithms, such as McCarthy’s, do
not directly identify dead objects, but rather, they identify
live objects, and the remaining objects are implicitly dead.
Table 2. The mutator characteristics of RC and Most high performance tracing algorithms are exact, which
Sticky Immix, which except for heap layout have simi- means that they precisely identify all live objects in the heap.
lar features. GC time is excluded. RC’s free list allocator To identify live objects, they must enumerate all live refer-
increases instructions retired and L1 cache misses. Immix ences from the running program’s stacks, which means that
serves as a point of comparison. the runtime must maintain accurate stack maps. Maintaining
stack maps is a formidable engineering burden, and is a rea-
son why some language developers use reference counting
The second advantage of contiguous allocation is that it
rather than tracing [19]. To build stack maps, the compiler
uses fewer instructions per allocation, principally because
(or interpreter) must be able to determine for every register
it zeros free memory in bulk using substantially more effi-
and stack location, at every point in the program’s execution
cient code [33]. The allocation itself is also simpler because
where a GC is legal, whether that location contains a valid
it only needs to check whether there is sufficient memory to
heap reference or not.
accommodate the new object and increase the bump pointer,
Collins’ first reference counting algorithm suffered from
while the free-list allocator has to look up and update the
significant drawbacks including: a) an inability to collect
metadata to decide where to allocate. However, we inspect
cycles of garbage, b) overheads due to tracking very frequent
generated code and confirm the result of Blackburn et al. [8]
pointer mutations, c) overheads due to storing the reference
— that in the context of a Java optimizing compiler, where
count, and d) overheads due to maintaining counts for short
the size of most objects is statically known, the free-list al-
lived objects. The following paragraphs briefly outline five
location sequence is only slightly more complex than for the
important optimizations developed over the past fifty years
bump pointer. The overhead in additional instructions shown
to improve over Collins’ original paper. Shahriyar et al.
in Table 1 is therefore solely attributable to the substantially
show that together these optimizations deliver competitive
less efficient cell-by-cell zeroing required by a free-list al-
performance [27].
locator. We measure a 7% increase in the number of retired
instructions due to the free list compared to Immix’s con- Deferral To mitigate the high cost of maintaining counts
tiguous allocator. for rapidly mutated references, Deutsch and Bobrow intro-
duced deferred reference counting [16]. Deferred reference
Analyzing RC Overheads We use a similar analysis to
counting ignores mutations to frequently modified variables,
examine mutator overheads in RC [27] by comparing to
such as those stored in registers and on the stack. Deferral re-
Sticky Immix [6, 15], a generational variant of Immix. We
quires a two phase approach, dividing execution into distinct
choose Sticky Immix for its similarities to RC. Both collec-
mutation and collection phases. This tradeoff reduces refer-
tors a) are mostly non-moving, b) have generational behav-
ence counting work significantly, but delays reclamation.
ior, and c) use similar write barriers. This comparison holds
Since deferred references are not accounted for during
as much as possible constant but varies the heap layout be-
the mutator phase, the collector counts other references and
tween free list and contiguous.
places zero count objects in a zero count table (ZCT) defer-
Table 2 compares mutator time, retired instructions, and
ring their reclamation. Periodically in a GC reference count-
L1 data cache misses of RC and Sticky Immix. The mutator
ing phase, the collector enumerates all deferred references
time of RC is on average 9.3% slower than Sticky Immix,
into a root set and then reclaims any object in the ZCT that
which is reflected by the two performance counters we re-
is not in the root set.
port. 1) RC has on average 9.2% more mutator retired in-
Bacon et al. [3] eliminate the zero count table by buffer-
structions than Sticky Immix. 2) RC has on average 33%
ing decrements between collections. At collection time, the
more mutator L1 data cache misses than Sticky Immix.
collector temporarily increments a reference count to each
These results are consistent with the hypothesis that RC’s
object in the root set and then processes all of the buffered
use of a free list is the principal source of overhead com-
decrements. Although much faster than naı̈ve immediate ref-
pared to Sticky Immix, and motivates our design that com-
erence counting, these schemes typically require stack maps
bines reference counting with the Immix heap structure.
to enumerate all live pointers from the stacks. Stack maps are
2.2 High Performance Reference Counting an engineering impediment, which discourages many refer-
The first account of reference counting was published by ence counting implementations from including deferral [19].
George Collins in 1960 [14], just months after John Mc- Coalescing Levanoni and Petrank observed that all but the
Carthy first described tracing garbage collection [24]. The first and last in any chain of mutations to a reference within a

96
given window can be coalesced [21, 22]. Only the initial and Both RC and RC Immix use backup tracing [18, 27]. Per-
final states of the reference are necessary to calculate correct forming the trace requires a mark bit in each object header.
reference counts. Intervening mutations generate increments During the trace, RC takes the opportunity to recompute all
and decrements that cancel each other out. This observation reference counts and thus may fix stuck counts. RC then
is exploited by remembering (logging) only the initial value sweeps all dead objects to the free list. RC Immix uses the
of a reference field when the program mutates it between same approach, except that it also recomputes object counts
periodic reference counting collections. At each collection, on lines and then reclaims free lines and blocks.
the collector need only apply a decrement to the initial value
of any over-written reference (the value that was logged), Young Objects As the weak generational hypothesis states,
and an increment to the latest value of the reference (the most objects die young [23, 30], and as a consequence,
current value of the reference). young objects are a very important optimization target. All
Levanoni and Petrank implemented coalescing using ob- high performance collectors today exploit this observation,
ject remembering. The first time the program mutates an typically via a copying generational nursery [30]. Prior work
object reference after a collection phase a) a write barrier applies the weak generational hypothesis to reference count-
logs all of the outgoing references of the mutated object ing by combining reference counting with a copying nurs-
and marks the object as logged; b) all subsequent reference ery [5] and by in-place mark-sweep tracing [25].
mutations in this mutator phase to the (now logged) object Shahriyar et al. applied two optimizations to deferred, co-
are ignored; and c) during the next collection, the collector alescing reference counting to exploit short lived young ob-
scans the remembered object, increments all of its outgoing jects: 1) lazy mod-buf insertion and 2) allocate as dead. Lazy
pointers, decrements all of its remembered outgoing refer- mod-buf insertion avoids adding new objects to the mod-
ences, and clears the logged flag. This optimization uses two buf. Instead, it sets a new bit in object headers during allo-
buffers called the mod-buf and dec-buf. The allocator logs cation and the collector only adds new objects to the mod-
all new objects, ensuring that outgoing references are incre- buf lazily when it processes increments. During collection
mented at the next collection. The allocator does not record whenever the subject of an increment has its new bit set,
old values for new objects because all outgoing references the collector first clears the new bit and then pushes the ob-
start as null. ject into the mod-buf. Because in a coalescing deferred ref-
Limited Bit Counts Each object has a reference count. A erence counter, all references from roots and old objects will
dedicated word for the count guarantees that it will never increment all objects they reach, this approach will only re-
overflow since a word is large enough to count a pointer tain new objects directly reachable from old objects and the
from every address in the address space. However, reference roots. For each object in the mod-buf, the collector will in-
counting may use fewer bits [20]. Shahriyar et al. show crement each of its children, which makes this scheme tran-
that using a full word adds non-negligible overhead. They sitive. Thus new objects are effectively traced.
instead use just four bits that are available in the object’s The allocate as dead optimization is a simple extension
header word in many systems [27]. The reference counter of the above strategy. Instead of allocating objects live with
leaves any count that is about to overflow in a stuck state, a reference count of one and enqueueing a compensating
protecting the integrity of the remainder of the header word, decrement, this strategy allocates new objects as dead and
but introducing a potential garbage leak. Each time the cycle does not enqueue a decrement. This optimization inverts the
collector runs it resets each reference count, which has the presumption: the reference counter does not need to identify
effect of bounding the impact of stuck reference counts. new objects that are dead, but it must rather identify live ob-
Shahriyar et al. show that this strategy performs well. jects. This inversion means that the collector performs work
in the infrequent case when a new object survives a col-
Cycle Collection Reference counting suffers from the lection, rather that in the common case when it dies. New
problem that cycles of objects will sustain non-zero refer- objects become live when they receive their first increment
ence counts, and therefore cannot be collected. To attain when the collector processes the mod-buf. This strategy re-
completeness, a separate backup tracing collector executes moves the need for creating compensating decrements and
from time to time to eliminate cyclic garbage [31]. Backup avoids explicitly freeing short lived objects.
tracing must enumerate live root references from the stack
and registers, which requires stack maps. For this reason, Modern widely used implementations of reference counting
naı̈ve reference counting implementations usually do not employ few, if any, of the above optimizations. The likely
perform cycle collection. explanation for this phenomena is two-fold. First, reference
A backup tracing collector typically collects cycles by counting lagged the best generational garbage collectors by
performing a mark-sweep trace of the entire heap. Re- 40% until 2012 when Shahriyar et al. closed the gap to 10%.
searchers tried limiting tracing to mutated objects [2], but Consequently, reference counting is not currently used in
subsequently Frampton showed backup tracing, starting performance critical settings. Second, one attraction of sim-
from the roots, performs better [18]. ple reference counting implementations is that they do not

97
require sophisticated runtime system support, such as pre- sition contemporaneously allocated objects in spatially dis-
cise stack maps. Many of the optimizations we describe here joint memory, as discussed in Section 2.1.
require the same runtime support as a tracing collector, un-
dermining a principal advantage of a simple implementation. Mark-Region Mark-region memory managers use a sim-
The reference counting implementations in widely used lan- ple bump pointer to allocate objects into regions of contigu-
guages such as PHP and Objective-C are naı̈ve. Because they
ous memory [6]. A tracing collection marks each object and
lack these optimizations, they are inefficient. Shahriyar et
marks its containing region. Once all live objects have been
al.’s collector implements all these optimizations and is our
traced, it reclaims unmarked regions. This design addresses
RC baseline.
the locality problem in free-list allocators. A mark-region
We have outlined the state of the art in reference counting. memory manager can choose whether to move surviving ob-
RC Immix builds upon this foundation and then extends it by jects or not. By contrast, evacuating and compacting col-
a) changing the underlying heap structure, and b) performing
lectors must copy, leading them to have expensive space or
proactive and reactive copying to mitigate fragmentation
time collection overheads compared to mark-sweep collec-
and improve locality. The result is that RC Immix entirely
tors. Mark-region collectors are vulnerable to fragmentation
eliminates the 10% performance overhead suffered by the because a single live object may keep an entire region alive
fastest previous reference counting implementation. and unavailable for reuse, and thus must copy some objects
to attain good performance.
2.3 Heap Organization and Immix
Blackburn and McKinley outline three heap organizations: Immix: Lines, Blocks, and Opportunistic Copying Immix
a) free lists, b) contiguous, and c) regions [6]. Until now, ref- is a mark-region collector that uses a region hierarchy with
erence counting used a free-list heap structure. In this paper, two sizes: lines, which target cache line locality, and blocks,
we adapt reference counting to use regions. In particular, we which target page level locality [6]. Each block is composed
combine object reference counting with the line and block of lines, as shown in Figure 1. The allocator places new ob-
reclamation strategy used by Immix. jects contiguously into empty lines and skips over occupied
lines. Objects may span lines, but not blocks. Immix uses
Free List A free-list allocator uses a heap structure that di- a bit in the header to indicate whether an object straddles
vides memory into cells of various fixed sizes [32]. When lines, for efficient line marking. Immix recycles partially free
space is required for an object, the allocator searches a data blocks, allocating into them first.
structure called a free list to find a cell of sufficient size to Immix tackles fragmentation using opportunistic defrag-
accommodate the object. When an object becomes free, the mentation, which mixes marking with copying. At the be-
allocator returns the cell containing the object to the free list ginning of a collection, Immix identifies fragmentation as
for reuse. Free lists are used by explicit memory manage- follows. Blocks with available memory indicate fragmenta-
ment systems and by mark-sweep and reference counting tion because although available, the memory was not usable
garbage collectors. Importantly, free-list allocators do not by the mutator. Furthermore, the live/free status for these
require copying of objects, which makes them particularly blocks is up-to-date from the prior collection. In this case,
amenable to systems that use reference counting and to sys- Immix performs what we call here, a reactive defragment-
tems that require support for pinning of objects (i.e. objects ing collection. To mix marking and copying, Immix uses two
that cannot be moved). bits in the object header to differentiate between marked and
Free lists support immediate and fast reclamation of in- forwarded objects. At the beginning of a defragmenting col-
dividual objects, which makes them particularly suitable for lection, Immix identifies source and target blocks. During
reference counting. Other systems, such as evacuation and the mark trace, when Immix first encounters an object that
compaction, must identify and move live objects before they resides on a source block and there is still available mem-
may reclaim any memory. Also, free lists are a good fit to ory for it on a target block, Immix copies the object to a
backup tracing used by many reference counters. Free lists target block, leaving a forwarding pointer. Otherwise Immix
are easy to sweep because they encode free and occupied simply marks the object as usual. When Immix encounters
memory in separate metadata. The sweep identifies and re- forwarded objects while tracing, it updates the reference ac-
tains live objects and returns memory occupied by dead ob- cordingly. This process is opportunistic, since it performs
jects to the free list. Free lists suffer two notable shortcom- copying until it exhausts memory to defragment the heap.
ings. First, they are vulnerable to fragmentation of two kinds. The result is a collector that combines the locality of a copy-
They suffer from internal fragmentation when objects are ing collector and the collection efficiency of a mark-sweep
not perfectly matched to the size of their containing cell, and collector with resilience to fragmentation.
they suffer external fragmentation when free cells of particu- The best performing production collector in Jikes RVM
lar sizes exist, but the allocator requires cells of another size. is generational Immix (GenImmix) [6], which consists of a
Second, they suffer from poor locality because they often po- copying young space and an Immix old space.

98
3. Design of RC Immix
RC!RC!RC!RC! M! N! LG! LG! S! M! F! F!
This section presents the design of RC Immix, which com-
bines the RC and Immix collectors described in the previous (a) RC! (b) Immix!
section. This combination requires solving two problems.
1) We need to adapt the Immix line/block reclamation strat-
RC!RC!RC! S! M! N! LG! LG! RC!RC!RC! S! M! N! F! F!
egy to a reference counting context. 2) We need to share the
limited number of bits in the object header to satisfy the de- (c) RC Immix during mutation! (d) RC Immix during tracing!
mands of both Immix and reference counting. and reference counting!
In addition, RC Immix seizes two opportunities for de-
RC: ! Reference Count! N:! New object! S:! Straddles lines !
fragmentation using proactive and reactive opportunistic
M:! Marked! LG:! Logged ! F:! Forwarded!
copying. When identifying new objects for the first time,
it opportunistically copies them, proactively defragmenting.
When it, on occasion, performs cyclic garbage collection, Figure 2. How RC, Immix, and the different phases of
RC Immix performs reactive defragmentation. RC Immix use the eight header bits.
Similar to RC, RC Immix has frequent reference counting
phases and occasional backup cycle tracing phases. This
structure divides execution into discrete mutation, reference
the collection, RC Immix will never encounter a reference
counting collection, and cycle collection phases.
to an object on the line, will never increment the live object
3.1 RC and the Immix Heap count, and will trivially collect the line at the end of the first
GC cycle. Because Immix’s line marks are bytes (stored in
Until now, reference counting algorithms have always used
the metadata for the block) and the number of objects on a
free-list allocators. When the reference count for an object
line is limited by the 256 byte line size, live object counts do
falls to zero, the reference counter frees the space occupied
not incur any space penalty in RC Immix compared to the
by the object, placing it on a free list for subsequent reuse
original Immix algorithm.
by an allocator. Immix is a mark-region collector, which
reclaims memory regions when they are completely free,
Limited Bit Count In Jikes RVM, one byte (eight bits) is
rather than reclaiming memory on a per-object basis. Since
available in the object header for use by the garbage collec-
Immix uses a line and block hierarchy, it reclaims free lines
tor. RC uses all eight bits. It uses two bits to log mutated
and if all the lines in a block are free, it reclaims the free
objects for the purposes of coalescing increments and decre-
block. Lines and block cannot be reclaimed until all objects
ments, one bit for the mark state for backup cycle tracing,
within them are dead.
one bit for identifying new objects, and the remaining four
RC Immix Line and Block Reclamation RC Immix de- bits to store the reference count. Figure 2(a) illustrates how
tects free lines by tracking the number of live objects on a RC fully uses all its eight header bits. Table 3 shows that four
line. RC Immix replaces Immix’s line mark with a per-line bits for the reference count is sufficient to correctly count
live object count, which counts the number of live objects references to more than 99.8% of objects.
on the line. (It does not count incoming references to the To integrate RC and Immix, we need some header bits
line.) in objects for Immix-specific functionality as well. The
As mentioned in Section 2.2, each object is born dead in base Immix implementation requires four header bits, fewer
RC, with a zero reference count to elide all reference count- header bits than RC, but three bits store different informa-
ing work for short lived objects. In RC Immix, each line is tion than RC. Both Immix and RC share the requirement
also born dead with a zero live object count to similarly elide for one mark bit during a tracing collection. Immix however
all line counting work when a newly allocated line only con- requires one bit to identify objects that span multiple lines
tains short lived objects. RC only increments an object’s ref- and two bits when it forwards objects during defragmenta-
erence count when it encounters it during the first GC af- tion. (Copying collectors, including Immix and RC Immix,
ter the object is born, either directly from a root or due to first copy the object and then store a forwarding pointer in
an increment from a live mutated object. We propagate this the original object’s header.) Figure 2(b) shows the Immix
laziness to per-line live object counts in RC Immix. header bits.
A newly allocated line will contain only newly born Immix and RC Immix both require a bit to identify ob-
objects. During a reference counting collection, before jects that may span lines to ensure that all affected lines are
RC Immix increments an object’s reference count, it first kept live. Immix and RC Immix both use an optimization
checks the new bit. If the object is new, RC Immix clears called conservative marking which means this bit is only set
the new object bit, indicating the object is now old. It then for objects that are larger than one line, which empirically
increments the object reference count and the live object is relatively uncommon [6]. Immix stores its line marks in
count for the line. When all new objects on a line die before per-block metadata and RC Immix does the same.

99
io
s

ud
ed

es

ch

w
ex
se

db
us

ga
pr

flo
ar
n

n
c

d
t

lip

bb

d
ar
ro
m

pe

ql
oa
ea

se
t
ss

la
ck
va

in

n
ts

tr

pm
db
co

hs

su
ch

ec

xa
av

fo
bi

lu

lu

pj
bl
je

ja

ja
1 47.65 33.93 57.31 9.37 66.77 7.29 20.23 44.76 34.74 49.54 47.08 86.36 49.68 96.89 66.31 75.83 38.23 59.62 13.54 47.83
2 6.75 0.08 0.16 0.80 4.96 0.16 0.95 0.95 15.74 2.54 1.83 9.38 4.73 47.38 20.75 5.57 4.67 5.15 0.01 2.47
3 0.65 0 0.08 0.68 0.59 0 0.16 0.01 6.69 0.10 0.02 1.15 0.31 0.21 0.16 0.01 0.14 1.53 0.01 0.59
4 0.11 0 0.06 0.68 0.28 0 0.08 0 0.06 0.05 0.01 0.24 0.10 0.01 0.01 0.01 0.02 0.26 0.01 0.17
5 0.06 0 0.03 0.49 0.12 0 0.03 0 0.06 0.03 0.01 0.07 0.05 0.01 0.01 0.01 0.01 0.14 0.01 0.06

Table 3. Percentage of objects that overflow for a given number of reference counting bits. RC Immix and RC use three and
four bits, respectively. Data from Shahriyar et al. [27]. On average, 0.65% of objects overflow with three bits.

Immix and RC Immix both need to forward objects dur- count bit for pinning, it worked well (see Section 5.3), so
ing defragmentation. Forwarding uses two bits during a col- we did not explore the more complex implementation. Our
lection to record the forwarding state (not forwarded, being default RC Immix configuration does not use pinning.
forwarded, forwarded).
3.2 Cycle Collection and Defragmentation
At first cut, it seems that there are not enough bits since
adding Immix functionality to RC requires three bits and Cycle Collection Reference counting suffers from the
would thus reduce the bits for the reference count to just one. problem that cycles of objects will sustain non-zero refer-
However, we observe that RC Immix only needs the logged ence counts and therefore cannot be collected. The same
bits for an object to coalesce increments and decrements dur- problem affects RC Immix, since line counts follow object
ing reference counting, and it only needs forwarding bits liveness. RC Immix relies on a backup tracing cycle collec-
when tracing new objects and during backup cycle collec- tor to correct incorrect line counts and stuck object counts. It
tion. These activities are mutually exclusive in time, so they uses a mark bit for each object and each line. It takes one bit
are complementary requirements. from the line count for the mark bit and uses the remaining
We therefore put the two bits to use as follows. 1) Dur- bits for the line count. The cycle collector starts by setting all
ing mutation RC Immix follows RC, using the logged bits to the line marks and counts to zero. During cycle collection,
mark modified objects that it has remembered for coalesc- the collector marks each live object, marks its corresponding
ing. 2) During a reference counting collection, RC Immix line, and increments the live object count for the line when it
follows RC. For old objects, RC Immix performs increments first encounters the object. At the end of marking, the cycle
and decrements as specified by coalescing and then clears collector reclaims all unmarked lines.
the two bits. 3) For new objects and during cycle collec- Whenever any reference counting implementation finds
tion, RC Immix follows Immix. It sets the now cleared bits that an object is dead, it decrements the reference counts of
to indicate that it has forwarded an object and at the end of all the children of the dead object, which may recursively
the collection, reclaims the memory. RC Immix thus over- result in more dead objects. This rule applies to reference
loads the two bits for coalescing and forwarding. Figure 2(c) counting in RC and RC Immix. RC and RC Immix’s cycle
shows how RC Immix uses the header bits during mutation collection is tasked with explicitly resetting all reference
and reference counting. Figure 2(d) shows how RC Immix counts. In addition, RC Immix restores correct line counts.
repurposes the logged bits for forwarding during a collec- This feature eliminates the need to sweep dead objects alto-
tion. All the other bits remain the same in both phases. gether and RC Immix instead sweeps dead lines.
Consequently, we reduce the number of reference count- RC Immix performs cycle collection on occasion. How
ing bits to three. Three bits will lead to overflow in just often to perform cycle collection is an empirical question
0.65% of objects on average, as shown in Table 3. When that trades off responsiveness with immediacy of cycle recla-
a reference count is about to overflow, it remains stuck un- mation that we explore below.
til a cycle collection occurs, at which time it is reset to the Defragmentation with Opportunistic Copying Reference
correct value or left stuck if the correct count is higher. counting is a local operation, meaning that the collector is
Several optimizations and languages such as C# require only aware of the number of references to an object, not their
pinning. Pinned objects are usually identified by a bit in the origin. Therefore it is generally not possible to move objects
header. The simplest way to add pinning is to steal another during reference counting. However, RC Immix seizes upon
bit from the reference count, reducing it to two bits. A two important opportunities to copy objects and thus mit-
slightly more complex design adds pinning to the logged and igate fragmentation. First, we observe that when an object
forwarded bits, since each of logged and forwarding only is subject to its first reference counting collection, all ref-
require three states. When we evaluated stealing a reference erences to that object will be traversed, giving us a unique

100
opportunity to move the object during a reference counting Immix Min Immix Survival
collection. Because each object is unreferenced at birth, at Alloc Heap Byte Object Line Block
its first GC, the set of all increments to a new object must Benchmark MB MB % % % %
be the set of all references to that object. Second, we exploit compress 0.3 21 6 5 7 11
the fact that cycle collection involves a global trace, and thus jess 262 20 1 1 7 53
presents another opportunity to copy objects. In both cases, db 53 19 8 6 8 10
javac 174 30 17 19 32 66
we use opportunistic copying. Opportunistic copying mixes mpegaudio 0.2 13 41 37 44 100
copying with in-place reference counting and marking such mtrt 97 18 3 3 6 11
that it can stop copying when it exhausts the available space. jack 248 19 3 2 6 32
avrora 53 30 1 4 8 9
Proactive Defragmentation RC Immix’s proactive defrag- bloat 1091 40 1 1 5 32
mentation copies as many surviving new objects as possible chart 628 50 4 5 17 67
eclipse 2237 84 6 6 7 36
given a particular copy reserve. During the mutator phase, fop 47 35 14 13 29 69
the allocator dynamically sets aside a portion of memory as hsqldb 112 115 23 23 26 56
a copy reserve, which strictly bounds the amount of copy- jython 1349 90 0 0 0 0
ing that may occur in the next collection phase. In a classic luindex 9 30 8 11 11 15
lusearch 1009 30 3 2 4 22
semi-space copying collector, the copy reserve must be large lusearch-fix 997 30 1 1 2 8
enough to accommodate all objects surviving because it is pmd 364 55 9 11 14 26
dictated by the worst case survival scenario. Therefore, every sunflow 1820 30 1 2 5 99
new block of allocation requires a block for the copy reserve. xalan 507 40 12 5 24 51

Because RC Immix is a mark-region collector, which can pjbb2005 1955 355 11 12 24 87


reuse partially occupied blocks, copying is optional. Copy-
ing is an optimization rather than required for correctness.
Table 4. Benchmark characteristics. Bytes allocated into the
Consequently, we size the copy reserve according to perfor-
Immix heap and minimum heap, in MB. The average sur-
mance criteria.
vival rate as a percentage of bytes, objects, lines, and blocks
Choosing the copy reserve size reflects a tradeoff. A large
measured in an instrumentation run at 1.5⇥ the minimum
copy reserve eats into memory otherwise available for allo-
heap size. Block survival rate is too coarse to predict byte
cation and invites a large amount of copying. Although copy-
survival rates. Line survival rate is fairly accurate and adds
ing mitigates fragmentation, copying is considerably more
no measurable overhead.
expensive than marking and should be used judiciously. On
the other hand, if the copy reserve is too small, it may not
compact objects that will induce fragmentation later.
Heap Size
Our heuristic seeks to mimic the behavior of a genera- Heuristic 1.2⇥ 1.5⇥ 2⇥
tional collector, while making the copy reserve as small as
MAX 1.031 0.984 0.976
possible. Ideally, an oracle would tell us the survival rate of EXP 1.036 0.990 0.982
the next collection (e.g., 10%) and the collector would size
the copy reserve accordingly. We seek to emulate this policy
by using past survival rate to predict the future. Computing Table 5. Two proactive copying heuristics and their perfor-
fine-gain byte or object survival in production requires look- mance at 1.2, 1.5 and 2 times the minimum heap size, av-
ing up every object’s size, which is too expensive. Instead, eraged over all benchmarks. Time is normalized relative to
we use line survival rate as an estimate of byte survival rate. GenImmix. Lower is better.
We compute line survival rates of partially full blocks when
we scan the line marks in a block to recycle its lines. This
MAX. MAX simply takes the maximum survival rate of the
computation adds no measurable overhead.
last N collections (4 in our experiments). Also good, but
Table 4 shows the average byte, object, line, and block
more complex, is a heuristic we call EXP. EXP computes
percentage survival rates. Block survival rates significantly
a moving window of survival rates in buckets of N bytes of
over predict actual byte survival rates. Line survival rates
allocation (32 MB in our experiments) and then weights each
over predict as well, but much less. The difference between
bucket by an exponential decay function (1 for the current
line and block survival rate is an indication of fragmentation.
bucket, 1/2 for the next oldest, 1/4, and so on). Table 5 shows
The larger the difference between the two, the more live
that the simple MAX heuristic performs well. We believe
objects are spread out over the blocks and the less likely a
better heuristics are possible.
fresh allocation of a multi-line object will fit in the holes
(contiguous free lines). Reactive Defragmentation RC Immix also performs reac-
We experimented with a number of heuristics and choose tive defragmentation, during cycle collection. At the start
two effective ones. We call our default copy reserve heuristic of each cycle collection, the collector determines whether

101
Threshold Heap Size
each reference in the boot image at each collection. We
Cycle Defrag 1.2⇥ 1.5⇥ 2⇥ identified this as a significant bottleneck in small heaps and
1% 1% 1.030 0.983 0.975 instead treat the boot image as a non-collected part of the
5% 5% 1.041 0.983 0.976 heap, rather than part of the root set. This very simple change
10% 10% 1.096 0.993 0.980 delivers a significant performance boost to RC in modest
heaps and is critical to RC Immix’s performance in small
heaps (Figure 4(a)).
Table 6. Sensitivity to frequency of cycle detection and re-
active defragmentation at 1.2, 1.5 and 2 times the minimum 4. Methodology
heap size, averaged over all benchmarks. Time is normalized
relative to GenImmix. Lower is better. This section presents software, hardware, and measurement
methodologies that we use to evaluate RC Immix.

to defragment based on fragmentation levels, any available Benchmarks. We draw 21 benchmarks from DaCapo [10],
free blocks, and any available partially filled blocks contain- SPECjvm98 [28], and pjbb2005 [9]. The pjbb2005 bench-
ing free lines, using statistics it gathers in the previous col- mark is a fixed workload version of SPECjbb2005 [29] with
lection. RC Immix uses these statistics to select defragmen- 8 warehouses that executes 10,000 transactions per ware-
tation sources and targets. If an object is unmovable when house. We do not use SPECjvm2008 because that suite does
the collector first encounters it, the collector marks the ob- not hold workload constant, so is unsuitable for GC evalua-
ject and line live, increments the object and line counts, and tions unless modified. Since a few DaCapo 9.12 benchmarks
leaves the object in place. When the collector first encoun- do not execute on our virtual machine, we use benchmarks
ters a movable live object on a source block, and there is from both 2006-10-MR2 and 9.12 Bach releases of DaCapo
still sufficient space for it on a target block, it opportunisti- to enlarge our suite.
cally evacuates the object, copying it to the target block, and We omit two outliers, mpegaudio and lusearch, from our
leaves a forwarding pointer that records the address of the figures and averages, but include them grayed-out in tables,
new location. If the collector encounters subsequent refer- for completeness. The mpegaudio benchmark is a very small
ences to a forwarded object, it replaces them with the value benchmark that performs almost zero allocation. The luse-
arch benchmark allocates at three times the rate of any other.
of the object’s forwarding pointer.
A key empirical question for cycle detection and defrag- The lusearch benchmark derives from the 2.4.1 stable release
mentation is how often to perform them. If we perform them of Apache Lucene. Yang et al. [33] found a performance
too often, the system loses its incrementality and pays both bug in the method QueryParser.getFieldQuery(), which
reference counting and tracing overheads. If we perform revision r803664 of Lucene fixes [26]. The heavily executed
getFieldQuery() method unconditionally allocated a large
them too infrequently, it takes a long time to reclaim objects
kept alive by dead cycles and the heap may suffer a lot of data structure. The fixed version only allocates a large data
fragmentation. Both waste memory. This threshold is nec- structure if it is unable to reuse an existing one. This fix cuts
essarily a heuristic. We explore thresholds as a function of total allocation by a factor of eight, speeds the benchmark up
heap size. considerably, and reduces the allocation rate by over a factor
We use the following principle for our heuristic. If at the of three. We patched the DaCapo lusearch benchmark with
end of a collection, the amount of free memory available just this fix and we call the fixed benchmark lusearch-fix. The
for allocation falls below a given threshold, then we mark presence of this anomaly for over a year in public releases of
the next collection for cycle collection. We can always in- a widely used package suggests that the behavior of lusearch
clude defragmentation with cycle detection, or we can per- is of some interest. Compared with GenImmix, RC Immix
form it less frequently. Triggering cycle collection and de- improves the performance of lusearch by 34% on i7-2600,
fragmentation more often enables applications to execute in but we use lusearch-fix in our results.
smaller minimum heap sizes, but will degrade performance. Jikes RVM & MMTk. We use Jikes RVM and MMTk for
Depending on the scenario, this choice might be desirable. all of our experiments. Jikes RVM [1] is an open source
We focus on performance and use a free memory threshold high performance Java virtual machine (VM) written in a
which is a fraction of the total heap size. We experiment with slightly extended version of Java. We use Jikes RVM re-
a variety of thresholds to pick the best values for both and lease 3.1.2+hg r10475 to build RC Immix and compare it
show the results for three heap sizes in Table 6. (See Sec- with different GCs. MMTk is Jikes RVM’s memory man-
tion 4 for our methodology.) Based on the results in Table 6, agement sub-system. It is a programmable memory manage-
we use 1% for both. ment toolkit that implements a wide variety of collectors that
reuse shared components [8].
3.3 Optimized Root Scanning All of the garbage collectors we evaluate are paral-
The existing implementation of the RC algorithm treats Jikes lel [7]. They use thread local allocation for each applica-
RVM’s boot image as part of the root set [27], enumerating tion thread to minimize synchronization. During collection,

102
the collectors exploit available software and hardware par- 5.1 RC Immix Performance Overview
allelism [12]. To compare collectors, we vary the heap size Table 7 and Figure 3 compare total time, mutator time, and
to understand how well collectors respond to the time space garbage collection time of RC Immix and RC Immix without
tradeoff. In our experiments, no collector consistently ran proactive copying (‘no PC’) against a number of collectors.
in smaller heaps than the other collectors. Therefore we se- The figure illustrates the data and the table includes raw per-
lected for our minimum heap size the smallest heap size in formance as well as relative measurements of the same data.
which all of the collectors execute, and thus have complete This analysis uses a moderate heap size of 2⇥ the minimum
results at all heap sizes for all collectors. in which all collectors can execute each benchmark. Pro-
Jikes RVM does not have a bytecode interpreter. Instead, duction systems often use this heap size because it strikes
a fast template-driven baseline compiler produces machine a balance in the space-time tradeoff exposed by garbage col-
code when the VM first encounters each Java method. The lected languages between memory consumption and garbage
adaptive compilation system then judiciously optimizes the collection overheads. We explore the space-time tradeoff in
most frequently executed methods. Using a timer-based ap- more detail in Section 5.2. In Figure 3(c) and 3(d), results
proach, it schedules periodic interrupts. At each interrupt, are missing for some configurations on some benchmarks.
the adaptive system records the currently executing method. In each of these cases, either the numerator or denominator
Using a cost model, it then selects frequently executing or both performed no GC (see Table 7).
methods it predicts will benefit from optimization. The opti- The table and figure compare six collectors.
mizing compiler compiles these methods at increasing levels
of optimizations. 1. GenImmix, which uses a copying nursery and an Immix
To reduce perturbation due to dynamic optimization and mature space.
to maximize the performance of the underlying system that 2. Sticky Immix, which uses Immix with an in-place gener-
we improve, we use a warmup replay methodology. Be- ational adaptation [6, 15].
fore executing any experiments, we gathered compiler opti- 3. Full heap Immix.
mization profiles from the 10th iteration of each benchmark.
When we perform an experiment, we execute one complete 4. RC from Shahriyar et al.
iteration of each benchmark without any compiler optimiza- 5. RC Immix (no PC) which excludes proactive copying and
tions, which loads all the classes and resolves methods. We performs well in moderate to large heaps due to very low
next apply the benchmark-specific optimization profile and collection times.
perform no subsequent compilation. We then measure and 6. RC Immix as described in the previous section, which
report the subsequent iteration. This methodology greatly re- performs well at all heap sizes.
duces non-determinism due to the adaptive optimizing com-
piler and improves underlying performance by about 5% We normalize to GenImmix since it is the best performing
compared to the prior replay methodology [11]. We run each in the literature [6] across all heap sizes and consequently is
benchmark 20 times (20 invocations) and report the average. the default production collector in Jikes RVM. All of the col-
We also report 95% confidence intervals for the average us- lectors, except RC, defragment when there is an opportunity,
ing Student’s t-distribution. i.e., there are partially filled blocks without fresh allocation
and fragmentation is high, as described in Section 3.2.
Operating System. We use Ubuntu 10.04.01 LTS server These results show that RC Immix outperforms the best
distribution and a 64-bit (x86 64) 2.6.32-24 Linux kernel. performing garbage collector at this moderate heap size and
completely eliminates the reference counting performance
gap. The timegc show that, not surprisingly, Immix, the only
Hardware Platform. We report performance, performance full heap collector that does not exploit any generational be-
counter, and detailed results on a 32nm Core i7-2600 Sandy haviors, has the worst collector performance, degrading by
Bridge with 4 cores and 2-way SMT running at 3.4GHz. on average 34%. Since garbage collection time is a relatively
The two hardware threads on each core share a 32KB L1 in- smaller influence on total time in a moderate heap, all but RC
struction cache, 32KB L1 data cache, and 256KB L2 cache. perform similarly on total time. At this heap size RC Immix
All four cores share a single 8MB last level cache. A dual- performs the same as RC Immix (no PC), but its worse-case
channel memory controller is integrated into the CPU. The degradation is just 5% while its best case improvement is
system has 4GB of DDR3-1066 memory installed. 22%. By comparison, RC Immix (no PC) has a worst case
degradation of 12% and best case improvement of 24%. Ta-
ble 7 and Figure 3(c) show that RC Immix (no PC) has the
5. Results best garbage collection time, outperforming GenImmix by
We first compare RC Immix with other collectors at a moder- 48%. As we show later, RC Immix has an advantage over
ate 2⇥ heap size, then consider sensitivity to available mem- RC Immix (no PC) when memory is tight and fragmentation
ory, and perform additional in depth analysis. is a bigger issue.

103
GenImmix StickyImmix Immix RC RCImmix (NoPC) RCImmix
50%
40%
30%
faster ← Time → slower

20%
10%
0%
compress

jess

db

javac

mtrt

avrora

bloat

eclipse

fop

hsqldb

jython

luindex

lusearchfix

pmd

sunflow

xalan

pjbb2005

mean

geomean
chart
jack
-10%
-20%
-30%

(a) Total slowdown compared to GenImmix

30%
25%
faster ← Mutator → slower

20%
15%
10%
5%
0%
compress

jess

db

javac

avrora

chart

eclipse

fop

hsqldb

jython

luindex

lusearchfix

pmd

sunflow

xalan

pjbb2005

mean

geomean
bloat
mtrt

jack

-5%
-10%
-15%
(b) Mutator slowdown compared to GenImmix

300%
250%
200%
faster ← GC → slower

150%
100%
50%
0%
compress

jess

db

javac

mtrt

avrora

eclipse

fop

hsqldb

jython

luindex

lusearchfix

pmd

sunflow

xalan

pjbb2005

mean

geomean
chart
bloat
jack

-50%
-100%
-150%
(c) GC slowdown compared to GenImmix

50%
40%
GC Fraction

30%
20%
10%
0%
compress

jess

db

javac

mtrt

avrora

bloat

eclipse

fop

hsqldb

jython

luindex

lusearchfix

pmd

sunflow

xalan

pjbb2005

mean

geomean
chart
jack

(d) Percentage of total execution time spent in GC

Figure 3. RC Immix performs 3% better than GenImmix, the highest performance generational collector, at a moderate heap
size of 2 times the minimum. The first three graphs compare total, mutator, and GC slowdowns relative to GenImmix; lower is
better. The fourth graph indicates the GC load seen by each configuration. RC Immix eliminates all the mutator time overheads
of RC. Error bars are not shown, but 95% confidence intervals are given in Table 7.

104
Benchmark GenImmix StickyImmix Immix RC RC Immix (no PC) RC Immix
milliseconds ——————————– Normalized to GenImmix ——————————–
time timemu timegc time timemu timegc time timemu timegc time timemu timegc time timemu timegc time timemu timegc

compress 2256 2237 20 1.00 1.00 1.37 0.99 0.99 1.09 1.00 1.00 0.76 0.97 0.98 0.25 0.97 0.97 0.28
±0.2 ±0.2 ±4.6 ±0.2 ±0.2 ±5.9 ±0.2 ±0.2 ±4.1 ±0.1 ±0.2 ±3.2 ±0.2 ±0.2 ±2.9 ±0.2 ±0.2 ±1.6

jess 485 453 32 0.98 0.99 0.77 1.09 1.00 2.42 1.33 1.28 2.08 1.02 1.03 0.85 1.01 1.01 0.98
±0.7 ±0.7 ±4.3 ±0.6 ±0.6 ±3.2 ±0.8 ±0.8 ±8.0 ±1.1 ±1.2 ±7.4 ±0.9 ±1.0 ±6.7 ±0.7 ±0.6 ±6.8

db 1491 1460 31 1.06 1.01 3.29 0.96 0.96 0.92 1.09 1.10 0.68 0.96 0.97 0.51 0.97 0.97 0.85
±0.4 ±0.4 ±7.1 ±0.5 ±0.5 ±17.8 ±1.0 ±1.0 ±5.6 ±0.8 ±0.8 ±4.0 ±0.9 ±0.8 ±7.0 ±0.7 ±0.8 ±8.0

javac 1048 911 137 1.02 1.01 1.10 0.86 0.95 0.25 0.97 1.08 0.20 0.89 1.01 0.08 1.05 1.03 1.17
±0.7 ±0.4 ±4.5 ±0.6 ±0.3 ±4.6 ±0.4 ±0.3 ±0.9 ±0.5 ±0.3 ±0.7 ±0.5 ±0.3 ±1.5 ±2.3 ±0.5 ±15.2

mpegaudio 1406 1406 0 1.01 1.01 0.00 1.01 1.01 0.00 1.00 1.00 0.00 0.97 0.97 0.00 0.97 0.97 0.00
±0.1 ±0.1 ±0.0 ±0.2 ±0.2 ±0.0 ±0.1 ±0.1 ±0.0 ±0.1 ±0.1 ±0.0 ±0.1 ±0.1 ±0.0 ±0.1 ±0.1 ±0.0

mtrt 340 302 38 1.00 1.01 0.92 1.06 0.98 1.72 1.06 1.07 1.00 0.96 1.01 0.58 0.98 0.99 0.89
±3.5 ±3.8 ±2.8 ±3.7 ±4.2 ±7.1 ±2.6 ±2.7 ±5.7 ±2.6 ±2.9 ±3.9 ±3.6 ±4.2 ±3.0 ±3.3 ±3.4 ±7.0

jack 715 665 50 0.94 0.97 0.57 1.00 0.97 1.40 1.18 1.13 1.75 0.97 0.98 0.72 0.97 0.99 0.74
±0.7 ±0.7 ±7.3 ±0.6 ±0.6 ±3.4 ±0.8 ±0.7 ±7.4 ±0.8 ±0.7 ±9.2 ±0.8 ±0.7 ±6.6 ±0.7 ±0.7 ±4.6

mean 1056 1005 51


±0.9 ±0.9 ±4.4
geomean 1.00 1.00 1.12 0.99 0.97 1.06 1.10 1.11 0.85 0.96 1.00 0.39 0.99 0.99 0.75

avrora 3154 3134 20 0.99 1.00 0.21 0.98 0.98 0.58 0.97 0.98 0.35 0.98 0.98 0.12 0.98 0.99 0.46
±1.2 ±1.2 ±9.6 ±1.3 ±1.3 ±1.7 ±1.1 ±1.1 ±5.1 ±1.3 ±1.3 ±2.6 ±1.2 ±1.2 ±1.2 ±1.1 ±1.1 ±16.4

bloat 3164 3018 145 1.04 1.03 1.09 1.07 0.99 2.71 1.20 1.19 1.51 1.02 1.02 0.90 0.99 1.00 0.68
±0.4 ±0.5 ±1.7 ±0.5 ±0.6 ±2.1 ±0.8 ±0.8 ±6.4 ±0.5 ±0.6 ±2.6 ±0.8 ±0.7 ±4.0 ±0.5 ±0.5 ±3.1

chart 3750 3473 276 1.02 1.02 1.09 0.98 1.01 0.60 1.08 1.13 0.48 0.99 1.04 0.35 0.99 1.03 0.52
±0.2 ±0.1 ±1.6 ±0.2 ±0.2 ±1.4 ±0.5 ±0.5 ±0.9 ±0.5 ±0.5 ±0.9 ±0.8 ±0.8 ±1.0 ±0.5 ±0.7 ±3.2

eclipse 16203 15382 821 1.07 1.04 1.51 1.06 1.01 2.06 1.12 1.13 0.99 0.99 1.02 0.57 1.03 1.04 0.79
±4.0 ±4.2 ±1.1 ±5.7 ±5.9 ±1.5 ±13.1 ±8.5 ±170.2 ±5.7 ±6.1 ±1.0 ±4.9 ±5.2 ±0.7 ±5.2 ±5.5 ±4.5

fop 868 848 20 1.05 1.04 1.11 0.99 0.99 0.98 1.02 1.02 0.92 0.97 0.98 0.59 0.99 0.99 1.16
±0.8 ±0.8 ±2.0 ±0.9 ±0.9 ±1.9 ±0.9 ±0.9 ±4.1 ±0.9 ±0.9 ±1.8 ±0.9 ±0.9 ±1.2 ±1.0 ±1.0 ±12.4

hsqldb 970 783 188 1.13 0.98 1.72 1.41 0.96 3.25 1.11 1.16 0.88 0.92 0.98 0.66 1.03 0.98 1.26
±0.8 ±0.1 ±4.3 ±1.9 ±0.2 ±11.2 ±2.4 ±2.8 ±10.0 ±0.7 ±0.2 ±2.7 ±0.6 ±0.5 ±2.3 ±0.6 ±0.1 ±3.9

jython 3581 3493 88 1.03 1.01 1.66 1.02 0.95 3.71 1.15 1.12 2.36 0.99 1.00 0.58 0.98 0.99 0.61
±0.5 ±0.5 ±1.8 ±0.5 ±0.5 ±2.8 ±0.4 ±0.4 ±9.1 ±0.5 ±0.5 ±3.4 ±0.4 ±0.4 ±4.2 ±0.6 ±0.6 ±1.5

luindex 626 620 7 1.02 1.01 1.50 0.99 1.00 0.00 1.02 1.02 1.10 1.02 1.03 0.50 1.04 1.04 0.73
±0.3 ±0.3 ±4.4 ±0.3 ±0.3 ±6.8 ±0.3 ±0.3 ±0.0 ±0.3 ±0.3 ±4.7 ±0.3 ±0.3 ±2.9 ±0.4 ±0.4 ±5.1

lusearch 3154 2147 1007 1.01 0.77 1.52 0.91 0.72 1.30 1.12 0.89 1.61 0.66 0.75 0.46 0.66 0.75 0.46
±0.3 ±0.3 ±0.9 ±0.7 ±0.5 ±2.4 ±0.4 ±0.5 ±1.1 ±0.7 ±0.6 ±1.4 ±0.4 ±0.6 ±0.5 ±0.5 ±0.8 ±0.5

lusearchfix 887 767 120 0.92 0.96 0.66 1.03 0.89 1.90 1.23 1.11 2.05 0.92 0.94 0.75 0.89 0.92 0.68
±3.2 ±3.7 ±2.9 ±2.8 ±3.2 ±1.7 ±3.0 ±3.1 ±4.2 ±4.1 ±4.5 ±4.3 ±2.7 ±3.2 ±1.7 ±2.2 ±2.6 ±1.7

pmd 934 790 144 0.96 0.98 0.82 1.00 0.98 1.07 1.09 1.15 0.78 0.98 1.03 0.73 0.94 0.98 0.72
±1.2 ±1.2 ±4.5 ±1.0 ±1.2 ±4.5 ±1.4 ±1.0 ±7.8 ±1.5 ±1.2 ±6.0 ±1.2 ±1.2 ±5.3 ±1.1 ±1.1 ±3.5

sunflow 2482 2175 307 0.95 0.98 0.72 1.11 0.95 2.23 1.18 1.06 2.03 1.12 0.98 2.12 0.95 0.98 0.73
±1.1 ±1.3 ±1.5 ±1.1 ±1.2 ±1.4 ±1.1 ±1.1 ±3.8 ±1.1 ±1.1 ±2.9 ±1.3 ±1.1 ±5.5 ±1.0 ±1.2 ±2.5

xalan 1393 1008 385 1.09 0.90 1.58 0.93 0.91 0.97 0.98 0.98 0.99 0.76 0.92 0.34 0.78 0.90 0.45
±8.5 ±11.6 ±1.1 ±6.7 ±7.5 ±3.6 ±11.1 ±10.5 ±22.9 ±6.0 ±8.1 ±1.9 ±5.8 ±9.1 ±0.6 ±4.8 ±7.7 ±1.2

mean 3168 2957 210


±1.8 ±2.1 ±2.9
geomean 1.02 1.00 1.01 1.04 0.97 0.00 1.09 1.08 1.05 0.97 0.99 0.56 0.96 0.99 0.70

pjbb2005 3775 3363 412 1.07 1.00 1.61 1.11 1.07 1.37 1.07 1.11 0.80 1.05 1.01 1.40 0.97 1.00 0.74
±1.0 ±1.1 ±2.1 ±1.0 ±1.1 ±3.7 ±20.4 ±23.7 ±8.1 ±1.1 ±1.2 ±4.7 ±1.2 ±1.3 ±4.5 ±1.3 ±1.3 ±6.5

min 340 302 7 0.92 0.90 0.21 0.86 0.89 0.00 0.97 0.98 0.20 0.76 0.92 0.08 0.78 0.90 0.28
max 16203 15382 821 1.13 1.04 3.29 1.41 1.07 3.71 1.33 1.28 2.36 1.12 1.04 2.12 1.05 1.04 1.26

mean 2533 2362 171


±1.4 ±1.6 ±3.4
geomean 1.02 1.00 1.07 1.03 0.98 1.34 1.09 1.09 0.97 0.97 1.00 0.52 0.97 0.99 0.72

Table 7. RC Immix performs 3% better than GenImmix at a moderate heap size of 2⇥ the minimum. We show at left total,
mutator, and GC time for GenImmix in milliseconds and performance of RC, Immix, Sticky Immix, RC Immix (no PC), and
RC Immix normalized to GenImmix. Lower is better. We grey-out and exclude from aggregates lusearch and mpegaudio
because of their pathological behaviors, although both perform very well with our systems. The numbers in grey beneath
the corresponding arithmetic mean report 95% confidence intervals, expressed as percentages.

105
Mutator GenImmix RC RC Immix RC RC (no scan) RC Immix (No PC) RC Immix GenImmix
1.5
Time 1.000 1.087 0.985
1.45
Instructions Retired 1.000 1.094 1.012 1.4
L1 Data Cache Misses 1.000 1.313 1.043 1.35

Time / Best
1.3
1.25

Table 8. Mutator performance counters show RC Immix 1.2


1.15
solves the instruction overhead and poor locality problems in 1.1
RC. Applications executing RC Immix compared with Gen- 1.05
1
Immix in a moderate heap size of 2⇥ the minimum exe- 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
cute the same number of retired instructions and see only Heap Size / Minimum Heap

a slightly higher L1 data cache miss rate. Comparing RC to (a) Total time
RC Immix, RC Immix reduces miss rates by around 20%.
1.18
1.16

Mutator Time / Best


The timemu columns of Table 7 and Figure 3(b) show 1.14
1.12
that RC Immix matches or beats the Immix collectors with
1.1
respect to mutator performance and improves significantly 1.08
over RC in a moderate heap. The reasons that RC Immix im- 1.06
proves over RC in total time stem directly from this improve- 1.04
1.02
ment in mutator performance. RC mutator time is 9% worse 1
than any other collector, as we reported in Table 2 and dis- 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Heap Size / Minimum Heap
cussed in Section 2.1. RC Immix completely eliminates this
gap in mutator performance. (b) Mutator time
Table 8 summarizes the reasons for RC Immix’s improve-
ment over RC by showing the number of mutator retired 25
instructions and mutator L1 data cache misses for RC and 22.5
20
RC Immix normalized to GenImmix. RC Immix solves the
GC Time / Best

17.5
instruction overhead and poor locality problems in RC be- 15
12.5
cause by using a bump pointer, it wins twice. 10
First, it gains the advantage of efficient zeroing of free 7.5
memory in lines and blocks, rather than zeroing at the gran- 5
2.5
ularity of each object when it dies or is recycled in the free 0
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
list (see Section 2.1 and Yang et al.’s measurements [33]).
Heap Size / Minimum Heap
Second, it gains the advantage of contiguous allocation in
memory of objects allocated together in time. This heap lay- (c) GC time
out induces good cache behavior because objects allocated
and used together occupy the same cache line, and because Figure 4. The performance of GenImmix, RC, RC Immix,
the bump pointer marches sequentially through memory, and RC Immix with no proactive copying (No PC) as a
the hardware prefetcher correctly predicts the next line to function of heap size.
fetch, so it is in cache when the program (via the mem-
ory allocator) accesses it. Yang et al.’s prefetching measure-
ments quantify this effect [33]. Table 8 shows that compared time, and GC time as a geometric mean of the benchmarks,
to RC, RC Immix reduces cache misses by around 20% showing RC Immix, GenImmix, RC, RC with no boot image
(1.043/1.313). GenImmix has slightly lower cache miss scanning (Section 3.3), and RC Immix with the no proac-
rates than RC Immix, which makes sense because it always tive copying. Figure 4(a) shows total time, and reveals that
allocates new objects contiguously (sequentially) whereas RC Immix dominates RC at all heap sizes, and consistently
RC Immix sometimes allocates into partially full blocks and outperforms GenImmix at heap sizes above 1.4⇥ the mini-
must skip over occupied lines. mum. Figures 4(b) and 4(c) reveal the source of the behavior
of RC Immix seen in Figure 4(a). Figure 4(b) reveals that
5.2 Variable Heap Size Analysis the mutator performance of RC Immix is consistently good.
Figure 4 evaluates RC Immix performance as a function of This graph makes it clear that the underlying heap structure
available memory. Each of the three graphs varies heap size has a profound impact on mutator performance. Figure 4(c)
between 1⇥ and 6⇥ the minimum in which all collectors can shows that in GC time, RC Immix outperforms RC in tighter
execute the benchmark. The graphs plot total time, mutator heaps, matches GenImmix at heap size 1.3⇥ the minimum

106
and outperforms GenImmix at heap sizes above 1.4⇥ the Bits Used Heap Size
minimum. count pin 1.2⇥ 1.5⇥ 2⇥
The faster degradation in RC Immix compared to GenIm- 3 0 1.030 0.998 0.978
mix at the very smallest heap sizes is likely to be due to the 2 0 1.022 0.991 0.979
more aggressive defragmenting effect of GenImmix’s copy- 2 1 1.023 0.991 0.974
ing nursery. When a nursery collection occurs, those objects
that survive are all copied into the mature space, which uses
Table 9. Performance sensitivity of RC Immix with pinning
the Immix discipline, in the case of GenImmix. So while the
bit at 1.2, 1.5 and 2 times the minimum heap size, averaged
surviving objects may be scattered throughout the nursery
over all benchmarks. Time is normalized relative to GenIm-
at the start of the collection, they will generally be contigu-
mix. Lower is better.
ous after the nursery collection. GenImmix will tend to have
a less fragmented mature space than RC Immix because the
mature space is not intermingled with the fragmented young decrements performed without greatly reducing the efficacy
space, as it is in RC Immix and Sticky Immix. The result of reclamation.
will be two-fold; both improved spatial locality and more
importantly, reduced fragmentation which will substantially 5.4 Benchmark Analysis
reduce GC load at small heap sizes, as seen in Figure 4(c). Table 7 reveals that sunflow is significantly slower on RC Immix (no PC)
Furthermore, as hinted by the data cache miss rates for semi- than on GenImmix, whereas xalan and lusearch are signifi-
space in Table 1, the copying order of a copying collector cantly faster when using RC Immix. We now analyze these
will generally result in particularly good locality among the outlier results.
surviving objects.
Sunflow Table 7 shows that sunflow is 12% slower in to-
5.3 Pinning tal time on RC Immix (no PC) than GenImmix, and that this
We conducted a simple experiment to explore the tradeoff slowdown is entirely due to a garbage collection slowdown
associated with dedicating a header bit for pinning (see Sec- of 2.12⇥. The source of this problem appears to be high frag-
tion 3.1). While a pinning bit could be folded into the logged mentation among surviving young objects in sunflow. It was
and forwarding bits, in this case we simply trade pinning this observation that encouraged us to explore proactive de-
functionality for reduced reference counting bits. In Jikes fragmentation, and this benchmark shows that the strategy is
RVM, pinning can be utilized by the Java standard libraries hugely effective, as RC Immix improves over GenImmix by
to make IO more efficient, so although no Java application 5%. sunflow has a high allocation rate [33], and our obser-
can exploit pinning directly, there is a potential performance vation that GenImmix does a large number of nursery col-
benefit to providing pinning support. lections, but no mature space collections at 2⇥ minimum
Table 9 shows the result of a simple experiment with heap size confirms this behavior. RC Immix (no PC) does a
three configurations at three heap sizes. The performance large number of collections, many of which are defragment-
numbers are normalized to GenImmix and represent the ge- ing cycle collections, and yet sunflow has few cycles [27].
ometric mean of all benchmarks. In the first row, we have Furthermore, Table 4 shows that although the line survival
three reference counting bits and no pinning support, which rate for sunflow is 5%, the block survival rate is a remarkable
is the default configuration used in this paper. Then we re- 99%. This indicates that surviving objects are scattered in the
duce the number of reference counting bits to two, without heap generating fragmentation, thus Immix blocks are being
adding pinning. Finally we use two reference counting bits kept alive unnecessarily. We also established empirically that
and add support for pinning. The results show that to the sunflow’s performance degraded substantially if the standard
first approximation, the tradeoff is not significant, with the defragmentation heuristic was made less aggressive.
performance variations all being within 0.8% of each other.
Xalan Both RC Immix (no PC) and RC Immix perform
Although the variations are small, the numbers are intrigu-
very well on xalan, principally because they have lower GC
ing. We see that at the 2⇥ heap, the introduction of pinning
time than GenImmix. RC Immix (no PC) has 66% lower GC
improved total performance by around 0.5% when holding
time than GenImmix and RC Immix has 55% lower GC time
the reference counting bits constant. More interestingly, we
than GenImmix. xalan has a large amount of medium life-
see that the reduction in reference counting bits from three
time objects, which can only be recovered by a full heap
to two makes very little difference, perhaps even improv-
collection with GenImmix, but are recovered in a timely
ing performance at 1.2⇥ and 1.5⇥. This second result seems
way in RC Immix.
counter-intuitive. We surmise that the reason for this is that
while the bulk of objects only need two bits to be correctly Lusearch RC Immix performs much better on lusearch
counted, many of the overflows may be attributable to ob- than GenImmix. In fact GenImmix has substantially worse
jects that also overflow with three bits. The reduction in bits mutator time than any other system. This result is due to
may thus be reducing the total number of increments and the bug in lusearch that causes the allocation of a very large

107
number of medium sized objects (Section 4), leading Gen- stack scan could only introduce false positives, and therefore
Immix to perform over 800 nursery collections, destroying could never lead to an incorrect decrement, and thus recla-
mutator locality. The allocation pathology of lusearch is es- mation of a live object. We therefore believe that RC Immix
tablished and is the reason why we use lusearch-fix in our could be implemented with conservative stack scans, cir-
results, exclude lusearch from all of our aggregate (mean cumventing a major barrier to the use of high performance
and geomean) results, and leave it greyed out in Table 7. reference counting. We plan to explore this in future work.
If we were to include lusearch in our aggregate results then
Root Elision A key advantage of reference counting over
both RC Immix (no PC) and RC Immix would be 5% faster
generational collection is that it continuously collects mature
in geomean than GenImmix.
objects. The benefits are borne out by the improvements we
5.5 Further Analysis and Opportunities see in xalan, which has many medium lived objects. These
objects are promptly reclaimed by RC and RC Immix, but
We have explored three further opportunities for improving
are not reclaimed by a generational collector until a full heap
the performance of RC Immix, namely reference level coa-
collection occurs. However, this timely collection of mature
lescing, conservative stack scan, and root coalescing.
objects does not come for free. Unlike a nursery collection
Reference Level Coalescing When Levanoni and Petrank in a generational collector, a reference counting collector
first described coalescing of reference counts, they described must enumerate all roots, including all pointers from the
it in terms of remembering the address and value of each ref- stacks and all pointers from globals (statics). We realized
erence when it was first mutated [21]. However, in practice that it may be possible to greatly reduce the workload of
it is easier to remember the address and contents of each enumerating roots by selectively enumerating only those
object when the first of its reference fields is mutated [22]. roots that have changed since the last GC. In the case of
In the first case, the collector compares the GC-time value of globals/statics this could be achieved either by a write barrier
the reference with the remembered value and decrements the or by keeping a shadow set of globals. We note that the latter
count for the object referred to by the remembered reference may be feasible because the amount of space consumed by
and increments the count for the object referred to by the lat- global pointers is typically very low. In the case of the stack,
est value of the reference. With object level coalescing, each we could utilize a return barrier [34] to only scan the parts
reference within the object is remembered and compared. of the stack that have changed since the last GC. We plan to
The implementation challenge is due to the need to only re- explore this in future work.
member each reference once, and therefore efficiently record
somewhere that a given reference had been remembered. Us- 6. Conclusion
ing a bit in the object’s header makes it easy to do coalescing
In the garbage collection literature, two fundamental algo-
at an object granularity. Both RC and RC Immix use object
rithms identify dead objects. Reference counting identifies
level coalescing.
them directly and tracing identifies them implicitly. Despite
As part of this work, we implemented reference level co-
its intrinsic advantages, such as promptness of recovery and
alescing. We did this by stealing a high order bit within each
dependence only on local rather than global state, refer-
reference to record whether that reference had been remem-
ence counting did not deliver high performance and it suf-
bered. We then map two versions of each page to a single
fered from incompleteness due to cycles. Recent advances
physical page (each one corresponding to the two possible
by Shahriyar et al. closed, but did not eliminate this perfor-
states of the high order bit). We must also modify the JVM’s
mance gap.
object equality tests to ensure that the stolen bit is ignored in
This paper identified heap organization as the principal
any equality test. We were disappointed to find that despite
source of this gap. In the literature, allocators use three heap
the low overhead bit stealing approach we devised, we saw
organizations to place objects in memory: free lists, contigu-
no performance advantage in using reference level coalesc-
ous, and regions. Until this paper, reference counting always
ing. Indeed, we observed a small slowdown. We investigated
used a free list because it offered a constant time operation
and noticed that reference level coalescing places a small but
to reclaim each dead object. Unfortunately, optimizing for
uniform overhead on each pointer mutation, but the potential
reclamation time neglects the more essential performance
benefit for the optimization is dominated by the young ob-
requirement of cache locality on modern systems. We show
ject optimizations implemented in RC and RC Immix. As a
that indeed RC in a free list heap suffers poor locality com-
result, we use object level coalescing in RC Immix.
pared to contiguous and hierarchical memory organizations.
Conservative Stack Scan One of the explanations for the Unfortunately, the contiguous heap organization and freeing
continued use of naı̈ve reference counting rather than de- at an object granularity are fundamentally incompatible. For-
ferred reference counting is that deferred reference counting tunately, the region heap organization and reference count-
requires an enumeration of roots [16], which is challeng- ing are compatible.
ing to implement correctly. To precisely enumerate roots re- We describe the design and implementation of a new hy-
quires implementing stack maps. We note that a conservative brid RC Immix collector. The key design contributions of

108
our work are an algorithm for performing per-line live object 16, 2004, pages 25–36. ACM, 2004. doi: 10.1145/1005686.
counts and the integration of proactive and reactive oppor- 1005693.
tunistic copying. We show how to copy new objects proac- [9] S. M. Blackburn, M. Hirzel, R. Garner, and D. Ste-
tively to mitigate fragmentation and improve locality. We fanović. pjbb2005: The pseudojbb benchmark, 2005. URL
http://users.cecs.anu.edu.au/˜steveb/research/
further show how to combine reactive defragmentation with
research-infrastructure/pjbb2005.
backup cycle detection. The key engineering contribution of
our work is how to use limited header bits efficiently, serving [10] S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khan, K. S. McKin-
ley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer,
triple duties for reference counting, backup cycle collection M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar,
with tracing, and opportunistic copying. D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann.
Looking forward, we believe that RC Immix offers a new The DaCapo benchmarks: Java benchmarking development and anal-
direction for both high performance throughput collectors ysis. In ACM Conference on Object–Oriented Programming Sys-
tems, Languages, and Applications, OOPSLA’06, Portland, OR, USA,
and soft real-time collectors because of its ability to provide Oct, 2006, pages 169–190. ACM, 2006. doi: 10.1145/1167473.
incremental reclamation with high throughput. 1167488.
[11] S. M. Blackburn, K. S. McKinley, R. Garner, C. Hoffman, A. M.
Acknowledgements Khan, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer,
M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar,
We thank Daniel Frampton for his contributions to the RC D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann.
collector and Bertrand Maher and James Bornholt for their Wake Up and Smell the Coffee: Evaluation Methodology for the 21st
Century. Commun. ACM, 51(8):83–89, Aug. 2008.
comments on earlier drafts of this paper.
[12] T. Cao, , S. M. Blackburn, T. Gao, and K. S. McKinley. The yin
and yang of power and performance for asymmetric hardware and
References managed software. In The 39th International Conference on Computer
[1] B. Alpern, S. Augart, S. M. Blackburn, M. Butrico, A. Cocchi, Architecture, ISCA’12, Portland, OR, June, 2012, pages 225–236.
P. Cheng, J. Dolby, S. J. Fink, D. Grove, M. Hind, K. S. McKinley, ACM/IEEE, 2012. doi: 10.1145/2366231.2337185.
M. Mergen, J. E. B. Moss, T. Ngo, V. Sarkar, and M. Trapp. The Jikes [13] C. J. Cheney. A nonrecursive list compacting algorithm. Com-
RVM Project: Building an open source research community. IBM Sys- mun. ACM, 13(11):677–678, Nov. 1970. doi: 10.1145/362790.
tem Journal, 44(2):399–418, 2005. doi: 10.1147/sj.442.0399. 362798.
[2] D. F. Bacon and V. T. Rajan. Concurrent cycle collection in [14] G. E. Collins. A method for overlapping and erasure of lists. Commun.
reference counted systems. In European Conference on Object- ACM, 3(12):655–657, December 1960. doi: 10.1145/367487.
Oriented Programming, Budapest, Hungary, June 18 - 22, 2001, pages 367501.
207–235. LNCS, 2001. ISBN 3-540-42206-4. doi: 10.1007/
3-540-45337-7_12. [15] A. Demmers, M. Weiser, B. Hayes, H. Boehm, D. Bobrow, and
S. Shenker. Combining generational and conservative garbage col-
[3] D. F. Bacon, C. R. Attanasio, H. B. Lee, V. T. Rajan, and S. Smith. lection: framework and implementations. In ACM Symposium on the
Java without the coffee breaks: A nonintrusive multiprocessor garbage Principles of Programming Languages, POPL’90, San Francisco, CA,
collector. In ACM Conference on Programming Language Design and USA, pages 261–269. ACM, 1990. doi: 10.1145/96709.96735.
Implementation, PLDI’01, Snowbird, UT, USA, June 2001, pages 92–
103. ACM, 2001. doi: 10.1145/378795.378819. [16] L. P. Deutsch and D. G. Bobrow. An efficient, incremental, automatic
garbage collector. Commun. ACM, 19(9):522–526, September 1976.
[4] E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing high- doi: 10.1145/360336.360345.
performance memory allocators. In ACM Conference on Program-
ming Language Design and Implementation, PLDI’01, Snowbird, UT, [17] Y. Feng and E. D. Berger. A locality-improving dynamic memory allo-
USA, June 2001, pages 114–124. ACM, 2001. doi: 10.1145/ cator. In Proceedings of the 2005 Workshop on Memory System Perfor-
378795.378821. mance, pages 68–77, 2005. doi: 10.1145/1111583.1111594.
[5] S. M. Blackburn and K. S. McKinley. Ulterior reference counting: [18] D. Frampton. Garbage Collection and the Case for High-level Low-
Fast garbage collection without a long wait. In ACM Conference level Programming. PhD thesis, Australian National University, June
on Object–Oriented Programming Systems, Languages, and Applica- 2010. URL http://cs.anu.edu.au/˜Daniel.Frampton/
tions, OOPSLA’03, Anaheim, CA, USA, Oct, 2003, pages 344–358. DanielFrampton_Thesis_Jun2010.pdf.
ACM, 2003. doi: 10.1145/949305.949336.
[19] I. Jibaja, S. M. Blackburn, M. R. Haghighat, and K. S. McKinley. De-
[6] S. M. Blackburn and K. S. McKinley. Immix: A mark-region garbage ferred gratification: Engineering for high performance garbage collec-
collector with space efficiency, fast collection, and mutator locality. In tion from the get go. In Proceedings of the 2011 ACM SIGPLAN
ACM Conference on Programming Language Design and Implemen- Workshop on Memory Systems Performance and Correctness (MSPC
tation, PLDI’08, Tucson, AZ, USA, June 2008, pages 22–32. ACM, 2011), San Jose, CA, June 5, 2011. ACM, 2011. doi: 10.1145/
2008. doi: 10.1145/1379022.1375586. 1988915.1988930.
[7] S. M. Blackburn, P. Cheng, and K. S. McKinley. Oil and water? [20] R. E. Jones, A. Hosking, and J. E. B. Moss. The Garbage Collection
High performance garbage collection in Java with MMTk. In The Handbook: The Art of Automatic Memory Management. Chapman
26th International Conference on Software Engineering, ICSE’04, and Hall/CRC Applied Algorithms and Data Structures Series, USA,
Edinburgh, Scotland, 2004, pages 137–146. ACM/IEEE, 2004. 2011. URL http://gchandbook.org/.
[8] S. M. Blackburn, P. Cheng, and K. S. McKinley. Myths and realities: [21] Y. Levanoni and E. Petrank. An on-the-fly reference counting garbage
The performance impact of garbage collection. In SIGMETRICS – collector for Java. In ACM Conference on Object–Oriented Program-
Performance 2004, Joint International Conference on Measurement ming Systems, Languages, and Applications, OOPSLA’01, Tampa,
and Modeling of Computer Systems, New York, NY, USA, June 12–

109
FL, USA, Oct, 2001, pages 367–380. ACM, 2001. doi: 10.1145/ [28] SPEC. SPECjvm98, Release 1.03. Standard Performance Evaluation
504282.504309. Corporation, Mar. 1999. URL http://www.spec.org/jvm98.
[22] Y. Levanoni and E. Petrank. An on-the-fly reference-counting garbage [29] SPEC. SPECjbb2005 (Java Server Benchmark), Release 1.07. Stan-
collector for Java. ACM Trans. Prog. Lang. Syst., 28(1):1–69, January dard Performance Evaluation Corporation, 2006. URL http://
2006. doi: 10.1145/1111596.1111597. www.spec.org/jbb2005.
[23] H. Lieberman and C. Hewitt. A real-time garbage collector based on [30] D. Ungar. Generation Scavenging: A non-disruptive high performance
the lifetimes of objects. Commun. ACM, 26(6):419–429, June 1983. storage reclamation algorithm. In Proceedings of the first ACM SIG-
doi: 10.1145/358141.358147. SOFT/SIGPLAN software engineering symposium on Practical soft-
ware development environments, SDE 1, 1984, pages 157–167. ACM,
[24] J. McCarthy. Recursive functions of symbolic expressions and their 1984. ISBN 0-89791-131-8. doi: 10.1145/800020.808261.
computation by machine, part I. Commun. ACM, 3(4):184–195, April
1960. doi: 10.1145/367177.367199. [31] J. Weizenbaum. Recovery of reentrant list structures in Lisp. Commun.
ACM, 12(7):370–372, July 1969. doi: 10.1145/363156.363159.
[25] H. Paz, E. Petrank, and S. M. Blackburn. Age-oriented concur-
rent garbage collection. In Compiler Construction, volume 3443 [32] P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. Dynamic stor-
of Lecture Notes in Computer Science, pages 121–136. Springer age allocation: A survey and critical review. In IWMM’95, Proceed-
Berlin Heidelberg, 2005. ISBN 978-3-540-25411-9. doi: 10.1007/ ings of the International Workshop on Memory Management, IWMM,
978-3-540-31985-6_9. Kinross, Scotland, UK, Sep 27-29, 1995, volume 986 of Lecture Notes
in Computer Science, pages 1–116. Springer Berlin Heidelberg, 2005.
[26] Y. Seeley. JIRA issue LUCENE-1800: QueryParser should use doi: 10.1007/3-540-60368-9_19.
reusable token streams, 2009. URL https://issues.apache.
org/jira/browse/LUCENE-1800. [33] X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and K. S.
McKinley. Why nothing matters: The impact of zeroing. In ACM
[27] R. Shahriyar, S. M. Blackburn, and D. Frampton. Down for the count? Conference on Object–Oriented Programming Systems, Languages,
Getting reference counting back in the ring. In Proceedings of the and Applications, OOPSLA’11, Portland, Oregon, USA, Oct, 2011,
11th International Symposium on Memory Management, ISMM 2012, pages 307–324. ACM, 2011. doi: 10.1145/2048066.2048092.
Beijing, China, June 15 - 16, 2012. ACM, 2012. doi: 10.1145/
2258996.2259008. [34] T. Yuasa, Y. Nakagawa, T. Komiya, and M. Yasugi. Return barrier. In
Proceedings of the International Lisp Conference, 2002.

110

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy