Hell Oh Entropy!

That is not a number, that is a freed object

Posted: Apr 20, 2022, 12:01

How many of you have written this kind of code in the past:

o = xmalloc (old_size);
...
n = xrealloc (o, new_size);

if (n != o)
  {
    o = n;
    /* Update other pointers that referred to o or offsets from it.  */
  }

Not uncommon right? We’re not dereferencing the freed o and the pointer is after all, a number and hence should be perfectly safe to check, right? And more optimal too since we’re not updating pointers if it’s not necessary. Well…

Better Fortification

TLDR; I broke this ‘safety’ in my implementation for __builtin_dynamic_object_size in gcc but I’m not wrong, you are! See the last section for why.

Now for those of you interested in the story, it all began with the implementation for __builtin_dynamic_object_size. This builtin was implemented first in clang and promised to be a better __builtin_object_size, which was severely limited by its necessity to emit a constant. That restriction meant that (1) there were many cases where it just couldn’t arrive at a constant size and (2) where it did, it would come up with an upper or lower estimate and not necessarily a precise size. Given that the builtin is primarily used to implement _FORTIFY_SOURCE (there’s a more detailed blog post describing its mechanism out there), this directly reduces the scope of this security protection.

__builtin_dynamic_object_size had deeper implications than just being a dynamic version of __builtin_object_size however, which had led to initial pushback in the gcc community. Now that the implementation is due to come out in gcc 12.1 and is being tested with distribution rebuilds, new and interesting implications are being discovered. One of these (and so far the most fascinating to me) was its impact on using (not dereferencing, mind you) a freed pointer.

How gcc deduces object sizes

The object size computation is largely (there are some caveats here but not important for the purposes of this post) done in a separate pass. The pass runs twice, once very early in the pass chain and finally, near the end of the tree passes. The early run is a hack that tries to record subobject size estimates before subsequent passes simplify subobject references to references to their parent object, thus returning a more precise subobject size. The late run is where the actual fun happens.

The object sizes pass, at a high level, tracks the pointer passed to either __builtin_object_size or __builtin_dynamic_object_size to all possible objects it may point to and subsequently, to the site of their assignment, to derive the size. In the static case (i.e. __builtin_object_size), it tries to come up with either the maximum or the minimum estimate while in the dynamic case it builds a fancy expression that would evaluate to the precise size at that point. Of course, ‘precise’ shouldn’t be taken for granted because there could be future changes that make the expressions imprecise in the interest of broadening coverage. If the pass is unable to deduce a size of any of the target objects of the pointer for any reason (passed through a call, non-constant in the static case, etc.), the call is replaced with (size_t) -1 or (size_t) 0 as appropriate.

I can’t judge what I can’t see

As the pass tracks origins of the pointer in question, it unfortunately does not take into account any uses between the allocation and the reference in the builtin that may alter the nature of the pointer. This means that if the pointer was reallocated between its first allocation and the builtin call, the pass won’t notice unless the pointer was explicitly updated. This is a benign limitation in the static case because for the above example, it would simply compute the maximum of new_size and old_size and return the result. In fact in most real world cases since the reallocation is bound to be dynamic, it would simply bail out, resulting in a missed fortification.

With dynamic sizes though, one will now get the new size for n != o but not for the n == o case. As a result, any fortified function call based on this information will see the old size and abort fearing a buffer overflow even though there technically wasn’t any. This was seen in autogen, which had this precise pattern and hence stumbled when it was built with _FORTIFY_SOURCE=3.

It’s a bug, it’s not a bug…

After a bit of back and forth, Martin Liška very helpfully came up with a contained reproducer that allowed us to see what had actually happened. I had broken a pretty common idiom, which meant that those applications would have false positive aborts, something that hadn’t happened with _FORTIFY_SOURCE before. That is until I found an excuse that I could use to point the finger back at you (which includes past me, who is clearly a different person, no?), the developer!

Object Lifetimes

clang 13 also broke with the test case Martin shared after I altered it a bit to fortify fread. That gave me first relief because clearly whatever I did wrong, the smart folks in the clang community did wrong too. So I wasn’t that stupid. Then of course, there was this, which put our collective ‘stupidity’ into perspective, kinda letting us off the hook:

Section 6.2.4 of the ISO C standard (I’m referring to an April 2011 draft because who even in their right minds pays for their copy?!) has this in point number 2:

The lifetime of an object is the portion of program execution during which storage is
guaranteed to be reserved for it. An object exists, has a constant address, and retains
its last-stored value throughout its lifetime. If an object is referred to outside of its
lifetime, the behavior is undefined. The value of a pointer becomes indeterminate when
the object it points to (or just past) reaches the end of its lifetime.

It clearly states that even the value of the pointer pointing to the object is not reliable after it has been freed, so not only should one avoid dereferencing the pointer after it is freed, they should refrain from using it altogether.

Essentially, the comparison with the old pointer results in undefined behaviour. I don’t think the standards committee intended to invalidate this specific idiom with that language, but it does allow compilers the freedom to make assumptions about pointer validity and this idiom ends up trouncing on it. It is possible for the compiler to look for a dominating realloc and update its expectations for size in very specific cases, but it still remains largely unsupported. It won’t, for example, work in cases where a reallocation has been wrapped in a function without any malloc attribute annotations. In fact, gcc 12 has a new -Wuse-after-free option that warns users of this that I, admittedly, once thought was too harsh.

EDIT 2022-04-21: This spawned a conversation in the rust community and Ralf Jung pointed out a way to think about this in pointer provenance terms and does not rely on the above C standard indeterminate pointer clause. This is very relevant because what the object size pass does in this context is essentially pointer provenance (albeit limited and somewhat incomplete), which makes it natural for it to trip on this implicit assumption of o == n. Continuing to use o (and any pointers derived from it) in this context is incorrect.

Getting better together

I’m going to try and support some of these simple cases in gcc during the gcc 13 cycle but in general, this is undefined behaviour. If your code uses this idiom, you should start weaning away from it if it’s not performance sensitive and unconditionally update pointers once their lifetime ends.

Deploying _FORTIFY_SOURCE=3 more widely has been a learning experience (all owing to Martin Liška’s efforts since he was the one building thousands of packages and reporting bugs!) in the deeper implications that __builtin_dynamic_object_size would have when replacing __builtin_object_size. Another interesting implication was the misuse of malloc_usable_size and equivalent interfaces that we discovered with systemd and jemalloc that open up deeper design questions for malloc interfaces. More on that in a separate post either here or on one of the Red Hat blogs.

A simple change of more precise object sizes and wider coverage ended up not only weeding out actual overflows, but also some interesting corner cases and “adventurous” programming practices. I’m going to start rolling some of this out into Fedora near the end of the year and we’ll hopefully have better mitigations in Linux distributions very soon.

Comments

Malloc per-thread arenas in glibc

Posted: Oct 24, 2012, 14:40

The malloc subsystem in glibc has had the feature of per-thread arenas for quite some time now and based on my experience, it seems to be the source of a lot of confusion. This is especially for enterprise users who move from RHEL-5 to RHEL-6 to see their apps taking up a lot more ‘memory’ and you’ll see a fair amount of content in this regard in the Red Hat Customer Portal if you’re a RHEL customer. This post is an attempt to get more perspective on this from the internal design perspective. I have not written any of the code in this implementation (barring a tiny improvement), so all of the talk about ‘original intention’ of any design is speculation on my behalf.

Background

The glibc malloc function allocates blocks of address space to callers by requesting memory from the kernel. I have written about the two syscalls glibc uses to do this, so I won’t repeat that here beyond mentioning that they two ‘places’ from where the address space is obtained are ‘arenas’ and ‘anonymous memory maps’ (referred to as anon maps in the rest of the post). The concept of interest here is the arena, which is nothing but a contiguous block of memory obtained from the kernel. The difference from the anon maps is that one anon map fulfills only one malloc request while an arena is a scratchpad that glibc maintains to return smaller blocks to the requestor. As you may have guessed, the data area (or ‘heap’) created by extending the process break (using the brk syscall) is also an arena - it is referred to as the main arena. The ‘heap’ keyword has a different meaning in relation to the glibc malloc implementation as we’ll find out later.

The Arenas

In addition to the main arena, glibc malloc allocates additional arenas. The reason for creation of arenas always seems to have been to improve performance of multithreaded processes. A malloc call needs to lock an arena to get a block from it and contention for this lock among threads is quite a performance hit. So the multiple arenas implementation did the following to reduce this contention:

Firstly, the malloc call always tries to always stick to the arena it accessed the last time
If the earlier arena is not available immediately (as tested by a pthread_mutex_trylock), try the next arena
If we don't have uncontended access on any of the current arenas, then create a new arena and use it.

We obviously start with just the main arena. The main arena is extended using the brk() syscall and the other arenas are extended by mapping new ‘heaps’ (using mmap) and linking them. So an arena can generally be seen as a linked list of heaps.

An Arena for a Thread

The arena implementation was faster than the earlier model of using just a single arena with an on-demand model for reduction of contention in the general case. The process of detecting contention however was sufficiently slow and hence a better idea was needed. This is where the idea of having a thread stick to one arena comes in.

With the arena per-thread model, if malloc is called for the first time within a thread, a new arena is created without looking at whether the earlier locks would be contended or not. As a result, for a sane number of threads*, one can expect zero contention among threads when locking arenas since they’re all working on their own arenas.

There is a limit to the number of arenas that are created in this manner and that limit is determined based on the number of cores the system has. 32-bit systems get twice the number of cores and 64-bit systems get 8 times the number of cores. This can also be controlled using the MALLOC_ARENA_MAX environment variable.

A sane number of threads is usually not more than twice the number of cores. Anything more and you’ve have a different set of performance problems to deal with.

The Problem (or not)

The whole issue around the concept of arenas is the amount of ‘memory’ it tends to waste. The common complaint is that the virtual memory footprint of programs increase due to use of arenas and definitely more so due to using a different arena for each thread. The complaint is mostly bogus because of a couple of very important features that have been there for years - loads and loads of address space and demand paging.

The linux virtual memory model can allow a process to access most of its virtual memory address space (provided it is requested and mapped in). A 64-bit address space is massive and is hence not a resource that is likely to run out for any reasonable process. Add to it the fact that the kernel has mechanisms to use physical memory only for pages that are needed by the process at that time and you can rest assured that even if you map in terabytes of memory in your process, only the pages that are used will ever be accounted against you.

Both arena models rely on these facts. A mapped in arena is explicitly requested without any permissions (PROT_NONE) so that the kernel knows that it does not need any physical memory to back it. Block requests are then fulfilled by giving read+write permissions in parts. Freed blocks near the end of heaps in an arena are given back to the system either by using madvise(MADV_DONTNEED) or by explicitly unmapping.

So in the light of all of this, as an administrator or programmer, one needs to shift focus from the virtual memory column in your top output to the RSS column, since that is where the real resource usage is. Address space usage is not the same thing as memory usage. It is of course a problem if RSS is also high and in such cases one should look at the possibility of of a memory leak within to process and if that is not the problem, then fragmentation within the arenas. There are tuning options available to reduce fragmentation. Setting resource limits on address space usage is passe and it is time that better alternatives such as cgroups are given serious consideration.

Comments