Malloc per-thread arenas in glibc
The malloc subsystem in glibc has had the feature of per-thread arenas for quite some time now and based on my experience, it seems to be the source of a lot of confusion. This is especially for enterprise users who move from RHEL-5 to RHEL-6 to see their apps taking up a lot more ‘memory’ and you’ll see a fair amount of content in this regard in the Red Hat Customer Portal if you’re a RHEL customer. This post is an attempt to get more perspective on this from the internal design perspective. I have not written any of the code in this implementation (barring a tiny improvement), so all of the talk about ‘original intention’ of any design is speculation on my behalf.
Background
The glibc malloc function allocates blocks of address space to callers by requesting memory from the kernel. I have written about the two syscalls glibc uses to do this, so I won’t repeat that here beyond mentioning that they two ‘places’ from where the address space is obtained are ‘arenas’ and ‘anonymous memory maps’ (referred to as anon maps in the rest of the post). The concept of interest here is the arena, which is nothing but a contiguous block of memory obtained from the kernel. The difference from the anon maps is that one anon map fulfills only one malloc request while an arena is a scratchpad that glibc maintains to return smaller blocks to the requestor. As you may have guessed, the data area (or ‘heap’) created by extending the process break (using the brk syscall) is also an arena - it is referred to as the main arena. The ‘heap’ keyword has a different meaning in relation to the glibc malloc implementation as we’ll find out later.
The Arenas
In addition to the main arena, glibc malloc allocates additional arenas. The reason for creation of arenas always seems to have been to improve performance of multithreaded processes. A malloc call needs to lock an arena to get a block from it and contention for this lock among threads is quite a performance hit. So the multiple arenas implementation did the following to reduce this contention:
- Firstly, the malloc call always tries to always stick to the arena it accessed the last time
- If the earlier arena is not available immediately (as tested by a pthread_mutex_trylock), try the next arena
- If we don't have uncontended access on any of the current arenas, then create a new arena and use it.
We obviously start with just the main arena. The main arena is extended using the brk() syscall and the other arenas are extended by mapping new ‘heaps’ (using mmap) and linking them. So an arena can generally be seen as a linked list of heaps.
An Arena for a Thread
The arena implementation was faster than the earlier model of using just a single arena with an on-demand model for reduction of contention in the general case. The process of detecting contention however was sufficiently slow and hence a better idea was needed. This is where the idea of having a thread stick to one arena comes in.
With the arena per-thread model, if malloc is called for the first time within a thread, a new arena is created without looking at whether the earlier locks would be contended or not. As a result, for a sane number of threads*, one can expect zero contention among threads when locking arenas since they’re all working on their own arenas.
There is a limit to the number of arenas that are created in this manner and that limit is determined based on the number of cores the system has. 32-bit systems get twice the number of cores and 64-bit systems get 8 times the number of cores. This can also be controlled using the MALLOC_ARENA_MAX environment variable.
- A sane number of threads is usually not more than twice the number of cores. Anything more and you’ve have a different set of performance problems to deal with.
The Problem (or not)
The whole issue around the concept of arenas is the amount of ‘memory’ it tends to waste. The common complaint is that the virtual memory footprint of programs increase due to use of arenas and definitely more so due to using a different arena for each thread. The complaint is mostly bogus because of a couple of very important features that have been there for years - loads and loads of address space and demand paging.
The linux virtual memory model can allow a process to access most of its virtual memory address space (provided it is requested and mapped in). A 64-bit address space is massive and is hence not a resource that is likely to run out for any reasonable process. Add to it the fact that the kernel has mechanisms to use physical memory only for pages that are needed by the process at that time and you can rest assured that even if you map in terabytes of memory in your process, only the pages that are used will ever be accounted against you.
Both arena models rely on these facts. A mapped in arena is explicitly requested without any permissions (PROT_NONE) so that the kernel knows that it does not need any physical memory to back it. Block requests are then fulfilled by giving read+write permissions in parts. Freed blocks near the end of heaps in an arena are given back to the system either by using madvise(MADV_DONTNEED) or by explicitly unmapping.
So in the light of all of this, as an administrator or programmer, one needs to shift focus from the virtual memory column in your top
output to the RSS column, since that is where the real resource usage is. Address space usage is not the same thing as memory usage. It is of course a problem if RSS is also high and in such cases one should look at the possibility of of a memory leak within to process and if that is not the problem, then fragmentation within the arenas. There are tuning options available to reduce fragmentation. Setting resource limits on address space usage is passe and it is time that better alternatives such as cgroups are given serious consideration.