Hell Oh Entropy!

The beauty of equations in Physics

by siddhesh on Jun 23, 2022, 06:48

I have been stereotyped by many I know (and understandably so) as a quintessential computer geek, probably someone who dabbled in computers since his childhood and is his first love. That is however far from the truth because I first programmed a computer at the age of 20 and started my career soon after as a lowly back office outsourced engineer that a lot of the geek community looks down upon. What I did grow up with was something related but still different - mathematics and physics.

Around the age of 15 my mother enrolled me for IIT coaching classes (a huge financial struggle and a social shock for me, but that’s another story) and I found teachers that ignited a love for these subjects. I was always technically inclined thanks to my father who was an engineer. After his early death everyone around me wanted me to be his ‘successor’ as an engineer and I happily (almost proudly then, what does a 9 year old know!) obliged. Whatever technical inclination I had was due to watching my father as a 6-9 year old (another interesting story) and it was enough to make me an above average math and science student.

In the IIT coaching classes, I learned in Physics, a way to look at the world and in mathematics, a way to express that view. Despite the ragging due to my social status (backward caste and poor, in a predominantly rich, upper caste, South Bombay clique), I was floating in the air those two years, making up problems to trick friends and solving equations for fun. I don’t think I’ve done maths just for the heck of it since. The love affair with physics and its mathematics did not live too long though as I flunked my IIT examinations and had to rethink everything, including my view of myself; it woouldn’t be the last time either.

I went into what I now recognize as depression for over a year. As I recovered, I was ankle deep in Linux and FOSS and it was the beginning of the next chapter of my life, a life that has more extensive documentation on the intertubes. The physics and maths got beaten out of me in the process though. One thing however seems to have stuck, my obsession for beauty and rhythm whenever I encounter a mathematical equation. If a result looked large and wieldy, I would get very uncomfortable and keep working at it, refactoring it till it looked beautiful and reading it out sounded like I was reciting a poem. It was an obsession that my teachers loved and hated depending on how they were feeling on that day.

I rediscovered that love for symmetry and rhythm when I spent some time working with multiple precision math nearly a decade ago. I discovered it once again some years ago with just a few minutes of hacking away at a physics problem at reserved-bit where I came up with equations for a little maze that the kids at the makerspace wanted to build. The immense satisfaction of seeing the equation being easy on the eyes and almost musical to read is a feeling I cannot express in words. Then there is the beauty of discovering little facts by reading the equation (like location of an object at any point in the maze being independent of the acceleration due to gravity for the maze) that adds to the wonderful experience.

There is a parallel to such beauty in programming in the form of APIs or algorithms, but it doesn’t quite feel the same to me. I guess I enjoy programming quite a lot but no, I don’t love it like I did physics and maths. I don’t seem to have the mental space to go back to it though. I guess it’s a first love that I can only look back fondly at for now.

Comments

That is not a number, that is a freed object

by siddhesh on Apr 20, 2022, 12:01

How many of you have written this kind of code in the past:

o = xmalloc (old_size);
...
n = xrealloc (o, new_size);

if (n != o)
  {
    o = n;
    /* Update other pointers that referred to o or offsets from it.  */
  }

Not uncommon right? We’re not dereferencing the freed o and the pointer is after all, a number and hence should be perfectly safe to check, right? And more optimal too since we’re not updating pointers if it’s not necessary. Well…

Better Fortification

TLDR; I broke this ‘safety’ in my implementation for __builtin_dynamic_object_size in gcc but I’m not wrong, you are! See the last section for why.

Now for those of you interested in the story, it all began with the implementation for __builtin_dynamic_object_size. This builtin was implemented first in clang and promised to be a better __builtin_object_size, which was severely limited by its necessity to emit a constant. That restriction meant that (1) there were many cases where it just couldn’t arrive at a constant size and (2) where it did, it would come up with an upper or lower estimate and not necessarily a precise size. Given that the builtin is primarily used to implement _FORTIFY_SOURCE (there’s a more detailed blog post describing its mechanism out there), this directly reduces the scope of this security protection.

__builtin_dynamic_object_size had deeper implications than just being a dynamic version of __builtin_object_size however, which had led to initial pushback in the gcc community. Now that the implementation is due to come out in gcc 12.1 and is being tested with distribution rebuilds, new and interesting implications are being discovered. One of these (and so far the most fascinating to me) was its impact on using (not dereferencing, mind you) a freed pointer.

How gcc deduces object sizes

The object size computation is largely (there are some caveats here but not important for the purposes of this post) done in a separate pass. The pass runs twice, once very early in the pass chain and finally, near the end of the tree passes. The early run is a hack that tries to record subobject size estimates before subsequent passes simplify subobject references to references to their parent object, thus returning a more precise subobject size. The late run is where the actual fun happens.

The object sizes pass, at a high level, tracks the pointer passed to either __builtin_object_size or __builtin_dynamic_object_size to all possible objects it may point to and subsequently, to the site of their assignment, to derive the size. In the static case (i.e. __builtin_object_size), it tries to come up with either the maximum or the minimum estimate while in the dynamic case it builds a fancy expression that would evaluate to the precise size at that point. Of course, ‘precise’ shouldn’t be taken for granted because there could be future changes that make the expressions imprecise in the interest of broadening coverage. If the pass is unable to deduce a size of any of the target objects of the pointer for any reason (passed through a call, non-constant in the static case, etc.), the call is replaced with (size_t) -1 or (size_t) 0 as appropriate.

I can’t judge what I can’t see

As the pass tracks origins of the pointer in question, it unfortunately does not take into account any uses between the allocation and the reference in the builtin that may alter the nature of the pointer. This means that if the pointer was reallocated between its first allocation and the builtin call, the pass won’t notice unless the pointer was explicitly updated. This is a benign limitation in the static case because for the above example, it would simply compute the maximum of new_size and old_size and return the result. In fact in most real world cases since the reallocation is bound to be dynamic, it would simply bail out, resulting in a missed fortification.

With dynamic sizes though, one will now get the new size for n != o but not for the n == o case. As a result, any fortified function call based on this information will see the old size and abort fearing a buffer overflow even though there technically wasn’t any. This was seen in autogen, which had this precise pattern and hence stumbled when it was built with _FORTIFY_SOURCE=3.

It’s a bug, it’s not a bug…

After a bit of back and forth, Martin Liška very helpfully came up with a contained reproducer that allowed us to see what had actually happened. I had broken a pretty common idiom, which meant that those applications would have false positive aborts, something that hadn’t happened with _FORTIFY_SOURCE before. That is until I found an excuse that I could use to point the finger back at you (which includes past me, who is clearly a different person, no?), the developer!

Object Lifetimes

clang 13 also broke with the test case Martin shared after I altered it a bit to fortify fread. That gave me first relief because clearly whatever I did wrong, the smart folks in the clang community did wrong too. So I wasn’t that stupid. Then of course, there was this, which put our collective ‘stupidity’ into perspective, kinda letting us off the hook:

Section 6.2.4 of the ISO C standard (I’m referring to an April 2011 draft because who even in their right minds pays for their copy?!) has this in point number 2:

The lifetime of an object is the portion of program execution during which storage is
guaranteed to be reserved for it. An object exists, has a constant address, and retains
its last-stored value throughout its lifetime. If an object is referred to outside of its
lifetime, the behavior is undefined. The value of a pointer becomes indeterminate when
the object it points to (or just past) reaches the end of its lifetime.

It clearly states that even the value of the pointer pointing to the object is not reliable after it has been freed, so not only should one avoid dereferencing the pointer after it is freed, they should refrain from using it altogether.

Essentially, the comparison with the old pointer results in undefined behaviour. I don’t think the standards committee intended to invalidate this specific idiom with that language, but it does allow compilers the freedom to make assumptions about pointer validity and this idiom ends up trouncing on it. It is possible for the compiler to look for a dominating realloc and update its expectations for size in very specific cases, but it still remains largely unsupported. It won’t, for example, work in cases where a reallocation has been wrapped in a function without any malloc attribute annotations. In fact, gcc 12 has a new -Wuse-after-free option that warns users of this that I, admittedly, once thought was too harsh.

EDIT 2022-04-21: This spawned a conversation in the rust community and Ralf Jung pointed out a way to think about this in pointer provenance terms and does not rely on the above C standard indeterminate pointer clause. This is very relevant because what the object size pass does in this context is essentially pointer provenance (albeit limited and somewhat incomplete), which makes it natural for it to trip on this implicit assumption of o == n. Continuing to use o (and any pointers derived from it) in this context is incorrect.

Getting better together

I’m going to try and support some of these simple cases in gcc during the gcc 13 cycle but in general, this is undefined behaviour. If your code uses this idiom, you should start weaning away from it if it’s not performance sensitive and unconditionally update pointers once their lifetime ends.

Deploying _FORTIFY_SOURCE=3 more widely has been a learning experience (all owing to Martin Liška’s efforts since he was the one building thousands of packages and reporting bugs!) in the deeper implications that __builtin_dynamic_object_size would have when replacing __builtin_object_size. Another interesting implication was the misuse of malloc_usable_size and equivalent interfaces that we discovered with systemd and jemalloc that open up deeper design questions for malloc interfaces. More on that in a separate post either here or on one of the Red Hat blogs.

A simple change of more precise object sizes and wider coverage ended up not only weeding out actual overflows, but also some interesting corner cases and “adventurous” programming practices. I’m going to start rolling some of this out into Fedora near the end of the year and we’ll hopefully have better mitigations in Linux distributions very soon.

Comments

Relocations: fantastic symbols, but where to find them?

by siddhesh on Apr 10, 2020, 12:36

Update: There’s a Links section at the end that should give you a list of all the reference you’ll ever need to undrstand Linkers and Relocations!

When I started out in compilers years ago, I found relocations especially hard to wrap my head around. They’re just simple math in the end but they combine elements from different places that make it complicated enough that many (as did I back then) assume it to be some kind of black art. The Oracle documentation on relocation processing and relocation sections is pretty much the best thing I’ve found on the internet that explains relocations and they’re a great start if you already know what you’re doing. This is why I figured I ought to try writing something more accessible that puts some of the bits into practice. The fact that I’m currently working on relocation processing makes it that much simpler for me to just bash at the keyboard and commit the stuff in my memory to a more persistent medium before it gets swapped out to make space for kitten photos.

While this tutorial is meant to be beginner friendly, it does assume though that the reader has some awareness of the ELF format, at least to the extent of knowing about different ELF sections. It also assumes that the reader has some undersanding of assembly language programming, bonus if you know aarch64 assembly since that’s the flavour of the examples.

Finding each other in this crowded world

The idea of relocation is quite simple: when compiling programs, we need flexibility to build components of programs independently and then have them link together. This could be in the same source file where we don’t know where parts of the source would end up, multiple source files built into different object files or sets of object files built into different libraries that reference objects in each other. This flexibility is achieved using relocations. Here’s a very simple example using aarch64 assembly:

.data
.globl somedata
somedata:
	.8byte 0x42

.text
.globl start
_start:
	nop
	ldr	x2, somedata

This is a simple program that loads somedata into register x2. It doesn’t do much and if you try to run the program it will crash, but it is a useful example that shows an assembly source where parts of it end up in different ELF sections.

The interesting bit here is a load instruction in the text section that is reading a variable somedata from the data section. The load instruction encodes within itself, the offset of somedata from itself, aka the PC-relative offset. The assembler can see both, the variable and the instruction but it cannot say for sure where they will be in memory at this point, because it does not know how far the data section will be from the text section in the final library or executable. To work around this limitation, it needs to leave the offset field in the ldr instruction blank so that the linker can finally fill it in. It also needs to provide instructions to the linker to tell it how to fill in this field.

This is where relocations come into play.

In this specific case, one may assemble the example and disassemble it using objdump -Dr to find this disassembly:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d503201f 	nop
   4:	58000002 	ldr	x2, 0 <_start>
			4: R_AARCH64_LD_PREL_LO19	somedata

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

The R_AARCH64_LD_PREL_LO19 in that output is the relocation. How did it land in there in between the instructions you ask? Well, it didn’t. The relocations are actually in a separate section of their own, as is evident with objdump -r:

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000004 R_AARCH64_LD_PREL_LO19  somedata

even better with readelf -r because it tells you that the relocations are essentially just a table of entries in the .rela.text section:

Relocation section '.rela.text' at offset 0x110 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000004  000500000111 R_AARCH64_LD_PREL 0000000000000000 somedata + 0

All sections with relocation entries have names with prefix .rela or .rel followed by the name of the section for which the relocations need to be applied. Based on these section names, it’s evident that there are two types of relocation entries, REL and RELA. There are a number of important pieces of information the assembler leaves for the linker here:

The memory address it needs to fix up
Which symbol the memory address is referring to
What is the offset from the symbol that it should finally consider as the result
How should it perform the computation

All of this can be seen in the above relocation table. Each entry in the relocation table is basically a C structure of the following form for REL type relocations:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
} Elf64_Rel;

and for RELA:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
        Elf64_Sxword    r_addend;
} Elf64_Rela;

r_offset corresponds to the Offset entry in the readelf output above and is typically the memory address that needs to be fixed up. The offset from the symbol, aka the addendum is present only in RELA type relocations and it corresponds to the r_addend element in the structure and the Addend field in the readelf output. The symbol and computation related information is encoded in the r_info field.

The Elf32_Rel and Elf32_Rela structures are similar, except for the data sizes of the elements.

Symbol hunting

The ‘what to write’ is where the r_info field comes in. That’s what the linker needs to figure out before the where and how, which comes later.

This field is split into two 32-bit parts (16-bit for 32-bit architectures). The lower part tells the linker how to perform the computation and the upper part tells the linker what the target symbol is. The upper part is a symbol ID, which basically is an index into the symbol table. In our example above, the r_info is 0x000500000111, which means that the symbol id is 0x5. We can pull out the symbol table using readelf -s:

Symbol table '.symtab' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     4: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT    1 $x
     5: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    3 somedata
     6: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    1 _start

and we find that the symbol id 0x5 is somedata and has the Value (i.e. the address of the symbol relative to its section) of 0x0.

Now we need to figure out how to combine all of this information together using the lower part of r_info, which is 0x111. This number corresponds to R_AARCH64_PREL_LO19, which has a specific meaning. An architecture defines a number of such relocations with descriptions of how they’re supposed to be applied. In case of R_AARCH64_PREL_LO19 it means “Add the symbol address and addend and subtract from it, the location of the memory address that is being fixed up”. Think about that a bit and you’ll notice that it is how you would compute the PC-relative offset of somedata from the instruction, i.e. subtract the location of the fixup (i.e. the LDR instruction) from the address of somedata. In short form (and you’ll see this and similar notations to describe relocations), it is written as S + A - P.

Putting things together

Now that we know what the target symbol is and how to compute the pc-relative offset, we need to compute the final symbol address, the final target address and then do the actual fixup. This is done near the end of the linking process in the GNU linker, when all sections have been laid out and we finally are in a position to know the relative addresses. The linker will then make the computation (i.e. S+A-P) and patch in the result into the LDR instruction before writing the output to the final binary. Here is what our result looks like:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d503201f 	nop
  4000b4:	58080022 	ldr	x2, 4100b8 <somedata>

Disassembly of section .data:

00000000004100b8 <somedata>:
  4100b8:	00000042 	.inst	0x00000042 ; undefined
  4100bc:	00000000 	.inst	0x00000000 ; undefined

Notice that the opcode of the LDR instuction is now different (and as a result the instruction itself is also different) and includes the 0x8002, which is basically the encoded difference (0x10004) from somedata.

Raising the stakes: Dynamic Relocations

This is all great when all our symbols are local and have predictable layouts such that PC-relative relocations such as R_AARCH64_PREL_LO19 are sufficient to describe and resolve in between assembling and linking a program. What happens however, when these symbol references cross boundaries of sections in ways that we cannot predict at compile or link time? What happens when your symbol references cross boundaries of your shared object? These are problems that need to be solved to make Position Independent Code (PIC) possible. PIC is when your program (and sections within your program) could get mapped anywhere in memory and you need your code to adapt to that fact.

Take this very simple example:

.data
somedata:
        .8byte 0x42
somedata2:
	.8byte somedata

.text
.globl _start
_start:
	ret

There’s next to nothing here; just somedata like in our previous example and a somedata2 which points to somedata. However, in this next-to-nothing example lies an interesting complication that needs to be resolved at runtime: the value in somedata2 cannot be computed at compile time; it needs a fixup at runtime! Let’s walk through the compilation to see how we get to the final result. First, the disassembly to understand what the assembler did for us:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d65f03c0 	ret

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

0000000000000008 <somedata2>:
	...
			8: R_AARCH64_ABS64	.data

We see now that the address to be relocated is somedata2 in the .data section and it is of type R_AARCH64_ABS64. This is simple relocation that instructs the linker to compute S + A to get the result, i.e. get the symbol address of somedata and add the addendum (0 again in this case). This in fact would be the final result for a statically linked result (using ld -static) and we’d lose the relocation in favour of the absolute address written into somedata2:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d65f03c0 	ret

Disassembly of section .data:

00000000004100b4 <somedata>:
  4100b4:	00000042 	.inst	0x00000042 ; undefined
  4100b8:	00000000 	.inst	0x00000000 ; undefined

00000000004100bc <somedata2>:
  4100bc:	004100b4 	.inst	0x004100b4 ; undefined
  4100c0:	00000000 	.inst	0x00000000 ; undefined

When compiling a shared object however (i.e. ld -shared) we intend to produce a position independent DSO (dynamic shared object) and to achieve that the linker now emits a relocation to describe how to compute the final address to assign to somedata2 and where the memory address can be located. In this example, it is the R_AARCH64_RELATIVE dynamic relocation, as seen using objdump -DR (output snipped to retain only useful bits):

Disassembly of section .data:

0000000000011000 <somedata>:
   11000:	00000042 	.inst	0x00000042 ; undefined
   11004:	00000000 	.inst	0x00000000 ; undefined

0000000000011008 <somedata2>:
   11008:	00011000 	.inst	0x00011000 ; undefined
			11008: R_AARCH64_RELATIVE	*ABS*+0x11000
   1100c:	00000000 	.inst	0x00000000 ; undefined

This relocation is interesting not just for the reason that it is dynamic, but also because it is a S+A type relocation that puts the non-relocated address (i.e. the link time address) of somedata into its addend. This relocation also does not reference a symbol; instead it references an *ABS* value, which is basically the offset at which this DSO would be loaded during execution. It is the dynamic linker in the C runtime library (ld.so in GNU systems) that reads these relocations from the .rela.dyn section. Because this relocation is based on an absolute address computed by the static linker, the dynamic linker does not have to do a symbol lookup to resolve the relocation.

The other difference from static relocations is that when a dynamic relocation references a symbol in its r_info, it is looked up in the .dynsym section, i.e. in the dynamic symbol table and not in the regular symbol table.

Final Thoughts

There are a number of other cases that the linker needs to cater for when it comes to relocations such as entries in Global Offset Tables, resolution of intermediate functions and Thread-Local Storage. Thankfully though, the first principles behind all those relocations are the same as the above and you can apply this knowledge to GOT, TLS and IFUNC relocations as well. GOT relocations for example reference GOT base, which the linker knows where to find (since it sets up the .got section in the first place) and can use that information to compute the location to fix up. Other than this special knowledge, everything else remains pretty much the same.

Once you’re equipped with these first principles, the next task is to figure out where documentation for specific relocations is for every architecture. While the binutils documentation makes some effort to document the public facing part of relocations, the detailed documentation is usually distributed by the architecture chip vendors. The AArch64 ELF documentation for example is hosted on the Arm website.

gcc under the hood

by siddhesh on Oct 03, 2019, 14:12

My background in computers is a bit hacky for a compiler engineer. I never had the theoretical computer science base that the average compiler geek does (yes I have a Masters in Computer Applications, but it’s from India and yes I’m going to leave that hanging without explanation) and I have pretty much winged it all these years. What follows is a bunch of thoughts (high five to those who get that reference!) from my winging it for almost a decade with the GNU toolchain. If you’re a visual learner then I would recommend watching my talk video at Linaro Connect 2019 instead of reading this. This is an imprecise transcript of my talk, with less silly quips and in a more neutral accent.

Hell Oh World!

It all started as a lightning talk at SHD Belgaum where I did a 5 minute demonstration of how a program goes from source code to a binary program. I got many people asking me to talk about this in more detail and it eventually became a full hour workshop at FOSSASIA in 2017. Those remained the basis for the first part of my talk at Connect, in fact they’re a bit more detailed in their treatment of the subject of purely taking code from source to binary.

Go read those posts, I’ll wait.

Under the hood

Welcome back! So now you know almost exactly how code goes from source to binary and how it gets executed on your computer. Congratulations, you now have a headstart into hacking on the dynamic loader bits in glibc! Now that we’ve explored the woods around the swamp, lets step into the swamp. Lets take a closer look at what gcc does with our source code.

Babelfish

The first thing a compiler does is to read the source code and make it into a data structure form that is easy for the computer to manipulate. Since the computer deals best with binary data, it converts the text form of the source code language into a tree-like structure. There is plenty of literature available on how that is implemented; in fact most compiler texts end up putting too much focus on this aspect of the compiler and then end up rushing through the real fun stuff that the compiler does with the program you wrote.

The data structure that the compiler translates your source code into is called an Intermediate Representation (IR), and gcc has three of them!

This part of the compiler that parses the source code into IR is known as the front end. gcc has many such front ends, one for each language it implements; they are all cluttered into the gcc/ directory. The C family of languages has a common subset of code in the gcc/c-family directory and then there are specializations like the C++ front end, implemented in the gcc/cp directory. There is a directory for fortran, another for java, yet another for go and so on. The translation of code from source to IR varies from frontend to frontend because of language differences, but they all have one thing in common; their output is an IR called GENERIC and it attempts to be language-independent.

Optimisation passes

Once gcc translates the source code into its IR form, it runs the IR through a number of passes (about 200 of them, more or less depending on the -O flag you pass) to try and come up with the most optimal machine code representation of the program. gcc builds source code a file at a time, which is called a Translation Unit (TU). A lot of its optimisation passes operate at the function level, i.e. individual functions are seen as independent units and are optimised separately. Then there are Inter-procedural analysis (IPA) passes that look at interactions of these functions and finally there is Link Time Optimisation(LTO) that attempts to analyse source code across translation units to potentially get even better results.

Optimisation passes can be architecture-independent or architecture-dependent.

Architecture independent passes do not care too much about the underlying machine beyond some basic details like word size, whether the CPU has floating point support, vector support, etc. These passes have some configuration hooks that allow their behaviour to be modified according to the target CPU architecture but the high level behaviour is architecture-agnostic. Architecture-independent passes are the holy grail for optimisation because they usually don’t get old; optimisations that work today will continue to work regardless of CPU architecture evolution.

Architecture-dependent passes need more information about the architecture and as a result their results may change as architectures evolve. Register allocation and instruction scheduling for example are very interesting and complex problems that architecture-specific passes handle. The instruction scheduling problem was a lot more critical back in the day when CPUs could only execute code sequentially. With out-of-order execution, the scheduling problem has not become less critical, but it has definitely changed in nature. Similarly the register allocation problem can be complicated by various factors such as number of locigal registers, how they share physical register files, how costly moving between different types of registers is, and so on. Architecture-dependent passes have to take into consideration all of these factors.

The final pass in architecture-dependent passes does the work of emitting assembly code for the target CPU. The architecture-independent passes constitute what is known as the middle-end of the compiler and the architecture-dependent passes form the compiler backend.

Each pass is a complex work of art, mathematics and logic and may have one or more highly cited research papers as their basis. No single gcc engineer would claim to understand all of these passes; there are many who have spent most of their careers on a small subset of passes, such is their complexity. But then, this is also a testament to how well we can work together as humans to create something that is significantly more complex than what our individual minds can grasp. What I mean to say with all this is, it’s OK to not know all of the passes, let alone know them well; you’re definitely not the only one.

Optimisation passes are all listed in gcc/passes.def and that is the sequence in which they are executed. A pass is defined as a class with a gate and execute function that determine whether to run and what to do respectively. Here’s what a single pass namespace looks like:

/* Pass data to help the pass manager classify, prepare and cleanup.  */
const pass_data pass_data_my_pass =
{
  GIMPLE_PASS, /* type */
  "my_pass", /* name */
  OPTGROUP_LOOP, /* optinfo_flags */
  TV_TREE_LOOP, /* tv_id */
  PROP_cfg, /* properties_required */
  0, /* properties_provided */
  0, /* properties_destroyed */
  0, /* todo_flags_start */
  0, /* todo_flags_finish */
};

/* This is a GIMPLE pass.  I know you don't know what GIMPLE is yet ;) */
class pass_my_pass : public gimple_opt_pass
{
public:
  pass_my_pass (gcc::context *ctxt)
    : gimple_opt_pass (pass_data_my_pass, ctxt)
  {}

  /* opt_pass methods: */
  virtual bool gate (function *) { /* Decide whether to run.  */ }

  virtual unsigned int execute (function *fn);
};

unsigned int
pass_my_pass::execute (function *)
{
  /* Magic!  */
}

We will not go into the anatomy of a pass yet. That is perhaps a topic for a follow-up post.

GENERIC

GENERIC is the first IR that gets generated from parsing the source code. It is a tree structure that has been present since the earliest gcc versions. Its core data structure is a tree_node, which is a hierarchy of structs, with tree_base as the Abraham. OK, if you haven’t been following gcc development, this can come as a surprise: a significant portion of gcc is now in c++!

It’s OK, take a minute to mourn/celebrate that.

The tree_node can mean a lot of things (it is a union), but the most important one is the tree_typed node. There are various types of tree_typed, like tree_int_cst, tree_identifier, tree_string, etc. that describe the elements of source code that they house. The base struct, i.e. tree_typed and even further up, tree_base have flags that have metadata about the elements that help in optimisation decisions. You’ll almost never use generic in the context of code traversal in optimisation passes, but the nodes are still very important to know about because GIMPLE continues to use them to store operand information. Take a peek at gcc/tree-core.h and gcc/tree.def to take a closer look at all of the types of nodes you could have in GENERIC.

What is GIMPLE you ask? Well, that’s our next IR and probably the most important one form the context of optimisations.

OK so now you have enough background to go look at the guts of the GENERIC tree IR in gcc. Here’s the gcc internals documentation that will help you navigate all of the convenience macros to analyze th tree nodes.

GIMPLE

The GIMPLE IR is a workhorse of gcc. Passes that operate on GIMPLE are architecture-independent.

The core structure in GIMPLE is a struct gimple that holds all of the metadata for a single gimple statement and is also a node in the list of gimple statements. There are various structures named gimple_statement_with_ops_* that have the actual operand information based on its type. Once again like with GENERIC, it is a hierarchy of structs. Note that the operands are all of the tree type so we haven’t got rid of all of GENERIC. gcc/gimple.h is where all of these structures are defined and gcc/gimple.def is where all of the different types of gimple statements are defined.

Where did my control flow go?

But how is it that a simple list of gimple statements is sufficient to traverse a program that has conditions and loops? Here is where things get interesting. Every sequence of GIMPLE statements is housed in a Basic Block (BB) and the compiler, when translating from GENERIC to GIMPLE (also known as lowering, since GIMPLE is structurally simpler than GENERIC), generates a Control FLow Graph (CFG) that describes the flow of the function from one BB to another. The only control flow idea one then needs in GIMPLE to traverse code from within the gimple statement context is a jump from one block to another and that is fulfilled by the GIMPLE_GOTO statement type. The CFG with its basic blocks and edges connecting those blocks, takes care of everything else. Routines to generate and manipulate the CFG are defined in gcc/cfg.h and gcc/cfg.c but beware, you don’t modify the CFG directly. Since CFG is tightly linked with GIMPLE (and RTL, yes that’s our third and final IR), it provides hooks to manipulate the graph and update GIMPLE if necessary.

The last interesting detail about CFG is that it has a special construct for loops, because they’re typically the most interesting subjects for optimisation: you can splice them, unroll them, distribute them and more to produce some fantastic performance results. gcc/cfgloop.h is there you’ll find all of the routines you need to traverse and manipulate loops.

The final important detail with regard to GIMPLE is the Single Static Assignment (SSA) form. Typical source code would have variables that get declared, assigned to, manipulated and then stored back into memory. Essentially, it is multiple operations on a single variable, sometimes destroying its previous contents as we reuse variables for different things that are logically related in the context of the high level program. This reuse however makes it complicated for a compiler to understand what’s going on and hence it ends up missing a host of optimisation opportunities.

To make things easier for optimisation passes, these variables are broken up into versions of themselves that have a specific start and end point in their lifecycle. This is called the Single Static Assignment form of a variable, where each version of the variable as a single starting point, viz. its definition. So if you have code like this:

    x = 10;
    x += 20;

it becomes:

    x_1 = 10;
    x_2 = x_1 + 20;

where x_1 and x_2 are versions of x. If you have versions of variables in conditions, things get interesting and the compiler deals with it with a mysterious concept called PHI nodes. So this code:

    if (n > 10)
      x = 10;
    else
      x = 20;
    return x;

becomes:

    if (n > 10)
      x_1 = 10;
    else
      x_2 = 20;
    # x_3 = PHI<x_1, x_2>;
    return x_3;

So the PHI node is a conditional selector of the earlier two versions of the variable and depending on the results of the optimisation passes, you could eliminate versions of the variables altogether or use CPU registers more efficiently to store those variable versions.

There you go, now you have everything you need to get started on hacking GIMPLE. I know this part is a bit heavy but guess what, this is where you can seriously start thinking about hacking on gcc! When you jump in, you’ll need the more detailed information in the gcc internals manual on GIMPLE, CFG and GIMPLE optimisations.

RTL

We are yet another step closer to generating assembly code for our assembler and linker to build into the final program. Recall that GIMPLE is largely architecture independent, so it works on high level ideas of statements, expressions and types and their relationships. RTL is much more primitive in comparison and is designed to mimic sequential architecture instructions. Its main purpose is to do architecture-specific work, such as register allocation and scheduling instructions in addition to more optimisation (because you can never get enough of that!) passes that make use of architecture information.

Internally, you will encounter two forms of RTL, one being the rtx struct that is used for most transformations and passes in the compiler. The other form is in text to map machine instructions to RTL and these are Lisp-like S expressions. These are used to make machine descriptions, where you can specify machine instructions for all of the common operations such as add, sub, multiply and so on. For example, this is what the description of a jump looks like for aarch64:

(define_insn "jump"
  [(set (pc) (label_ref (match_operand 0 "" "")))]
  ""
  "b\\t%l0"
  [(set_attr "type" "branch")]
)

Machine description files are in the gcc/config directory in their respective architecture subdirectory and have the .md suffix. The main aarch64 machine description file for example is gcc/config/aarch64/aarch64.md. These files are parsed by a tool in gcc to generate C code with rtx structures for each of those S-expressions.

In general, the gcc/config directory contains source code that handles compilation for that specific architecture. In addition to the machine descriptions these directories also have C code that enhance the RTL passes to exploit as much of th architecture information as they possibly can. This is where some of the detailed architecture-specific analysis of the RTL instructions go. For example, combination of loads into load pairs for aarch64 is an important task and it is done with a combination of machine description and some code to peek into and rearrange neighbouring RTL instructions

But then you’re wondering, why are there multiple description files? Other than just cleaner layout (put constraint information in a separate file, type information in another, etc.) it is because there are multiple evolutions of an architecture. The ‘i386’ architecture is a mess of evolutions that span word sizes and capabilities. The aarch64 architecture has within it many microarchitectures developed by various Arm licensee vendors like xgene, thunderxt88, falkor and also those developed by Arm such as the cortex-a57, cortex-a72, ares, etc. All of these have different behaviours and performance characteristics despite sharing an instruction set. For example on some microarchitecture one may prefer to emit the csel instruction instead of cmp, b.cond and multiple mov instructions to reduce code size (and hence improve performance) but on some other architecture, the csel instruction may have been designed really badly and hence it may well be cheaper to execute the 4+ instructions instead of the one csel. These are behaviour quirks that you select when you use the -mtune flag to optimise for a specific CPU. A lot of this information is also available in the C code of the architecture in the form of structures called cost tables. These are relative costs of various operations that help the RTL passes and some GIMPLE passes determine the best target code and optimisation behaviour accordingly for the CPU. Here’s an example of register move costs for the Qualcomm Centriq processor:

static const struct cpu_regmove_cost qdf24xx_regmove_cost =
{
  2, /* GP2GP  */
  /* Avoid the use of int<->fp moves for spilling.  */
  6, /* GP2FP  */
  6, /* FP2GP  */
  4 /* FP2FP  */
};

This tells us that in general, moving between general purpose registers is cheaper, moving between floating point registers is slightly more expensive and moving between general purpose and floating point registers is most expensive.

The other important detail in such machine description files is the pipeline description. This is a description of how the pipeline for a specific CPU microarchitecture is designed along with latencies for instructions in that pipeline. This information is used by the instruction scheduler pass to determine the best schedule of instructions for a CPU.

Now where do I start?

This is a lot of information to page in at once and if you’re like me, you’d want something more concrete to get started with understanding what gcc is doing. gcc has you covered there because it has flags that allow you to study the IR generated for the compiler at every stage of compilation. By every stage, I mean every pass! The -fdump-* set of flags allow you to dump the IR to study what gcc did in each pass. Particularly, -fdump-tree-* flags dump GIMPLE IR (in a C-like format so that it is not too complicated to read) into a file for each pass and the -fdump-rtl-* flags do the same for RTL. The GIMPLE IR can be dumped in its raw form as well (e.g. -fdump-tree-all-raw), which makes it much simpler to correlate with the code that is manipulating the GIMPLE in an optimisation pass.

The easiest way to get into gcc development (and compiler development in general) in my experience is the back end. Try tweaking the various cost tables to see what effect it has on code generation. Modify the instructions generated by the RTL descriptions and use that to look closer at one pass that interests you. Once you’re comfortable with making changes to gcc, rebuilding and checking its outputs, you can then try writing a pass, which is a slightly more involved process. Maybe I’ll write about it some day.

References

The GCC internals manual is the canonical place to read (and fix up) documentation for GCC internals. It is wonderfully detailed and hopelessly outdated at the same time. Bringing it up to date is a task by itself and the project continues to look for volunteers to do that.
David Malcolm has a more functional newbie guide for fledgling gcc hackers who might struggle with debugging gcc and getting involved in the gcc development process
The GCC Resource Center workshop on GCC is where I cut my teeth on gcc internals. I don’t think their workshop is active anymore but they have presentations and other literature there that is still very relevant.
I wrote about micro-optimisations in the past and those ideas are great to try on gcc.

Comments

Wrestling with the register allocator: LuaJIT edition

by siddhesh on Sep 15, 2019, 17:20

For some time now, I kept running into one specific piece of code in luajit repeatedly for various reasons and last month I came across a fascinating bug in the register allocator pitfall that I had never encountered before. But then as is the norm after fixing such bugs, I concluded that it’s too trivial to write about and I was just stupid to not have found it sooner; all bugs are trivial once they’re fixed.

The worst thing about solving tricky problems is deciding to write up about it and then halfway through the writeup, concluding that it was way too simple and I was just stupid to not solve it sooner.
— Siddhesh Poyarekar (@siddhesh_p) August 28, 2019

After getting over my imposter syndrome (yeah I know, took a while) I finally have the courage to write about it, so like a famous cook+musician says, enough jibberjabber…

Looping through a hash table

One of the key data structures that luajit uses to implement metatables is a hash table. Due to its extensive use, the lookup loop into such hash tables is optimised in the JIT using an architecture-specific asm_href function. Here is what a snippet of the arm64 version of the function looks like. A key thing to note is that the assembly code is generated backwards, i.e. the last instruction is emitted first.

  /* Key not found in chain: jump to exit (if merged) or load niltv. */
  l_end = emit_label(as);
  as->invmcp = NULL;
  if (merge == IR_NE)
    asm_guardcc(as, CC_AL);
  else if (destused)
    emit_loada(as, dest, niltvg(J2G(as->J)));

  /* Follow hash chain until the end. */
  l_loop = --as->mcp;
  emit_n(as, A64I_CMPx^A64I_K12^0, dest);
  emit_lso(as, A64I_LDRx, dest, dest, offsetof(Node, next));
  l_next = emit_label(as);

  /* Type and value comparison. */
  if (merge == IR_EQ)
    asm_guardcc(as, CC_EQ);
  else
    emit_cond_branch(as, CC_EQ, l_end);

  if (irt_isnum(kt)) {
    if (isk) {
      /* Assumes -0.0 is already canonicalized to +0.0. */
      if (k)
        emit_n(as, A64I_CMPx^k, tmp);
      else
        emit_nm(as, A64I_CMPx, key, tmp);
      emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
    } else {
      Reg ftmp = ra_scratch(as, rset_exclude(RSET_FPR, key));
      emit_nm(as, A64I_FCMPd, key, ftmp);
      emit_dn(as, A64I_FMOV_D_R, (ftmp & 31), (tmp & 31));
      emit_cond_branch(as, CC_LO, l_next);
      emit_nm(as, A64I_CMPx | A64F_SH(A64SH_LSR, 32), tisnum, tmp);
      emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.n));
    }
  } else if (irt_isaddr(kt)) {
    Reg scr;
    if (isk) {
      int64_t kk = ((int64_t)irt_toitype(irkey->t) << 47) | irkey[1].tv.u64;
      scr = ra_allock(as, kk, allow);
      emit_nm(as, A64I_CMPx, scr, tmp);
      emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
    } else {
      scr = ra_scratch(as, allow);
      emit_nm(as, A64I_CMPx, tmp, scr);
      emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key.u64));
    }
    rset_clear(allow, scr);
  } else {
    Reg type, scr;
    lua_assert(irt_ispri(kt) && !irt_isnil(kt));
    type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
    scr = ra_scratch(as, rset_clear(allow, type));
    rset_clear(allow, scr);
    emit_nm(as, A64I_CMPw, scr, type);
    emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key));
  }

  *l_loop = A64I_BCC | A64F_S19(as->mcp - l_loop) | CC_NE;

Here, the emit_* functions emit assembly instructions and the ra_* functions allocate registers. In the normal case everything is fine and the table lookup code is concise and effective. When there is register pressure however, things get interesting.

As an example, here is what a typical type lookup would look like:

0x100	ldr x1, [x16, #52]
0x104	cmp x1, x2
0x108	beq -> exit
0x10c	ldr x16, [x16, #16]
0x110	cmp x16, #0
0x114	bne 0x100

Here, x16 is the table that the loop traverses. x1 is a key, which if it matches, results in an exit to the interpreter. Otherwise the loop moves ahead until the end of the table. The comparison is done with a constant stored in x2.

The value of x is loaded later (i.e. earlier in the code, we are emitting code backwards, remember?) whenever that register is needed for reuse through a process called restoration or spilling. In the restore case, it is loaded into register as a constant or expressed as another constant (look up constant rematerialisation) and in the case of a spill, the register is restored from a slot in the stack. If there is no register pressure, all of this restoration happens at the head of the trace, which is why if you study a typical trace you will notice a lot of constant loads at the top of the trace.

Like the Spill that ruined your keyboard the other day…

Things get interesting when the allocation of x2 in the loop above results in a restore. Looking at the code a bit closer:

    type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
    scr = ra_scratch(as, rset_clear(allow, type));
    rset_clear(allow, scr);
    emit_nm(as, A64I_CMPw, scr, type);
    emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key));

The x2 here is type, which is a constant. If a register is not available, we have to make one available by either rematerializing or by restoring the register, which would result in something like this:

0x100   ldr x1, [x16, #52]
0x104   cmp x1, x2
0x108	mov x2, #42
0x10c   beq -> exit
0x110   ldr x16, [x16, #16]
0x114   cmp x16, #0
0x118   bne 0x100

This ends up breaking the loop because the allocator restore/spill logic assumes that the code is linear and the restore will affect only code that follows it, i.e. code that got generated earlier. To fix this, all of the register allocations should be done before the loop code is generated.

Making things right

The result of this analysis was this fix in my LuaJIT fork that allocates registers for operands that will be used in the loop before generating the body of the loop. That is, if the registers have to spill, they will do so after the loop (we are generating code in reverse order) and leave the loop compact. The fix is also in the luajit2 repository in the OpenResty project. The work was sponsored by OpenResty as this wonderfully vague bug could only be produced by some very complex scripts that are part of the OpenResty product.

Comments

A JIT in Time...

by siddhesh on Mar 28, 2019, 13:29

It’s been a different 3 months. For over 6 years I had been working almost exclusively on the GNU toolchain with a focus on glibc and I now had the chance of working on a completely different set of projects, something I had done a lot of during my Red Hat technical support days but not since. I was to look into Pypy, OpenJDK and LuaJIT, three very different projects with very different development styles, communities and technologies. The comparison of these projects among themselves and the GNU projects is an interesting point but not the purpose of this post, maybe some other day. In this post I want to talk about the project I spent the most time on (~1.5 months) and found to be technically the most intriguing: LuaJIT.

A Just In Time Introduction

For those new to the concept, JIT compilation techniques are pretty old and there is a very interesting paper called the A brief history of just in time that does what the title states. The basic concept is quite straightforward - code written in a high level language (in the case of luajit, lua) is interpreted as usual while keeping track of which parts of the code get hit often. If a part of the code is seen to be executed repeatedly, all or part of that code is compiled into binary and mapped in, with entry and exit branches into the interpreter, also known as exit guards. There are a number of tradeoffs in designing a JIT and the paper I’ve linked above gives enough of an introduction to appreciate the complexity of the problem being solved.

The key difference from compilers is that the time required to compile is often as much a performance factor as the quality of the generated code. Due to this, one needs to be careful about the amount of processing one can do on the code to optimise it. So while gcc or llvm may end up giving higher quality code, the ~200 passes that are involved in building a TU may well end up eating up all the performance gains compiling just in time would have given.

LuaJIT: Peeking under the hood

The LuaJIT project was started and is mostly written by Mike Pall, that is apparently a pseudonym for a very private and very smart hacker. I assume that he is male given that Mike is a common male name. The source code repository is a bit odd. There is a github repository that is supposed to be official but isn’t; it is a mirror created by CloudFlare along with Mike with the aim to broaden the developer community base. That ride hasn’t been the smoothest and I’ve talked about it in more detail below. The latest code with support for other architectures such as arm64 and ppc are in the v2.1 branch, which has only had beta releases come off it, the last one in 2017. There are tests in a separate repository called LuaJIT-test-cleanup which has a big fat warning that it is not the official testsuite, although if you look around, it pretty much is the only testsuite worth using for luajit.

Wait, there’s also bench_lua, which has some benchmarks and a pretty nice driver for the benchmarks, something that the LuaJIT-test-cleanup benchmarks lack.

LuaJIT uses the concept of trace compiling which is pretty simple in concept but has some very interesting side-effects. The idea of trace compilation, specifically with luajit is quite simple and follows roughly this logic:

Interpret program and profile it while it is running. Typical candidates for profiling would be loops for the obvious reason that it will likely execute repeatedly.
If a loop is hit repeatedly, i.e. it crosses a threshold number of iterations, the JIT compiler is invoked on its next iteration.
The JIT compiler first traces execution of the program and generates an IR for the trace of the program.
The IR then goes through some optimisation passes and finally code is generated for the desired CPU backend.

This keeps on repeating as the interpreter encounters more hotspots. The interesting bit here is that the only bit that gets compiled is the code that gets executed during the trace. So if you have a branch like so:

    if cond > threshold then
        i = i + 1
    else
        i = i - 1
    end

and the else block is executed during the trace, only that bit is compiled and not the if block. The compiled code then has branches (known as exit guards) to jump back into the interpreter if the condition is true. This produces an interesting optimisation opportunity that can be done during tracing itself. If cond > threshold is found to be always false because they are constants or some other reason, the if condition can be completely eliminated, which saves compilation time as well as execution time.

Another interesting side effect of tracing that is not seen in typical compilers is that function calls effectively get inlined. Again, that becomes a very cheap way to achieve something that would otherwise have been done in a separate pass in traditional compilers.

In addition to very fast tracing and compilation, all of luajit is quite compact. It’s IR is linear array based and is hence allows very fast traversal. It’s easy to visualize it using the jit.* debug modules and using the -jdump flag to dump the IR during execution. The luajit wiki has some pretty detailed documentation on its internals.

The coding style of the project is a bit too compact to my taste since I personally prefer writing for readability. There are a lot of constructs throughout the code that need a fair amount of squinting to understand, such as assignments inside the for loop headers and inside conditions. OK all of you pointing at the macro and makefile soups in glibc and laughing, please be quiet ;)

There’s also the infamous (at least in luajit circles) 47-bit address space limitation for garbage collected objects in luajit because luajit uses the top bits for metadata. This is known to have correctness issues with Lua userdata objects and also performance issues because luajit repeatedly tries allocations until it finds a suitable address in the 47-bit space. It doesn’t hurt x86 much (because of MAP_32BIT) but arm64 feels it and I imagine so do other architectures.

My LuaJIT involvement

My full time involvement with luajit was brief and will likely end soon (my personal involvement may still continue) so in this short period I wanted to tick off as many short but significant work items as I could. My github fork is here.

Sameera Deshpande started the initial work and then helped me ramp up later on. We got a couple of CI instances up and running to begin with, one for the official repository and another for my github fork so that I can review my changes regularly. If you’re interested in adding a node for your architecture to the Ci projects, please feel free to reach out to me, Linaro will happily add the node to the CI matrix.

Register Allocation improvements

The register allocator in luajit is pretty simple to keep the compilation overhead low. Registers are allocated sequentially based on their categories (caller saved, callee saved, etc.) and it uses some tricks such as constant rematerialization used to reduce register pressure. Rematerialization is also very basic in its implementation; whenever constants need to be allocated to registers, it is preferred that they use existing constants, (assuming their live ranges are compatible) either directly or as a constant computation. This is quite valuable because there is a fair amount of constant usage in the JITted code; exit guard addresses are coded in as constants for example and so are floating point numbers, in addition to the usual integers. The register modes are not specified during allocation and are defined by the instructions generated in the assembly phase.

There was a bug in the luajit register allocator due to which registers used for constant rematerialization were being clobbered, resulting in corruption. A fix was proposed but the author of the fix was not sure if it was correct. I posted an alternative patch and then realized and explained why my patch is overkill and his approach is optimal. I added additional cleanups to that to finish it up.

While working on this problem, I noticed that the arm64 backend was not using XZR often enough and I posted a patch to fix that. I started benchmarking the improvement (the codegen was obviously better, it was saving registers for stores fo zeroes for example) and quickly realized that both bench_lua and the LuaJIT-test-cleanup benchmarks were quite raw and couldn’t be relied upon for consistent results.

So I digressed.

Benchmark improvements and luaJIT-test-cleanup cleanup

bench_lua was my more favourite project to hack on benchmarks because it was evident that reviews were very hard to come by in the luajit project. Also, bench_lua had a benchmark driver that produced repeatable results but it still had some cleanup issues, including the fact that it did not have a license! The author was very responsive on the license question though and quickly put one in. I fixed some timing issues in the driver and while doing so, I realized that it might be better if I used this driver on the more extensive set of benchmarks in LuaJIT-test-cleanup. So that’s what I did.

I integrated the bench_lua driver into luajit-test-cleanup and added Makefile targets so that one could easily do make check and make bench to run the tests and benchmarks. Now I had something I could work with but it was still in a different repo and it was getting quite cumbersome to work with them.

So I integrated LuaJIT-test-cleanup into LuaJIT. Now I had a LuaJIT repository that IMO was complete and could handle the standard make/make check workflow. At the same time, it was modular enough that it could be merged into the upstream LuaJIT with relative ease. I posted all of these patches as PRs and watched as nothing happened. The LuaJIT-test-cleanup project had not seen a PR review since about 2016 and the LuaJIT project had seen occassional comments and patches from Mike in the past couple of years, but not much else.

Fusing and combining optimisations

Instruction fusion is an architecture dependent feature in luajit and each backend implements its own during the IR to assembly conversion phase, where the IR is traversed from the bottom up and assembly instructions generated sequentially. Luajit does some trivial reordering in its IR optimisation passes but during assembly, it does not peek ahead to actively look for instruction fusion opportunities; it only tries to fuse neighbouring instructions. As a result, while there are implementations for instructions like load and store pair in arm64, it is useful in only the most trivial of tests. Likewise for fmadd/fmsub; a simple intervening load is sufficient to prevent the optimisation.

In addition to this, it is often seen that optimisations like loop unrolling and vectorisation bring in even more opportunities for combining of loads and stores. Luajit does some loop peeling but that’s about it.

Sameera did some analysis on ways to introduce more aggressive unrolling and possibly some amount of vectorisation but we did not have enough time to implement it. She did have enough time to implement some instruction fusing and using fnmadd and fnmsub for arm64. She also looked at load combining opportunities but realized that luajit would need more powerful instruction reordering, similar to the load grouping in the gcc scheduler that makes load pair generation much easier. So that project was also not small enough for us to complete in the limited time.

Casting floats to unsigned integers

The C standard defines casting of floating point types to unsigned integer types only for the range (-1.0, UTYPE_MAX), where UTYPE_MAX is the unsigned version of TYPE. Casts to signed types work just fine as long as the number is in the range of that type. Waters get a bit murky with dynamic types and type narrowing when the default internal representation for all numbers is double. That was the situation in luajit. The fix for this was pretty straightforward in theory, which was to add an additional cast from float to signed int and then to unsigned int for floating point values less than zero and sticking to a direct cast to unsigned int for positive numbers. I have implemented this for the interpreter and for arm64 in my fork.

Project state and the road ahead

LuaJIT is a very interesting project that has some very interesting concepts that I learned in the last month or so. It has a pretty active user community that sings praises of the project and seems to advocate it in a number of areas. However, the project development itself is in a bit of a crisis.

Around 2015 Mike Pall said he wanted to step back from the project and wanted more people to get involved in the development. With that intent, Cloudflare created the github organisation and repository to allow for better collaboration. Based on conversation threads I read, things seemed to go fine when the community stepped in to create the LuaJIT-test-cleanup repository based on some initial tests Mike had written and built it up into a set of 500+ tests. However in about a year that excitement faded because nobody was made maintainer alongside Mike to carry forward the work and that meant that the LuaJIT project itself would only get sporadic fixes whenever Mike had some free time. Minor patches were accepted but bigger pieces of code went unreviewed and presumably the developers also lost interest.

Fast forward four years into 2019 and we are still in the same situation, probably worse. LuaJIT-test-cleanup has not had a patch review since 2016. LuaJIT has had comments about a couple of times each quarter and bug fixes with similar frequency, but not much else. The mailing list also has similar traffic - I announced all of the work I did above and did not get any responses. there are forks of LuaJIT all over the place in projects such as OpenResty and RaptorJIT and the projects seem happy to let things run that way. Lua language support is in a bit of a limbo with it being mostly 5.1 compliant with some 5.2 bits thrown in. Overall, it’s a great chunk of code that’s about to vanish into oblivion.

Then there is the very tricky question of copyright. The copyright notices all over the code say that Mike Pall has ownership. However, the code clearly has a number of contributions from others and there is no copyright assignment in place. While it’s likely not an issue from a licensing standpoint (IANAL, etc.), it is definitely something that needs to be addressed if the project is somehow ressurected, at the very least to give more prominent credit to contributors.

I’ve posted PRs for my work and tried to engage but I don’t have much hope given past history. I intend to spend at least some of my free time tinkering with this code since it’s just a very interesting project and there’s a lot that can be done. I am trawling the PRs and issue lists to look for patches that can be incorporated in my tree so if anyone is interested in contributing patches, you’re most welcome. I will continue to ensure that my tree applies on top of the official repository because I do not want to give up hope of the project coming back to life.

Comments

Optimizing toolchains for modern microprocessors

by siddhesh on Feb 11, 2018, 11:37

About 2.5 years ago I left Red Hat to join Linaro in a move that surprised even me for the first few months. I still work on the GNU toolchain with a glibc focus, but my focus changed considerably. I am no longer looking at the toolchain in its entirety (although I do that on my own time whenever I can, either as glibc release manager or reviewer); my focus is making glibc routines faster for one specific server microprocessor; no prizes for guessing which processor that is. I have read architecture manuals in the past to understand specific behaviours but this is the first time that I have had to pore through the entire manual and optimization guides and try and eek out the last cycle of performance from a chip.

This post is an attempt to document my learnings and make a high level guide of the various things me and my team looked at to improve performance of the toolchain. Note that my team is continuing to work on this chip (and I continue to learn new techniques, I may write about it later) so this ‘guide’ is more of a personal journey. I may add more follow ups or modify this post to reflect any changes in my understanding of this vast topic.

All of my examples use ARM64 assembly since that’s what I’ve been working on and translating the examples to something x86 would have discouraged me enough to not write this at all.

What am I optimizing for?

CPUs today are complicated beasts. Vulnerabilities like Spectre allude to how complicated CPU behaviour can get but in reality it can get a lot more complicated and there’s never really a universal solution to get the best out of them. Due to this, it is important to figure out what the end goal for the optimization is. For string functions for example, there are a number of different factors in play and there is no single set of behaviours that trumps over all others. For compilers in general, the number of such combinations is even higher. The solution often is to try and ensure that there is a balance and there are no exponentially worse behaviours.

The first line of defence for this is to ensure that the algorithm used for the routine does not exhibit exponential behaviour. I wrote about algorithmic changes I did to the multiple precision fallback implementation in glibc years ago elsewhere so I’m not going to repeat that. I will however state that the first line of attack to improve any function must be algorithmic. Thankfully barring strcmp, string routines in glibc had a fairly sound algorithmic base. strcmp fall back to a byte comparison when inputs are not mutually aligned, which is now fixed.

Large strings vs small

This is one question that gets asked very often in the context of string functions and different developers have different opinions on it, some differences even leading to flamewars in the past. One popular approach to ‘solving’ this is to quote usage of string functions in a popular benchmark and use that as a measuring stick. For a benchmark like CPU2006 or CPU2017, it means that you optimize for smaller strings because the number of calls to smaller strings is very high in those benchmarks. There are a few issues to that approach:

These benchmarks use glibc routines for a very small fraction of time, so you’re not going to win a lot of performance in the benchmark by improving small string performance
Small string operations have other factors affecting it a lot more, i.e. things like cache locality, branch predictor behaviour, prefether behaviour, etc. So while it might be fun to tweak behaviour exactly the way a CPU likes it, it may not end up resulting in the kind of gains you’re looking for
A 10K string (in theory) takes at least 10 times more cycles than a 1K string, often more. So effectively, there is 10x more incentive to look at improving performance of larger strings than smaller ones.
There are CPU features specifically catered for larger sequential string operations and utilizing those microarchitecture quirks will guarantee much better gains
There are a significant number of use cases outside of these benchmarks that use glibc far more than the SPEC benchmarks. There’s no established set of benchmarks that represent them though.

I won’t conclude with a final answer for this because there is none. This is also why I had to revisit this question for every single routine I targeted, sometimes even before I decide to target it.

Cached or not?

This is another question that comes up for string routines and the answer is actually a spectrum - a string could be cached, not cached or partially cached. What’s the safe assumption then?

There is a bit more consensus on the answer to this question. It is generally considered safe to consider that shorter string accesses are cached and then focus on code scheduling and layout for its target code. If the string is not cached, the cost of getting it into cache far outweighs the savings through scheduling and hence it is pointless looking at that case. For larger strings, assuming that they’re cached does not make sense due to their size. As a result, the focus for such situations should be on ensuring that cache utilization is optimal. That is, make sure that the code aids all of the CPU units that populate caches, either through a hardware prefetcher or through judiciously placed software prefetch instructions or by avoiding caching altogether, thus avoiding evicting other hot data. Code scheduling, alignment, etc. is still important because more often than not you’ll have a hot loop that does the loads, compares, stores, etc. and once your stream is primed, you need to ensure that the loop is not suboptimal and runs without stalls.

My branch is more important than yours

Branch predictor units in CPUs are quite complicated and the compiler does not try to model them. Instead, it tries to do the simpler and more effective thing; make sure that the more probably branch target is accessible through sequential fetching. This is another aspect of the large strings vs small for string functions and more often than not, smaller sizes are assumed to be more probable for hand-written assembly because it seems to be that way in practice and also the cost of a mispredict hits the smaller size more than it does the larger one.

Don’t waste any part of a pig CPU

CPUs today are complicated beasts. Yes I know I started the previous section with this exact same line; they’re complicated enough to bear repeating that. However, there is a bit of relief in the fact that the first principles of their design hasn’t changed much. The components of the CPU are all things we heard about in our CS class and the problem then reduces to understanding specific quirks of the processor core. At a very high level, there are three types of quirks you look for:

Something the core does exceedingly well
Something the core does very badly
Something the core does very well or badly under specific conditions

Typically this is made easy by CPU vendors when they provide documentation that specifies a lot of this information. Then there are cases where you discover these behaviours through profiling. Oh yes, before I forget:

Learn how to use perf or similar tool and read its output it will save your life

For example, the falkor core does something interesting with respect with loads and addressing modes. Typically, a load instruction would take a specific number of cycles to fetch from L1, more if memory is not cached, but that’s not relevant here. If you issue a load instruction with a pre/post-incrementing addressing mode, the microarchitecture issues two micro-instructions; one load and another that updates the base address. So:

   ldr  x1, [x2, 16]!

effectively is:

  ldr   x1, [x2, 16]
  add   x2, x2, 16

and that increases the net cost of the load. While it saves us an instruction, this addressing mode isn’t always preferred in unrolled loops since you could avoid the base address increment at the end of every instruction and do that at the end. With falkor however, this operation is very fast and in most cases, this addressing mode is preferred for loads. The reason for this is the way its hardware prefetcher works.

Hardware Prefetcher

A hardware prefetcher is a CPU unit that speculatively loads the memory location after the location requested, in an attempt to speed things up. This forms a memory stream and larger the string, the more its gains from prefetching. This however also means that in case of multiple prefetcher units in a core, one must ensure that the same prefetcher unit is hit so that the unit gets trained properly, i.e. knows what’s the next block to fetch. The way a prefetcher typically knows is if sees a consistent stride in memory access, i.e. it sees loads of X, X+16, X+32, etc. in a sequence.

On falkor the addressing mode plays an important role in determining which hardware prefetcher unit is hit by the load and effectively, a pre/post-incrementing load ensures that the loads hit the same prefetcher. That combined with a feature called register renaming ensures that it is much quicker to just fetch into the same virtual register and pre/post-increment the base address than to second-guess the CPU and try to outsmart it. The memcpy and memmove routines use this quirk extensively; comments in the falkor routines even have detailed comments explaining the basis of this behaviour.

Doing something so badly that it is easier to win

A colleague once said that the best targets for toolchain optimizations are CPUs that do things badly. There always is this one behaviour or set of behaviours that CPU designers decided to sacrifice to benefit other behaviours. On falkor for example, calling the MRS instruction for some registers is painfully slow whereas it is close to single cycle latency for most other processors. Simply avoiding such slow paths in itself could result in tremendous performance wins; this was evident with the memset function for falkor, which became twice as fast for medium sized strings.

Another example for this is in the compiler and not glibc, where the fact that using a ‘str’ instruction on 128-bit registers with register addressing mode is very slow on falkor. Simply avoiding that instruction altogether results in pretty good gains.

CPU Pipeline

Both gcc and llvm allow you to specify a model of the CPU pipeline, i.e.

The number of each type of unit the CPU has. That is, the number of load/store units, number of integer math units, number of FP units, etc.
The latency for each type of instruction
The number of micro-operations each instruction splits into
The number of instructions the CPU can fetch/dispatch in a single cycle

and so on. This information is then used to sequence instructions in a function that it optimizes for. This may also help the compiler choose between instructions based on how long those take. For example, it may be cheaper to just declare a literal in the code and load from it than to construct a constant using mov/movk. Similarly, it could be cheaper to use csel to select a value to load to a register than to branch to a different piece of code that loads the register or vice versa.

Optimal instruction sequencing can often result in significant gains. For example, intespersing load and store instructions with unrelated arithmetic instructions could result in both those instructions executing in parallel, thus saving time. On the contrary, sequencing multiple load instructions back to back could result in other units being underutilized and all instructions being serialized on to the load unit. The pipeline model allows the compiler to make an optimal decision in this regard.

Vector unit - to use or not to use, that is the question

The vector unit is this temptress that promises to double your execution rate, but it doesn’t come without cost. The most important cost is that of moving data between general purpose and vector registers and quite often this may end up eating into your gains. The cost of the vector instructions themselves may be high, or a CPU might have multiple integer units and just one SIMD unit, because of which code may get a better schedule when executed on the integer units as opposed to via the vector unit.

I had seen an opposite example of this in powerpc years ago when I noticed that much of the integer operations were also implemented in FP in multiple precision math. This was because the original authors were from IBM and they had noticed a significant performance gain with that on powerpc (possible power7 or earlier given the timelines) because the CPU had 4 FP units!

Final Thoughts

This is really just the tip of the iceberg when it comes to performance optimization in toolchains and utilizing CPU quirks. There are more behaviours that could be exploited (such as aliasing behaviour in branch prediction or core topology) but the cost benefit of doing that is questionable.

Despite how much fun it is to hand-write assembly for such routines, the best approach is always to write simple enough code (yes, clever tricks might actually defeat compiler optimization passes!) that the compiler can optimize for you. If there are missed optimizations, improve compiler support for it. For glibc and aarch64, there is also the case of impending multiarch explosion. Due to the presence of multiple vendors, having a perfectly tuned routine for each vendor may pose code maintenance problems and also secondary issues with performance, like code layout in a binary and instruction cache utilization. There are some random ideas floating about for that already, like making separate text sections for vendor-specific code, but that’s something we would like to avoid doing if we can.

Comments

Across the Charles Bridge - GNU Tools Cauldron 2017

by siddhesh on Sep 11, 2017, 23:16

Since I joined Linaro back in 2015 around this time, my travel has gone up 3x with 2 Linaro Connects a year added to the one GNU Tools Cauldron. This year I went to FOSSAsia too, so it’s been a busy traveling year. The special thing about Cauldron though is that it is one of those conferences where I ‘work’ as well as have a lot of fun. The fun bit is because I get to meet all of the people that I work with almost every day in person and a lot of them have become great friends over the years.

I still remember the first Cauldron I went to in 2013 at Mountain View where I felt dwarfed by all of the giants I was sitting with. It was exaggerated because it was the first time I met the likes of Jeff Law, Richard Henderson, etc. in personal meetings since I had joined the Red Hat toolchain team just months before; it was intimidating and exciting all at once. That was also the first time I met Roland McGrath (I still hadn’t met Carlos, he had just had a baby and couldn’t come), someone I was terrified of back then because his patch reviews would be quite sharp and incisive. I had imagined him to be a grim old man hammering out those words from a stern laptop, so it was a surprise to see him use the same kinds of words but with a sarcastic smile, completely changing the context and tone. That was the first time I truly realized how emails often lack context. Years later, I still try to visualize people when I read their emails.

Skip to 4 years later and I was at my 5th Cauldron last week and despite my assumptions on how it would go, it was a completely new experience. A lot of it had to do with my time at Linaro and very little to do with technical growth. I felt like an equal to Linaro folks all over the world and I seemed to carry that forward here, where I felt like an equal with all of the people present, I felt like I belonged. I did not feel insecure about my capabilities (I still am intimately aware of my limitations), nor did I feel the need to constantly prove that I belonged. I was out there seeking toolchain developers (we are hiring btw, email me if you’re a fit), comfortable with the idea of leading a team. The fact that I managed to not screw up the two glibc releases I managed may also have helped :)

Oh, and one wonderful surprise was that an old friend decided to drop in an Cauldron and spend a couple of days.

This year’s Cauldron had the most technical talks submitted in recent years. We had 5 talks in the glibc area, possibly also the highest for us; just as well because we went over time in almost all of them. I won’t say that it’s a surprise since that has happened in every single year that I attended. The first glibc talk was about tunables where I briefly recapped what we have done in tunables so far and talked about the future a bit more at length. Pedro Alves suggested putting pretty printers for tunables for introspection and maybe also for runtime tuning in the coming future. There was a significant amount of interest in the idea of auto-tuning, i.e. collecting profiling data about tunable use and coming up with optimal default values and possibly even eliminating such tunables in future if we find that we have a pretty good default. We also talked about tuning at runtime and the various kinds of support that would be required to make it happen. Finally there were discussions on tuning profiles and ideas around creating performance-enhanced routines for workloads instead of CPUs. The video recording of the talk will hopefully be out soon and I’ll link the video here when it is available.

Florian then talked about glibc 3.0, a notional concept (i.e. won’t be a soname bump) where we rewrite sections of code that have been rotting due to having to support some legacy platforms. The most prominent among them is libio, the module in glibc that implements stdio. When libio was written, it was designed to be compatible with libstdc++ so that FILE streams could be compatible with C++ stdio streams. The only version of gcc that really supports that is 2.95 since libstdc++ has since moved on. However because of the way we do things in glibc, we cannot get rid of them even if there is just one user that needs that ABI. We toyed with the concept of a separate compatibility library that becomes a graveyard for such legacy interfaces so that they don’t hold up progress in the library. It remains to be seen how this pans out, but I would definitely be happy to see this progress; libio was one of my backlog projects for years. I had to miss Raji’s talk on powerpc glibc improvements since I had to be in another meeting, so I’ll have to catch it when the video comes out.

The two BoFs for glibc dealt with a number of administrative and development issues, details of which Carlos will post on the mailing list soon. The highlights for me were the malloc instrumented benchmarks that Carlos wants to add to benchtests and build and review tools. Once I clear up my work backlog a bit, I’ll attempt to set up something like phabricator or gerrit and see how that works out or the community instead of patchwork. I am convinced that all of the issues that we want to solve like crediting reviewers, ensuring good git commit logs, running automated builds and tests, etc. can only be effectively solved with a proper review tool in place to review patches.

There was also a discussion on redoing the makefiles in glibc so that it doesn’t spend so much time doing dependecy resolution, but I am going to pretend that it didn’t happen because it is an ugly ugly task :/

I’m back home now, recovering from the cold that worsened while I was in Prague before I head out again in a couple of weeks to SFO for Linaro Connect. I’ve booked tickets for whale watching tours there, so hopefully I’ll be posting some pictures again after a long break.

Comments

Tunables story continued - glibc 2.26

by siddhesh on Aug 02, 2017, 23:57

Those of you tuned in to the wonderful world of system programming may have noticed that glibc 2.26 was released last night (or daytime if you live west of me or middle of the night/dawn if you live east of me, well you get the drift) and it came out with a host of new improvements, including the much awaited thread cache for malloc. The thread cache for malloc is truly a great step forward - it brings down latency of a bulk of allocations from hundreds of cycles to tens of cycles. The other major improvement that a bulk of users and developers will notice is the fact that glibc now detects when resolv.conf has changed and reloads the lookup configuration. Yes, this was long overdue but hey, it’s not like we were refusing patches for the past half a decade, so thank the nice soul (Florian Weimer) who actually got it done in the end.

We are not here to talk about the improvements mentioned in the NEWS. We are here to talk about an improvement that will likely have a long term impact on how optimizations are implemented in libraries. We are here to talk about…

TUNABLES!

Yes, I’m back with tunables, but this time I am not the one who did the work, it’s the wonderful people from Cavium and Intel who have started using tunables for a use case I had alluded to in my talk at Linaro Connect BKK 2016 and also in my previous blog post on tunables, which was the ability to influence IFUNCs.

IFUNCs? International functions? Intricate Functions? Impossibly ridiculous Functions?

There is a short introduction of the GNU Indirect Functions on the glibc wiki that should help you get started on this very powerful yet very complicated concept. In short, ifuncs extend the GOT/PLT mechanism of loading functions from dynamic libraries to loading different implementations of the same function depending on some simple selection criteria. Traditionally this has been based on querying the CPU for features that it supports and as a result we have had multiple variants of some very common functions such as memcpy_sse2 and memcpy_ssse3 for x86 processors that get executed based on the support declared by the processor the program is running on.

Tunables allow you to take this idea further because there are two ways to get performance benefits, (1) by utilizing all of the CPU features that help and (2) by catering to the workload. For example, you could have a workload that performs better with a supposedly sub-optimal memcpy variant for the CPU purely because of the way your data is structured or laid out. Tunables allow you to select that routine by pretending that the CPU has a different set of capabilities than it actually reports, by setting the glibc.tune.hwcaps tunable on x86 processors. Not only that, you can even tune cache sizes and non-temporal thresholds (i.e. threshold beyond which some routines use non-temporal instructions for loads and stores to optimize cache usage) to suit your workload. I won’t be surprised if some years down the line we see specialized implementations of these routines that cater to specific workloads, like memcpy_db for databases or memset_paranoid for a time invariant (or mostly invariant) implementation of memset.

Beyond x86

Here’s where another very important feature landed in glibc 2.26: multiarch support in aarch64. The ARMv8 spec is pretty standard and as a result the high level instruction set and feature set of vendor chips is pretty much the same with some minor trivial differences. However, even though the spec is standard, the underlying microarchitecture implementation could be very different and that meant that selection of instructions and scheduling differences could lead to sometimes very significant differences in performance and vendors obviously would like to take advantage of that.

The only way they could reliably (well, kind of, there should be a whole blog post for this) identify their processor variant (and hence deploy routines for their processors) was by reading the machine identification register or MIDR_EL1. If you’re familiar with aarch64 registers, you’ll notice that this register cannot be read by userspace, it can only be read by the kernel. The kernel thus had to trap and emulate this instruction, support for which is now available since Linux 4.11. In glibc 2.26, we now use MIDR_EL1 to identify which vendor processor the program is running on and deploy an optimal routine (in this case for the Cavium thunderxt88) for the processor.

But wait, what about earlier kernels, how do they take advantage of this? There’s a tunable for it! There’s glibc.tune.cpu for aarch64 that allows you to select the CPU variant you want to emulate. For some workloads you’ll find the generic memcpy actually works better and the tunable allows you to select that as well.

Finally due to tunables, the much needed cleanup of LD_HWCAP_MASK happened, giving rise to the tunable glibc.tune.hwcap_mask. Tunables also eliminated a lot of the inconsistency in environment variable behaviour due to the way static and dynamic executables are initialized, so you’ll see much less differences in the way your applications behave when they’re built dynamically vs when they’re built statically.

Wow, that sounds good, where do I sign up for your newsletter?

The full list of hardware capability tunables are documented in the glibc manual so take a look and feel free to hop on to the libc-help mailing list to discuss these tunables and suggest more ways in which you would like to tune the library for your workload. Remember that tunables don’t have any ABI/API guarantees for now, so they can be added or removed between releases as we deem fit. Also, your distribution may end up adding their own tunables too in future, so look out for those as well. Finally, system level tunables coming up real soon to allow system administrators to control how users use these tunables.

Happy hacking!

Comments

The story of tunables

by siddhesh on May 26, 2017, 08:33

This is long overdue and I have finally got around to writing this. Apologies to everyone who asked me to write about it and I responded with "Oh yeah, right away!" If you are not interested in the story bits, start with So what are tunables anyway below.

The story of tunables began in 2013 when I was a relatively fresh glibc engineer in the Red Hat toolchain team. We wanted to add an environment variable to allow users to set the default stack sizes for thread stacks and Carlos took that idea to the next level with the question: How do we make this more extensible so that we have full control over the kind of tuning parameters we accept in glibc but at the same time, allow distributions to add their own tuning parameters without affecting upstream code? He asked this question in the 2013 Cauldron in Mountain View, where the famous glibc BoF happened in a tiny meeting room which overflowed into an adjacent room, which also filled up quickly, and then the BoF overran its 45 minute slot by roughly a couple of hours! Carlos joined the BoF over Hangout (I think it was called Google Talk then) because he couldn’t make it and we had a lengthy back and forth about the pros and cons of having such tuning parameters. In principle, everybody agreed that such a thing would be desirable from a maintenance perspective. However the approach for doing it was something nobody seemed to agree on.

Thus the idea of tunables was born 4 years ago, except that Carlos wrote the first wiki page and called it ‘tunnables’. He consistently spelled it tunnables and I tunables. I won in the end because I wrote the patches ;)

Jokes aside, we were happy about the reception of the idea and we went about documenting it at length. However given that we were a two man army manning the glibc bunkers in Red Hat and the fact that upstream was still reviving itself from the post-Uli era meant that we would never come back to it for a while.

Then 2015 happened and it came with a memorable Cauldron in Prague. It was memorable because by then I had come up with a first draft of an API for the tunables framework. It was also memorable because it was my last month at Red Hat, something I never imagined would ever happen. I was leaving my dream team and I wasn’t sure if I would ever be as happy again. Those uncertainties were unfounded as I know now, but that’s a story for another post.

The struggle to write code

The first draft I presented at Cauldron in 2015 was really just a naive attempt at storing and initializing public values accessed across libraries in glibc and we had not even thought through everything we would end up fixing with tunables. It kinda worked, but it was never going to make the cut. A new employer meant that tunables will become a weekend project and as a result it missed the release deadline. And another, and then another. Towards the closing of every release I would whip out a patchset that would be poked holes into and then the change would be considered too risky to include.

Finally we set a deadline of 2.25 for tunables because by then quite a few devs had started maintaining their own list of tunables on top of my tree, frustratingly rebasing every time I completely changed my approach. We made it in the end, with Florian and I working through the year end holidays to get the whole patchset in before freeze.

So as of 2.25, tunables is firmly entrenched into glibc and as we speak, there are more tunables to come, especially to override IFUNC selections and to tune the processor capability mask.

So what are tunables anyway?

This is where you start if you want the technical description and are not interested in the story bits.

Tunables is an internal implementation detail in glibc. It is a way to manage ways in which we allow behaviour in glibc to be modified. As of now the only way to manage glibc is via environment variables and the way to do that was strewn all over the place in the source code. Tunables provide one place to add the tunable parameter with all of the characteristics it would have and then the framework will handle everything from there. The user of that tunable (e.g. malloc for MALLOC_MMAP_THRESHOLD_ or malloc.mmap.threshold in tunables parlance) would then simply access the tunable from the list and do what it wants to do, without bothering about where it came from.

The framework is implemented in elf/dl-tunables.c and all of the supporting code is named as elf/dl-tunable*. As is evident, tunables is linked into the dynamic linker, where it is initialized very early. In static binaries, the initialization is done in libc-start.c, again early enough to influence almost everything in the program. The list is initialized just once and is modifiable only in the dynamic linker before it relocates itself.

The main list of tunables is maintained in elf/dl-tunables.list. Architectures may define their own tunables in sysdeps/…/dl-tunables.list. There is a README.tunables that lists out the gory details of using tunables within glibc to access its values and if necessary, update it.

This gives us a number of advantages, some of them being the following:

Single Initialization

All environment variables used by glibc would be read in by a single double-nested loop which initializes all tunables. Accesses are then just a GOT away, so no more getenv loops in glibc code. This is not achieved yet since all of the environment variables are not yet ported to tunables (Hint: here’s a nice project for you, you aspiring glibc developer!)

All tunables are listed in a single file

The file elf/dl-tunables.list has a full list of tunables along with its properties such as type, value range, default value and its behaviour with setuid binaries. This caused us to introspect on each environment variable we ported into tunables and we ended up fixing a few bugs as well.

Very Early Initialization

Yes, very early, earlier than you would imagine, earlier than IFUNCs! *gasp*

Tunables get initialized very early so that they can influence almost every behaviour in glibc. The unreleased 2.26 makes this even earlier (or rather, delays CPU features initialization enough) so that tunables can impact selection of routines using IFUNCs. This fixes an important inconsistency in glibc, where LD_HWCAP_MASK was read in dynamically linked binaries but not in static binaries because it was not read in early enough.

relro

The tunable list is read-only, so glibc reads from a list that cannot be tampered by malicious code that gets loaded after relocation.

What changes for me as a user?

The change in 2.25 is minimal enough that you won’t notice. In this release, only the malloc tuning environment variables have been ported to tunables and if you’ve been using those environment variables before, they will continue to work even now. In addition, you get to tune these parameters in a fancy way that doesn’t require the stupid trailing underscore, using the GLIBC_TUNABLES environment variable. The manual describes it extensively so I won’t go into details.

The major change is about to happen now. Intel is starting to push a number of tunables to allow you to tune your library to your liking, changing things like string routines that get selected for your program, cache parameters, etc. I believe PowerPC and S390 will see something simila too in the lock elision space and aarch64 multiarch will be tunable as well. All of this will hopefully come in 2.26 or latest by 2.27.

One thing to note though is that for now tunables are not covered by any ABI or API guarantees. That is to say, if you like a tunable that is in 2.26, we may well remove the tunable in 2.27 if we find that it either does not make sense to have that tunable exposed or exposing that tunable is somehow detrimental to user programs.

The big difference will likely come in when distributions start adding their own tunables into the mix. since it will allow them to add customizations to the library without having to maintain huge ugly patchsets.

The Road Ahead

The big advantage of collecting all tuning parameters under a single framework is the ability to then add new ways to influence those tuning parameters. We have environment variables now, but we could add other methods to tune the library. Some ideas discussed are as follows:

Have a systemwide configuration file (e.g. /etc/sysctl.user.conf) that sets different defaults for some tunables and limits the degree to which specific tunables are altered. This allows systems administrators to have more fine grained control over the processes on their system
Have user-specific configuration files (e.g. $HOME/.sysctl.user.conf) that does something similar but at a user level
Have some tunables modified during execution via some shared memory mechanism

All of this is still evolving, so if you have an idea or would like to work on any of these ideas, feel free to get in touch with me and we can find a way to get you contributing to one of the most critical parts of the operating system!

Comments