Hell Oh Entropy!

Relocations: fantastic symbols, but where to find them?

Posted: Apr 10, 2020, 12:36

Update: There’s a Links section at the end that should give you a list of all the reference you’ll ever need to undrstand Linkers and Relocations!

When I started out in compilers years ago, I found relocations especially hard to wrap my head around. They’re just simple math in the end but they combine elements from different places that make it complicated enough that many (as did I back then) assume it to be some kind of black art. The Oracle documentation on relocation processing and relocation sections is pretty much the best thing I’ve found on the internet that explains relocations and they’re a great start if you already know what you’re doing. This is why I figured I ought to try writing something more accessible that puts some of the bits into practice. The fact that I’m currently working on relocation processing makes it that much simpler for me to just bash at the keyboard and commit the stuff in my memory to a more persistent medium before it gets swapped out to make space for kitten photos.

While this tutorial is meant to be beginner friendly, it does assume though that the reader has some awareness of the ELF format, at least to the extent of knowing about different ELF sections. It also assumes that the reader has some undersanding of assembly language programming, bonus if you know aarch64 assembly since that’s the flavour of the examples.

Finding each other in this crowded world

The idea of relocation is quite simple: when compiling programs, we need flexibility to build components of programs independently and then have them link together. This could be in the same source file where we don’t know where parts of the source would end up, multiple source files built into different object files or sets of object files built into different libraries that reference objects in each other. This flexibility is achieved using relocations. Here’s a very simple example using aarch64 assembly:

.data
.globl somedata
somedata:
	.8byte 0x42

.text
.globl start
_start:
	nop
	ldr	x2, somedata

This is a simple program that loads somedata into register x2. It doesn’t do much and if you try to run the program it will crash, but it is a useful example that shows an assembly source where parts of it end up in different ELF sections.

The interesting bit here is a load instruction in the text section that is reading a variable somedata from the data section. The load instruction encodes within itself, the offset of somedata from itself, aka the PC-relative offset. The assembler can see both, the variable and the instruction but it cannot say for sure where they will be in memory at this point, because it does not know how far the data section will be from the text section in the final library or executable. To work around this limitation, it needs to leave the offset field in the ldr instruction blank so that the linker can finally fill it in. It also needs to provide instructions to the linker to tell it how to fill in this field.

This is where relocations come into play.

In this specific case, one may assemble the example and disassemble it using objdump -Dr to find this disassembly:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d503201f 	nop
   4:	58000002 	ldr	x2, 0 <_start>
			4: R_AARCH64_LD_PREL_LO19	somedata

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

The R_AARCH64_LD_PREL_LO19 in that output is the relocation. How did it land in there in between the instructions you ask? Well, it didn’t. The relocations are actually in a separate section of their own, as is evident with objdump -r:

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000004 R_AARCH64_LD_PREL_LO19  somedata

even better with readelf -r because it tells you that the relocations are essentially just a table of entries in the .rela.text section:

Relocation section '.rela.text' at offset 0x110 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000004  000500000111 R_AARCH64_LD_PREL 0000000000000000 somedata + 0

All sections with relocation entries have names with prefix .rela or .rel followed by the name of the section for which the relocations need to be applied. Based on these section names, it’s evident that there are two types of relocation entries, REL and RELA. There are a number of important pieces of information the assembler leaves for the linker here:

The memory address it needs to fix up
Which symbol the memory address is referring to
What is the offset from the symbol that it should finally consider as the result
How should it perform the computation

All of this can be seen in the above relocation table. Each entry in the relocation table is basically a C structure of the following form for REL type relocations:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
} Elf64_Rel;

and for RELA:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
        Elf64_Sxword    r_addend;
} Elf64_Rela;

r_offset corresponds to the Offset entry in the readelf output above and is typically the memory address that needs to be fixed up. The offset from the symbol, aka the addendum is present only in RELA type relocations and it corresponds to the r_addend element in the structure and the Addend field in the readelf output. The symbol and computation related information is encoded in the r_info field.

The Elf32_Rel and Elf32_Rela structures are similar, except for the data sizes of the elements.

Symbol hunting

The ‘what to write’ is where the r_info field comes in. That’s what the linker needs to figure out before the where and how, which comes later.

This field is split into two 32-bit parts (16-bit for 32-bit architectures). The lower part tells the linker how to perform the computation and the upper part tells the linker what the target symbol is. The upper part is a symbol ID, which basically is an index into the symbol table. In our example above, the r_info is 0x000500000111, which means that the symbol id is 0x5. We can pull out the symbol table using readelf -s:

Symbol table '.symtab' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     4: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT    1 $x
     5: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    3 somedata
     6: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    1 _start

and we find that the symbol id 0x5 is somedata and has the Value (i.e. the address of the symbol relative to its section) of 0x0.

Now we need to figure out how to combine all of this information together using the lower part of r_info, which is 0x111. This number corresponds to R_AARCH64_PREL_LO19, which has a specific meaning. An architecture defines a number of such relocations with descriptions of how they’re supposed to be applied. In case of R_AARCH64_PREL_LO19 it means “Add the symbol address and addend and subtract from it, the location of the memory address that is being fixed up”. Think about that a bit and you’ll notice that it is how you would compute the PC-relative offset of somedata from the instruction, i.e. subtract the location of the fixup (i.e. the LDR instruction) from the address of somedata. In short form (and you’ll see this and similar notations to describe relocations), it is written as S + A - P.

Putting things together

Now that we know what the target symbol is and how to compute the pc-relative offset, we need to compute the final symbol address, the final target address and then do the actual fixup. This is done near the end of the linking process in the GNU linker, when all sections have been laid out and we finally are in a position to know the relative addresses. The linker will then make the computation (i.e. S+A-P) and patch in the result into the LDR instruction before writing the output to the final binary. Here is what our result looks like:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d503201f 	nop
  4000b4:	58080022 	ldr	x2, 4100b8 <somedata>

Disassembly of section .data:

00000000004100b8 <somedata>:
  4100b8:	00000042 	.inst	0x00000042 ; undefined
  4100bc:	00000000 	.inst	0x00000000 ; undefined

Notice that the opcode of the LDR instuction is now different (and as a result the instruction itself is also different) and includes the 0x8002, which is basically the encoded difference (0x10004) from somedata.

Raising the stakes: Dynamic Relocations

This is all great when all our symbols are local and have predictable layouts such that PC-relative relocations such as R_AARCH64_PREL_LO19 are sufficient to describe and resolve in between assembling and linking a program. What happens however, when these symbol references cross boundaries of sections in ways that we cannot predict at compile or link time? What happens when your symbol references cross boundaries of your shared object? These are problems that need to be solved to make Position Independent Code (PIC) possible. PIC is when your program (and sections within your program) could get mapped anywhere in memory and you need your code to adapt to that fact.

Take this very simple example:

.data
somedata:
        .8byte 0x42
somedata2:
	.8byte somedata

.text
.globl _start
_start:
	ret

There’s next to nothing here; just somedata like in our previous example and a somedata2 which points to somedata. However, in this next-to-nothing example lies an interesting complication that needs to be resolved at runtime: the value in somedata2 cannot be computed at compile time; it needs a fixup at runtime! Let’s walk through the compilation to see how we get to the final result. First, the disassembly to understand what the assembler did for us:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d65f03c0 	ret

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

0000000000000008 <somedata2>:
	...
			8: R_AARCH64_ABS64	.data

We see now that the address to be relocated is somedata2 in the .data section and it is of type R_AARCH64_ABS64. This is simple relocation that instructs the linker to compute S + A to get the result, i.e. get the symbol address of somedata and add the addendum (0 again in this case). This in fact would be the final result for a statically linked result (using ld -static) and we’d lose the relocation in favour of the absolute address written into somedata2:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d65f03c0 	ret

Disassembly of section .data:

00000000004100b4 <somedata>:
  4100b4:	00000042 	.inst	0x00000042 ; undefined
  4100b8:	00000000 	.inst	0x00000000 ; undefined

00000000004100bc <somedata2>:
  4100bc:	004100b4 	.inst	0x004100b4 ; undefined
  4100c0:	00000000 	.inst	0x00000000 ; undefined

When compiling a shared object however (i.e. ld -shared) we intend to produce a position independent DSO (dynamic shared object) and to achieve that the linker now emits a relocation to describe how to compute the final address to assign to somedata2 and where the memory address can be located. In this example, it is the R_AARCH64_RELATIVE dynamic relocation, as seen using objdump -DR (output snipped to retain only useful bits):

Disassembly of section .data:

0000000000011000 <somedata>:
   11000:	00000042 	.inst	0x00000042 ; undefined
   11004:	00000000 	.inst	0x00000000 ; undefined

0000000000011008 <somedata2>:
   11008:	00011000 	.inst	0x00011000 ; undefined
			11008: R_AARCH64_RELATIVE	*ABS*+0x11000
   1100c:	00000000 	.inst	0x00000000 ; undefined

This relocation is interesting not just for the reason that it is dynamic, but also because it is a S+A type relocation that puts the non-relocated address (i.e. the link time address) of somedata into its addend. This relocation also does not reference a symbol; instead it references an *ABS* value, which is basically the offset at which this DSO would be loaded during execution. It is the dynamic linker in the C runtime library (ld.so in GNU systems) that reads these relocations from the .rela.dyn section. Because this relocation is based on an absolute address computed by the static linker, the dynamic linker does not have to do a symbol lookup to resolve the relocation.

The other difference from static relocations is that when a dynamic relocation references a symbol in its r_info, it is looked up in the .dynsym section, i.e. in the dynamic symbol table and not in the regular symbol table.

Final Thoughts

There are a number of other cases that the linker needs to cater for when it comes to relocations such as entries in Global Offset Tables, resolution of intermediate functions and Thread-Local Storage. Thankfully though, the first principles behind all those relocations are the same as the above and you can apply this knowledge to GOT, TLS and IFUNC relocations as well. GOT relocations for example reference GOT base, which the linker knows where to find (since it sets up the .got section in the first place) and can use that information to compute the location to fix up. Other than this special knowledge, everything else remains pretty much the same.

Once you’re equipped with these first principles, the next task is to figure out where documentation for specific relocations is for every architecture. While the binutils documentation makes some effort to document the public facing part of relocations, the detailed documentation is usually distributed by the architecture chip vendors. The AArch64 ELF documentation for example is hosted on the Arm website.

Optimizing toolchains for modern microprocessors

Posted: Feb 11, 2018, 11:37

About 2.5 years ago I left Red Hat to join Linaro in a move that surprised even me for the first few months. I still work on the GNU toolchain with a glibc focus, but my focus changed considerably. I am no longer looking at the toolchain in its entirety (although I do that on my own time whenever I can, either as glibc release manager or reviewer); my focus is making glibc routines faster for one specific server microprocessor; no prizes for guessing which processor that is. I have read architecture manuals in the past to understand specific behaviours but this is the first time that I have had to pore through the entire manual and optimization guides and try and eek out the last cycle of performance from a chip.

This post is an attempt to document my learnings and make a high level guide of the various things me and my team looked at to improve performance of the toolchain. Note that my team is continuing to work on this chip (and I continue to learn new techniques, I may write about it later) so this ‘guide’ is more of a personal journey. I may add more follow ups or modify this post to reflect any changes in my understanding of this vast topic.

All of my examples use ARM64 assembly since that’s what I’ve been working on and translating the examples to something x86 would have discouraged me enough to not write this at all.

What am I optimizing for?

CPUs today are complicated beasts. Vulnerabilities like Spectre allude to how complicated CPU behaviour can get but in reality it can get a lot more complicated and there’s never really a universal solution to get the best out of them. Due to this, it is important to figure out what the end goal for the optimization is. For string functions for example, there are a number of different factors in play and there is no single set of behaviours that trumps over all others. For compilers in general, the number of such combinations is even higher. The solution often is to try and ensure that there is a balance and there are no exponentially worse behaviours.

The first line of defence for this is to ensure that the algorithm used for the routine does not exhibit exponential behaviour. I wrote about algorithmic changes I did to the multiple precision fallback implementation in glibc years ago elsewhere so I’m not going to repeat that. I will however state that the first line of attack to improve any function must be algorithmic. Thankfully barring strcmp, string routines in glibc had a fairly sound algorithmic base. strcmp fall back to a byte comparison when inputs are not mutually aligned, which is now fixed.

Large strings vs small

This is one question that gets asked very often in the context of string functions and different developers have different opinions on it, some differences even leading to flamewars in the past. One popular approach to ‘solving’ this is to quote usage of string functions in a popular benchmark and use that as a measuring stick. For a benchmark like CPU2006 or CPU2017, it means that you optimize for smaller strings because the number of calls to smaller strings is very high in those benchmarks. There are a few issues to that approach:

These benchmarks use glibc routines for a very small fraction of time, so you’re not going to win a lot of performance in the benchmark by improving small string performance
Small string operations have other factors affecting it a lot more, i.e. things like cache locality, branch predictor behaviour, prefether behaviour, etc. So while it might be fun to tweak behaviour exactly the way a CPU likes it, it may not end up resulting in the kind of gains you’re looking for
A 10K string (in theory) takes at least 10 times more cycles than a 1K string, often more. So effectively, there is 10x more incentive to look at improving performance of larger strings than smaller ones.
There are CPU features specifically catered for larger sequential string operations and utilizing those microarchitecture quirks will guarantee much better gains
There are a significant number of use cases outside of these benchmarks that use glibc far more than the SPEC benchmarks. There’s no established set of benchmarks that represent them though.

I won’t conclude with a final answer for this because there is none. This is also why I had to revisit this question for every single routine I targeted, sometimes even before I decide to target it.

Cached or not?

This is another question that comes up for string routines and the answer is actually a spectrum - a string could be cached, not cached or partially cached. What’s the safe assumption then?

There is a bit more consensus on the answer to this question. It is generally considered safe to consider that shorter string accesses are cached and then focus on code scheduling and layout for its target code. If the string is not cached, the cost of getting it into cache far outweighs the savings through scheduling and hence it is pointless looking at that case. For larger strings, assuming that they’re cached does not make sense due to their size. As a result, the focus for such situations should be on ensuring that cache utilization is optimal. That is, make sure that the code aids all of the CPU units that populate caches, either through a hardware prefetcher or through judiciously placed software prefetch instructions or by avoiding caching altogether, thus avoiding evicting other hot data. Code scheduling, alignment, etc. is still important because more often than not you’ll have a hot loop that does the loads, compares, stores, etc. and once your stream is primed, you need to ensure that the loop is not suboptimal and runs without stalls.

My branch is more important than yours

Branch predictor units in CPUs are quite complicated and the compiler does not try to model them. Instead, it tries to do the simpler and more effective thing; make sure that the more probably branch target is accessible through sequential fetching. This is another aspect of the large strings vs small for string functions and more often than not, smaller sizes are assumed to be more probable for hand-written assembly because it seems to be that way in practice and also the cost of a mispredict hits the smaller size more than it does the larger one.

Don’t waste any part of a pig CPU

CPUs today are complicated beasts. Yes I know I started the previous section with this exact same line; they’re complicated enough to bear repeating that. However, there is a bit of relief in the fact that the first principles of their design hasn’t changed much. The components of the CPU are all things we heard about in our CS class and the problem then reduces to understanding specific quirks of the processor core. At a very high level, there are three types of quirks you look for:

Something the core does exceedingly well
Something the core does very badly
Something the core does very well or badly under specific conditions

Typically this is made easy by CPU vendors when they provide documentation that specifies a lot of this information. Then there are cases where you discover these behaviours through profiling. Oh yes, before I forget:

Learn how to use perf or similar tool and read its output it will save your life

For example, the falkor core does something interesting with respect with loads and addressing modes. Typically, a load instruction would take a specific number of cycles to fetch from L1, more if memory is not cached, but that’s not relevant here. If you issue a load instruction with a pre/post-incrementing addressing mode, the microarchitecture issues two micro-instructions; one load and another that updates the base address. So:

   ldr  x1, [x2, 16]!

effectively is:

  ldr   x1, [x2, 16]
  add   x2, x2, 16

and that increases the net cost of the load. While it saves us an instruction, this addressing mode isn’t always preferred in unrolled loops since you could avoid the base address increment at the end of every instruction and do that at the end. With falkor however, this operation is very fast and in most cases, this addressing mode is preferred for loads. The reason for this is the way its hardware prefetcher works.

Hardware Prefetcher

A hardware prefetcher is a CPU unit that speculatively loads the memory location after the location requested, in an attempt to speed things up. This forms a memory stream and larger the string, the more its gains from prefetching. This however also means that in case of multiple prefetcher units in a core, one must ensure that the same prefetcher unit is hit so that the unit gets trained properly, i.e. knows what’s the next block to fetch. The way a prefetcher typically knows is if sees a consistent stride in memory access, i.e. it sees loads of X, X+16, X+32, etc. in a sequence.

On falkor the addressing mode plays an important role in determining which hardware prefetcher unit is hit by the load and effectively, a pre/post-incrementing load ensures that the loads hit the same prefetcher. That combined with a feature called register renaming ensures that it is much quicker to just fetch into the same virtual register and pre/post-increment the base address than to second-guess the CPU and try to outsmart it. The memcpy and memmove routines use this quirk extensively; comments in the falkor routines even have detailed comments explaining the basis of this behaviour.

Doing something so badly that it is easier to win

A colleague once said that the best targets for toolchain optimizations are CPUs that do things badly. There always is this one behaviour or set of behaviours that CPU designers decided to sacrifice to benefit other behaviours. On falkor for example, calling the MRS instruction for some registers is painfully slow whereas it is close to single cycle latency for most other processors. Simply avoiding such slow paths in itself could result in tremendous performance wins; this was evident with the memset function for falkor, which became twice as fast for medium sized strings.

Another example for this is in the compiler and not glibc, where the fact that using a ‘str’ instruction on 128-bit registers with register addressing mode is very slow on falkor. Simply avoiding that instruction altogether results in pretty good gains.

CPU Pipeline

Both gcc and llvm allow you to specify a model of the CPU pipeline, i.e.

The number of each type of unit the CPU has. That is, the number of load/store units, number of integer math units, number of FP units, etc.
The latency for each type of instruction
The number of micro-operations each instruction splits into
The number of instructions the CPU can fetch/dispatch in a single cycle

and so on. This information is then used to sequence instructions in a function that it optimizes for. This may also help the compiler choose between instructions based on how long those take. For example, it may be cheaper to just declare a literal in the code and load from it than to construct a constant using mov/movk. Similarly, it could be cheaper to use csel to select a value to load to a register than to branch to a different piece of code that loads the register or vice versa.

Optimal instruction sequencing can often result in significant gains. For example, intespersing load and store instructions with unrelated arithmetic instructions could result in both those instructions executing in parallel, thus saving time. On the contrary, sequencing multiple load instructions back to back could result in other units being underutilized and all instructions being serialized on to the load unit. The pipeline model allows the compiler to make an optimal decision in this regard.

Vector unit - to use or not to use, that is the question

The vector unit is this temptress that promises to double your execution rate, but it doesn’t come without cost. The most important cost is that of moving data between general purpose and vector registers and quite often this may end up eating into your gains. The cost of the vector instructions themselves may be high, or a CPU might have multiple integer units and just one SIMD unit, because of which code may get a better schedule when executed on the integer units as opposed to via the vector unit.

I had seen an opposite example of this in powerpc years ago when I noticed that much of the integer operations were also implemented in FP in multiple precision math. This was because the original authors were from IBM and they had noticed a significant performance gain with that on powerpc (possible power7 or earlier given the timelines) because the CPU had 4 FP units!

Final Thoughts

This is really just the tip of the iceberg when it comes to performance optimization in toolchains and utilizing CPU quirks. There are more behaviours that could be exploited (such as aliasing behaviour in branch prediction or core topology) but the cost benefit of doing that is questionable.

Despite how much fun it is to hand-write assembly for such routines, the best approach is always to write simple enough code (yes, clever tricks might actually defeat compiler optimization passes!) that the compiler can optimize for you. If there are missed optimizations, improve compiler support for it. For glibc and aarch64, there is also the case of impending multiarch explosion. Due to the presence of multiple vendors, having a perfectly tuned routine for each vendor may pose code maintenance problems and also secondary issues with performance, like code layout in a binary and instruction cache utilization. There are some random ideas floating about for that already, like making separate text sections for vendor-specific code, but that’s something we would like to avoid doing if we can.

Comments

Across the Charles Bridge - GNU Tools Cauldron 2017

Posted: Sep 11, 2017, 23:16

Since I joined Linaro back in 2015 around this time, my travel has gone up 3x with 2 Linaro Connects a year added to the one GNU Tools Cauldron. This year I went to FOSSAsia too, so it’s been a busy traveling year. The special thing about Cauldron though is that it is one of those conferences where I ‘work’ as well as have a lot of fun. The fun bit is because I get to meet all of the people that I work with almost every day in person and a lot of them have become great friends over the years.

I still remember the first Cauldron I went to in 2013 at Mountain View where I felt dwarfed by all of the giants I was sitting with. It was exaggerated because it was the first time I met the likes of Jeff Law, Richard Henderson, etc. in personal meetings since I had joined the Red Hat toolchain team just months before; it was intimidating and exciting all at once. That was also the first time I met Roland McGrath (I still hadn’t met Carlos, he had just had a baby and couldn’t come), someone I was terrified of back then because his patch reviews would be quite sharp and incisive. I had imagined him to be a grim old man hammering out those words from a stern laptop, so it was a surprise to see him use the same kinds of words but with a sarcastic smile, completely changing the context and tone. That was the first time I truly realized how emails often lack context. Years later, I still try to visualize people when I read their emails.

Skip to 4 years later and I was at my 5th Cauldron last week and despite my assumptions on how it would go, it was a completely new experience. A lot of it had to do with my time at Linaro and very little to do with technical growth. I felt like an equal to Linaro folks all over the world and I seemed to carry that forward here, where I felt like an equal with all of the people present, I felt like I belonged. I did not feel insecure about my capabilities (I still am intimately aware of my limitations), nor did I feel the need to constantly prove that I belonged. I was out there seeking toolchain developers (we are hiring btw, email me if you’re a fit), comfortable with the idea of leading a team. The fact that I managed to not screw up the two glibc releases I managed may also have helped :)

Oh, and one wonderful surprise was that an old friend decided to drop in an Cauldron and spend a couple of days.

This year’s Cauldron had the most technical talks submitted in recent years. We had 5 talks in the glibc area, possibly also the highest for us; just as well because we went over time in almost all of them. I won’t say that it’s a surprise since that has happened in every single year that I attended. The first glibc talk was about tunables where I briefly recapped what we have done in tunables so far and talked about the future a bit more at length. Pedro Alves suggested putting pretty printers for tunables for introspection and maybe also for runtime tuning in the coming future. There was a significant amount of interest in the idea of auto-tuning, i.e. collecting profiling data about tunable use and coming up with optimal default values and possibly even eliminating such tunables in future if we find that we have a pretty good default. We also talked about tuning at runtime and the various kinds of support that would be required to make it happen. Finally there were discussions on tuning profiles and ideas around creating performance-enhanced routines for workloads instead of CPUs. The video recording of the talk will hopefully be out soon and I’ll link the video here when it is available.

Florian then talked about glibc 3.0, a notional concept (i.e. won’t be a soname bump) where we rewrite sections of code that have been rotting due to having to support some legacy platforms. The most prominent among them is libio, the module in glibc that implements stdio. When libio was written, it was designed to be compatible with libstdc++ so that FILE streams could be compatible with C++ stdio streams. The only version of gcc that really supports that is 2.95 since libstdc++ has since moved on. However because of the way we do things in glibc, we cannot get rid of them even if there is just one user that needs that ABI. We toyed with the concept of a separate compatibility library that becomes a graveyard for such legacy interfaces so that they don’t hold up progress in the library. It remains to be seen how this pans out, but I would definitely be happy to see this progress; libio was one of my backlog projects for years. I had to miss Raji’s talk on powerpc glibc improvements since I had to be in another meeting, so I’ll have to catch it when the video comes out.

The two BoFs for glibc dealt with a number of administrative and development issues, details of which Carlos will post on the mailing list soon. The highlights for me were the malloc instrumented benchmarks that Carlos wants to add to benchtests and build and review tools. Once I clear up my work backlog a bit, I’ll attempt to set up something like phabricator or gerrit and see how that works out or the community instead of patchwork. I am convinced that all of the issues that we want to solve like crediting reviewers, ensuring good git commit logs, running automated builds and tests, etc. can only be effectively solved with a proper review tool in place to review patches.

There was also a discussion on redoing the makefiles in glibc so that it doesn’t spend so much time doing dependecy resolution, but I am going to pretend that it didn’t happen because it is an ugly ugly task :/

I’m back home now, recovering from the cold that worsened while I was in Prague before I head out again in a couple of weeks to SFO for Linaro Connect. I’ve booked tickets for whale watching tours there, so hopefully I’ll be posting some pictures again after a long break.

Comments

Tunables story continued - glibc 2.26

Posted: Aug 02, 2017, 23:57

Those of you tuned in to the wonderful world of system programming may have noticed that glibc 2.26 was released last night (or daytime if you live west of me or middle of the night/dawn if you live east of me, well you get the drift) and it came out with a host of new improvements, including the much awaited thread cache for malloc. The thread cache for malloc is truly a great step forward - it brings down latency of a bulk of allocations from hundreds of cycles to tens of cycles. The other major improvement that a bulk of users and developers will notice is the fact that glibc now detects when resolv.conf has changed and reloads the lookup configuration. Yes, this was long overdue but hey, it’s not like we were refusing patches for the past half a decade, so thank the nice soul (Florian Weimer) who actually got it done in the end.

We are not here to talk about the improvements mentioned in the NEWS. We are here to talk about an improvement that will likely have a long term impact on how optimizations are implemented in libraries. We are here to talk about…

TUNABLES!

Yes, I’m back with tunables, but this time I am not the one who did the work, it’s the wonderful people from Cavium and Intel who have started using tunables for a use case I had alluded to in my talk at Linaro Connect BKK 2016 and also in my previous blog post on tunables, which was the ability to influence IFUNCs.

IFUNCs? International functions? Intricate Functions? Impossibly ridiculous Functions?

There is a short introduction of the GNU Indirect Functions on the glibc wiki that should help you get started on this very powerful yet very complicated concept. In short, ifuncs extend the GOT/PLT mechanism of loading functions from dynamic libraries to loading different implementations of the same function depending on some simple selection criteria. Traditionally this has been based on querying the CPU for features that it supports and as a result we have had multiple variants of some very common functions such as memcpy_sse2 and memcpy_ssse3 for x86 processors that get executed based on the support declared by the processor the program is running on.

Tunables allow you to take this idea further because there are two ways to get performance benefits, (1) by utilizing all of the CPU features that help and (2) by catering to the workload. For example, you could have a workload that performs better with a supposedly sub-optimal memcpy variant for the CPU purely because of the way your data is structured or laid out. Tunables allow you to select that routine by pretending that the CPU has a different set of capabilities than it actually reports, by setting the glibc.tune.hwcaps tunable on x86 processors. Not only that, you can even tune cache sizes and non-temporal thresholds (i.e. threshold beyond which some routines use non-temporal instructions for loads and stores to optimize cache usage) to suit your workload. I won’t be surprised if some years down the line we see specialized implementations of these routines that cater to specific workloads, like memcpy_db for databases or memset_paranoid for a time invariant (or mostly invariant) implementation of memset.

Beyond x86

Here’s where another very important feature landed in glibc 2.26: multiarch support in aarch64. The ARMv8 spec is pretty standard and as a result the high level instruction set and feature set of vendor chips is pretty much the same with some minor trivial differences. However, even though the spec is standard, the underlying microarchitecture implementation could be very different and that meant that selection of instructions and scheduling differences could lead to sometimes very significant differences in performance and vendors obviously would like to take advantage of that.

The only way they could reliably (well, kind of, there should be a whole blog post for this) identify their processor variant (and hence deploy routines for their processors) was by reading the machine identification register or MIDR_EL1. If you’re familiar with aarch64 registers, you’ll notice that this register cannot be read by userspace, it can only be read by the kernel. The kernel thus had to trap and emulate this instruction, support for which is now available since Linux 4.11. In glibc 2.26, we now use MIDR_EL1 to identify which vendor processor the program is running on and deploy an optimal routine (in this case for the Cavium thunderxt88) for the processor.

But wait, what about earlier kernels, how do they take advantage of this? There’s a tunable for it! There’s glibc.tune.cpu for aarch64 that allows you to select the CPU variant you want to emulate. For some workloads you’ll find the generic memcpy actually works better and the tunable allows you to select that as well.

Finally due to tunables, the much needed cleanup of LD_HWCAP_MASK happened, giving rise to the tunable glibc.tune.hwcap_mask. Tunables also eliminated a lot of the inconsistency in environment variable behaviour due to the way static and dynamic executables are initialized, so you’ll see much less differences in the way your applications behave when they’re built dynamically vs when they’re built statically.

Wow, that sounds good, where do I sign up for your newsletter?

The full list of hardware capability tunables are documented in the glibc manual so take a look and feel free to hop on to the libc-help mailing list to discuss these tunables and suggest more ways in which you would like to tune the library for your workload. Remember that tunables don’t have any ABI/API guarantees for now, so they can be added or removed between releases as we deem fit. Also, your distribution may end up adding their own tunables too in future, so look out for those as well. Finally, system level tunables coming up real soon to allow system administrators to control how users use these tunables.

Happy hacking!

Comments

The story of tunables

Posted: May 26, 2017, 08:33

This is long overdue and I have finally got around to writing this. Apologies to everyone who asked me to write about it and I responded with "Oh yeah, right away!" If you are not interested in the story bits, start with So what are tunables anyway below.

The story of tunables began in 2013 when I was a relatively fresh glibc engineer in the Red Hat toolchain team. We wanted to add an environment variable to allow users to set the default stack sizes for thread stacks and Carlos took that idea to the next level with the question: How do we make this more extensible so that we have full control over the kind of tuning parameters we accept in glibc but at the same time, allow distributions to add their own tuning parameters without affecting upstream code? He asked this question in the 2013 Cauldron in Mountain View, where the famous glibc BoF happened in a tiny meeting room which overflowed into an adjacent room, which also filled up quickly, and then the BoF overran its 45 minute slot by roughly a couple of hours! Carlos joined the BoF over Hangout (I think it was called Google Talk then) because he couldn’t make it and we had a lengthy back and forth about the pros and cons of having such tuning parameters. In principle, everybody agreed that such a thing would be desirable from a maintenance perspective. However the approach for doing it was something nobody seemed to agree on.

Thus the idea of tunables was born 4 years ago, except that Carlos wrote the first wiki page and called it ‘tunnables’. He consistently spelled it tunnables and I tunables. I won in the end because I wrote the patches ;)

Jokes aside, we were happy about the reception of the idea and we went about documenting it at length. However given that we were a two man army manning the glibc bunkers in Red Hat and the fact that upstream was still reviving itself from the post-Uli era meant that we would never come back to it for a while.

Then 2015 happened and it came with a memorable Cauldron in Prague. It was memorable because by then I had come up with a first draft of an API for the tunables framework. It was also memorable because it was my last month at Red Hat, something I never imagined would ever happen. I was leaving my dream team and I wasn’t sure if I would ever be as happy again. Those uncertainties were unfounded as I know now, but that’s a story for another post.

The struggle to write code

The first draft I presented at Cauldron in 2015 was really just a naive attempt at storing and initializing public values accessed across libraries in glibc and we had not even thought through everything we would end up fixing with tunables. It kinda worked, but it was never going to make the cut. A new employer meant that tunables will become a weekend project and as a result it missed the release deadline. And another, and then another. Towards the closing of every release I would whip out a patchset that would be poked holes into and then the change would be considered too risky to include.

Finally we set a deadline of 2.25 for tunables because by then quite a few devs had started maintaining their own list of tunables on top of my tree, frustratingly rebasing every time I completely changed my approach. We made it in the end, with Florian and I working through the year end holidays to get the whole patchset in before freeze.

So as of 2.25, tunables is firmly entrenched into glibc and as we speak, there are more tunables to come, especially to override IFUNC selections and to tune the processor capability mask.

So what are tunables anyway?

This is where you start if you want the technical description and are not interested in the story bits.

Tunables is an internal implementation detail in glibc. It is a way to manage ways in which we allow behaviour in glibc to be modified. As of now the only way to manage glibc is via environment variables and the way to do that was strewn all over the place in the source code. Tunables provide one place to add the tunable parameter with all of the characteristics it would have and then the framework will handle everything from there. The user of that tunable (e.g. malloc for MALLOC_MMAP_THRESHOLD_ or malloc.mmap.threshold in tunables parlance) would then simply access the tunable from the list and do what it wants to do, without bothering about where it came from.

The framework is implemented in elf/dl-tunables.c and all of the supporting code is named as elf/dl-tunable*. As is evident, tunables is linked into the dynamic linker, where it is initialized very early. In static binaries, the initialization is done in libc-start.c, again early enough to influence almost everything in the program. The list is initialized just once and is modifiable only in the dynamic linker before it relocates itself.

The main list of tunables is maintained in elf/dl-tunables.list. Architectures may define their own tunables in sysdeps/…/dl-tunables.list. There is a README.tunables that lists out the gory details of using tunables within glibc to access its values and if necessary, update it.

This gives us a number of advantages, some of them being the following:

Single Initialization

All environment variables used by glibc would be read in by a single double-nested loop which initializes all tunables. Accesses are then just a GOT away, so no more getenv loops in glibc code. This is not achieved yet since all of the environment variables are not yet ported to tunables (Hint: here’s a nice project for you, you aspiring glibc developer!)

All tunables are listed in a single file

The file elf/dl-tunables.list has a full list of tunables along with its properties such as type, value range, default value and its behaviour with setuid binaries. This caused us to introspect on each environment variable we ported into tunables and we ended up fixing a few bugs as well.

Very Early Initialization

Yes, very early, earlier than you would imagine, earlier than IFUNCs! *gasp*

Tunables get initialized very early so that they can influence almost every behaviour in glibc. The unreleased 2.26 makes this even earlier (or rather, delays CPU features initialization enough) so that tunables can impact selection of routines using IFUNCs. This fixes an important inconsistency in glibc, where LD_HWCAP_MASK was read in dynamically linked binaries but not in static binaries because it was not read in early enough.

relro

The tunable list is read-only, so glibc reads from a list that cannot be tampered by malicious code that gets loaded after relocation.

What changes for me as a user?

The change in 2.25 is minimal enough that you won’t notice. In this release, only the malloc tuning environment variables have been ported to tunables and if you’ve been using those environment variables before, they will continue to work even now. In addition, you get to tune these parameters in a fancy way that doesn’t require the stupid trailing underscore, using the GLIBC_TUNABLES environment variable. The manual describes it extensively so I won’t go into details.

The major change is about to happen now. Intel is starting to push a number of tunables to allow you to tune your library to your liking, changing things like string routines that get selected for your program, cache parameters, etc. I believe PowerPC and S390 will see something simila too in the lock elision space and aarch64 multiarch will be tunable as well. All of this will hopefully come in 2.26 or latest by 2.27.

One thing to note though is that for now tunables are not covered by any ABI or API guarantees. That is to say, if you like a tunable that is in 2.26, we may well remove the tunable in 2.27 if we find that it either does not make sense to have that tunable exposed or exposing that tunable is somehow detrimental to user programs.

The big difference will likely come in when distributions start adding their own tunables into the mix. since it will allow them to add customizations to the library without having to maintain huge ugly patchsets.

The Road Ahead

The big advantage of collecting all tuning parameters under a single framework is the ability to then add new ways to influence those tuning parameters. We have environment variables now, but we could add other methods to tune the library. Some ideas discussed are as follows:

Have a systemwide configuration file (e.g. /etc/sysctl.user.conf) that sets different defaults for some tunables and limits the degree to which specific tunables are altered. This allows systems administrators to have more fine grained control over the processes on their system
Have user-specific configuration files (e.g. $HOME/.sysctl.user.conf) that does something similar but at a user level
Have some tunables modified during execution via some shared memory mechanism

All of this is still evolving, so if you have an idea or would like to work on any of these ideas, feel free to get in touch with me and we can find a way to get you contributing to one of the most critical parts of the operating system!

Comments

Hello FOSSASIA: Revisiting the event and the first program we write in C

Posted: Mar 19, 2017, 10:15

I was at FOSSAsia this weekend to deliver a workshop on the very basics of programming. It ended a pretty rough couple of weeks for me, with travel to Budapest (for Linaro Connect) followed immediately by the travel to Singapore. It seems like I don’t travel east in the timezone very well and the effects were visible with me napping at odd hours and generally looking groggy through the weekend at Singapore. It was however all worth it because despite a number of glitches, I had some real positives to take back from the conference.

The conference

FOSSAsia had been on my list of conferences to visit due to Kushal Das telling me time and again that I’d meet interesting people there. I had proposed a talk (since I can’t justify the travel just to attend) a couple of years ago but dropped out since I could not find sponsors for my talk and FOSSAsia was not interested in sponsoring me either. Last year I met Hong at SHD Belgaum and she invited me to speak at FOSSAsia. I gladly accepted since Nisha was going to volunteer anyway. However as things turned out in the end, my talk got accepted and I found sponsorship for travel and stay (courtesy Linaro), but Nisha could not attend.

I came (I’m still in SG, waiting for my flight) half-heartedly since Nisha did not accompany me, but the travel seemed worth it in the end. I met some very interesting people and was able to deliver a workshop that I was satisfied with.

Speaking of the workshop…

I was scheduled to talk on the last day (Sunday) first thing in the morning and I was pretty sure I was going to be the only person standing with nobody in their right minds waking up early on a Sunday for a workshop. A Sunday workshop also meant that I knew the venue and its deficiencies - the “Scientist for a Day” part of the Science Center was a disaster since it was completely open and noisy, with lunch being served right next to the room on the first day. I was wary of that, but the Sunday morning slot protected me from that and my workshop personally without such glitches.

The workshop content itself was based on an impromptu ‘workshop’ I did at FUDCon Pune 2015, but a little more organized. Here’s a blow by blow account of the talk for those who missed it, and also a reference for those who attended and would like a reference to go back to in future.

Hell Oh World

It all starts with this program. Hello World is what we all say when we are looking to learn a new language. However, after Hello World, we move up to learn the syntax of the language and then try to solve more complex user problems, ignoring the wonderful things that happened underneath Hello World to make it all happen. This session is an attempt to take a brief look into these depths. Since I am a bit of a cynic, my Hello World program is slightly different:

#include <stdio.h>

int
main (void)
{
  printf ("Hell Oh World!\n");
  return 0;
}

We compile this program:

$ gcc -o helloworld helloworld.c

We can see that the program prints the result just fine:

$ ./helloworld 
Hell Oh World!

But then there is so much that went into making that program. Lets take a look at the binary by using a process called disassembling, which prints the binary program into a human-readable format - well at least readable to humans that know assembly language programming.

$ objdump -d helloworld

We wrote only one function: main, so we should see only that. Instead however, we see so many functions that are present in the binary In fact, you you were lied to when they told back in college that main() is the entry point of the program! The entry point is the function called _start, which calls a function in the GNU C Library called __libc_start_main, which in turn calls the main function. When you invoke the compiler to build the helloworld program, you’re actually running a number of commands in sequence. In general, you do the following steps:

Preprocess the source code to expand macros and includes
Compile the source to assembly code
Assemble the assembly source to binary object code
Link the code against its dependencies to produce the final binary program

let us look at these steps one by one.

Preprocessing

gcc -E -o helloworld.i helloworld.c

Run this command instead of the first one to produce a pre-processed file. You’ll see that the resultant file has hundreds of lines of code and among those hundreds of lines, is this one line that we need: the prototype for printf so that the compiler identifies the call printf:

extern int printf (const char *__restrict __format, ...);

It is possible to just use this extern decl and avoid including the entire header file, but it is not good practice. The overhead of maintaining something like this is unnecessary, especially when the compiler can do the job of eliminating the unused bits anyway. We are better off just including a couple of headers and getting all declarations.

Compiling the preprocessed source

Contrary to popular belief, the compiler does not compile into binary .o - it only generates assembly code. It then calls the assembler in the binutils project to convert the assembly into object code.

$ gcc -S -o helloworld.s helloworld.i

The assembly code is now just this:

    .file   "helloworld.i"
    .section    .rodata
.LC0:
    .string "Hell Oh World!"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $.LC0, %edi
    call    puts
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (GNU) 6.3.1 20161221 (Red Hat 6.3.1-1)"
    .section    .note.GNU-stack,"",@progbits

which is just the main function and nothing else. The interesting thing there though is that the printf function call is replaced with puts because the input to printf is just a string without any format and puts is much faster than printf in such cases. This is an optimization by gcc to make code run faster. In fact, the code runs close to 200 optimization passes to attempt to improve the quality of the generated assembly code. However, it does not add all of those additional functions.

So does the assembler add the rest of the gunk?

Assembling the assembly

gcc -c -o helloworld.o helloworld.s

Here is how we assemble the generated assembly source into an object file. The generated assembly can again be disassembled using objdump and we see this:

helloworld.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 :
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   bf 00 00 00 00          mov    $0x0,%edi
   9:   e8 00 00 00 00          callq  e 
   e:   b8 00 00 00 00          mov    $0x0,%eax
  13:   5d                      pop    %rbp
  14:   c3                      retq

which is no more than what we saw with the compiler, just in binary format. So it surely is the linker adding all of the gunk.

Putting it all together

Now that we know that it is the linker adding all of the additional stuff into helloworld, lets look at how gcc invokes the linker. To do this, we need to add a -v to the gcc command. You’ll get a lot of output, but the relevant bit is this:

$ gcc -v -o helloworld helloworld.c
...

...
/usr/libexec/gcc/x86_64-redhat-linux/6.3.1/collect2 -plugin /usr/libexec/gcc/x86_64-redhat-linux/6.3.1/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/x86_64-redhat-linux/6.3.1/lto-wrapper -plugin-opt=-fresolution=/tmp/ccEdWzG5.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o helloworld /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/crtbegin.o -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1 -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../.. /tmp/cc3m0We9.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-redhat-linux/6.3.1/crtend.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'helloworld' '-mtune=generic' '-march=x86-64'

This is a long command, but the main points of interest are all of the object files (*.o) that get linked in because the linker concatenates those and then resolves dependencies of unresolved references to functions (only puts in this case) among those and all of the libraries (libc.so via -lc, libgcc.so via -lgcc, etc.). To find out which of the object code files have the definition of a specific function, say, _start, disassemble each of them. You’ll find that crt1.o has the definition.

Static linking

Another interesting thing to note in the generated assembly is that the call is to puts@plt, which is not exactly puts. It is in reality a construct called a trampoline, which helps the code jump to the actual printf function during runtime. We need this because printf is actually present in libc.so.6, which the binary simply claims to need by encoding it in the binary. To see this, disassemble the binary using the -x flag:

$ objdump -x helloworld

helloworld:     file format elf64-x86-64
helloworld
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x0000000000400430
...
Dynamic Section:
  NEEDED               libc.so.6
...

This is dynamic linking. When a program is executed, what is actually called first is the dynamic linker (ld.so), which then opens all dependent libraries, maps them into memory, and then calls the _start function in the program. During mapping, it also fills in a table of data called the Global Offset Table with offsets of all of the external references (puts in our case) to help the trampoline jump to the correct location.

If you want to be independent of the dynamic linker, then you can link the program statically:

$ gcc -static -o helloworld helloworld.c

This will however result in bloating of the program and also has a number of other disadvantages, like having to rebuild for every update of its dependent libraries and sub-optimal performance since the kernel can no longer share pages among processes for common code.

BONUS: Writing the smallest program

The basics were done with about 10 minutes to spare, so I showed how one could write the smallest program ever. In principle, the smallest program in C is:

int
main (void)
{
  return 42;
}

As is evident though, this pulls in everything from the C and gcc libraries, so it is clearly hard to do this in C, so lets try it in assembly. We already know that _start is the main entry point of the program, so we need to implement that function. To exit the program, we need to tell the kernel to exit by invoking the exit_group syscall, which has syscall number 231. Here is what the function looks like:

.globl _start
_start:
    mov $0xe7, %rax
    mov $0x42, %rdi
    syscall

We can build this with gcc to get a very small binary but to do this, we need to specify that we don’t want to use the standard libraries:

gcc -o min -nostdlib min.s

The resultant file is 864 bytes, as opposed to the 8.5K binary from the C program. We can reduce this further by invoking the assembler and linker directly:

$ as -o min.o min.s
$ ld -o min min.o

This results in an even smaller binary, at 664 bytes! This is because gcc puts some extra meta information in the binary to identify its builds.

Conclusion

At this point we ran out of time and we had to cut things short. It was a fun interaction because there were even a couple of people with Macbooks and we spotted a couple of differences in the way the linker ran due to differences in the libc, despite having the same gcc installed. I wasn’t able to focus too much on the specifics of these differences and I hope they weren’t a problem for the attendees using Macs. In all it was a satisfying session because the audience seemed happy to learn about all of this. It looked like many of them had more questions (and wonderment, as I had when I learned these things for the first time) in their mind than they came in with and I hope they follow up and eventually participate in Open Source projects to fulfill their curiosity and learn further.

Comments

GNU Tools Cauldron 2016, ARMv8 multi-arch edition

Posted: Sep 28, 2016, 11:49

Worst planned trip ever.

That is what my England trip for the GNU Tools Cauldron was, but that only seemed to add to the pleasure of meeting friends again. I flewin to Heathrow and started on an almost long train journey to Halifax,with two train changes from Reading. I forgot my phone on the trainbut the friendly station manager at Halifax helped track it down andgot it back to me. That was the first of the many times I forgotstuff in a variety of places during this trip. Like I discovered thatI forgot to carry a jacket or an umbrella. Or shorts. Or full lengthpants for that matter. Like I purchased an umbrella from Sainsbury’s but forgot to carry it out. I guess you got the drift of it.

All that mess aside, the conference itself was wonderful as usual. My main point of interest at the Cauldron this time was to try and make progress on discussions around multi-arch support for ARMv8. I have never talked about this in my blog the past, so a brief introduction is in order.

What is multi-arch?

Processors evolve over time and introduce features that can be exploited by the C library to do work faster, like using the vectori SIMD unit to do memory copies and manipulation faster. However, this is at odds with the goal of the C library to be able to run on all hardware, including those that may not have a vector unit or may not have that specific type of vector unit (e.g. have SSE4 but not AVX512 on x86). To solve this problem, we exploit the concept of PLT and dynamic linking.

I thought we were talking about multiarch, what’s a PLT now?

When a program calls a function in a library that it links to dynamically (i.e. only the reference of the library and the function are present in the binary, not the function implementation), it makes the call via an indirect reference (aka a trampoline) within thebinary because it cannot know where the function entry point in another library resides in memory. The trampoline uses a table (called the Procedure Linkage Table, PLT for short) to then jump to the final location, which is the entry point of the function.

In the beginning, the entry point is set as a function in the dynamic linker (lets call it the resolver function), which then looks for the function name in libraries that the program links to and then updates the table with the result. The dynamic linker resolver function can do more than just look for the exact function name in the libraries the function links to and that is where the concept of Indirect Functions or IFUNCs come into the picture.

Further down the rabbit hole - what’s an IFUNC?

When the resolver function finds the function symbol in a library, it looks at the type of the function before simply patching the PLT with its address. If it finds that the function is an IFUNC type (lets call it the IFUNC resolver), it knows that executing that function will give the actual address of the function it should patch into the PLT. This is a very powerful idea because it now allows us to have multiple implementations of the same function built into the library for different features and then have the IFUNC resolver study its execution environment and return the address of the most appropriate function. This is fundamentally how multiarch is implemented in glibc, where we have multiple implementations of functions like memcpy, each utilizing different features, like AVX, AVX2, SSE4 and so on. The IFUNC resolver for memcpy then queries the CPU to find the features it supports and then returns the address of the implementation best suited to the processor.

… and we’re back! Multi-arch for ARMv8

ARMv8 has been making good progress in terms of adoption and it is clear that ARM servers are going to form a significant portion of datacenters of the future. That said, major vendors of such servers with architecture licenses are trying to differentiate by innovating onthe microarchitecture level. This means that a sequence of instructions may not necessarily have the same execution cost on all processors. This gives an opportunity for vendors to write optimal code sequences for key function implementations (string functions for example) for their processors and have them included in the C library. They can use the IFUNC mechanism to then identify their processors and then launch the routine best suited for their processor implementation.

This is all great, except that they can’t identify their processors reliably with the current state of the kernel and glibc. The way to identify a vendor processor is to read the MIDR_EL1 and REVIDR_EL1 registers using the MSR instruction. As the register name suggests, they are readable only in exception level 1, i.e. by the kernel, which makes it impossible for glibc to directly read this, unlike on Intel processors where the CPUID instruction is executable in userspace and is sufficient to identify the processor and its features.

… and this is only the beginning of the problem. ARM processors have a very interesting (and hence painful) feature called big.LITTLE, which allows for different processor configurations on a single die. Even if we have a way to read te two registers, you could end up reading the MIDR_EL1 from one CPU and REVIDR_EL1 from another, so you need a way to ensure that both values are read from the same core.

This led to the initial proposal for kernel support to expose the information in a sysfs directory structure in addition to a trap into the kernel for the MRS instruction. This meant that for any IFUNC implementation to find out the vendor IDs of the cores on the system, it would have to traverse a whole directory structure, which is not the most optimal thing to do in an IFUNC, even if it happens only once in the lifetime of a process. As a result, we wanted to look for a better alternative.

VDSO FTW!

The number of system calls in a directory traversal would be staggering for, say, a 128 core processor and things will undoubtedly get worse as we scale. Another way for the kernel to share this (mostly static) information with userspace is via a VDSO, with an opaque structure in userspace pages in the vdso and helper functionsto traverse that structure. This however (or FS traversal for that matter) exposed a deeper problem, the extent of things we can do in an IFUNC.

An IFUNC runs very early in a dynamically linked program and even earlier in a statically linked program. As a result, there is very little that it can do because most of the complex features are not even initialized at that point. What’s more, the things you can do in a dynamic program are different from the things you can do in a static program (pretty much nothing right now in the latter), so that’s an inconsistency that is hard to reconcile. This makes the IFUNC resolvers very limited in their power and applicability, at least in their current state.

What were we talking about again?

The brief introduction turned out to be not so brief after all, but I hope it was clear. All of this fine analysis was done by Szabolcs Nagy from ARM when we talked about multi-arch first and the conclusion was that we needed to fix and enhance IFUNC support first if we had any hope of doing micro-architecture detection for ARM. However, there is another way for now…

Tunables!

A (not so) famous person (me) once said that glibc tunables are the answer to all problems including world hunger and of course, the ARMv8 multi-arch problem. This was a long term idea I had shared at the Linaro Connect in Bangkok earlier this year, but it looks like it might become a reality sooner. What’s more, it seems like Intel is looking for something like that as well, so I am not alone in making this potentially insane suggestion.

The basic idea here would be to have environment variable(s) todo/override IFUNC selection via tunables until the multi-arch situation is resolved. Tunables initialization is much more lightweight and only really relies on what the kernel provides on the stackand in the auxilliary vector and what the CPU provides directly. It seems easier to delay IFUNC resolution at least until tunables are initialized and then look harder at how much further they can be delayed so that they can use other things like VDSO and/or files.

So here is yet another idea that has culminated into a “just finish tunables already!” suggestion. The glibc community has agreed on setting the 2.25 release as the deadline to get this support in, so hopefully we will see some real code in this time.

Comments

Understanding malloc behaviour using Systemtap userspace probes

Posted: Oct 02, 2014, 22:03

A blog post I wrote on Understanding malloc behaviour using Systemtap userspace probes on the Red Hat Developer Blog has now been published. I got a query about a follow-up post with example usage, which I hope to be able to work on soon-ish.

Comments

Buggy HLE, microcode updates and SIGILLs

Posted: Sep 26, 2014, 09:32

Update: Disabling lock elision in glibc doesn’t seem to be sufficient. Either way, the Fedora kernel folks will have an update in place to update the microcode early by default so that both the kernel and the first instantiation of pthreads will see HLE disabled. So read the story as something interesting that we did but didn’t quite work. It was fun though…

Amit and I ran into an interesting problem today with his new Haswell process based system. A fully updated Fedora 21 alpha would fail during boot and fall into the maintainer shell. The systemd journal showed that systemd-udevd was crashing with a SIGILL, which seemed strange. The core dump revealed the problem:

(gdb) x/i $rip
=> 0x7f68b0b978ba <pthread_rwlock_rdlock+186>:  xbeginq 0x7f68b0b978c0 <pthread_rwlock_rdlock+192>

The xbeginq instruction is an HLE instruction, so the first thing that came to mind was the recent errata that Intel pushed out, effectively announcing that HLE was buggy and that they were going to disable it soon. We looked at /proc/cpuinfo expecting to find hle and rtm missing, but were even more confused to find that they were present.

After much tinkering about, Amit made a vague reference to microcode_ctl being able to change CPU microcode on the fly. It took a while to hit us, but we finally realized that we had found the culprit. microcode_ctl had been updated with the latest Intel microcode update. We initially thought that it ought to be a one-time problem since the microcode would be flashed into the cpu and later everything would work, but then we found out that the microcode needs to be flashed on every boot.

So the root cause was that the microcode would happen late enough that systemd was already up and had read the hle bit, thus enabling lock elision support in systemd. Also, since the kernel had already read in cpu capabilities, it also did not have the updated capabilities, due to which we continued seeing hle and rtm set in cpuinfo.

As a result, thanks to the microcode update, all haswell based F21 alpha systems are essentially unbootable. Carlos is now fixing this by disabling lock-elision completely in the glibc build. Work is in progress for rawhide, F21 and F20 as I write this, so the impact of this will hopefully be minimal. If you do run into this problem, all you have to do is dowwngrade the microcode_ctl package and pin it so that it doesn’t get updated till the glibc update becomes available.

Comments

NOTABUG in glibc

Posted: Sep 18, 2014, 02:16

The glibc malloc implementation has a number of heap consistency checks in place to ensure that memory corruption bugs in programs are caught as early as possible and the program aborted to prevent misuse of the bug. Memory corruption through buffer overruns (or underruns) are often exploit vectors waiting to be ‘used’, which is why these consistency checks and aborts are necessary.

If the heap of a program has been found to be corrupted, the program is terminated with an error that usually looks something like this:

*** glibc detected *** ./foo: double free or corruption (!prev): 0x0000000001362010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x78a96)[0x7f3df63aea96]
/lib64/libc.so.6(cfree+0x6c)[0x7f3df63b2d7c]
./foo[0x400e7c]
/lib64/libc.so.6(__libc_start_main+0xed)[0x7f3df635730d]
./foo[0x4008f9]
======= Memory map: ========

and when one looks at the core dump, the top of the call stack is all inside glibc:

Program terminated with signal 6, Aborted.
#0  0x00007fd0273b6925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64    return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007fd0273b6925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fd0273b8105 in abort () at abort.c:92
#2  0x00007fd0273f4837 in __libc_message (do_abort=2, fmt=0x7fd0274dcaa0 "n not possible due to RF-kill") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x00007fd0273fa166 in malloc_printerr (action=3, str=0x7fd0274daa5e "/proc/self/maps", ptr=) at malloc.c:6332
#4  0x00007fd0273fdf9a in _int_malloc (av=0x7fd027713e80, bytes=) at malloc.c:4673

The common mistake one may make here is to assume that it is a glibc bug because the crash is ‘caused’ by glibc. That is the equivalent of killing the whistleblower. The crash is indeed caused by glibc, but the bug is not in glibc. glibc has only caught the bug after it has happened and halted execution of the program.

And if you think glibc is overstepping its bounds by halting the program, you could tell it to not abort by exporting the MALLOC_CHECK_ environment variable set to either 0 (completely silent) or 1 (prints the message on stderr). Of course, you have to be smoking something very exotic to do that instead of finding and fixing the bug.

Comments