Hell Oh Entropy!

Relocations: fantastic symbols, but where to find them?

Posted: Apr 10, 2020, 12:36

Update: There’s a Links section at the end that should give you a list of all the reference you’ll ever need to undrstand Linkers and Relocations!

When I started out in compilers years ago, I found relocations especially hard to wrap my head around. They’re just simple math in the end but they combine elements from different places that make it complicated enough that many (as did I back then) assume it to be some kind of black art. The Oracle documentation on relocation processing and relocation sections is pretty much the best thing I’ve found on the internet that explains relocations and they’re a great start if you already know what you’re doing. This is why I figured I ought to try writing something more accessible that puts some of the bits into practice. The fact that I’m currently working on relocation processing makes it that much simpler for me to just bash at the keyboard and commit the stuff in my memory to a more persistent medium before it gets swapped out to make space for kitten photos.

While this tutorial is meant to be beginner friendly, it does assume though that the reader has some awareness of the ELF format, at least to the extent of knowing about different ELF sections. It also assumes that the reader has some undersanding of assembly language programming, bonus if you know aarch64 assembly since that’s the flavour of the examples.

Finding each other in this crowded world

The idea of relocation is quite simple: when compiling programs, we need flexibility to build components of programs independently and then have them link together. This could be in the same source file where we don’t know where parts of the source would end up, multiple source files built into different object files or sets of object files built into different libraries that reference objects in each other. This flexibility is achieved using relocations. Here’s a very simple example using aarch64 assembly:

.data
.globl somedata
somedata:
	.8byte 0x42

.text
.globl start
_start:
	nop
	ldr	x2, somedata

This is a simple program that loads somedata into register x2. It doesn’t do much and if you try to run the program it will crash, but it is a useful example that shows an assembly source where parts of it end up in different ELF sections.

The interesting bit here is a load instruction in the text section that is reading a variable somedata from the data section. The load instruction encodes within itself, the offset of somedata from itself, aka the PC-relative offset. The assembler can see both, the variable and the instruction but it cannot say for sure where they will be in memory at this point, because it does not know how far the data section will be from the text section in the final library or executable. To work around this limitation, it needs to leave the offset field in the ldr instruction blank so that the linker can finally fill it in. It also needs to provide instructions to the linker to tell it how to fill in this field.

This is where relocations come into play.

In this specific case, one may assemble the example and disassemble it using objdump -Dr to find this disassembly:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d503201f 	nop
   4:	58000002 	ldr	x2, 0 <_start>
			4: R_AARCH64_LD_PREL_LO19	somedata

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

The R_AARCH64_LD_PREL_LO19 in that output is the relocation. How did it land in there in between the instructions you ask? Well, it didn’t. The relocations are actually in a separate section of their own, as is evident with objdump -r:

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000004 R_AARCH64_LD_PREL_LO19  somedata

even better with readelf -r because it tells you that the relocations are essentially just a table of entries in the .rela.text section:

Relocation section '.rela.text' at offset 0x110 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000004  000500000111 R_AARCH64_LD_PREL 0000000000000000 somedata + 0

All sections with relocation entries have names with prefix .rela or .rel followed by the name of the section for which the relocations need to be applied. Based on these section names, it’s evident that there are two types of relocation entries, REL and RELA. There are a number of important pieces of information the assembler leaves for the linker here:

The memory address it needs to fix up
Which symbol the memory address is referring to
What is the offset from the symbol that it should finally consider as the result
How should it perform the computation

All of this can be seen in the above relocation table. Each entry in the relocation table is basically a C structure of the following form for REL type relocations:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
} Elf64_Rel;

and for RELA:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
        Elf64_Sxword    r_addend;
} Elf64_Rela;

r_offset corresponds to the Offset entry in the readelf output above and is typically the memory address that needs to be fixed up. The offset from the symbol, aka the addendum is present only in RELA type relocations and it corresponds to the r_addend element in the structure and the Addend field in the readelf output. The symbol and computation related information is encoded in the r_info field.

The Elf32_Rel and Elf32_Rela structures are similar, except for the data sizes of the elements.

Symbol hunting

The ‘what to write’ is where the r_info field comes in. That’s what the linker needs to figure out before the where and how, which comes later.

This field is split into two 32-bit parts (16-bit for 32-bit architectures). The lower part tells the linker how to perform the computation and the upper part tells the linker what the target symbol is. The upper part is a symbol ID, which basically is an index into the symbol table. In our example above, the r_info is 0x000500000111, which means that the symbol id is 0x5. We can pull out the symbol table using readelf -s:

Symbol table '.symtab' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     4: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT    1 $x
     5: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    3 somedata
     6: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    1 _start

and we find that the symbol id 0x5 is somedata and has the Value (i.e. the address of the symbol relative to its section) of 0x0.

Now we need to figure out how to combine all of this information together using the lower part of r_info, which is 0x111. This number corresponds to R_AARCH64_PREL_LO19, which has a specific meaning. An architecture defines a number of such relocations with descriptions of how they’re supposed to be applied. In case of R_AARCH64_PREL_LO19 it means “Add the symbol address and addend and subtract from it, the location of the memory address that is being fixed up”. Think about that a bit and you’ll notice that it is how you would compute the PC-relative offset of somedata from the instruction, i.e. subtract the location of the fixup (i.e. the LDR instruction) from the address of somedata. In short form (and you’ll see this and similar notations to describe relocations), it is written as S + A - P.

Putting things together

Now that we know what the target symbol is and how to compute the pc-relative offset, we need to compute the final symbol address, the final target address and then do the actual fixup. This is done near the end of the linking process in the GNU linker, when all sections have been laid out and we finally are in a position to know the relative addresses. The linker will then make the computation (i.e. S+A-P) and patch in the result into the LDR instruction before writing the output to the final binary. Here is what our result looks like:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d503201f 	nop
  4000b4:	58080022 	ldr	x2, 4100b8 <somedata>

Disassembly of section .data:

00000000004100b8 <somedata>:
  4100b8:	00000042 	.inst	0x00000042 ; undefined
  4100bc:	00000000 	.inst	0x00000000 ; undefined

Notice that the opcode of the LDR instuction is now different (and as a result the instruction itself is also different) and includes the 0x8002, which is basically the encoded difference (0x10004) from somedata.

Raising the stakes: Dynamic Relocations

This is all great when all our symbols are local and have predictable layouts such that PC-relative relocations such as R_AARCH64_PREL_LO19 are sufficient to describe and resolve in between assembling and linking a program. What happens however, when these symbol references cross boundaries of sections in ways that we cannot predict at compile or link time? What happens when your symbol references cross boundaries of your shared object? These are problems that need to be solved to make Position Independent Code (PIC) possible. PIC is when your program (and sections within your program) could get mapped anywhere in memory and you need your code to adapt to that fact.

Take this very simple example:

.data
somedata:
        .8byte 0x42
somedata2:
	.8byte somedata

.text
.globl _start
_start:
	ret

There’s next to nothing here; just somedata like in our previous example and a somedata2 which points to somedata. However, in this next-to-nothing example lies an interesting complication that needs to be resolved at runtime: the value in somedata2 cannot be computed at compile time; it needs a fixup at runtime! Let’s walk through the compilation to see how we get to the final result. First, the disassembly to understand what the assembler did for us:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d65f03c0 	ret

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

0000000000000008 <somedata2>:
	...
			8: R_AARCH64_ABS64	.data

We see now that the address to be relocated is somedata2 in the .data section and it is of type R_AARCH64_ABS64. This is simple relocation that instructs the linker to compute S + A to get the result, i.e. get the symbol address of somedata and add the addendum (0 again in this case). This in fact would be the final result for a statically linked result (using ld -static) and we’d lose the relocation in favour of the absolute address written into somedata2:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d65f03c0 	ret

Disassembly of section .data:

00000000004100b4 <somedata>:
  4100b4:	00000042 	.inst	0x00000042 ; undefined
  4100b8:	00000000 	.inst	0x00000000 ; undefined

00000000004100bc <somedata2>:
  4100bc:	004100b4 	.inst	0x004100b4 ; undefined
  4100c0:	00000000 	.inst	0x00000000 ; undefined

When compiling a shared object however (i.e. ld -shared) we intend to produce a position independent DSO (dynamic shared object) and to achieve that the linker now emits a relocation to describe how to compute the final address to assign to somedata2 and where the memory address can be located. In this example, it is the R_AARCH64_RELATIVE dynamic relocation, as seen using objdump -DR (output snipped to retain only useful bits):

Disassembly of section .data:

0000000000011000 <somedata>:
   11000:	00000042 	.inst	0x00000042 ; undefined
   11004:	00000000 	.inst	0x00000000 ; undefined

0000000000011008 <somedata2>:
   11008:	00011000 	.inst	0x00011000 ; undefined
			11008: R_AARCH64_RELATIVE	*ABS*+0x11000
   1100c:	00000000 	.inst	0x00000000 ; undefined

This relocation is interesting not just for the reason that it is dynamic, but also because it is a S+A type relocation that puts the non-relocated address (i.e. the link time address) of somedata into its addend. This relocation also does not reference a symbol; instead it references an *ABS* value, which is basically the offset at which this DSO would be loaded during execution. It is the dynamic linker in the C runtime library (ld.so in GNU systems) that reads these relocations from the .rela.dyn section. Because this relocation is based on an absolute address computed by the static linker, the dynamic linker does not have to do a symbol lookup to resolve the relocation.

The other difference from static relocations is that when a dynamic relocation references a symbol in its r_info, it is looked up in the .dynsym section, i.e. in the dynamic symbol table and not in the regular symbol table.

Final Thoughts

There are a number of other cases that the linker needs to cater for when it comes to relocations such as entries in Global Offset Tables, resolution of intermediate functions and Thread-Local Storage. Thankfully though, the first principles behind all those relocations are the same as the above and you can apply this knowledge to GOT, TLS and IFUNC relocations as well. GOT relocations for example reference GOT base, which the linker knows where to find (since it sets up the .got section in the first place) and can use that information to compute the location to fix up. Other than this special knowledge, everything else remains pretty much the same.

Once you’re equipped with these first principles, the next task is to figure out where documentation for specific relocations is for every architecture. While the binutils documentation makes some effort to document the public facing part of relocations, the detailed documentation is usually distributed by the architecture chip vendors. The AArch64 ELF documentation for example is hosted on the Arm website.

GNU Tools Cauldron 2016, ARMv8 multi-arch edition

Posted: Sep 28, 2016, 11:49

Worst planned trip ever.

That is what my England trip for the GNU Tools Cauldron was, but that only seemed to add to the pleasure of meeting friends again. I flewin to Heathrow and started on an almost long train journey to Halifax,with two train changes from Reading. I forgot my phone on the trainbut the friendly station manager at Halifax helped track it down andgot it back to me. That was the first of the many times I forgotstuff in a variety of places during this trip. Like I discovered thatI forgot to carry a jacket or an umbrella. Or shorts. Or full lengthpants for that matter. Like I purchased an umbrella from Sainsbury’s but forgot to carry it out. I guess you got the drift of it.

All that mess aside, the conference itself was wonderful as usual. My main point of interest at the Cauldron this time was to try and make progress on discussions around multi-arch support for ARMv8. I have never talked about this in my blog the past, so a brief introduction is in order.

What is multi-arch?

Processors evolve over time and introduce features that can be exploited by the C library to do work faster, like using the vectori SIMD unit to do memory copies and manipulation faster. However, this is at odds with the goal of the C library to be able to run on all hardware, including those that may not have a vector unit or may not have that specific type of vector unit (e.g. have SSE4 but not AVX512 on x86). To solve this problem, we exploit the concept of PLT and dynamic linking.

I thought we were talking about multiarch, what’s a PLT now?

When a program calls a function in a library that it links to dynamically (i.e. only the reference of the library and the function are present in the binary, not the function implementation), it makes the call via an indirect reference (aka a trampoline) within thebinary because it cannot know where the function entry point in another library resides in memory. The trampoline uses a table (called the Procedure Linkage Table, PLT for short) to then jump to the final location, which is the entry point of the function.

In the beginning, the entry point is set as a function in the dynamic linker (lets call it the resolver function), which then looks for the function name in libraries that the program links to and then updates the table with the result. The dynamic linker resolver function can do more than just look for the exact function name in the libraries the function links to and that is where the concept of Indirect Functions or IFUNCs come into the picture.

Further down the rabbit hole - what’s an IFUNC?

When the resolver function finds the function symbol in a library, it looks at the type of the function before simply patching the PLT with its address. If it finds that the function is an IFUNC type (lets call it the IFUNC resolver), it knows that executing that function will give the actual address of the function it should patch into the PLT. This is a very powerful idea because it now allows us to have multiple implementations of the same function built into the library for different features and then have the IFUNC resolver study its execution environment and return the address of the most appropriate function. This is fundamentally how multiarch is implemented in glibc, where we have multiple implementations of functions like memcpy, each utilizing different features, like AVX, AVX2, SSE4 and so on. The IFUNC resolver for memcpy then queries the CPU to find the features it supports and then returns the address of the implementation best suited to the processor.

… and we’re back! Multi-arch for ARMv8

ARMv8 has been making good progress in terms of adoption and it is clear that ARM servers are going to form a significant portion of datacenters of the future. That said, major vendors of such servers with architecture licenses are trying to differentiate by innovating onthe microarchitecture level. This means that a sequence of instructions may not necessarily have the same execution cost on all processors. This gives an opportunity for vendors to write optimal code sequences for key function implementations (string functions for example) for their processors and have them included in the C library. They can use the IFUNC mechanism to then identify their processors and then launch the routine best suited for their processor implementation.

This is all great, except that they can’t identify their processors reliably with the current state of the kernel and glibc. The way to identify a vendor processor is to read the MIDR_EL1 and REVIDR_EL1 registers using the MSR instruction. As the register name suggests, they are readable only in exception level 1, i.e. by the kernel, which makes it impossible for glibc to directly read this, unlike on Intel processors where the CPUID instruction is executable in userspace and is sufficient to identify the processor and its features.

… and this is only the beginning of the problem. ARM processors have a very interesting (and hence painful) feature called big.LITTLE, which allows for different processor configurations on a single die. Even if we have a way to read te two registers, you could end up reading the MIDR_EL1 from one CPU and REVIDR_EL1 from another, so you need a way to ensure that both values are read from the same core.

This led to the initial proposal for kernel support to expose the information in a sysfs directory structure in addition to a trap into the kernel for the MRS instruction. This meant that for any IFUNC implementation to find out the vendor IDs of the cores on the system, it would have to traverse a whole directory structure, which is not the most optimal thing to do in an IFUNC, even if it happens only once in the lifetime of a process. As a result, we wanted to look for a better alternative.

VDSO FTW!

The number of system calls in a directory traversal would be staggering for, say, a 128 core processor and things will undoubtedly get worse as we scale. Another way for the kernel to share this (mostly static) information with userspace is via a VDSO, with an opaque structure in userspace pages in the vdso and helper functionsto traverse that structure. This however (or FS traversal for that matter) exposed a deeper problem, the extent of things we can do in an IFUNC.

An IFUNC runs very early in a dynamically linked program and even earlier in a statically linked program. As a result, there is very little that it can do because most of the complex features are not even initialized at that point. What’s more, the things you can do in a dynamic program are different from the things you can do in a static program (pretty much nothing right now in the latter), so that’s an inconsistency that is hard to reconcile. This makes the IFUNC resolvers very limited in their power and applicability, at least in their current state.

What were we talking about again?

The brief introduction turned out to be not so brief after all, but I hope it was clear. All of this fine analysis was done by Szabolcs Nagy from ARM when we talked about multi-arch first and the conclusion was that we needed to fix and enhance IFUNC support first if we had any hope of doing micro-architecture detection for ARM. However, there is another way for now…

Tunables!

A (not so) famous person (me) once said that glibc tunables are the answer to all problems including world hunger and of course, the ARMv8 multi-arch problem. This was a long term idea I had shared at the Linaro Connect in Bangkok earlier this year, but it looks like it might become a reality sooner. What’s more, it seems like Intel is looking for something like that as well, so I am not alone in making this potentially insane suggestion.

The basic idea here would be to have environment variable(s) todo/override IFUNC selection via tunables until the multi-arch situation is resolved. Tunables initialization is much more lightweight and only really relies on what the kernel provides on the stackand in the auxilliary vector and what the CPU provides directly. It seems easier to delay IFUNC resolution at least until tunables are initialized and then look harder at how much further they can be delayed so that they can use other things like VDSO and/or files.

So here is yet another idea that has culminated into a “just finish tunables already!” suggestion. The glibc community has agreed on setting the 2.25 release as the deadline to get this support in, so hopefully we will see some real code in this time.

Comments