Hell Oh Entropy!

Relocations: fantastic symbols, but where to find them?

Posted: Apr 10, 2020, 12:36

Update: There’s a Links section at the end that should give you a list of all the reference you’ll ever need to undrstand Linkers and Relocations!

When I started out in compilers years ago, I found relocations especially hard to wrap my head around. They’re just simple math in the end but they combine elements from different places that make it complicated enough that many (as did I back then) assume it to be some kind of black art. The Oracle documentation on relocation processing and relocation sections is pretty much the best thing I’ve found on the internet that explains relocations and they’re a great start if you already know what you’re doing. This is why I figured I ought to try writing something more accessible that puts some of the bits into practice. The fact that I’m currently working on relocation processing makes it that much simpler for me to just bash at the keyboard and commit the stuff in my memory to a more persistent medium before it gets swapped out to make space for kitten photos.

While this tutorial is meant to be beginner friendly, it does assume though that the reader has some awareness of the ELF format, at least to the extent of knowing about different ELF sections. It also assumes that the reader has some undersanding of assembly language programming, bonus if you know aarch64 assembly since that’s the flavour of the examples.

Finding each other in this crowded world

The idea of relocation is quite simple: when compiling programs, we need flexibility to build components of programs independently and then have them link together. This could be in the same source file where we don’t know where parts of the source would end up, multiple source files built into different object files or sets of object files built into different libraries that reference objects in each other. This flexibility is achieved using relocations. Here’s a very simple example using aarch64 assembly:

.data
.globl somedata
somedata:
	.8byte 0x42

.text
.globl start
_start:
	nop
	ldr	x2, somedata

This is a simple program that loads somedata into register x2. It doesn’t do much and if you try to run the program it will crash, but it is a useful example that shows an assembly source where parts of it end up in different ELF sections.

The interesting bit here is a load instruction in the text section that is reading a variable somedata from the data section. The load instruction encodes within itself, the offset of somedata from itself, aka the PC-relative offset. The assembler can see both, the variable and the instruction but it cannot say for sure where they will be in memory at this point, because it does not know how far the data section will be from the text section in the final library or executable. To work around this limitation, it needs to leave the offset field in the ldr instruction blank so that the linker can finally fill it in. It also needs to provide instructions to the linker to tell it how to fill in this field.

This is where relocations come into play.

In this specific case, one may assemble the example and disassemble it using objdump -Dr to find this disassembly:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d503201f 	nop
   4:	58000002 	ldr	x2, 0 <_start>
			4: R_AARCH64_LD_PREL_LO19	somedata

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

The R_AARCH64_LD_PREL_LO19 in that output is the relocation. How did it land in there in between the instructions you ask? Well, it didn’t. The relocations are actually in a separate section of their own, as is evident with objdump -r:

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000004 R_AARCH64_LD_PREL_LO19  somedata

even better with readelf -r because it tells you that the relocations are essentially just a table of entries in the .rela.text section:

Relocation section '.rela.text' at offset 0x110 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000004  000500000111 R_AARCH64_LD_PREL 0000000000000000 somedata + 0

All sections with relocation entries have names with prefix .rela or .rel followed by the name of the section for which the relocations need to be applied. Based on these section names, it’s evident that there are two types of relocation entries, REL and RELA. There are a number of important pieces of information the assembler leaves for the linker here:

The memory address it needs to fix up
Which symbol the memory address is referring to
What is the offset from the symbol that it should finally consider as the result
How should it perform the computation

All of this can be seen in the above relocation table. Each entry in the relocation table is basically a C structure of the following form for REL type relocations:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
} Elf64_Rel;

and for RELA:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
        Elf64_Sxword    r_addend;
} Elf64_Rela;

r_offset corresponds to the Offset entry in the readelf output above and is typically the memory address that needs to be fixed up. The offset from the symbol, aka the addendum is present only in RELA type relocations and it corresponds to the r_addend element in the structure and the Addend field in the readelf output. The symbol and computation related information is encoded in the r_info field.

The Elf32_Rel and Elf32_Rela structures are similar, except for the data sizes of the elements.

Symbol hunting

The ‘what to write’ is where the r_info field comes in. That’s what the linker needs to figure out before the where and how, which comes later.

This field is split into two 32-bit parts (16-bit for 32-bit architectures). The lower part tells the linker how to perform the computation and the upper part tells the linker what the target symbol is. The upper part is a symbol ID, which basically is an index into the symbol table. In our example above, the r_info is 0x000500000111, which means that the symbol id is 0x5. We can pull out the symbol table using readelf -s:

Symbol table '.symtab' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     4: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT    1 $x
     5: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    3 somedata
     6: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    1 _start

and we find that the symbol id 0x5 is somedata and has the Value (i.e. the address of the symbol relative to its section) of 0x0.

Now we need to figure out how to combine all of this information together using the lower part of r_info, which is 0x111. This number corresponds to R_AARCH64_PREL_LO19, which has a specific meaning. An architecture defines a number of such relocations with descriptions of how they’re supposed to be applied. In case of R_AARCH64_PREL_LO19 it means “Add the symbol address and addend and subtract from it, the location of the memory address that is being fixed up”. Think about that a bit and you’ll notice that it is how you would compute the PC-relative offset of somedata from the instruction, i.e. subtract the location of the fixup (i.e. the LDR instruction) from the address of somedata. In short form (and you’ll see this and similar notations to describe relocations), it is written as S + A - P.

Putting things together

Now that we know what the target symbol is and how to compute the pc-relative offset, we need to compute the final symbol address, the final target address and then do the actual fixup. This is done near the end of the linking process in the GNU linker, when all sections have been laid out and we finally are in a position to know the relative addresses. The linker will then make the computation (i.e. S+A-P) and patch in the result into the LDR instruction before writing the output to the final binary. Here is what our result looks like:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d503201f 	nop
  4000b4:	58080022 	ldr	x2, 4100b8 <somedata>

Disassembly of section .data:

00000000004100b8 <somedata>:
  4100b8:	00000042 	.inst	0x00000042 ; undefined
  4100bc:	00000000 	.inst	0x00000000 ; undefined

Notice that the opcode of the LDR instuction is now different (and as a result the instruction itself is also different) and includes the 0x8002, which is basically the encoded difference (0x10004) from somedata.

Raising the stakes: Dynamic Relocations

This is all great when all our symbols are local and have predictable layouts such that PC-relative relocations such as R_AARCH64_PREL_LO19 are sufficient to describe and resolve in between assembling and linking a program. What happens however, when these symbol references cross boundaries of sections in ways that we cannot predict at compile or link time? What happens when your symbol references cross boundaries of your shared object? These are problems that need to be solved to make Position Independent Code (PIC) possible. PIC is when your program (and sections within your program) could get mapped anywhere in memory and you need your code to adapt to that fact.

Take this very simple example:

.data
somedata:
        .8byte 0x42
somedata2:
	.8byte somedata

.text
.globl _start
_start:
	ret

There’s next to nothing here; just somedata like in our previous example and a somedata2 which points to somedata. However, in this next-to-nothing example lies an interesting complication that needs to be resolved at runtime: the value in somedata2 cannot be computed at compile time; it needs a fixup at runtime! Let’s walk through the compilation to see how we get to the final result. First, the disassembly to understand what the assembler did for us:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d65f03c0 	ret

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

0000000000000008 <somedata2>:
	...
			8: R_AARCH64_ABS64	.data

We see now that the address to be relocated is somedata2 in the .data section and it is of type R_AARCH64_ABS64. This is simple relocation that instructs the linker to compute S + A to get the result, i.e. get the symbol address of somedata and add the addendum (0 again in this case). This in fact would be the final result for a statically linked result (using ld -static) and we’d lose the relocation in favour of the absolute address written into somedata2:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d65f03c0 	ret

Disassembly of section .data:

00000000004100b4 <somedata>:
  4100b4:	00000042 	.inst	0x00000042 ; undefined
  4100b8:	00000000 	.inst	0x00000000 ; undefined

00000000004100bc <somedata2>:
  4100bc:	004100b4 	.inst	0x004100b4 ; undefined
  4100c0:	00000000 	.inst	0x00000000 ; undefined

When compiling a shared object however (i.e. ld -shared) we intend to produce a position independent DSO (dynamic shared object) and to achieve that the linker now emits a relocation to describe how to compute the final address to assign to somedata2 and where the memory address can be located. In this example, it is the R_AARCH64_RELATIVE dynamic relocation, as seen using objdump -DR (output snipped to retain only useful bits):

Disassembly of section .data:

0000000000011000 <somedata>:
   11000:	00000042 	.inst	0x00000042 ; undefined
   11004:	00000000 	.inst	0x00000000 ; undefined

0000000000011008 <somedata2>:
   11008:	00011000 	.inst	0x00011000 ; undefined
			11008: R_AARCH64_RELATIVE	*ABS*+0x11000
   1100c:	00000000 	.inst	0x00000000 ; undefined

This relocation is interesting not just for the reason that it is dynamic, but also because it is a S+A type relocation that puts the non-relocated address (i.e. the link time address) of somedata into its addend. This relocation also does not reference a symbol; instead it references an *ABS* value, which is basically the offset at which this DSO would be loaded during execution. It is the dynamic linker in the C runtime library (ld.so in GNU systems) that reads these relocations from the .rela.dyn section. Because this relocation is based on an absolute address computed by the static linker, the dynamic linker does not have to do a symbol lookup to resolve the relocation.

The other difference from static relocations is that when a dynamic relocation references a symbol in its r_info, it is looked up in the .dynsym section, i.e. in the dynamic symbol table and not in the regular symbol table.

Final Thoughts

There are a number of other cases that the linker needs to cater for when it comes to relocations such as entries in Global Offset Tables, resolution of intermediate functions and Thread-Local Storage. Thankfully though, the first principles behind all those relocations are the same as the above and you can apply this knowledge to GOT, TLS and IFUNC relocations as well. GOT relocations for example reference GOT base, which the linker knows where to find (since it sets up the .got section in the first place) and can use that information to compute the location to fix up. Other than this special knowledge, everything else remains pretty much the same.

Once you’re equipped with these first principles, the next task is to figure out where documentation for specific relocations is for every architecture. While the binutils documentation makes some effort to document the public facing part of relocations, the detailed documentation is usually distributed by the architecture chip vendors. The AArch64 ELF documentation for example is hosted on the Arm website.

gcc under the hood

Posted: Oct 03, 2019, 14:12

My background in computers is a bit hacky for a compiler engineer. I never had the theoretical computer science base that the average compiler geek does (yes I have a Masters in Computer Applications, but it’s from India and yes I’m going to leave that hanging without explanation) and I have pretty much winged it all these years. What follows is a bunch of thoughts (high five to those who get that reference!) from my winging it for almost a decade with the GNU toolchain. If you’re a visual learner then I would recommend watching my talk video at Linaro Connect 2019 instead of reading this. This is an imprecise transcript of my talk, with less silly quips and in a more neutral accent.

Hell Oh World!

It all started as a lightning talk at SHD Belgaum where I did a 5 minute demonstration of how a program goes from source code to a binary program. I got many people asking me to talk about this in more detail and it eventually became a full hour workshop at FOSSASIA in 2017. Those remained the basis for the first part of my talk at Connect, in fact they’re a bit more detailed in their treatment of the subject of purely taking code from source to binary.

Go read those posts, I’ll wait.

Under the hood

Welcome back! So now you know almost exactly how code goes from source to binary and how it gets executed on your computer. Congratulations, you now have a headstart into hacking on the dynamic loader bits in glibc! Now that we’ve explored the woods around the swamp, lets step into the swamp. Lets take a closer look at what gcc does with our source code.

Babelfish

The first thing a compiler does is to read the source code and make it into a data structure form that is easy for the computer to manipulate. Since the computer deals best with binary data, it converts the text form of the source code language into a tree-like structure. There is plenty of literature available on how that is implemented; in fact most compiler texts end up putting too much focus on this aspect of the compiler and then end up rushing through the real fun stuff that the compiler does with the program you wrote.

The data structure that the compiler translates your source code into is called an Intermediate Representation (IR), and gcc has three of them!

This part of the compiler that parses the source code into IR is known as the front end. gcc has many such front ends, one for each language it implements; they are all cluttered into the gcc/ directory. The C family of languages has a common subset of code in the gcc/c-family directory and then there are specializations like the C++ front end, implemented in the gcc/cp directory. There is a directory for fortran, another for java, yet another for go and so on. The translation of code from source to IR varies from frontend to frontend because of language differences, but they all have one thing in common; their output is an IR called GENERIC and it attempts to be language-independent.

Optimisation passes

Once gcc translates the source code into its IR form, it runs the IR through a number of passes (about 200 of them, more or less depending on the -O flag you pass) to try and come up with the most optimal machine code representation of the program. gcc builds source code a file at a time, which is called a Translation Unit (TU). A lot of its optimisation passes operate at the function level, i.e. individual functions are seen as independent units and are optimised separately. Then there are Inter-procedural analysis (IPA) passes that look at interactions of these functions and finally there is Link Time Optimisation(LTO) that attempts to analyse source code across translation units to potentially get even better results.

Optimisation passes can be architecture-independent or architecture-dependent.

Architecture independent passes do not care too much about the underlying machine beyond some basic details like word size, whether the CPU has floating point support, vector support, etc. These passes have some configuration hooks that allow their behaviour to be modified according to the target CPU architecture but the high level behaviour is architecture-agnostic. Architecture-independent passes are the holy grail for optimisation because they usually don’t get old; optimisations that work today will continue to work regardless of CPU architecture evolution.

Architecture-dependent passes need more information about the architecture and as a result their results may change as architectures evolve. Register allocation and instruction scheduling for example are very interesting and complex problems that architecture-specific passes handle. The instruction scheduling problem was a lot more critical back in the day when CPUs could only execute code sequentially. With out-of-order execution, the scheduling problem has not become less critical, but it has definitely changed in nature. Similarly the register allocation problem can be complicated by various factors such as number of locigal registers, how they share physical register files, how costly moving between different types of registers is, and so on. Architecture-dependent passes have to take into consideration all of these factors.

The final pass in architecture-dependent passes does the work of emitting assembly code for the target CPU. The architecture-independent passes constitute what is known as the middle-end of the compiler and the architecture-dependent passes form the compiler backend.

Each pass is a complex work of art, mathematics and logic and may have one or more highly cited research papers as their basis. No single gcc engineer would claim to understand all of these passes; there are many who have spent most of their careers on a small subset of passes, such is their complexity. But then, this is also a testament to how well we can work together as humans to create something that is significantly more complex than what our individual minds can grasp. What I mean to say with all this is, it’s OK to not know all of the passes, let alone know them well; you’re definitely not the only one.

Optimisation passes are all listed in gcc/passes.def and that is the sequence in which they are executed. A pass is defined as a class with a gate and execute function that determine whether to run and what to do respectively. Here’s what a single pass namespace looks like:

/* Pass data to help the pass manager classify, prepare and cleanup.  */
const pass_data pass_data_my_pass =
{
  GIMPLE_PASS, /* type */
  "my_pass", /* name */
  OPTGROUP_LOOP, /* optinfo_flags */
  TV_TREE_LOOP, /* tv_id */
  PROP_cfg, /* properties_required */
  0, /* properties_provided */
  0, /* properties_destroyed */
  0, /* todo_flags_start */
  0, /* todo_flags_finish */
};

/* This is a GIMPLE pass.  I know you don't know what GIMPLE is yet ;) */
class pass_my_pass : public gimple_opt_pass
{
public:
  pass_my_pass (gcc::context *ctxt)
    : gimple_opt_pass (pass_data_my_pass, ctxt)
  {}

  /* opt_pass methods: */
  virtual bool gate (function *) { /* Decide whether to run.  */ }

  virtual unsigned int execute (function *fn);
};

unsigned int
pass_my_pass::execute (function *)
{
  /* Magic!  */
}

We will not go into the anatomy of a pass yet. That is perhaps a topic for a follow-up post.

GENERIC

GENERIC is the first IR that gets generated from parsing the source code. It is a tree structure that has been present since the earliest gcc versions. Its core data structure is a tree_node, which is a hierarchy of structs, with tree_base as the Abraham. OK, if you haven’t been following gcc development, this can come as a surprise: a significant portion of gcc is now in c++!

It’s OK, take a minute to mourn/celebrate that.

The tree_node can mean a lot of things (it is a union), but the most important one is the tree_typed node. There are various types of tree_typed, like tree_int_cst, tree_identifier, tree_string, etc. that describe the elements of source code that they house. The base struct, i.e. tree_typed and even further up, tree_base have flags that have metadata about the elements that help in optimisation decisions. You’ll almost never use generic in the context of code traversal in optimisation passes, but the nodes are still very important to know about because GIMPLE continues to use them to store operand information. Take a peek at gcc/tree-core.h and gcc/tree.def to take a closer look at all of the types of nodes you could have in GENERIC.

What is GIMPLE you ask? Well, that’s our next IR and probably the most important one form the context of optimisations.

OK so now you have enough background to go look at the guts of the GENERIC tree IR in gcc. Here’s the gcc internals documentation that will help you navigate all of the convenience macros to analyze th tree nodes.

GIMPLE

The GIMPLE IR is a workhorse of gcc. Passes that operate on GIMPLE are architecture-independent.

The core structure in GIMPLE is a struct gimple that holds all of the metadata for a single gimple statement and is also a node in the list of gimple statements. There are various structures named gimple_statement_with_ops_* that have the actual operand information based on its type. Once again like with GENERIC, it is a hierarchy of structs. Note that the operands are all of the tree type so we haven’t got rid of all of GENERIC. gcc/gimple.h is where all of these structures are defined and gcc/gimple.def is where all of the different types of gimple statements are defined.

Where did my control flow go?

But how is it that a simple list of gimple statements is sufficient to traverse a program that has conditions and loops? Here is where things get interesting. Every sequence of GIMPLE statements is housed in a Basic Block (BB) and the compiler, when translating from GENERIC to GIMPLE (also known as lowering, since GIMPLE is structurally simpler than GENERIC), generates a Control FLow Graph (CFG) that describes the flow of the function from one BB to another. The only control flow idea one then needs in GIMPLE to traverse code from within the gimple statement context is a jump from one block to another and that is fulfilled by the GIMPLE_GOTO statement type. The CFG with its basic blocks and edges connecting those blocks, takes care of everything else. Routines to generate and manipulate the CFG are defined in gcc/cfg.h and gcc/cfg.c but beware, you don’t modify the CFG directly. Since CFG is tightly linked with GIMPLE (and RTL, yes that’s our third and final IR), it provides hooks to manipulate the graph and update GIMPLE if necessary.

The last interesting detail about CFG is that it has a special construct for loops, because they’re typically the most interesting subjects for optimisation: you can splice them, unroll them, distribute them and more to produce some fantastic performance results. gcc/cfgloop.h is there you’ll find all of the routines you need to traverse and manipulate loops.

The final important detail with regard to GIMPLE is the Single Static Assignment (SSA) form. Typical source code would have variables that get declared, assigned to, manipulated and then stored back into memory. Essentially, it is multiple operations on a single variable, sometimes destroying its previous contents as we reuse variables for different things that are logically related in the context of the high level program. This reuse however makes it complicated for a compiler to understand what’s going on and hence it ends up missing a host of optimisation opportunities.

To make things easier for optimisation passes, these variables are broken up into versions of themselves that have a specific start and end point in their lifecycle. This is called the Single Static Assignment form of a variable, where each version of the variable as a single starting point, viz. its definition. So if you have code like this:

    x = 10;
    x += 20;

it becomes:

    x_1 = 10;
    x_2 = x_1 + 20;

where x_1 and x_2 are versions of x. If you have versions of variables in conditions, things get interesting and the compiler deals with it with a mysterious concept called PHI nodes. So this code:

    if (n > 10)
      x = 10;
    else
      x = 20;
    return x;

becomes:

    if (n > 10)
      x_1 = 10;
    else
      x_2 = 20;
    # x_3 = PHI<x_1, x_2>;
    return x_3;

So the PHI node is a conditional selector of the earlier two versions of the variable and depending on the results of the optimisation passes, you could eliminate versions of the variables altogether or use CPU registers more efficiently to store those variable versions.

There you go, now you have everything you need to get started on hacking GIMPLE. I know this part is a bit heavy but guess what, this is where you can seriously start thinking about hacking on gcc! When you jump in, you’ll need the more detailed information in the gcc internals manual on GIMPLE, CFG and GIMPLE optimisations.

RTL

We are yet another step closer to generating assembly code for our assembler and linker to build into the final program. Recall that GIMPLE is largely architecture independent, so it works on high level ideas of statements, expressions and types and their relationships. RTL is much more primitive in comparison and is designed to mimic sequential architecture instructions. Its main purpose is to do architecture-specific work, such as register allocation and scheduling instructions in addition to more optimisation (because you can never get enough of that!) passes that make use of architecture information.

Internally, you will encounter two forms of RTL, one being the rtx struct that is used for most transformations and passes in the compiler. The other form is in text to map machine instructions to RTL and these are Lisp-like S expressions. These are used to make machine descriptions, where you can specify machine instructions for all of the common operations such as add, sub, multiply and so on. For example, this is what the description of a jump looks like for aarch64:

(define_insn "jump"
  [(set (pc) (label_ref (match_operand 0 "" "")))]
  ""
  "b\\t%l0"
  [(set_attr "type" "branch")]
)

Machine description files are in the gcc/config directory in their respective architecture subdirectory and have the .md suffix. The main aarch64 machine description file for example is gcc/config/aarch64/aarch64.md. These files are parsed by a tool in gcc to generate C code with rtx structures for each of those S-expressions.

In general, the gcc/config directory contains source code that handles compilation for that specific architecture. In addition to the machine descriptions these directories also have C code that enhance the RTL passes to exploit as much of th architecture information as they possibly can. This is where some of the detailed architecture-specific analysis of the RTL instructions go. For example, combination of loads into load pairs for aarch64 is an important task and it is done with a combination of machine description and some code to peek into and rearrange neighbouring RTL instructions

But then you’re wondering, why are there multiple description files? Other than just cleaner layout (put constraint information in a separate file, type information in another, etc.) it is because there are multiple evolutions of an architecture. The ‘i386’ architecture is a mess of evolutions that span word sizes and capabilities. The aarch64 architecture has within it many microarchitectures developed by various Arm licensee vendors like xgene, thunderxt88, falkor and also those developed by Arm such as the cortex-a57, cortex-a72, ares, etc. All of these have different behaviours and performance characteristics despite sharing an instruction set. For example on some microarchitecture one may prefer to emit the csel instruction instead of cmp, b.cond and multiple mov instructions to reduce code size (and hence improve performance) but on some other architecture, the csel instruction may have been designed really badly and hence it may well be cheaper to execute the 4+ instructions instead of the one csel. These are behaviour quirks that you select when you use the -mtune flag to optimise for a specific CPU. A lot of this information is also available in the C code of the architecture in the form of structures called cost tables. These are relative costs of various operations that help the RTL passes and some GIMPLE passes determine the best target code and optimisation behaviour accordingly for the CPU. Here’s an example of register move costs for the Qualcomm Centriq processor:

static const struct cpu_regmove_cost qdf24xx_regmove_cost =
{
  2, /* GP2GP  */
  /* Avoid the use of int<->fp moves for spilling.  */
  6, /* GP2FP  */
  6, /* FP2GP  */
  4 /* FP2FP  */
};

This tells us that in general, moving between general purpose registers is cheaper, moving between floating point registers is slightly more expensive and moving between general purpose and floating point registers is most expensive.

The other important detail in such machine description files is the pipeline description. This is a description of how the pipeline for a specific CPU microarchitecture is designed along with latencies for instructions in that pipeline. This information is used by the instruction scheduler pass to determine the best schedule of instructions for a CPU.

Now where do I start?

This is a lot of information to page in at once and if you’re like me, you’d want something more concrete to get started with understanding what gcc is doing. gcc has you covered there because it has flags that allow you to study the IR generated for the compiler at every stage of compilation. By every stage, I mean every pass! The -fdump-* set of flags allow you to dump the IR to study what gcc did in each pass. Particularly, -fdump-tree-* flags dump GIMPLE IR (in a C-like format so that it is not too complicated to read) into a file for each pass and the -fdump-rtl-* flags do the same for RTL. The GIMPLE IR can be dumped in its raw form as well (e.g. -fdump-tree-all-raw), which makes it much simpler to correlate with the code that is manipulating the GIMPLE in an optimisation pass.

The easiest way to get into gcc development (and compiler development in general) in my experience is the back end. Try tweaking the various cost tables to see what effect it has on code generation. Modify the instructions generated by the RTL descriptions and use that to look closer at one pass that interests you. Once you’re comfortable with making changes to gcc, rebuilding and checking its outputs, you can then try writing a pass, which is a slightly more involved process. Maybe I’ll write about it some day.

References

The GCC internals manual is the canonical place to read (and fix up) documentation for GCC internals. It is wonderfully detailed and hopelessly outdated at the same time. Bringing it up to date is a task by itself and the project continues to look for volunteers to do that.
David Malcolm has a more functional newbie guide for fledgling gcc hackers who might struggle with debugging gcc and getting involved in the gcc development process
The GCC Resource Center workshop on GCC is where I cut my teeth on gcc internals. I don’t think their workshop is active anymore but they have presentations and other literature there that is still very relevant.
I wrote about micro-optimisations in the past and those ideas are great to try on gcc.

Comments

A JIT in Time...

Posted: Mar 28, 2019, 13:29

It’s been a different 3 months. For over 6 years I had been working almost exclusively on the GNU toolchain with a focus on glibc and I now had the chance of working on a completely different set of projects, something I had done a lot of during my Red Hat technical support days but not since. I was to look into Pypy, OpenJDK and LuaJIT, three very different projects with very different development styles, communities and technologies. The comparison of these projects among themselves and the GNU projects is an interesting point but not the purpose of this post, maybe some other day. In this post I want to talk about the project I spent the most time on (~1.5 months) and found to be technically the most intriguing: LuaJIT.

A Just In Time Introduction

For those new to the concept, JIT compilation techniques are pretty old and there is a very interesting paper called the A brief history of just in time that does what the title states. The basic concept is quite straightforward - code written in a high level language (in the case of luajit, lua) is interpreted as usual while keeping track of which parts of the code get hit often. If a part of the code is seen to be executed repeatedly, all or part of that code is compiled into binary and mapped in, with entry and exit branches into the interpreter, also known as exit guards. There are a number of tradeoffs in designing a JIT and the paper I’ve linked above gives enough of an introduction to appreciate the complexity of the problem being solved.

The key difference from compilers is that the time required to compile is often as much a performance factor as the quality of the generated code. Due to this, one needs to be careful about the amount of processing one can do on the code to optimise it. So while gcc or llvm may end up giving higher quality code, the ~200 passes that are involved in building a TU may well end up eating up all the performance gains compiling just in time would have given.

LuaJIT: Peeking under the hood

The LuaJIT project was started and is mostly written by Mike Pall, that is apparently a pseudonym for a very private and very smart hacker. I assume that he is male given that Mike is a common male name. The source code repository is a bit odd. There is a github repository that is supposed to be official but isn’t; it is a mirror created by CloudFlare along with Mike with the aim to broaden the developer community base. That ride hasn’t been the smoothest and I’ve talked about it in more detail below. The latest code with support for other architectures such as arm64 and ppc are in the v2.1 branch, which has only had beta releases come off it, the last one in 2017. There are tests in a separate repository called LuaJIT-test-cleanup which has a big fat warning that it is not the official testsuite, although if you look around, it pretty much is the only testsuite worth using for luajit.

Wait, there’s also bench_lua, which has some benchmarks and a pretty nice driver for the benchmarks, something that the LuaJIT-test-cleanup benchmarks lack.

LuaJIT uses the concept of trace compiling which is pretty simple in concept but has some very interesting side-effects. The idea of trace compilation, specifically with luajit is quite simple and follows roughly this logic:

Interpret program and profile it while it is running. Typical candidates for profiling would be loops for the obvious reason that it will likely execute repeatedly.
If a loop is hit repeatedly, i.e. it crosses a threshold number of iterations, the JIT compiler is invoked on its next iteration.
The JIT compiler first traces execution of the program and generates an IR for the trace of the program.
The IR then goes through some optimisation passes and finally code is generated for the desired CPU backend.

This keeps on repeating as the interpreter encounters more hotspots. The interesting bit here is that the only bit that gets compiled is the code that gets executed during the trace. So if you have a branch like so:

    if cond > threshold then
        i = i + 1
    else
        i = i - 1
    end

and the else block is executed during the trace, only that bit is compiled and not the if block. The compiled code then has branches (known as exit guards) to jump back into the interpreter if the condition is true. This produces an interesting optimisation opportunity that can be done during tracing itself. If cond > threshold is found to be always false because they are constants or some other reason, the if condition can be completely eliminated, which saves compilation time as well as execution time.

Another interesting side effect of tracing that is not seen in typical compilers is that function calls effectively get inlined. Again, that becomes a very cheap way to achieve something that would otherwise have been done in a separate pass in traditional compilers.

In addition to very fast tracing and compilation, all of luajit is quite compact. It’s IR is linear array based and is hence allows very fast traversal. It’s easy to visualize it using the jit.* debug modules and using the -jdump flag to dump the IR during execution. The luajit wiki has some pretty detailed documentation on its internals.

The coding style of the project is a bit too compact to my taste since I personally prefer writing for readability. There are a lot of constructs throughout the code that need a fair amount of squinting to understand, such as assignments inside the for loop headers and inside conditions. OK all of you pointing at the macro and makefile soups in glibc and laughing, please be quiet ;)

There’s also the infamous (at least in luajit circles) 47-bit address space limitation for garbage collected objects in luajit because luajit uses the top bits for metadata. This is known to have correctness issues with Lua userdata objects and also performance issues because luajit repeatedly tries allocations until it finds a suitable address in the 47-bit space. It doesn’t hurt x86 much (because of MAP_32BIT) but arm64 feels it and I imagine so do other architectures.

My LuaJIT involvement

My full time involvement with luajit was brief and will likely end soon (my personal involvement may still continue) so in this short period I wanted to tick off as many short but significant work items as I could. My github fork is here.

Sameera Deshpande started the initial work and then helped me ramp up later on. We got a couple of CI instances up and running to begin with, one for the official repository and another for my github fork so that I can review my changes regularly. If you’re interested in adding a node for your architecture to the Ci projects, please feel free to reach out to me, Linaro will happily add the node to the CI matrix.

Register Allocation improvements

The register allocator in luajit is pretty simple to keep the compilation overhead low. Registers are allocated sequentially based on their categories (caller saved, callee saved, etc.) and it uses some tricks such as constant rematerialization used to reduce register pressure. Rematerialization is also very basic in its implementation; whenever constants need to be allocated to registers, it is preferred that they use existing constants, (assuming their live ranges are compatible) either directly or as a constant computation. This is quite valuable because there is a fair amount of constant usage in the JITted code; exit guard addresses are coded in as constants for example and so are floating point numbers, in addition to the usual integers. The register modes are not specified during allocation and are defined by the instructions generated in the assembly phase.

There was a bug in the luajit register allocator due to which registers used for constant rematerialization were being clobbered, resulting in corruption. A fix was proposed but the author of the fix was not sure if it was correct. I posted an alternative patch and then realized and explained why my patch is overkill and his approach is optimal. I added additional cleanups to that to finish it up.

While working on this problem, I noticed that the arm64 backend was not using XZR often enough and I posted a patch to fix that. I started benchmarking the improvement (the codegen was obviously better, it was saving registers for stores fo zeroes for example) and quickly realized that both bench_lua and the LuaJIT-test-cleanup benchmarks were quite raw and couldn’t be relied upon for consistent results.

So I digressed.

Benchmark improvements and luaJIT-test-cleanup cleanup

bench_lua was my more favourite project to hack on benchmarks because it was evident that reviews were very hard to come by in the luajit project. Also, bench_lua had a benchmark driver that produced repeatable results but it still had some cleanup issues, including the fact that it did not have a license! The author was very responsive on the license question though and quickly put one in. I fixed some timing issues in the driver and while doing so, I realized that it might be better if I used this driver on the more extensive set of benchmarks in LuaJIT-test-cleanup. So that’s what I did.

I integrated the bench_lua driver into luajit-test-cleanup and added Makefile targets so that one could easily do make check and make bench to run the tests and benchmarks. Now I had something I could work with but it was still in a different repo and it was getting quite cumbersome to work with them.

So I integrated LuaJIT-test-cleanup into LuaJIT. Now I had a LuaJIT repository that IMO was complete and could handle the standard make/make check workflow. At the same time, it was modular enough that it could be merged into the upstream LuaJIT with relative ease. I posted all of these patches as PRs and watched as nothing happened. The LuaJIT-test-cleanup project had not seen a PR review since about 2016 and the LuaJIT project had seen occassional comments and patches from Mike in the past couple of years, but not much else.

Fusing and combining optimisations

Instruction fusion is an architecture dependent feature in luajit and each backend implements its own during the IR to assembly conversion phase, where the IR is traversed from the bottom up and assembly instructions generated sequentially. Luajit does some trivial reordering in its IR optimisation passes but during assembly, it does not peek ahead to actively look for instruction fusion opportunities; it only tries to fuse neighbouring instructions. As a result, while there are implementations for instructions like load and store pair in arm64, it is useful in only the most trivial of tests. Likewise for fmadd/fmsub; a simple intervening load is sufficient to prevent the optimisation.

In addition to this, it is often seen that optimisations like loop unrolling and vectorisation bring in even more opportunities for combining of loads and stores. Luajit does some loop peeling but that’s about it.

Sameera did some analysis on ways to introduce more aggressive unrolling and possibly some amount of vectorisation but we did not have enough time to implement it. She did have enough time to implement some instruction fusing and using fnmadd and fnmsub for arm64. She also looked at load combining opportunities but realized that luajit would need more powerful instruction reordering, similar to the load grouping in the gcc scheduler that makes load pair generation much easier. So that project was also not small enough for us to complete in the limited time.

Casting floats to unsigned integers

The C standard defines casting of floating point types to unsigned integer types only for the range (-1.0, UTYPE_MAX), where UTYPE_MAX is the unsigned version of TYPE. Casts to signed types work just fine as long as the number is in the range of that type. Waters get a bit murky with dynamic types and type narrowing when the default internal representation for all numbers is double. That was the situation in luajit. The fix for this was pretty straightforward in theory, which was to add an additional cast from float to signed int and then to unsigned int for floating point values less than zero and sticking to a direct cast to unsigned int for positive numbers. I have implemented this for the interpreter and for arm64 in my fork.

Project state and the road ahead

LuaJIT is a very interesting project that has some very interesting concepts that I learned in the last month or so. It has a pretty active user community that sings praises of the project and seems to advocate it in a number of areas. However, the project development itself is in a bit of a crisis.

Around 2015 Mike Pall said he wanted to step back from the project and wanted more people to get involved in the development. With that intent, Cloudflare created the github organisation and repository to allow for better collaboration. Based on conversation threads I read, things seemed to go fine when the community stepped in to create the LuaJIT-test-cleanup repository based on some initial tests Mike had written and built it up into a set of 500+ tests. However in about a year that excitement faded because nobody was made maintainer alongside Mike to carry forward the work and that meant that the LuaJIT project itself would only get sporadic fixes whenever Mike had some free time. Minor patches were accepted but bigger pieces of code went unreviewed and presumably the developers also lost interest.

Fast forward four years into 2019 and we are still in the same situation, probably worse. LuaJIT-test-cleanup has not had a patch review since 2016. LuaJIT has had comments about a couple of times each quarter and bug fixes with similar frequency, but not much else. The mailing list also has similar traffic - I announced all of the work I did above and did not get any responses. there are forks of LuaJIT all over the place in projects such as OpenResty and RaptorJIT and the projects seem happy to let things run that way. Lua language support is in a bit of a limbo with it being mostly 5.1 compliant with some 5.2 bits thrown in. Overall, it’s a great chunk of code that’s about to vanish into oblivion.

Then there is the very tricky question of copyright. The copyright notices all over the code say that Mike Pall has ownership. However, the code clearly has a number of contributions from others and there is no copyright assignment in place. While it’s likely not an issue from a licensing standpoint (IANAL, etc.), it is definitely something that needs to be addressed if the project is somehow ressurected, at the very least to give more prominent credit to contributors.

I’ve posted PRs for my work and tried to engage but I don’t have much hope given past history. I intend to spend at least some of my free time tinkering with this code since it’s just a very interesting project and there’s a lot that can be done. I am trawling the PRs and issue lists to look for patches that can be incorporated in my tree so if anyone is interested in contributing patches, you’re most welcome. I will continue to ensure that my tree applies on top of the official repository because I do not want to give up hope of the project coming back to life.

Comments