Hell Oh Entropy!

Life, Code and everything in between

_FORTIFY_SOURCE=3 performance

Posted: Jan 05, 2023, 16:28

So early last year I finished implemented everything needed for a fully working _FORTIFY_SOURCE=3 so that disrtributions can use it out of the box. OpenSUSE adopted it almost immediately and Gentoo started the work of adding it to their hardened profile. I proposed to make it the default for Fedora 38 after some tests but people quoted to me this blog post that some guy wrote, telling me that there’s a performance issue. Since my explanations and clarifications in the Fedora wiki or on the Fedora devel list is not sufficient (the feature was approved but the “_FORTIFY_SOURCE=3 has performance overhead” claims don’t seem to stop), here’s a blog post for a blog post, stating conclusively that the performance issue is theoretical and overstated, the guy didn’t know what he was talking about when he wrote it.

That guy is working on a clarification blog post of his own, describing in some more detail why the concern is overblown but he has to jump through editorial hoops of a multi-billion dollar corporation that pays his salary, so his apology to me is going to take a while. Whenever he gets to publish his work, I’ll link it here so that it’s two blog posts against one. Take that!

Comments

Relocations: fantastic symbols, but where to find them?

Posted: Apr 10, 2020, 12:36

Update: There’s a Links section at the end that should give you a list of all the reference you’ll ever need to undrstand Linkers and Relocations!

When I started out in compilers years ago, I found relocations especially hard to wrap my head around. They’re just simple math in the end but they combine elements from different places that make it complicated enough that many (as did I back then) assume it to be some kind of black art. The Oracle documentation on relocation processing and relocation sections is pretty much the best thing I’ve found on the internet that explains relocations and they’re a great start if you already know what you’re doing. This is why I figured I ought to try writing something more accessible that puts some of the bits into practice. The fact that I’m currently working on relocation processing makes it that much simpler for me to just bash at the keyboard and commit the stuff in my memory to a more persistent medium before it gets swapped out to make space for kitten photos.

While this tutorial is meant to be beginner friendly, it does assume though that the reader has some awareness of the ELF format, at least to the extent of knowing about different ELF sections. It also assumes that the reader has some undersanding of assembly language programming, bonus if you know aarch64 assembly since that’s the flavour of the examples.

Finding each other in this crowded world

The idea of relocation is quite simple: when compiling programs, we need flexibility to build components of programs independently and then have them link together. This could be in the same source file where we don’t know where parts of the source would end up, multiple source files built into different object files or sets of object files built into different libraries that reference objects in each other. This flexibility is achieved using relocations. Here’s a very simple example using aarch64 assembly:

.data
.globl somedata
somedata:
	.8byte 0x42

.text
.globl start
_start:
	nop
	ldr	x2, somedata

This is a simple program that loads somedata into register x2. It doesn’t do much and if you try to run the program it will crash, but it is a useful example that shows an assembly source where parts of it end up in different ELF sections.

The interesting bit here is a load instruction in the text section that is reading a variable somedata from the data section. The load instruction encodes within itself, the offset of somedata from itself, aka the PC-relative offset. The assembler can see both, the variable and the instruction but it cannot say for sure where they will be in memory at this point, because it does not know how far the data section will be from the text section in the final library or executable. To work around this limitation, it needs to leave the offset field in the ldr instruction blank so that the linker can finally fill it in. It also needs to provide instructions to the linker to tell it how to fill in this field.

This is where relocations come into play.

In this specific case, one may assemble the example and disassemble it using objdump -Dr to find this disassembly:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d503201f 	nop
   4:	58000002 	ldr	x2, 0 <_start>
			4: R_AARCH64_LD_PREL_LO19	somedata

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

The R_AARCH64_LD_PREL_LO19 in that output is the relocation. How did it land in there in between the instructions you ask? Well, it didn’t. The relocations are actually in a separate section of their own, as is evident with objdump -r:

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE 
0000000000000004 R_AARCH64_LD_PREL_LO19  somedata

even better with readelf -r because it tells you that the relocations are essentially just a table of entries in the .rela.text section:

Relocation section '.rela.text' at offset 0x110 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000004  000500000111 R_AARCH64_LD_PREL 0000000000000000 somedata + 0

All sections with relocation entries have names with prefix .rela or .rel followed by the name of the section for which the relocations need to be applied. Based on these section names, it’s evident that there are two types of relocation entries, REL and RELA. There are a number of important pieces of information the assembler leaves for the linker here:

All of this can be seen in the above relocation table. Each entry in the relocation table is basically a C structure of the following form for REL type relocations:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
} Elf64_Rel;

and for RELA:

typedef struct {
        Elf64_Addr      r_offset;
        Elf64_Xword     r_info;
        Elf64_Sxword    r_addend;
} Elf64_Rela;

r_offset corresponds to the Offset entry in the readelf output above and is typically the memory address that needs to be fixed up. The offset from the symbol, aka the addendum is present only in RELA type relocations and it corresponds to the r_addend element in the structure and the Addend field in the readelf output. The symbol and computation related information is encoded in the r_info field.

The Elf32_Rel and Elf32_Rela structures are similar, except for the data sizes of the elements.

Symbol hunting

The ‘what to write’ is where the r_info field comes in. That’s what the linker needs to figure out before the where and how, which comes later.

This field is split into two 32-bit parts (16-bit for 32-bit architectures). The lower part tells the linker how to perform the computation and the upper part tells the linker what the target symbol is. The upper part is a symbol ID, which basically is an index into the symbol table. In our example above, the r_info is 0x000500000111, which means that the symbol id is 0x5. We can pull out the symbol table using readelf -s:

Symbol table '.symtab' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     4: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT    1 $x
     5: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    3 somedata
     6: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    1 _start

and we find that the symbol id 0x5 is somedata and has the Value (i.e. the address of the symbol relative to its section) of 0x0.

Now we need to figure out how to combine all of this information together using the lower part of r_info, which is 0x111. This number corresponds to R_AARCH64_PREL_LO19, which has a specific meaning. An architecture defines a number of such relocations with descriptions of how they’re supposed to be applied. In case of R_AARCH64_PREL_LO19 it means “Add the symbol address and addend and subtract from it, the location of the memory address that is being fixed up”. Think about that a bit and you’ll notice that it is how you would compute the PC-relative offset of somedata from the instruction, i.e. subtract the location of the fixup (i.e. the LDR instruction) from the address of somedata. In short form (and you’ll see this and similar notations to describe relocations), it is written as S + A - P.

Putting things together

Now that we know what the target symbol is and how to compute the pc-relative offset, we need to compute the final symbol address, the final target address and then do the actual fixup. This is done near the end of the linking process in the GNU linker, when all sections have been laid out and we finally are in a position to know the relative addresses. The linker will then make the computation (i.e. S+A-P) and patch in the result into the LDR instruction before writing the output to the final binary. Here is what our result looks like:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d503201f 	nop
  4000b4:	58080022 	ldr	x2, 4100b8 <somedata>

Disassembly of section .data:

00000000004100b8 <somedata>:
  4100b8:	00000042 	.inst	0x00000042 ; undefined
  4100bc:	00000000 	.inst	0x00000000 ; undefined

Notice that the opcode of the LDR instuction is now different (and as a result the instruction itself is also different) and includes the 0x8002, which is basically the encoded difference (0x10004) from somedata.

Raising the stakes: Dynamic Relocations

This is all great when all our symbols are local and have predictable layouts such that PC-relative relocations such as R_AARCH64_PREL_LO19 are sufficient to describe and resolve in between assembling and linking a program. What happens however, when these symbol references cross boundaries of sections in ways that we cannot predict at compile or link time? What happens when your symbol references cross boundaries of your shared object? These are problems that need to be solved to make Position Independent Code (PIC) possible. PIC is when your program (and sections within your program) could get mapped anywhere in memory and you need your code to adapt to that fact.

Take this very simple example:

.data
somedata:
        .8byte 0x42
somedata2:
	.8byte somedata

.text
.globl _start
_start:
	ret

There’s next to nothing here; just somedata like in our previous example and a somedata2 which points to somedata. However, in this next-to-nothing example lies an interesting complication that needs to be resolved at runtime: the value in somedata2 cannot be computed at compile time; it needs a fixup at runtime! Let’s walk through the compilation to see how we get to the final result. First, the disassembly to understand what the assembler did for us:

Disassembly of section .text:

0000000000000000 <_start>:
   0:	d65f03c0 	ret

Disassembly of section .data:

0000000000000000 <somedata>:
   0:	00000042 	.inst	0x00000042 ; undefined
   4:	00000000 	.inst	0x00000000 ; undefined

0000000000000008 <somedata2>:
	...
			8: R_AARCH64_ABS64	.data

We see now that the address to be relocated is somedata2 in the .data section and it is of type R_AARCH64_ABS64. This is simple relocation that instructs the linker to compute S + A to get the result, i.e. get the symbol address of somedata and add the addendum (0 again in this case). This in fact would be the final result for a statically linked result (using ld -static) and we’d lose the relocation in favour of the absolute address written into somedata2:

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	d65f03c0 	ret

Disassembly of section .data:

00000000004100b4 <somedata>:
  4100b4:	00000042 	.inst	0x00000042 ; undefined
  4100b8:	00000000 	.inst	0x00000000 ; undefined

00000000004100bc <somedata2>:
  4100bc:	004100b4 	.inst	0x004100b4 ; undefined
  4100c0:	00000000 	.inst	0x00000000 ; undefined

When compiling a shared object however (i.e. ld -shared) we intend to produce a position independent DSO (dynamic shared object) and to achieve that the linker now emits a relocation to describe how to compute the final address to assign to somedata2 and where the memory address can be located. In this example, it is the R_AARCH64_RELATIVE dynamic relocation, as seen using objdump -DR (output snipped to retain only useful bits):

Disassembly of section .data:

0000000000011000 <somedata>:
   11000:	00000042 	.inst	0x00000042 ; undefined
   11004:	00000000 	.inst	0x00000000 ; undefined

0000000000011008 <somedata2>:
   11008:	00011000 	.inst	0x00011000 ; undefined
			11008: R_AARCH64_RELATIVE	*ABS*+0x11000
   1100c:	00000000 	.inst	0x00000000 ; undefined

This relocation is interesting not just for the reason that it is dynamic, but also because it is a S+A type relocation that puts the non-relocated address (i.e. the link time address) of somedata into its addend. This relocation also does not reference a symbol; instead it references an *ABS* value, which is basically the offset at which this DSO would be loaded during execution. It is the dynamic linker in the C runtime library (ld.so in GNU systems) that reads these relocations from the .rela.dyn section. Because this relocation is based on an absolute address computed by the static linker, the dynamic linker does not have to do a symbol lookup to resolve the relocation.

The other difference from static relocations is that when a dynamic relocation references a symbol in its r_info, it is looked up in the .dynsym section, i.e. in the dynamic symbol table and not in the regular symbol table.

Final Thoughts

There are a number of other cases that the linker needs to cater for when it comes to relocations such as entries in Global Offset Tables, resolution of intermediate functions and Thread-Local Storage. Thankfully though, the first principles behind all those relocations are the same as the above and you can apply this knowledge to GOT, TLS and IFUNC relocations as well. GOT relocations for example reference GOT base, which the linker knows where to find (since it sets up the .got section in the first place) and can use that information to compute the location to fix up. Other than this special knowledge, everything else remains pretty much the same.

Once you’re equipped with these first principles, the next task is to figure out where documentation for specific relocations is for every architecture. While the binutils documentation makes some effort to document the public facing part of relocations, the detailed documentation is usually distributed by the architecture chip vendors. The AArch64 ELF documentation for example is hosted on the Arm website.

Links

Comments

gcc under the hood

Posted: Oct 03, 2019, 14:12

My background in computers is a bit hacky for a compiler engineer. I never had the theoretical computer science base that the average compiler geek does (yes I have a Masters in Computer Applications, but it’s from India and yes I’m going to leave that hanging without explanation) and I have pretty much winged it all these years. What follows is a bunch of thoughts (high five to those who get that reference!) from my winging it for almost a decade with the GNU toolchain. If you’re a visual learner then I would recommend watching my talk video at Linaro Connect 2019 instead of reading this. This is an imprecise transcript of my talk, with less silly quips and in a more neutral accent.

Hell Oh World!

It all started as a lightning talk at SHD Belgaum where I did a 5 minute demonstration of how a program goes from source code to a binary program. I got many people asking me to talk about this in more detail and it eventually became a full hour workshop at FOSSASIA in 2017. Those remained the basis for the first part of my talk at Connect, in fact they’re a bit more detailed in their treatment of the subject of purely taking code from source to binary.

Go read those posts, I’ll wait.

Under the hood

Welcome back! So now you know almost exactly how code goes from source to binary and how it gets executed on your computer. Congratulations, you now have a headstart into hacking on the dynamic loader bits in glibc! Now that we’ve explored the woods around the swamp, lets step into the swamp. Lets take a closer look at what gcc does with our source code.

Babelfish

The first thing a compiler does is to read the source code and make it into a data structure form that is easy for the computer to manipulate. Since the computer deals best with binary data, it converts the text form of the source code language into a tree-like structure. There is plenty of literature available on how that is implemented; in fact most compiler texts end up putting too much focus on this aspect of the compiler and then end up rushing through the real fun stuff that the compiler does with the program you wrote.

The data structure that the compiler translates your source code into is called an Intermediate Representation (IR), and gcc has three of them!

This part of the compiler that parses the source code into IR is known as the front end. gcc has many such front ends, one for each language it implements; they are all cluttered into the gcc/ directory. The C family of languages has a common subset of code in the gcc/c-family directory and then there are specializations like the C++ front end, implemented in the gcc/cp directory. There is a directory for fortran, another for java, yet another for go and so on. The translation of code from source to IR varies from frontend to frontend because of language differences, but they all have one thing in common; their output is an IR called GENERIC and it attempts to be language-independent.

Optimisation passes

Once gcc translates the source code into its IR form, it runs the IR through a number of passes (about 200 of them, more or less depending on the -O flag you pass) to try and come up with the most optimal machine code representation of the program. gcc builds source code a file at a time, which is called a Translation Unit (TU). A lot of its optimisation passes operate at the function level, i.e. individual functions are seen as independent units and are optimised separately. Then there are Inter-procedural analysis (IPA) passes that look at interactions of these functions and finally there is Link Time Optimisation(LTO) that attempts to analyse source code across translation units to potentially get even better results.

Optimisation passes can be architecture-independent or architecture-dependent.

Architecture independent passes do not care too much about the underlying machine beyond some basic details like word size, whether the CPU has floating point support, vector support, etc. These passes have some configuration hooks that allow their behaviour to be modified according to the target CPU architecture but the high level behaviour is architecture-agnostic. Architecture-independent passes are the holy grail for optimisation because they usually don’t get old; optimisations that work today will continue to work regardless of CPU architecture evolution.

Architecture-dependent passes need more information about the architecture and as a result their results may change as architectures evolve. Register allocation and instruction scheduling for example are very interesting and complex problems that architecture-specific passes handle. The instruction scheduling problem was a lot more critical back in the day when CPUs could only execute code sequentially. With out-of-order execution, the scheduling problem has not become less critical, but it has definitely changed in nature. Similarly the register allocation problem can be complicated by various factors such as number of locigal registers, how they share physical register files, how costly moving between different types of registers is, and so on. Architecture-dependent passes have to take into consideration all of these factors.

The final pass in architecture-dependent passes does the work of emitting assembly code for the target CPU. The architecture-independent passes constitute what is known as the middle-end of the compiler and the architecture-dependent passes form the compiler backend.

Each pass is a complex work of art, mathematics and logic and may have one or more highly cited research papers as their basis. No single gcc engineer would claim to understand all of these passes; there are many who have spent most of their careers on a small subset of passes, such is their complexity. But then, this is also a testament to how well we can work together as humans to create something that is significantly more complex than what our individual minds can grasp. What I mean to say with all this is, it’s OK to not know all of the passes, let alone know them well; you’re definitely not the only one.

Optimisation passes are all listed in gcc/passes.def and that is the sequence in which they are executed. A pass is defined as a class with a gate and execute function that determine whether to run and what to do respectively. Here’s what a single pass namespace looks like:

/* Pass data to help the pass manager classify, prepare and cleanup.  */
const pass_data pass_data_my_pass =
{
  GIMPLE_PASS, /* type */
  "my_pass", /* name */
  OPTGROUP_LOOP, /* optinfo_flags */
  TV_TREE_LOOP, /* tv_id */
  PROP_cfg, /* properties_required */
  0, /* properties_provided */
  0, /* properties_destroyed */
  0, /* todo_flags_start */
  0, /* todo_flags_finish */
};

/* This is a GIMPLE pass.  I know you don't know what GIMPLE is yet ;) */
class pass_my_pass : public gimple_opt_pass
{
public:
  pass_my_pass (gcc::context *ctxt)
    : gimple_opt_pass (pass_data_my_pass, ctxt)
  {}

  /* opt_pass methods: */
  virtual bool gate (function *) { /* Decide whether to run.  */ }

  virtual unsigned int execute (function *fn);
};

unsigned int
pass_my_pass::execute (function *)
{
  /* Magic!  */
}

We will not go into the anatomy of a pass yet. That is perhaps a topic for a follow-up post.

GENERIC

GENERIC is the first IR that gets generated from parsing the source code. It is a tree structure that has been present since the earliest gcc versions. Its core data structure is a tree_node, which is a hierarchy of structs, with tree_base as the Abraham. OK, if you haven’t been following gcc development, this can come as a surprise: a significant portion of gcc is now in c++!

It’s OK, take a minute to mourn/celebrate that.

The tree_node can mean a lot of things (it is a union), but the most important one is the tree_typed node. There are various types of tree_typed, like tree_int_cst, tree_identifier, tree_string, etc. that describe the elements of source code that they house. The base struct, i.e. tree_typed and even further up, tree_base have flags that have metadata about the elements that help in optimisation decisions. You’ll almost never use generic in the context of code traversal in optimisation passes, but the nodes are still very important to know about because GIMPLE continues to use them to store operand information. Take a peek at gcc/tree-core.h and gcc/tree.def to take a closer look at all of the types of nodes you could have in GENERIC.

What is GIMPLE you ask? Well, that’s our next IR and probably the most important one form the context of optimisations.

OK so now you have enough background to go look at the guts of the GENERIC tree IR in gcc. Here’s the gcc internals documentation that will help you navigate all of the convenience macros to analyze th tree nodes.

GIMPLE

The GIMPLE IR is a workhorse of gcc. Passes that operate on GIMPLE are architecture-independent.

The core structure in GIMPLE is a struct gimple that holds all of the metadata for a single gimple statement and is also a node in the list of gimple statements. There are various structures named gimple_statement_with_ops_* that have the actual operand information based on its type. Once again like with GENERIC, it is a hierarchy of structs. Note that the operands are all of the tree type so we haven’t got rid of all of GENERIC. gcc/gimple.h is where all of these structures are defined and gcc/gimple.def is where all of the different types of gimple statements are defined.

Where did my control flow go?

But how is it that a simple list of gimple statements is sufficient to traverse a program that has conditions and loops? Here is where things get interesting. Every sequence of GIMPLE statements is housed in a Basic Block (BB) and the compiler, when translating from GENERIC to GIMPLE (also known as lowering, since GIMPLE is structurally simpler than GENERIC), generates a Control FLow Graph (CFG) that describes the flow of the function from one BB to another. The only control flow idea one then needs in GIMPLE to traverse code from within the gimple statement context is a jump from one block to another and that is fulfilled by the GIMPLE_GOTO statement type. The CFG with its basic blocks and edges connecting those blocks, takes care of everything else. Routines to generate and manipulate the CFG are defined in gcc/cfg.h and gcc/cfg.c but beware, you don’t modify the CFG directly. Since CFG is tightly linked with GIMPLE (and RTL, yes that’s our third and final IR), it provides hooks to manipulate the graph and update GIMPLE if necessary.

The last interesting detail about CFG is that it has a special construct for loops, because they’re typically the most interesting subjects for optimisation: you can splice them, unroll them, distribute them and more to produce some fantastic performance results. gcc/cfgloop.h is there you’ll find all of the routines you need to traverse and manipulate loops.

The final important detail with regard to GIMPLE is the Single Static Assignment (SSA) form. Typical source code would have variables that get declared, assigned to, manipulated and then stored back into memory. Essentially, it is multiple operations on a single variable, sometimes destroying its previous contents as we reuse variables for different things that are logically related in the context of the high level program. This reuse however makes it complicated for a compiler to understand what’s going on and hence it ends up missing a host of optimisation opportunities.

To make things easier for optimisation passes, these variables are broken up into versions of themselves that have a specific start and end point in their lifecycle. This is called the Single Static Assignment form of a variable, where each version of the variable as a single starting point, viz. its definition. So if you have code like this:

    x = 10;
    x += 20;

it becomes:

    x_1 = 10;
    x_2 = x_1 + 20;

where x_1 and x_2 are versions of x. If you have versions of variables in conditions, things get interesting and the compiler deals with it with a mysterious concept called PHI nodes. So this code:

    if (n > 10)
      x = 10;
    else
      x = 20;
    return x;

becomes:

    if (n > 10)
      x_1 = 10;
    else
      x_2 = 20;
    # x_3 = PHI<x_1, x_2>;
    return x_3;

So the PHI node is a conditional selector of the earlier two versions of the variable and depending on the results of the optimisation passes, you could eliminate versions of the variables altogether or use CPU registers more efficiently to store those variable versions.

There you go, now you have everything you need to get started on hacking GIMPLE. I know this part is a bit heavy but guess what, this is where you can seriously start thinking about hacking on gcc! When you jump in, you’ll need the more detailed information in the gcc internals manual on GIMPLE, CFG and GIMPLE optimisations.

RTL

We are yet another step closer to generating assembly code for our assembler and linker to build into the final program. Recall that GIMPLE is largely architecture independent, so it works on high level ideas of statements, expressions and types and their relationships. RTL is much more primitive in comparison and is designed to mimic sequential architecture instructions. Its main purpose is to do architecture-specific work, such as register allocation and scheduling instructions in addition to more optimisation (because you can never get enough of that!) passes that make use of architecture information.

Internally, you will encounter two forms of RTL, one being the rtx struct that is used for most transformations and passes in the compiler. The other form is in text to map machine instructions to RTL and these are Lisp-like S expressions. These are used to make machine descriptions, where you can specify machine instructions for all of the common operations such as add, sub, multiply and so on. For example, this is what the description of a jump looks like for aarch64:

(define_insn "jump"
  [(set (pc) (label_ref (match_operand 0 "" "")))]
  ""
  "b\\t%l0"
  [(set_attr "type" "branch")]
)

Machine description files are in the gcc/config directory in their respective architecture subdirectory and have the .md suffix. The main aarch64 machine description file for example is gcc/config/aarch64/aarch64.md. These files are parsed by a tool in gcc to generate C code with rtx structures for each of those S-expressions.

In general, the gcc/config directory contains source code that handles compilation for that specific architecture. In addition to the machine descriptions these directories also have C code that enhance the RTL passes to exploit as much of th architecture information as they possibly can. This is where some of the detailed architecture-specific analysis of the RTL instructions go. For example, combination of loads into load pairs for aarch64 is an important task and it is done with a combination of machine description and some code to peek into and rearrange neighbouring RTL instructions

But then you’re wondering, why are there multiple description files? Other than just cleaner layout (put constraint information in a separate file, type information in another, etc.) it is because there are multiple evolutions of an architecture. The ‘i386’ architecture is a mess of evolutions that span word sizes and capabilities. The aarch64 architecture has within it many microarchitectures developed by various Arm licensee vendors like xgene, thunderxt88, falkor and also those developed by Arm such as the cortex-a57, cortex-a72, ares, etc. All of these have different behaviours and performance characteristics despite sharing an instruction set. For example on some microarchitecture one may prefer to emit the csel instruction instead of cmp, b.cond and multiple mov instructions to reduce code size (and hence improve performance) but on some other architecture, the csel instruction may have been designed really badly and hence it may well be cheaper to execute the 4+ instructions instead of the one csel. These are behaviour quirks that you select when you use the -mtune flag to optimise for a specific CPU. A lot of this information is also available in the C code of the architecture in the form of structures called cost tables. These are relative costs of various operations that help the RTL passes and some GIMPLE passes determine the best target code and optimisation behaviour accordingly for the CPU. Here’s an example of register move costs for the Qualcomm Centriq processor:

static const struct cpu_regmove_cost qdf24xx_regmove_cost =
{
  2, /* GP2GP  */
  /* Avoid the use of int<->fp moves for spilling.  */
  6, /* GP2FP  */
  6, /* FP2GP  */
  4 /* FP2FP  */
};

This tells us that in general, moving between general purpose registers is cheaper, moving between floating point registers is slightly more expensive and moving between general purpose and floating point registers is most expensive.

The other important detail in such machine description files is the pipeline description. This is a description of how the pipeline for a specific CPU microarchitecture is designed along with latencies for instructions in that pipeline. This information is used by the instruction scheduler pass to determine the best schedule of instructions for a CPU.

Now where do I start?

This is a lot of information to page in at once and if you’re like me, you’d want something more concrete to get started with understanding what gcc is doing. gcc has you covered there because it has flags that allow you to study the IR generated for the compiler at every stage of compilation. By every stage, I mean every pass! The -fdump-* set of flags allow you to dump the IR to study what gcc did in each pass. Particularly, -fdump-tree-* flags dump GIMPLE IR (in a C-like format so that it is not too complicated to read) into a file for each pass and the -fdump-rtl-* flags do the same for RTL. The GIMPLE IR can be dumped in its raw form as well (e.g. -fdump-tree-all-raw), which makes it much simpler to correlate with the code that is manipulating the GIMPLE in an optimisation pass.

The easiest way to get into gcc development (and compiler development in general) in my experience is the back end. Try tweaking the various cost tables to see what effect it has on code generation. Modify the instructions generated by the RTL descriptions and use that to look closer at one pass that interests you. Once you’re comfortable with making changes to gcc, rebuilding and checking its outputs, you can then try writing a pass, which is a slightly more involved process. Maybe I’ll write about it some day.

References

Comments

Hello FOSSASIA: Revisiting the event *and* the first program we write in C

Posted: Mar 19, 2017, 10:15

I was at FOSSAsia this weekend to deliver a workshop on the very basics of programming. It ended a pretty rough couple of weeks for me, with travel to Budapest (for Linaro Connect) followed immediately by the travel to Singapore. It seems like I don’t travel east in the timezone very well and the effects were visible with me napping at odd hours and generally looking groggy through the weekend at Singapore. It was however all worth it because despite a number of glitches, I had some real positives to take back from the conference.

The conference

FOSSAsia had been on my list of conferences to visit due to Kushal Das telling me time and again that I’d meet interesting people there. I had proposed a talk (since I can’t justify the travel just to attend) a couple of years ago but dropped out since I could not find sponsors for my talk and FOSSAsia was not interested in sponsoring me either. Last year I met Hong at SHD Belgaum and she invited me to speak at FOSSAsia. I gladly accepted since Nisha was going to volunteer anyway. However as things turned out in the end, my talk got accepted and I found sponsorship for travel and stay (courtesy Linaro), but Nisha could not attend.

I came (I’m still in SG, waiting for my flight) half-heartedly since Nisha did not accompany me, but the travel seemed worth it in the end. I met some very interesting people and was able to deliver a workshop that I was satisfied with.

Speaking of the workshop…

I was scheduled to talk on the last day (Sunday) first thing in the morning and I was pretty sure I was going to be the only person standing with nobody in their right minds waking up early on a Sunday for a workshop. A Sunday workshop also meant that I knew the venue and its deficiencies - the “Scientist for a Day” part of the Science Center was a disaster since it was completely open and noisy, with lunch being served right next to the room on the first day. I was wary of that, but the Sunday morning slot protected me from that and my workshop personally without such glitches.

The workshop content itself was based on an impromptu ‘workshop’ I did at FUDCon Pune 2015, but a little more organized. Here’s a blow by blow account of the talk for those who missed it, and also a reference for those who attended and would like a reference to go back to in future.

Hell Oh World

It all starts with this program. Hello World is what we all say when we are looking to learn a new language. However, after Hello World, we move up to learn the syntax of the language and then try to solve more complex user problems, ignoring the wonderful things that happened underneath Hello World to make it all happen. This session is an attempt to take a brief look into these depths. Since I am a bit of a cynic, my Hello World program is slightly different:

#include <stdio.h>

int
main (void)
{
  printf ("Hell Oh World!\n");
  return 0;
}

We compile this program:

$ gcc -o helloworld helloworld.c

We can see that the program prints the result just fine:

$ ./helloworld 
Hell Oh World!

But then there is so much that went into making that program. Lets take a look at the binary by using a process called disassembling, which prints the binary program into a human-readable format - well at least readable to humans that know assembly language programming.

$ objdump -d helloworld

We wrote only one function: main, so we should see only that. Instead however, we see so many functions that are present in the binary In fact, you you were lied to when they told back in college that main() is the entry point of the program! The entry point is the function called _start, which calls a function in the GNU C Library called __libc_start_main, which in turn calls the main function. When you invoke the compiler to build the helloworld program, you’re actually running a number of commands in sequence. In general, you do the following steps:

let us look at these steps one by one.

Preprocessing

gcc -E -o helloworld.i helloworld.c

Run this command instead of the first one to produce a pre-processed file. You’ll see that the resultant file has hundreds of lines of code and among those hundreds of lines, is this one line that we need: the prototype for printf so that the compiler identifies the call printf:

extern int printf (const char *__restrict __format, ...);

It is possible to just use this extern decl and avoid including the entire header file, but it is not good practice. The overhead of maintaining something like this is unnecessary, especially when the compiler can do the job of eliminating the unused bits anyway. We are better off just including a couple of headers and getting all declarations.

Compiling the preprocessed source

Contrary to popular belief, the compiler does not compile into binary .o - it only generates assembly code. It then calls the assembler in the binutils project to convert the assembly into object code.

$ gcc -S -o helloworld.s helloworld.i

The assembly code is now just this:

    .file   "helloworld.i"
    .section    .rodata
.LC0:
    .string "Hell Oh World!"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $.LC0, %edi
    call    puts
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (GNU) 6.3.1 20161221 (Red Hat 6.3.1-1)"
    .section    .note.GNU-stack,"",@progbits

which is just the main function and nothing else. The interesting thing there though is that the printf function call is replaced with puts because the input to printf is just a string without any format and puts is much faster than printf in such cases. This is an optimization by gcc to make code run faster. In fact, the code runs close to 200 optimization passes to attempt to improve the quality of the generated assembly code. However, it does not add all of those additional functions.

So does the assembler add the rest of the gunk?

Assembling the assembly

gcc -c -o helloworld.o helloworld.s

Here is how we assemble the generated assembly source into an object file. The generated assembly can again be disassembled using objdump and we see this:

helloworld.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 
: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: bf 00 00 00 00 mov $0x0,%edi 9: e8 00 00 00 00 callq e e: b8 00 00 00 00 mov $0x0,%eax 13: 5d pop %rbp 14: c3 retq

which is no more than what we saw with the compiler, just in binary format. So it surely is the linker adding all of the gunk.

Putting it all together

Now that we know that it is the linker adding all of the additional stuff into helloworld, lets look at how gcc invokes the linker. To do this, we need to add a -v to the gcc command. You’ll get a lot of output, but the relevant bit is this:

$ gcc -v -o helloworld helloworld.c
...

...
/usr/libexec/gcc/x86_64-redhat-linux/6.3.1/collect2 -plugin /usr/libexec/gcc/x86_64-redhat-linux/6.3.1/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/x86_64-redhat-linux/6.3.1/lto-wrapper -plugin-opt=-fresolution=/tmp/ccEdWzG5.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o helloworld /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/crtbegin.o -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1 -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../.. /tmp/cc3m0We9.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-redhat-linux/6.3.1/crtend.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'helloworld' '-mtune=generic' '-march=x86-64'

This is a long command, but the main points of interest are all of the object files (*.o) that get linked in because the linker concatenates those and then resolves dependencies of unresolved references to functions (only puts in this case) among those and all of the libraries (libc.so via -lc, libgcc.so via -lgcc, etc.). To find out which of the object code files have the definition of a specific function, say, _start, disassemble each of them. You’ll find that crt1.o has the definition.

Static linking

Another interesting thing to note in the generated assembly is that the call is to puts@plt, which is not exactly puts. It is in reality a construct called a trampoline, which helps the code jump to the actual printf function during runtime. We need this because printf is actually present in libc.so.6, which the binary simply claims to need by encoding it in the binary. To see this, disassemble the binary using the -x flag:

$ objdump -x helloworld

helloworld:     file format elf64-x86-64
helloworld
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x0000000000400430
...
Dynamic Section:
  NEEDED               libc.so.6
...

This is dynamic linking. When a program is executed, what is actually called first is the dynamic linker (ld.so), which then opens all dependent libraries, maps them into memory, and then calls the _start function in the program. During mapping, it also fills in a table of data called the Global Offset Table with offsets of all of the external references (puts in our case) to help the trampoline jump to the correct location.

If you want to be independent of the dynamic linker, then you can link the program statically:

$ gcc -static -o helloworld helloworld.c

This will however result in bloating of the program and also has a number of other disadvantages, like having to rebuild for every update of its dependent libraries and sub-optimal performance since the kernel can no longer share pages among processes for common code.

BONUS: Writing the smallest program

The basics were done with about 10 minutes to spare, so I showed how one could write the smallest program ever. In principle, the smallest program in C is:

int
main (void)
{
  return 42;
}

As is evident though, this pulls in everything from the C and gcc libraries, so it is clearly hard to do this in C, so lets try it in assembly. We already know that _start is the main entry point of the program, so we need to implement that function. To exit the program, we need to tell the kernel to exit by invoking the exit_group syscall, which has syscall number 231. Here is what the function looks like:

.globl _start
_start:
    mov $0xe7, %rax
    mov $0x42, %rdi
    syscall

We can build this with gcc to get a very small binary but to do this, we need to specify that we don’t want to use the standard libraries:

gcc -o min -nostdlib min.s

The resultant file is 864 bytes, as opposed to the 8.5K binary from the C program. We can reduce this further by invoking the assembler and linker directly:

$ as -o min.o min.s
$ ld -o min min.o

This results in an even smaller binary, at 664 bytes! This is because gcc puts some extra meta information in the binary to identify its builds.

Conclusion

At this point we ran out of time and we had to cut things short. It was a fun interaction because there were even a couple of people with Macbooks and we spotted a couple of differences in the way the linker ran due to differences in the libc, despite having the same gcc installed. I wasn’t able to focus too much on the specifics of these differences and I hope they weren’t a problem for the attendees using Macs. In all it was a satisfying session because the audience seemed happy to learn about all of this. It looked like many of them had more questions (and wonderment, as I had when I learned these things for the first time) in their mind than they came in with and I hope they follow up and eventually participate in Open Source projects to fulfill their curiosity and learn further.

Comments

Higher order functions in C?

Posted: Jun 10, 2012, 10:01

The Structure and Interpretation of Computer Programs course has a class on higher order functions, which is the ability of a function to accept a function and/or returns another function that uses the input function. The C programming language has very limit capability to do this and it is limited to being able to accept function pointers or return function pointers. I wanted to explore this limitation further to figure out exactly how far I can stretch this, so I wrote a small C program that tries to emulate this concept. This is just random tinkering, so if the reader is looking for a specific lesson, then let me clarify that there is none.

I wrote a program to get square roots using the Newton-Raphson method -- the same one that was used in the course with an attempt to keep the structure as close as possible to the scheme example in the class video. The Lisp-based listing for the course can be found on the internet and here is what I wrote:

#include <stdio.h>
#include <stdlib.h>
typedef double (*generic_func_t) (double);
#define dx 0.000001
#define absolute(n) ((n) < 0 ? -(n) : (n))
generic_func_t deriv(generic_func_t f)
{
        double deriv_func(double x)
        {
                return (f(x + dx) - f(x)) / dx;
        }
        return deriv_func;
}
double fixed_point(generic_func_t f, double x)
{
        while (absolute(f(x) - x) > dx)
                x = f(x);       
        return f(x);                            
}
double newton(generic_func_t f, double guess)
{
        double newton_func(double y)
        {
                generic_func_t df = deriv(f);
                return (y - f(y) / df (y));
        }
        return fixed_point(newton_func, guess);
}
double sqrt(double x)
{
        double sqrt_func(double y)
        {
                return (x - y*y);
        }
        return newton(sqrt_func, 1);
}
int main(int argc, char **argv)
{
        if (argc < 2) {
                printf(“Usage: %s <number>\n”, argv[0]);
                exit (1);
        }
        printf (“%lf\n”, sqrt(strtod(argv[1], NULL)));
        return 0;
}
The first thing that is evident in the above is that the syntax is non-standard. I have declared some functions within another function and that is not allowed by the C standard. GCC however implements this concept of nested functions as an extension, so this compiled cleanly for me. Running this however is a completely different thing.

Executing the above program for any argument results in segfaults and analysis of the program under gdb shows that the generated code is quite ridiculous. This is expected however, since we have pushed the gcc nested function support too far. The nested functions within gcc are OK as long as they remain within the following limits:

  1. They are only used within the function that nests them, either directly or indirectly via some function calls. Essentially, control from the nested function should return before it’s nesting function returns.
  2. If they’re used outside the nesting function and called after the nesting function returns, then they don’t use any local variables or arguments of the nesting function
Of course, if (2) is true then the function need not have been nested in the first place. Looking at the listing above, the functions sqrt_func, newton_func and deriv_func are the nested functions. Of these however, only deriv_func is actually returned as a function to be used from within newton_func. This is wrong because it breaks (2). deriv_func uses the input argument f(x) which is essentially sqrt_func and is actually called after the deriv() function returns. Other than that however, all of our ‘functional-type’ code looks sane and within what is supported by gcc.

So I modified the deriv function and came up with the listing below:

#include <stdio.h>
#include <stdlib.h>
typedef double (*generic_func_t) (double);
#define dx 0.000001
#define absolute(n) ((n) < 0 ? -(n) : (n))
double deriv(double x, generic_func_t f)
{
        return (f(x + dx) - f(x)) / dx;
}
double fixed_point(generic_func_t f, double x)
{
        while (absolute(f(x) - x) > dx)
                x = f(x);
        return f(x);
}
double newton(generic_func_t f, double guess)
{
        double newton_func(double y)
        {
                double dy = deriv(y, f);
                return (y - f(y) / dy);
        }
        return fixed_point(newton_func, guess);
}
double sqrt(double x)
{
        double sqrt_func(double y)
        {
                return (x - y*y);
        }
        return newton(sqrt_func, 1);
}
int main(int argc, char **argv)
{
        if (argc < 2) {
                printf(“Usage: %s <number>\n”, argv[0]);
                exit (1);
        }
        printf (“%lf\n”, sqrt(strtod(argv[1], NULL)));
        return 0;
}
The only change above is that instead of accepting a function and returning another in deriv, I now accept a function and return the evaluation of the derived function for the specified value. This is highlighted in bold above. This program now builds with gcc and also runs correctly. It should similarly be possible to eliminate the other nested functions to get a much more compact and standards-compliant program.

Comments

GCC Workshop at GRC, IIT Bombay

Posted: Jul 08, 2010, 22:19

Mustafa (my manager) knew about my current fascination with learning the _real_ basics of computers (electronics, kernel, compilers, etc.), so he arranged for me to attend the GCC workshop held by IIT Bombay this week. My first reaction was that I would be wasting my time there since I didn't know the first thing about compiler theory and was a novice at best in assembly language programming. He said it wouldn't hurt to try. So I had to try.

I got a chance on the day before the workshop to read up some about things like IRs, RTLs, etc. It was enough that I would not be completely lost from day 1. But I did not attend day 1 at all, thanks to the countrywide strike that crippled all public transport (and some people too). So I spent that day working and also trying to cover what would otherwise have been covered in day 1 -- the various phases of compilation, passes, gray box probing to find out some more about intermediate outputs, etc. I got hold of last year's slides, so it was a little easier.

So I finally made it to days 2, 3 and 4. The first thing I noticed during the lecture sessions was that the professor really knew his stuff. He was well acquainted with the internal layout of gcc and was able to explain it well enough that I really _got_ it. Overall I come out of the sessions today with much more knowledge about gcc than I could ever gain in 4 days on my own. Here are some observations that I made during the course of this session.

The professor really knew his stuff. I say this again so that it does not look like I am ignoring that. There are also a lot of really talented individuals at the GRC, who are doing some pretty interesting research based on gcc. The trouble though is that there seem to have been no efforts whatsoever to share these ideas upstream.

One such idea is the Generic Data Flow Analyzer (GDFA). It is a patch to gcc that provides a data flow analyzer, which can be used to find and eliminate dead code or unused variables. It adds a gimple pass to the compilation sequence and intends to replace the current dead code elimination and unused variable elimination passes with the same code called with different parameters. While the idea is pretty interesting, the sad thing is that there are no signs of an attempt to push this idea upstream. All I could find was an announcement to the gcc mailing list, but no request for comments or for inclusion of the patch.

This is only one of many more ideas that are brewing in the GRC in the minds of some very talented people. But one felt that these ideas were being used only to get degrees and nothing was being done to actually test their feasibility in live production level code. It would be nice to see some of these ideas actually presented upstream with a genuine interest in getting them incorporated.

To conclude, it is a pretty good session for those who want to get started with learning compilers and gcc internals.

Comments