_FORTIFY_SOURCE=3 performance
Posted: Jan 05, 2023, 16:28So early last year I finished implemented everything needed for a fully working _FORTIFY_SOURCE=3
so that disrtributions can use it out of the box. OpenSUSE adopted it almost immediately and Gentoo started the work of adding it to their hardened profile. I proposed to make it the default for Fedora 38 after some tests but people quoted to me this blog post that some guy wrote, telling me that there’s a performance issue. Since my explanations and clarifications in the Fedora wiki or on the Fedora devel list is not sufficient (the feature was approved but the “_FORTIFY_SOURCE=3
has performance overhead” claims don’t seem to stop), here’s a blog post for a blog post, stating conclusively that the performance issue is theoretical and overstated, the guy didn’t know what he was talking about when he wrote it.
That guy is working on a clarification blog post of his own, describing in some more detail why the concern is overblown but he has to jump through editorial hoops of a multi-billion dollar corporation that pays his salary, so his apology to me is going to take a while. Whenever he gets to publish his work, I’ll link it here so that it’s two blog posts against one. Take that!
Relocations: fantastic symbols, but where to find them?
Posted: Apr 10, 2020, 12:36Update: There’s a Links section at the end that should give you a list of all the reference you’ll ever need to undrstand Linkers and Relocations!
When I started out in compilers years ago, I found relocations especially hard to wrap my head around. They’re just simple math in the end but they combine elements from different places that make it complicated enough that many (as did I back then) assume it to be some kind of black art. The Oracle documentation on relocation processing and relocation sections is pretty much the best thing I’ve found on the internet that explains relocations and they’re a great start if you already know what you’re doing. This is why I figured I ought to try writing something more accessible that puts some of the bits into practice. The fact that I’m currently working on relocation processing makes it that much simpler for me to just bash at the keyboard and commit the stuff in my memory to a more persistent medium before it gets swapped out to make space for kitten photos.
While this tutorial is meant to be beginner friendly, it does assume though that the reader has some awareness of the ELF format, at least to the extent of knowing about different ELF sections. It also assumes that the reader has some undersanding of assembly language programming, bonus if you know aarch64 assembly since that’s the flavour of the examples.
Finding each other in this crowded world
The idea of relocation is quite simple: when compiling programs, we need flexibility to build components of programs independently and then have them link together. This could be in the same source file where we don’t know where parts of the source would end up, multiple source files built into different object files or sets of object files built into different libraries that reference objects in each other. This flexibility is achieved using relocations. Here’s a very simple example using aarch64 assembly:
.data
.globl somedata
somedata:
.8byte 0x42
.text
.globl start
_start:
nop
ldr x2, somedata
This is a simple program that loads somedata
into register x2
. It doesn’t do much and if you try to run the program it will crash, but it is a useful example that shows an assembly source where parts of it end up in different ELF sections.
The interesting bit here is a load instruction in the text section that is reading a variable somedata
from the data section. The load instruction encodes within itself, the offset of somedata
from itself, aka the PC-relative offset. The assembler can see both, the variable and the instruction but it cannot say for sure where they will be in memory at this point, because it does not know how far the data section will be from the text section in the final library or executable. To work around this limitation, it needs to leave the offset field in the ldr instruction blank so that the linker can finally fill it in. It also needs to provide instructions to the linker to tell it how to fill in this field.
This is where relocations come into play.
In this specific case, one may assemble the example and disassemble it using objdump -Dr
to find this disassembly:
Disassembly of section .text:
0000000000000000 <_start>:
0: d503201f nop
4: 58000002 ldr x2, 0 <_start>
4: R_AARCH64_LD_PREL_LO19 somedata
Disassembly of section .data:
0000000000000000 <somedata>:
0: 00000042 .inst 0x00000042 ; undefined
4: 00000000 .inst 0x00000000 ; undefined
The R_AARCH64_LD_PREL_LO19
in that output is the relocation. How did it land in there in between the instructions you ask? Well, it didn’t. The relocations are actually in a separate section of their own, as is evident with objdump -r
:
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
0000000000000004 R_AARCH64_LD_PREL_LO19 somedata
even better with readelf -r
because it tells you that the relocations are essentially just a table of entries in the .rela.text
section:
Relocation section '.rela.text' at offset 0x110 contains 1 entry:
Offset Info Type Sym. Value Sym. Name + Addend
000000000004 000500000111 R_AARCH64_LD_PREL 0000000000000000 somedata + 0
All sections with relocation entries have names with prefix .rela
or .rel
followed by the name of the section for which the relocations need to be applied. Based on these section names, it’s evident that there are two types of relocation entries, REL and RELA. There are a number of important pieces of information the assembler leaves for the linker here:
- The memory address it needs to fix up
- Which symbol the memory address is referring to
- What is the offset from the symbol that it should finally consider as the result
- How should it perform the computation
All of this can be seen in the above relocation table. Each entry in the relocation table is basically a C structure of the following form for REL type relocations:
typedef struct {
Elf64_Addr r_offset;
Elf64_Xword r_info;
} Elf64_Rel;
and for RELA:
typedef struct {
Elf64_Addr r_offset;
Elf64_Xword r_info;
Elf64_Sxword r_addend;
} Elf64_Rela;
r_offset
corresponds to the Offset
entry in the readelf output above and is typically the memory address that needs to be fixed up. The offset from the symbol, aka the addendum is present only in RELA type relocations and it corresponds to the r_addend
element in the structure and the Addend
field in the readelf output. The symbol and computation related information is encoded in the r_info
field.
The Elf32_Rel
and Elf32_Rela
structures are similar, except for the data sizes of the elements.
Symbol hunting
The ‘what to write’ is where the r_info
field comes in. That’s what the linker needs to figure out before the where and how, which comes later.
This field is split into two 32-bit parts (16-bit for 32-bit architectures). The lower part tells the linker how to perform the computation and the upper part tells the linker what the target symbol is. The upper part is a symbol ID, which basically is an index into the symbol table. In our example above, the r_info
is 0x000500000111
, which means that the symbol id is 0x5. We can pull out the symbol table using readelf -s
:
Symbol table '.symtab' contains 7 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 SECTION LOCAL DEFAULT 1
2: 0000000000000000 0 SECTION LOCAL DEFAULT 3
3: 0000000000000000 0 SECTION LOCAL DEFAULT 4
4: 0000000000000000 0 NOTYPE LOCAL DEFAULT 1 $x
5: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 3 somedata
6: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 1 _start
and we find that the symbol id 0x5 is somedata
and has the Value
(i.e. the address of the symbol relative to its section) of 0x0.
Now we need to figure out how to combine all of this information together using the lower part of r_info
, which is 0x111. This number corresponds to R_AARCH64_PREL_LO19
, which has a specific meaning. An architecture defines a number of such relocations with descriptions of how they’re supposed to be applied. In case of R_AARCH64_PREL_LO19
it means “Add the symbol address and addend and subtract from it, the location of the memory address that is being fixed up”. Think about that a bit and you’ll notice that it is how you would compute the PC-relative offset of somedata
from the instruction, i.e. subtract the location of the fixup (i.e. the LDR instruction) from the address of somedata
. In short form (and you’ll see this and similar notations to describe relocations), it is written as S + A - P
.
Putting things together
Now that we know what the target symbol is and how to compute the pc-relative offset, we need to compute the final symbol address, the final target address and then do the actual fixup. This is done near the end of the linking process in the GNU linker, when all sections have been laid out and we finally are in a position to know the relative addresses. The linker will then make the computation (i.e. S+A-P) and patch in the result into the LDR instruction before writing the output to the final binary. Here is what our result looks like:
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: d503201f nop
4000b4: 58080022 ldr x2, 4100b8 <somedata>
Disassembly of section .data:
00000000004100b8 <somedata>:
4100b8: 00000042 .inst 0x00000042 ; undefined
4100bc: 00000000 .inst 0x00000000 ; undefined
Notice that the opcode of the LDR instuction is now different (and as a result the instruction itself is also different) and includes the 0x8002, which is basically the encoded difference (0x10004) from somedata
.
Raising the stakes: Dynamic Relocations
This is all great when all our symbols are local and have predictable layouts such that PC-relative relocations such as R_AARCH64_PREL_LO19
are sufficient to describe and resolve in between assembling and linking a program. What happens however, when these symbol references cross boundaries of sections in ways that we cannot predict at compile or link time? What happens when your symbol references cross boundaries of your shared object? These are problems that need to be solved to make Position Independent Code (PIC) possible. PIC is when your program (and sections within your program) could get mapped anywhere in memory and you need your code to adapt to that fact.
Take this very simple example:
.data
somedata:
.8byte 0x42
somedata2:
.8byte somedata
.text
.globl _start
_start:
ret
There’s next to nothing here; just somedata
like in our previous example and a somedata2
which points to somedata
. However, in this next-to-nothing example lies an interesting complication that needs to be resolved at runtime: the value in somedata2
cannot be computed at compile time; it needs a fixup at runtime! Let’s walk through the compilation to see how we get to the final result. First, the disassembly to understand what the assembler did for us:
Disassembly of section .text:
0000000000000000 <_start>:
0: d65f03c0 ret
Disassembly of section .data:
0000000000000000 <somedata>:
0: 00000042 .inst 0x00000042 ; undefined
4: 00000000 .inst 0x00000000 ; undefined
0000000000000008 <somedata2>:
...
8: R_AARCH64_ABS64 .data
We see now that the address to be relocated is somedata2
in the .data
section and it is of type R_AARCH64_ABS64
. This is simple relocation that instructs the linker to compute S + A
to get the result, i.e. get the symbol address of somedata
and add the addendum (0 again in this case). This in fact would be the final result for a statically linked result (using ld -static
) and we’d lose the relocation in favour of the absolute address written into somedata2
:
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: d65f03c0 ret
Disassembly of section .data:
00000000004100b4 <somedata>:
4100b4: 00000042 .inst 0x00000042 ; undefined
4100b8: 00000000 .inst 0x00000000 ; undefined
00000000004100bc <somedata2>:
4100bc: 004100b4 .inst 0x004100b4 ; undefined
4100c0: 00000000 .inst 0x00000000 ; undefined
When compiling a shared object however (i.e. ld -shared
) we intend to produce a position independent DSO (dynamic shared object) and to achieve that the linker now emits a relocation to describe how to compute the final address to assign to somedata2
and where the memory address can be located. In this example, it is the R_AARCH64_RELATIVE
dynamic relocation, as seen using objdump -DR
(output snipped to retain only useful bits):
Disassembly of section .data:
0000000000011000 <somedata>:
11000: 00000042 .inst 0x00000042 ; undefined
11004: 00000000 .inst 0x00000000 ; undefined
0000000000011008 <somedata2>:
11008: 00011000 .inst 0x00011000 ; undefined
11008: R_AARCH64_RELATIVE *ABS*+0x11000
1100c: 00000000 .inst 0x00000000 ; undefined
This relocation is interesting not just for the reason that it is dynamic, but also because it is a S+A
type relocation that puts the non-relocated address (i.e. the link time address) of somedata
into its addend. This relocation also does not reference a symbol; instead it references an *ABS*
value, which is basically the offset at which this DSO would be loaded during execution. It is the dynamic linker in the C runtime library (ld.so in GNU systems) that reads these relocations from the .rela.dyn
section. Because this relocation is based on an absolute address computed by the static linker, the dynamic linker does not have to do a symbol lookup to resolve the relocation.
The other difference from static relocations is that when a dynamic relocation references a symbol in its r_info
, it is looked up in the .dynsym
section, i.e. in the dynamic symbol table and not in the regular symbol table.
Final Thoughts
There are a number of other cases that the linker needs to cater for when it comes to relocations such as entries in Global Offset Tables, resolution of intermediate functions and Thread-Local Storage. Thankfully though, the first principles behind all those relocations are the same as the above and you can apply this knowledge to GOT, TLS and IFUNC relocations as well. GOT relocations for example reference GOT base, which the linker knows where to find (since it sets up the .got
section in the first place) and can use that information to compute the location to fix up. Other than this special knowledge, everything else remains pretty much the same.
Once you’re equipped with these first principles, the next task is to figure out where documentation for specific relocations is for every architecture. While the binutils documentation makes some effort to document the public facing part of relocations, the detailed documentation is usually distributed by the architecture chip vendors. The AArch64 ELF documentation for example is hosted on the Arm website.
Links
- @gnutools pointed me to a full blog series by Ian Lance Taylor on Linkers, indexed by LWN
- @matt_dz maintains a comprehensive set of links on linking and loading on GitHub. This looks like everything you’ll ever need to understand ELF and linkers!
gcc under the hood
Posted: Oct 03, 2019, 14:12My background in computers is a bit hacky for a compiler engineer. I never had the theoretical computer science base that the average compiler geek does (yes I have a Masters in Computer Applications, but it’s from India and yes I’m going to leave that hanging without explanation) and I have pretty much winged it all these years. What follows is a bunch of thoughts (high five to those who get that reference!) from my winging it for almost a decade with the GNU toolchain. If you’re a visual learner then I would recommend watching my talk video at Linaro Connect 2019 instead of reading this. This is an imprecise transcript of my talk, with less silly quips and in a more neutral accent.
Hell Oh World!
It all started as a lightning talk at SHD Belgaum where I did a 5 minute demonstration of how a program goes from source code to a binary program. I got many people asking me to talk about this in more detail and it eventually became a full hour workshop at FOSSASIA in 2017. Those remained the basis for the first part of my talk at Connect, in fact they’re a bit more detailed in their treatment of the subject of purely taking code from source to binary.
Go read those posts, I’ll wait.
Under the hood
Welcome back! So now you know almost exactly how code goes from source to binary and how it gets executed on your computer. Congratulations, you now have a headstart into hacking on the dynamic loader bits in glibc! Now that we’ve explored the woods around the swamp, lets step into the swamp. Lets take a closer look at what gcc does with our source code.
Babelfish
The first thing a compiler does is to read the source code and make it into a data structure form that is easy for the computer to manipulate. Since the computer deals best with binary data, it converts the text form of the source code language into a tree-like structure. There is plenty of literature available on how that is implemented; in fact most compiler texts end up putting too much focus on this aspect of the compiler and then end up rushing through the real fun stuff that the compiler does with the program you wrote.
The data structure that the compiler translates your source code into is called an Intermediate Representation (IR), and gcc has three of them!
This part of the compiler that parses the source code into IR is known as the front end. gcc has many such front ends, one for each language it implements; they are all cluttered into the gcc/
directory. The C family of languages has a common subset of code in the gcc/c-family
directory and then there are specializations like the C++ front end, implemented in the gcc/cp
directory. There is a directory for fortran, another for java, yet another for go and so on. The translation of code from source to IR varies from frontend to frontend because of language differences, but they all have one thing in common; their output is an IR called GENERIC
and it attempts to be language-independent.
Optimisation passes
Once gcc translates the source code into its IR form, it runs the IR through a number of passes (about 200 of them, more or less depending on the -O
flag you pass) to try and come up with the most optimal machine code representation of the program. gcc builds source code a file at a time, which is called a Translation Unit (TU). A lot of its optimisation passes operate at the function level, i.e. individual functions are seen as independent units and are optimised separately. Then there are Inter-procedural analysis (IPA) passes that look at interactions of these functions and finally there is Link Time Optimisation(LTO) that attempts to analyse source code across translation units to potentially get even better results.
Optimisation passes can be architecture-independent or architecture-dependent.
Architecture independent passes do not care too much about the underlying machine beyond some basic details like word size, whether the CPU has floating point support, vector support, etc. These passes have some configuration hooks that allow their behaviour to be modified according to the target CPU architecture but the high level behaviour is architecture-agnostic. Architecture-independent passes are the holy grail for optimisation because they usually don’t get old; optimisations that work today will continue to work regardless of CPU architecture evolution.
Architecture-dependent passes need more information about the architecture and as a result their results may change as architectures evolve. Register allocation and instruction scheduling for example are very interesting and complex problems that architecture-specific passes handle. The instruction scheduling problem was a lot more critical back in the day when CPUs could only execute code sequentially. With out-of-order execution, the scheduling problem has not become less critical, but it has definitely changed in nature. Similarly the register allocation problem can be complicated by various factors such as number of locigal registers, how they share physical register files, how costly moving between different types of registers is, and so on. Architecture-dependent passes have to take into consideration all of these factors.
The final pass in architecture-dependent passes does the work of emitting assembly code for the target CPU. The architecture-independent passes constitute what is known as the middle-end of the compiler and the architecture-dependent passes form the compiler backend.
Each pass is a complex work of art, mathematics and logic and may have one or more highly cited research papers as their basis. No single gcc engineer would claim to understand all of these passes; there are many who have spent most of their careers on a small subset of passes, such is their complexity. But then, this is also a testament to how well we can work together as humans to create something that is significantly more complex than what our individual minds can grasp. What I mean to say with all this is, it’s OK to not know all of the passes, let alone know them well; you’re definitely not the only one.
Optimisation passes are all listed in gcc/passes.def
and that is the sequence in which they are executed. A pass is defined as a class with a gate and execute function that determine whether to run and what to do respectively. Here’s what a single pass namespace looks like:
/* Pass data to help the pass manager classify, prepare and cleanup. */
const pass_data pass_data_my_pass =
{
GIMPLE_PASS, /* type */
"my_pass", /* name */
OPTGROUP_LOOP, /* optinfo_flags */
TV_TREE_LOOP, /* tv_id */
PROP_cfg, /* properties_required */
0, /* properties_provided */
0, /* properties_destroyed */
0, /* todo_flags_start */
0, /* todo_flags_finish */
};
/* This is a GIMPLE pass. I know you don't know what GIMPLE is yet ;) */
class pass_my_pass : public gimple_opt_pass
{
public:
pass_my_pass (gcc::context *ctxt)
: gimple_opt_pass (pass_data_my_pass, ctxt)
{}
/* opt_pass methods: */
virtual bool gate (function *) { /* Decide whether to run. */ }
virtual unsigned int execute (function *fn);
};
unsigned int
pass_my_pass::execute (function *)
{
/* Magic! */
}
We will not go into the anatomy of a pass yet. That is perhaps a topic for a follow-up post.
GENERIC
GENERIC is the first IR that gets generated from parsing the source code. It is a tree structure that has been present since the earliest gcc versions. Its core data structure is a tree_node
, which is a hierarchy of structs, with tree_base
as the Abraham. OK, if you haven’t been following gcc development, this can come as a surprise: a significant portion of gcc is now in c++!
It’s OK, take a minute to mourn/celebrate that.
The tree_node
can mean a lot of things (it is a union
), but the most important one is the tree_typed
node. There are various types of tree_typed
, like tree_int_cst
, tree_identifier
, tree_string
, etc. that describe the elements of source code that they house. The base struct, i.e. tree_typed
and even further up, tree_base
have flags that have metadata about the elements that help in optimisation decisions. You’ll almost never use generic in the context of code traversal in optimisation passes, but the nodes are still very important to know about because GIMPLE continues to use them to store operand information. Take a peek at gcc/tree-core.h
and gcc/tree.def
to take a closer look at all of the types of nodes you could have in GENERIC.
What is GIMPLE
you ask? Well, that’s our next IR and probably the most important one form the context of optimisations.
OK so now you have enough background to go look at the guts of the GENERIC tree IR in gcc. Here’s the gcc internals documentation that will help you navigate all of the convenience macros to analyze th tree nodes.
GIMPLE
The GIMPLE IR is a workhorse of gcc. Passes that operate on GIMPLE are architecture-independent.
The core structure in GIMPLE is a struct gimple
that holds all of the metadata for a single gimple statement and is also a node in the list of gimple statements. There are various structures named gimple_statement_with_ops_*
that have the actual operand information based on its type. Once again like with GENERIC, it is a hierarchy of structs. Note that the operands are all of the tree
type so we haven’t got rid of all of GENERIC. gcc/gimple.h
is where all of these structures are defined and gcc/gimple.def
is where all of the different types of gimple statements are defined.
Where did my control flow go?
But how is it that a simple list of gimple statements is sufficient to traverse a program that has conditions and loops? Here is where things get interesting. Every sequence of GIMPLE statements is housed in a Basic Block (BB) and the compiler, when translating from GENERIC to GIMPLE (also known as lowering, since GIMPLE is structurally simpler than GENERIC), generates a Control FLow Graph (CFG) that describes the flow of the function from one BB to another. The only control flow idea one then needs in GIMPLE to traverse code from within the gimple statement context is a jump from one block to another and that is fulfilled by the GIMPLE_GOTO statement type. The CFG with its basic blocks and edges connecting those blocks, takes care of everything else. Routines to generate and manipulate the CFG are defined in gcc/cfg.h
and gcc/cfg.c
but beware, you don’t modify the CFG directly. Since CFG is tightly linked with GIMPLE (and RTL, yes that’s our third and final IR), it provides hooks to manipulate the graph and update GIMPLE if necessary.
The last interesting detail about CFG is that it has a special construct for loops, because they’re typically the most interesting subjects for optimisation: you can splice them, unroll them, distribute them and more to produce some fantastic performance results. gcc/cfgloop.h
is there you’ll find all of the routines you need to traverse and manipulate loops.
The final important detail with regard to GIMPLE is the Single Static Assignment (SSA) form. Typical source code would have variables that get declared, assigned to, manipulated and then stored back into memory. Essentially, it is multiple operations on a single variable, sometimes destroying its previous contents as we reuse variables for different things that are logically related in the context of the high level program. This reuse however makes it complicated for a compiler to understand what’s going on and hence it ends up missing a host of optimisation opportunities.
To make things easier for optimisation passes, these variables are broken up into versions of themselves that have a specific start and end point in their lifecycle. This is called the Single Static Assignment form of a variable, where each version of the variable as a single starting point, viz. its definition. So if you have code like this:
x = 10;
x += 20;
it becomes:
x_1 = 10;
x_2 = x_1 + 20;
where x_1
and x_2
are versions of x
. If you have versions of variables in conditions, things get interesting and the compiler deals with it with a mysterious concept called PHI nodes. So this code:
if (n > 10)
x = 10;
else
x = 20;
return x;
becomes:
if (n > 10)
x_1 = 10;
else
x_2 = 20;
# x_3 = PHI<x_1, x_2>;
return x_3;
So the PHI node is a conditional selector of the earlier two versions of the variable and depending on the results of the optimisation passes, you could eliminate versions of the variables altogether or use CPU registers more efficiently to store those variable versions.
There you go, now you have everything you need to get started on hacking GIMPLE. I know this part is a bit heavy but guess what, this is where you can seriously start thinking about hacking on gcc! When you jump in, you’ll need the more detailed information in the gcc internals manual on GIMPLE, CFG and GIMPLE optimisations.
RTL
We are yet another step closer to generating assembly code for our assembler and linker to build into the final program. Recall that GIMPLE is largely architecture independent, so it works on high level ideas of statements, expressions and types and their relationships. RTL is much more primitive in comparison and is designed to mimic sequential architecture instructions. Its main purpose is to do architecture-specific work, such as register allocation and scheduling instructions in addition to more optimisation (because you can never get enough of that!) passes that make use of architecture information.
Internally, you will encounter two forms of RTL, one being the rtx
struct that is used for most transformations and passes in the compiler. The other form is in text to map machine instructions to RTL and these are Lisp-like S expressions. These are used to make machine descriptions, where you can specify machine instructions for all of the common operations such as add, sub, multiply and so on. For example, this is what the description of a jump looks like for aarch64:
(define_insn "jump"
[(set (pc) (label_ref (match_operand 0 "" "")))]
""
"b\\t%l0"
[(set_attr "type" "branch")]
)
Machine description files are in the gcc/config
directory in their respective architecture subdirectory and have the .md
suffix. The main aarch64 machine description file for example is gcc/config/aarch64/aarch64.md
. These files are parsed by a tool in gcc to generate C code with rtx
structures for each of those S-expressions.
In general, the gcc/config
directory contains source code that handles compilation for that specific architecture. In addition to the machine descriptions these directories also have C code that enhance the RTL passes to exploit as much of th architecture information as they possibly can. This is where some of the detailed architecture-specific analysis of the RTL instructions go. For example, combination of loads into load pairs for aarch64 is an important task and it is done with a combination of machine description and some code to peek into and rearrange neighbouring RTL instructions
But then you’re wondering, why are there multiple description files? Other than just cleaner layout (put constraint information in a separate file, type information in another, etc.) it is because there are multiple evolutions of an architecture. The ‘i386’ architecture is a mess of evolutions that span word sizes and capabilities. The aarch64 architecture has within it many microarchitectures developed by various Arm licensee vendors like xgene
, thunderxt88
, falkor
and also those developed by Arm such as the cortex-a57
, cortex-a72
, ares
, etc. All of these have different behaviours and performance characteristics despite sharing an instruction set. For example on some microarchitecture one may prefer to emit the csel
instruction instead of cmp
, b.cond
and multiple mov
instructions to reduce code size (and hence improve performance) but on some other architecture, the csel
instruction may have been designed really badly and hence it may well be cheaper to execute the 4+ instructions instead of the one csel
. These are behaviour quirks that you select when you use the -mtune
flag to optimise for a specific CPU. A lot of this information is also available in the C code of the architecture in the form of structures called cost tables. These are relative costs of various operations that help the RTL passes and some GIMPLE passes determine the best target code and optimisation behaviour accordingly for the CPU. Here’s an example of register move costs for the Qualcomm Centriq processor:
static const struct cpu_regmove_cost qdf24xx_regmove_cost =
{
2, /* GP2GP */
/* Avoid the use of int<->fp moves for spilling. */
6, /* GP2FP */
6, /* FP2GP */
4 /* FP2FP */
};
This tells us that in general, moving between general purpose registers is cheaper, moving between floating point registers is slightly more expensive and moving between general purpose and floating point registers is most expensive.
The other important detail in such machine description files is the pipeline description. This is a description of how the pipeline for a specific CPU microarchitecture is designed along with latencies for instructions in that pipeline. This information is used by the instruction scheduler pass to determine the best schedule of instructions for a CPU.
Now where do I start?
This is a lot of information to page in at once and if you’re like me, you’d want something more concrete to get started with understanding what gcc is doing. gcc has you covered there because it has flags that allow you to study the IR generated for the compiler at every stage of compilation. By every stage, I mean every pass! The -fdump-*
set of flags allow you to dump the IR to study what gcc did in each pass. Particularly, -fdump-tree-*
flags dump GIMPLE IR (in a C-like format so that it is not too complicated to read) into a file for each pass and the -fdump-rtl-*
flags do the same for RTL. The GIMPLE IR can be dumped in its raw form as well (e.g. -fdump-tree-all-raw
), which makes it much simpler to correlate with the code that is manipulating the GIMPLE in an optimisation pass.
The easiest way to get into gcc development (and compiler development in general) in my experience is the back end. Try tweaking the various cost tables to see what effect it has on code generation. Modify the instructions generated by the RTL descriptions and use that to look closer at one pass that interests you. Once you’re comfortable with making changes to gcc, rebuilding and checking its outputs, you can then try writing a pass, which is a slightly more involved process. Maybe I’ll write about it some day.
References
- The GCC internals manual is the canonical place to read (and fix up) documentation for GCC internals. It is wonderfully detailed and hopelessly outdated at the same time. Bringing it up to date is a task by itself and the project continues to look for volunteers to do that.
- David Malcolm has a more functional newbie guide for fledgling gcc hackers who might struggle with debugging gcc and getting involved in the gcc development process
- The GCC Resource Center workshop on GCC is where I cut my teeth on gcc internals. I don’t think their workshop is active anymore but they have presentations and other literature there that is still very relevant.
- I wrote about micro-optimisations in the past and those ideas are great to try on gcc.
Hello FOSSASIA: Revisiting the event *and* the first program we write in C
Posted: Mar 19, 2017, 10:15I was at FOSSAsia this weekend to deliver a workshop on the very basics of programming. It ended a pretty rough couple of weeks for me, with travel to Budapest (for Linaro Connect) followed immediately by the travel to Singapore. It seems like I don’t travel east in the timezone very well and the effects were visible with me napping at odd hours and generally looking groggy through the weekend at Singapore. It was however all worth it because despite a number of glitches, I had some real positives to take back from the conference.
The conference
FOSSAsia had been on my list of conferences to visit due to Kushal Das telling me time and again that I’d meet interesting people there. I had proposed a talk (since I can’t justify the travel just to attend) a couple of years ago but dropped out since I could not find sponsors for my talk and FOSSAsia was not interested in sponsoring me either. Last year I met Hong at SHD Belgaum and she invited me to speak at FOSSAsia. I gladly accepted since Nisha was going to volunteer anyway. However as things turned out in the end, my talk got accepted and I found sponsorship for travel and stay (courtesy Linaro), but Nisha could not attend.
I came (I’m still in SG, waiting for my flight) half-heartedly since Nisha did not accompany me, but the travel seemed worth it in the end. I met some very interesting people and was able to deliver a workshop that I was satisfied with.
Speaking of the workshop…
I was scheduled to talk on the last day (Sunday) first thing in the morning and I was pretty sure I was going to be the only person standing with nobody in their right minds waking up early on a Sunday for a workshop. A Sunday workshop also meant that I knew the venue and its deficiencies - the “Scientist for a Day” part of the Science Center was a disaster since it was completely open and noisy, with lunch being served right next to the room on the first day. I was wary of that, but the Sunday morning slot protected me from that and my workshop personally without such glitches.
The workshop content itself was based on an impromptu ‘workshop’ I did at FUDCon Pune 2015, but a little more organized. Here’s a blow by blow account of the talk for those who missed it, and also a reference for those who attended and would like a reference to go back to in future.
Hell Oh World
It all starts with this program. Hello World is what we all say when we are looking to learn a new language. However, after Hello World, we move up to learn the syntax of the language and then try to solve more complex user problems, ignoring the wonderful things that happened underneath Hello World to make it all happen. This session is an attempt to take a brief look into these depths. Since I am a bit of a cynic, my Hello World program is slightly different:
#include <stdio.h> int main (void) { printf ("Hell Oh World!\n"); return 0; }
We compile this program:
$ gcc -o helloworld helloworld.c
We can see that the program prints the result just fine:
$ ./helloworld Hell Oh World!
But then there is so much that went into making that program. Lets take a look at the binary by using a process called disassembling, which prints the binary program into a human-readable format - well at least readable to humans that know assembly language programming.
$ objdump -d helloworld
We wrote only one function: main, so we should see only that. Instead however, we see so many functions that are present in the binary In fact, you you were lied to when they told back in college that main() is the entry point of the program! The entry point is the function called _start, which calls a function in the GNU C Library called __libc_start_main, which in turn calls the main function. When you invoke the compiler to build the helloworld program, you’re actually running a number of commands in sequence. In general, you do the following steps:
- Preprocess the source code to expand macros and includes
- Compile the source to assembly code
- Assemble the assembly source to binary object code
- Link the code against its dependencies to produce the final binary program
let us look at these steps one by one.
Preprocessing
gcc -E -o helloworld.i helloworld.c
Run this command instead of the first one to produce a pre-processed file. You’ll see that the resultant file has hundreds of lines of code and among those hundreds of lines, is this one line that we need: the prototype for printf so that the compiler identifies the call printf:
extern int printf (const char *__restrict __format, ...);
It is possible to just use this extern decl and avoid including the entire header file, but it is not good practice. The overhead of maintaining something like this is unnecessary, especially when the compiler can do the job of eliminating the unused bits anyway. We are better off just including a couple of headers and getting all declarations.
Compiling the preprocessed source
Contrary to popular belief, the compiler does not compile into binary .o - it only generates assembly code. It then calls the assembler in the binutils project to convert the assembly into object code.
$ gcc -S -o helloworld.s helloworld.i
The assembly code is now just this:
.file "helloworld.i" .section .rodata .LC0: .string "Hell Oh World!" .text .globl main .type main, @function main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl $.LC0, %edi call puts movl $0, %eax popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size main, .-main .ident "GCC: (GNU) 6.3.1 20161221 (Red Hat 6.3.1-1)" .section .note.GNU-stack,"",@progbits
which is just the main function and nothing else. The interesting thing there though is that the printf function call is replaced with puts because the input to printf is just a string without any format and puts is much faster than printf in such cases. This is an optimization by gcc to make code run faster. In fact, the code runs close to 200 optimization passes to attempt to improve the quality of the generated assembly code. However, it does not add all of those additional functions.
So does the assembler add the rest of the gunk?
Assembling the assembly
gcc -c -o helloworld.o helloworld.s
Here is how we assemble the generated assembly source into an object file. The generated assembly can again be disassembled using objdump and we see this:
helloworld.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: bf 00 00 00 00 mov $0x0,%edi 9: e8 00 00 00 00 callq e e: b8 00 00 00 00 mov $0x0,%eax 13: 5d pop %rbp 14: c3 retq
which is no more than what we saw with the compiler, just in binary format. So it surely is the linker adding all of the gunk.
Putting it all together
Now that we know that it is the linker adding all of the additional stuff into helloworld, lets look at how gcc invokes the linker. To do this, we need to add a -v to the gcc command. You’ll get a lot of output, but the relevant bit is this:
$ gcc -v -o helloworld helloworld.c ... ... /usr/libexec/gcc/x86_64-redhat-linux/6.3.1/collect2 -plugin /usr/libexec/gcc/x86_64-redhat-linux/6.3.1/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/x86_64-redhat-linux/6.3.1/lto-wrapper -plugin-opt=-fresolution=/tmp/ccEdWzG5.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o helloworld /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/crtbegin.o -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1 -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../.. /tmp/cc3m0We9.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-redhat-linux/6.3.1/crtend.o /usr/lib/gcc/x86_64-redhat-linux/6.3.1/../../../../lib64/crtn.o COLLECT_GCC_OPTIONS='-v' '-o' 'helloworld' '-mtune=generic' '-march=x86-64'
This is a long command, but the main points of interest are all of the object files (*.o) that get linked in because the linker concatenates those and then resolves dependencies of unresolved references to functions (only puts in this case) among those and all of the libraries (libc.so via -lc, libgcc.so via -lgcc, etc.). To find out which of the object code files have the definition of a specific function, say, _start, disassemble each of them. You’ll find that crt1.o has the definition.
Static linking
Another interesting thing to note in the generated assembly is that the call is to puts@plt, which is not exactly puts. It is in reality a construct called a trampoline, which helps the code jump to the actual printf function during runtime. We need this because printf is actually present in libc.so.6, which the binary simply claims to need by encoding it in the binary. To see this, disassemble the binary using the -x flag:
$ objdump -x helloworld helloworld: file format elf64-x86-64 helloworld architecture: i386:x86-64, flags 0x00000112: EXEC_P, HAS_SYMS, D_PAGED start address 0x0000000000400430 ... Dynamic Section: NEEDED libc.so.6 ...
This is dynamic linking. When a program is executed, what is actually called first is the dynamic linker (ld.so), which then opens all dependent libraries, maps them into memory, and then calls the _start function in the program. During mapping, it also fills in a table of data called the Global Offset Table with offsets of all of the external references (puts in our case) to help the trampoline jump to the correct location.
If you want to be independent of the dynamic linker, then you can link the program statically:
$ gcc -static -o helloworld helloworld.c
This will however result in bloating of the program and also has a number of other disadvantages, like having to rebuild for every update of its dependent libraries and sub-optimal performance since the kernel can no longer share pages among processes for common code.
BONUS: Writing the smallest program
The basics were done with about 10 minutes to spare, so I showed how one could write the smallest program ever. In principle, the smallest program in C is:
int main (void) { return 42; }
As is evident though, this pulls in everything from the C and gcc libraries, so it is clearly hard to do this in C, so lets try it in assembly. We already know that _start is the main entry point of the program, so we need to implement that function. To exit the program, we need to tell the kernel to exit by invoking the exit_group syscall, which has syscall number 231. Here is what the function looks like:
.globl _start _start: mov $0xe7, %rax mov $0x42, %rdi syscall
We can build this with gcc to get a very small binary but to do this, we need to specify that we don’t want to use the standard libraries:
gcc -o min -nostdlib min.s
The resultant file is 864 bytes, as opposed to the 8.5K binary from the C program. We can reduce this further by invoking the assembler and linker directly:
$ as -o min.o min.s $ ld -o min min.o
This results in an even smaller binary, at 664 bytes! This is because gcc puts some extra meta information in the binary to identify its builds.
Conclusion
At this point we ran out of time and we had to cut things short. It was a fun interaction because there were even a couple of people with Macbooks and we spotted a couple of differences in the way the linker ran due to differences in the libc, despite having the same gcc installed. I wasn’t able to focus too much on the specifics of these differences and I hope they weren’t a problem for the attendees using Macs. In all it was a satisfying session because the audience seemed happy to learn about all of this. It looked like many of them had more questions (and wonderment, as I had when I learned these things for the first time) in their mind than they came in with and I hope they follow up and eventually participate in Open Source projects to fulfill their curiosity and learn further.
Higher order functions in C?
Posted: Jun 10, 2012, 10:01The Structure and Interpretation of Computer Programs course has a class on higher order functions, which is the ability of a function to accept a function and/or returns another function that uses the input function. The C programming language has very limit capability to do this and it is limited to being able to accept function pointers or return function pointers. I wanted to explore this limitation further to figure out exactly how far I can stretch this, so I wrote a small C program that tries to emulate this concept. This is just random tinkering, so if the reader is looking for a specific lesson, then let me clarify that there is none.
I wrote a program to get square roots using the Newton-Raphson method -- the same one that was used in the course with an attempt to keep the structure as close as possible to the scheme example in the class video. The Lisp-based listing for the course can be found on the internet and here is what I wrote:
#include <stdio.h> #include <stdlib.h> typedef double (*generic_func_t) (double); #define dx 0.000001 #define absolute(n) ((n) < 0 ? -(n) : (n)) generic_func_t deriv(generic_func_t f) { double deriv_func(double x) { return (f(x + dx) - f(x)) / dx; } return deriv_func; } double fixed_point(generic_func_t f, double x) { while (absolute(f(x) - x) > dx) x = f(x); return f(x); } double newton(generic_func_t f, double guess) { double newton_func(double y) { generic_func_t df = deriv(f); return (y - f(y) / df (y)); } return fixed_point(newton_func, guess); } double sqrt(double x) { double sqrt_func(double y) { return (x - y*y); } return newton(sqrt_func, 1); } int main(int argc, char **argv) { if (argc < 2) { printf(“Usage: %s <number>\n”, argv[0]); exit (1); } printf (“%lf\n”, sqrt(strtod(argv[1], NULL))); return 0; }The first thing that is evident in the above is that the syntax is non-standard. I have declared some functions within another function and that is not allowed by the C standard. GCC however implements this concept of nested functions as an extension, so this compiled cleanly for me. Running this however is a completely different thing.
Executing the above program for any argument results in segfaults and analysis of the program under gdb shows that the generated code is quite ridiculous. This is expected however, since we have pushed the gcc nested function support too far. The nested functions within gcc are OK as long as they remain within the following limits:
- They are only used within the function that nests them, either directly or indirectly via some function calls. Essentially, control from the nested function should return before it’s nesting function returns.
- If they’re used outside the nesting function and called after the nesting function returns, then they don’t use any local variables or arguments of the nesting function
So I modified the deriv function and came up with the listing below:
#include <stdio.h> #include <stdlib.h> typedef double (*generic_func_t) (double); #define dx 0.000001 #define absolute(n) ((n) < 0 ? -(n) : (n)) double deriv(double x, generic_func_t f) { return (f(x + dx) - f(x)) / dx; } double fixed_point(generic_func_t f, double x) { while (absolute(f(x) - x) > dx) x = f(x); return f(x); } double newton(generic_func_t f, double guess) { double newton_func(double y) { double dy = deriv(y, f); return (y - f(y) / dy); } return fixed_point(newton_func, guess); } double sqrt(double x) { double sqrt_func(double y) { return (x - y*y); } return newton(sqrt_func, 1); } int main(int argc, char **argv) { if (argc < 2) { printf(“Usage: %s <number>\n”, argv[0]); exit (1); } printf (“%lf\n”, sqrt(strtod(argv[1], NULL))); return 0; }The only change above is that instead of accepting a function and returning another in deriv, I now accept a function and return the evaluation of the derived function for the specified value. This is highlighted in bold above. This program now builds with gcc and also runs correctly. It should similarly be possible to eliminate the other nested functions to get a much more compact and standards-compliant program.
GCC Workshop at GRC, IIT Bombay
Posted: Jul 08, 2010, 22:19Mustafa (my manager) knew about my current fascination with learning the _real_ basics of computers (electronics, kernel, compilers, etc.), so he arranged for me to attend the GCC workshop held by IIT Bombay this week. My first reaction was that I would be wasting my time there since I didn't know the first thing about compiler theory and was a novice at best in assembly language programming. He said it wouldn't hurt to try. So I had to try.
I got a chance on the day before the workshop to read up some about things like IRs, RTLs, etc. It was enough that I would not be completely lost from day 1. But I did not attend day 1 at all, thanks to the countrywide strike that crippled all public transport (and some people too). So I spent that day working and also trying to cover what would otherwise have been covered in day 1 -- the various phases of compilation, passes, gray box probing to find out some more about intermediate outputs, etc. I got hold of last year's slides, so it was a little easier.
So I finally made it to days 2, 3 and 4. The first thing I noticed during the lecture sessions was that the professor really knew his stuff. He was well acquainted with the internal layout of gcc and was able to explain it well enough that I really _got_ it. Overall I come out of the sessions today with much more knowledge about gcc than I could ever gain in 4 days on my own. Here are some observations that I made during the course of this session.
The professor really knew his stuff. I say this again so that it does not look like I am ignoring that. There are also a lot of really talented individuals at the GRC, who are doing some pretty interesting research based on gcc. The trouble though is that there seem to have been no efforts whatsoever to share these ideas upstream.
One such idea is the Generic Data Flow Analyzer (GDFA). It is a patch to gcc that provides a data flow analyzer, which can be used to find and eliminate dead code or unused variables. It adds a gimple pass to the compilation sequence and intends to replace the current dead code elimination and unused variable elimination passes with the same code called with different parameters. While the idea is pretty interesting, the sad thing is that there are no signs of an attempt to push this idea upstream. All I could find was an announcement to the gcc mailing list, but no request for comments or for inclusion of the patch.
This is only one of many more ideas that are brewing in the GRC in the minds of some very talented people. But one felt that these ideas were being used only to get degrees and nothing was being done to actually test their feasibility in live production level code. It would be nice to see some of these ideas actually presented upstream with a genuine interest in getting them incorporated.
To conclude, it is a pretty good session for those who want to get started with learning compilers and gcc internals.