Wrestling with the register allocator: LuaJIT edition
For some time now, I kept running into one specific piece of code in luajit repeatedly for various reasons and last month I came across a fascinating bug in the register allocator pitfall that I had never encountered before. But then as is the norm after fixing such bugs, I concluded that it’s too trivial to write about and I was just stupid to not have found it sooner; all bugs are trivial once they’re fixed.
The worst thing about solving tricky problems is deciding to write up about it and then halfway through the writeup, concluding that it was way too simple and I was just stupid to not solve it sooner.
— Siddhesh Poyarekar (@siddhesh_p) August 28, 2019
After getting over my imposter syndrome (yeah I know, took a while) I finally have the courage to write about it, so like a famous cook+musician says, enough jibberjabber…
Looping through a hash table
One of the key data structures that luajit uses to implement metatables is a hash table. Due to its extensive use, the lookup loop into such hash tables is optimised in the JIT using an architecture-specific asm_href
function. Here is what a snippet of the arm64 version of the function looks like. A key thing to note is that the assembly code is generated backwards, i.e. the last instruction is emitted first.
/* Key not found in chain: jump to exit (if merged) or load niltv. */
l_end = emit_label(as);
as->invmcp = NULL;
if (merge == IR_NE)
asm_guardcc(as, CC_AL);
else if (destused)
emit_loada(as, dest, niltvg(J2G(as->J)));
/* Follow hash chain until the end. */
l_loop = --as->mcp;
emit_n(as, A64I_CMPx^A64I_K12^0, dest);
emit_lso(as, A64I_LDRx, dest, dest, offsetof(Node, next));
l_next = emit_label(as);
/* Type and value comparison. */
if (merge == IR_EQ)
asm_guardcc(as, CC_EQ);
else
emit_cond_branch(as, CC_EQ, l_end);
if (irt_isnum(kt)) {
if (isk) {
/* Assumes -0.0 is already canonicalized to +0.0. */
if (k)
emit_n(as, A64I_CMPx^k, tmp);
else
emit_nm(as, A64I_CMPx, key, tmp);
emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
} else {
Reg ftmp = ra_scratch(as, rset_exclude(RSET_FPR, key));
emit_nm(as, A64I_FCMPd, key, ftmp);
emit_dn(as, A64I_FMOV_D_R, (ftmp & 31), (tmp & 31));
emit_cond_branch(as, CC_LO, l_next);
emit_nm(as, A64I_CMPx | A64F_SH(A64SH_LSR, 32), tisnum, tmp);
emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.n));
}
} else if (irt_isaddr(kt)) {
Reg scr;
if (isk) {
int64_t kk = ((int64_t)irt_toitype(irkey->t) << 47) | irkey[1].tv.u64;
scr = ra_allock(as, kk, allow);
emit_nm(as, A64I_CMPx, scr, tmp);
emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
} else {
scr = ra_scratch(as, allow);
emit_nm(as, A64I_CMPx, tmp, scr);
emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key.u64));
}
rset_clear(allow, scr);
} else {
Reg type, scr;
lua_assert(irt_ispri(kt) && !irt_isnil(kt));
type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
scr = ra_scratch(as, rset_clear(allow, type));
rset_clear(allow, scr);
emit_nm(as, A64I_CMPw, scr, type);
emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key));
}
*l_loop = A64I_BCC | A64F_S19(as->mcp - l_loop) | CC_NE;
Here, the emit_*
functions emit assembly instructions and the ra_*
functions allocate registers. In the normal case everything is fine and the table lookup code is concise and effective. When there is register pressure however, things get interesting.
As an example, here is what a typical type lookup would look like:
0x100 ldr x1, [x16, #52]
0x104 cmp x1, x2
0x108 beq -> exit
0x10c ldr x16, [x16, #16]
0x110 cmp x16, #0
0x114 bne 0x100
Here, x16
is the table that the loop traverses. x1
is a key, which if it matches, results in an exit to the interpreter. Otherwise the loop moves ahead until the end of the table. The comparison is done with a constant stored in x2
.
The value of x
is loaded later (i.e. earlier in the code, we are emitting code backwards, remember?) whenever that register is needed for reuse through a process called restoration or spilling. In the restore case, it is loaded into register as a constant or expressed as another constant (look up constant rematerialisation) and in the case of a spill, the register is restored from a slot in the stack. If there is no register pressure, all of this restoration happens at the head of the trace, which is why if you study a typical trace you will notice a lot of constant loads at the top of the trace.
Like the Spill that ruined your keyboard the other day…
Things get interesting when the allocation of x2
in the loop above results in a restore. Looking at the code a bit closer:
type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
scr = ra_scratch(as, rset_clear(allow, type));
rset_clear(allow, scr);
emit_nm(as, A64I_CMPw, scr, type);
emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key));
The x2
here is type, which is a constant. If a register is not available, we have to make one available by either rematerializing or by restoring the register, which would result in something like this:
0x100 ldr x1, [x16, #52]
0x104 cmp x1, x2
0x108 mov x2, #42
0x10c beq -> exit
0x110 ldr x16, [x16, #16]
0x114 cmp x16, #0
0x118 bne 0x100
This ends up breaking the loop because the allocator restore/spill logic assumes that the code is linear and the restore will affect only code that follows it, i.e. code that got generated earlier. To fix this, all of the register allocations should be done before the loop code is generated.
Making things right
The result of this analysis was this fix in my LuaJIT fork that allocates registers for operands that will be used in the loop before generating the body of the loop. That is, if the registers have to spill, they will do so after the loop (we are generating code in reverse order) and leave the loop compact. The fix is also in the luajit2 repository in the OpenResty project. The work was sponsored by OpenResty as this wonderfully vague bug could only be produced by some very complex scripts that are part of the OpenResty product.