Buggy HLE, microcode updates and SIGILLs
Update: Disabling lock elision in glibc doesn’t seem to be sufficient. Either way, the Fedora kernel folks will have an update in place to update the microcode early by default so that both the kernel and the first instantiation of pthreads will see HLE disabled. So read the story as something interesting that we did but didn’t quite work. It was fun though…
Amit and I ran into an interesting problem
today with his new Haswell process based system. A fully updated
Fedora 21 alpha would fail during boot and fall into the maintainer
shell. The systemd journal showed that systemd-udevd
was crashing
with a SIGILL, which seemed strange. The core dump revealed the
problem:
(gdb) x/i $rip
=> 0x7f68b0b978ba <pthread_rwlock_rdlock+186>: xbeginq 0x7f68b0b978c0 <pthread_rwlock_rdlock+192>
The xbeginq
instruction is an HLE instruction, so the first thing
that came to mind was the recent
errata
that Intel pushed out, effectively announcing that HLE was buggy and
that they were going to disable it soon. We looked at
/proc/cpuinfo
expecting to find hle
and rtm
missing, but
were even more confused to find that they were present.
After much tinkering about, Amit made a vague reference to
microcode_ctl
being able to change CPU microcode on the fly. It
took a while to hit us, but we finally realized that we had found the
culprit. microcode_ctl had been updated with the latest Intel
microcode update. We initially thought that it ought to be a one-time
problem since the microcode would be flashed into the cpu and later
everything would work, but then we found out that the microcode needs
to be flashed on every boot.
So the root cause was that the microcode would happen late enough that
systemd was already up and had read the hle bit, thus enabling lock
elision support in systemd. Also, since the kernel had already read
in cpu capabilities, it also did not have the updated capabilities,
due to which we continued seeing hle
and rtm
set in cpuinfo.
As a result, thanks to the microcode update, all haswell based F21 alpha systems are essentially unbootable. Carlos is now fixing this by disabling lock-elision completely in the glibc build. Work is in progress for rawhide, F21 and F20 as I write this, so the impact of this will hopefully be minimal. If you do run into this problem, all you have to do is dowwngrade the microcode_ctl package and pin it so that it doesn’t get updated till the glibc update becomes available.