Multiprecision arithmetic in libm
Before my two week break, I was working on a bunch of ideas to try and get the multiprecision arithmetic performance in libm to not suck as badly as it currently does. There was a lot of stuff going on, so I’ll try to summarize them here. The primary reason for this post is to get myself back in groove (although I’ve tried to provide as muchh background as possible), so I apologize to readers if some of the content is not coherent. Feel free to point them out in the comments and I’ll try to clarify.
The multiprecision bits in libm are essentially all files starting with mp
in $srcdir/sysdeps/iee754/dbl-64/
. The structure that stores a multiprecision number is called mp_no and is declared in mpa.h as:
typedef struct { int e; double d[40]; } mp_no;
where e is the exponent of the number and the mantissa digits are in d. The radix of the number is 224, so each digit in d is always a non-negative integral value less than 224.
The other all-important module is mpa.c, which defines basic operations on mp_no
(construct from double, deconstruct to double, add, subtract, multiply, divide). It was relatively easy to see that these basic operations were the hotspots (specifically multiplication), but not so easy to see that Power code has its own copy of these functions. And there begins the real difficulty.
The Power architecture is unique in that it has all of 4 floating point units. The power specific code takes advantage of that fact and is tweaked in a manner that the execution units are used in parallel. In contrast, the intel x86 architecture is quite weak on floating point units and this is where the major conflict is. The mantissa of mp_no, which is currently a double (but does not need to be since it can only have integral values), is perfect for Power, but not good enough for intel, which has much faster fixed point computations. Conversion between doubles and ints is too slow on both platforms and is hence not a good enough alternative.
A possible approach is using a mantissa_t typedef that is then overridden by Power, but I need to do some consolidation in the rest of the code to ensure that the internal structure of mp_no is not exposed anywhere. So that’s a TODO.
Apart from exploiting architecture-specific traits, the other approach I considered was to tweak the algorithm for multiplication to make it as fast as possible in a platform-independent manner. A significant number of multiplication inputs are numbers that do not utilize the full precision of mp_no, i.e. a large number of mantissa digits are zeroes. The current algorithm blindly loops around the entire precision multiplying these zeroes, which is a waste. A better idea is to find the precision of the numbers and then run the multiply-add-carry loop only to the extent of that precision. The result of this was an improvement in performance to an extent of 36% in the pow function, i.e. a bit less than twice the speed of the earlier algorithm!
Next steps are to evaluate the impact of storing the actual precision of numbers so that it does not have to be computed within the multiplication function. That’s another TODO for me.
Finally there is a lot of cleanup in order in pretty much all of the mathlib code (i.e. the stuff in sysdeps/iee754/dbl-64), which I’ve been chugging away at and will continue to do so. I’m sure the glibc community will love to review any patch submissions that even simply reformat these files in the GNU format, so there’s a good way to get started with contributing code to glibc.