libfreevec NG!!
markos — Tue, 24/03/2009 - 23:24
I'm in the process of rewriting libfreevec and porting it to other SIMD platforms, apart from AltiVec (which I consider dead or dying, unfortunately, thanks to the Big Powers that decided it's no longer important along with PowerPC, but that should be another topic). Anyway, the main platforms chosen are AltiVec (of course :), SSE (SSE2, SSE3 and possiby SSE4), ARM NEON and Cell SPU.
The idea behind libfreevec is not restricted to AltiVec anyway. I have proven that glibc, the #1 libc used on Linux, is totally unoptimized even for common platforms (such as x86 and x86_64), and there are performance gains that could/should materialize if someone took the effort to do it. So, I've decided to do exactly that.
First, I'll extend libfreevec to be a full blown libc, and will try to at least be source-compatible with glibc (that's a definitive must, ABI compatible would be a nice plus, but I don't know if I can do it yet, probably not). For this purpose, I'm also rewriting the make system and will use cmake instead. I have to say, so far it has reduced both compile times, and debugging times by a factor of 10!! No more messing around with stupid configure/autoconf scripts. Good riddance!
Second, I'm abstracting the actual functions. After all, a memcpy uses the same algorithm, no matter what the platform or the SIMD engine is, right? This way, I'm just including a header file that has all the macros (or actually inline functions, as I moved away from macros, inline functions are much easier to debug) necessary for the particular function. The file is automatically included depending on the SIMD engine used at compile time (or scalar, if no SIMD engine was defined).
Also, I've started work on rewriting the IEEE754 math functions used in glibc/libm. I've often mentioned that these are slow as molasses on ALL platforms, and now I can prove it, here are some results:
> ./test_trigf Populated 100000000 floats in the range [0..pi/4] dt = 2.830000 Glibc : 35335689.05 calculations of cosf()/sec <cos(x)> = 0.167772 dt = 2.240000 libfreevec : 44642857.14 calculations of cosf()/sec <cos(x)> = 0.167772 vec_cosf fail/tot = 1456251/100000000, maxerror = 0.0000001 dt = 1.710000 Glibc : 58479532.16 calculations of sinf()/sec <sin(x)> = 0.167772 dt = 2.390000 libfreevec : 41841004.18 calculations of sinf()/sec <sin(x)> = 0.167772 vec_sinf fail/tot = 98844217/100000000, maxerror = 0.0635434 dt = 2.100000 Glibc : 47619047.62 calculations of tanf()/sec <tan(x)> = 0.167772 dt = 2.220000 libfreevec : 45045045.05 calculations of tanf()/sec <tan(x)> = 0.167772 vec_tanf fail/tot = 125883/100000000, maxerror = 0.0000001 dt = 4.190000 Glibc : 23866348.45 calculations of coshf()/sec <cosh(x)> = 0.335544 dt = 1.470000 libfreevec : 68027210.88 calculations of coshf()/sec <cosh(x)> = 0.335544 vec_coshf fail/tot = 23772394/100000000, maxerror = 0.0000001 dt = 4.470000 Glibc : 22371364.65 calculations of sinhf()/sec <sinh(x)> = 0.167772 dt = 1.380000 libfreevec : 72463768.12 calculations of sinhf()/sec <sinh(x)> = 0.167772 vec_sinhf fail/tot = 111824/100000000, maxerror = 0.0000001 dt = 7.980000 Glibc : 12531328.32 calculations of tanhf()/sec <tanh(x)> = 0.167772 dt = 1.270000 libfreevec : 78740157.48 calculations of tanhf()/sec <tanh(x)> = 0.167772 vec_tanhf fail/tot = 803327/100000000, maxerror = 0.0000001
Ok, these are preliminary results, and I probably can do better both in terms of accuracy and in terms of speed (I'm especially disappointed with sinf() which is definitely using a wrong approximant function, but I expect to find the culprit soon). All tests were done on an Athlon X2 @2.5Ghz (the same used in previous benchmarks), the glibc version used was glibc-2.9-2.11.1 (opensuse 11.1 package, 64-bit version). The max error there reports the maximum difference between my version and the glibc version in X tests out of 100 million (ok, you have to admit a max error of 1 * 10^(-7) is negligible :)
Also, these are all PLAIN C versions, no asm or any optimization used. The good thing is that they are ~20 C lines max, for each function, much easier to read than the spaghetti mess in glibc. When I do the custom optimizations per arch, even the functions that are now faster in glibc, are going to get totally trounced. Btw, all functions in libfreevec have a consistent speed, however most functions in glibc perform good in the 0..pi/4 range (possibly due to the SINCOS asm instruction, but lose great speed when the sample used is [-pi..pi] (which is the generic case) and are in fact slower than libfreevec.
Stay tuned...
- Login to post comments
glibc STT_GNU_IFUNC
mab — Fri, 28/01/2011 - 18:28markos, have you seen what glibc did with STT_GNU_IFUNC?
Re: Proposal for CPU dispatching in libc
glibc 2.10 news - Automatic use of optimized function
Isn't that targetting something similar? Maybe you can make use of that?
gentoo-ppc
mab — Fri, 28/01/2011 - 11:01I discussed and introduced libfreevec at gentoo-powerpc.
We are really looking forward to the next libfreevec NG release.
I would like to see libfreevec becoming a real competitor for glibc and a serious and official replacement.
Therefore I would like to help you testing the G4 platform on Gentoo for the next bugfixes.
Best regards,
Massimo B.
re: gentoo-ppc
markos — Wed, 09/02/2011 - 16:39I read the thread, libfreevec started long before glibc even considered adding STT_GNU_IFUNC (I started in 2004) and it's not going to stop development now because of that. I strongly believe that the way glibc does it is wrong. What's the difference? I guess you'll just have to wait and see.
The other difference is that I've adopted a BSD-like license instead of LGPL/GPL and I'm evolving it -albeit very slowly- to a full libc solution. This will take time though.
Regards
Konstantinos
re: gentoo-ppc
mab — Fri, 19/08/2011 - 14:09Hi Konstantinos,
So you think libfreevec will still be faster than current glibc and its STT_GNU_IFUNC?
I stopped using libfreevec since my first attempts in january 2011 because I had a lot of crashes such as konqueror on KDE4 which where gone right after turning libfreevec off.
Since YDL was using libfreevec in its release I would guess it is stable enough, at least for one of its main targets the altivec G4.
Currently I'm clearing my PPC-stock, getting rid at least of the G3. As for the G4 I'm not sure. I definitly need a faster platform, not for power-using or gaming, I'm Desktop-user (as for gentoo-builts these can run at night). But some Java-GUI-applications are so slow, hardly to use at all, tinkering around with the (only) binary IBM-JDK (slow) and IcedTea-Zero (slowest).
Anyway. Can you give a hint how to use libfreevec the best? I'd like to give it a chance since I like the G4. Do I need to rebuilt any stuff with libfreevec? Are there any advices to prevent crashes, like special CFLAGS or CXXFLAGS?
I know you are working on the next generation libfreevec being more portable and targetting ARM. Will this give any improvement on PPC too? Anyway I'm looking forward to the libfreevec development.
Btw. libfreevec on ARM could be of high use for lot of emerging new mobile smart phones with Android and other OS.
Even the PPC could still be interesting since its todays application on high-performance servers and embedded. Then there is still the PS3, but right, PPC for Desktop use has ceased if not being dead since Apple moved.
How is it going at the Genesi job?
Regards,
Massimo
great news :) About the
ggael — Thu, 26/03/2009 - 11:59great news :)
About the math functions, are you using similar techniques than the cephes lib ? (http://www.netlib.org/cephes/)
I'm asking because I just added SSE versions of the cephes's sin, cos, exp and log functions (all credits go to Julien Pommier http://gruntthepeon.free.fr/ssemath/) and they work pretty well. But if you have something even better to propose...
Also, the cephes's sin and cos routines works very well in the range ~ [-8000 : 8000] while the quality slowly degrades for larger values. Have you experimented a similar behavior ?
not certain
markos — Thu, 26/03/2009 - 14:43Not certain about this.
I think that Cephes uses Taylor expansions for most of the functions, whereas I use Padé approximants (which are much faster to calculate as they use much less terms, difference in speed is from 50-200%). I'll organize the code a little as it's a mess currently and will release it shortly. I have also some new nice results from exp() functions, but I have to finetune the polynomial constants in the terms to get full accuracy (right now I get ~3 *10^(-7), which is ok, but not fully IEEE754. I'll let you know when I'm done with this.