Conversation
|
I don't know how your patch slipped into this PR ... trying to rebase now ... |
28f7ba0 to
b155b7f
Compare
|
Ok, cleaned up now. I only rebuilt and re-ran the tests, not re-benchmarked |
|
On Milan and gcc 11.5 the version from the PR is slightly slower. Before: After: |
|
On a quick glance at godbolt, the assembly looks the same. Your CPU really doesn't like memory loads!
|
|
Here are the, not very clean benchmarks, with What does Edit: epyc milan should be Edit2: the assembly looks the same with Edit3: actually, with GCC 11.4 the code is exactly the same with/without the memory barrier if |
|
Here are the results on my tiger lake laptop. The patch is a consistent win. |
|
|
Interesting that it reports |
|
Any thoughts on using the method from NEON? It would allow keeping the struct const. const constants* c = &consts;
ZMIJ_ASM(("" : "+r"(c)));
using ptr = const __m128i*;
const __m128i div10k = _mm_load_si128(ptr(&c->div10k));
...AFAICS it generates the same(ish) assembly: https://www.godbolt.org/z/Ejrx1EobK |
|
@vitaut I must have messed up when I tried it because I didn't see it have an impact. I'm happy if it works, of course! |
Even more interesting because:
Here's the patch that added it to gcc: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603486.html Edit: yes, it was added in 11.5 https://gcc.gnu.org/onlinedocs/gcc-11.5.0/gcc/x86-Options.html#index-march-14 It adds mainly AVX-512 |
vitaut
left a comment
There was a problem hiding this comment.
On my Intel machine I do see a small improvement with this change. So it looks like the results are mostly positive and I'm fine with merging the PR but please apply the technique from the NEON path to keep the struct const.
b155b7f to
f38a58d
Compare
|
I had to get up to speed with |
|
@vitaut I've no idea how or why I'd closed this? Seems to have happened when I force-pushed it. Anyway, it's still live and updated per your request. |
|
For some reason the PR shows 0 commits. Could you try to rebase/push again or, if it doesn't help, submit a new PR? |
|
Fixed it, sorry about wasting your cycles. Hope it was fast enough to avoid another branch. |
Ensures constants are loaded from memory.
48b2d13 to
07dc3d7
Compare
|
I'm fairly sure that I didn't do something silly this time around. |
|
Merged, thank you! |

Recent changes made gcc no longer load the constants, or less constant, from memorys for the SSE code. The fix is simple though: insert an asm memory-clobber annotation on the
constsstruct (which can no longer beconstexpras it needs to be mutable). The worst offender seems to be the multiplication byneg10(246) which is translated to(ie. 246 = ((2^5 - 1) * 2^2 - 1) * 2)
I'm actually surprised by how small the performance loss is. But then, the compiler is not super-dumb, but it optimizes along the wrong axis as most SIMD code is probably loading a lot of stuff from memory and thus needs minimal distraction.
Here are the benchmark runs on my AMD Ryzen with (
-asm) and without (no suffix) this patch for the library built with SSE2 (-default), SSE4.2 (-sse4.2) and AVX2 (-avx2, this uses the SSE4.2 but a different instruction encoding which gives the register allocator more liberty). As you can see, this is a 3% performance gain.Here's a godbolt: the relevant assembly is around line 120 in both, note the preprocessor variable that controls the presence of the memory barrier.
https://godbolt.org/z/cs7sjzzzT
ps BTW i wasn't able to compile the C test with gcc. It complains about
_Alignas(accepts alignas) and laterconstexpr, which is where I decided to disable the test locally.