Skip to content

Add memory barrier#78

Merged
vitaut merged 1 commit intovitaut:mainfrom
TobiSchluter:add_memory_barrier
Jan 25, 2026
Merged

Add memory barrier#78
vitaut merged 1 commit intovitaut:mainfrom
TobiSchluter:add_memory_barrier

Conversation

@TobiSchluter
Copy link
Contributor

@TobiSchluter TobiSchluter commented Jan 21, 2026

Recent changes made gcc no longer load the constants, or less constant, from memorys for the SSE code. The fix is simple though: insert an asm memory-clobber annotation on the consts struct (which can no longer be constexpr as it needs to be mutable). The worst offender seems to be the multiplication by neg10 (246) which is translated to

        vpmulhuw        xmm2, xmm0, xmm2
        vpsllw  xmm1, xmm2, 5
        vpsubw  xmm1, xmm1, xmm2
        vpsllw  xmm1, xmm1, 2
        vpsubw  xmm1, xmm1, xmm2
        vpsllw  xmm1, xmm1, 1
        vpaddw  xmm0, xmm1, xmm0

(ie. 246 = ((2^5 - 1) * 2^2 - 1) * 2)

I'm actually surprised by how small the performance loss is. But then, the compiler is not super-dumb, but it optimizes along the wrong axis as most SIMD code is probably loading a lot of stuff from memory and thus needs minimal distraction.

Here are the benchmark runs on my AMD Ryzen with (-asm) and without (no suffix) this patch for the library built with SSE2 (-default), SSE4.2 (-sse4.2) and AVX2 (-avx2, this uses the SSE4.2 but a different instruction encoding which gives the register allocator more liberty). As you can see, this is a 3% performance gain.

Here's a godbolt: the relevant assembly is around line 120 in both, note the preprocessor variable that controls the presence of the memory barrier.
https://godbolt.org/z/cs7sjzzzT

build/test/benchmark-avx2
Mean of per-digit medians:
dragonbox: 16.40ns (14.79ns - 19.29ns) noisy
zmij     :  9.26ns ( 9.02ns - 10.88ns) noisy
build/test/benchmark-avx2-asm
Mean of per-digit medians:
dragonbox: 16.70ns (15.80ns - 19.75ns) noisy
zmij     :  9.13ns ( 8.91ns - 10.64ns) noisy
build/test/benchmark-defaults
Mean of per-digit medians:
dragonbox: 16.67ns (15.01ns - 19.75ns) noisy
zmij     :  9.47ns ( 9.19ns - 11.16ns) noisy
build/test/benchmark-defaults-asm
Mean of per-digit medians:
dragonbox: 16.78ns (15.28ns - 19.70ns) 
zmij     :  9.17ns ( 8.92ns - 10.65ns) noisy
build/test/benchmark-sse4.2
Mean of per-digit medians:
dragonbox: 16.67ns (15.23ns - 19.57ns) noisy
zmij     :  9.47ns ( 9.18ns - 11.25ns) noisy
build/test/benchmark-sse4.2-asm
Mean of per-digit medians:
dragonbox: 16.75ns (15.14ns - 19.47ns) noisy
zmij     :  9.15ns ( 8.92ns - 10.70ns) noisy

ps BTW i wasn't able to compile the C test with gcc. It complains about _Alignas (accepts alignas) and later constexpr, which is where I decided to disable the test locally.

@TobiSchluter
Copy link
Contributor Author

I don't know how your patch slipped into this PR ... trying to rebase now ...

@TobiSchluter
Copy link
Contributor Author

Ok, cleaned up now. I only rebuilt and re-ran the tests, not re-benchmarked

@vitaut
Copy link
Owner

vitaut commented Jan 21, 2026

On Milan and gcc 11.5 the version from the PR is slightly slower.

Before:

dragonbox: 30.65ns (27.92ns - 35.06ns)
zmij     : 18.98ns (18.65ns - 22.25ns)

After:

dragonbox: 30.63ns (28.02ns - 35.07ns)
zmij     : 19.18ns (18.80ns - 22.22ns)

@TobiSchluter
Copy link
Contributor Author

On a quick glance at godbolt, the assembly looks the same. Your CPU really doesn't like memory loads!

-mtune=native might make a difference, pre-patch. Will send benchmarks later.

@TobiSchluter
Copy link
Contributor Author

TobiSchluter commented Jan 21, 2026

Here are the, not very clean benchmarks, with -mtune=native added (indicated by the suffix -native).

→ for I in build/test/benchmark-* ; do echo $I; taskset -c 0-4 $I; done
build/test/benchmark-avx2
Mean of per-digit medians:
dragonbox: 17.06ns (14.89ns - 19.37ns) noisy
zmij     : 10.09ns ( 9.10ns - 12.33ns) noisy
build/test/benchmark-avx2-asm
Mean of per-digit medians:
dragonbox: 17.05ns (14.74ns - 19.69ns) noisy
zmij     :  9.59ns ( 8.92ns - 10.67ns) noisy
build/test/benchmark-avx2-asm-native
Mean of per-digit medians:
dragonbox: 17.01ns (15.20ns - 20.68ns) noisy
zmij     :  9.75ns ( 8.93ns - 12.30ns) noisy
build/test/benchmark-avx2-native
Mean of per-digit medians:
dragonbox: 17.52ns (16.24ns - 21.33ns) noisy
zmij     :  9.29ns ( 8.87ns - 11.28ns) noisy
build/test/benchmark-defaults
Mean of per-digit medians:
dragonbox: 17.25ns (15.28ns - 19.77ns) noisy
zmij     :  9.79ns ( 9.18ns - 11.22ns) noisy
build/test/benchmark-defaults-asm
Mean of per-digit medians:
dragonbox: 17.50ns (15.26ns - 19.77ns) noisy
zmij     :  9.49ns ( 8.92ns - 11.79ns) noisy
build/test/benchmark-defaults-asm-native
Mean of per-digit medians:
dragonbox: 17.70ns (16.32ns - 20.33ns) noisy
zmij     :  9.17ns ( 8.93ns - 10.70ns) noisy
build/test/benchmark-defaults-native
Mean of per-digit medians:
dragonbox: 17.13ns (16.10ns - 19.11ns) noisy
zmij     : 10.28ns ( 9.26ns - 12.89ns) noisy
build/test/benchmark-sse4.2
Mean of per-digit medians:
dragonbox: 17.10ns (15.31ns - 21.97ns) noisy
zmij     : 10.09ns ( 9.21ns - 12.35ns) noisy
build/test/benchmark-sse4.2-asm
Mean of per-digit medians:
dragonbox: 17.19ns (16.20ns - 19.77ns) noisy
zmij     :  9.45ns ( 8.92ns - 11.69ns) noisy
build/test/benchmark-sse4.2-asm-native
Mean of per-digit medians:
dragonbox: 16.91ns (14.99ns - 21.24ns) noisy
zmij     :  9.84ns ( 9.61ns - 10.64ns) noisy
build/test/benchmark-sse4.2-native
Mean of per-digit medians:
dragonbox: 16.73ns (15.08ns - 20.89ns) noisy
zmij     :  9.27ns ( 8.97ns - 10.79ns) noisy

What does gcc -Q --help=target -mtune=native print? Unfortunately, this output is very long, but I'm not aware of a way to get a single -mtune=<architecture> flag out of it that reproduces the native tuning :(

Edit: epyc milan should be -mtune=znver3

Edit2: the assembly looks the same with -mtune=znver3 with recent GCC

Edit3: actually, with GCC 11.4 the code is exactly the same with/without the memory barrier if -mavx2 -mtune=znver3 is used. In the default case of SSE2, i.e. with -msse2 -mtune=znver3, it creates 6 out of 7 constants in memory even without the memory barrier. So it seems that gcc thinks that the znver3 microarchitecture is actually quite memory-load friendly! https://godbolt.org/z/fKc5fWWWE

@TobiSchluter
Copy link
Contributor Author

Here are the results on my tiger lake laptop. The patch is a consistent win.

→ for I in out/build/gcc-release/test/benchmark-*; do echo $I; taskset -c 0-3 $I; done
out/build/gcc-release/test/benchmark-avx2
Mean of per-digit medians:
dragonbox: 47.97ns (42.32ns - 60.32ns) noisy
zmij     : 34.42ns (32.89ns - 43.01ns) noisy
out/build/gcc-release/test/benchmark-avx2-asm
Mean of per-digit medians:
dragonbox: 47.47ns (41.56ns - 53.96ns) noisy
zmij     : 33.29ns (29.87ns - 50.13ns) noisy
out/build/gcc-release/test/benchmark-defaults
Mean of per-digit medians:
dragonbox: 47.52ns (41.70ns - 68.05ns) noisy
zmij     : 34.82ns (33.35ns - 43.63ns) noisy
out/build/gcc-release/test/benchmark-defaults-asm
Mean of per-digit medians:
dragonbox: 47.36ns (41.68ns - 54.95ns) noisy
zmij     : 33.44ns (32.06ns - 36.33ns) noisy
out/build/gcc-release/test/benchmark-sse4.2
Mean of per-digit medians:
dragonbox: 46.97ns (41.20ns - 55.84ns) noisy
zmij     : 33.89ns (32.51ns - 43.41ns) noisy
out/build/gcc-release/test/benchmark-sse4.2-asm
Mean of per-digit medians:
dragonbox: 46.79ns (42.19ns - 54.76ns) noisy
zmij     : 33.34ns (32.05ns - 38.00ns) noisy

@vitaut
Copy link
Owner

vitaut commented Jan 22, 2026

What does gcc -Q --help=target -mtune=native print?

The following options are target specific:
  -m128bit-long-double        		[enabled]
  -m16                        		[disabled]
  -m32                        		[disabled]
  -m3dnow                     		[disabled]
  -m3dnowa                    		[disabled]
  -m64                        		[enabled]
  -m80387                     		[enabled]
  -m8bit-idiv                 		[disabled]
  -m96bit-long-double         		[disabled]
  -mabi=                      		sysv
  -mabm                       		[disabled]
  -maccumulate-outgoing-args  		[disabled]
  -maddress-mode=             		long
  -madx                       		[disabled]
  -maes                       		[disabled]
  -malign-data=               		compat
  -malign-double              		[disabled]
  -malign-functions=          		0
  -malign-jumps=              		0
  -malign-loops=              		0
  -malign-stringops           		[enabled]
  -mamx-bf16                  		[disabled]
  -mamx-int8                  		[disabled]
  -mamx-tile                  		[disabled]
  -mandroid                   		[disabled]
  -march=                     		x86-64-v2
  -masm=                      		att
  -mavx                       		[disabled]
  -mavx2                      		[disabled]
  -mavx256-split-unaligned-load 	[disabled]
  -mavx256-split-unaligned-store 	[disabled]
  -mavx5124fmaps              		[disabled]
  -mavx5124vnniw              		[disabled]
  -mavx512bf16                		[disabled]
  -mavx512bitalg              		[disabled]
  -mavx512bw                  		[disabled]
  -mavx512cd                  		[disabled]
  -mavx512dq                  		[disabled]
  -mavx512er                  		[disabled]
  -mavx512f                   		[disabled]
  -mavx512ifma                		[disabled]
  -mavx512pf                  		[disabled]
  -mavx512vbmi                		[disabled]
  -mavx512vbmi2               		[disabled]
  -mavx512vl                  		[disabled]
  -mavx512vnni                		[disabled]
  -mavx512vp2intersect        		[disabled]
  -mavx512vpopcntdq           		[disabled]
  -mavxvnni                   		[disabled]
  -mbionic                    		[disabled]
  -mbmi                       		[disabled]
  -mbmi2                      		[disabled]
  -mbranch-cost=<0,5>         		3
  -mcall-ms2sysv-xlogues      		[disabled]
  -mcet-switch                		[disabled]
  -mcld                       		[disabled]
  -mcldemote                  		[disabled]
  -mclflushopt                		[disabled]
  -mclwb                      		[disabled]
  -mclzero                    		[disabled]
  -mcmodel=                   		small
  -mcpu=                      		
  -mcrc32                     		[enabled]
  -mcx16                      		[enabled]
  -mdaz-ftz                   		[disabled]
  -mdispatch-scheduler        		[disabled]
  -mdump-tune-features        		[disabled]
  -menqcmd                    		[disabled]
  -mf16c                      		[disabled]
  -mfancy-math-387            		[enabled]
  -mfentry                    		[disabled]
  -mfentry-name=              		
  -mfentry-section=           		
  -mfma                       		[disabled]
  -mfma4                      		[disabled]
  -mforce-drap                		[disabled]
  -mforce-indirect-call       		[disabled]
  -mfp-ret-in-387             		[enabled]
  -mfpmath=                   		sse
  -mfsgsbase                  		[disabled]
  -mfunction-return=          		keep
  -mfused-madd                		-ffp-contract=fast
  -mfxsr                      		[enabled]
  -mgather                    		-mtune-ctrl=use_gather
  -mgeneral-regs-only         		[disabled]
  -mgfni                      		[disabled]
  -mglibc                     		[enabled]
  -mhard-float                		[enabled]
  -mharden-sls=               		none
  -mhle                       		[disabled]
  -mhreset                    		[disabled]
  -miamcu                     		[disabled]
  -mieee-fp                   		[enabled]
  -mincoming-stack-boundary=  		0
  -mindirect-branch-cs-prefix 		[disabled]
  -mindirect-branch-register  		[disabled]
  -mindirect-branch=          		keep
  -minline-all-stringops      		[disabled]
  -minline-stringops-dynamically 	[disabled]
  -minstrument-return=        		none
  -mintel-syntax              		-masm=intel
  -mkl                        		[disabled]
  -mlarge-data-threshold=<number> 	65536
  -mlong-double-128           		[disabled]
  -mlong-double-64            		[disabled]
  -mlong-double-80            		[enabled]
  -mlwp                       		[disabled]
  -mlzcnt                     		[disabled]
  -mmanual-endbr              		[disabled]
  -mmemcpy-strategy=          		
  -mmemset-strategy=          		
  -mmitigate-rop              		[disabled]
  -mmmx                       		[enabled]
  -mmovbe                     		[disabled]
  -mmovdir64b                 		[disabled]
  -mmovdiri                   		[disabled]
  -mmpx                       		[disabled]
  -mms-bitfields              		[disabled]
  -mmusl                      		[disabled]
  -mmwait                     		[enabled]
  -mmwaitx                    		[disabled]
  -mneeded                    		[disabled]
  -mno-align-stringops        		[disabled]
  -mno-default                		[disabled]
  -mno-fancy-math-387         		[disabled]
  -mno-push-args              		[disabled]
  -mno-red-zone               		[disabled]
  -mno-sse4                   		[disabled]
  -mnop-mcount                		[disabled]
  -momit-leaf-frame-pointer   		[disabled]
  -mpc32                      		[disabled]
  -mpc64                      		[disabled]
  -mpc80                      		[disabled]
  -mpclmul                    		[disabled]
  -mpcommit                   		[disabled]
  -mpconfig                   		[disabled]
  -mpku                       		[disabled]
  -mpopcnt                    		[enabled]
  -mprefer-avx128             		-mprefer-vector-width=128
  -mprefer-vector-width=      		256
  -mpreferred-stack-boundary= 		0
  -mprefetchwt1               		[disabled]
  -mprfchw                    		[disabled]
  -mptwrite                   		[disabled]
  -mpush-args                 		[enabled]
  -mrdpid                     		[disabled]
  -mrdrnd                     		[disabled]
  -mrdseed                    		[disabled]
  -mrecip                     		[disabled]
  -mrecip=                    		
  -mrecord-mcount             		[disabled]
  -mrecord-return             		[disabled]
  -mred-zone                  		[enabled]
  -mregparm=                  		6
  -mrtd                       		[disabled]
  -mrtm                       		[disabled]
  -msahf                      		[enabled]
  -mserialize                 		[disabled]
  -msgx                       		[disabled]
  -msha                       		[disabled]
  -mshstk                     		[disabled]
  -mskip-rax-setup            		[disabled]
  -msoft-float                		[disabled]
  -msse                       		[enabled]
  -msse2                      		[enabled]
  -msse2avx                   		[disabled]
  -msse3                      		[enabled]
  -msse4                      		[enabled]
  -msse4.1                    		[enabled]
  -msse4.2                    		[enabled]
  -msse4a                     		[disabled]
  -msse5                      		-mavx
  -msseregparm                		[disabled]
  -mssse3                     		[enabled]
  -mstack-arg-probe           		[disabled]
  -mstack-protector-guard-offset= 	
  -mstack-protector-guard-reg= 		
  -mstack-protector-guard-symbol= 	
  -mstack-protector-guard=    		tls
  -mstackrealign              		[disabled]
  -mstringop-strategy=        		[default]
  -mstv                       		[enabled]
  -mtbm                       		[disabled]
  -mtls-dialect=              		gnu
  -mtls-direct-seg-refs       		[enabled]
  -mtsxldtrk                  		[disabled]
  -mtune-ctrl=                		
  -mtune=                     		znver4
  -muclibc                    		[disabled]
  -muintr                     		[disabled]
  -mvaes                      		[disabled]
  -mveclibabi=                		[default]
  -mvect8-ret-in-mem          		[disabled]
  -mvpclmulqdq                		[disabled]
  -mvzeroupper                		[disabled]
  -mwaitpkg                   		[disabled]
  -mwbnoinvd                  		[disabled]
  -mwidekl                    		[disabled]
  -mx32                       		[disabled]
  -mxop                       		[disabled]
  -mxsave                     		[disabled]
  -mxsavec                    		[disabled]
  -mxsaveopt                  		[disabled]
  -mxsaves                    		[disabled]
  Known assembler dialects (for use with the -masm= option):
    att intel
  Known ABIs (for use with the -mabi= option):
    ms sysv
  Known code models (for use with the -mcmodel= option):
    32 kernel large medium small
  Valid arguments to -mfpmath=:
    387 387+sse 387,sse both sse sse+387 sse,387
  Known choices for mitigation against straight line speculation with -mharden-sls=:
    all indirect-jmp none return
  Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
    keep thunk thunk-extern thunk-inline
  Known choices for return instrumentation with -minstrument-return=:
    call none nop5
  Known data alignment choices (for use with the -malign-data= option):
    abi cacheline compat
  Known vectorization library ABIs (for use with the -mveclibabi= option):
    acml svml
  Known address mode (for use with the -maddress-mode= option):
    long short
  Known preferred register vector length (to use with the -mprefer-vector-width= option):
    128 256 512 none
  Known stack protector guard (for use with the -mstack-protector-guard= option):
    global tls
  Valid arguments to -mstringop-strategy=:
    byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
  Known TLS dialects (for use with the -mtls-dialect= option):
    gnu gnu2
  Known valid arguments for -march= option:
    i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 znver4 btver1 btver2 generic native
  Known valid arguments for -mtune= option:
    generic i386 i486 pentium lakemont pentiumpro pentium4 nocona core2 nehalem sandybridge haswell bonnell silvermont goldmont goldmont-plus tremont knl knm skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake rocketlake intel geode k6 athlon k8 amdfam10 bdver1 bdver2 bdver3 bdver4 btver1 btver2 znver1 znver2 znver3 znver4

@vitaut
Copy link
Owner

vitaut commented Jan 22, 2026

Interesting that it reports znver4 and not znver3.

@vitaut
Copy link
Owner

vitaut commented Jan 22, 2026

Any thoughts on using the method from NEON? It would allow keeping the struct const.

  const constants* c = &consts;
  ZMIJ_ASM(("" : "+r"(c)));

  using ptr = const __m128i*;
  const __m128i div10k = _mm_load_si128(ptr(&c->div10k));
  ...

AFAICS it generates the same(ish) assembly: https://www.godbolt.org/z/Ejrx1EobK

@TobiSchluter
Copy link
Contributor Author

@vitaut I must have messed up when I tried it because I didn't see it have an impact. I'm happy if it works, of course!

@TobiSchluter
Copy link
Contributor Author

TobiSchluter commented Jan 22, 2026

Interesting that it reports znver4 and not znver3.

Even more interesting because:
image

znver4 was only added in gcc 12.3. Maybe 11.5 (which isn't on godbolt) added it as well? Maybe they added it because EPYC Milan is a picky little bird?

Here's the patch that added it to gcc: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603486.html
"This is a basic enablement patch and as of now the costings, tunings are kept same as znver3."
So one would have to dig deeper to understand more of it.

Edit: yes, it was added in 11.5 https://gcc.gnu.org/onlinedocs/gcc-11.5.0/gcc/x86-Options.html#index-march-14 It adds mainly AVX-512

Copy link
Owner

@vitaut vitaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my Intel machine I do see a small improvement with this change. So it looks like the results are mostly positive and I'm fine with merging the PR but please apply the technique from the NEON path to keep the struct const.

@TobiSchluter
Copy link
Contributor Author

I had to get up to speed with asm constraints to clearly understand what had to be done, but it's done. Instead of telling the compiler that the memory it's reading has changed, we instruct it to forget the pointer through which it is reading the memory. Performance is not measurably different from my original patch, maybe a tiny bit faster.

@TobiSchluter
Copy link
Contributor Author

TobiSchluter commented Jan 24, 2026

@vitaut I've no idea how or why I'd closed this? Seems to have happened when I force-pushed it. Anyway, it's still live and updated per your request.

@vitaut
Copy link
Owner

vitaut commented Jan 24, 2026

For some reason the PR shows 0 commits. Could you try to rebase/push again or, if it doesn't help, submit a new PR?

@TobiSchluter TobiSchluter reopened this Jan 25, 2026
@TobiSchluter
Copy link
Contributor Author

Fixed it, sorry about wasting your cycles. Hope it was fast enough to avoid another branch.

@TobiSchluter TobiSchluter requested a review from vitaut January 25, 2026 00:35
Ensures constants are loaded from memory.
@TobiSchluter
Copy link
Contributor Author

I'm fairly sure that I didn't do something silly this time around.

@vitaut vitaut merged commit 6589ede into vitaut:main Jan 25, 2026
@vitaut
Copy link
Owner

vitaut commented Jan 25, 2026

Merged, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants