Wait and lock free alternatives to LongAdder and AtomicLong#93
Conversation
|
Do you think it is worth trying to compete with atomic counters after JDK 6? I know the assembler instructions have changed a lot after JDK 6. Doug Lea have been doing a great job with CHM which you could witness at JDK 7 & 8, to the point of rendering any other alternative to be a hassle rather than improving anything. Now queues are a different story, until they aren't anymore. Interesting article: http://brettwooldridge.github.io/HikariCP/ludicrous.html NOTE: Not trying to discourage you as this problem for JDK 6 and 7 is still real (Not so sure for JDK 8 after xadd), just be aware in the future and know when to pull the switch, it might be a long time before that ;-) |
|
@qwwdfsad thanks for your contribution and the considerable effort you put into benchmarking it. |
|
@qwwdfsad After I read my own comment I think I presented it wrong, my comment only applies for Java 8 and that is a maybe, the problem you are trying to solve is very real and valid for JDK 6 and 7 which is where most users are at the moment so please ignore my previous comment, take it a further research for future of JDK 8, 9 and so on. |
|
@nitsanw I've changed return type of CountersFactory and extended it a bit. Factory seems fine (if you don't mind), especially if other counters (thread local / statistical) will be added in the future, plus it's common pattern in JCTools. |
|
@guidomedina as you can see in benchmark results for JDK 8, my counter is already faster. I'm not really competing with Doug Lea: LongAdder is general-purpose counter with dynamic resizing (to avoid excess allocations (256 bytes with @contended per Cell instance) when counter is not contended), that's why xadd can't be used (xadd doesn't indicate contention fact, but cmpxchg fail does) here, and my implementation is much more specific and doesn't try to adapt, so I can use xadd and avoid contention/resize checks. That's why difference between my counter and LongAdder in JDK 6 shouldn't be very noticeable (both rely on same cmpxchg op), it's added for compatibility with JDK 6 only. |
|
@qwwdfsad sorry for taking long, I'm having a rather crowded week... I hope to get to reviewing and merging this in a few days |
|
OK, merged, with some minor rework. |
|
Oh, sorry for such a long pause, I completely missed github notification :( Firstly, I've found article from concurrency freaks with almost the same counter: http://concurrencyfreaks.blogspot.ru/2013/08/concurrency-pattern-distributed-cache.html (and comparison with LongAdder: http://concurrencyfreaks.blogspot.co.uk/2013/09/longadder-and-dclc.html), so my idea isn't new (eh). Their implementation shows significant speed-up on 32 core macnine. Secondly, note that my benchmark configuration was pretty specific: tickless kernel with disabled RCU and hyper-threading, so it's not very "common case". It was done not to show something, but because at that moment it was the only available linux machine without load I had access to. Benefit on 2/4/8 cores shouldn't be very sufficient, but its worth measuring on larger machines. Currently I have no multi-socket machine, but I'm ready to invest some time into benchmarking on 8-cores i7 (where difference shouldn't be huge) if you are still interested. |
|
@qwwdfsad don't worry :-) original ideas are few and far between, I have a single original algorithm in the whole of JCTools (the chunked MPSC). The rest are ports/mashups/variations on existing work. |
|
Thank you, I will try not to :) Alas, I have no ARM machines. |
As part of #36.
Alternative of LongAdder, which uses static-sized inlined stripes and doesn't try to resize adaptively.
So it benefits from less padding and dereferencing overhead (compared to @contended Cell class from Striped64), no complicated adaptive-resizing control flow and uses much more performant lock xadd (or lock addq) instruction (via Unsafe intrinsic) instead of lock cmpxchg (which is used in LongAdder instead of xadd only to indicate contention fact). Tradeoff between inc/get performance could be tuned via stripes count constructor parameter. Less performant version of counter for Java 6 included.
Assembly of inc method is here.
Some interesting benchmarks results are here (different cores amount/cross-socket modes, but always single reader/multiple writers). Benchmarks were performed on 32 cores Xeon with disabled hyper-threading (so in fact there was 16 cores), 8u72, 32 stripes (not cores * 4 as stated in current benchmark code) and with a little bit slower index method (it was 'dec' + two 'and' ops instead of single 'and' as now). Similar results are reproducible on my i7.
If you are going to benchmark this code manually, then be aware of JDK-6515172 (affects LongAdder measurements)