Skip to content

Wait and lock free alternatives to LongAdder and AtomicLong#93

Merged
nitsanw merged 5 commits into
JCTools:masterfrom
qwwdfsad:master
Apr 2, 2016
Merged

Wait and lock free alternatives to LongAdder and AtomicLong#93
nitsanw merged 5 commits into
JCTools:masterfrom
qwwdfsad:master

Conversation

@qwwdfsad

Copy link
Copy Markdown
Contributor

As part of #36.

Alternative of LongAdder, which uses static-sized inlined stripes and doesn't try to resize adaptively.
So it benefits from less padding and dereferencing overhead (compared to @contended Cell class from Striped64), no complicated adaptive-resizing control flow and uses much more performant lock xadd (or lock addq) instruction (via Unsafe intrinsic) instead of lock cmpxchg (which is used in LongAdder instead of xadd only to indicate contention fact). Tradeoff between inc/get performance could be tuned via stripes count constructor parameter. Less performant version of counter for Java 6 included.
Assembly of inc method is here.

Some interesting benchmarks results are here (different cores amount/cross-socket modes, but always single reader/multiple writers). Benchmarks were performed on 32 cores Xeon with disabled hyper-threading (so in fact there was 16 cores), 8u72, 32 stripes (not cores * 4 as stated in current benchmark code) and with a little bit slower index method (it was 'dec' + two 'and' ops instead of single 'and' as now). Similar results are reproducible on my i7.

If you are going to benchmark this code manually, then be aware of JDK-6515172 (affects LongAdder measurements)

@guidomedina

Copy link
Copy Markdown

Do you think it is worth trying to compete with atomic counters after JDK 6? I know the assembler instructions have changed a lot after JDK 6.

Doug Lea have been doing a great job with CHM which you could witness at JDK 7 & 8, to the point of rendering any other alternative to be a hassle rather than improving anything.

Now queues are a different story, until they aren't anymore.

Interesting article: http://brettwooldridge.github.io/HikariCP/ludicrous.html

NOTE: Not trying to discourage you as this problem for JDK 6 and 7 is still real (Not so sure for JDK 8 after xadd), just be aware in the future and know when to pull the switch, it might be a long time before that ;-)

@nitsanw

nitsanw commented Mar 13, 2016

Copy link
Copy Markdown
Contributor

@qwwdfsad thanks for your contribution and the considerable effort you put into benchmarking it.
Looking at your results and assuming we care only for the inc() speed the results for the FixedSizeStripedV8 are encouraging. It's reasonable to compare the V6 results with a LongAdder backport perhaps, if one can be copied from somewhere.
A note on the implementation: You can replace the factory with factory methods returning the FixedSizeStripedLongCounter (and constructing the appropriate subclass) and make the implementations package private. This limits the class exposure and allows the users to make their ref type the super class. Because OpenJDK CHA analysis doesn't cover interfaces this is a minor saving for the performance crazed :-). They can always choose the interface if they like.
There's something to be said for resolving this issue in JCTools since it is Java 1.6 compliant and I know there has been some interest in this in the past. I'm happy to give it a home in the experimental asylum for now and think on it (once we've worked through the review comments).

@guidomedina

Copy link
Copy Markdown

@qwwdfsad After I read my own comment I think I presented it wrong, my comment only applies for Java 8 and that is a maybe, the problem you are trying to solve is very real and valid for JDK 6 and 7 which is where most users are at the moment so please ignore my previous comment, take it a further research for future of JDK 8, 9 and so on.

@qwwdfsad

Copy link
Copy Markdown
Contributor Author

@nitsanw I've changed return type of CountersFactory and extended it a bit. Factory seems fine (if you don't mind), especially if other counters (thread local / statistical) will be added in the future, plus it's common pattern in JCTools.
Maybe it's a good idea not to expose create..V6/V8 methods and let benchmarks and users use most performant implementation?

@qwwdfsad

Copy link
Copy Markdown
Contributor Author

@guidomedina as you can see in benchmark results for JDK 8, my counter is already faster. I'm not really competing with Doug Lea: LongAdder is general-purpose counter with dynamic resizing (to avoid excess allocations (256 bytes with @contended per Cell instance) when counter is not contended), that's why xadd can't be used (xadd doesn't indicate contention fact, but cmpxchg fail does) here, and my implementation is much more specific and doesn't try to adapt, so I can use xadd and avoid contention/resize checks.

That's why difference between my counter and LongAdder in JDK 6 shouldn't be very noticeable (both rely on same cmpxchg op), it's added for compatibility with JDK 6 only.

@nitsanw

nitsanw commented Mar 16, 2016

Copy link
Copy Markdown
Contributor

@qwwdfsad sorry for taking long, I'm having a rather crowded week... I hope to get to reviewing and merging this in a few days

@nitsanw nitsanw merged commit e5a9dd4 into JCTools:master Apr 2, 2016
@nitsanw

nitsanw commented Apr 2, 2016

Copy link
Copy Markdown
Contributor

OK, merged, with some minor rework.
In rudimentary benchmarks the benefit seems marginal, but that's on a laptop with 2 cores(4 threads) and so probably not very meaningful. I'll do more benchmarking and think on making this part of core.
@qwwdfsad Thanks allot for your efforts, please have a look at my changes and comment.

@qwwdfsad

Copy link
Copy Markdown
Contributor Author

Oh, sorry for such a long pause, I completely missed github notification :(
Changes looks good, thanks for cleanup.

Firstly, I've found article from concurrency freaks with almost the same counter: http://concurrencyfreaks.blogspot.ru/2013/08/concurrency-pattern-distributed-cache.html (and comparison with LongAdder: http://concurrencyfreaks.blogspot.co.uk/2013/09/longadder-and-dclc.html), so my idea isn't new (eh). Their implementation shows significant speed-up on 32 core macnine.

Secondly, note that my benchmark configuration was pretty specific: tickless kernel with disabled RCU and hyper-threading, so it's not very "common case". It was done not to show something, but because at that moment it was the only available linux machine without load I had access to. Benefit on 2/4/8 cores shouldn't be very sufficient, but its worth measuring on larger machines.

Currently I have no multi-socket machine, but I'm ready to invest some time into benchmarking on 8-cores i7 (where difference shouldn't be huge) if you are still interested.

@nitsanw

nitsanw commented Jun 21, 2016

Copy link
Copy Markdown
Contributor

@qwwdfsad don't worry :-) original ideas are few and far between, I have a single original algorithm in the whole of JCTools (the chunked MPSC). The rest are ports/mashups/variations on existing work.
I'll have a look at that post, certainly it would be good to benchmark on larger machines, I have a large lab to experiment with (but all intels, so if you got an interesting ARM machine that would be cool).
I'm likely to put this into core at some point, since I'm putting in NHBM which also brings in a counter type.

@qwwdfsad

Copy link
Copy Markdown
Contributor Author

Thank you, I will try not to :) Alas, I have no ARM machines.
Glad to hear about core integration! Hope you will share benchmark results :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants