Fixes #272 BurstCost aren't handling multi-consumers correctly#273
Conversation
This is fixing MpqBurstCost and QueueBurstCost multi-consumer scenario and will make both to look similar.
Pull Request Test Coverage Report for Build 596
💛 - Coveralls |
|
Now I'm running the Queue-one to check again the results on http://psy-lob-saw.blogspot.com/2015/01/mpmc-multi-multi-queue-vs-clq.html ;) |
|
I could have used a different approach to track event handlings ie by not using an AtomicLong shared between consumers, but a long[] per-producer (thread) where each consumer is responsible of a specific index. It would make the |
|
With the first version of the benchmark applied on #269 (by using xadd when consumers > 1) I will stress some asymmetric configurations too to make sure that the xadd q is delivering its promises: All the previous results has been collected by tasksetting the bench to run on a single numa node, disabling HT, while the next result is on wild (but idle) machine (HT, 48 cores, 2 numa nodes): It's interesting to see that with HT enabled and no taskset, cases that shows a tiny advantage of Xadd q perform similarly as before, but |
1f41ce6 to
2bac336
Compare
|
I've added a new commit, trying to save using xadd consumer-side, by using separate counters per-consumer. It would hit false-sharing between counters: this could be saved by using 2*CACHE_LINE pads between each counter to avoid consumers invalidating each-others cache-lines while handling producers tasks. With asymmetric configurations: The "wild" machine cases too: Some conclusions from the asymmetric tests: I don't think |
| public void init(QueueBurstCost state) | ||
| { | ||
| // do nothing | ||
| handledCount = new long[state.consumerCount]; |
There was a problem hiding this comment.
I think it would be far less complex to put the counter on Consumer, which already has padding. The waitFor then looks at the counters on the consumers. You can reduce the reader induced coherence overhead by spin waiting on the final event (as before) and then transition to consumer scanning. Further you can batch update the handled counter on the Consumer only when the queue is observed empty or relaxedPoll returns null.
There was a problem hiding this comment.
I think it would be far less complex to put the counter on Consumer, which already has padding
That's a nice idea, but the counter is not just per-consumer, need to be per-producer (thread) as well.
You can reduce the reader induced coherence overhead by spin waiting on the final event (as before) and then transition to consumer scanning
This would work if the last offered event will be the last to be handled, but with multi-consumer is not ensured anymore: we cannot be sure neither that every consumer will handl at least 1 event, because is a kind of fairness that we don't force neither we measure.
There was a problem hiding this comment.
"That's a nice idea, but the counter is not just per-consumer, need to be per-producer (thread) as well."
Fair point, in which case, yes you have to pad the array based counter.
"This would work if the last offered event will be the last to be handled, but with multi-consumer is not ensured anymore"
You can't be sure, but it's fair to assume that there's no need to check the counters before that one in handled.
There was a problem hiding this comment.
it's fair to assume that there's no need to check the counters before that one in handled
I don't understand this: maybe with "before" I mean something different.
My understanding of "before" is the index/id of consumer on Event::handledCount: if the last (submitted) event is being handled, I can trust that the previous events would be handled with offering order just in the single consumer case, saving to poll for theirs completion.
AFAIK the current algorithm won't do that anyway (because Event::handledCount::length == consumerCount == 1, you ends up always polling the same counter/same consumer).
If with "before" you mean the previous handled events by that same consumer: if the last event will make the consumer to write its id while handling it (ie the index on Event::handledCount), then we can save to poll that same consumer and use the Event::handledCount[<id of last event>] count with no need to poll it, but what about the other consumers? They still be needed to be polled until reaching the expected bustSize count.
Help me to understand what do you mean, sorry for being slow in this :)
There was a problem hiding this comment.
"I can trust that the previous events would be handled with offering order just in the single consumer case, saving to poll for theirs completion."
I'm not saying you should trust this as the only indicator, just that it's a good heuristic to reduce the overall noise.
There was a problem hiding this comment.
WRT benchmark results, I would appreciate it if you made the effort to format those into something that is more easily comparable.
There was a problem hiding this comment.
WRT the consume method inlining, I can't remember. There's good reasons to avoid inlining in infinite loops to avoid OSR compilations, but I can't remember what was the reasons/testing here.
There was a problem hiding this comment.
Configuration is the same as https://github.com/franz1981/JCTools-wiki/blob/xadd_docs/MpscUnboundedXaddArrayQueue.md#lies-damned-lies-and-benchmarks and test has been run using:
$ for i in 1 2 3 4 5 6; do java -jar microbenchmarks.jar org.jctools.jmh.latency.QueueBurstCost.* -f 10 -t $i -p consumerCount=$i -p burstSize=100 -p qType=MpmcUnboundedXaddArrayQueue,ConcurrentLinkedQueue,MpmcArrayQueue -rf json -rff $i.json >> burst.$i.txt; done
dccdcec to
a4125a7
Compare
|
Configuration is the same as https://github.com/franz1981/JCTools-wiki/blob/xadd_docs/MpscUnboundedXaddArrayQueue.md#lies-damned-lies-and-benchmarks and test has been run using: |
|
@nitsanw please, do not hesitate to ask if you want me to run any specific test on the same machine :) |
|
@franz1981 These are interesting. How do they compare to the prev.(before the fix) results? |
|
@nitsanw Symmetric tests Let me know if these are the results you were expecting 👍 |
|
#269 is a different story, I'm interested in the benchmark results before/after the change to the benchmarks. In particular we would expect all the single consumer scenarios to produce the same results. |
|
To be clear, I'm not at all interested here in the comparison between different queue implementations so much as in the effect of the benchmark change. So the visualisation, while nice, still requires me to scroll up and down and look for the b/a picture. |
That's what the numbers says, assuming some error due to my use of
Given that NP-NC with N > 1 are totally not-comparable (the original code with multi-consumer is just broken IIRC) I can arrange the previous results of 1P-1C to make it more easily comparable. That's a better view of the 1P-1C case: Maybe would worth to do it for NP-1C to see if there are any regressions |
I can open a separate issue or write on the PR re the mpmc xadd fix if you prefer, but I need some magic help from you there, bud :) |
|
In addition to the results of 1P-1C on #273 (comment), I'm adding these ones too: TBH I was expecting the same trend of the 1P-1C results ie the new PR to slow down a bit the task execution, but rather it seems the opposite, although I see that the errors are more pronounced. |
|
"TBH I was expecting the same trend of the 1P-1C results ie the new PR to slow down a bit the task execution, but rather it seems the opposite" - A slower consumer means less contention. This is a case where consumers are much faster than producers (no contention) so the queue has no slack. |
That makes sense to me, but how to make it more evident on the benchmark results? Returning on topic: this PR's
I'm going to address #273 (comment) and wil wait your opinion on how to proceed for this. |
|
I think with the start latch it's good to merge. Sorry for the long review process :-) |
|
@nitsanw No worries bud; that's the base where we collect numbers to understand what's going on and is sufficiently delicate that need some care while modifying it :) |
|
I've already pushed the commit with the started latch change: the only downside I see is that one of the consumer(s) will pay the cost of awaking it, unblocking |























This is fixing MpqBurstCost and QueueBurstCost multi-consumer scenario
and will make both to look similar.