Fixes #272 BurstCost aren't handling multi-consumers correctly by franz1981 · Pull Request #273 · JCTools/JCTools

franz1981 · 2019-11-16T08:27:37Z

This is fixing MpqBurstCost and QueueBurstCost multi-consumer scenario
and will make both to look similar.

This is fixing MpqBurstCost and QueueBurstCost multi-consumer scenario and will make both to look similar.

coveralls · 2019-11-16T08:35:00Z

Pull Request Test Coverage Report for Build 596

0 of 0 changed or added relevant lines in 0 files are covered.
20 unchanged lines in 5 files lost coverage.
Overall coverage increased (+0.04%) to 86.033%

Files with Coverage Reduction	New Missed Lines	%
jctools-core/src/main/java/org/jctools/queues/atomic/MpscLinkedAtomicQueue.java	1	98.59%
jctools-core/src/main/java/org/jctools/maps/NonBlockingSetInt.java	3	77.65%
jctools-core/src/main/java/org/jctools/maps/NonBlockingHashMapLong.java	4	80.15%
jctools-core/src/main/java/org/jctools/queues/MpscLinkedQueue.java	4	94.74%
jctools-core/src/main/java/org/jctools/maps/NonBlockingHashMap.java	8	78.12%

Totals
Change from base Build 592:	0.04%
Covered Lines:	4743
Relevant Lines:	5513

💛 - Coveralls

franz1981 · 2019-11-16T08:41:05Z

Now I'm running the Queue-one to check again the results on http://psy-lob-saw.blogspot.com/2015/01/mpmc-multi-multi-queue-vs-clq.html ;)
This fix is related to the mpmc xadd q too: I would like to run it vs #269 to compare with MpmcArrayQueue with multiple consumers

franz1981 · 2019-11-16T08:49:10Z

I could have used a different approach to track event handlings ie by not using an AtomicLong shared between consumers, but a long[] per-producer (thread) where each consumer is responsible of a specific index. It would make the waitFor slower, but won't have the xadd cost of AtomicLong during a contended handling. Probably I will add a commit with this approach too.

franz1981 · 2019-11-16T09:31:53Z

With the first version of the benchmark applied on #269 (by using xadd when consumers > 1)
with burstSize = 100 I'm getting:

---------
1P-1C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score     Error  Units
QueueBurstCost.burstCost          100                1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  11318.278 ±  91.259  ns/op
QueueBurstCost.burstCost          100                1       132000        ConcurrentLinkedQueue      true  avgt   40  17116.521 ± 480.471  ns/op
QueueBurstCost.burstCost          100                1       132000               MpmcArrayQueue      true  avgt   40   7148.785 ± 236.638  ns/op
---------
2P-2C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score     Error  Units
QueueBurstCost.burstCost          100                2       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  23409.568 ± 355.988  ns/op
QueueBurstCost.burstCost          100                2       132000        ConcurrentLinkedQueue      true  avgt   40  40930.614 ± 876.973  ns/op
QueueBurstCost.burstCost          100                2       132000               MpmcArrayQueue      true  avgt   40  26813.796 ± 670.512  ns/op
---------
3P-3C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score      Error  Units
QueueBurstCost.burstCost          100                3       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  37742.824 ±  486.882  ns/op
QueueBurstCost.burstCost          100                3       132000        ConcurrentLinkedQueue      true  avgt   40  67822.700 ± 3790.893  ns/op
QueueBurstCost.burstCost          100                3       132000               MpmcArrayQueue      true  avgt   40  49307.053 ± 1144.943  ns/op
---------
4P-4C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                4       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   64889.469 ± 2997.470  ns/op
QueueBurstCost.burstCost          100                4       132000        ConcurrentLinkedQueue      true  avgt   40  112583.181 ± 8774.015  ns/op
QueueBurstCost.burstCost          100                4       132000               MpmcArrayQueue      true  avgt   40   67236.589 ± 2185.796  ns/op
---------
5P-5C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                5       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  102090.251 ± 1972.969  ns/op
QueueBurstCost.burstCost          100                5       132000        ConcurrentLinkedQueue      true  avgt   40  183607.705 ± 2645.447  ns/op
QueueBurstCost.burstCost          100                5       132000               MpmcArrayQueue      true  avgt   40   93294.382 ± 3663.733  ns/op
---------
6P-6C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                6       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  137735.752 ± 2284.493  ns/op
QueueBurstCost.burstCost          100                6       132000        ConcurrentLinkedQueue      true  avgt   40  248958.667 ± 3797.790  ns/op
QueueBurstCost.burstCost          100                6       132000               MpmcArrayQueue      true  avgt   40  122083.077 ± 5117.835  ns/op

I will stress some asymmetric configurations too to make sure that the xadd q is delivering its promises:

---------
11P-1C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   65348.761 ±  3252.266  ns/op
QueueBurstCost.burstCost          100                1       132000        ConcurrentLinkedQueue      true  avgt   40  274958.795 ± 27544.429  ns/op
QueueBurstCost.burstCost          100                1       132000               MpmcArrayQueue      true  avgt   40  312524.595 ±  5232.305  ns/op
---------
10P-2C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                2       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  125105.547 ± 3191.048  ns/op
QueueBurstCost.burstCost          100                2       132000        ConcurrentLinkedQueue      true  avgt   40  207516.138 ± 3626.670  ns/op
QueueBurstCost.burstCost          100                2       132000               MpmcArrayQueue      true  avgt   40  278119.574 ± 9714.185  ns/op
---------
8P-4C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                4       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  130007.279 ±  5887.400  ns/op
QueueBurstCost.burstCost          100                4       132000        ConcurrentLinkedQueue      true  avgt   40  203753.285 ± 14697.797  ns/op
QueueBurstCost.burstCost          100                4       132000               MpmcArrayQueue      true  avgt   40  204236.319 ± 18750.423  ns/op

All the previous results has been collected by tasksetting the bench to run on a single numa node, disabling HT, while the next result is on wild (but idle) machine (HT, 48 cores, 2 numa nodes):

---------
22P-22C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt        Score        Error  Units
QueueBurstCost.burstCost          100               22       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   595094.367 ±  22259.547  ns/op
QueueBurstCost.burstCost          100               22       132000        ConcurrentLinkedQueue      true  avgt   40  1973646.942 ±  49411.295  ns/op
QueueBurstCost.burstCost          100               22       132000               MpmcArrayQueue      true  avgt   40  1028019.196 ± 165624.742  ns/op
---------
8P-8C
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                8       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  207026.684 ±  7375.556  ns/op
QueueBurstCost.burstCost          100                8       132000        ConcurrentLinkedQueue      true  avgt   40  765197.805 ± 52690.660  ns/op
QueueBurstCost.burstCost          100                8       132000               MpmcArrayQueue      true  avgt   40  283497.495 ±  6342.431  ns/op

It's interesting to see that with HT enabled and no taskset, cases that shows a tiny advantage of MpmcArrayQueue vs xadd q, obtain a very different result, like:

---------
6P-6C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                6       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  145090.336 ±  4518.584  ns/op
QueueBurstCost.burstCost          100                6       132000        ConcurrentLinkedQueue      true  avgt   40  554241.709 ± 12009.224  ns/op
QueueBurstCost.burstCost          100                6       132000               MpmcArrayQueue      true  avgt   40  199863.872 ± 11239.150  ns/op

Xadd q perform similarly as before, but MpmcArrayQueue is performing a lot worse

franz1981 · 2019-11-17T13:00:43Z

I've added a new commit, trying to save using xadd consumer-side, by using separate counters per-consumer. It would hit false-sharing between counters: this could be saved by using 2*CACHE_LINE pads between each counter to avoid consumers invalidating each-others cache-lines while handling producers tasks.
I'm reporting here the same results of #273 (comment) to see if there is any improvement on the results.

---------
1P-1C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score     Error  Units
QueueBurstCost.burstCost          100                1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  11583.685 ±  58.408  ns/op
QueueBurstCost.burstCost          100                1       132000        ConcurrentLinkedQueue      true  avgt   40  17266.134 ± 320.246  ns/op
QueueBurstCost.burstCost          100                1       132000               MpmcArrayQueue      true  avgt   40   6979.031 ± 125.073  ns/op
---------
2P-2C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score      Error  Units
QueueBurstCost.burstCost          100                2       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  23675.018 ±  499.502  ns/op
QueueBurstCost.burstCost          100                2       132000        ConcurrentLinkedQueue      true  avgt   40  46451.762 ± 3138.152  ns/op
QueueBurstCost.burstCost          100                2       132000               MpmcArrayQueue      true  avgt   40  26189.720 ± 1318.632  ns/op
---------
3P-3C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score      Error  Units
QueueBurstCost.burstCost          100                3       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  43946.182 ± 2227.768  ns/op
QueueBurstCost.burstCost          100                3       132000        ConcurrentLinkedQueue      true  avgt   40  87120.460 ± 3194.932  ns/op
QueueBurstCost.burstCost          100                3       132000               MpmcArrayQueue      true  avgt   40  45105.417 ±  929.084  ns/op
---------
4P-4C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                4       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   70045.186 ± 2241.860  ns/op
QueueBurstCost.burstCost          100                4       132000        ConcurrentLinkedQueue      true  avgt   40  137313.832 ± 1815.112  ns/op
QueueBurstCost.burstCost          100                4       132000               MpmcArrayQueue      true  avgt   40   68284.462 ± 1983.369  ns/op
---------
5P-5C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                5       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  100252.444 ± 3413.939  ns/op
QueueBurstCost.burstCost          100                5       132000        ConcurrentLinkedQueue      true  avgt   40  191076.869 ± 2130.139  ns/op
QueueBurstCost.burstCost          100                5       132000               MpmcArrayQueue      true  avgt   40   95408.004 ± 4680.241  ns/op
---------
6P-6C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                6       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  134966.416 ± 1555.916  ns/op
QueueBurstCost.burstCost          100                6       132000        ConcurrentLinkedQueue      true  avgt   40  249424.726 ± 2469.640  ns/op
QueueBurstCost.burstCost          100                6       132000               MpmcArrayQueue      true  avgt   40  127035.803 ± 5283.807  ns/op

With asymmetric configurations:

---------
11P-1C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   67192.319 ±  4189.873  ns/op
QueueBurstCost.burstCost          100                1       132000        ConcurrentLinkedQueue      true  avgt   40  285255.400 ± 19155.638  ns/op
QueueBurstCost.burstCost          100                1       132000               MpmcArrayQueue      true  avgt   40  315123.221 ±  4491.199  ns/op
---------
10P-2C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                2       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  117949.236 ±  4571.083  ns/op
QueueBurstCost.burstCost          100                2       132000        ConcurrentLinkedQueue      true  avgt   40  267017.522 ±  8036.012  ns/op
QueueBurstCost.burstCost          100                2       132000               MpmcArrayQueue      true  avgt   40  284326.729 ± 10299.422  ns/op
---------
8P-4C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                4       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  137988.606 ±  4540.951  ns/op
QueueBurstCost.burstCost          100                4       132000        ConcurrentLinkedQueue      true  avgt   40  283946.006 ±  3914.199  ns/op
QueueBurstCost.burstCost          100                4       132000               MpmcArrayQueue      true  avgt   40  222860.950 ± 12340.679  ns/op

The "wild" machine cases too:

---------
22P-22C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt        Score        Error  Units
QueueBurstCost.burstCost          100               22       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   489316.718 ±  41842.624  ns/op
QueueBurstCost.burstCost          100               22       132000        ConcurrentLinkedQueue      true  avgt   40  1901868.898 ±  53407.107  ns/op
QueueBurstCost.burstCost          100               22       132000               MpmcArrayQueue      true  avgt   40   971550.149 ± 150254.715  ns/op
---------
8P-8C
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                8       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  190692.631 ±  7632.108  ns/op
QueueBurstCost.burstCost          100                8       132000        ConcurrentLinkedQueue      true  avgt   40  805924.222 ± 16923.904  ns/op
QueueBurstCost.burstCost          100                8       132000               MpmcArrayQueue      true  avgt   40  279469.248 ± 12048.981  ns/op
---------
6P-6C
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                6       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  126767.559 ±  5528.759  ns/op
QueueBurstCost.burstCost          100                6       132000        ConcurrentLinkedQueue      true  avgt   40  466296.406 ± 51349.549  ns/op
QueueBurstCost.burstCost          100                6       132000               MpmcArrayQueue      true  avgt   40  200847.577 ± 15646.834  ns/op

Some conclusions from the asymmetric tests: I don't think MpmcArrayQueue has something to envy to CLQ although is getting worst results then it, but that given the consumer count is lowered, the chance that consumers are able to getting empty q is higher, invalidating the producer sequence in order to be sure of emptyness; probably that's the reason why it seems slower then CLQ.

nitsanw · 2019-11-18T08:55:37Z

+        public void init(QueueBurstCost state)
        {
-            // do nothing
+            handledCount = new long[state.consumerCount];


I think it would be far less complex to put the counter on Consumer, which already has padding. The waitFor then looks at the counters on the consumers. You can reduce the reader induced coherence overhead by spin waiting on the final event (as before) and then transition to consumer scanning. Further you can batch update the handled counter on the Consumer only when the queue is observed empty or relaxedPoll returns null.

I think it would be far less complex to put the counter on Consumer, which already has padding

That's a nice idea, but the counter is not just per-consumer, need to be per-producer (thread) as well.

You can reduce the reader induced coherence overhead by spin waiting on the final event (as before) and then transition to consumer scanning

This would work if the last offered event will be the last to be handled, but with multi-consumer is not ensured anymore: we cannot be sure neither that every consumer will handl at least 1 event, because is a kind of fairness that we don't force neither we measure.

"That's a nice idea, but the counter is not just per-consumer, need to be per-producer (thread) as well."
Fair point, in which case, yes you have to pad the array based counter.

"This would work if the last offered event will be the last to be handled, but with multi-consumer is not ensured anymore"
You can't be sure, but it's fair to assume that there's no need to check the counters before that one in handled.

it's fair to assume that there's no need to check the counters before that one in handled

I don't understand this: maybe with "before" I mean something different.
My understanding of "before" is the index/id of consumer on Event::handledCount: if the last (submitted) event is being handled, I can trust that the previous events would be handled with offering order just in the single consumer case, saving to poll for theirs completion.
AFAIK the current algorithm won't do that anyway (because Event::handledCount::length == consumerCount == 1, you ends up always polling the same counter/same consumer).

If with "before" you mean the previous handled events by that same consumer: if the last event will make the consumer to write its id while handling it (ie the index on Event::handledCount), then we can save to poll that same consumer and use the Event::handledCount[<id of last event>] count with no need to poll it, but what about the other consumers? They still be needed to be polled until reaching the expected bustSize count.

Help me to understand what do you mean, sorry for being slow in this :)

"I can trust that the previous events would be handled with offering order just in the single consumer case, saving to poll for theirs completion."
I'm not saying you should trust this as the only indicator, just that it's a good heuristic to reduce the overall noise.

WRT benchmark results, I would appreciate it if you made the effort to format those into something that is more easily comparable.

WRT the consume method inlining, I can't remember. There's good reasons to avoid inlining in infinite loops to avoid OSR compilations, but I can't remember what was the reasons/testing here.

Configuration is the same as https://github.com/franz1981/JCTools-wiki/blob/xadd_docs/MpscUnboundedXaddArrayQueue.md#lies-damned-lies-and-benchmarks and test has been run using:

$ for i in 1 2 3 4 5 6; do java -jar microbenchmarks.jar org.jctools.jmh.latency.QueueBurstCost.* -f 10 -t $i -p consumerCount=$i -p burstSize=100 -p qType=MpmcUnboundedXaddArrayQueue,ConcurrentLinkedQueue,MpmcArrayQueue -rf json -rff $i.json >> burst.$i.txt; done

Symmetric tests
1P-1C:

2P-2C:

3P-3C:

4P-4C:

5P-5C:

6P-6C:

Asymmetric tests
8P-4C:

10P-2C:

11P-1C:

@nitsanw Let me know any insights about them and how you think is performing xadd q in your opinion @ #269
I see many things that are just weird to me

franz1981 · 2019-11-18T22:23:02Z

Configuration is the same as https://github.com/franz1981/JCTools-wiki/blob/xadd_docs/MpscUnboundedXaddArrayQueue.md#lies-damned-lies-and-benchmarks and test has been run using:

$ for i in 1 2 3 4 5 6; do java -jar microbenchmarks.jar org.jctools.jmh.latency.QueueBurstCost.* -f 10 -t $i -p consumerCount=$i -p burstSize=100 -p qType=MpmcUnboundedXaddArrayQueue,ConcurrentLinkedQueue,MpmcArrayQueue -rf json -rff $i.json >> burst.$i.txt; done

Symmetric tests
1P-1C:

2P-2C:

3P-3C:

4P-4C:

5P-5C:

6P-6C:

Asymmetric tests
8P-4C:

10P-2C:

11P-1C:

franz1981 · 2019-11-19T09:00:13Z

@nitsanw please, do not hesitate to ask if you want me to run any specific test on the same machine :)

nitsanw · 2019-11-20T12:54:22Z

@franz1981 These are interesting. How do they compare to the prev.(before the fix) results?

franz1981 · 2019-11-20T20:02:33Z

@nitsanw
I'm re-running the same benchmarks of #273 (comment) on master, but with #269 applied.
If the results will prove to be sound (for this PR) I will perform the same vs master without #269 applied to verify that #269 seems a good change (throughput tests already confirmed it).

Symmetric tests
1P-1C:

2P-2C:

3P-3C:

4P-4C:

5P-5C:

6P-6C:

Asymmetric tests
8P-4C:

10P-2C:

11P-1C:

Let me know if these are the results you were expecting 👍

nitsanw · 2019-11-21T10:01:01Z

#269 is a different story, I'm interested in the benchmark results before/after the change to the benchmarks. In particular we would expect all the single consumer scenarios to produce the same results.

nitsanw · 2019-11-21T10:02:47Z

To be clear, I'm not at all interested here in the comparison between different queue implementations so much as in the effect of the benchmark change. So the visualisation, while nice, still requires me to scroll up and down and look for the b/a picture.

franz1981 · 2019-11-21T10:13:24Z

In particular we would expect all the single consumer scenarios to produce the same results.

That's what the numbers says, assuming some error due to my use of taskset: I'm not using taskset to let the benchs run on just 2 different cores, but always on 12 different cores belonging the same numa node, so both benchs can see some error due to context switches/core migrations.
I can re-run both benchs in the 1P-1C case while pinning it just on 2 cores, if you think it worths.

So the visualisation, while nice, still requires me to scroll up and down and look for the b/a picture.

Given that NP-NC with N > 1 are totally not-comparable (the original code with multi-consumer is just broken IIRC) I can arrange the previous results of 1P-1C to make it more easily comparable.
<joking-mode>
Please, hold your breath and save those expensive scrolling movements to compare charts: we're not athletes!! :D
</joking-mode>
NOTE: https://jmh.morethan.io/ rocks :D

That's a better view of the 1P-1C case:

Maybe would worth to do it for NP-1C to see if there are any regressions

franz1981 · 2019-11-21T10:36:45Z

I'm not at all interested here in the comparison between different queue implementations

I can open a separate issue or write on the PR re the mpmc xadd fix if you prefer, but I need some magic help from you there, bud :)

franz1981 · 2019-11-23T06:52:24Z

In addition to the results of 1P-1C on #273 (comment), I'm adding these ones too:
2P-1C:

4P-1C:

8P-1C:

11P-1C:

TBH I was expecting the same trend of the 1P-1C results ie the new PR to slow down a bit the task execution, but rather it seems the opposite, although I see that the errors are more pronounced.
Maybe the original benchmark was having some issue with bimorphic calls? I cannot think to other reasons that would make the new benchmark lighter then the one on master.

nitsanw · 2019-11-25T05:04:26Z

"TBH I was expecting the same trend of the 1P-1C results ie the new PR to slow down a bit the task execution, but rather it seems the opposite" - A slower consumer means less contention. This is a case where consumers are much faster than producers (no contention) so the queue has no slack.

franz1981 · 2019-11-25T08:25:30Z

A slower consumer means less contention

That makes sense to me, but how to make it more evident on the benchmark results?
With JMH queue throughput benchs we've the failed polls stats, while here we have to profile/perfasm to understand what's going on. Getting the failed poll stats from the beginning till the end of a burst could be helpful although I've no (yet) any idea how to collect these info here.

Returning on topic: this PR's handle seems more heavyweight then master; as you've suggested it can be made lighter for the single-consumer case, although at the cost a bimorphic call on consumer. But I see 2 benefits of this PR as it is:

single/multi-consumer tests behave the same and can be compared (ie handle cost is the same for both)
the error bars on 1P-1C shows that the difference from master is not big

I'm going to address #273 (comment) and wil wait your opinion on how to proceed for this.

nitsanw · 2019-11-25T10:14:41Z

I think with the start latch it's good to merge. Sorry for the long review process :-)

franz1981 · 2019-11-25T10:34:00Z

@nitsanw No worries bud; that's the base where we collect numbers to understand what's going on and is sufficiently delicate that need some care while modifying it :)
Give me few minutes that I write the latch change and will push the commit: do you want me to squash all the commits into 1?

franz1981 · 2019-11-25T11:07:27Z

I've already pushed the commit with the started latch change: the only downside I see is that one of the consumer(s) will pay the cost of awaking it, unblocking startConsumers, but it shouldn't be relevant, given that the bench iteration won't start until startConsumers won't end.

Fixes JCTools#272 BurstCost aren't handling multi-consumers correctly

66e71af

This is fixing MpqBurstCost and QueueBurstCost multi-consumer scenario and will make both to look similar.

Unified multi with single-consumer handling and saved XADD handling

2bac336

franz1981 force-pushed the mc_burst_bench branch from 1f41ce6 to 2bac336 Compare November 17, 2019 11:10

nitsanw reviewed Nov 18, 2019

View reviewed changes

Comment thread jctools-benchmarks/src/main/java/org/jctools/jmh/latency/QueueBurstCost.java Outdated

nitsanw reviewed Nov 18, 2019

View reviewed changes

Comment thread jctools-benchmarks/src/main/java/org/jctools/jmh/latency/QueueBurstCost.java

nitsanw reviewed Nov 18, 2019

View reviewed changes

Comment thread jctools-benchmarks/src/main/java/org/jctools/jmh/latency/QueueBurstCost.java

nitsanw reviewed Nov 18, 2019

View reviewed changes

Addressing Event false-sharing and qCapacity as a String config property

a4125a7

franz1981 force-pushed the mc_burst_bench branch from dccdcec to a4125a7 Compare November 18, 2019 13:13

nitsanw reviewed Nov 25, 2019

View reviewed changes

Comment thread jctools-benchmarks/src/main/java/org/jctools/jmh/latency/MpqBurstCost.java

Added start latch to ensure consumers are ready before offering

b86a84a

nitsanw merged commit 9c354ce into JCTools:master Nov 25, 2019

Conversation

franz1981 commented Nov 16, 2019

Uh oh!

coveralls commented Nov 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 596

💛 - Coveralls

Uh oh!

franz1981 commented Nov 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Nov 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Nov 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Nov 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nitsanw Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

franz1981 Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nitsanw Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

franz1981 Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nitsanw Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

nitsanw Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

nitsanw Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

franz1981 Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franz1981 Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franz1981 Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franz1981 commented Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitsanw commented Nov 20, 2019

Uh oh!

franz1981 commented Nov 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitsanw commented Nov 21, 2019

Uh oh!

nitsanw commented Nov 21, 2019

Uh oh!

franz1981 commented Nov 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Nov 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Nov 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

coveralls commented Nov 16, 2019 •

edited

Loading

franz1981 commented Nov 16, 2019 •

edited

Loading

franz1981 commented Nov 16, 2019 •

edited

Loading

franz1981 commented Nov 16, 2019 •

edited

Loading

franz1981 commented Nov 17, 2019 •

edited

Loading

franz1981 Nov 18, 2019 •

edited

Loading

franz1981 Nov 18, 2019 •

edited

Loading

franz1981 Nov 18, 2019 •

edited

Loading

franz1981 Nov 18, 2019 •

edited

Loading

franz1981 Nov 18, 2019 •

edited

Loading

franz1981 commented Nov 18, 2019 •

edited

Loading

franz1981 commented Nov 19, 2019 •

edited

Loading

franz1981 commented Nov 20, 2019 •

edited

Loading

franz1981 commented Nov 21, 2019 •

edited

Loading

franz1981 commented Nov 21, 2019 •

edited

Loading

franz1981 commented Nov 23, 2019 •

edited

Loading

franz1981 commented Nov 25, 2019 •

edited

Loading

franz1981 commented Nov 25, 2019 •

edited

Loading