Skip to content

Fixes #272 BurstCost aren't handling multi-consumers correctly#273

Merged
nitsanw merged 4 commits into
JCTools:masterfrom
franz1981:mc_burst_bench
Nov 25, 2019
Merged

Fixes #272 BurstCost aren't handling multi-consumers correctly#273
nitsanw merged 4 commits into
JCTools:masterfrom
franz1981:mc_burst_bench

Conversation

@franz1981

Copy link
Copy Markdown
Collaborator

This is fixing MpqBurstCost and QueueBurstCost multi-consumer scenario
and will make both to look similar.

This is fixing MpqBurstCost and QueueBurstCost multi-consumer scenario
and will make both to look similar.
@coveralls

coveralls commented Nov 16, 2019

Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 596

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 20 unchanged lines in 5 files lost coverage.
  • Overall coverage increased (+0.04%) to 86.033%

Files with Coverage Reduction New Missed Lines %
jctools-core/src/main/java/org/jctools/queues/atomic/MpscLinkedAtomicQueue.java 1 98.59%
jctools-core/src/main/java/org/jctools/maps/NonBlockingSetInt.java 3 77.65%
jctools-core/src/main/java/org/jctools/maps/NonBlockingHashMapLong.java 4 80.15%
jctools-core/src/main/java/org/jctools/queues/MpscLinkedQueue.java 4 94.74%
jctools-core/src/main/java/org/jctools/maps/NonBlockingHashMap.java 8 78.12%
Totals Coverage Status
Change from base Build 592: 0.04%
Covered Lines: 4743
Relevant Lines: 5513

💛 - Coveralls

@franz1981

franz1981 commented Nov 16, 2019

Copy link
Copy Markdown
Collaborator Author

Now I'm running the Queue-one to check again the results on http://psy-lob-saw.blogspot.com/2015/01/mpmc-multi-multi-queue-vs-clq.html ;)
This fix is related to the mpmc xadd q too: I would like to run it vs #269 to compare with MpmcArrayQueue with multiple consumers

@franz1981

franz1981 commented Nov 16, 2019

Copy link
Copy Markdown
Collaborator Author

I could have used a different approach to track event handlings ie by not using an AtomicLong shared between consumers, but a long[] per-producer (thread) where each consumer is responsible of a specific index. It would make the waitFor slower, but won't have the xadd cost of AtomicLong during a contended handling. Probably I will add a commit with this approach too.

@franz1981

franz1981 commented Nov 16, 2019

Copy link
Copy Markdown
Collaborator Author

With the first version of the benchmark applied on #269 (by using xadd when consumers > 1)
with burstSize = 100 I'm getting:

---------
1P-1C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score     Error  Units
QueueBurstCost.burstCost          100                1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  11318.278 ±  91.259  ns/op
QueueBurstCost.burstCost          100                1       132000        ConcurrentLinkedQueue      true  avgt   40  17116.521 ± 480.471  ns/op
QueueBurstCost.burstCost          100                1       132000               MpmcArrayQueue      true  avgt   40   7148.785 ± 236.638  ns/op
---------
2P-2C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score     Error  Units
QueueBurstCost.burstCost          100                2       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  23409.568 ± 355.988  ns/op
QueueBurstCost.burstCost          100                2       132000        ConcurrentLinkedQueue      true  avgt   40  40930.614 ± 876.973  ns/op
QueueBurstCost.burstCost          100                2       132000               MpmcArrayQueue      true  avgt   40  26813.796 ± 670.512  ns/op
---------
3P-3C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score      Error  Units
QueueBurstCost.burstCost          100                3       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  37742.824 ±  486.882  ns/op
QueueBurstCost.burstCost          100                3       132000        ConcurrentLinkedQueue      true  avgt   40  67822.700 ± 3790.893  ns/op
QueueBurstCost.burstCost          100                3       132000               MpmcArrayQueue      true  avgt   40  49307.053 ± 1144.943  ns/op
---------
4P-4C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                4       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   64889.469 ± 2997.470  ns/op
QueueBurstCost.burstCost          100                4       132000        ConcurrentLinkedQueue      true  avgt   40  112583.181 ± 8774.015  ns/op
QueueBurstCost.burstCost          100                4       132000               MpmcArrayQueue      true  avgt   40   67236.589 ± 2185.796  ns/op
---------
5P-5C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                5       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  102090.251 ± 1972.969  ns/op
QueueBurstCost.burstCost          100                5       132000        ConcurrentLinkedQueue      true  avgt   40  183607.705 ± 2645.447  ns/op
QueueBurstCost.burstCost          100                5       132000               MpmcArrayQueue      true  avgt   40   93294.382 ± 3663.733  ns/op
---------
6P-6C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                6       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  137735.752 ± 2284.493  ns/op
QueueBurstCost.burstCost          100                6       132000        ConcurrentLinkedQueue      true  avgt   40  248958.667 ± 3797.790  ns/op
QueueBurstCost.burstCost          100                6       132000               MpmcArrayQueue      true  avgt   40  122083.077 ± 5117.835  ns/op

I will stress some asymmetric configurations too to make sure that the xadd q is delivering its promises:

---------
11P-1C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   65348.761 ±  3252.266  ns/op
QueueBurstCost.burstCost          100                1       132000        ConcurrentLinkedQueue      true  avgt   40  274958.795 ± 27544.429  ns/op
QueueBurstCost.burstCost          100                1       132000               MpmcArrayQueue      true  avgt   40  312524.595 ±  5232.305  ns/op
---------
10P-2C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                2       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  125105.547 ± 3191.048  ns/op
QueueBurstCost.burstCost          100                2       132000        ConcurrentLinkedQueue      true  avgt   40  207516.138 ± 3626.670  ns/op
QueueBurstCost.burstCost          100                2       132000               MpmcArrayQueue      true  avgt   40  278119.574 ± 9714.185  ns/op
---------
8P-4C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                4       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  130007.279 ±  5887.400  ns/op
QueueBurstCost.burstCost          100                4       132000        ConcurrentLinkedQueue      true  avgt   40  203753.285 ± 14697.797  ns/op
QueueBurstCost.burstCost          100                4       132000               MpmcArrayQueue      true  avgt   40  204236.319 ± 18750.423  ns/op

All the previous results has been collected by tasksetting the bench to run on a single numa node, disabling HT, while the next result is on wild (but idle) machine (HT, 48 cores, 2 numa nodes):

---------
22P-22C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt        Score        Error  Units
QueueBurstCost.burstCost          100               22       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   595094.367 ±  22259.547  ns/op
QueueBurstCost.burstCost          100               22       132000        ConcurrentLinkedQueue      true  avgt   40  1973646.942 ±  49411.295  ns/op
QueueBurstCost.burstCost          100               22       132000               MpmcArrayQueue      true  avgt   40  1028019.196 ± 165624.742  ns/op
---------
8P-8C
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                8       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  207026.684 ±  7375.556  ns/op
QueueBurstCost.burstCost          100                8       132000        ConcurrentLinkedQueue      true  avgt   40  765197.805 ± 52690.660  ns/op
QueueBurstCost.burstCost          100                8       132000               MpmcArrayQueue      true  avgt   40  283497.495 ±  6342.431  ns/op

It's interesting to see that with HT enabled and no taskset, cases that shows a tiny advantage of MpmcArrayQueue vs xadd q, obtain a very different result, like:

---------
6P-6C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                6       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  145090.336 ±  4518.584  ns/op
QueueBurstCost.burstCost          100                6       132000        ConcurrentLinkedQueue      true  avgt   40  554241.709 ± 12009.224  ns/op
QueueBurstCost.burstCost          100                6       132000               MpmcArrayQueue      true  avgt   40  199863.872 ± 11239.150  ns/op

Xadd q perform similarly as before, but MpmcArrayQueue is performing a lot worse

@franz1981

franz1981 commented Nov 17, 2019

Copy link
Copy Markdown
Collaborator Author

I've added a new commit, trying to save using xadd consumer-side, by using separate counters per-consumer. It would hit false-sharing between counters: this could be saved by using 2*CACHE_LINE pads between each counter to avoid consumers invalidating each-others cache-lines while handling producers tasks.
I'm reporting here the same results of #273 (comment) to see if there is any improvement on the results.

---------
1P-1C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score     Error  Units
QueueBurstCost.burstCost          100                1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  11583.685 ±  58.408  ns/op
QueueBurstCost.burstCost          100                1       132000        ConcurrentLinkedQueue      true  avgt   40  17266.134 ± 320.246  ns/op
QueueBurstCost.burstCost          100                1       132000               MpmcArrayQueue      true  avgt   40   6979.031 ± 125.073  ns/op
---------
2P-2C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score      Error  Units
QueueBurstCost.burstCost          100                2       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  23675.018 ±  499.502  ns/op
QueueBurstCost.burstCost          100                2       132000        ConcurrentLinkedQueue      true  avgt   40  46451.762 ± 3138.152  ns/op
QueueBurstCost.burstCost          100                2       132000               MpmcArrayQueue      true  avgt   40  26189.720 ± 1318.632  ns/op
---------
3P-3C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt      Score      Error  Units
QueueBurstCost.burstCost          100                3       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  43946.182 ± 2227.768  ns/op
QueueBurstCost.burstCost          100                3       132000        ConcurrentLinkedQueue      true  avgt   40  87120.460 ± 3194.932  ns/op
QueueBurstCost.burstCost          100                3       132000               MpmcArrayQueue      true  avgt   40  45105.417 ±  929.084  ns/op
---------
4P-4C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                4       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   70045.186 ± 2241.860  ns/op
QueueBurstCost.burstCost          100                4       132000        ConcurrentLinkedQueue      true  avgt   40  137313.832 ± 1815.112  ns/op
QueueBurstCost.burstCost          100                4       132000               MpmcArrayQueue      true  avgt   40   68284.462 ± 1983.369  ns/op
---------
5P-5C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                5       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  100252.444 ± 3413.939  ns/op
QueueBurstCost.burstCost          100                5       132000        ConcurrentLinkedQueue      true  avgt   40  191076.869 ± 2130.139  ns/op
QueueBurstCost.burstCost          100                5       132000               MpmcArrayQueue      true  avgt   40   95408.004 ± 4680.241  ns/op
---------
6P-6C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score      Error  Units
QueueBurstCost.burstCost          100                6       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  134966.416 ± 1555.916  ns/op
QueueBurstCost.burstCost          100                6       132000        ConcurrentLinkedQueue      true  avgt   40  249424.726 ± 2469.640  ns/op
QueueBurstCost.burstCost          100                6       132000               MpmcArrayQueue      true  avgt   40  127035.803 ± 5283.807  ns/op

With asymmetric configurations:

---------
11P-1C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                1       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   67192.319 ±  4189.873  ns/op
QueueBurstCost.burstCost          100                1       132000        ConcurrentLinkedQueue      true  avgt   40  285255.400 ± 19155.638  ns/op
QueueBurstCost.burstCost          100                1       132000               MpmcArrayQueue      true  avgt   40  315123.221 ±  4491.199  ns/op
---------
10P-2C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                2       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  117949.236 ±  4571.083  ns/op
QueueBurstCost.burstCost          100                2       132000        ConcurrentLinkedQueue      true  avgt   40  267017.522 ±  8036.012  ns/op
QueueBurstCost.burstCost          100                2       132000               MpmcArrayQueue      true  avgt   40  284326.729 ± 10299.422  ns/op
---------
8P-4C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                4       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  137988.606 ±  4540.951  ns/op
QueueBurstCost.burstCost          100                4       132000        ConcurrentLinkedQueue      true  avgt   40  283946.006 ±  3914.199  ns/op
QueueBurstCost.burstCost          100                4       132000               MpmcArrayQueue      true  avgt   40  222860.950 ± 12340.679  ns/op

The "wild" machine cases too:

---------
22P-22C:
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt        Score        Error  Units
QueueBurstCost.burstCost          100               22       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40   489316.718 ±  41842.624  ns/op
QueueBurstCost.burstCost          100               22       132000        ConcurrentLinkedQueue      true  avgt   40  1901868.898 ±  53407.107  ns/op
QueueBurstCost.burstCost          100               22       132000               MpmcArrayQueue      true  avgt   40   971550.149 ± 150254.715  ns/op
---------
8P-8C
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                8       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  190692.631 ±  7632.108  ns/op
QueueBurstCost.burstCost          100                8       132000        ConcurrentLinkedQueue      true  avgt   40  805924.222 ± 16923.904  ns/op
QueueBurstCost.burstCost          100                8       132000               MpmcArrayQueue      true  avgt   40  279469.248 ± 12048.981  ns/op
---------
6P-6C
---------
Benchmark                 (burstSize)  (consumerCount)  (qCapacity)                      (qType)  (warmup)  Mode  Cnt       Score       Error  Units
QueueBurstCost.burstCost          100                6       132000  MpmcUnboundedXaddArrayQueue      true  avgt   40  126767.559 ±  5528.759  ns/op
QueueBurstCost.burstCost          100                6       132000        ConcurrentLinkedQueue      true  avgt   40  466296.406 ± 51349.549  ns/op
QueueBurstCost.burstCost          100                6       132000               MpmcArrayQueue      true  avgt   40  200847.577 ± 15646.834  ns/op

Some conclusions from the asymmetric tests: I don't think MpmcArrayQueue has something to envy to CLQ although is getting worst results then it, but that given the consumer count is lowered, the chance that consumers are able to getting empty q is higher, invalidating the producer sequence in order to be sure of emptyness; probably that's the reason why it seems slower then CLQ.

Comment thread jctools-benchmarks/src/main/java/org/jctools/jmh/latency/QueueBurstCost.java Outdated
public void init(QueueBurstCost state)
{
// do nothing
handledCount = new long[state.consumerCount];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be far less complex to put the counter on Consumer, which already has padding. The waitFor then looks at the counters on the consumers. You can reduce the reader induced coherence overhead by spin waiting on the final event (as before) and then transition to consumer scanning. Further you can batch update the handled counter on the Consumer only when the queue is observed empty or relaxedPoll returns null.

@franz1981 franz1981 Nov 18, 2019

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be far less complex to put the counter on Consumer, which already has padding

That's a nice idea, but the counter is not just per-consumer, need to be per-producer (thread) as well.

You can reduce the reader induced coherence overhead by spin waiting on the final event (as before) and then transition to consumer scanning

This would work if the last offered event will be the last to be handled, but with multi-consumer is not ensured anymore: we cannot be sure neither that every consumer will handl at least 1 event, because is a kind of fairness that we don't force neither we measure.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"That's a nice idea, but the counter is not just per-consumer, need to be per-producer (thread) as well."
Fair point, in which case, yes you have to pad the array based counter.

"This would work if the last offered event will be the last to be handled, but with multi-consumer is not ensured anymore"
You can't be sure, but it's fair to assume that there's no need to check the counters before that one in handled.

@franz1981 franz1981 Nov 18, 2019

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fair to assume that there's no need to check the counters before that one in handled

I don't understand this: maybe with "before" I mean something different.
My understanding of "before" is the index/id of consumer on Event::handledCount: if the last (submitted) event is being handled, I can trust that the previous events would be handled with offering order just in the single consumer case, saving to poll for theirs completion.
AFAIK the current algorithm won't do that anyway (because Event::handledCount::length == consumerCount == 1, you ends up always polling the same counter/same consumer).

If with "before" you mean the previous handled events by that same consumer: if the last event will make the consumer to write its id while handling it (ie the index on Event::handledCount), then we can save to poll that same consumer and use the Event::handledCount[<id of last event>] count with no need to poll it, but what about the other consumers? They still be needed to be polled until reaching the expected bustSize count.

Help me to understand what do you mean, sorry for being slow in this :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I can trust that the previous events would be handled with offering order just in the single consumer case, saving to poll for theirs completion."
I'm not saying you should trust this as the only indicator, just that it's a good heuristic to reduce the overall noise.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT benchmark results, I would appreciate it if you made the effort to format those into something that is more easily comparable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT the consume method inlining, I can't remember. There's good reasons to avoid inlining in infinite loops to avoid OSR compilations, but I can't remember what was the reasons/testing here.

@franz1981 franz1981 Nov 18, 2019

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration is the same as https://github.com/franz1981/JCTools-wiki/blob/xadd_docs/MpscUnboundedXaddArrayQueue.md#lies-damned-lies-and-benchmarks and test has been run using:

$ for i in 1 2 3 4 5 6; do java -jar microbenchmarks.jar org.jctools.jmh.latency.QueueBurstCost.* -f 10 -t $i -p consumerCount=$i -p burstSize=100 -p qType=MpmcUnboundedXaddArrayQueue,ConcurrentLinkedQueue,MpmcArrayQueue -rf json -rff $i.json >> burst.$i.txt; done

Symmetric tests
1P-1C:
image
2P-2C:
image
3P-3C:
image
4P-4C:
image
5P-5C:
image
6P-6C:
image

@franz1981 franz1981 Nov 18, 2019

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asymmetric tests
8P-4C:
image
10P-2C:
image
11P-1C:
image

@franz1981 franz1981 Nov 18, 2019

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitsanw Let me know any insights about them and how you think is performing xadd q in your opinion @ #269
I see many things that are just weird to me

@franz1981

franz1981 commented Nov 18, 2019

Copy link
Copy Markdown
Collaborator Author

Configuration is the same as https://github.com/franz1981/JCTools-wiki/blob/xadd_docs/MpscUnboundedXaddArrayQueue.md#lies-damned-lies-and-benchmarks and test has been run using:

$ for i in 1 2 3 4 5 6; do java -jar microbenchmarks.jar org.jctools.jmh.latency.QueueBurstCost.* -f 10 -t $i -p consumerCount=$i -p burstSize=100 -p qType=MpmcUnboundedXaddArrayQueue,ConcurrentLinkedQueue,MpmcArrayQueue -rf json -rff $i.json >> burst.$i.txt; done

Symmetric tests
1P-1C:
image
2P-2C:
image
3P-3C:
image
4P-4C:
image
5P-5C:
image
6P-6C:
image

Asymmetric tests
8P-4C:
image
10P-2C:
image
11P-1C:
image

@franz1981

franz1981 commented Nov 19, 2019

Copy link
Copy Markdown
Collaborator Author

@nitsanw please, do not hesitate to ask if you want me to run any specific test on the same machine :)

@nitsanw

nitsanw commented Nov 20, 2019

Copy link
Copy Markdown
Contributor

@franz1981 These are interesting. How do they compare to the prev.(before the fix) results?

@franz1981

franz1981 commented Nov 20, 2019

Copy link
Copy Markdown
Collaborator Author

@nitsanw
I'm re-running the same benchmarks of #273 (comment) on master, but with #269 applied.
If the results will prove to be sound (for this PR) I will perform the same vs master without #269 applied to verify that #269 seems a good change (throughput tests already confirmed it).

Symmetric tests
1P-1C:
image
2P-2C:
image
3P-3C:
image
4P-4C:
image
5P-5C:
image
6P-6C:
image
Asymmetric tests
8P-4C:
image
10P-2C:
image
11P-1C:
image

Let me know if these are the results you were expecting 👍

@nitsanw

nitsanw commented Nov 21, 2019

Copy link
Copy Markdown
Contributor

#269 is a different story, I'm interested in the benchmark results before/after the change to the benchmarks. In particular we would expect all the single consumer scenarios to produce the same results.

@nitsanw

nitsanw commented Nov 21, 2019

Copy link
Copy Markdown
Contributor

To be clear, I'm not at all interested here in the comparison between different queue implementations so much as in the effect of the benchmark change. So the visualisation, while nice, still requires me to scroll up and down and look for the b/a picture.

@franz1981

franz1981 commented Nov 21, 2019

Copy link
Copy Markdown
Collaborator Author

In particular we would expect all the single consumer scenarios to produce the same results.

That's what the numbers says, assuming some error due to my use of taskset: I'm not using taskset to let the benchs run on just 2 different cores, but always on 12 different cores belonging the same numa node, so both benchs can see some error due to context switches/core migrations.
I can re-run both benchs in the 1P-1C case while pinning it just on 2 cores, if you think it worths.

So the visualisation, while nice, still requires me to scroll up and down and look for the b/a picture.

Given that NP-NC with N > 1 are totally not-comparable (the original code with multi-consumer is just broken IIRC) I can arrange the previous results of 1P-1C to make it more easily comparable.
<joking-mode>
Please, hold your breath and save those expensive scrolling movements to compare charts: we're not athletes!! :D
</joking-mode>
NOTE: https://jmh.morethan.io/ rocks :D

That's a better view of the 1P-1C case:
image

Maybe would worth to do it for NP-1C to see if there are any regressions

@franz1981

franz1981 commented Nov 21, 2019

Copy link
Copy Markdown
Collaborator Author

I'm not at all interested here in the comparison between different queue implementations

I can open a separate issue or write on the PR re the mpmc xadd fix if you prefer, but I need some magic help from you there, bud :)

@franz1981

franz1981 commented Nov 23, 2019

Copy link
Copy Markdown
Collaborator Author

In addition to the results of 1P-1C on #273 (comment), I'm adding these ones too:
2P-1C:
image

4P-1C:
image

8P-1C:
image

11P-1C:
image

TBH I was expecting the same trend of the 1P-1C results ie the new PR to slow down a bit the task execution, but rather it seems the opposite, although I see that the errors are more pronounced.
Maybe the original benchmark was having some issue with bimorphic calls? I cannot think to other reasons that would make the new benchmark lighter then the one on master.

@nitsanw

nitsanw commented Nov 25, 2019

Copy link
Copy Markdown
Contributor

"TBH I was expecting the same trend of the 1P-1C results ie the new PR to slow down a bit the task execution, but rather it seems the opposite" - A slower consumer means less contention. This is a case where consumers are much faster than producers (no contention) so the queue has no slack.

@franz1981

Copy link
Copy Markdown
Collaborator Author

A slower consumer means less contention

That makes sense to me, but how to make it more evident on the benchmark results?
With JMH queue throughput benchs we've the failed polls stats, while here we have to profile/perfasm to understand what's going on. Getting the failed poll stats from the beginning till the end of a burst could be helpful although I've no (yet) any idea how to collect these info here.

Returning on topic: this PR's handle seems more heavyweight then master; as you've suggested it can be made lighter for the single-consumer case, although at the cost a bimorphic call on consumer. But I see 2 benefits of this PR as it is:

  1. single/multi-consumer tests behave the same and can be compared (ie handle cost is the same for both)
  2. the error bars on 1P-1C shows that the difference from master is not big

I'm going to address #273 (comment) and wil wait your opinion on how to proceed for this.

@nitsanw

nitsanw commented Nov 25, 2019

Copy link
Copy Markdown
Contributor

I think with the start latch it's good to merge. Sorry for the long review process :-)

@franz1981

franz1981 commented Nov 25, 2019

Copy link
Copy Markdown
Collaborator Author

@nitsanw No worries bud; that's the base where we collect numbers to understand what's going on and is sufficiently delicate that need some care while modifying it :)
Give me few minutes that I write the latch change and will push the commit: do you want me to squash all the commits into 1?

@franz1981

franz1981 commented Nov 25, 2019

Copy link
Copy Markdown
Collaborator Author

I've already pushed the commit with the started latch change: the only downside I see is that one of the consumer(s) will pay the cost of awaking it, unblocking startConsumers, but it shouldn't be relevant, given that the bench iteration won't start until startConsumers won't end.

@nitsanw nitsanw merged commit 9c354ce into JCTools:master Nov 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants