Skip to content

MpscBlockingConsumerArrayQueue poll with timeout fix#316

Merged
nitsanw merged 12 commits into
JCTools:masterfrom
diffusiondata:fix-race
Jul 30, 2020
Merged

MpscBlockingConsumerArrayQueue poll with timeout fix#316
nitsanw merged 12 commits into
JCTools:masterfrom
diffusiondata:fix-race

Conversation

@philipa

@philipa philipa commented Jul 20, 2020

Copy link
Copy Markdown
Contributor

Reorder offerAndWakeup() to fix race. Previously the following sequence was possible:

Consumer                            Provider

poll(timeout)
                                  offerAndWakeup()
time out
                                  soRefElement(, offset,)
consume from offset
offset -> offset'
poll(timeout)
                                  soBlocked(null)
                                  unpark(consumerThread)
// Spin for ever
spinWaitForElement(, offset')

philipa added 2 commits July 20, 2020 08:35
  Reorder offerAndWakeup() to fix race. Previously the following
  sequence was possible:

  Consumer                            Provider

  poll(timeout)
                                      offerAndWakeup()
  time out
                                      soRefElement(, offset,)
  consume from offset
  offset -> offset'
  poll()
                                      soBlocked(null)
                                      unpark(consumerThread)
  // Spin for ever
  spinWaitForElement(, offset')
@philipa philipa marked this pull request as ready for review July 20, 2020 07:43
@coveralls

coveralls commented Jul 20, 2020

Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 699

  • 41 of 43 (95.35%) changed or added relevant lines in 1 file are covered.
  • 15 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.02%) to 86.194%

Changes Missing Coverage Covered Lines Changed/Added Lines %
jctools-core/src/main/java/org/jctools/queues/MpscBlockingConsumerArrayQueue.java 41 43 95.35%
Files with Coverage Reduction New Missed Lines %
jctools-core/src/main/java/org/jctools/queues/IndexedQueueSizeUtil.java 1 82.35%
jctools-core/src/main/java/org/jctools/maps/NonBlockingHashMapLong.java 2 81.91%
jctools-core/src/main/java/org/jctools/maps/NonBlockingHashMap.java 5 79.48%
jctools-core/src/main/java/org/jctools/maps/NonBlockingIdentityHashMap.java 7 75.06%
Totals Coverage Status
Change from base Build 681: -0.02%
Covered Lines: 4776
Relevant Lines: 5541

💛 - Coveralls

@franz1981

Copy link
Copy Markdown
Collaborator

I don't understand why the last poll should spin wait for element TBH, given that there is no new element on offset' beside the one already consumed by take(timeout), need to look better into this

@philipa

philipa commented Jul 20, 2020

Copy link
Copy Markdown
Contributor Author

Look at spinWaitForElement. It will wait for ever for an update to offset, but none is forthcoming. Further, the "busy bit" of producerIndex is left set but blocked is null, so the next offer will spin too.

Or try reverting the fix and running the test. It failed consistently for me, even with additional logging so I could figure out what was happening.

@franz1981

Copy link
Copy Markdown
Collaborator

What it looks strange to me is the check on https://github.com/pushtechnology/JCTools/blob/865a75f12cfa3d63029f16979dc8ee2af7e4a5b1/jctools-core/src/main/java/org/jctools/queues/MpscBlockingConsumerArrayQueue.java#L577 : I will add some logs to understand why it isn't returning null instead

@philipa

philipa commented Jul 20, 2020

Copy link
Copy Markdown
Contributor Author

Oh, right. I was using poll(timeout) throughout, so this is irrelevant. I'll updated the comment above accordingly.

@franz1981

Copy link
Copy Markdown
Collaborator

Ah ok so the issue happens with 2 subsequent poll(timeout) and the expectation would be that:

  1. first one should time out but read the offered element
  2. second one should time out and return null

Or

  1. first one should time out and return null
  2. second one should return an element regardless timing out (if the offer land in time)

@philipa

philipa commented Jul 20, 2020

Copy link
Copy Markdown
Contributor Author

Yes, at least two poll(timeout)s and a concurrent offerAndWait().

It reproduced for me fairly readily once I had the test balanced to ensure a mostly empty queue. I'd be interested to know if the test fails for you (with the fix reverted).

@nitsanw

nitsanw commented Jul 21, 2020

Copy link
Copy Markdown
Contributor

Hi @philipa ! long time no see :-)

Thanks for the find + PR, this is an interesting one. The thinking behind the original ordering is that we want to avoid the spin when we wake the consumer from an unbounded take. There have been prior concerns around spinning with no yield in single CPU environments or where the scheduler unpark results in the unparked thread executing on the same core. Upon reflection I'm not sure if re-ordering is the "right" solution (I think it's correct), or if we need heavier weight guards on the state transitions here via e.g. a CAS on the blocked field. WDYT?

@nitsanw nitsanw added the bug label Jul 21, 2020
@philipa

philipa commented Jul 21, 2020

Copy link
Copy Markdown
Contributor Author

Not sure CAS on blocked works alone, since the producer would need to distinguish between the consumer waiting for a value at the original offset vs it having woken up early, sneakily read the element, and now waiting for an element at the next offset. cIndex is also pertinent here.

@njhill

njhill commented Jul 21, 2020

Copy link
Copy Markdown
Contributor

Here is an alternative fix to consumer-side which should not affect the existing happy paths at all: 3f82f1c. The existing cas in the finally block already detects this race and so it's just a matter of handling it appropriately (I removed the try/finally and now have separate checks for interrupt and timeout cases). wdyt?

@philipa

philipa commented Jul 22, 2020

Copy link
Copy Markdown
Contributor Author

Looks good. I prefer the revised code flow. It would make sense to restructure take similarly to remove the finally block.

I did like e8e0839 though.

@franz1981

Copy link
Copy Markdown
Collaborator

+100 for the changes of @njhill

@njhill

njhill commented Jul 22, 2020

Copy link
Copy Markdown
Contributor

Thanks @philipa @franz1981, agree about the restructure of take. I should note that I was the one that introduced this bug in the first place, as well as the the try/finally flow which @nitsanw wasn't in favour of at the time... so more reason to change it I guess!

Comment thread jctools-core/src/main/java/org/jctools/queues/MpscBlockingConsumerArrayQueue.java Outdated
@nitsanw

nitsanw commented Jul 29, 2020

Copy link
Copy Markdown
Contributor

@philipa thanks for the good work! I'll do a final pass today

@nitsanw

nitsanw commented Jul 30, 2020

Copy link
Copy Markdown
Contributor

@philipa @njhill See minor changes: diffusiondata@c3ea433

If you can squash it all to one nice commit I think we're done :-)

@philipa

philipa commented Jul 30, 2020

Copy link
Copy Markdown
Contributor Author

Refactoring looks good.

Excuse my GH ignorance, but don't you have a "squash and merge" button for PRs?

@nitsanw nitsanw merged commit 8196da3 into JCTools:master Jul 30, 2020
@nitsanw

nitsanw commented Jul 30, 2020

Copy link
Copy Markdown
Contributor

Great work team :-)

@philipa philipa deleted the fix-race branch July 30, 2020 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants