Skip to content

Regularly wake from timely to deal with eaten wake-ups#13972

Closed
frankmcsherry wants to merge 1 commit into
MaterializeInc:mainfrom
frankmcsherry:regularly_wake_from_timely
Closed

Regularly wake from timely to deal with eaten wake-ups#13972
frankmcsherry wants to merge 1 commit into
MaterializeInc:mainfrom
frankmcsherry:regularly_wake_from_timely

Conversation

@frankmcsherry

Copy link
Copy Markdown
Contributor

This addresses a race that appears in storaged where the LocalClient will wake worker threads that have inbound messages to read, but the wake-ups are (or appear to be) eaten by various non-Timely sleepy moments in the code. If these wake-ups are quietly eaten, and then Timely is entered with a None timeout and nothing to do, it will sleep forever.

It would be nice to fix this better, but just introducing a time-out seems to be the most robust way to fix this.

Motivation

  • This PR fixes a previously unreported bug.

CI had various timeouts that were experienced when storaged clients were "slow-rolled" commands, being provided first with InitializationComplete commands and then source creation commands. The creation commands could be missed, when presented in this order, as was introduced in 9b2eb6d. This has been reverted, but if it is reintroduced with this change the timeout does not appear to manifest.

Tips for reviewer

Checklist

@frankmcsherry frankmcsherry requested a review from benesch July 30, 2022 01:24
benesch added a commit to benesch/materialize that referenced this pull request Jul 30, 2022
This is an alternative to MaterializeInc#13972 that avoids hardcoding the 1s timeout.
As described there:

    This addresses a race that appears in storaged where the LocalClient
    will wake worker threads that have inbound messages to read, but the
    wake-ups are (or appear to be) eaten by various non-Timely sleepy
    moments in the code. If these wake-ups are quietly eaten, and then
    Timely is entered with a None timeout and nothing to do, it will
    sleep forever.

Would close MaterializeInc#13972 if merged.
benesch added a commit to benesch/materialize that referenced this pull request Jul 30, 2022
This is an alternative to MaterializeInc#13972 that avoids hardcoding the 1s timeout.
As described there:

    This addresses a race that appears in storaged where the LocalClient
    will wake worker threads that have inbound messages to read, but the
    wake-ups are (or appear to be) eaten by various non-Timely sleepy
    moments in the code. If these wake-ups are quietly eaten, and then
    Timely is entered with a None timeout and nothing to do, it will
    sleep forever.

Would close MaterializeInc#13972 if merged.

Many, many thanks to @frankmcsherry for determining that lost wakeups
were the cause of this issue.

Co-authored-by: Frank McSherry <fmcsherry@me.com>
benesch added a commit that referenced this pull request Jul 31, 2022
This is an alternative to #13972 that avoids hardcoding the 1s timeout.
As described there:

    This addresses a race that appears in storaged where the LocalClient
    will wake worker threads that have inbound messages to read, but the
    wake-ups are (or appear to be) eaten by various non-Timely sleepy
    moments in the code. If these wake-ups are quietly eaten, and then
    Timely is entered with a None timeout and nothing to do, it will
    sleep forever.

Would close #13972 if merged.

Many, many thanks to @frankmcsherry for determining that lost wakeups
were the cause of this issue.

Co-authored-by: Frank McSherry <fmcsherry@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant