Skip to content

fix: testground instability due to race condition on outcomes#1407

Merged
laurentsenta merged 11 commits intomasterfrom
fix/race-condition-on-outcomes
Aug 5, 2022
Merged

fix: testground instability due to race condition on outcomes#1407
laurentsenta merged 11 commits intomasterfrom
fix/race-condition-on-outcomes

Conversation

@laurentsenta
Copy link
Copy Markdown
Contributor

@laurentsenta laurentsenta commented Jul 28, 2022

Fix #1400
Fix #1382

When the docker runner starts a test run it does a few things:

  1. Wait for the instance containers to complete,
  2. Setup a "pretty printer" that forwards instance's stdout, BUT ALSO tries to decode events from stdout and update the outcomes,
  3. Listen to the events sent by test instances and update the outcomes on success.

When the containers exits (1) testground stops everything related to collecting the outcomes (3).
This is fine-ish with the go-sdk because it dumps events to stdout as well, so the outcomes will be read thanks to (2).

Logging events on stdout and trying to decode the JSON looked like a legacy hack: this was not implemented in the rust-sdk. See my notes on the pretty printer. The rust-sdk triggers a race condition:

  1. test instance sends "success event" to the sync service
  2. test instance exits
  3. Testgound shuts down the test (and the outcome collection part)
  4. Testground receives the "success event", but does not listen to it anymore.

I believe we've seen this error on go-sdk, less often, see this comment.

This PR rework the local docker runner so that a test is complete if and only if:

  • All the instances have exited AND all the outcomes have been received
  • All the instances have exited AND a timeout of x seconds have been reached (prevents hanging when an instance crashes)
  • OR the test timeout has been reached.

See runs:

Todo

  • Spike the patch,
  • Verify that the patch doesn't break other components,
  • Go through all other TODOs (dealing with errors, etc),
  • Double check the outcome outputs and the integration tests results,
  • Look into the testing situation.

@laurentsenta laurentsenta force-pushed the fix/race-condition-on-outcomes branch from dc7f133 to 6027201 Compare July 28, 2022 14:55
@laurentsenta laurentsenta force-pushed the fix/race-condition-on-outcomes branch from d19b932 to 0494134 Compare July 28, 2022 15:35
@laurentsenta laurentsenta marked this pull request as draft July 29, 2022 07:42
@mxinden
Copy link
Copy Markdown
Member

mxinden commented Aug 1, 2022

Thank you @laurentsenta for debugging this. Very valuable work.

@laurentsenta laurentsenta force-pushed the fix/race-condition-on-outcomes branch 2 times, most recently from f483906 to f7e8b52 Compare August 2, 2022 12:30
@laurentsenta laurentsenta force-pushed the fix/race-condition-on-outcomes branch from 8e704e9 to 6d665fc Compare August 3, 2022 11:11
@laurentsenta laurentsenta force-pushed the fix/race-condition-on-outcomes branch from 6d665fc to 882a35f Compare August 3, 2022 11:47
@laurentsenta laurentsenta marked this pull request as ready for review August 3, 2022 11:48
@laurentsenta
Copy link
Copy Markdown
Contributor Author

laurentsenta commented Aug 3, 2022

I'm marking this for review, this new version is more stable in my tests,

The patch does not add tests: testing this error specifically and reliably is tough, and I'd rather spend energy on this later to explose the implementation into unit-testable bricks (when we work on new runners maybe).

I did left the code cleaner than I found it (boyscout'ing ;), this PR extracts a few methods and drops any channel-related spaghetti. The reasons behind success or failure should be much more clear now.

Copy link
Copy Markdown
Contributor

@galargh galargh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

Co-authored-by: Piotr Galar <piotr.galar@gmail.com>
@laurentsenta laurentsenta requested a review from brdji August 4, 2022 11:32
@laurentsenta laurentsenta changed the title Fix: testground instability due to race condition on outcomes fix: testground instability due to race condition on outcomes Aug 4, 2022
@laurentsenta
Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky rust-sdk example Testground reports failure for instances when they appear to succeed

4 participants