Skip to content

drain: reproduce and fix a draining race condition that fails tsan#31452

Merged
jmarantz merged 9 commits intoenvoyproxy:mainfrom
jmarantz:repro-drain-race
Dec 20, 2023
Merged

drain: reproduce and fix a draining race condition that fails tsan#31452
jmarantz merged 9 commits intoenvoyproxy:mainfrom
jmarantz:repro-drain-race

Conversation

@jmarantz
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz commented Dec 19, 2023

Commit Message: This reproduces and fixes a race condition on DrainManagerImpl::drain_deadline_ which is written from the main thread by DrainManagerImpl::startDrainSequence via admin, and read from worker threads via the DrainManagerImpl::drainClose() API. The race condition is resolved by updating the draining_ atomic bool after updating drain_deadline_, as C++ will not re-order statements around atomics. drainClose() does not look at drain_deadline_ unless draining_ is true, and startDrainSequence will not write drain_deadline_ a second time.

Additionally, jitter was added based on the --drain-time-s command-line parameter. However, that defaults to 0 and was used as a modulus to calculate jitter, which causes an arithmetic exception. Instead we assume no jitter. Another similar bug was fixed where there could be a microsecond left, but we calculate the modulus for the jitter based on rounding that to milliseconds, which could be zero.

Additional Description:
Risk Level: low
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes: #31457

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #31452 was opened by jmarantz.

see: more, trace.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz changed the title drain: reproduce a draining race condition that fails tsan drain: reproduce and fix a draining race condition that fails tsan Dec 20, 2023
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as ready for review December 20, 2023 13:59
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a few comments. Not sure if CI issues are real.

/wait

Comment on lines +153 to +156
// Since draining_ is atomic, it is safe to set drain_deadline_ without a mutex
// as drain_close() only reads from drain_deadline_ if draining_ is true, and
// C++ will not re-order an assign to an atomic. See
// https://stackoverflow.com/questions/40320254/reordering-atomic-operations-in-c .
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true if set with SeqCst (or maybe release, though I'm hazy on all of this), right? I assume this is the default for assignment using an atomic in C++ but maybe you want to make SeqCst explicit to be clear about it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the default based on https://en.cppreference.com/w/cpp/atomic/atomic/store but maybe it'd be clearer if I set the atomic bool by explicitly specifying that with a store call rather than using operator=.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +103 to +105
// Note; current_time may be less than drain_deadline_ by a microsecond, and
// when we round to milliseconds that might be 0, which would throw an arithmetic
// exception if used as a modulus.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry how does this happen? I'm confused how the time could ever be less?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add a test case to get to this spot with 1us variation on DrainManagerImplTest, RegisterCallbackAfterDrainBeginGradualStrategy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
mattklein123
mattklein123 previously approved these changes Dec 20, 2023
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK I see the case where it can be zero, got it. Cool LGTM.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Copy Markdown
Contributor Author

Sorry Matt -- I made one more tweak; will need another stamp I just found a way to make a test stanza more readable. I'm also waiting for someone to tell me whether this fixes the problem they noticed internally.

@jmarantz
Copy link
Copy Markdown
Contributor Author

thanks Matt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: drain manager has a main-thread vs worker race

2 participants