drain: reproduce and fix a draining race condition that fails tsan#31452
drain: reproduce and fix a draining race condition that fails tsan#31452jmarantz merged 9 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks a few comments. Not sure if CI issues are real.
/wait
| // Since draining_ is atomic, it is safe to set drain_deadline_ without a mutex | ||
| // as drain_close() only reads from drain_deadline_ if draining_ is true, and | ||
| // C++ will not re-order an assign to an atomic. See | ||
| // https://stackoverflow.com/questions/40320254/reordering-atomic-operations-in-c . |
There was a problem hiding this comment.
This is only true if set with SeqCst (or maybe release, though I'm hazy on all of this), right? I assume this is the default for assignment using an atomic in C++ but maybe you want to make SeqCst explicit to be clear about it?
There was a problem hiding this comment.
I believe this is the default based on https://en.cppreference.com/w/cpp/atomic/atomic/store but maybe it'd be clearer if I set the atomic bool by explicitly specifying that with a store call rather than using operator=.
source/server/drain_manager_impl.cc
Outdated
| // Note; current_time may be less than drain_deadline_ by a microsecond, and | ||
| // when we round to milliseconds that might be 0, which would throw an arithmetic | ||
| // exception if used as a modulus. |
There was a problem hiding this comment.
Sorry how does this happen? I'm confused how the time could ever be less?
There was a problem hiding this comment.
I'm going to add a test case to get to this spot with 1us variation on DrainManagerImplTest, RegisterCallbackAfterDrainBeginGradualStrategy.
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Ah OK I see the case where it can be zero, got it. Cool LGTM.
Signed-off-by: Joshua Marantz <jmarantz@google.com>
|
Sorry Matt -- I made one more tweak; will need another stamp I just found a way to make a test stanza more readable. I'm also waiting for someone to tell me whether this fixes the problem they noticed internally. |
|
thanks Matt! |
Commit Message: This reproduces and fixes a race condition on
DrainManagerImpl::drain_deadline_which is written from the main thread by DrainManagerImpl::startDrainSequence via admin, and read from worker threads via the DrainManagerImpl::drainClose() API. The race condition is resolved by updating thedraining_atomic bool after updatingdrain_deadline_, as C++ will not re-order statements around atomics. drainClose() does not look at drain_deadline_ unless draining_ is true, and startDrainSequence will not writedrain_deadline_a second time.Additionally, jitter was added based on the
--drain-time-scommand-line parameter. However, that defaults to 0 and was used as a modulus to calculate jitter, which causes an arithmetic exception. Instead we assume no jitter. Another similar bug was fixed where there could be a microsecond left, but we calculate the modulus for the jitter based on rounding that to milliseconds, which could be zero.Additional Description:
Risk Level: low
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes: #31457