xds: safely handle watch removals during update, nested pause/resume.#12069
xds: safely handle watch removals during update, nested pause/resume.#12069htuch merged 12 commits intoenvoyproxy:masterfrom
Conversation
To fix envoyproxy#11877, we need to handle safely the case where two watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to remove the other watch during its invoked onConfigUpdate(). While working on this, it made sense to fix envoyproxy#11674, avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test. Risk level: Medium (this has xDS wire-level implications). Testing: New unit tests for pause/resume, regression unit and integration tests for watch map removal behaviors. Fixes envoyproxy#11877 envoyproxy#11674 Signed-off-by: Harvey Tuch <htuch@google.com> Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com> Signed-off-by: Harvey Tuch <htuch@google.com>
|
@sschepens can you verify this fixes your issue? @wgallagher can you take a first pass at this? Thanks. |
|
@stevenzzzz for pause/resume ref counting FYI |
|
@htuch this seems to fix our issue, tried several times to reproduce it without success. Thanks! |
|
@samflattery FY I wonder if you can replicate this corner case in the xDS fuzzer, seems like it should have been possible to induce there. |
|
@wgallagher friendly ping. Also tagging @lizan as senior maintainer reviewer. |
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
|
Ended up having to move around a bit of logic to deal with the remote closes and reconnects. @lizan can you take a pass? Thanks. |
| } | ||
|
|
||
| // Track any removals triggered by earlier watch updates. | ||
| deferred_removed_during_update_ = std::make_unique<absl::flat_hash_set<Watch*>>(); |
There was a problem hiding this comment.
Is the assumption here deferred_removed_during_update_ == nullptr, lets ASSERT?
There was a problem hiding this comment.
Nup, the goal is to reset the queue on reconnect.
There was a problem hiding this comment.
I'm bit confused, what happen to the existing set in there? Aren't we assume they are already removed?
There was a problem hiding this comment.
Right, I thought you were referring to another place we assign to a new unique_ptr. In this case, you are right, it should be null. Added the ASSERTs. Thanks.
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
…eads in ads_integration_test. Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
|
@lizan not sure what is going on with macOS CI here, but I've fixed a few ASAN/TSAN/coverage snafus and this is now ready for approval again. Thanks. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@lizan ready for approval, CI is passing. |
…envoyproxy#12069) To fix envoyproxy#11877, we need to handle safely the case where two watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to remove the other watch during its invoked onConfigUpdate(). While working on this, it made sense to fix envoyproxy#11674, avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test. Risk level: Medium (this has xDS wire-level implications). Testing: New unit tests for pause/resume, regression unit and integration tests for watch map removal behaviors. Fixes envoyproxy#11877 envoyproxy#11674 Signed-off-by: Harvey Tuch <htuch@google.com> Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com> Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…envoyproxy#12069) To fix envoyproxy#11877, we need to handle safely the case where two watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to remove the other watch during its invoked onConfigUpdate(). While working on this, it made sense to fix envoyproxy#11674, avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test. Risk level: Medium (this has xDS wire-level implications). Testing: New unit tests for pause/resume, regression unit and integration tests for watch map removal behaviors. Fixes envoyproxy#11877 envoyproxy#11674 Signed-off-by: Harvey Tuch <htuch@google.com> Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com> Signed-off-by: chaoqinli <chaoqinli@google.com>
To fix #11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().
While working on this, it made sense to fix #11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.
Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
removal behaviors.
Fixes #11877 #11674
Signed-off-by: Harvey Tuch htuch@google.com
Co-authored-by: Sebastian Schepens sebastian.schepens@mercadolibre.com