Skip to content

xds: safely handle watch removals during update, nested pause/resume.#12069

Merged
htuch merged 12 commits intoenvoyproxy:masterfrom
htuch:sds-tls-crash
Jul 24, 2020
Merged

xds: safely handle watch removals during update, nested pause/resume.#12069
htuch merged 12 commits intoenvoyproxy:masterfrom
htuch:sds-tls-crash

Conversation

@htuch
Copy link
Copy Markdown
Member

@htuch htuch commented Jul 14, 2020

To fix #11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().

While working on this, it made sense to fix #11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.

Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
removal behaviors.

Fixes #11877 #11674

Signed-off-by: Harvey Tuch htuch@google.com
Co-authored-by: Sebastian Schepens sebastian.schepens@mercadolibre.com

To fix envoyproxy#11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().

While working on this, it made sense to fix envoyproxy#11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.

Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
  removal behaviors.

Fixes envoyproxy#11877 envoyproxy#11674

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 14, 2020

@sschepens can you verify this fixes your issue? @wgallagher can you take a first pass at this? Thanks.

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 14, 2020

@stevenzzzz for pause/resume ref counting FYI

@sschepens
Copy link
Copy Markdown
Contributor

@htuch this seems to fix our issue, tried several times to reproduce it without success. Thanks!

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 14, 2020

@samflattery FY I wonder if you can replicate this corner case in the xDS fuzzer, seems like it should have been possible to induce there.

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 16, 2020

@wgallagher friendly ping. Also tagging @lizan as senior maintainer reviewer.

@htuch htuch requested a review from lizan July 16, 2020 13:32
wgallagher
wgallagher previously approved these changes Jul 16, 2020
wgallagher
wgallagher previously approved these changes Jul 16, 2020
Copy link
Copy Markdown

@wgallagher wgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 16, 2020

Ended up having to move around a bit of logic to deal with the remote closes and reconnects.

@lizan can you take a pass? Thanks.

Copy link
Copy Markdown
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit

}

// Track any removals triggered by earlier watch updates.
deferred_removed_during_update_ = std::make_unique<absl::flat_hash_set<Watch*>>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the assumption here deferred_removed_during_update_ == nullptr, lets ASSERT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nup, the goal is to reset the queue on reconnect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bit confused, what happen to the existing set in there? Aren't we assume they are already removed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I thought you were referring to another place we assign to a new unique_ptr. In this case, you are right, it should be null. Added the ASSERTs. Thanks.

lizan
lizan previously approved these changes Jul 20, 2020
Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 22, 2020

@lizan not sure what is going on with macOS CI here, but I've fixed a few ASAN/TSAN/coverage snafus and this is now ready for approval again. Thanks.

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 22, 2020

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Jul 23, 2020

@lizan ready for approval, CI is passing.

@htuch htuch merged commit 06acbab into envoyproxy:master Jul 24, 2020
@htuch htuch deleted the sds-tls-crash branch July 24, 2020 04:10
KBaichoo pushed a commit to KBaichoo/envoy that referenced this pull request Jul 30, 2020
…envoyproxy#12069)

To fix envoyproxy#11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().

While working on this, it made sense to fix envoyproxy#11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.

Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
removal behaviors.

Fixes envoyproxy#11877 envoyproxy#11674

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
…envoyproxy#12069)

To fix envoyproxy#11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().

While working on this, it made sense to fix envoyproxy#11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.

Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
removal behaviors.

Fixes envoyproxy#11877 envoyproxy#11674

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
Signed-off-by: chaoqinli <chaoqinli@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants