Fixes #12235: handle dangling reference during VHDS update request by dmitri-d · Pull Request #12395 · envoyproxy/envoy

dmitri-d · 2020-07-30T23:56:00Z

Fixes #12235

Signed-off-by: Dmitri Dolguikh ddolguik@redhat.com

Risk Level: low
Testing: existing tests
Docs Changes: none
Release Notes: none

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-07-30T23:56:28Z

ping @stevenzzzz

stevenzzzz

thanks, looks good. can we add a test to stop regression?

source/common/router/rds_impl.cc

stevenzzzz · 2020-07-31T15:07:47Z

source/common/router/rds_impl.cc

+        auto cb = config_cbs.lock();
+        if (s && cb) {
+          s->updateOnDemand(alias);
+          cb->push_back({alias, thread_local_dispatcher, route_config_updated_cb});


if in main thread, management server already sent back response with the alias, do you want to send another on demand request to management server?

Depends on management server implementation, if it only sends a response when there is some update of the project, then this HttpRequest will hang till timeout, while Envoy has the config to proceed.

This gets pretty complicated quickly: the only time (that I can see) an instance of RdsRouteConfigProviderImplcan be deallocated is when LDS is used and the listener configuration has changed. At this point all connections are going to be drained, trying to update a vhost seems not worth it to me.

Even if we wanted to do that, we'd still need an active subscription (may not be there yet, and no way to find out which one to use) and a place to store the callback to notify the on_demand_update filter (which may not be there any more, as the active stream it belongs to may have been closed).

I am describing a scenario where no rds subscription destruction is involved.

could the management server send required vhost to Envoy before the on-demand request is sent? in that case, the management server may or may not send the requested resource again depends on its implementation, in which case the callback will not be called and the request will only timeout. this is very implementation specific, so I am not too concerned.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-07-31T21:19:09Z

added a test

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-07-31T23:26:56Z

merged in changes from master

test/common/router/rds_impl_test.cc

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-08-05T15:55:50Z

ping @stevenzzzz.

stevenzzzz · 2020-08-05T16:10:06Z

sorry I thought I committed the comment.

there is one thing to be fixed around which weak_ptr to look at, please lock the most "specific" object to make sure there is no lifecycle problem, i.e., the config_cbs_.

dmitri-d · 2020-08-05T16:50:34Z

there is one thing to be fixed around which weak_ptr to look at, please lock the most "specific" object to make sure there is no lifecycle problem, i.e., the config_cbs_

In this context, a weak_ptr to either the subscription or the list of callbacks would work; if anything the subscription is longer-lived and has a benefit of being already stored in a shared_ptr.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-08-05T17:15:29Z

switched to use config_update_callbacks to determine if the instance of RdsRouteConfigProviderImpl is still alive.

stevenzzzz · 2020-08-05T17:35:20Z

source/common/router/rds_impl.cc

+      [subscription = std::weak_ptr<RdsRouteConfigSubscription>(subscription_),
+       config_cbs = std::weak_ptr<std::list<UpdateOnDemandCallback>>(config_update_callbacks_),
+       alias, &thread_local_dispatcher, route_config_updated_cb]() -> void {
+        auto sub = subscription.lock();


you can move this into the if statement below.

could you also move this sub lock to the if statement below?

stevenzzzz · 2020-08-05T17:35:44Z

there is one thing to be fixed around which weak_ptr to look at, please lock the most "specific" object to make sure there is no lifecycle problem, i.e., the config_cbs_

In this context, a weak_ptr to either the subscription or the list of callbacks would work; if anything the subscription is longer-lived and has a benefit of being already stored in a shared_ptr.
if config_callbacks are gone, it's undefined behavior again when you push a callback into the list.

Thanks for the fix.
LGTM

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-08-05T17:56:36Z

responded to feedback

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-08-05T18:16:18Z

responded to feedback

dmitri-d · 2020-08-06T17:02:41Z

Looks like arm build failure is due to a flaky test.

dmitri-d · 2020-08-06T19:38:53Z

ping @stevenzzzz

stevenzzzz · 2020-08-06T20:04:28Z

I LGTMed already.
Let's ping @snowp
😄

snowp

Thanks for working on this, just one comment.

snowp · 2020-08-07T13:58:54Z

source/common/router/rds_impl.cc

+       config_cbs = std::weak_ptr<std::list<UpdateOnDemandCallback>>(config_update_callbacks_),
+       alias, &thread_local_dispatcher, route_config_updated_cb]() -> void {
+        if (auto callbacks = config_cbs.lock()) {
+          auto sub = subscription.lock();


why capture a weak ptr if you dont ever validate that its still alive? maybe just capture this as a weak ptr?

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

mattklein123

Thanks for fixing. One small question.

/wait-any

mattklein123 · 2020-08-10T23:01:31Z

source/common/router/rds_impl.cc

+  // The RdsRouteConfigProviderImpl instance can go away before the dispatcher has a chance to
+  // execute the callback. We use a weak_ptr here to verify that the instance is still around.


Why don't we allocate the provider itself with shared_from_this and then we can capture a weak reference to ourselves and post that? This seems a bit strange and fragile?

I'd hesitate to do so, with that we could get into another rabbit hole where a worker thread may accidentally get a shared_ptr of the provider and introduce another bunch of life cycle problem due to provider can only be con/destructed on main thread.
thoughts?

OK fair enough. Can we add a comment on why we are not doing that so the next person that comes along doesn't ask the same question? Thank you.

/wait

Ahh, yes, this is better.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-08-11T00:15:11Z

switched to using a shared_from_this instance of a RdsRouteConfigProviderImpl
merged in latest master

mattklein123

This LGTM but I will defer to you and @stevenzzzz as to whether you want to stick with this or not per the other comment.

stevenzzzz · 2020-08-11T00:52:36Z

source/common/router/rds_impl.h

 class RdsRouteConfigProviderImpl : public RouteConfigProvider,
-                                   Logger::Loggable<Logger::Id::router> {
+                                   Logger::Loggable<Logger::Id::router>,
+                                   public std::enable_shared_from_this<RdsRouteConfigProviderImpl> {


can we please not do this. When provider is capable of shared_from_this, we actual open its access to worker threads, which I'd hope we can avoid: providers are supposed to be created/owned by HCM and no one else should have ownership to it, otherwise it'd introduce lots of troubles due to destruction order when Envoy tears down.
If you need a cleaner signal to decide if we need a post, one thing you can do is to lock on route_config_updated_cb, which the finest grained object that tells if the source httprequest is alive or not.

I don't quite understand this now that I think about it more. This is already a shared_ptr allocated object. Why does it matter if it enabled shared_from_this?

I think we want it shared between listeners in main thread, but its ownership should not be shared to worker thread.
If however the ownership of the provider is shared to worker thread, the scenario i was talking about would cause the assertion error when LDS deletes the listener or when envoy quits.

I think the change is quite safe: we aren't changing provider interface, just the concrete implementation: to create another shared_ptr one would need to cast a RouteConfigProvider object to RdsRouteConfigProviderImpl before being able to do so. Code review would catch such a cast or a call to shared_from_this(). I can add a note re: intended usage of shared_from_this if that would help.

I agree that this is the right change modulo any other lifetime issue.

it's a tricky case and it's not about life cycle of provider when the post lambda gets called.
the problem is that The ownership of any "tls instance owner" should not be spilled into other threads than the main thread.
please see my other reply why I am against this, your change is introducing another race which could ends in an assertion failure.

stevenzzzz · 2020-08-11T00:56:15Z

source/common/router/rds_impl.cc

-      });
+  factory_context_.dispatcher().post([this,
+                                      self = std::weak_ptr<RdsRouteConfigProviderImpl>(
+                                          shared_from_this()),


this could cause trouble as it introduces another race between worker threads and main thread: for example on envoy teardown, the lambda created in the worker thread may be the last one to be teardown, in which case, the destruction of the provider will happen in worker thread, which triggers an assert failure because it holds an TLS instance which can only be destructed in main thread.

But how is this safe then? We capture "this" so it can still crash?

i am thinking capturing a member of the provider will guarantee that the provider is alive when this cb is called inside main thread.
The difference is the ownership of provider has not been "shared" outside of main thread, which will not introduce the new race.

i am thinking capturing a member of the provider will guarantee that the provider is alive when this cb is called inside main thread.

It definitely does not guarantee that. It just prevents the member from going away.

let me rephrase why this change is a regression:

this lambda "L" is created with a shared pointer to the RDS config provider, and it's in worker thread.

it's destructed in worker thread after it's posted to main thread,

in main thread, the rds config provider could go away because of LDS or Envoy teardown, in which case the ownership of the config provider will fall to the "L" if it's not deleted from the worker thread yet.

deallocation of the RDS config provider in worker thread will cause an assertion failure, which is our fence on clear boundary between config pipeline and data plane, as the tls instance held by the provider can only be destructed in main thread.

Yes I agree ^ is a problem, but this code is still broken. If the provider goes away on the main thread, we still capture this. That is not correct. I think given how long we are talking about this we probably need a cleaner solution here. If the original change is actually safe can the comments be clearly updated to why?

Right. So if we capture a weak_ptr of the callback list, and verify it's alive in main thread, we can make sure that we are accessing an active provider(and its subscription), meanwhile, the provider ownership is still on main thread and we dont need to worry about the race mentioned above.

trying to lock on the route_config_updated_cb will also do the trick, but it has a different semantic: if the callback is alive, which means the hcm is alive and the provider associated with it will be alive.
but now we are firing an on-demand when the requesting HttpRequest is alive, if we lock on provider, we are firing as long as there was a request(may be gone).

Yes, OK, I think this make sense. @dmitri-d can we revert back to what you had before but also add a lot more comments? I would like to get this merged as I'm dealing with additional issues in this area and I want to fix on top of this depending on if anything else needs fixing. Thank you!

/wait

Do we need to capture a weak_ptr to both subscription and callback list? Or would just subscription work? I think the original solution looks right, but it seems awkward to have to lock two objects. Better documentation on lifetime guarantees and relationships would definitely help. I think what you really want is some sentinel member element that provides "am I alive?"; normally this could be shared_from_this but this doesn't work for the reasons @stevenzzzz points out.

mattklein123 · 2020-08-11T22:18:18Z

I just checked and the previous version of this patch does not fix #12235 (comment), so I will continue to debug.

mattklein123 · 2020-08-11T22:37:41Z

Nevermind regarding ^ this is the same TSAN issue that is fixed with instrumented TSAN which I am not using. :(

This reverts commit 97e55fd. Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

…live' flag Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-08-12T01:05:15Z

Reverted shared_from_this related changes
Introduced a dedicated 'still alive' flag: functionally the same as relying on a weak_ptr to config_update_callbacks_, but reads a bit clearer (I think). If everyone is ok with this change, I'll add a comment explaining the lifecycle there.

htuch

This makes sense to me, @stevenzzzz @mattklein123 ?

stevenzzzz · 2020-08-12T01:20:28Z

This makes sense to me, @stevenzzzz @mattklein123 ?

LGTM, making a shared_ptr on a callback list indeed looks weird.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d · 2020-08-12T02:12:26Z

fixed formatting
added a comment explaining the need for still_alive flag

mattklein123

Awesome, thanks. The comment and the discrete boolean make this much easier to understand.

Fixes #12235: handle dangling reference during VHDS update request

e0b4dd0

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

alyssawilk assigned snowp and stevenzzzz Jul 31, 2020

stevenzzzz reviewed Jul 31, 2020

View reviewed changes

Added a test

92e41ce

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Dmitri Dolguikh added 2 commits July 31, 2020 15:12

Fixes format

2ab055a

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Merge remote-tracking branch 'upstream' into fix-dangling-pointer-in-…

57f72fe

…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

stevenzzzz reviewed Aug 3, 2020

View reviewed changes

test/common/router/rds_impl_test.cc Outdated Show resolved Hide resolved

Dmitri Dolguikh added 2 commits August 4, 2020 12:18

Responded to feedback

b3bce91

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Merge remote-tracking branch 'upstream' into fix-dangling-pointer-in-…

2785968

…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Responded to feedback

cf21bfa

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

stevenzzzz reviewed Aug 5, 2020

View reviewed changes

responded to feedback

1aa8464

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

responded to feedback

d10b5d9

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

snowp suggested changes Aug 7, 2020

View reviewed changes

Dmitri Dolguikh added 2 commits August 7, 2020 10:44

Responded to feedback

ecf5fed

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Merge remote-tracking branch 'upstream' into fix-dangling-pointer-in-…

e081d63

…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

mattklein123 reviewed Aug 10, 2020

View reviewed changes

repokitteh-read-only bot added the waiting:any label Aug 10, 2020

mattklein123 self-assigned this Aug 10, 2020

repokitteh-read-only bot added waiting and removed waiting:any labels Aug 10, 2020

Dmitri Dolguikh added 2 commits August 10, 2020 17:12

responded to feedback

97e55fd

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Merge remote-tracking branch 'upstream' into fix-dangling-pointer-in-…

47614f9

…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

repokitteh-read-only bot removed the waiting label Aug 11, 2020

mattklein123 previously approved these changes Aug 11, 2020

View reviewed changes

stevenzzzz reviewed Aug 11, 2020

View reviewed changes

mattklein123 added the waiting:any label Aug 11, 2020

repokitteh-read-only bot added waiting and removed waiting:any labels Aug 11, 2020

Dmitri Dolguikh added 2 commits August 11, 2020 17:41

Revert "responded to feedback"

f84d5e3

This reverts commit 97e55fd. Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

Reverted shared_from_this changes, replaced with a dedicated 'still a…

b18d13f

…live' flag Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

dmitri-d dismissed mattklein123’s stale review via b18d13f August 12, 2020 01:02

repokitteh-read-only bot removed the waiting label Aug 12, 2020

htuch reviewed Aug 12, 2020

View reviewed changes

Fixed formatting, added a comment

5ddc8bf

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

mattklein123 approved these changes Aug 12, 2020

View reviewed changes

mattklein123 merged commit 3498413 into envoyproxy:master Aug 12, 2020

dmitri-d deleted the fix-dangling-pointer-in-vhds branch August 12, 2020 17:28

		// The RdsRouteConfigProviderImpl instance can go away before the dispatcher has a chance to
		// execute the callback. We use a weak_ptr here to verify that the instance is still around.

Conversation

dmitri-d commented Jul 30, 2020

Uh oh!

dmitri-d commented Jul 30, 2020

Uh oh!

stevenzzzz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmitri-d commented Jul 31, 2020

Uh oh!

dmitri-d commented Jul 31, 2020

Uh oh!

Uh oh!

dmitri-d commented Aug 5, 2020

Uh oh!

stevenzzzz commented Aug 5, 2020

Uh oh!

dmitri-d commented Aug 5, 2020

Uh oh!

dmitri-d commented Aug 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzzzz commented Aug 5, 2020

Uh oh!

dmitri-d commented Aug 5, 2020

Uh oh!

dmitri-d commented Aug 5, 2020

Uh oh!

dmitri-d commented Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmitri-d commented Aug 6, 2020

Uh oh!

stevenzzzz commented Aug 6, 2020

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmitri-d commented Aug 11, 2020

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

stevenzzzz Aug 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dmitri-d commented Aug 6, 2020 •

edited

Loading

stevenzzzz Aug 11, 2020 •

edited

Loading

stevenzzzz Aug 11, 2020 •

edited

Loading

mattklein123 commented Aug 11, 2020 •

edited

Loading

dmitri-d commented Aug 12, 2020 •

edited

Loading