Skip to content

Fixes #12235: handle dangling reference during VHDS update request#12395

Merged
mattklein123 merged 16 commits intoenvoyproxy:masterfrom
dmitri-d:fix-dangling-pointer-in-vhds
Aug 12, 2020
Merged

Fixes #12235: handle dangling reference during VHDS update request#12395
mattklein123 merged 16 commits intoenvoyproxy:masterfrom
dmitri-d:fix-dangling-pointer-in-vhds

Conversation

@dmitri-d
Copy link
Copy Markdown
Contributor

Fixes #12235

Signed-off-by: Dmitri Dolguikh ddolguik@redhat.com

Risk Level: low
Testing: existing tests
Docs Changes: none
Release Notes: none

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

ping @stevenzzzz

Copy link
Copy Markdown
Contributor

@stevenzzzz stevenzzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, looks good. can we add a test to stop regression?

auto cb = config_cbs.lock();
if (s && cb) {
s->updateOnDemand(alias);
cb->push_back({alias, thread_local_dispatcher, route_config_updated_cb});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if in main thread, management server already sent back response with the alias, do you want to send another on demand request to management server?

Depends on management server implementation, if it only sends a response when there is some update of the project, then this HttpRequest will hang till timeout, while Envoy has the config to proceed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets pretty complicated quickly: the only time (that I can see) an instance of RdsRouteConfigProviderImplcan be deallocated is when LDS is used and the listener configuration has changed. At this point all connections are going to be drained, trying to update a vhost seems not worth it to me.

Even if we wanted to do that, we'd still need an active subscription (may not be there yet, and no way to find out which one to use) and a place to store the callback to notify the on_demand_update filter (which may not be there any more, as the active stream it belongs to may have been closed).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am describing a scenario where no rds subscription destruction is involved.

could the management server send required vhost to Envoy before the on-demand request is sent? in that case, the management server may or may not send the requested resource again depends on its implementation, in which case the callback will not be called and the request will only timeout. this is very implementation specific, so I am not too concerned.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

  • added a test

Dmitri Dolguikh added 2 commits July 31, 2020 15:12
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…vhds

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

  • merged in changes from master

Dmitri Dolguikh added 2 commits August 4, 2020 12:18
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…vhds

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

dmitri-d commented Aug 5, 2020

ping @stevenzzzz.

@stevenzzzz
Copy link
Copy Markdown
Contributor

sorry I thought I committed the comment.

there is one thing to be fixed around which weak_ptr to look at, please lock the most "specific" object to make sure there is no lifecycle problem, i.e., the config_cbs_.

@dmitri-d
Copy link
Copy Markdown
Contributor Author

dmitri-d commented Aug 5, 2020

there is one thing to be fixed around which weak_ptr to look at, please lock the most "specific" object to make sure there is no lifecycle problem, i.e., the config_cbs_

In this context, a weak_ptr to either the subscription or the list of callbacks would work; if anything the subscription is longer-lived and has a benefit of being already stored in a shared_ptr.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

dmitri-d commented Aug 5, 2020

  • switched to use config_update_callbacks to determine if the instance of RdsRouteConfigProviderImpl is still alive.

[subscription = std::weak_ptr<RdsRouteConfigSubscription>(subscription_),
config_cbs = std::weak_ptr<std::list<UpdateOnDemandCallback>>(config_update_callbacks_),
alias, &thread_local_dispatcher, route_config_updated_cb]() -> void {
auto sub = subscription.lock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this into the if statement below.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also move this sub lock to the if statement below?

@stevenzzzz
Copy link
Copy Markdown
Contributor

there is one thing to be fixed around which weak_ptr to look at, please lock the most "specific" object to make sure there is no lifecycle problem, i.e., the config_cbs_

In this context, a weak_ptr to either the subscription or the list of callbacks would work; if anything the subscription is longer-lived and has a benefit of being already stored in a shared_ptr.
if config_callbacks are gone, it's undefined behavior again when you push a callback into the list.

Thanks for the fix.
LGTM

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

dmitri-d commented Aug 5, 2020

  • responded to feedback

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

dmitri-d commented Aug 5, 2020

  • responded to feedback

@dmitri-d
Copy link
Copy Markdown
Contributor Author

dmitri-d commented Aug 6, 2020

Looks like arm build failure is due to a flaky test.

@dmitri-d
Copy link
Copy Markdown
Contributor Author

dmitri-d commented Aug 6, 2020

ping @stevenzzzz

@stevenzzzz
Copy link
Copy Markdown
Contributor

I LGTMed already.
Let's ping @snowp
😄

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, just one comment.

config_cbs = std::weak_ptr<std::list<UpdateOnDemandCallback>>(config_update_callbacks_),
alias, &thread_local_dispatcher, route_config_updated_cb]() -> void {
if (auto callbacks = config_cbs.lock()) {
auto sub = subscription.lock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why capture a weak ptr if you dont ever validate that its still alive? maybe just capture this as a weak ptr?

Dmitri Dolguikh added 2 commits August 7, 2020 10:44
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…vhds

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. One small question.

/wait-any

Comment on lines +309 to +310
// The RdsRouteConfigProviderImpl instance can go away before the dispatcher has a chance to
// execute the callback. We use a weak_ptr here to verify that the instance is still around.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we allocate the provider itself with shared_from_this and then we can capture a weak reference to ourselves and post that? This seems a bit strange and fragile?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hesitate to do so, with that we could get into another rabbit hole where a worker thread may accidentally get a shared_ptr of the provider and introduce another bunch of life cycle problem due to provider can only be con/destructed on main thread.
thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK fair enough. Can we add a comment on why we are not doing that so the next person that comes along doesn't ask the same question? Thank you.

/wait

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yes, this is better.

Dmitri Dolguikh added 2 commits August 10, 2020 17:12
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…vhds

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

  • switched to using a shared_from_this instance of a RdsRouteConfigProviderImpl
  • merged in latest master

mattklein123
mattklein123 previously approved these changes Aug 11, 2020
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM but I will defer to you and @stevenzzzz as to whether you want to stick with this or not per the other comment.

class RdsRouteConfigProviderImpl : public RouteConfigProvider,
Logger::Loggable<Logger::Id::router> {
Logger::Loggable<Logger::Id::router>,
public std::enable_shared_from_this<RdsRouteConfigProviderImpl> {
Copy link
Copy Markdown
Contributor

@stevenzzzz stevenzzzz Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please not do this. When provider is capable of shared_from_this, we actual open its access to worker threads, which I'd hope we can avoid: providers are supposed to be created/owned by HCM and no one else should have ownership to it, otherwise it'd introduce lots of troubles due to destruction order when Envoy tears down.
If you need a cleaner signal to decide if we need a post, one thing you can do is to lock on route_config_updated_cb, which the finest grained object that tells if the source httprequest is alive or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this now that I think about it more. This is already a shared_ptr allocated object. Why does it matter if it enabled shared_from_this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want it shared between listeners in main thread, but its ownership should not be shared to worker thread.
If however the ownership of the provider is shared to worker thread, the scenario i was talking about would cause the assertion error when LDS deletes the listener or when envoy quits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change is quite safe: we aren't changing provider interface, just the concrete implementation: to create another shared_ptr one would need to cast a RouteConfigProvider object to RdsRouteConfigProviderImpl before being able to do so. Code review would catch such a cast or a call to shared_from_this(). I can add a note re: intended usage of shared_from_this if that would help.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is the right change modulo any other lifetime issue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a tricky case and it's not about life cycle of provider when the post lambda gets called.
the problem is that The ownership of any "tls instance owner" should not be spilled into other threads than the main thread.
please see my other reply why I am against this, your change is introducing another race which could ends in an assertion failure.

});
factory_context_.dispatcher().post([this,
self = std::weak_ptr<RdsRouteConfigProviderImpl>(
shared_from_this()),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could cause trouble as it introduces another race between worker threads and main thread: for example on envoy teardown, the lambda created in the worker thread may be the last one to be teardown, in which case, the destruction of the provider will happen in worker thread, which triggers an assert failure because it holds an TLS instance which can only be destructed in main thread.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how is this safe then? We capture "this" so it can still crash?

Copy link
Copy Markdown
Contributor

@stevenzzzz stevenzzzz Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am thinking capturing a member of the provider will guarantee that the provider is alive when this cb is called inside main thread.
The difference is the ownership of provider has not been "shared" outside of main thread, which will not introduce the new race.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am thinking capturing a member of the provider will guarantee that the provider is alive when this cb is called inside main thread.

It definitely does not guarantee that. It just prevents the member from going away.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me rephrase why this change is a regression:

  1. this lambda "L" is created with a shared pointer to the RDS config provider, and it's in worker thread.
  2. it's destructed in worker thread after it's posted to main thread,
  3. in main thread, the rds config provider could go away because of LDS or Envoy teardown, in which case the ownership of the config provider will fall to the "L" if it's not deleted from the worker thread yet.
  4. deallocation of the RDS config provider in worker thread will cause an assertion failure, which is our fence on clear boundary between config pipeline and data plane, as the tls instance held by the provider can only be destructed in main thread.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree ^ is a problem, but this code is still broken. If the provider goes away on the main thread, we still capture this. That is not correct. I think given how long we are talking about this we probably need a cleaner solution here. If the original change is actually safe can the comments be clearly updated to why?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So if we capture a weak_ptr of the callback list, and verify it's alive in main thread, we can make sure that we are accessing an active provider(and its subscription), meanwhile, the provider ownership is still on main thread and we dont need to worry about the race mentioned above.

trying to lock on the route_config_updated_cb will also do the trick, but it has a different semantic: if the callback is alive, which means the hcm is alive and the provider associated with it will be alive.
but now we are firing an on-demand when the requesting HttpRequest is alive, if we lock on provider, we are firing as long as there was a request(may be gone).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, OK, I think this make sense. @dmitri-d can we revert back to what you had before but also add a lot more comments? I would like to get this merged as I'm dealing with additional issues in this area and I want to fix on top of this depending on if anything else needs fixing. Thank you!

/wait

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to capture a weak_ptr to both subscription and callback list? Or would just subscription work? I think the original solution looks right, but it seems awkward to have to lock two objects. Better documentation on lifetime guarantees and relationships would definitely help. I think what you really want is some sentinel member element that provides "am I alive?"; normally this could be shared_from_this but this doesn't work for the reasons @stevenzzzz points out.

@mattklein123
Copy link
Copy Markdown
Member

mattklein123 commented Aug 11, 2020

I just checked and the previous version of this patch does not fix #12235 (comment), so I will continue to debug.

@mattklein123
Copy link
Copy Markdown
Member

Nevermind regarding ^ this is the same TSAN issue that is fixed with instrumented TSAN which I am not using. :(

Dmitri Dolguikh added 2 commits August 11, 2020 17:41
This reverts commit 97e55fd.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…live' flag

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

dmitri-d commented Aug 12, 2020

  • Reverted shared_from_this related changes
  • Introduced a dedicated 'still alive' flag: functionally the same as relying on a weak_ptr to config_update_callbacks_, but reads a bit clearer (I think). If everyone is ok with this change, I'll add a comment explaining the lifecycle there.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, @stevenzzzz @mattklein123 ?

@stevenzzzz
Copy link
Copy Markdown
Contributor

This makes sense to me, @stevenzzzz @mattklein123 ?

LGTM, making a shared_ptr on a callback list indeed looks weird.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Copy Markdown
Contributor Author

  • fixed formatting
  • added a comment explaining the need for still_alive flag

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks. The comment and the discrete boolean make this much easier to understand.

@mattklein123 mattklein123 merged commit 3498413 into envoyproxy:master Aug 12, 2020
@dmitri-d dmitri-d deleted the fix-dangling-pointer-in-vhds branch August 12, 2020 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential Dangling Reference in On Demand VHDS

5 participants