Fixes #12235: handle dangling reference during VHDS update request#12395
Fixes #12235: handle dangling reference during VHDS update request#12395mattklein123 merged 16 commits intoenvoyproxy:masterfrom dmitri-d:fix-dangling-pointer-in-vhds
Conversation
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
ping @stevenzzzz |
stevenzzzz
left a comment
There was a problem hiding this comment.
thanks, looks good. can we add a test to stop regression?
source/common/router/rds_impl.cc
Outdated
| auto cb = config_cbs.lock(); | ||
| if (s && cb) { | ||
| s->updateOnDemand(alias); | ||
| cb->push_back({alias, thread_local_dispatcher, route_config_updated_cb}); |
There was a problem hiding this comment.
if in main thread, management server already sent back response with the alias, do you want to send another on demand request to management server?
Depends on management server implementation, if it only sends a response when there is some update of the project, then this HttpRequest will hang till timeout, while Envoy has the config to proceed.
There was a problem hiding this comment.
This gets pretty complicated quickly: the only time (that I can see) an instance of RdsRouteConfigProviderImplcan be deallocated is when LDS is used and the listener configuration has changed. At this point all connections are going to be drained, trying to update a vhost seems not worth it to me.
Even if we wanted to do that, we'd still need an active subscription (may not be there yet, and no way to find out which one to use) and a place to store the callback to notify the on_demand_update filter (which may not be there any more, as the active stream it belongs to may have been closed).
There was a problem hiding this comment.
I am describing a scenario where no rds subscription destruction is involved.
could the management server send required vhost to Envoy before the on-demand request is sent? in that case, the management server may or may not send the requested resource again depends on its implementation, in which case the callback will not be called and the request will only timeout. this is very implementation specific, so I am not too concerned.
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
ping @stevenzzzz. |
|
sorry I thought I committed the comment. there is one thing to be fixed around which weak_ptr to look at, please lock the most "specific" object to make sure there is no lifecycle problem, i.e., the config_cbs_. |
In this context, a weak_ptr to either the subscription or the list of callbacks would work; if anything the subscription is longer-lived and has a benefit of being already stored in a shared_ptr. |
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
source/common/router/rds_impl.cc
Outdated
| [subscription = std::weak_ptr<RdsRouteConfigSubscription>(subscription_), | ||
| config_cbs = std::weak_ptr<std::list<UpdateOnDemandCallback>>(config_update_callbacks_), | ||
| alias, &thread_local_dispatcher, route_config_updated_cb]() -> void { | ||
| auto sub = subscription.lock(); |
There was a problem hiding this comment.
you can move this into the if statement below.
There was a problem hiding this comment.
could you also move this sub lock to the if statement below?
Thanks for the fix. |
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
|
Looks like arm build failure is due to a flaky test. |
|
ping @stevenzzzz |
|
I LGTMed already. |
snowp
left a comment
There was a problem hiding this comment.
Thanks for working on this, just one comment.
source/common/router/rds_impl.cc
Outdated
| config_cbs = std::weak_ptr<std::list<UpdateOnDemandCallback>>(config_update_callbacks_), | ||
| alias, &thread_local_dispatcher, route_config_updated_cb]() -> void { | ||
| if (auto callbacks = config_cbs.lock()) { | ||
| auto sub = subscription.lock(); |
There was a problem hiding this comment.
why capture a weak ptr if you dont ever validate that its still alive? maybe just capture this as a weak ptr?
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for fixing. One small question.
/wait-any
source/common/router/rds_impl.cc
Outdated
| // The RdsRouteConfigProviderImpl instance can go away before the dispatcher has a chance to | ||
| // execute the callback. We use a weak_ptr here to verify that the instance is still around. |
There was a problem hiding this comment.
Why don't we allocate the provider itself with shared_from_this and then we can capture a weak reference to ourselves and post that? This seems a bit strange and fragile?
There was a problem hiding this comment.
I'd hesitate to do so, with that we could get into another rabbit hole where a worker thread may accidentally get a shared_ptr of the provider and introduce another bunch of life cycle problem due to provider can only be con/destructed on main thread.
thoughts?
There was a problem hiding this comment.
OK fair enough. Can we add a comment on why we are not doing that so the next person that comes along doesn't ask the same question? Thank you.
/wait
There was a problem hiding this comment.
Ahh, yes, this is better.
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…vhds Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
mattklein123
left a comment
There was a problem hiding this comment.
This LGTM but I will defer to you and @stevenzzzz as to whether you want to stick with this or not per the other comment.
source/common/router/rds_impl.h
Outdated
| class RdsRouteConfigProviderImpl : public RouteConfigProvider, | ||
| Logger::Loggable<Logger::Id::router> { | ||
| Logger::Loggable<Logger::Id::router>, | ||
| public std::enable_shared_from_this<RdsRouteConfigProviderImpl> { |
There was a problem hiding this comment.
can we please not do this. When provider is capable of shared_from_this, we actual open its access to worker threads, which I'd hope we can avoid: providers are supposed to be created/owned by HCM and no one else should have ownership to it, otherwise it'd introduce lots of troubles due to destruction order when Envoy tears down.
If you need a cleaner signal to decide if we need a post, one thing you can do is to lock on route_config_updated_cb, which the finest grained object that tells if the source httprequest is alive or not.
There was a problem hiding this comment.
I don't quite understand this now that I think about it more. This is already a shared_ptr allocated object. Why does it matter if it enabled shared_from_this?
There was a problem hiding this comment.
I think we want it shared between listeners in main thread, but its ownership should not be shared to worker thread.
If however the ownership of the provider is shared to worker thread, the scenario i was talking about would cause the assertion error when LDS deletes the listener or when envoy quits.
There was a problem hiding this comment.
I think the change is quite safe: we aren't changing provider interface, just the concrete implementation: to create another shared_ptr one would need to cast a RouteConfigProvider object to RdsRouteConfigProviderImpl before being able to do so. Code review would catch such a cast or a call to shared_from_this(). I can add a note re: intended usage of shared_from_this if that would help.
There was a problem hiding this comment.
I agree that this is the right change modulo any other lifetime issue.
There was a problem hiding this comment.
it's a tricky case and it's not about life cycle of provider when the post lambda gets called.
the problem is that The ownership of any "tls instance owner" should not be spilled into other threads than the main thread.
please see my other reply why I am against this, your change is introducing another race which could ends in an assertion failure.
source/common/router/rds_impl.cc
Outdated
| }); | ||
| factory_context_.dispatcher().post([this, | ||
| self = std::weak_ptr<RdsRouteConfigProviderImpl>( | ||
| shared_from_this()), |
There was a problem hiding this comment.
this could cause trouble as it introduces another race between worker threads and main thread: for example on envoy teardown, the lambda created in the worker thread may be the last one to be teardown, in which case, the destruction of the provider will happen in worker thread, which triggers an assert failure because it holds an TLS instance which can only be destructed in main thread.
There was a problem hiding this comment.
But how is this safe then? We capture "this" so it can still crash?
There was a problem hiding this comment.
i am thinking capturing a member of the provider will guarantee that the provider is alive when this cb is called inside main thread.
The difference is the ownership of provider has not been "shared" outside of main thread, which will not introduce the new race.
There was a problem hiding this comment.
i am thinking capturing a member of the provider will guarantee that the provider is alive when this cb is called inside main thread.
It definitely does not guarantee that. It just prevents the member from going away.
There was a problem hiding this comment.
let me rephrase why this change is a regression:
- this lambda "L" is created with a shared pointer to the RDS config provider, and it's in worker thread.
- it's destructed in worker thread after it's posted to main thread,
- in main thread, the rds config provider could go away because of LDS or Envoy teardown, in which case the ownership of the config provider will fall to the "L" if it's not deleted from the worker thread yet.
- deallocation of the RDS config provider in worker thread will cause an assertion failure, which is our fence on clear boundary between config pipeline and data plane, as the tls instance held by the provider can only be destructed in main thread.
There was a problem hiding this comment.
Yes I agree ^ is a problem, but this code is still broken. If the provider goes away on the main thread, we still capture this. That is not correct. I think given how long we are talking about this we probably need a cleaner solution here. If the original change is actually safe can the comments be clearly updated to why?
There was a problem hiding this comment.
Right. So if we capture a weak_ptr of the callback list, and verify it's alive in main thread, we can make sure that we are accessing an active provider(and its subscription), meanwhile, the provider ownership is still on main thread and we dont need to worry about the race mentioned above.
trying to lock on the route_config_updated_cb will also do the trick, but it has a different semantic: if the callback is alive, which means the hcm is alive and the provider associated with it will be alive.
but now we are firing an on-demand when the requesting HttpRequest is alive, if we lock on provider, we are firing as long as there was a request(may be gone).
There was a problem hiding this comment.
Yes, OK, I think this make sense. @dmitri-d can we revert back to what you had before but also add a lot more comments? I would like to get this merged as I'm dealing with additional issues in this area and I want to fix on top of this depending on if anything else needs fixing. Thank you!
/wait
There was a problem hiding this comment.
Do we need to capture a weak_ptr to both subscription and callback list? Or would just subscription work? I think the original solution looks right, but it seems awkward to have to lock two objects. Better documentation on lifetime guarantees and relationships would definitely help. I think what you really want is some sentinel member element that provides "am I alive?"; normally this could be shared_from_this but this doesn't work for the reasons @stevenzzzz points out.
|
I just checked and the previous version of this patch does not fix #12235 (comment), so I will continue to debug. |
|
Nevermind regarding ^ this is the same TSAN issue that is fixed with instrumented TSAN which I am not using. :( |
This reverts commit 97e55fd. Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…live' flag Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
htuch
left a comment
There was a problem hiding this comment.
This makes sense to me, @stevenzzzz @mattklein123 ?
LGTM, making a shared_ptr on a callback list indeed looks weird. |
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
|
mattklein123
left a comment
There was a problem hiding this comment.
Awesome, thanks. The comment and the discrete boolean make this much easier to understand.
Fixes #12235
Signed-off-by: Dmitri Dolguikh ddolguik@redhat.com
Risk Level: low
Testing: existing tests
Docs Changes: none
Release Notes: none