xDS: gRPC connection failure shouldn't make Envoy continue startup#8152
xDS: gRPC connection failure shouldn't make Envoy continue startup#8152htuch merged 3 commits intoenvoyproxy:masterfrom
Conversation
There was a problem hiding this comment.
The code here is exactly same as the onConfigUpdateFailed in grpc_mux_subscription_impl except logs. Can we refactor them in to a utility and pass ApiType and use it in logs?
There was a problem hiding this comment.
IMHO this is not necessary to use a common function to handle the errors for HTTP and gRPC subscription. It's possible to have different error handling for them.
There was a problem hiding this comment.
Not sure why would it be different once the message is received, but up to you.
There was a problem hiding this comment.
GrpcMuxSubscriptionImpl and HttpSubscriptionImpl share Config::Subscription as common interface, which only defines subscription related API, no API for get stats_ and callbacks_, or disableInitFetchTimeoutTimer(). That's, beside start() and updateResource(), they are not expected to be same, although they have similar implementation.
I guess it's not a good idea to define a new interface for error handling or use template to generalize the type. Maybe they can share a common base class which define the error handling, but for this PR I want to limit the change scope :)
There was a problem hiding this comment.
I think it would be nice to dedupe here, but agree we can push this to a later PR.
There was a problem hiding this comment.
Please hold off on anything like this until #7293 is merged.
There was a problem hiding this comment.
(The reason being that GrpcMuxSubscriptionImpl is about to be entirely replaced)
There was a problem hiding this comment.
But should we fix that case now? Otherwise of http subscriptions even if the connection failed, it will call onConfigUpdateFailed method which is incorrect/inconsistent?
There was a problem hiding this comment.
I suggest use an another PR to fix "//test/integration:hotrestart_test". I'm not fully understand that test case right now, no sure what's the best way to modify that.
There was a problem hiding this comment.
PR #8162 created for fixing "//test/integration:hotrestart_test", that's a small config change, just remove dynamic_resources, after that accepted, I will update code here.
There was a problem hiding this comment.
nit: call this method if reason != ConnectionFailure may be more readable than return?
There was a problem hiding this comment.
Putting const value at left side of == is considered a better style. It doesn't cause any difficulty for readability usually, because reader always need to see both side of == to understand the condition.
There was a problem hiding this comment.
Sorry. My comment was not clear may be. I am not talking about const on left of ==,
I am suggesting the following instead if return - Not a big deal though
if (Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure != reason) {
callbacks_.onConfigUpdateFailed(reason, e);
}
There was a problem hiding this comment.
IMO positive logic is preferred, the comment in if block describe is helpful for understanding what's going on.
There was a problem hiding this comment.
@l8huang can you actually switch this to reason == the const value? I agree that what you have there is "better style" for the reason you give, but it's actually quite jarring as most of the Envoy code base doesn't do this, and the "conform to local practices" style argument wins out IMHO.
There was a problem hiding this comment.
I thought onFetchFailure is always a ConnectionFailure? If so, we do not have to change onFetchFailure signature and just call handleFailure with ConnectionFailure reason here?
There was a problem hiding this comment.
no, please see old code in source/common/http/rest_api_fetcher.cc.
|
@lambdai for first pass. @ramaraochavali I thought we had already done this, do you have an idea of what is different? |
|
@htuch we fixed it for EDS earlier. This fix is trying to generalize for all subscriptions. |
There was a problem hiding this comment.
nit: reset timer if not nullptr, or simply disableInitFetchTimeoutTimer() ?
There was a problem hiding this comment.
disableInitFetchTimeoutTimer() added
lambdai
left a comment
There was a problem hiding this comment.
Theoretically it's safe when the callback need to handle fewer conditions(connection failure)
The questions is, is there any callback expecting connection failure signal, such as clean up? I think the answer is no.
source/common/upstream/eds.cc
Outdated
There was a problem hiding this comment.
Can we add a log line here? It's a critical milestone to trigger an init complete
There was a problem hiding this comment.
Also add a ASSERT(reason != Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure) ?
Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Signed-off-by: lhuang8 <lhuang8@ebay.com>
Signed-off-by: lhuang8 <lhuang8@ebay.com>
|
@lambdai PR updated, could you please take a review? |
|
/lgtm |
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
There was a problem hiding this comment.
I think it would be nice to dedupe here, but agree we can push this to a later PR.
There was a problem hiding this comment.
@l8huang can you actually switch this to reason == the const value? I agree that what you have there is "better style" for the reason you give, but it's actually quite jarring as most of the Envoy code base doesn't do this, and the "conform to local practices" style argument wins out IMHO.
Signed-off-by: lhuang8 <lhuang8@ebay.com>
|
@l8huang Nice change; looking at your code and description, this certainly seems like a better way to handle the situation! I have tried to adapt it into my ongoing xDS PR, #7293. Could you take a quick look at to make sure I have the right idea? (The flow is that DeltaSubscriptionState::handleEstablishmentFailure() calls DeltaSubscriptionImpl::onConfigUpdateFailed()). |
|
@fredlas LGTM |
|
Thank you! :) |
…nvoyproxy#8152) Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Risk Level: Medium Testing: existing test cases updated Fixes envoyproxy#8046 Signed-off-by: lhuang8 <lhuang8@ebay.com>
…nvoyproxy#8152) Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Risk Level: Medium Testing: existing test cases updated Fixes envoyproxy#8046 Signed-off-by: lhuang8 <lhuang8@ebay.com>
…nvoyproxy#8152) Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Risk Level: Medium Testing: existing test cases updated Fixes envoyproxy#8046 Signed-off-by: lhuang8 <lhuang8@ebay.com>
xDS: gRPC connection failure shouldn't make Envoy continue startup
Description:
Currently, if gRPC config stream disconnected while Envoy waiting for
initial xDS response, xDS implementations' onConfigUpdateFailed() will
allow Envoy startup to continue. This may cause Envoy begins taking
traffics while route/cluster/endpoint config are still missing and
return "404 NR" or "503 NR".
This change makes Envoy waiting for initial xDS response until
initial_fetch_timeout if specified.
Risk Level: Medium
Testing: existing test cases updated
Fixes #8046
Signed-off-by: lhuang8 lhuang8@ebay.com