ecds: use last received filter config from a subscription#15371
ecds: use last received filter config from a subscription#15371snowp merged 6 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Kuat Yessenov <kuat@google.com>
|
/retest |
|
Retrying Azure Pipelines: |
Signed-off-by: Kuat Yessenov <kuat@google.com>
snowp
left a comment
There was a problem hiding this comment.
Seems right to me, just one question
| nullptr); | ||
| config_applied = true; | ||
| } catch (const EnvoyException& e) { | ||
| ENVOY_LOG(debug, "ECDS subscription {} is invalid in a listener context: {}.", |
There was a problem hiding this comment.
Can you express this in a way that leads to actionable outcome, i.e. what is the likely cause of this error?
Also, should we have some stat bumped? Logs will get lost in production settings.
There was a problem hiding this comment.
Thinking through this more broadly, I kind of find this corner case a bit icky. This relates to my validate config comment above; if all we can do is validate configs, can we pull in the type URL validation into the core here?
There was a problem hiding this comment.
Yeah, I'll take it away from base interfaces. The likely cause is pushing LDS and ECDS together and LDS update arriving first. Do you have any recommendation for a metric?
There was a problem hiding this comment.
Are you making the argument here that since the last config has already been validated and is known to work, the only possible way it could fail here is if the type URLs mismatch somehow? Do we even need onConfigUpdate in try/catch if we validate?
In terms of stats, probably something in https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats#config-http-conn-man-stats namespace makes sense.
There was a problem hiding this comment.
@htuch I'm looking for guidance how to deal with inconsistency between ECDS and LDS. The scenario is the following:
- LDS add listener L1 with config_discovery for x with type URL {T1, T2}
- ECDS update for x with type URL T1 (acked)
- LDS add listener L2 with config_discovery for x with type URL T2.
I think this is similar to a delayed NACK situation since we want the management server to push a new ECDS update for x with type URL T2, that validates both L1 and L2. This PR will make L2 to use the default config, but it does not indicate to the management server than a new ECDS update is needed. Granted, this is a fairly convoluted example. The metric would be useful, but is the assumption here, that some higher level controller (human?) is going to look for it and force push ECDS?
There was a problem hiding this comment.
Added "config_conflict" stat for ECDS resource. I think keying by ECDS is a bit more actionable, since it is clear why resource is the conflict (LDS tends to be more source-of-truth). LMK.
There was a problem hiding this comment.
In your example, is "x" actually different in step 4? I.e. it has been deliberately changed by the management server and when server in hypothetical step 4, it's really just a problem of 3 and 4 racing? Or is "x" somehow polymorphic and could be T1 or T2 and it gets decided based on client state?
This seems a pretty weird situation but I'm likely missing something. Feel free to ping me IM so we can figure this out with lower RTT.
Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
|
Yeah, it's just a problem of racing, "x" needs to be deliberately changed
to satisfy both type URL constraints. I totally agree, this is weird and
unnatural. Nevertheless, it needs to be handled. It has similarities to
sequencing LDS, RDS, CDS updates to avoid introducing conflicts.
…On Thu, Mar 11, 2021 at 7:02 PM htuch ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In source/common/filter/http/filter_config_discovery_impl.cc
<#15371 (comment)>:
> factory_context.initManager().add(provider->init_target_);
}
+
+ // If the subscription already received a config, attempt to apply it.
+ bool config_applied = false;
+ if (subscription->lastConfig().has_value()) {
+ try {
+ provider->validateConfig(subscription->lastTypeUrl());
+ provider->onConfigUpdate(subscription->lastConfig().value(), subscription->lastVersionInfo(),
+ nullptr);
+ config_applied = true;
+ } catch (const EnvoyException& e) {
+ ENVOY_LOG(debug, "ECDS subscription {} is invalid in a listener context: {}.",
In your example, is "x" actually different in step 4? I.e. it has been
deliberately changed by the management server and when server in
hypothetical step 4, it's really just a problem of 3 and 4 racing? Or is
"x" somehow polymorphic and could be T1 or T2 and it gets decided based on
client state?
This seems a pretty weird situation but I'm likely missing something. Feel
free to ping me IM so we can figure this out with lower RTT.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15371 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIYRRVNN2GWAL4OFNOETWLTDFY37ANCNFSM4Y2H4L6Q>
.
|
|
@kyessenov I don't see what would possibly cause the Envoy to re-request This is the moral equivalent of returning 4xx/5xx when cluster or endpoint resources are missing due to eventual consistency. I think in those cases, monitoring would be responsible for spotting the problem. |
|
Right, the update for "x" has to be pushed by management, ideally before
pushing L2. If the sequencing is done correctly, the conflict never arises.
If it does, the metric here conflg_conflict should be a good signal that an
ECDS resource "x" needs an update.
…On Thu, Mar 11, 2021 at 7:10 PM htuch ***@***.***> wrote:
@kyessenov <https://github.com/kyessenov> I don't see what would possibly
cause the Envoy to re-request x in this situation, so I assume it must be
the server pushing an update of x. I think having L2 using the default
config is semantically correct, since this is what it would do is x
wasn't present. Presumably the management server sends the updated x
shortly after and both L1 and L2 switch to it?
This is the moral equivalent of returning 4xx/5xx when cluster or endpoint
resources are missing due to eventual consistency. I think in those cases,
monitoring would be responsible for spotting the problem.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15371 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIYRRUDUD66LJGPA2M6VK3TDFZZ3ANCNFSM4Y2H4L6Q>
.
|
|
@kyessenov I see the metric as not being used by the management server, instead it would be used to diagnose that sometimes there are consistency problems and to fix the management server to stop these happening in the future. Does that seems a reasonable flow? |
|
Yeah, that sounds reasonable. It's a mis-configuration runtime error, so
metrics seem appropriate to alert operators.
…On Thu, Mar 11, 2021 at 7:18 PM htuch ***@***.***> wrote:
@kyessenov <https://github.com/kyessenov> I see the metric as not being
used by the management server, instead it would be used to diagnose that
sometimes there are consistency problems and to fix the management server
to stop these happening in the future. Does that seems a reasonable flow?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15371 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACIYRRXTWFMKJTX22X7AQNLTDF2YHANCNFSM4Y2H4L6Q>
.
|
htuch
left a comment
There was a problem hiding this comment.
@kyessenov thanks for all the great discussion explaining the intuition. Could you capture it in a comment somewhere so that this lives for posterity near the code in question? Ta.
/wait
| bool last_config_valid = false; | ||
| if (subscription->lastConfig().has_value()) { | ||
| try { | ||
| provider->validateTypeUrl(subscription->lastTypeUrl()); |
There was a problem hiding this comment.
Would there be a value in moving this to returning true/false?
There was a problem hiding this comment.
I considered it, but the exception contains the list of required type URLs, which is useful. Bool would be less actionable.
Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov kuat@google.com
Commit Message: Fix a bug with LDS not picking up last ECDS change. This was because the logic applied the default config from the extension config source regardless of the whether subscription has been initialized with a filter config. There is a tricky edge case if the listener restricts the type URLs that makes the subscription invalid. In this case, we continue using the default config, and log a message.
Risk Level: low (bugfix)
Testing: integration
Docs Changes: none
Release Notes: fix a bug with ECDS using the default config in multiple listeners subscribed to the same filter config resource
Fixes: #14934
cc @bianpengyuan @snowp