Skip to content

ecds: use last received filter config from a subscription#15371

Merged
snowp merged 6 commits intoenvoyproxy:mainfrom
kyessenov:skip_ecds_listener_warming
Mar 17, 2021
Merged

ecds: use last received filter config from a subscription#15371
snowp merged 6 commits intoenvoyproxy:mainfrom
kyessenov:skip_ecds_listener_warming

Conversation

@kyessenov
Copy link
Copy Markdown
Contributor

Signed-off-by: Kuat Yessenov kuat@google.com

Commit Message: Fix a bug with LDS not picking up last ECDS change. This was because the logic applied the default config from the extension config source regardless of the whether subscription has been initialized with a filter config. There is a tricky edge case if the listener restricts the type URLs that makes the subscription invalid. In this case, we continue using the default config, and log a message.
Risk Level: low (bugfix)
Testing: integration
Docs Changes: none
Release Notes: fix a bug with ECDS using the default config in multiple listeners subscribed to the same filter config resource
Fixes: #14934

cc @bianpengyuan @snowp

Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov
Copy link
Copy Markdown
Contributor Author

/retest
(not sure how to access test logs...)

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15371 (comment) was created by @kyessenov.

see: more, trace.

Signed-off-by: Kuat Yessenov <kuat@google.com>
@dio dio assigned htuch Mar 9, 2021
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems right to me, just one question

nullptr);
config_applied = true;
} catch (const EnvoyException& e) {
ENVOY_LOG(debug, "ECDS subscription {} is invalid in a listener context: {}.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you express this in a way that leads to actionable outcome, i.e. what is the likely cause of this error?

Also, should we have some stat bumped? Logs will get lost in production settings.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this more broadly, I kind of find this corner case a bit icky. This relates to my validate config comment above; if all we can do is validate configs, can we pull in the type URL validation into the core here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll take it away from base interfaces. The likely cause is pushing LDS and ECDS together and LDS update arriving first. Do you have any recommendation for a metric?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you making the argument here that since the last config has already been validated and is known to work, the only possible way it could fail here is if the type URLs mismatch somehow? Do we even need onConfigUpdate in try/catch if we validate?

In terms of stats, probably something in https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats#config-http-conn-man-stats namespace makes sense.

Copy link
Copy Markdown
Contributor Author

@kyessenov kyessenov Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch I'm looking for guidance how to deal with inconsistency between ECDS and LDS. The scenario is the following:

  1. LDS add listener L1 with config_discovery for x with type URL {T1, T2}
  2. ECDS update for x with type URL T1 (acked)
  3. LDS add listener L2 with config_discovery for x with type URL T2.

I think this is similar to a delayed NACK situation since we want the management server to push a new ECDS update for x with type URL T2, that validates both L1 and L2. This PR will make L2 to use the default config, but it does not indicate to the management server than a new ECDS update is needed. Granted, this is a fairly convoluted example. The metric would be useful, but is the assumption here, that some higher level controller (human?) is going to look for it and force push ECDS?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "config_conflict" stat for ECDS resource. I think keying by ECDS is a bit more actionable, since it is clear why resource is the conflict (LDS tends to be more source-of-truth). LMK.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example, is "x" actually different in step 4? I.e. it has been deliberately changed by the management server and when server in hypothetical step 4, it's really just a problem of 3 and 4 racing? Or is "x" somehow polymorphic and could be T1 or T2 and it gets decided based on client state?

This seems a pretty weird situation but I'm likely missing something. Feel free to ping me IM so we can figure this out with lower RTT.

Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov
Copy link
Copy Markdown
Contributor Author

kyessenov commented Mar 12, 2021 via email

@htuch
Copy link
Copy Markdown
Member

htuch commented Mar 12, 2021

@kyessenov I don't see what would possibly cause the Envoy to re-request x in this situation, so I assume it must be the server pushing an update of x. I think having L2 using the default config is semantically correct, since this is what it would do is x wasn't present. Presumably the management server sends the updated x shortly after and both L1 and L2 switch to it?

This is the moral equivalent of returning 4xx/5xx when cluster or endpoint resources are missing due to eventual consistency. I think in those cases, monitoring would be responsible for spotting the problem.

@kyessenov
Copy link
Copy Markdown
Contributor Author

kyessenov commented Mar 12, 2021 via email

@htuch
Copy link
Copy Markdown
Member

htuch commented Mar 12, 2021

@kyessenov I see the metric as not being used by the management server, instead it would be used to diagnose that sometimes there are consistency problems and to fix the management server to stop these happening in the future. Does that seems a reasonable flow?

@kyessenov
Copy link
Copy Markdown
Contributor Author

kyessenov commented Mar 12, 2021 via email

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyessenov thanks for all the great discussion explaining the intuition. Could you capture it in a comment somewhere so that this lives for posterity near the code in question? Ta.
/wait

bool last_config_valid = false;
if (subscription->lastConfig().has_value()) {
try {
provider->validateTypeUrl(subscription->lastTypeUrl());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a value in moving this to returning true/false?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but the exception contains the list of required type URLs, which is useful. Bool would be less actionable.

Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@snowp snowp merged commit 1ecac9e into envoyproxy:main Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dynamic filter config provider provides empty config when listener updates

3 participants