cluster manager: fix crash related to "local cluster" handling by mattklein123 · Pull Request #1275 · envoyproxy/envoy

mattklein123 · 2017-07-18T03:13:32Z

This commit fixes a crashing issue in which a CDS cluster is removed,
leaving behind a stale member update callback inside the "local cluster."
The fix is to allow member update callbacks to be removed. However, as
part of investigating this issue I uncovered several related issues:

We will leak TLS slots in several places when using LDS.
Redis is broken in two ways regarding LDS/CDS. This commit prevents
one of them from hapening by just prohibiting Redis from working with
a CDS cluster (for now), but LDS is still broken in a similar way WRT
the member update callbacks.

Fixing both of the above issues is not trivial so I will do them in follow
ups since this change is large enough as it is. Also, fixing the Redis LDS
case will make a portion of this fix cleaner.

Fixes #1254

This commit fixes a crashing issue in which a CDS cluster is removed, leaving behind a stale member update callback inside the "local cluster." The fix is to allow member update callbacks to be removed. However, as part of investigating this issue I uncovered several related issues: 1) We will leak TLS slots in several places when using LDS. 2) Redis is broken in two ways regarding LDS/CDS. This commit prevents one of them from hapening by just prohibiting Redis from working with a CDS cluster (for now), but LDS is still broken in a similar way WRT the member update callbacks. Fixing both of the above issues is not trivial so I will do them in follow ups since this change is large enough as it is. Also, fixing the Redis LDS case will make a portion of this fix cleaner. Fixes #1254

mattklein123 · 2017-07-18T03:14:33Z

@lyft/network-team @htuch

@htuch I will look through the other reviews tomorrow if you feel like taking a pass on this one. This turned into a pretty complicated issue and will need several follow ups.

@danielhochman FYI re: the Redis stuff. Please take a look. We can discuss offline.

@jwfang FYI

htuch

Good catch! Code looks good, lots of churn due to the added_via_api propagation.

htuch · 2017-07-18T15:00:54Z

source/common/upstream/cluster_manager_impl.cc

  cluster_list->remove(&cluster);
-  ENVOY_LOG(info, "cm init: removing: cluster={} primary={} secondary={}", cluster.info()->name(),
-            primary_init_clusters_.size(), secondary_init_clusters_.size());
+  ENVOY_LOG(info, "cm init: init complete: cluster={} primary={} secondary={}",


Nit: Is there much value in removing the "removing" here?

Personally, I found "removing" to be confusing when I was debugging this. It seemed like the cluster was being removed. Open to other wordings?

This is actually fine, I wrote that comment oblivious to the fact that we're in ClusterManagerInitHelper.

This is fine, I didn't realize this was the init helper when writing that..

htuch · 2017-07-18T15:01:41Z

source/common/upstream/cluster_manager_impl.cc

    ThreadLocalClusterManagerImpl& cluster_manager =
        tls_.getTyped<ThreadLocalClusterManagerImpl>(thread_local_slot_);

+    if (cluster_manager.thread_local_clusters_.count(new_cluster->name())) {


Nit: prefer > 0

htuch · 2017-07-18T15:13:30Z

source/common/redis/proxy_filter.cc

+    throw EnvoyException(fmt::format("redis filter config: unknown cluster '{}'", cluster_name_));
+  }
+
+  if (cluster->info()->addedViaApi()) {


I wonder if it's making things a bit messy to have to track this everywhere. The main use of it is to disallow registering Redis temporarily, and some assertions. As we head towards v2, everything is going to be dynamic, even if its sourced from a file. Is there a strong case to maintain this distinction across the call stack and objects?

I went back and forth on this. I agree it's a lot of churn. On the other hand, I think it does have value in that I was planning on making sure that the RDS cluster for example references a static cluster that can't be removed. I also think that giving the people the ability to write filters that don't work with CDS is probably OK as long as they have error/config checking. So, I think I'm still leaning towards adding it, but I could be swayed either way.

OK.. in v2, my thoughts were that clusters that have to live forever get put in https://github.com/lyft/envoy-api/blob/master/api/bootstrap.proto#L25. Everything else is allowed to be dynamically updated, even if its on the filesystem (the file might be replaced, inotify triggers). From that perspective, filters should be capable of fully dynamic behavior in most cases.

How come RDS can't work with a dynamic cluster BTW? Seems there should be no hard dependency there. I can see why CDS needs to be a static cluster :)

RDS should "work" in that it won't crash (I think), but I think people would generally find this behavior extremely confusing if they accidentally deleted the RDS cluster.

Anyhow, I guess I was erring on the side of trying to raise issues to users faster to avoid support burden (and potentially allow people to build things in a more simplistic way potentially if they want), but given that redis is not officially supported yet, I can revert this part of the change if you don't think it's useful in general. I don't feel that strongly about it. Let me know. Note that if we revert this part of the change I will definitely work on fixing the redis code to support cluster removal since it absolutely will crash in bad ways right now.

I think it's fine for now as is. My main concern beyond the churn was that there might be some disconnect over where we're headed with static vs. dynamic resources. I've created https://github.com/lyft/envoy-api/issues/112 to track.

htuch · 2017-07-18T15:15:07Z

source/common/upstream/cluster_utility.h

+   * @param hosts_added supplies the added hosts.
+   * @param hosts_removed supplies the removed hosts.
+   */
+  void runCallbacks(const std::vector<HostSharedPtr>& hosts_added,


Wondering if we can make this just a generic callback helper class using varargs magic..

I am actually planning on doing this in a follow up commit. I need more callbacks with removals to correctly deal with some of the TODOs so will do this then.

htuch · 2017-07-18T15:16:28Z

source/common/upstream/load_balancer_impl.cc


+LoadBalancerBase::~LoadBalancerBase() {
+  if (local_host_set_member_update_cb_handle_ != nullptr) {
+    local_host_set_->removeMemberUpdateCb(local_host_set_member_update_cb_handle_);


Wondering if we could use RAII on the handles to have them clean themselves up on object destruction..

My first attempt at fixing this was RAII where the handles cleaned up during destruction. This lead into a bunch of other issues that I won't go into, but it turned very ugly during shutdown due to lack of ordering (basically universally unregistering all callbacks is tough with the current code). I can definitely do something like add a remove() method on the handle, and hold that in an RAII holder. I can probably also fix it to all work with 100% RAII cleanup, but that will be a larger change. My plan was to try to fix this in a series of follow ups. I'm happy to try to do it all in this change, but all of the TODOs are kind of related and the change will be very large. Probably better to do it in pieces.

Yeah, maybe just a remove() and RAII wrapper where appropriate is fine here. It only makes sense anyway if it's less error prone or simpler than the alternative of manual reclamation.

mattklein123 · 2017-07-18T21:01:56Z

@htuch updated

htuch

The generalized callback manager is great.

htuch · 2017-07-18T21:33:19Z

source/common/common/callback_impl.h

+#include "common/common/assert.h"
+
+namespace Envoy {
+namespace Upstream {


s/Upstream/Common/

htuch · 2017-07-18T21:33:55Z

source/common/common/BUILD

+
+envoy_cc_library(
+    name = "callback_impl_lib",
+    hdrs = ["callback_impl.h"],


Nit: should have a dep on :assert_lib.

htuch · 2017-07-18T21:40:10Z

source/common/common/callback_impl.h

+   */
+  void runCallbacks(CallbackArgs... args) {
+    for (const auto& cb : callbacks_) {
+      cb.cb_(args...);


If you add tests for this, a good one would be to validate that remove yourself from the list in the context of a callback is safe, which it should be, as the entry isn't used outside that call stack.

Sure can add a quick test.

Actually I think it's probably not safe the way the code is written but will fix if not.

mattklein123 · 2017-07-18T23:01:45Z

@htuch updated again. I also made the template interface a bit nicer.

Description: fixing bazel target with explicit stamp. Also updating docs with information about debug symbols. Risk Level: low Testing: local binary size analysis Docs Changes: updated relevant docs. Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

**Description** Before, the MCP configuration argument to the standalone mode was not compatible with existing OpenAI ENV vars. This fixes it, so that you can have a single agent origin without yaml. For example: ``` OPENAI_API_KEY=sk-xxxx aigw run --mcp-json '{"mcpServers":{"kiwi":{"type":"http","url":"https://mcp.kiwi.com"}}}' ``` **Details** **MCP config generator bug** maybeGenerateResourcesForMCPGateway unconditionally added resources for MCP which caused configuration give-ups. fixed and backfilled tests **docker mcp examples** wrote examples in the same style as chat, embeddings etc into `cmd/aigw/docker-compose*.yaml` so that they are easy to invoke. **Envoy Gateway techdebt workarounds** - hard-codes a shared directory for envoy downloads, so changed CI to write to /tmp/envoy-gateway - hard-codes a cert directory as well, which causes other tests to fail when one writes certs to it. deleting that on startup - hard-codes os.Stdout for envoy when launched in standalone mode, so disable debug even though we could capture otherwise and only report on fail. - has broken CLI args passed to envoy which makes admin server detection fail, fall back to poll the listen port instead **Fix either/or with MCP config** The main issue before this change was that you couldn't do both LLM and MCP in auto-config. This PR refactors the MCP configuration system and introduces a unified `autoconfig` package that supports OpenAI-only, MCP-only, or combined OpenAI+MCP configurations from environment variables and JSON. The new `autoconfig` package replaces the old `envgen` package with a more flexible architecture, generates complete AI Gateway YAML from environment variables and JSON configuration, and handles both OpenAI providers (OpenAI, Azure OpenAI) and MCP server configurations. **naming consistency and organization** - mcpServers "tools" -> "allowedTools" for internal consistency and also consistency with gemini - `MCPNameSelector` → `MCPToolSelector` as there are other entities (tools, prompts, resources) - `nameSelector` → `toolSelector` similar to above - `cmd/aigw/*example*` -> `examples/mcp/` for better organization and discovery - `test-e2e-mcp` -> `test-e2e-aigw` and similar like`tests/e2e-aigw` **admin check hardening** I also enhanced Envoy admin discovery in `internal/aigw/admin.go` with a new `EnvoyAdminClient` interface for checking Envoy readiness, automatic discovery of Envoy admin port from subprocess arguments using `gopsutil`. However, due to tech debt in Envoy Gateway which was fixed but not released, we need a fallback to a listener port check. **MCP code issues** - bug where we wrote MCP resources even when there were none caused blocking loop on Envoy waiting for them - backfilled tests to ensure test-coverage would pass **aigw run test deflake** - rewrote to share shutdown code, and work around cert state leftovers due to hard-coding in envoy gateway. **ollama hardening** - changed CI to keep a log which could explain if 403 in tests have anything to do with ollama - refactored CI to use a larger context size as default is known to not be big enough for goose **goose deflakes** - larger model as `qwen3:0.6b` is known to not be viable for agents - rewrote the examples and goose tests to be clear what happened on error give up **CLI debug propagation** - `aigw --debug` now propagates to envoy debug which makes it possible to debug many problems - due to envoy-gateway not propagating stdout/stderr from cobra to func-e, this cannot be left on in tests **envoy CI downloads in unit tests** - some tests flake because they are downloading envoy, so added steps to do that in CI. this uses the hard-coded directory in envoy gateway. **documentation backfill** I also noticed a lack of documentation, so I spent a lot of time on that, too - `site/docs/capabilities/mcp/index.md` - New MCP guide - New `site/docs/capabilities/mcp/index.md` explaining MCP integration in detail - `site/docs/cli/run.md` - MCP configuration docs - Updated `site/docs/cli/run.md` with new `--mcp-json` flag documentation - `cmd/aigw/README.md` - Add MCP quick start - `cmd/aigw/docker-compose*.yaml` - Add MCP configuration **test fixture consolidation** several tools consolidated into internal/testing notably `RequireEventuallyNoError` needed to know what the last error is on failure --------- Signed-off-by: Adrian Cole <adrian@tetrate.io>

htuch reviewed Jul 18, 2017

View reviewed changes

comments

db9cd8d

mattklein123 added 3 commits July 18, 2017 14:31

Merge remote-tracking branch 'origin/master' into fix_local_cluster_bug

8938158

fix

88a3da9

fix

1b5d760

htuch reviewed Jul 18, 2017

View reviewed changes

mattklein123 added 3 commits July 18, 2017 14:47

fix

53a4944

comments

fa25110

fix

c1351f0

htuch approved these changes Jul 18, 2017

View reviewed changes

mattklein123 merged commit daaf87a into master Jul 18, 2017

mattklein123 deleted the fix_local_cluster_bug branch July 18, 2017 23:51

Conversation

mattklein123 commented Jul 18, 2017

Uh oh!

mattklein123 commented Jul 18, 2017

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Jul 18, 2017

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Jul 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants