Skip to content

cluster manager: fix crash related to "local cluster" handling#1275

Merged
mattklein123 merged 8 commits intomasterfrom
fix_local_cluster_bug
Jul 18, 2017
Merged

cluster manager: fix crash related to "local cluster" handling#1275
mattklein123 merged 8 commits intomasterfrom
fix_local_cluster_bug

Conversation

@mattklein123
Copy link
Copy Markdown
Member

This commit fixes a crashing issue in which a CDS cluster is removed,
leaving behind a stale member update callback inside the "local cluster."
The fix is to allow member update callbacks to be removed. However, as
part of investigating this issue I uncovered several related issues:

  1. We will leak TLS slots in several places when using LDS.
  2. Redis is broken in two ways regarding LDS/CDS. This commit prevents
    one of them from hapening by just prohibiting Redis from working with
    a CDS cluster (for now), but LDS is still broken in a similar way WRT
    the member update callbacks.

Fixing both of the above issues is not trivial so I will do them in follow
ups since this change is large enough as it is. Also, fixing the Redis LDS
case will make a portion of this fix cleaner.

Fixes #1254

This commit fixes a crashing issue in which a CDS cluster is removed,
leaving behind a stale member update callback inside the "local cluster."
The fix is to allow member update callbacks to be removed. However, as
part of investigating this issue I uncovered several related issues:
1) We will leak TLS slots in several places when using LDS.
2) Redis is broken in two ways regarding LDS/CDS. This commit prevents
   one of them from hapening by just prohibiting Redis from working with
   a CDS cluster (for now), but LDS is still broken in a similar way WRT
   the member update callbacks.

Fixing both of the above issues is not trivial so I will do them in follow
ups since this change is large enough as it is. Also, fixing the Redis LDS
case will make a portion of this fix cleaner.

Fixes #1254
@mattklein123
Copy link
Copy Markdown
Member Author

@lyft/network-team @htuch

@htuch I will look through the other reviews tomorrow if you feel like taking a pass on this one. This turned into a pretty complicated issue and will need several follow ups.

@danielhochman FYI re: the Redis stuff. Please take a look. We can discuss offline.

@jwfang FYI

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Code looks good, lots of churn due to the added_via_api propagation.

cluster_list->remove(&cluster);
ENVOY_LOG(info, "cm init: removing: cluster={} primary={} secondary={}", cluster.info()->name(),
primary_init_clusters_.size(), secondary_init_clusters_.size());
ENVOY_LOG(info, "cm init: init complete: cluster={} primary={} secondary={}",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is there much value in removing the "removing" here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I found "removing" to be confusing when I was debugging this. It seemed like the cluster was being removed. Open to other wordings?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually fine, I wrote that comment oblivious to the fact that we're in ClusterManagerInitHelper.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, I didn't realize this was the init helper when writing that..

ThreadLocalClusterManagerImpl& cluster_manager =
tls_.getTyped<ThreadLocalClusterManagerImpl>(thread_local_slot_);

if (cluster_manager.thread_local_clusters_.count(new_cluster->name())) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer > 0

throw EnvoyException(fmt::format("redis filter config: unknown cluster '{}'", cluster_name_));
}

if (cluster->info()->addedViaApi()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's making things a bit messy to have to track this everywhere. The main use of it is to disallow registering Redis temporarily, and some assertions. As we head towards v2, everything is going to be dynamic, even if its sourced from a file. Is there a strong case to maintain this distinction across the call stack and objects?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this. I agree it's a lot of churn. On the other hand, I think it does have value in that I was planning on making sure that the RDS cluster for example references a static cluster that can't be removed. I also think that giving the people the ability to write filters that don't work with CDS is probably OK as long as they have error/config checking. So, I think I'm still leaning towards adding it, but I could be swayed either way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.. in v2, my thoughts were that clusters that have to live forever get put in https://github.com/lyft/envoy-api/blob/master/api/bootstrap.proto#L25. Everything else is allowed to be dynamically updated, even if its on the filesystem (the file might be replaced, inotify triggers). From that perspective, filters should be capable of fully dynamic behavior in most cases.

How come RDS can't work with a dynamic cluster BTW? Seems there should be no hard dependency there. I can see why CDS needs to be a static cluster :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDS should "work" in that it won't crash (I think), but I think people would generally find this behavior extremely confusing if they accidentally deleted the RDS cluster.

Anyhow, I guess I was erring on the side of trying to raise issues to users faster to avoid support burden (and potentially allow people to build things in a more simplistic way potentially if they want), but given that redis is not officially supported yet, I can revert this part of the change if you don't think it's useful in general. I don't feel that strongly about it. Let me know. Note that if we revert this part of the change I will definitely work on fixing the redis code to support cluster removal since it absolutely will crash in bad ways right now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine for now as is. My main concern beyond the churn was that there might be some disconnect over where we're headed with static vs. dynamic resources. I've created https://github.com/lyft/envoy-api/issues/112 to track.

* @param hosts_added supplies the added hosts.
* @param hosts_removed supplies the removed hosts.
*/
void runCallbacks(const std::vector<HostSharedPtr>& hosts_added,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we can make this just a generic callback helper class using varargs magic..

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually planning on doing this in a follow up commit. I need more callbacks with removals to correctly deal with some of the TODOs so will do this then.


LoadBalancerBase::~LoadBalancerBase() {
if (local_host_set_member_update_cb_handle_ != nullptr) {
local_host_set_->removeMemberUpdateCb(local_host_set_member_update_cb_handle_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could use RAII on the handles to have them clean themselves up on object destruction..

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first attempt at fixing this was RAII where the handles cleaned up during destruction. This lead into a bunch of other issues that I won't go into, but it turned very ugly during shutdown due to lack of ordering (basically universally unregistering all callbacks is tough with the current code). I can definitely do something like add a remove() method on the handle, and hold that in an RAII holder. I can probably also fix it to all work with 100% RAII cleanup, but that will be a larger change. My plan was to try to fix this in a series of follow ups. I'm happy to try to do it all in this change, but all of the TODOs are kind of related and the change will be very large. Probably better to do it in pieces.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe just a remove() and RAII wrapper where appropriate is fine here. It only makes sense anyway if it's less error prone or simpler than the alternative of manual reclamation.

@mattklein123
Copy link
Copy Markdown
Member Author

@htuch updated

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generalized callback manager is great.

#include "common/common/assert.h"

namespace Envoy {
namespace Upstream {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Upstream/Common/


envoy_cc_library(
name = "callback_impl_lib",
hdrs = ["callback_impl.h"],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should have a dep on :assert_lib.

*/
void runCallbacks(CallbackArgs... args) {
for (const auto& cb : callbacks_) {
cb.cb_(args...);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add tests for this, a good one would be to validate that remove yourself from the list in the context of a callback is safe, which it should be, as the entry isn't used outside that call stack.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can add a quick test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it's probably not safe the way the code is written but will fix if not.

@mattklein123
Copy link
Copy Markdown
Member Author

@htuch updated again. I also made the template interface a bit nicer.

@mattklein123 mattklein123 merged commit daaf87a into master Jul 18, 2017
@mattklein123 mattklein123 deleted the fix_local_cluster_bug branch July 18, 2017 23:51
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: fixing bazel target with explicit stamp. Also updating docs with information about debug symbols.
Risk Level: low
Testing: local binary size analysis
Docs Changes: updated relevant docs.

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: fixing bazel target with explicit stamp. Also updating docs with information about debug symbols.
Risk Level: low
Testing: local binary size analysis
Docs Changes: updated relevant docs.

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
mathetake pushed a commit that referenced this pull request Mar 3, 2026
**Description**

Before, the MCP configuration argument to the standalone mode was not
compatible with existing OpenAI ENV vars. This fixes it, so that you can
have a single agent origin without yaml.

For example:
```
OPENAI_API_KEY=sk-xxxx aigw run --mcp-json '{"mcpServers":{"kiwi":{"type":"http","url":"https://mcp.kiwi.com"}}}'
```

**Details**

**MCP config generator bug**

maybeGenerateResourcesForMCPGateway unconditionally added resources for
MCP which caused configuration give-ups. fixed and backfilled tests

**docker mcp examples**

wrote examples in the same style as chat, embeddings etc into
`cmd/aigw/docker-compose*.yaml` so that they are easy to invoke.

**Envoy Gateway techdebt workarounds**

- hard-codes a shared directory for envoy downloads, so changed CI to
write to /tmp/envoy-gateway
- hard-codes a cert directory as well, which causes other tests to fail
when one writes certs to it. deleting that on startup
- hard-codes os.Stdout for envoy when launched in standalone mode, so
disable debug even though we could capture otherwise and only report on
fail.
- has broken CLI args passed to envoy which makes admin server detection
fail, fall back to poll the listen port instead

**Fix either/or with MCP config**

The main issue before this change was that you couldn't do both LLM and
MCP in auto-config.

This PR refactors the MCP configuration system and introduces a unified
`autoconfig` package that supports OpenAI-only, MCP-only, or combined
OpenAI+MCP configurations from environment variables and JSON.

The new `autoconfig` package replaces the old `envgen` package with a
more flexible architecture, generates complete AI Gateway YAML from
environment variables and JSON configuration, and handles both OpenAI
providers (OpenAI, Azure OpenAI) and MCP server configurations.

**naming consistency and organization**

- mcpServers "tools" -> "allowedTools" for internal consistency and also
consistency with gemini
- `MCPNameSelector` → `MCPToolSelector` as there are other entities
(tools, prompts, resources)
- `nameSelector` → `toolSelector` similar to above
- `cmd/aigw/*example*` -> `examples/mcp/` for better organization and
discovery
- `test-e2e-mcp` -> `test-e2e-aigw` and similar like`tests/e2e-aigw`

**admin check hardening**

I also enhanced Envoy admin discovery in `internal/aigw/admin.go` with a
new `EnvoyAdminClient` interface for checking Envoy readiness, automatic
discovery of Envoy admin port from subprocess arguments using
`gopsutil`. However, due to tech debt in Envoy Gateway which was fixed
but not released, we need a fallback to a listener port check.

**MCP code issues**

- bug where we wrote MCP resources even when there were none caused
blocking loop on Envoy waiting for them
- backfilled tests to ensure test-coverage would pass

**aigw run test deflake**
- rewrote to share shutdown code, and work around cert state leftovers
due to hard-coding in envoy gateway.

**ollama hardening**

- changed CI to keep a log which could explain if 403 in tests have
anything to do with ollama
- refactored CI to use a larger context size as default is known to not
be big enough for goose

**goose deflakes**

- larger model as `qwen3:0.6b` is known to not be viable for agents
- rewrote the examples and goose tests to be clear what happened on
error give up

**CLI debug propagation**

- `aigw --debug` now propagates to envoy debug which makes it possible
to debug many problems
- due to envoy-gateway not propagating stdout/stderr from cobra to
func-e, this cannot be left on in tests

**envoy CI downloads in unit tests**

- some tests flake because they are downloading envoy, so added steps to
do that in CI. this uses the hard-coded directory in envoy gateway.

**documentation backfill**

I also noticed a lack of documentation, so I spent a lot of time on
that, too
- `site/docs/capabilities/mcp/index.md` - New MCP guide
- New `site/docs/capabilities/mcp/index.md` explaining MCP integration
in detail
- `site/docs/cli/run.md` - MCP configuration docs
- Updated `site/docs/cli/run.md` with new `--mcp-json` flag
documentation
- `cmd/aigw/README.md` - Add MCP quick start
- `cmd/aigw/docker-compose*.yaml` - Add MCP configuration

**test fixture consolidation**

several tools consolidated into internal/testing notably
`RequireEventuallyNoError` needed to know what the last error is on
failure

---------

Signed-off-by: Adrian Cole <adrian@tetrate.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wierd "pure virtual method called" fault

2 participants