Skip to content

fix(delegate): _load_config reads from disk every call, not cached CLI_CONFIG#15540

Open
it-helloprint wants to merge 1 commit into
NousResearch:mainfrom
it-helloprint:fix/delegate-load-config-refresh-from-disk
Open

fix(delegate): _load_config reads from disk every call, not cached CLI_CONFIG#15540
it-helloprint wants to merge 1 commit into
NousResearch:mainfrom
it-helloprint:fix/delegate-load-config-refresh-from-disk

Conversation

@it-helloprint

Copy link
Copy Markdown

Summary

_load_config() in tools/delegate_tool.py reads cli.CLI_CONFIG before falling back to hermes_cli.config.load_config(). Because CLI_CONFIG is populated exactly once at cli.py module import and never refreshed, long-running gateway processes (Discord / Telegram / Slack) can't see edits made to ~/.hermes/config.yaml's delegation.model / delegation.provider after startup. The result is a silent fall-back: users edit the config, restart nothing, and subagent dispatches silently inherit the parent session's model instead of the configured override.

This flips the fallback order so load_config() (which reads the file on every call) is tried first, and CLI_CONFIG is only the backup.

Observed Symptom (production Discord gateway, 2026-04-25)

  • Gateway running as anthropic/claude-opus-4.7
  • ~/.hermes/config.yaml: delegation.model: anthropic/claude-sonnet-4.6, delegation.provider: openrouter
  • delegate_task dispatches returned envelopes with "model": "anthropic/claude-opus-4.7" despite the pin
  • SQLite sessions table confirmed: all child sessions recorded the parent's model, not the configured one

Verified root cause:

  1. _load_config() returned CLI_CONFIG["delegation"], a dict frozen at April 22 import (when delegation.model / delegation.provider were still empty strings).
  2. _resolve_delegation_credentials() read configured_model = None, configured_provider = None.
  3. Hit the "No provider override — child inherits everything from parent" branch (line ~2307).
  4. creds["model"] = Noneeffective_model = model or parent_agent.model (line ~321 in _build_child_agent) → child ran Opus.

Cost Impact

In the incident that triggered this investigation, ~$700 of unintended Opus burn on Sonnet-suitable mechanical coding work (4 PRs: CSV streaming importer, data migration command, Filament Infolist grouping refactor, doc updates) before the routing miss was diagnosed. A user who's pinned a cheaper model is probably doing so because the task doesn't need Opus; they should not have to kill their gateway just to pick up a config edit.

Fix

hermes_cli.config.load_config() reads ~/.hermes/config.yaml on every call. Try that first. Only if the disk read fails / returns empty do we fall back to the frozen CLI_CONFIG.

Cost of the extra read: a ~10 KB YAML file parsed once per delegate_task call (not per API hit, not per token). ~5 ms on a modern machine, on the delegation cold path. Rounding error relative to any LLM API call.

Diff Semantics

CLI users: No change. load_config() reads the same file as load_cli_config, so whatever CLI_CONFIG had, load_config() returns an equivalent delegation block.

Gateway users: Config edits take effect on the next delegate_task call instead of requiring a gateway restart.

Testing

Reproduced in isolation before the fix:

# With stale CLI_CONFIG simulating a long-running gateway:
creds = _resolve_delegation_credentials(_load_config(), parent)
# creds["model"] == None   <- bug

After the fix:

# Same setup, but _load_config now reads the live file:
creds = _resolve_delegation_credentials(_load_config(), parent)
# creds["model"] == "anthropic/claude-sonnet-4.6"  <- fixed

Full delegate_task flow exercised in /tmp/trace_delegation_full.py (not included in commit) confirms the child AIAgent's self.model stays sonnet-4.6 through construction and past the first API call.

Docstring update

Added a section explaining the fallback order and the historical failure mode so future readers see why disk-first matters.

…I_CONFIG

The previous `_load_config()` read `cli.CLI_CONFIG` first — a module-level
dict populated ONCE at `cli.py` import. In long-running gateway processes
(Discord / Telegram / Slack) this meant any edits to
`~/.hermes/config.yaml`'s `delegation.model` / `delegation.provider` keys
were invisible to running subagents. Users would edit the config, restart
nothing (the docstring implied config was re-read on each dispatch), and
`delegate_task` calls would silently inherit the parent session's model
instead of the configured override.

Observed symptom (production Discord gateway, 2026-04-25):
- config.yaml `delegation.model: anthropic/claude-sonnet-4.6`
- Session running on `anthropic/claude-opus-4.7`
- Every `delegate_task` dispatch returned envelopes with
  `model: anthropic/claude-opus-4.7` despite the config pin
- Verified via sessions.db: all child sessions showed parent's model
- Verified root cause: `_resolve_delegation_credentials` received
  `configured_model = None` / `configured_provider = None` (the
  pre-April config defaults the frozen `CLI_CONFIG` still held), hit
  the "no provider override" branch, returned null creds, child fell
  through to `effective_model = model or parent_agent.model`

Cost impact: ~$700 of unintended Opus burn on Sonnet-suitable mechanical
coding work (4 PRs of CSV streaming / data migration / Filament refactor)
before the routing miss was diagnosed.

Fix: flip the fallback order so `load_config()` (which reads the live
file) is tried FIRST, and `CLI_CONFIG` is only the backup when file
read fails. The YAML file is <10 KB and `_load_config()` runs on the
cold path (once per `delegate_task` call, not per API hit), so the
~5 ms disk read is worth the correctness guarantee.

CLI path still works identically: `load_config()` reads the same file
that `load_cli_config` already reads. No behavioural change for CLI
users, just a guarantee for gateway users that config edits take
effect without a process restart.

Includes updated docstring explaining the fallback order and the
historical failure mode so future readers see why disk-first matters.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/delegate Subagent delegation area/config Config system, migrations, profiles duplicate This issue or pull request already exists labels Apr 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #12053 — same root cause: _load_config() returns stale CLI_CONFIG cached at import time, delegation.model edits ignored in long-running gateways. #12053 and #12941 are competing fixes for the same issue (#11999).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Config system, migrations, profiles comp/agent Core agent loop, run_agent.py, prompt builder duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists tool/delegate Subagent delegation type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants