Skip to content

fix(agent): preserve skill commands on scan failure + log diagnostic warning#18669

Closed
alexzhu0 wants to merge 4 commits into
NousResearch:mainfrom
alexzhu0:fix/scan-skill-commands-atomic
Closed

fix(agent): preserve skill commands on scan failure + log diagnostic warning#18669
alexzhu0 wants to merge 4 commits into
NousResearch:mainfrom
alexzhu0:fix/scan-skill-commands-atomic

Conversation

@alexzhu0

@alexzhu0 alexzhu0 commented May 2, 2026

Copy link
Copy Markdown
Contributor

Closes #18659.

Symptom

Any exception during scan_skill_commands() silently blanks all 90+ skill slash commands (/apple-notes, /claude-code, etc.) with zero user-facing error. The user sees their skills disappear but has no log line to diagnose it. Reporter's reproducer: chmod 000 ~/.hermes/skills; hermes → all /skill-name autocompletions gone.

Root cause

agent/skill_commands.py:222 unconditionally clears the global before entering the try block:

```python
_skill_commands = {} # ← cleared FIRST
try:
from tools.skills_tool import ...
... populate _skill_commands ...
except Exception:
pass # ← swallowed, global stays {}
return _skill_commands
```

Any failure in the imports, outer scan loop, or get_external_skills_dirs() leaves the global at {}, and the except: pass swallows the reason.

Fix

Atomic swap — build into local new_commands, only assign to the global on success. Failure returns the last-known-good mapping.

Diagnostic log — replace except Exception: pass with logger.warning(..., exc_info=True), gated by a module-level _scan_error_logged flag so get_skill_commands() hot-loop callers don't flood the log. First failure emits a full traceback; subsequent failures stay silent until a successful scan resets the flag.

Before / after

Scenario Before After
First scan hits unreadable skills dir global = {}, all slash commands gone, no log global unchanged at initial {}, one WARNING traceback
Scan starts healthy then dir breaks mid-session global wiped to {}, 90 skills gone global retains last-known-good 90 entries, one WARNING traceback
Perms restored + /reload-skills shows 90 as "added" from empty before-snapshot (misleading diff) shows correct no-op diff when state was preserved
Persistent failure, 1000 calls to get_skill_commands() 1000× traceback log flood 1× traceback, then silent

Tests

5 new, 10 existing → 15/15 pass on TestScanSkillCommands (40/40 on full file):

  • test_preserves_previous_mapping_on_import_failure
  • test_logs_warning_on_scan_failure
  • test_repeated_scan_failures_do_not_flood_logs
  • test_scan_error_flag_resets_on_success
  • test_successful_scan_still_replaces_mapping

Cross-refs

@alt-glitch noted #18656 (closed, same report) and #2868 (open PR addressing the logging dimension). This PR addresses the core atomic-swap bug plus the diagnostic log dimension together with hot-loop-safe guarding; happy to rebase if #2868's logging approach lands first.

Files

  • agent/skill_commands.py: +17/-3
  • tests/agent/test_skill_commands.py: +132

…warning

Closes #18659.

## Symptom

Any exception during `scan_skill_commands()` silently blanks all 90+
skill slash commands (`/apple-notes`, `/claude-code`, etc.) with zero
user-facing error. Users notice their skills are gone but have no log
line to diagnose why.

## Root cause

`agent/skill_commands.py:222` unconditionally clears the global BEFORE
the try block that populates it:

```python
_skill_commands = {}           # ← cleared FIRST
try:
    from tools.skills_tool import ...
    ... populate _skill_commands ...
except Exception:
    pass                        # ← swallowed, _skill_commands stays empty
return _skill_commands
```

Any failure in the import block or the outer scan loop (e.g.
`chmod 000 ~/.hermes/skills`, a broken `external_dirs` entry, disk
read error) leaves the global at `{}`.

## Fix

1. **Atomic swap** — build into a local `new_commands` dict; only
   assign to the global AFTER success. On failure, return the
   last-known-good mapping so transient errors don't cascade into
   "all slash commands vanished".

2. **Diagnostic log** — replace `except Exception: pass` with a
   `logger.warning(..., exc_info=True)`. Gated by a module-level
   `_scan_error_logged` flag so `get_skill_commands()` hot-loop
   callers don't flood the log — first failure emits a full
   traceback, subsequent failures are silent until a successful
   scan resets the flag.

## Reproducer (from issue)

```
chmod 000 ~/.hermes/skills
hermes
# All skill slash commands now unavailable, no log line
```

Before fix: `_skill_commands = {}`, every `/skill` completes silently.
After fix: `_skill_commands` retains its last-known-good content, and
a single WARNING line names the exception.

## Tests

5 new + 10 existing → 15/15 pass. Full file: 40/40 pass.
- `test_preserves_previous_mapping_on_import_failure`
- `test_logs_warning_on_scan_failure`
- `test_repeated_scan_failures_do_not_flood_logs` (first-failure guard)
- `test_scan_error_flag_resets_on_success`
- `test_successful_scan_still_replaces_mapping` (atomic-swap regression)

## Files

- `agent/skill_commands.py`: +17/-3 (atomic swap + guarded log)
- `tests/agent/test_skill_commands.py`: +132

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/skills Skills system (list, view, manage) labels May 2, 2026
…ands-atomic

# Conflicts:
#	agent/skill_commands.py
@alexzhu0

alexzhu0 commented May 3, 2026

Copy link
Copy Markdown
Contributor Author

Rebased on top of #18739 (c73594fe4, just merged) — resolved cleanly. Both fixes are complementary and now compose:

After the merge, scan_skill_commands updates _skill_commands_platform first, then atomically swaps _skill_commands only on success. Failure path returns the last-known-good mapping with a one-time WARNING.

Tests: pytest tests/agent/test_skill_commands.py → 41/41 pass post-merge.

@alexzhu0

Copy link
Copy Markdown
Contributor Author

Status / gentle ping — opened 2026-05-02, no review yet. Closes #18659.

This is a small but high-impact reliability fix. Today, any exception during scan_skill_commands() silently blanks all 90+ skill slash commands (/apple-notes, /claude-code, etc.) with zero user-facing error. The reporter's reproducer is chmod 000 ~/.hermes/skills/apple/ — agent boots fine, but every skill is gone and there's no log line to diagnose it.

The fix:

  1. Catch + log the diagnostic warning instead of swallowing
  2. Preserve previously-scanned commands on partial failure rather than blanking the whole table

Re: failing CI — only test is red, and the failures are pre-existing on main (Anthropic beta headers, max_completion_tokens, matrix DM thread). Skill-scan tests pass. Happy to rebase.

cc @teknium1 / @alt-glitch — would be great to land before another user hits the silent blank.

@alt-glitch

Copy link
Copy Markdown
Collaborator

@daimon-nous review and reproduce please!

@daimon-nous

daimon-nous Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

PR #18669@alexzhu0
What it does: Atomic-swap for scan_skill_commands() — builds into local new_commands dict, only assigns to the global on success. Adds diagnostic logger.warning with a log-once guard (_scan_error_logged) so hot-loop callers don't flood.
What it solves: Any exception during skill scanning silently blanks all 90+ skill slash commands (_skill_commands = {} before the try block, except Exception: pass swallows the failure).
How: +17/−3 in agent/skill_commands.py, +132 in tests. 5 new tests, 41/41 pass.

Reproduction: Bug confirmed on current main — seeding a good mapping then triggering get_external_skills_dirs to raise returns {} (all commands lost). On PR branch, preserved mapping is returned correctly.

Code quality: Clean and correct.

  • Atomic swap is GIL-safe for concurrent readers (name rebind, not in-place mutation).
  • Log-once flag resets on success, so future failures remain audible. exc_info=True gives a full traceback on first failure.
  • Tests cover: preservation on failure, warning log emission, flood prevention (5 calls → 1 log), flag reset on success, happy-path full replacement (stale entries don't persist).

Two minor items (non-blocking, can be fixed in follow-up):

  1. Stale comment at line 384–385 (reload_skills()): says scan_skill_commands resets _skill_commands = {} internally and repopulates it — no longer accurate after the atomic-swap change.

  2. _skill_commands_platform set before the try block (line 253): On failure, the platform marker advances but the preserved mapping may be from a different platform scope. This suppresses the re-scan in get_skill_commands() for the edge case of multi-platform gateway + scan failure during platform switch. Strictly better than the old behavior (empty dict), but worth a # NOTE: comment or moving the marker into the success path.

Cross-refs:

Verdict: Good to salvage. The fix is correct, well-tested, and the two items above are follow-up material.

@alexzhu0

Copy link
Copy Markdown
Contributor Author

Hi @daimon-nous — following up on your 2026-05-15 review (thank you for the salvage verdict 🙏). The two minor items are easy:

  1. Stale comment in reload_skills() (line 384-385) — I'll update it to reflect atomic-swap semantics
  2. _skill_commands_platform marker before the try block (line 253) — I'll add a # NOTE: clarifying the suppress-re-scan trade-off, or move the marker into the success path if you'd prefer

Happy to push either as a small follow-up commit on this branch — let me know your preference (NOTE comment vs. moving the marker) and I'll have it up within the day. Branch is still MERGEABLE against current main.

alexzhu0 added 2 commits May 23, 2026 17:34
reload_skills()'s comment said scan_skill_commands resets _skill_commands
to {} and repopulates. After the atomic-swap refactor that's no longer
true: the function builds a fresh mapping locally and only swaps it in on
success, preserving the previous mapping on failure. Update the comment
to match.

Addresses one of two nits from review on #18669.
Add a NOTE explaining why _skill_commands_platform is set before the
build try-block: a platform change is reflected even when the build
fails and we keep the previous mapping. The marker is advisory while
the mapping is authoritative, so brief disagreement during a failed
rescan is intentional.

Addresses second nit from review on #18669 (variant A: comment-only).
@alexzhu0

Copy link
Copy Markdown
Contributor Author

@daimon-nous — both follow-up items addressed (docs-only, no behavior change):

  1. Stale comment in reload_skills() (line 384–385): updated to reflect atomic-swap semantics — scan_skill_commands now builds a fresh mapping locally and atomically swaps it on success; on failure the previous mapping is preserved.

  2. _skill_commands_platform marker before the try block (line 253): added a # NOTE: explaining the trade-off — marker advances before the build so subsequent platform changes are reflected even on failure; the marker is advisory metadata, while the mapping is the authoritative state callers depend on.

Two new commits pushed:

  • c57634cbd docs(skill_commands): correct stale comment about reset semantics
  • 61539dd4e docs(skill_commands): document platform marker timing trade-off

Tests unchanged (still 41/41 pass). Ready for merge when convenient.

@alexzhu0 alexzhu0 closed this by deleting the head repository May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists tool/skills Skills system (list, view, manage) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: scan_skill_commands unconditionally clears _skill_commands before try block, silently loses all skills on scan failure

2 participants