fix(agent): preserve skill commands on scan failure + log diagnostic warning by alexzhu0 · Pull Request #18669 · NousResearch/hermes-agent

alexzhu0 · 2026-05-02T05:23:33Z

Symptom

Any exception during scan_skill_commands() silently blanks all 90+ skill slash commands (/apple-notes, /claude-code, etc.) with zero user-facing error. The user sees their skills disappear but has no log line to diagnose it. Reporter's reproducer: chmod 000 ~/.hermes/skills; hermes → all /skill-name autocompletions gone.

Root cause

agent/skill_commands.py:222 unconditionally clears the global before entering the try block:

```python
_skill_commands = {} # ← cleared FIRST
try:
from tools.skills_tool import ...
... populate _skill_commands ...
except Exception:
pass # ← swallowed, global stays {}
return _skill_commands
```

Any failure in the imports, outer scan loop, or get_external_skills_dirs() leaves the global at {}, and the except: pass swallows the reason.

Fix

Atomic swap — build into local new_commands, only assign to the global on success. Failure returns the last-known-good mapping.

Diagnostic log — replace except Exception: pass with logger.warning(..., exc_info=True), gated by a module-level _scan_error_logged flag so get_skill_commands() hot-loop callers don't flood the log. First failure emits a full traceback; subsequent failures stay silent until a successful scan resets the flag.

Before / after

Scenario	Before	After
First scan hits unreadable skills dir	global = `{}`, all slash commands gone, no log	global unchanged at initial `{}`, one WARNING traceback
Scan starts healthy then dir breaks mid-session	global wiped to `{}`, 90 skills gone	global retains last-known-good 90 entries, one WARNING traceback
Perms restored + `/reload-skills`	shows 90 as "added" from empty before-snapshot (misleading diff)	shows correct no-op diff when state was preserved
Persistent failure, 1000 calls to `get_skill_commands()`	1000× traceback log flood	1× traceback, then silent

Tests

5 new, 10 existing → 15/15 pass on TestScanSkillCommands (40/40 on full file):

test_preserves_previous_mapping_on_import_failure
test_logs_warning_on_scan_failure
test_repeated_scan_failures_do_not_flood_logs
test_scan_error_flag_resets_on_success
test_successful_scan_still_replaces_mapping

Cross-refs

@alt-glitch noted #18656 (closed, same report) and #2868 (open PR addressing the logging dimension). This PR addresses the core atomic-swap bug plus the diagnostic log dimension together with hot-loop-safe guarding; happy to rebase if #2868's logging approach lands first.

Files

agent/skill_commands.py: +17/-3
tests/agent/test_skill_commands.py: +132

…warning Closes #18659. ## Symptom Any exception during `scan_skill_commands()` silently blanks all 90+ skill slash commands (`/apple-notes`, `/claude-code`, etc.) with zero user-facing error. Users notice their skills are gone but have no log line to diagnose why. ## Root cause `agent/skill_commands.py:222` unconditionally clears the global BEFORE the try block that populates it: ```python _skill_commands = {} # ← cleared FIRST try: from tools.skills_tool import ... ... populate _skill_commands ... except Exception: pass # ← swallowed, _skill_commands stays empty return _skill_commands ``` Any failure in the import block or the outer scan loop (e.g. `chmod 000 ~/.hermes/skills`, a broken `external_dirs` entry, disk read error) leaves the global at `{}`. ## Fix 1. **Atomic swap** — build into a local `new_commands` dict; only assign to the global AFTER success. On failure, return the last-known-good mapping so transient errors don't cascade into "all slash commands vanished". 2. **Diagnostic log** — replace `except Exception: pass` with a `logger.warning(..., exc_info=True)`. Gated by a module-level `_scan_error_logged` flag so `get_skill_commands()` hot-loop callers don't flood the log — first failure emits a full traceback, subsequent failures are silent until a successful scan resets the flag. ## Reproducer (from issue) ``` chmod 000 ~/.hermes/skills hermes # All skill slash commands now unavailable, no log line ``` Before fix: `_skill_commands = {}`, every `/skill` completes silently. After fix: `_skill_commands` retains its last-known-good content, and a single WARNING line names the exception. ## Tests 5 new + 10 existing → 15/15 pass. Full file: 40/40 pass. - `test_preserves_previous_mapping_on_import_failure` - `test_logs_warning_on_scan_failure` - `test_repeated_scan_failures_do_not_flood_logs` (first-failure guard) - `test_scan_error_flag_resets_on_success` - `test_successful_scan_still_replaces_mapping` (atomic-swap regression) ## Files - `agent/skill_commands.py`: +17/-3 (atomic swap + guarded log) - `tests/agent/test_skill_commands.py`: +132 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ands-atomic # Conflicts: # agent/skill_commands.py

alexzhu0 · 2026-05-03T15:13:33Z

Rebased on top of #18739 (c73594fe4, just merged) — resolved cleanly. Both fixes are complementary and now compose:

fix(skills): rescan skill_commands cache when platform scope changes (salvage #14570) #18739 (@teknium1): rescan when platform scope changes (fixes skill_commands cache can leak one platform's disabled-skill view into another #14536 — gateway-cross-platform stale cache)
fix(agent): preserve skill commands on scan failure + log diagnostic warning #18669 (this): atomic-swap so a scan failure doesn't blank the global; first-failure-only diagnostic log

After the merge, scan_skill_commands updates _skill_commands_platform first, then atomically swaps _skill_commands only on success. Failure path returns the last-known-good mapping with a one-time WARNING.

Tests: pytest tests/agent/test_skill_commands.py → 41/41 pass post-merge.

alexzhu0 · 2026-05-15T15:21:41Z

Status / gentle ping — opened 2026-05-02, no review yet. Closes #18659.

This is a small but high-impact reliability fix. Today, any exception during scan_skill_commands() silently blanks all 90+ skill slash commands (/apple-notes, /claude-code, etc.) with zero user-facing error. The reporter's reproducer is chmod 000 ~/.hermes/skills/apple/ — agent boots fine, but every skill is gone and there's no log line to diagnose it.

The fix:

Catch + log the diagnostic warning instead of swallowing
Preserve previously-scanned commands on partial failure rather than blanking the whole table

Re: failing CI — only test is red, and the failures are pre-existing on main (Anthropic beta headers, max_completion_tokens, matrix DM thread). Skill-scan tests pass. Happy to rebase.

cc @teknium1 / @alt-glitch — would be great to land before another user hits the silent blank.

alt-glitch · 2026-05-15T15:37:47Z

@daimon-nous review and reproduce please!

daimon-nous · 2026-05-15T15:44:16Z

PR #18669 — @alexzhu0
What it does: Atomic-swap for scan_skill_commands() — builds into local new_commands dict, only assigns to the global on success. Adds diagnostic logger.warning with a log-once guard (_scan_error_logged) so hot-loop callers don't flood.
What it solves: Any exception during skill scanning silently blanks all 90+ skill slash commands (_skill_commands = {} before the try block, except Exception: pass swallows the failure).
How: +17/−3 in agent/skill_commands.py, +132 in tests. 5 new tests, 41/41 pass.

Reproduction: Bug confirmed on current main — seeding a good mapping then triggering get_external_skills_dirs to raise returns {} (all commands lost). On PR branch, preserved mapping is returned correctly.

Code quality: Clean and correct.

Atomic swap is GIL-safe for concurrent readers (name rebind, not in-place mutation).
Log-once flag resets on success, so future failures remain audible. exc_info=True gives a full traceback on first failure.
Tests cover: preservation on failure, warning log emission, flood prevention (5 calls → 1 log), flag reset on success, happy-path full replacement (stale entries don't persist).

Two minor items (non-blocking, can be fixed in follow-up):

Stale comment at line 384–385 (reload_skills()): says scan_skill_commands resets _skill_commands = {} internally and repopulates it — no longer accurate after the atomic-swap change.
_skill_commands_platform set before the try block (line 253): On failure, the platform marker advances but the preserved mapping may be from a different platform scope. This suppresses the re-scan in get_skill_commands() for the edge case of multi-platform gateway + scan failure during platform switch. Strictly better than the old behavior (empty dict), but worth a # NOTE: comment or moving the marker into the success path.

Cross-refs:

Closes [Bug]: scan_skill_commands unconditionally clears _skill_commands before try block, silently loses all skills on scan failure #18659 ✓
Subsumes the scan_skill_commands portion of fix(skills): log warnings instead of silently swallowing exceptions in skill_commands.py #2868 (which only adds logging without the atomic-swap fix)
Duplicate of closed fix(skill_commands): preserve cache when scan fails #22514 (same atomic-swap idea, different author)

Verdict: Good to salvage. The fix is correct, well-tested, and the two items above are follow-up material.

alexzhu0 · 2026-05-23T09:15:44Z

Hi @daimon-nous — following up on your 2026-05-15 review (thank you for the salvage verdict 🙏). The two minor items are easy:

Stale comment in reload_skills() (line 384-385) — I'll update it to reflect atomic-swap semantics
_skill_commands_platform marker before the try block (line 253) — I'll add a # NOTE: clarifying the suppress-re-scan trade-off, or move the marker into the success path if you'd prefer

Happy to push either as a small follow-up commit on this branch — let me know your preference (NOTE comment vs. moving the marker) and I'll have it up within the day. Branch is still MERGEABLE against current main.

reload_skills()'s comment said scan_skill_commands resets _skill_commands to {} and repopulates. After the atomic-swap refactor that's no longer true: the function builds a fresh mapping locally and only swaps it in on success, preserving the previous mapping on failure. Update the comment to match. Addresses one of two nits from review on #18669.

Add a NOTE explaining why _skill_commands_platform is set before the build try-block: a platform change is reflected even when the build fails and we keep the previous mapping. The marker is advisory while the mapping is authoritative, so brief disagreement during a failed rescan is intentional. Addresses second nit from review on #18669 (variant A: comment-only).

alexzhu0 · 2026-05-24T07:12:20Z

@daimon-nous — both follow-up items addressed (docs-only, no behavior change):

Stale comment in reload_skills() (line 384–385): updated to reflect atomic-swap semantics — scan_skill_commands now builds a fresh mapping locally and atomically swaps it on success; on failure the previous mapping is preserved.
_skill_commands_platform marker before the try block (line 253): added a # NOTE: explaining the trade-off — marker advances before the build so subsequent platform changes are reflected even on failure; the marker is advisory metadata, while the mapping is the authoritative state callers depend on.

Two new commits pushed:

c57634cbd docs(skill_commands): correct stale comment about reset semantics
61539dd4e docs(skill_commands): document platform marker timing trade-off

Tests unchanged (still 41/41 pass). Ready for merge when convenient.

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/skills Skills system (list, view, manage) labels May 2, 2026

alt-glitch mentioned this pull request May 2, 2026

fix(skills): log warnings instead of silently swallowing exceptions in skill_commands.py #2868

Closed

9 tasks

Merge remote-tracking branch 'upstream/main' into fix/scan-skill-comm…

6a4c51c

…ands-atomic # Conflicts: # agent/skill_commands.py

alexzhu0 added 2 commits May 23, 2026 17:34

alexzhu0 closed this by deleting the head repository May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): preserve skill commands on scan failure + log diagnostic warning#18669

fix(agent): preserve skill commands on scan failure + log diagnostic warning#18669
alexzhu0 wants to merge 4 commits into
NousResearch:mainfrom
alexzhu0:fix/scan-skill-commands-atomic

alexzhu0 commented May 2, 2026

Uh oh!

alexzhu0 commented May 3, 2026

Uh oh!

alexzhu0 commented May 15, 2026

Uh oh!

alt-glitch commented May 15, 2026

Uh oh!

daimon-nous Bot commented May 15, 2026

Uh oh!

alexzhu0 commented May 23, 2026

Uh oh!

alexzhu0 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexzhu0 commented May 2, 2026

Symptom

Root cause

Fix

Before / after

Tests

Cross-refs

Files

Uh oh!

alexzhu0 commented May 3, 2026

Uh oh!

alexzhu0 commented May 15, 2026

Uh oh!

alt-glitch commented May 15, 2026

Uh oh!

daimon-nous Bot commented May 15, 2026

Uh oh!

alexzhu0 commented May 23, 2026

Uh oh!

alexzhu0 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants