fix(agent): preserve skill commands on scan failure + log diagnostic warning#18669
fix(agent): preserve skill commands on scan failure + log diagnostic warning#18669alexzhu0 wants to merge 4 commits into
Conversation
…warning Closes #18659. ## Symptom Any exception during `scan_skill_commands()` silently blanks all 90+ skill slash commands (`/apple-notes`, `/claude-code`, etc.) with zero user-facing error. Users notice their skills are gone but have no log line to diagnose why. ## Root cause `agent/skill_commands.py:222` unconditionally clears the global BEFORE the try block that populates it: ```python _skill_commands = {} # ← cleared FIRST try: from tools.skills_tool import ... ... populate _skill_commands ... except Exception: pass # ← swallowed, _skill_commands stays empty return _skill_commands ``` Any failure in the import block or the outer scan loop (e.g. `chmod 000 ~/.hermes/skills`, a broken `external_dirs` entry, disk read error) leaves the global at `{}`. ## Fix 1. **Atomic swap** — build into a local `new_commands` dict; only assign to the global AFTER success. On failure, return the last-known-good mapping so transient errors don't cascade into "all slash commands vanished". 2. **Diagnostic log** — replace `except Exception: pass` with a `logger.warning(..., exc_info=True)`. Gated by a module-level `_scan_error_logged` flag so `get_skill_commands()` hot-loop callers don't flood the log — first failure emits a full traceback, subsequent failures are silent until a successful scan resets the flag. ## Reproducer (from issue) ``` chmod 000 ~/.hermes/skills hermes # All skill slash commands now unavailable, no log line ``` Before fix: `_skill_commands = {}`, every `/skill` completes silently. After fix: `_skill_commands` retains its last-known-good content, and a single WARNING line names the exception. ## Tests 5 new + 10 existing → 15/15 pass. Full file: 40/40 pass. - `test_preserves_previous_mapping_on_import_failure` - `test_logs_warning_on_scan_failure` - `test_repeated_scan_failures_do_not_flood_logs` (first-failure guard) - `test_scan_error_flag_resets_on_success` - `test_successful_scan_still_replaces_mapping` (atomic-swap regression) ## Files - `agent/skill_commands.py`: +17/-3 (atomic swap + guarded log) - `tests/agent/test_skill_commands.py`: +132 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ands-atomic # Conflicts: # agent/skill_commands.py
|
Rebased on top of #18739 (
After the merge, Tests: |
|
Status / gentle ping — opened 2026-05-02, no review yet. Closes #18659. This is a small but high-impact reliability fix. Today, any exception during The fix:
Re: failing CI — only cc @teknium1 / @alt-glitch — would be great to land before another user hits the silent blank. |
|
@daimon-nous review and reproduce please! |
|
PR #18669 — @alexzhu0 Reproduction: Bug confirmed on current Code quality: Clean and correct.
Two minor items (non-blocking, can be fixed in follow-up):
Cross-refs:
Verdict: Good to salvage. The fix is correct, well-tested, and the two items above are follow-up material. |
|
Hi @daimon-nous — following up on your 2026-05-15 review (thank you for the salvage verdict 🙏). The two minor items are easy:
Happy to push either as a small follow-up commit on this branch — let me know your preference (NOTE comment vs. moving the marker) and I'll have it up within the day. Branch is still |
reload_skills()'s comment said scan_skill_commands resets _skill_commands
to {} and repopulates. After the atomic-swap refactor that's no longer
true: the function builds a fresh mapping locally and only swaps it in on
success, preserving the previous mapping on failure. Update the comment
to match.
Addresses one of two nits from review on #18669.
Add a NOTE explaining why _skill_commands_platform is set before the build try-block: a platform change is reflected even when the build fails and we keep the previous mapping. The marker is advisory while the mapping is authoritative, so brief disagreement during a failed rescan is intentional. Addresses second nit from review on #18669 (variant A: comment-only).
|
@daimon-nous — both follow-up items addressed (docs-only, no behavior change):
Two new commits pushed:
Tests unchanged (still 41/41 pass). Ready for merge when convenient. |
Closes #18659.
Symptom
Any exception during
scan_skill_commands()silently blanks all 90+ skill slash commands (/apple-notes,/claude-code, etc.) with zero user-facing error. The user sees their skills disappear but has no log line to diagnose it. Reporter's reproducer:chmod 000 ~/.hermes/skills; hermes→ all/skill-nameautocompletions gone.Root cause
agent/skill_commands.py:222unconditionally clears the global before entering the try block:```python
_skill_commands = {} # ← cleared FIRST
try:
from tools.skills_tool import ...
... populate _skill_commands ...
except Exception:
pass # ← swallowed, global stays {}
return _skill_commands
```
Any failure in the imports, outer scan loop, or
get_external_skills_dirs()leaves the global at{}, and theexcept: passswallows the reason.Fix
Atomic swap — build into local
new_commands, only assign to the global on success. Failure returns the last-known-good mapping.Diagnostic log — replace
except Exception: passwithlogger.warning(..., exc_info=True), gated by a module-level_scan_error_loggedflag soget_skill_commands()hot-loop callers don't flood the log. First failure emits a full traceback; subsequent failures stay silent until a successful scan resets the flag.Before / after
{}, all slash commands gone, no log{}, one WARNING traceback{}, 90 skills gone/reload-skillsget_skill_commands()Tests
5 new, 10 existing → 15/15 pass on
TestScanSkillCommands(40/40 on full file):test_preserves_previous_mapping_on_import_failuretest_logs_warning_on_scan_failuretest_repeated_scan_failures_do_not_flood_logstest_scan_error_flag_resets_on_successtest_successful_scan_still_replaces_mappingCross-refs
@alt-glitch noted #18656 (closed, same report) and #2868 (open PR addressing the logging dimension). This PR addresses the core atomic-swap bug plus the diagnostic log dimension together with hot-loop-safe guarding; happy to rebase if #2868's logging approach lands first.
Files
agent/skill_commands.py: +17/-3tests/agent/test_skill_commands.py: +132