Skip to content

fix(skills): surface skill-load failures instead of returning success#11408

Closed
zerone0x wants to merge 1 commit into
NousResearch:mainfrom
zerone0x:fix/skill-load-failure-11200
Closed

fix(skills): surface skill-load failures instead of returning success#11408
zerone0x wants to merge 1 commit into
NousResearch:mainfrom
zerone0x:fix/skill-load-failure-11200

Conversation

@zerone0x

@zerone0x zerone0x commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Fixes #11200

Problem

build_skill_invocation_message() returned a truthy placeholder string ([Failed to load skill: <name>]) when a registered skill's SKILL.md could not be loaded (e.g. file deleted or corrupted between scan and invocation). Every caller used if msg: to distinguish success from failure, so the placeholder was treated as a successful invocation and silently fed into the agent as user prompt content. A missing/deleted/corrupt skill therefore looked like a working slash command from the user's perspective, while the model received [Failed to load skill: ...] as conversation input.

Affected call sites flagged in the bug report:

  • agent/skill_commands.py:311-313
  • cli.py:5571-5580
  • gateway/run.py:2877-2882
  • gateway/platforms/webhook.py:383-388

Fix

  • build_skill_invocation_message() now returns None on load failure (matching the existing "unknown skill" path) and logs a warning with the cmd_key + skill_dir for diagnosis. The docstring is updated to document the new contract.
  • gateway/run.py: when the slash command resolves to a registered skill but loading fails, return a real error message to the user (instead of silently falling through and sending the bare /cmd to the LLM as free text).
  • gateway/platforms/webhook.py: when a configured route skill is registered but fails to load, log loudly at ERROR and skip that skill rather than injecting the placeholder string into the prompt.
  • cli.py already had if not msg: branches for both the skill slash handler and _handle_plan_command; those branches were effectively dead code under the old return contract and now fire correctly on real load failures.

Reproduction

The reporter's reproduction now returns None instead of '[Failed to load skill: demo]':

import agent.skill_commands as sc
from pathlib import Path
from tempfile import TemporaryDirectory
from unittest.mock import patch

with TemporaryDirectory() as td:
    root = Path(td)
    skill_dir = root / "demo"
    skill_dir.mkdir(parents=True)
    (skill_dir / "SKILL.md").write_text("---
name: demo
description: demo
---

# Demo
", encoding="utf-8")
    with patch("agent.skill_utils.get_external_skills_dirs", return_value=[root]), \
         patch("tools.skills_tool.SKILLS_DIR", Path(td) / "missing_local_skills"):
        sc.scan_skill_commands()
        (skill_dir / "SKILL.md").unlink()
        print(sc.build_skill_invocation_message("/demo", "hello"))  # -> None

Tests

  • New TestBuildSkillInvocationMessage::test_returns_none_when_payload_load_fails_after_scan covers the regression: scan a skill, delete its SKILL.md, assert build_skill_invocation_message returns None.
  • pytest tests/agent/test_skill_commands.py — 27 passed.
  • pytest tests/gateway/test_webhook_integration.py — 4 passed.

`build_skill_invocation_message()` returned a truthy placeholder string
(`[Failed to load skill: ...]`) when a registered skill's `SKILL.md`
could not be loaded. Every caller checked the result with `if msg:` to
distinguish success from failure, so the placeholder string was treated
as a successful invocation and silently fed into the agent as user
prompt content. A missing/deleted/corrupt skill therefore looked like a
working slash command from the user's perspective.

Make the function return `None` on load failure (matching the existing
"unknown skill" path) and update the gateway and webhook callers to
emit a real user-facing error / log instead of forwarding the bare
slash command. The CLI handler and `_handle_plan_command` already had
dead `if not msg:` branches that now actually fire on load failure.

Adds a regression test that scans a skill, deletes its SKILL.md before
invocation, and asserts `None` is returned.

Fixes #11200

Co-Authored-By: Claude <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery tool/skills Skills system (list, view, manage) labels Apr 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #14752 — both fix build_skill_invocation_message() returning truthy error string instead of None on load failure. Same root cause as #11200 and #14713.

@zerone0x

Copy link
Copy Markdown
Contributor Author

Confirmed this overlaps with #14752 on the same root fix: make build_skill_invocation_message() return None on skill payload load failure and have callers surface the failure instead of queueing the sentinel string. Since #14752 is open, I’m closing this one to reduce duplicate PR noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists tool/skills Skills system (list, view, manage) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Skill load failure is treated as successful slash-command invocation

2 participants