Skip to content

Feature: Skill Lifecycle Quality — Better Descriptions, Proactive Improvement Loop, and Writing Principles (inspired by Anthropic skill-creator) #429

@teknium1

Description

@teknium1

Overview

Anthropic's skill-creator meta-skill reveals several practical, low-effort improvements to how Hermes Agent creates, triggers, and iteratively refines skills. Unlike our existing #337 (evolutionary self-improvement via automated pipelines) or #416 (structural validation), this issue focuses on day-to-day skill quality — making the agent write better skills, trigger them more reliably, and improve them during normal use.

The core insight from Anthropic's approach: the skill lifecycle should be a closed loop — create → use → observe gaps → refine → use again. Hermes has the primitives for this (skill_manage with patch, system prompt encouragement) but several tweaks would make the loop significantly tighter.

Source: anthropics/skills/skill-creator — specifically the SKILL.md methodology, improve_description.py, and run_loop.py for description optimization.


Research Findings

What Anthropic Does Well

1. "Pushy" Description Philosophy

Claude undertriggers skills by default — the model errs on the side of NOT loading a skill even when it's relevant. Anthropic's fix: descriptions should be slightly aggressive, explicitly listing edge cases and synonyms:

"Make sure to use this skill whenever the user mentions dashboards, reports, analytics, data viz, charts, or visualizations — even if they don't explicitly ask for a 'dashboard'."

2. Description Length Budget

Anthropic allows up to 1,024 characters for descriptions and recommends 100–200 words. The description is the only thing the model sees when deciding whether to trigger a skill — it needs room to convey trigger conditions, not just a sentence fragment.

3. Imperative + Explain Why

Skills should use imperative commands but explain the reasoning. From their guide:

"Today's LLMs are smart. They have good theory of mind... If you find yourself writing ALWAYS or NEVER in all caps, that's a yellow flag — reframe and explain the reasoning."

4. Anti-Overfitting Guidance

"Don't make instructions too narrow to the test cases; aim for generalizability. Use metaphors and general patterns."

5. "Bundle Repeated Work"

If multiple uses of a skill result in the agent writing the same Python script, that script should be moved into the skill's scripts/ folder. This is a practical iterative refinement pattern.

6. Progressive Disclosure Awareness

Keep SKILL.md under 500 lines. Put large reference material in references/, not inline. Use the 3-level loading system consciously.

How This Maps to Hermes

Anthropic Concept Hermes Current State Gap
Description 100–200 words (1024 chars) _read_skill_description() truncates to 60 chars Descriptions are sentence fragments; insufficient for trigger decisions
"Pushy" description guidance No guidance on description writing Agent writes minimal descriptions by default
Post-use skill improvement System prompt says "if a skill has issues, fix it with patch" Reactive, not proactive; no guidance on WHAT to observe
Skill writing principles skill_manage schema says "trigger conditions, numbered steps, pitfalls, verification" Good but missing: explain-why, anti-overfitting, bundle-repeated-work, progressive disclosure awareness
Description optimization loop No equivalent Out of scope here (see #337), but the GUIDANCE is adoptable
Skill testing framework No equivalent Out of scope here (see #416 for validation)

Current State in Hermes Agent

What We Already Have (and it's solid)

  • skill_manage tool with create/patch/edit/delete — the agent can modify skills mid-conversation
  • System prompt injection via build_skills_system_prompt() — automatic skill discovery
  • Progressive disclosure — description in system prompt → skill_view() for body → file_path for resources
  • Security scanning with rollback — way ahead of Anthropic
  • Skills Hub with multi-source federation — distribution solved
  • CONTRIBUTING.md with skill vs. tool criteria — decision framework exists

What's Missing (scope of this issue)

  1. Description budget is too tight — 60 chars is a sentence fragment ("Expert guidance for fine-tuning LLMs with Axolotl - YAML ...")
  2. No guidance on writing triggerable descriptions — agent doesn't know descriptions need to be "pushy"
  3. Passive improvement loop — agent only patches when something actively breaks, doesn't proactively improve after use
  4. No skill writing principles in the prompting — "explain why", "don't overfit", "bundle repeated work" are absent

Implementation Plan

Skill vs. Tool Classification

This is not a skill or tool — it's a set of improvements to existing codebase components: prompt_builder.py (description length + system prompt guidance), skill_manager_tool.py (schema description guidance), and CONTRIBUTING.md (documentation). All changes are to constants, strings, and documentation.

What We'd Need

No new dependencies. No new files. Changes to 3 existing files.

Phased Rollout

Phase 1: Description & Triggering Improvements (< 1 hour)

  1. Increase description budget in system prompt — Change _read_skill_description(max_chars=60) to max_chars=200 in prompt_builder.py:117. This gives the model 3x more context per skill for trigger decisions. System prompt growth is bounded: ~90 skills × 140 extra chars = ~12K chars — acceptable.

  2. Add "pushy description" guidance to skill_manage schema — Append to the schema description:

    Write DESCRIPTIONS that are slightly aggressive about triggering — list
    synonyms, edge cases, and adjacent tasks the skill covers. The description
    is the ONLY thing seen when deciding whether to load a skill. Example:
    "Use this skill whenever the user mentions dashboards, reports, analytics,
    data viz, charts — even if they don't explicitly ask for one."
    
  3. Update CONTRIBUTING.md skill-writing section with description best practices.

Phase 2: Proactive Post-Use Improvement Loop (< 30 min)

  1. Enhance system prompt guidance — Replace the current passive SKILLS_GUIDANCE constant:

    # Current:
    "After completing a complex task (5+ tool calls), fixing a tricky error, 
    or discovering a non-trivial workflow, consider saving the approach as a 
    skill with skill_manage so you can reuse it next time."
    
    # Proposed (adds post-use improvement):
    "After completing a complex task (5+ tool calls), fixing a tricky error, 
    or discovering a non-trivial workflow, consider saving the approach as a 
    skill with skill_manage so you can reuse it next time.
    
    After USING a skill, evaluate: did it have missing steps, outdated
    commands, unclear instructions, or repeated boilerplate you wrote by
    hand? If so, patch the skill immediately with the improvementsskills should get better every time they're used."
  2. Add post-use fix hint to build_skills_system_prompt() output — After the existing "If a skill has issues, fix it with skill_manage(action='patch')" line, add:

    After using a skill successfully, improve it: add missing steps, update
    outdated info, move repeated boilerplate into scripts/. Skills improve
    through use.
    

Phase 3: Skill Writing Principles (< 30 min)

  1. Add writing principles to the skill_manage schema — Extend the "Good skills" guidance:
    Good skills: trigger conditions, numbered steps with exact commands,
    pitfalls section, verification steps. Use imperative commands but
    explain WHY behind each instruction — models reason better with
    context. Keep SKILL.md under 500 lines; put large references in
    references/. Don't overfit instructions to one scenario — write
    for the general case. If you keep generating the same helper code
    when using a skill, move it into the skill's scripts/ folder.
    

Pros & Cons

Pros

  • Zero new infrastructure — All changes are to constants and strings in 3 files
  • Immediate impact — Better descriptions → better triggering on the next session
  • Compounds over time — Proactive improvement loop means every skill use makes skills better
  • Learned from production — Anthropic's patterns come from operating skills at scale (84.5K stars, production Claude deployment)
  • Compatible with Feature: Evolutionary Self-Improvement — Auto-Evolving Skills & Prompts via LLM-Driven Search #337 — If/when we build automated evolution, better starting skills = faster convergence

Cons / Risks

  • System prompt growth — Increasing description length from 60→200 adds ~12K chars for ~90 skills. Need to monitor context usage. Could mitigate with embedding-based pre-filtering later.
  • Proactive patching noise — Agent might over-eagerly patch skills after every use. The guidance should emphasize "only if genuinely improved" not "always patch."
  • Instruction bloat — Adding more guidance to the skill_manage schema and system prompt costs context tokens. Must keep additions concise.

Open Questions

  • Should we cap description at 200 or go to the full 1024 like Anthropic? 200 is a pragmatic middle ground for system prompt size, but we could also consider dynamic truncation based on total skill count.
  • Should we add a skill_used counter or timestamp to skills metadata to track usage frequency? This would enable data-driven decisions about which skills to improve first (light lift, could be Phase 4).
  • Is there value in adding an explicit "trigger conditions" YAML field separate from description? E.g., triggers: ["dashboard", "data viz", "chart"] for structured matching vs. relying on free-text descriptions.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions