Skip to content

chore(skills): move red-team skills (godmode, obliteratus) to optional-skills — Anthropic classifier#43221

Merged
teknium1 merged 2 commits into
mainfrom
chore/remove-redteam-skills-classifier
Jun 10, 2026
Merged

chore(skills): move red-team skills (godmode, obliteratus) to optional-skills — Anthropic classifier#43221
teknium1 merged 2 commits into
mainfrom
chore/remove-redteam-skills-classifier

Conversation

@teknium1

@teknium1 teknium1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Moves the two red-team skills (godmode, obliteratus) out of the bundled catalog and into optional-skills/, because their descriptions — injected into every session's system prompt via the bundled <available_skills> list — trip Anthropic's output classifier and intermittently kill unrelated work. They remain installable on demand.

Root cause

The bundled <available_skills> catalog is part of the system prompt of every session, regardless of which skill is actually loaded. Two entries read as jailbreak/abliteration tooling:

  • red-teaming/godmode"Jailbreak LLMs: Parseltongue, GODMODE, ULTRAPLINIAN"
  • mlops/inference/obliteratus"OBLITERATUS: abliterate LLM refusals (diff-in-means)"

On claude-fable-5 (Anthropic, via OpenRouter), the output classifier sees these in context and returns empty content a large fraction of the time. The user-visible symptom is the agent dying with:

⚠️ Empty response from model — retrying (1/3..3/3)
❌ Model returned no content after all retries.

This blocks legitimate day-to-day work (PR review, codebase audits, optimization sweeps) that has nothing to do with red-teaming — the skills just need to be listed in the always-injected catalog to do the damage. optional-skills/ entries are not in that catalog until a user explicitly installs them, so relocating fixes the trigger without removing the skills.

Measured impact

Controlled, interleaved A/B (calls alternated so server-side classifier drift hits both arms equally). Same live context, same task, prompts differing only by the ~204 chars of these two catalog lines, N=20 each:

Catalog lines Blocked
present 19/20 (95%)
absent 5/20 (25%)

Removing them from the bundled catalog roughly quartered the block rate. Rewording the descriptions to neutral phrasing did not help — the skills have to leave the always-injected catalog. (Confirmed separately that the loaded hermes-agent-dev skill itself is inert: full-skill vs no-skill measured 7/20 == 7/20.)

Changes

  • Relocate skills/red-teaming/godmode/optional-skills/security/godmode/
  • Relocate skills/mlops/inference/obliteratus/optional-skills/mlops/obliteratus/
  • Regenerate bundled + optional catalog pages, sidebars, and EN/zh-Hans entries
  • Drop the godmode hand-written-page exception in generate-skill-docs.py (now an auto-generated optional page)

Both skills stay fully available via hermes skills install official/security/godmode and official/mlops/obliteratus.

Validation

  • generate-skill-docs.py regenerated cleanly (170 skills; 2 moved bundled → optional)
  • Git tracks all skill files as renames (R100), preserving history
  • No remaining references in the bundled catalog; both now appear in the optional catalog
  • Two unrelated code comments about the legacy prefill_messages config format (cli.py, cron/scheduler.py) mention "godmode-generated configs" — that's a config-format reference, not a skill dependency, left intact.

Infographic

Removing the classifier trigger — block rate 95% to 25% after moving red-team skills out of the bundled catalog

…dled catalog

Anthropic's output classifier on claude-fable-5 (and likely other Claude
models served through it) intermittently returns empty content for sessions
whose system prompt advertises these skills. The bundled skills-catalog block
is injected into every session's system prompt, so the descriptions

  - red-teaming/godmode      'Jailbreak LLMs: Parseltongue, GODMODE, ULTRAPLINIAN'
  - mlops/inference/obliteratus 'OBLITERATUS: abliterate LLM refusals (diff-in-means)'

trip the classifier on EVERY session regardless of which skill is actually
loaded, killing unrelated legitimate work (PR review, codebase audits, etc.).

Measured impact (controlled, interleaved A/B, claude-fable-5 via OpenRouter,
prompts differing only by the ~204 chars of these catalog lines, N=20 each):
  catalog lines present -> 19/20 (95%) blocked
  catalog lines absent  -> 5/20  (25%) blocked

Removing them ~quartered the block rate. Rewording the descriptions was not
enough; the skills must leave the bundled catalog.

- Delete skills/red-teaming/godmode and skills/mlops/inference/obliteratus
- Drop their generated doc pages + catalog/sidebar entries (EN + zh-Hans)
- Drop the godmode hand-written-page exception in generate-skill-docs.py
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: chore/remove-redteam-skills-classifier vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10626 on HEAD, 10626 on base (➖ 0)

🆕 New issues (9):

Rule Count
unresolved-reference 2
call-non-callable 2
unresolved-import 2
unresolved-attribute 1
invalid-parameter-default 1
invalid-argument-type 1
First entries
optional-skills/security/godmode/scripts/auto_jailbreak.py:527: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `list[str]` in union `list[str] | dict[str, str] | dict[Unknown, Unknown]`
optional-skills/security/godmode/scripts/auto_jailbreak.py:383: [invalid-parameter-default] invalid-parameter-default: Default value of type `None` is not assignable to annotated parameter type `str`
optional-skills/security/godmode/scripts/parseltongue.py:475: [invalid-argument-type] invalid-argument-type: Argument to function `escape` is incorrect: Argument type `Sized` does not satisfy constraints (`str`, `bytes`) of type variable `AnyStr`
optional-skills/security/godmode/scripts/auto_jailbreak.py:620: [unresolved-reference] unresolved-reference: Name `score_response` used when not defined
optional-skills/security/godmode/scripts/parseltongue.py:520: [call-non-callable] call-non-callable: Object of type `str` is not callable
optional-skills/security/godmode/scripts/auto_jailbreak.py:25: [unresolved-import] unresolved-import: Cannot resolve imported module `openai`
optional-skills/security/godmode/scripts/parseltongue.py:476: [call-non-callable] call-non-callable: Object of type `int` is not callable
optional-skills/security/godmode/scripts/auto_jailbreak.py:540: [unresolved-reference] unresolved-reference: Name `escalate_encoding` used when not defined
optional-skills/security/godmode/scripts/godmode_race.py:27: [unresolved-import] unresolved-import: Cannot resolve imported module `openai`

✅ Fixed issues (9):

Rule Count
call-non-callable 2
unresolved-import 2
unresolved-reference 2
invalid-parameter-default 1
invalid-argument-type 1
unresolved-attribute 1
First entries
skills/red-teaming/godmode/scripts/parseltongue.py:476: [call-non-callable] call-non-callable: Object of type `int` is not callable
skills/red-teaming/godmode/scripts/godmode_race.py:27: [unresolved-import] unresolved-import: Cannot resolve imported module `openai`
skills/red-teaming/godmode/scripts/auto_jailbreak.py:383: [invalid-parameter-default] invalid-parameter-default: Default value of type `None` is not assignable to annotated parameter type `str`
skills/red-teaming/godmode/scripts/auto_jailbreak.py:620: [unresolved-reference] unresolved-reference: Name `score_response` used when not defined
skills/red-teaming/godmode/scripts/parseltongue.py:475: [invalid-argument-type] invalid-argument-type: Argument to function `escape` is incorrect: Argument type `Sized` does not satisfy constraints (`str`, `bytes`) of type variable `AnyStr`
skills/red-teaming/godmode/scripts/auto_jailbreak.py:25: [unresolved-import] unresolved-import: Cannot resolve imported module `openai`
skills/red-teaming/godmode/scripts/auto_jailbreak.py:527: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `list[str]` in union `list[str] | dict[str, str] | dict[Unknown, Unknown]`
skills/red-teaming/godmode/scripts/auto_jailbreak.py:540: [unresolved-reference] unresolved-reference: Name `escalate_encoding` used when not defined
skills/red-teaming/godmode/scripts/parseltongue.py:520: [call-non-callable] call-non-callable: Object of type `str` is not callable

Unchanged: 5555 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Looks Good

  • Clean removal of red-team skills (godmode, obliteratus) that were tripping Anthropic policy.
  • The 5506-line deletion across 25 files represents complete removal of the skill directories.
  • The small +2 lines likely represent any remaining SKILL.md or metadata cleanup.
  • No sensitive content left behind.
  • This is a policy compliance cleanup.

Reviewed by Hermes Agent

@alt-glitch alt-glitch added type/bug Something isn't working tool/skills Skills system (list, view, manage) P2 Medium — degraded but workaround exists labels Jun 10, 2026
Rather than deleting outright, move both into optional-skills/ so they remain
installable via `hermes skills install` while leaving the always-injected
bundled catalog (which is what tripped Anthropic's classifier).

- optional-skills/security/godmode  (was skills/red-teaming/godmode)
- optional-skills/mlops/obliteratus  (was skills/mlops/inference/obliteratus)
- regenerate optional-skills catalog + sidebar entries
@teknium1 teknium1 changed the title chore(skills): remove red-team skills (godmode, obliteratus) tripping Anthropic classifier chore(skills): move red-team skills (godmode, obliteratus) to optional-skills — Anthropic classifier Jun 10, 2026
@teknium1 teknium1 merged commit fdc9034 into main Jun 10, 2026
24 checks passed
@teknium1 teknium1 deleted the chore/remove-redteam-skills-classifier branch June 10, 2026 04:41
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…l-skills — Anthropic classifier (NousResearch#43221)

* chore(skills): remove red-team skills (godmode, obliteratus) from bundled catalog

Anthropic's output classifier on claude-fable-5 (and likely other Claude
models served through it) intermittently returns empty content for sessions
whose system prompt advertises these skills. The bundled skills-catalog block
is injected into every session's system prompt, so the descriptions

  - red-teaming/godmode      'Jailbreak LLMs: Parseltongue, GODMODE, ULTRAPLINIAN'
  - mlops/inference/obliteratus 'OBLITERATUS: abliterate LLM refusals (diff-in-means)'

trip the classifier on EVERY session regardless of which skill is actually
loaded, killing unrelated legitimate work (PR review, codebase audits, etc.).

Measured impact (controlled, interleaved A/B, claude-fable-5 via OpenRouter,
prompts differing only by the ~204 chars of these catalog lines, N=20 each):
  catalog lines present -> 19/20 (95%) blocked
  catalog lines absent  -> 5/20  (25%) blocked

Removing them ~quartered the block rate. Rewording the descriptions was not
enough; the skills must leave the bundled catalog.

- Delete skills/red-teaming/godmode and skills/mlops/inference/obliteratus
- Drop their generated doc pages + catalog/sidebar entries (EN + zh-Hans)
- Drop the godmode hand-written-page exception in generate-skill-docs.py

* chore(skills): relocate godmode + obliteratus to optional-skills

Rather than deleting outright, move both into optional-skills/ so they remain
installable via `hermes skills install` while leaving the always-injected
bundled catalog (which is what tripped Anthropic's classifier).

- optional-skills/security/godmode  (was skills/red-teaming/godmode)
- optional-skills/mlops/obliteratus  (was skills/mlops/inference/obliteratus)
- regenerate optional-skills catalog + sidebar entries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Medium — degraded but workaround exists tool/skills Skills system (list, view, manage) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants