feat: OBLITERATUS skill — LLM refusal removal via SVD-based weight projection#408
Closed
teknium1 wants to merge 2 commits into
Closed
feat: OBLITERATUS skill — LLM refusal removal via SVD-based weight projection#408teknium1 wants to merge 2 commits into
teknium1 wants to merge 2 commits into
Conversation
…ght projection Add mlops skill for the OBLITERATUS toolkit, which surgically removes refusal behaviors from open-weight LLMs without retraining or fine-tuning. Skill includes: - SKILL.md: Full 7-step workflow (install, hardware check, model browse, method selection, abliteration, verification, output usage) - references/methods-guide.md: All 13 abliteration methods with decision flowchart and troubleshooting - references/analysis-modules.md: All 27 analysis modules for mechanistic interpretability of refusal - templates/abliteration-config.yaml: Standard config template - templates/analysis-study.yaml: Pre-abliteration analysis template - templates/batch-abliteration.yaml: Multi-model batch processing template OBLITERATUS is AGPL-3.0; skill invokes it strictly via CLI to maintain license separation from Hermes Agent's MIT license. Refs: #407
Fixes based on feedback from OBLITERATUS creator: - CLI only accepts 9 methods (basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, inverted, nuclear). The 4 reproduction methods (failspy, gabliteration, heretic, rdo) are Python-API-only and will be rejected by argparse. Separated into 'CLI Methods' and 'Python-API-Only Methods' sections with clear warnings. - Analysis module count corrected from 27 to 15, matching the README. The analysis/ directory has 24+ .py files but includes utilities, visualization helpers, and __init__.py beyond the 15 core modules. - Description broadened from 'SVD-based weight projection' to 'mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.)' to better represent the method diversity. - Telemetry notice clarified: CLI defaults to OFF, opt-in via OBLITERATUS_TELEMETRY=1 or --contribute flag.
Collaborator
|
check |
teknium1
added a commit
that referenced
this pull request
Mar 9, 2026
OBLITERATUS skill (PR #408 updated): - 9 CLI methods, 28 analysis modules, 116 model presets - Default method: advanced (multi-direction SVD, norm-preserving) - Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20% - References, templates, and real-world pitfalls included Gateway compression fix (PR #739): - Unified session hygiene with agent compression config - Uses model context length × compression.threshold from config.yaml - Removed hardcoded 100k/200-msg thresholds
Contributor
Author
|
Merged to main with significant updates (v2.0):
Merged in commit c21d77c. |
angelburgosrosado
pushed a commit
to angelburgosrosado/hermes-agent
that referenced
this pull request
Apr 27, 2026
OBLITERATUS skill (PR NousResearch#408 updated): - 9 CLI methods, 28 analysis modules, 116 model presets - Default method: advanced (multi-direction SVD, norm-preserving) - Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20% - References, templates, and real-world pitfalls included Gateway compression fix (PR NousResearch#739): - Unified session hygiene with agent compression config - Uses model context length × compression.threshold from config.yaml - Removed hardcoded 100k/200-msg thresholds
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
OBLITERATUS skill (PR NousResearch#408 updated): - 9 CLI methods, 28 analysis modules, 116 model presets - Default method: advanced (multi-direction SVD, norm-preserving) - Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20% - References, templates, and real-world pitfalls included Gateway compression fix (PR NousResearch#739): - Unified session hygiene with agent compression config - Uses model context length × compression.threshold from config.yaml - Removed hardcoded 100k/200-msg thresholds
olympus-terminal
pushed a commit
to olympus-terminal/hermes-agent
that referenced
this pull request
May 16, 2026
OBLITERATUS skill (PR NousResearch#408 updated): - 9 CLI methods, 28 analysis modules, 116 model presets - Default method: advanced (multi-direction SVD, norm-preserving) - Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20% - References, templates, and real-world pitfalls included Gateway compression fix (PR NousResearch#739): - Unified session hygiene with agent compression config - Uses model context length × compression.threshold from config.yaml - Removed hardcoded 100k/200-msg thresholds
Merged
3 tasks
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
OBLITERATUS skill (PR NousResearch#408 updated): - 9 CLI methods, 28 analysis modules, 116 model presets - Default method: advanced (multi-direction SVD, norm-preserving) - Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20% - References, templates, and real-world pitfalls included Gateway compression fix (PR NousResearch#739): - Unified session hygiene with agent compression config - Uses model context length × compression.threshold from config.yaml - Removed hardcoded 100k/200-msg thresholds
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new mlops skill for OBLITERATUS, a toolkit for surgically removing refusal behaviors from open-weight LLMs without retraining or fine-tuning.
How the Agent Uses This Skill
Skills are Hermes Agent's procedural memory — structured instructions that teach the agent how to accomplish specific tasks. Here's the concrete flow when a user asks to abliterate a model:
1. Trigger Detection
The agent's system prompt includes a one-line description of every available skill. Before replying, it scans these descriptions for relevance. This skill's description contains trigger keywords: "uncensor", "abliterate", "refusal removal", "weight projection", "guardrails".
So when a user says something like:
...the agent matches the
obliteratusskill and callsskill_view(name="obliteratus")to load the full SKILL.md into its context window.2. Following the Procedures
The SKILL.md is written as step-by-step instructions the agent follows using its existing tools — primarily
terminalfor running CLI commands:obliteratus --versionvia terminal to check if installed. If not, it asks the user before installing (the skill explicitly says "confirm with user before installing" due to the ~5-10GB dependency footprint).obliteratus models --tier <tier>to show available models.informed(auto-configuring).obliteratus obliterate <model> --method <method> [flags]command via terminal.3. Reference Files for Deep Dives
If the user asks detailed questions (e.g., "what's the difference between aggressive and surgical?"), the agent calls
skill_view(name="obliteratus", file_path="references/methods-guide.md")to load the detailed methods reference. Similarly for analysis modules. These reference files stay out of context until needed, keeping the base skill lightweight.4. Templates for Reproducibility
If the user wants a reproducible or batch workflow, the agent loads YAML config templates from
templates/and customizes them for the user's specific models and requirements.Key Point: No Code Changes to Hermes Agent
The skill is pure instructions — it teaches the agent what to do using tools it already has (terminal, read_file, write_file). OBLITERATUS runs as an external CLI process. No new Python code is added to the Hermes Agent codebase, no new tools are created, and no AGPL-licensed code is imported.
What This Adds
6 files, 730+ lines:
skills/mlops/obliteratus/SKILL.md— Full 7-step workflow:references/methods-guide.md— All 9 CLI methods + 4 API-only methods explained in depth, with decision flowchart, parameter tuning guide, and troubleshooting tablereferences/analysis-modules.md— 15 analysis modules for mechanistic interpretability of refusal (alignment imprint detection, concept cone geometry, logit lens, causal tracing, etc.)templates/abliteration-config.yaml— Standard YAML config for reproducible abliteration runstemplates/analysis-study.yaml— Pre-abliteration analysis study configtemplates/batch-abliteration.yaml— Multi-model batch processing config with per-model method overridesExample Interaction
License Note
OBLITERATUS is AGPL-3.0. The skill strictly invokes it via CLI (
obliteratuscommand) — never as a Python import. This maintains clean license separation from Hermes Agent's MIT license. The skill explicitly warns against importing OBLITERATUS as a library.Related