feat: OBLITERATUS skill — LLM refusal removal via SVD-based weight projection by teknium1 · Pull Request #408 · NousResearch/hermes-agent

teknium1 · 2026-03-05T01:19:39Z

Summary

Adds a new mlops skill for OBLITERATUS, a toolkit for surgically removing refusal behaviors from open-weight LLMs without retraining or fine-tuning.

How the Agent Uses This Skill

Skills are Hermes Agent's procedural memory — structured instructions that teach the agent how to accomplish specific tasks. Here's the concrete flow when a user asks to abliterate a model:

1. Trigger Detection

The agent's system prompt includes a one-line description of every available skill. Before replying, it scans these descriptions for relevance. This skill's description contains trigger keywords: "uncensor", "abliterate", "refusal removal", "weight projection", "guardrails".

So when a user says something like:

"Can you uncensor Llama 3.1 8B for me?"

...the agent matches the obliteratus skill and calls skill_view(name="obliteratus") to load the full SKILL.md into its context window.

2. Following the Procedures

The SKILL.md is written as step-by-step instructions the agent follows using its existing tools — primarily terminal for running CLI commands:

Step 1 (Install): Agent runs obliteratus --version via terminal to check if installed. If not, it asks the user before installing (the skill explicitly says "confirm with user before installing" due to the ~5-10GB dependency footprint).
Step 2 (Hardware check): Agent runs a Python snippet via terminal to detect GPU/VRAM and determine the compute tier (tiny/small/medium/large/frontier).
Step 3 (Model browse): Agent runs obliteratus models --tier <tier> to show available models.
Step 4 (Method selection): Agent consults the decision table in the skill to recommend a method based on the model type (dense vs MoE, reasoning vs general, etc.). For most users it recommends informed (auto-configuring).
Step 5 (Run): Agent constructs and executes the obliteratus obliterate <model> --method <method> [flags] command via terminal.
Step 6 (Verify): Agent reads the output metrics (refusal rate, perplexity, KL divergence) and interprets them using the thresholds in the skill. If perplexity spiked, it adjusts parameters and retries.
Step 7 (Output): Agent helps the user test the model, upload to HuggingFace, or serve with vLLM.

3. Reference Files for Deep Dives

If the user asks detailed questions (e.g., "what's the difference between aggressive and surgical?"), the agent calls skill_view(name="obliteratus", file_path="references/methods-guide.md") to load the detailed methods reference. Similarly for analysis modules. These reference files stay out of context until needed, keeping the base skill lightweight.

4. Templates for Reproducibility

If the user wants a reproducible or batch workflow, the agent loads YAML config templates from templates/ and customizes them for the user's specific models and requirements.

Key Point: No Code Changes to Hermes Agent

The skill is pure instructions — it teaches the agent what to do using tools it already has (terminal, read_file, write_file). OBLITERATUS runs as an external CLI process. No new Python code is added to the Hermes Agent codebase, no new tools are created, and no AGPL-licensed code is imported.

What This Adds

6 files, 730+ lines:

skills/mlops/obliteratus/SKILL.md — Full 7-step workflow:
1. Installation (with dependency size warning + user confirmation)
2. Hardware/GPU check (auto-detects tier, recommends model size)
3. Model browsing (116 presets across 5 compute tiers)
4. Method selection (9 CLI methods with decision table; 4 Python-API-only noted separately)
5. Running abliteration (CLI flags, interactive mode, web UI)
6. Verification (metric thresholds + troubleshooting for over/under-aggressive results)
7. Output usage (test, HF upload, vLLM serving)
- Plus: 15 analysis modules, steering vectors, telemetry notice, 9 common pitfalls
references/methods-guide.md — All 9 CLI methods + 4 API-only methods explained in depth, with decision flowchart, parameter tuning guide, and troubleshooting table
references/analysis-modules.md — 15 analysis modules for mechanistic interpretability of refusal (alignment imprint detection, concept cone geometry, logit lens, causal tracing, etc.)
templates/abliteration-config.yaml — Standard YAML config for reproducible abliteration runs
templates/analysis-study.yaml — Pre-abliteration analysis study config
templates/batch-abliteration.yaml — Multi-model batch processing config with per-model method overrides

Example Interaction

User: I want to remove the refusals from Qwen2.5-7B-Instruct

Agent: [loads obliteratus skill]
       Let me check if OBLITERATUS is installed...
       [runs: obliteratus --version]
       
       Checking your GPU...
       [runs: python3 GPU check snippet]
       GPU: NVIDIA RTX 3090, VRAM: 24.0 GB, TIER: large
       
       With 24GB VRAM you can handle Qwen2.5-7B easily.
       I recommend the `informed` method — it auto-detects
       the alignment technique and configures everything.
       
       Want me to run it? I'll use 4-bit quantization to
       be safe on memory.

User: Yeah go for it

Agent: [runs: obliteratus obliterate Qwen/Qwen2.5-7B-Instruct \
         --method informed --quantization 4bit \
         --output-dir ./abliterated-models]
       
       [monitors output, reports progress]
       
       Done! Results:
       - Refusal rate: 0.2% (down from 47%)
       - Perplexity: +8% (healthy, within 10% threshold)
       - KL divergence: 0.04 (minimal distribution shift)
       
       Model saved to ./abliterated-models/Qwen2.5-7B-Instruct-abliterated
       
       Want me to upload it to HuggingFace or test it locally?

License Note

OBLITERATUS is AGPL-3.0. The skill strictly invokes it via CLI (obliteratus command) — never as a Python import. This maintains clean license separation from Hermes Agent's MIT license. The skill explicitly warns against importing OBLITERATUS as a library.

…ght projection Add mlops skill for the OBLITERATUS toolkit, which surgically removes refusal behaviors from open-weight LLMs without retraining or fine-tuning. Skill includes: - SKILL.md: Full 7-step workflow (install, hardware check, model browse, method selection, abliteration, verification, output usage) - references/methods-guide.md: All 13 abliteration methods with decision flowchart and troubleshooting - references/analysis-modules.md: All 27 analysis modules for mechanistic interpretability of refusal - templates/abliteration-config.yaml: Standard config template - templates/analysis-study.yaml: Pre-abliteration analysis template - templates/batch-abliteration.yaml: Multi-model batch processing template OBLITERATUS is AGPL-3.0; skill invokes it strictly via CLI to maintain license separation from Hermes Agent's MIT license. Refs: #407

Fixes based on feedback from OBLITERATUS creator: - CLI only accepts 9 methods (basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, inverted, nuclear). The 4 reproduction methods (failspy, gabliteration, heretic, rdo) are Python-API-only and will be rejected by argparse. Separated into 'CLI Methods' and 'Python-API-Only Methods' sections with clear warnings. - Analysis module count corrected from 27 to 15, matching the README. The analysis/ directory has 24+ .py files but includes utilities, visualization helpers, and __init__.py beyond the 15 core modules. - Description broadened from 'SVD-based weight projection' to 'mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.)' to better represent the method diversity. - Telemetry notice clarified: CLI defaults to OFF, opt-in via OBLITERATUS_TELEMETRY=1 or --contribute flag.

J-SUPHA · 2026-03-05T03:18:16Z

check

OBLITERATUS skill (PR #408 updated): - 9 CLI methods, 28 analysis modules, 116 model presets - Default method: advanced (multi-direction SVD, norm-preserving) - Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20% - References, templates, and real-world pitfalls included Gateway compression fix (PR #739): - Unified session hygiene with agent compression config - Uses model context length × compression.threshold from config.yaml - Removed hardcoded 100k/200-msg thresholds

teknium1 · 2026-03-09T09:59:55Z

Merged to main with significant updates (v2.0):

Updated to match current OBLITERATUS repo (28 analysis modules, new CLI commands, direction extraction methods, strategies, evaluation, MLX support)
Changed default recommendation from informed (experimental) to advanced (reliable, well-tested)
Live-tested on real models:
- Qwen2.5-3B-Instruct: 75% → 0% refusal with advanced defaults
- Qwen2.5-0.5B-Instruct: 60% → 20% (small models have fragmented refusal)
Added real-world pitfalls from testing (small model limitations, aggressive backfire, spectral RED is normal)
Fixed torch API bug (total_mem → total_memory)

Merged in commit c21d77c.

OBLITERATUS skill (PR NousResearch#408 updated): - 9 CLI methods, 28 analysis modules, 116 model presets - Default method: advanced (multi-direction SVD, norm-preserving) - Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20% - References, templates, and real-world pitfalls included Gateway compression fix (PR NousResearch#739): - Unified session hygiene with agent compression config - Uses model context length × compression.threshold from config.yaml - Removed hardcoded 100k/200-msg thresholds

teknium1 added 2 commits March 4, 2026 17:19

teknium1 closed this Mar 9, 2026

EffortlessSteven mentioned this pull request Apr 17, 2026

feat(telegram): location skill #422

Closed

rafe-walker mentioned this pull request May 24, 2026

KR-PROMOTE-PHRASEBOOK-FOUNDATION-MEGABUCKET — first promotion loop end-to-end rafe-walker/kora#186

Merged

3 tasks

clwsgruk mentioned this pull request Jun 8, 2026

feat: refresh OBLITERATUS skill #42337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OBLITERATUS skill — LLM refusal removal via SVD-based weight projection#408

feat: OBLITERATUS skill — LLM refusal removal via SVD-based weight projection#408
teknium1 wants to merge 2 commits into
mainfrom
feature/obliteratus-skill

teknium1 commented Mar 5, 2026 •

edited

Loading

Uh oh!

J-SUPHA commented Mar 5, 2026

Uh oh!

teknium1 commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How the Agent Uses This Skill

1. Trigger Detection

2. Following the Procedures

3. Reference Files for Deep Dives

4. Templates for Reproducibility

Key Point: No Code Changes to Hermes Agent

What This Adds

Example Interaction

License Note

Related

Uh oh!

J-SUPHA commented Mar 5, 2026

Uh oh!

teknium1 commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

teknium1 commented Mar 5, 2026 •

edited

Loading