Skip to content

feat: OBLITERATUS skill — LLM refusal removal via SVD-based weight projection#408

Closed
teknium1 wants to merge 2 commits into
mainfrom
feature/obliteratus-skill
Closed

feat: OBLITERATUS skill — LLM refusal removal via SVD-based weight projection#408
teknium1 wants to merge 2 commits into
mainfrom
feature/obliteratus-skill

Conversation

@teknium1

@teknium1 teknium1 commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new mlops skill for OBLITERATUS, a toolkit for surgically removing refusal behaviors from open-weight LLMs without retraining or fine-tuning.

How the Agent Uses This Skill

Skills are Hermes Agent's procedural memory — structured instructions that teach the agent how to accomplish specific tasks. Here's the concrete flow when a user asks to abliterate a model:

1. Trigger Detection

The agent's system prompt includes a one-line description of every available skill. Before replying, it scans these descriptions for relevance. This skill's description contains trigger keywords: "uncensor", "abliterate", "refusal removal", "weight projection", "guardrails".

So when a user says something like:

"Can you uncensor Llama 3.1 8B for me?"

...the agent matches the obliteratus skill and calls skill_view(name="obliteratus") to load the full SKILL.md into its context window.

2. Following the Procedures

The SKILL.md is written as step-by-step instructions the agent follows using its existing tools — primarily terminal for running CLI commands:

  • Step 1 (Install): Agent runs obliteratus --version via terminal to check if installed. If not, it asks the user before installing (the skill explicitly says "confirm with user before installing" due to the ~5-10GB dependency footprint).
  • Step 2 (Hardware check): Agent runs a Python snippet via terminal to detect GPU/VRAM and determine the compute tier (tiny/small/medium/large/frontier).
  • Step 3 (Model browse): Agent runs obliteratus models --tier <tier> to show available models.
  • Step 4 (Method selection): Agent consults the decision table in the skill to recommend a method based on the model type (dense vs MoE, reasoning vs general, etc.). For most users it recommends informed (auto-configuring).
  • Step 5 (Run): Agent constructs and executes the obliteratus obliterate <model> --method <method> [flags] command via terminal.
  • Step 6 (Verify): Agent reads the output metrics (refusal rate, perplexity, KL divergence) and interprets them using the thresholds in the skill. If perplexity spiked, it adjusts parameters and retries.
  • Step 7 (Output): Agent helps the user test the model, upload to HuggingFace, or serve with vLLM.

3. Reference Files for Deep Dives

If the user asks detailed questions (e.g., "what's the difference between aggressive and surgical?"), the agent calls skill_view(name="obliteratus", file_path="references/methods-guide.md") to load the detailed methods reference. Similarly for analysis modules. These reference files stay out of context until needed, keeping the base skill lightweight.

4. Templates for Reproducibility

If the user wants a reproducible or batch workflow, the agent loads YAML config templates from templates/ and customizes them for the user's specific models and requirements.

Key Point: No Code Changes to Hermes Agent

The skill is pure instructions — it teaches the agent what to do using tools it already has (terminal, read_file, write_file). OBLITERATUS runs as an external CLI process. No new Python code is added to the Hermes Agent codebase, no new tools are created, and no AGPL-licensed code is imported.


What This Adds

6 files, 730+ lines:

  • skills/mlops/obliteratus/SKILL.md — Full 7-step workflow:

    1. Installation (with dependency size warning + user confirmation)
    2. Hardware/GPU check (auto-detects tier, recommends model size)
    3. Model browsing (116 presets across 5 compute tiers)
    4. Method selection (9 CLI methods with decision table; 4 Python-API-only noted separately)
    5. Running abliteration (CLI flags, interactive mode, web UI)
    6. Verification (metric thresholds + troubleshooting for over/under-aggressive results)
    7. Output usage (test, HF upload, vLLM serving)
    • Plus: 15 analysis modules, steering vectors, telemetry notice, 9 common pitfalls
  • references/methods-guide.md — All 9 CLI methods + 4 API-only methods explained in depth, with decision flowchart, parameter tuning guide, and troubleshooting table

  • references/analysis-modules.md — 15 analysis modules for mechanistic interpretability of refusal (alignment imprint detection, concept cone geometry, logit lens, causal tracing, etc.)

  • templates/abliteration-config.yaml — Standard YAML config for reproducible abliteration runs

  • templates/analysis-study.yaml — Pre-abliteration analysis study config

  • templates/batch-abliteration.yaml — Multi-model batch processing config with per-model method overrides

Example Interaction

User: I want to remove the refusals from Qwen2.5-7B-Instruct

Agent: [loads obliteratus skill]
       Let me check if OBLITERATUS is installed...
       [runs: obliteratus --version]
       
       Checking your GPU...
       [runs: python3 GPU check snippet]
       GPU: NVIDIA RTX 3090, VRAM: 24.0 GB, TIER: large
       
       With 24GB VRAM you can handle Qwen2.5-7B easily.
       I recommend the `informed` method — it auto-detects
       the alignment technique and configures everything.
       
       Want me to run it? I'll use 4-bit quantization to
       be safe on memory.

User: Yeah go for it

Agent: [runs: obliteratus obliterate Qwen/Qwen2.5-7B-Instruct \
         --method informed --quantization 4bit \
         --output-dir ./abliterated-models]
       
       [monitors output, reports progress]
       
       Done! Results:
       - Refusal rate: 0.2% (down from 47%)
       - Perplexity: +8% (healthy, within 10% threshold)
       - KL divergence: 0.04 (minimal distribution shift)
       
       Model saved to ./abliterated-models/Qwen2.5-7B-Instruct-abliterated
       
       Want me to upload it to HuggingFace or test it locally?

License Note

OBLITERATUS is AGPL-3.0. The skill strictly invokes it via CLI (obliteratus command) — never as a Python import. This maintains clean license separation from Hermes Agent's MIT license. The skill explicitly warns against importing OBLITERATUS as a library.

Related

teknium1 added 2 commits March 4, 2026 17:19
…ght projection

Add mlops skill for the OBLITERATUS toolkit, which surgically removes
refusal behaviors from open-weight LLMs without retraining or fine-tuning.

Skill includes:
- SKILL.md: Full 7-step workflow (install, hardware check, model browse,
  method selection, abliteration, verification, output usage)
- references/methods-guide.md: All 13 abliteration methods with decision
  flowchart and troubleshooting
- references/analysis-modules.md: All 27 analysis modules for mechanistic
  interpretability of refusal
- templates/abliteration-config.yaml: Standard config template
- templates/analysis-study.yaml: Pre-abliteration analysis template
- templates/batch-abliteration.yaml: Multi-model batch processing template

OBLITERATUS is AGPL-3.0; skill invokes it strictly via CLI to maintain
license separation from Hermes Agent's MIT license.

Refs: #407
Fixes based on feedback from OBLITERATUS creator:

- CLI only accepts 9 methods (basic, advanced, aggressive, spectral_cascade,
  informed, surgical, optimized, inverted, nuclear). The 4 reproduction methods
  (failspy, gabliteration, heretic, rdo) are Python-API-only and will be
  rejected by argparse. Separated into 'CLI Methods' and 'Python-API-Only
  Methods' sections with clear warnings.

- Analysis module count corrected from 27 to 15, matching the README.
  The analysis/ directory has 24+ .py files but includes utilities,
  visualization helpers, and __init__.py beyond the 15 core modules.

- Description broadened from 'SVD-based weight projection' to
  'mechanistic interpretability techniques (diff-in-means, SVD,
  whitened SVD, SAE decomposition, etc.)' to better represent
  the method diversity.

- Telemetry notice clarified: CLI defaults to OFF, opt-in via
  OBLITERATUS_TELEMETRY=1 or --contribute flag.
@J-SUPHA

J-SUPHA commented Mar 5, 2026

Copy link
Copy Markdown
Collaborator

check

teknium1 added a commit that referenced this pull request Mar 9, 2026
OBLITERATUS skill (PR #408 updated):
- 9 CLI methods, 28 analysis modules, 116 model presets
- Default method: advanced (multi-direction SVD, norm-preserving)
- Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20%
- References, templates, and real-world pitfalls included

Gateway compression fix (PR #739):
- Unified session hygiene with agent compression config
- Uses model context length × compression.threshold from config.yaml
- Removed hardcoded 100k/200-msg thresholds
@teknium1

teknium1 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor Author

Merged to main with significant updates (v2.0):

  • Updated to match current OBLITERATUS repo (28 analysis modules, new CLI commands, direction extraction methods, strategies, evaluation, MLX support)
  • Changed default recommendation from informed (experimental) to advanced (reliable, well-tested)
  • Live-tested on real models:
    • Qwen2.5-3B-Instruct: 75% → 0% refusal with advanced defaults
    • Qwen2.5-0.5B-Instruct: 60% → 20% (small models have fragmented refusal)
  • Added real-world pitfalls from testing (small model limitations, aggressive backfire, spectral RED is normal)
  • Fixed torch API bug (total_memtotal_memory)

Merged in commit c21d77c.

@teknium1 teknium1 closed this Mar 9, 2026
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
OBLITERATUS skill (PR NousResearch#408 updated):
- 9 CLI methods, 28 analysis modules, 116 model presets
- Default method: advanced (multi-direction SVD, norm-preserving)
- Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20%
- References, templates, and real-world pitfalls included

Gateway compression fix (PR NousResearch#739):
- Unified session hygiene with agent compression config
- Uses model context length × compression.threshold from config.yaml
- Removed hardcoded 100k/200-msg thresholds
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
OBLITERATUS skill (PR NousResearch#408 updated):
- 9 CLI methods, 28 analysis modules, 116 model presets
- Default method: advanced (multi-direction SVD, norm-preserving)
- Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20%
- References, templates, and real-world pitfalls included

Gateway compression fix (PR NousResearch#739):
- Unified session hygiene with agent compression config
- Uses model context length × compression.threshold from config.yaml
- Removed hardcoded 100k/200-msg thresholds
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
OBLITERATUS skill (PR NousResearch#408 updated):
- 9 CLI methods, 28 analysis modules, 116 model presets
- Default method: advanced (multi-direction SVD, norm-preserving)
- Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20%
- References, templates, and real-world pitfalls included

Gateway compression fix (PR NousResearch#739):
- Unified session hygiene with agent compression config
- Uses model context length × compression.threshold from config.yaml
- Removed hardcoded 100k/200-msg thresholds
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
OBLITERATUS skill (PR NousResearch#408 updated):
- 9 CLI methods, 28 analysis modules, 116 model presets
- Default method: advanced (multi-direction SVD, norm-preserving)
- Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20%
- References, templates, and real-world pitfalls included

Gateway compression fix (PR NousResearch#739):
- Unified session hygiene with agent compression config
- Uses model context length × compression.threshold from config.yaml
- Removed hardcoded 100k/200-msg thresholds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants