Feature: OBLITERATUS Skill — LLM Refusal Removal via SVD-Based Weight Projection

## Overview

[OBLITERATUS](https://github.com/elder-plinius/OBLITERATUS) is an open-source toolkit (~36.5k LOC Python) for surgically removing refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. It uses mechanistic interpretability techniques — primarily SVD-based weight projection — to identify and excise "refusal directions" from model weights while preserving reasoning capabilities.

The toolkit offers 13 abliteration methods (from faithful reproductions of FailSpy, Gabliteration, Heretic, and RDO to novel pipelines like spectral cascade, CoT-aware, and expert-granular for MoE models), 27 analysis modules for mapping refusal geometry, and a 6-stage pipeline (SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH). It supports models from GPT-2 (CPU) to DeepSeek-V3 (multi-GPU) across 5 compute tiers with 116 curated model presets.

This is relevant to a subset of Nous Research customers who want to create uncensored models from open-weight base models. A Hermes Agent skill would let users orchestrate the full abliteration workflow — from model selection through analysis, abliteration, verification, and HuggingFace upload — via natural language.

**Source:** [elder-plinius/OBLITERATUS](https://github.com/elder-plinius/OBLITERATUS) (AGPL-3.0)

---

## Research Findings

### How OBLITERATUS Works

The core pipeline operates in 6 stages:

1. **SUMMON** — Load model via HuggingFace Transformers with auto device_map, optional 4-bit/8-bit quantization via bitsandbytes.

2. **PROBE** — Forward-pass harmful & harmless prompt pairs through all layers, collecting per-layer hidden-state activations at the last token position. Also collects jailbreak-contrastive activations and MoE router logits.

3. **DISTILL** — Extract refusal directions via SVD on `(harmful_mean - harmless_mean)` per layer. Methods include standard diff-in-means, Whitened SVD (covariance-normalized), and Wasserstein-optimal extraction. Layer selection uses "knee detection" via COSMIC cosine similarity.

4. **EXCISE** — Project out refusal directions from ALL weight matrices: attention Q/K/V projections, output projections, FFN up/down/gate, MoE routers, shared experts, embeddings. Uses norm-preserving projection. Advanced techniques include: reflection (inversion), per-expert directions for MoE, attention head surgery, SAE feature-level targeting, safety neuron masking, activation steering, and expert transplant.

5. **VERIFY** — Generate responses to test prompts, measure refusal rate, compute KL divergence vs baseline, perplexity, and coherence.

6. **REBIRTH** — Save the modified model + tokenizer + `abliteration_metadata.json`.

### 13 Abliteration Methods

| Method | Description | Complexity |
|:---|:---|:---|
| `basic` | Single direction diff-in-means (Arditi et al. 2024) | Low |
| `failspy` | FailSpy/abliterator reproduction | Low |
| `gabliteration` | Gabliteration reproduction | Low |
| `heretic` | Heretic/p-e-w reproduction | Medium |
| `rdo` | Refusal Direction Optimization (ICML 2025) | Medium |
| `advanced` | Multi-dir SVD + norm-preserving (default) | Medium |
| `aggressive` | Full whitened SVD + jailbreak contrast + head surgery | High |
| `spectral_cascade` | DCT frequency-domain decomposition | High |
| `informed` | Analysis-guided auto-configuration (the killer feature) | High |
| `surgical` | All SOTA: SAE + neuron masking + head surgery + per-expert | Very High |
| `optimized` | Bayesian auto-tuning via Optuna TPE | Very High |
| `inverted` | Semantic inversion (reflects refusal direction) | Medium |
| `nuclear` | Maximum force combo for stubborn MoE models | Extreme |

### The Informed Pipeline (Key Innovation)

The `InformedAbliterationPipeline` (986 lines) runs analysis DURING abliteration to auto-configure every decision:

1. **AlignmentImprintDetector** → Detects if model was trained via DPO/RLHF/CAI/SFT from subspace geometry → Sets regularization strength
2. **ConceptConeAnalyzer** → Determines if refusal is polyhedral vs linear → Sets number of directions
3. **CrossLayerAlignmentAnalyzer** → Cluster-aware layer selection → Chooses which layers to modify
4. **DefenseRobustnessEvaluator** → Assesses self-repair risk → Sets refinement passes
5. **Ouroboros compensation loop** → Re-probes/re-excises if refusal persists after initial pass

### 27 Analysis Modules (~10,400 lines)

Deep interpretability tools for understanding refusal geometry before touching weights:

- **Refusal Logit Lens** — Identifies the specific layer where a model decides to refuse
- **Causal Tracing** — Determines which components are causally necessary for refusal
- **Residual Stream Decomposition** — Analyzes Attention vs MLP contribution to refusal
- **Alignment Imprint Detection** — Fingerprints DPO vs RLHF vs CAI from subspace geometry
- **Concept Cone Geometry** — Maps per-category guardrails (polyhedral vs linear)
- **Ouroboros Effect Detection** — Measures if guardrails self-repair after removal
- **Cross-Model Transfer** — Tests if refusal directions transfer between models
- **Riemannian Manifold Geometry** — Weight manifold analysis (673 lines)
- **SAE-based Abliteration** — Sparse Autoencoder feature decomposition (762 lines)
- **Whitened SVD** — Covariance-normalized direction extraction
- **Steering Vectors** — Inference-time behavior modification (reversible)
- Plus 16 more modules

### Hardware Requirements

| Tier | VRAM | Example Models |
|:---|:---|:---|
| Tiny | CPU/<1GB | GPT-2 (124M), TinyLlama 1.1B, SmolLM 135-360M |
| Small | 4-8GB | Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 1B/3B |
| Medium | 8-16GB | Llama 3.1 8B (4bit), Mistral 7B, Gemma 2 9B |
| Large | 24GB+ | Qwen3-32B (4bit), Llama 3.1 70B (4bit), Mistral Large 2 (4bit) |
| Frontier | Multi-GPU | DeepSeek-V3 (685B), Llama 3.1 405B, Qwen3-235B |

### CLI Interface

```bash
# Primary command — abliterate a model
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

# With options
obliteratus obliterate <model> \
  --method informed \
  --quantization 4bit \
  --device auto \
  --output-dir ./liberated-models

# Browse models by compute tier
obliteratus models --tier medium

# Interactive guided mode
obliteratus interactive

# Launch Gradio web UI
obliteratus ui

# Run study from YAML config
obliteratus run study_config.yaml --preset jailbreak

# Model architecture info
obliteratus info meta-llama/Llama-3.1-8B-Instruct
```

### Telemetry

- **Disabled by default** for local installs (opt-in via `OBLITERATUS_TELEMETRY=1`)
- Auto-enabled on HuggingFace Spaces (when `SPACE_ID` env var detected)
- Collects only: model_id, method, benchmark scores, hardware info, timing
- Does NOT collect: IP addresses, user identity, prompt content
- Stored locally in `~/.obliteratus/telemetry.jsonl`, synced to HF dataset only from Spaces

### Dependencies

Core runtime: `torch>=2.0`, `transformers>=4.40`, `datasets>=2.14`, `accelerate>=0.24`, `safetensors>=0.4`, `bitsandbytes>=0.46.1`, `pyyaml`, `rich`, `matplotlib`, `seaborn`, `pandas`, `numpy`, `scikit-learn`, `tqdm`.

Optional: `gradio>=5.0` (Web UI), `optuna` (Bayesian optimization).

---

## Current State in Hermes Agent

Hermes Agent has no model weight manipulation, abliteration, or refusal removal capability. Related existing components:

- **SAE Training Skill** (`sparse-autoencoder-training`) — Trains Sparse Autoencoders for interpretability; conceptually adjacent (both deal with model internals) but different purpose
- **Axolotl Skill** — Fine-tuning LLMs; complementary (fine-tune after abliteration)
- **Unsloth Skill** — Fast fine-tuning; complementary
- **vLLM Skill** — Serving LLMs; complementary (serve the abliterated model)
- **HuggingFace Tokenizers Skill** — Tokenizer management; tangentially related

No existing open issues cover abliteration, refusal removal, or weight projection topics.

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **skill** (Skills Hub, not bundled) because:

1. **CLI-driven**: OBLITERATUS provides a full CLI (`obliteratus obliterate`, `obliteratus interactive`, `obliteratus ui`, etc.) that the agent can invoke via `terminal`
2. **No custom Python integration needed**: All operations go through the CLI or can be orchestrated via shell commands
3. **AGPL-3.0 license**: The tool MUST remain an external process, never imported as a Python library into the MIT-licensed Hermes Agent codebase. CLI invocation is "mere aggregation" — permitted under AGPL
4. **Specialized audience**: Only ML practitioners doing model customization need this — not broadly useful enough to bundle

### What We'd Need

1. **SKILL.md** — Procedures for the full abliteration workflow
2. **Template files** — YAML config templates for common abliteration scenarios
3. **Scripts** — Helper scripts for GPU detection, model size estimation, method recommendation

### Phased Rollout

**Phase 1: Core Abliteration Skill (MVP)**
- Installation instructions (`pip install -e .` from git clone)
- GPU/VRAM detection and model tier recommendation
- Model browsing via `obliteratus models --tier <tier>`
- Basic abliteration workflow: model selection → method recommendation → execution → verification
- Method selection guidance (which method for which model family/size)
- Output handling: save locally, verify coherence
- Template YAML configs for common scenarios (8B dense model, 7B MoE, etc.)

**Phase 2: Analysis-Informed Workflow**
- Pre-abliteration analysis: run relevant analysis modules to understand refusal geometry
- Informed pipeline usage: teach agent when to use `--method informed` vs specific methods
- Interpretation of analysis results (alignment imprint, concept geometry, self-repair risk)
- Iterative refinement: if first pass leaves residual refusal, adjust parameters and re-run
- Comparison workflows: run multiple methods and compare results

**Phase 3: Integration & Publishing**
- HuggingFace Hub upload of abliterated models with proper metadata
- Model card generation with abliteration details and benchmark results
- Integration with existing Hermes skills: abliterate → fine-tune (Axolotl/Unsloth) → serve (vLLM)
- Batch abliteration: process multiple models in sequence
- Steering vector workflows: reversible inference-time modifications as alternative to permanent weight changes

---

## Pros & Cons

### Pros
- **Fills a real gap** — No existing Hermes capability for model weight manipulation or refusal removal. This is a genuinely unique workflow.
- **Low integration cost** — As a skill wrapping a CLI tool, implementation is straightforward with no codebase changes needed.
- **Consumer-friendly** — Supports 4-bit quantization, CPU offloading, and tiered model presets. Users with an RTX 3060+ can abliterate 8B models; RTX 3090/4090 users can handle 32B+ models.
- **Well-engineered source** — 36.5k LOC with 13 methods, 27 analysis modules, 837 tests. This is not a toy project.
- **The informed pipeline is genuinely novel** — Auto-configuring abliteration based on real-time analysis of the model's refusal geometry is a significant advancement over manual parameter tuning.
- **Complements existing mlops skills** — Natural workflow: abliterate → fine-tune (Axolotl) → serve (vLLM). Each step has its own skill.
- **Aligned with Nous' mission** — Nous Research has a history of creating uncensored/open models. Providing tooling for customers to do the same is on-brand.

### Cons / Risks
- **AGPL-3.0 license** — Most restrictive common OSS license. Must NEVER be imported as a Python library. CLI-only invocation is safe but requires discipline. Should be clearly documented in the skill.
- **Large dependency footprint** — PyTorch + Transformers + bitsandbytes is ~5-10GB of dependencies. Installation is heavyweight.
- **GPU required for practical use** — While tiny models work on CPU, any meaningful abliteration requires at least an 8GB GPU. This limits the user base.
- **Quality risks** — Aggressive abliteration can damage model coherence. The skill must emphasize verification and recommend conservative methods first.
- **Ethical surface area** — This tool's explicit purpose is removing safety guardrails. While legitimate for research and open-model customization, it may attract scrutiny. The skill should frame it professionally (model customization, not "jailbreaking").
- **Telemetry awareness** — While disabled by default locally, users should know about the telemetry system and make informed choices. The skill should mention this.
- **Beta software (v0.1.2)** — No API stability guarantees. The skill may need updates as OBLITERATUS evolves.

---

## Open Questions

1. **Method recommendation heuristic** — Should the skill include a decision tree for method selection (e.g., "MoE model → use expert-granular or nuclear; dense reasoning model → use CoT-aware; first time → use informed")? Or should it always recommend `informed` as the default?
2. **HuggingFace integration** — Should the skill handle HF Hub authentication and model upload, or delegate that to a separate workflow?
3. **Gated model access** — Many models (Llama 3, Gemma) require HF access tokens. Should the skill handle token management or assume `huggingface-cli login` is pre-configured?
4. **Steering vectors vs. permanent abliteration** — Should the skill give equal weight to reversible steering vectors as an alternative to permanent weight modification? Steering vectors are lower risk but require inference-time hooks.
5. **Batch workflows** — Is there demand for batch abliteration (e.g., "abliterate all Llama 3 sizes with the same method and compare")?
6. **Integration with SAE skill** — The existing `sparse-autoencoder-training` skill trains SAEs; OBLITERATUS can use SAE features for targeted abliteration. Should these skills cross-reference each other?

---

## References

- [elder-plinius/OBLITERATUS](https://github.com/elder-plinius/OBLITERATUS) — Source code (AGPL-3.0, ~36.5k LOC)
- [HuggingFace Spaces demo](https://huggingface.co/spaces/pliny-the-prompter/obliteratus) — ZeroGPU hosted version
- [Arditi et al. 2024 — "Refusal in Language Models Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717) — Foundational paper for the basic method
- [FailSpy/abliterator](https://github.com/FailSpy/abliterator) — Original abliterator implementation
- [Refusal Direction Optimization (RDO)](https://arxiv.org/abs/2411.14793) — ICML 2025 baseline
- Hermes `sparse-autoencoder-training` skill — Related interpretability tooling
- Hermes `axolotl` skill — Complementary fine-tuning workflow
- Hermes `serving-llms-vllm` skill — Complementary serving workflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: OBLITERATUS Skill — LLM Refusal Removal via SVD-Based Weight Projection #407

Overview

Research Findings

How OBLITERATUS Works

13 Abliteration Methods

The Informed Pipeline (Key Innovation)

27 Analysis Modules (~10,400 lines)

Hardware Requirements

CLI Interface

Telemetry

Dependencies

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Method	Description	Complexity
`basic`	Single direction diff-in-means (Arditi et al. 2024)	Low
`failspy`	FailSpy/abliterator reproduction	Low
`gabliteration`	Gabliteration reproduction	Low
`heretic`	Heretic/p-e-w reproduction	Medium
`rdo`	Refusal Direction Optimization (ICML 2025)	Medium
`advanced`	Multi-dir SVD + norm-preserving (default)	Medium
`aggressive`	Full whitened SVD + jailbreak contrast + head surgery	High
`spectral_cascade`	DCT frequency-domain decomposition	High
`informed`	Analysis-guided auto-configuration (the killer feature)	High
`surgical`	All SOTA: SAE + neuron masking + head surgery + per-expert	Very High
`optimized`	Bayesian auto-tuning via Optuna TPE	Very High
`inverted`	Semantic inversion (reflects refusal direction)	Medium
`nuclear`	Maximum force combo for stubborn MoE models	Extreme

Tier	VRAM	Example Models
Tiny	CPU/<1GB	GPT-2 (124M), TinyLlama 1.1B, SmolLM 135-360M
Small	4-8GB	Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 1B/3B
Medium	8-16GB	Llama 3.1 8B (4bit), Mistral 7B, Gemma 2 9B
Large	24GB+	Qwen3-32B (4bit), Llama 3.1 70B (4bit), Mistral Large 2 (4bit)
Frontier	Multi-GPU	DeepSeek-V3 (685B), Llama 3.1 405B, Qwen3-235B

Feature: OBLITERATUS Skill — LLM Refusal Removal via SVD-Based Weight Projection #407

Description

Overview

Research Findings

How OBLITERATUS Works

13 Abliteration Methods

The Informed Pipeline (Key Innovation)

27 Analysis Modules (~10,400 lines)

Hardware Requirements

CLI Interface

Telemetry

Dependencies

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions