Skip to content

Feature: OBLITERATUS Skill — LLM Refusal Removal via SVD-Based Weight Projection #407

@teknium1

Description

@teknium1

Overview

OBLITERATUS is an open-source toolkit (~36.5k LOC Python) for surgically removing refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. It uses mechanistic interpretability techniques — primarily SVD-based weight projection — to identify and excise "refusal directions" from model weights while preserving reasoning capabilities.

The toolkit offers 13 abliteration methods (from faithful reproductions of FailSpy, Gabliteration, Heretic, and RDO to novel pipelines like spectral cascade, CoT-aware, and expert-granular for MoE models), 27 analysis modules for mapping refusal geometry, and a 6-stage pipeline (SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH). It supports models from GPT-2 (CPU) to DeepSeek-V3 (multi-GPU) across 5 compute tiers with 116 curated model presets.

This is relevant to a subset of Nous Research customers who want to create uncensored models from open-weight base models. A Hermes Agent skill would let users orchestrate the full abliteration workflow — from model selection through analysis, abliteration, verification, and HuggingFace upload — via natural language.

Source: elder-plinius/OBLITERATUS (AGPL-3.0)


Research Findings

How OBLITERATUS Works

The core pipeline operates in 6 stages:

  1. SUMMON — Load model via HuggingFace Transformers with auto device_map, optional 4-bit/8-bit quantization via bitsandbytes.

  2. PROBE — Forward-pass harmful & harmless prompt pairs through all layers, collecting per-layer hidden-state activations at the last token position. Also collects jailbreak-contrastive activations and MoE router logits.

  3. DISTILL — Extract refusal directions via SVD on (harmful_mean - harmless_mean) per layer. Methods include standard diff-in-means, Whitened SVD (covariance-normalized), and Wasserstein-optimal extraction. Layer selection uses "knee detection" via COSMIC cosine similarity.

  4. EXCISE — Project out refusal directions from ALL weight matrices: attention Q/K/V projections, output projections, FFN up/down/gate, MoE routers, shared experts, embeddings. Uses norm-preserving projection. Advanced techniques include: reflection (inversion), per-expert directions for MoE, attention head surgery, SAE feature-level targeting, safety neuron masking, activation steering, and expert transplant.

  5. VERIFY — Generate responses to test prompts, measure refusal rate, compute KL divergence vs baseline, perplexity, and coherence.

  6. REBIRTH — Save the modified model + tokenizer + abliteration_metadata.json.

13 Abliteration Methods

Method Description Complexity
basic Single direction diff-in-means (Arditi et al. 2024) Low
failspy FailSpy/abliterator reproduction Low
gabliteration Gabliteration reproduction Low
heretic Heretic/p-e-w reproduction Medium
rdo Refusal Direction Optimization (ICML 2025) Medium
advanced Multi-dir SVD + norm-preserving (default) Medium
aggressive Full whitened SVD + jailbreak contrast + head surgery High
spectral_cascade DCT frequency-domain decomposition High
informed Analysis-guided auto-configuration (the killer feature) High
surgical All SOTA: SAE + neuron masking + head surgery + per-expert Very High
optimized Bayesian auto-tuning via Optuna TPE Very High
inverted Semantic inversion (reflects refusal direction) Medium
nuclear Maximum force combo for stubborn MoE models Extreme

The Informed Pipeline (Key Innovation)

The InformedAbliterationPipeline (986 lines) runs analysis DURING abliteration to auto-configure every decision:

  1. AlignmentImprintDetector → Detects if model was trained via DPO/RLHF/CAI/SFT from subspace geometry → Sets regularization strength
  2. ConceptConeAnalyzer → Determines if refusal is polyhedral vs linear → Sets number of directions
  3. CrossLayerAlignmentAnalyzer → Cluster-aware layer selection → Chooses which layers to modify
  4. DefenseRobustnessEvaluator → Assesses self-repair risk → Sets refinement passes
  5. Ouroboros compensation loop → Re-probes/re-excises if refusal persists after initial pass

27 Analysis Modules (~10,400 lines)

Deep interpretability tools for understanding refusal geometry before touching weights:

  • Refusal Logit Lens — Identifies the specific layer where a model decides to refuse
  • Causal Tracing — Determines which components are causally necessary for refusal
  • Residual Stream Decomposition — Analyzes Attention vs MLP contribution to refusal
  • Alignment Imprint Detection — Fingerprints DPO vs RLHF vs CAI from subspace geometry
  • Concept Cone Geometry — Maps per-category guardrails (polyhedral vs linear)
  • Ouroboros Effect Detection — Measures if guardrails self-repair after removal
  • Cross-Model Transfer — Tests if refusal directions transfer between models
  • Riemannian Manifold Geometry — Weight manifold analysis (673 lines)
  • SAE-based Abliteration — Sparse Autoencoder feature decomposition (762 lines)
  • Whitened SVD — Covariance-normalized direction extraction
  • Steering Vectors — Inference-time behavior modification (reversible)
  • Plus 16 more modules

Hardware Requirements

Tier VRAM Example Models
Tiny CPU/<1GB GPT-2 (124M), TinyLlama 1.1B, SmolLM 135-360M
Small 4-8GB Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 1B/3B
Medium 8-16GB Llama 3.1 8B (4bit), Mistral 7B, Gemma 2 9B
Large 24GB+ Qwen3-32B (4bit), Llama 3.1 70B (4bit), Mistral Large 2 (4bit)
Frontier Multi-GPU DeepSeek-V3 (685B), Llama 3.1 405B, Qwen3-235B

CLI Interface

# Primary command — abliterate a model
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

# With options
obliteratus obliterate <model> \
  --method informed \
  --quantization 4bit \
  --device auto \
  --output-dir ./liberated-models

# Browse models by compute tier
obliteratus models --tier medium

# Interactive guided mode
obliteratus interactive

# Launch Gradio web UI
obliteratus ui

# Run study from YAML config
obliteratus run study_config.yaml --preset jailbreak

# Model architecture info
obliteratus info meta-llama/Llama-3.1-8B-Instruct

Telemetry

  • Disabled by default for local installs (opt-in via OBLITERATUS_TELEMETRY=1)
  • Auto-enabled on HuggingFace Spaces (when SPACE_ID env var detected)
  • Collects only: model_id, method, benchmark scores, hardware info, timing
  • Does NOT collect: IP addresses, user identity, prompt content
  • Stored locally in ~/.obliteratus/telemetry.jsonl, synced to HF dataset only from Spaces

Dependencies

Core runtime: torch>=2.0, transformers>=4.40, datasets>=2.14, accelerate>=0.24, safetensors>=0.4, bitsandbytes>=0.46.1, pyyaml, rich, matplotlib, seaborn, pandas, numpy, scikit-learn, tqdm.

Optional: gradio>=5.0 (Web UI), optuna (Bayesian optimization).


Current State in Hermes Agent

Hermes Agent has no model weight manipulation, abliteration, or refusal removal capability. Related existing components:

  • SAE Training Skill (sparse-autoencoder-training) — Trains Sparse Autoencoders for interpretability; conceptually adjacent (both deal with model internals) but different purpose
  • Axolotl Skill — Fine-tuning LLMs; complementary (fine-tune after abliteration)
  • Unsloth Skill — Fast fine-tuning; complementary
  • vLLM Skill — Serving LLMs; complementary (serve the abliterated model)
  • HuggingFace Tokenizers Skill — Tokenizer management; tangentially related

No existing open issues cover abliteration, refusal removal, or weight projection topics.


Implementation Plan

Skill vs. Tool Classification

This should be a skill (Skills Hub, not bundled) because:

  1. CLI-driven: OBLITERATUS provides a full CLI (obliteratus obliterate, obliteratus interactive, obliteratus ui, etc.) that the agent can invoke via terminal
  2. No custom Python integration needed: All operations go through the CLI or can be orchestrated via shell commands
  3. AGPL-3.0 license: The tool MUST remain an external process, never imported as a Python library into the MIT-licensed Hermes Agent codebase. CLI invocation is "mere aggregation" — permitted under AGPL
  4. Specialized audience: Only ML practitioners doing model customization need this — not broadly useful enough to bundle

What We'd Need

  1. SKILL.md — Procedures for the full abliteration workflow
  2. Template files — YAML config templates for common abliteration scenarios
  3. Scripts — Helper scripts for GPU detection, model size estimation, method recommendation

Phased Rollout

Phase 1: Core Abliteration Skill (MVP)

  • Installation instructions (pip install -e . from git clone)
  • GPU/VRAM detection and model tier recommendation
  • Model browsing via obliteratus models --tier <tier>
  • Basic abliteration workflow: model selection → method recommendation → execution → verification
  • Method selection guidance (which method for which model family/size)
  • Output handling: save locally, verify coherence
  • Template YAML configs for common scenarios (8B dense model, 7B MoE, etc.)

Phase 2: Analysis-Informed Workflow

  • Pre-abliteration analysis: run relevant analysis modules to understand refusal geometry
  • Informed pipeline usage: teach agent when to use --method informed vs specific methods
  • Interpretation of analysis results (alignment imprint, concept geometry, self-repair risk)
  • Iterative refinement: if first pass leaves residual refusal, adjust parameters and re-run
  • Comparison workflows: run multiple methods and compare results

Phase 3: Integration & Publishing

  • HuggingFace Hub upload of abliterated models with proper metadata
  • Model card generation with abliteration details and benchmark results
  • Integration with existing Hermes skills: abliterate → fine-tune (Axolotl/Unsloth) → serve (vLLM)
  • Batch abliteration: process multiple models in sequence
  • Steering vector workflows: reversible inference-time modifications as alternative to permanent weight changes

Pros & Cons

Pros

  • Fills a real gap — No existing Hermes capability for model weight manipulation or refusal removal. This is a genuinely unique workflow.
  • Low integration cost — As a skill wrapping a CLI tool, implementation is straightforward with no codebase changes needed.
  • Consumer-friendly — Supports 4-bit quantization, CPU offloading, and tiered model presets. Users with an RTX 3060+ can abliterate 8B models; RTX 3090/4090 users can handle 32B+ models.
  • Well-engineered source — 36.5k LOC with 13 methods, 27 analysis modules, 837 tests. This is not a toy project.
  • The informed pipeline is genuinely novel — Auto-configuring abliteration based on real-time analysis of the model's refusal geometry is a significant advancement over manual parameter tuning.
  • Complements existing mlops skills — Natural workflow: abliterate → fine-tune (Axolotl) → serve (vLLM). Each step has its own skill.
  • Aligned with Nous' mission — Nous Research has a history of creating uncensored/open models. Providing tooling for customers to do the same is on-brand.

Cons / Risks

  • AGPL-3.0 license — Most restrictive common OSS license. Must NEVER be imported as a Python library. CLI-only invocation is safe but requires discipline. Should be clearly documented in the skill.
  • Large dependency footprint — PyTorch + Transformers + bitsandbytes is ~5-10GB of dependencies. Installation is heavyweight.
  • GPU required for practical use — While tiny models work on CPU, any meaningful abliteration requires at least an 8GB GPU. This limits the user base.
  • Quality risks — Aggressive abliteration can damage model coherence. The skill must emphasize verification and recommend conservative methods first.
  • Ethical surface area — This tool's explicit purpose is removing safety guardrails. While legitimate for research and open-model customization, it may attract scrutiny. The skill should frame it professionally (model customization, not "jailbreaking").
  • Telemetry awareness — While disabled by default locally, users should know about the telemetry system and make informed choices. The skill should mention this.
  • Beta software (v0.1.2) — No API stability guarantees. The skill may need updates as OBLITERATUS evolves.

Open Questions

  1. Method recommendation heuristic — Should the skill include a decision tree for method selection (e.g., "MoE model → use expert-granular or nuclear; dense reasoning model → use CoT-aware; first time → use informed")? Or should it always recommend informed as the default?
  2. HuggingFace integration — Should the skill handle HF Hub authentication and model upload, or delegate that to a separate workflow?
  3. Gated model access — Many models (Llama 3, Gemma) require HF access tokens. Should the skill handle token management or assume huggingface-cli login is pre-configured?
  4. Steering vectors vs. permanent abliteration — Should the skill give equal weight to reversible steering vectors as an alternative to permanent weight modification? Steering vectors are lower risk but require inference-time hooks.
  5. Batch workflows — Is there demand for batch abliteration (e.g., "abliterate all Llama 3 sizes with the same method and compare")?
  6. Integration with SAE skill — The existing sparse-autoencoder-training skill trains SAEs; OBLITERATUS can use SAE features for targeted abliteration. Should these skills cross-reference each other?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions