Overview
OBLITERATUS is an open-source toolkit (~36.5k LOC Python) for surgically removing refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. It uses mechanistic interpretability techniques — primarily SVD-based weight projection — to identify and excise "refusal directions" from model weights while preserving reasoning capabilities.
The toolkit offers 13 abliteration methods (from faithful reproductions of FailSpy, Gabliteration, Heretic, and RDO to novel pipelines like spectral cascade, CoT-aware, and expert-granular for MoE models), 27 analysis modules for mapping refusal geometry, and a 6-stage pipeline (SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH). It supports models from GPT-2 (CPU) to DeepSeek-V3 (multi-GPU) across 5 compute tiers with 116 curated model presets.
This is relevant to a subset of Nous Research customers who want to create uncensored models from open-weight base models. A Hermes Agent skill would let users orchestrate the full abliteration workflow — from model selection through analysis, abliteration, verification, and HuggingFace upload — via natural language.
Source: elder-plinius/OBLITERATUS (AGPL-3.0)
Research Findings
How OBLITERATUS Works
The core pipeline operates in 6 stages:
-
SUMMON — Load model via HuggingFace Transformers with auto device_map, optional 4-bit/8-bit quantization via bitsandbytes.
-
PROBE — Forward-pass harmful & harmless prompt pairs through all layers, collecting per-layer hidden-state activations at the last token position. Also collects jailbreak-contrastive activations and MoE router logits.
-
DISTILL — Extract refusal directions via SVD on (harmful_mean - harmless_mean) per layer. Methods include standard diff-in-means, Whitened SVD (covariance-normalized), and Wasserstein-optimal extraction. Layer selection uses "knee detection" via COSMIC cosine similarity.
-
EXCISE — Project out refusal directions from ALL weight matrices: attention Q/K/V projections, output projections, FFN up/down/gate, MoE routers, shared experts, embeddings. Uses norm-preserving projection. Advanced techniques include: reflection (inversion), per-expert directions for MoE, attention head surgery, SAE feature-level targeting, safety neuron masking, activation steering, and expert transplant.
-
VERIFY — Generate responses to test prompts, measure refusal rate, compute KL divergence vs baseline, perplexity, and coherence.
-
REBIRTH — Save the modified model + tokenizer + abliteration_metadata.json.
13 Abliteration Methods
| Method |
Description |
Complexity |
basic |
Single direction diff-in-means (Arditi et al. 2024) |
Low |
failspy |
FailSpy/abliterator reproduction |
Low |
gabliteration |
Gabliteration reproduction |
Low |
heretic |
Heretic/p-e-w reproduction |
Medium |
rdo |
Refusal Direction Optimization (ICML 2025) |
Medium |
advanced |
Multi-dir SVD + norm-preserving (default) |
Medium |
aggressive |
Full whitened SVD + jailbreak contrast + head surgery |
High |
spectral_cascade |
DCT frequency-domain decomposition |
High |
informed |
Analysis-guided auto-configuration (the killer feature) |
High |
surgical |
All SOTA: SAE + neuron masking + head surgery + per-expert |
Very High |
optimized |
Bayesian auto-tuning via Optuna TPE |
Very High |
inverted |
Semantic inversion (reflects refusal direction) |
Medium |
nuclear |
Maximum force combo for stubborn MoE models |
Extreme |
The Informed Pipeline (Key Innovation)
The InformedAbliterationPipeline (986 lines) runs analysis DURING abliteration to auto-configure every decision:
- AlignmentImprintDetector → Detects if model was trained via DPO/RLHF/CAI/SFT from subspace geometry → Sets regularization strength
- ConceptConeAnalyzer → Determines if refusal is polyhedral vs linear → Sets number of directions
- CrossLayerAlignmentAnalyzer → Cluster-aware layer selection → Chooses which layers to modify
- DefenseRobustnessEvaluator → Assesses self-repair risk → Sets refinement passes
- Ouroboros compensation loop → Re-probes/re-excises if refusal persists after initial pass
27 Analysis Modules (~10,400 lines)
Deep interpretability tools for understanding refusal geometry before touching weights:
- Refusal Logit Lens — Identifies the specific layer where a model decides to refuse
- Causal Tracing — Determines which components are causally necessary for refusal
- Residual Stream Decomposition — Analyzes Attention vs MLP contribution to refusal
- Alignment Imprint Detection — Fingerprints DPO vs RLHF vs CAI from subspace geometry
- Concept Cone Geometry — Maps per-category guardrails (polyhedral vs linear)
- Ouroboros Effect Detection — Measures if guardrails self-repair after removal
- Cross-Model Transfer — Tests if refusal directions transfer between models
- Riemannian Manifold Geometry — Weight manifold analysis (673 lines)
- SAE-based Abliteration — Sparse Autoencoder feature decomposition (762 lines)
- Whitened SVD — Covariance-normalized direction extraction
- Steering Vectors — Inference-time behavior modification (reversible)
- Plus 16 more modules
Hardware Requirements
| Tier |
VRAM |
Example Models |
| Tiny |
CPU/<1GB |
GPT-2 (124M), TinyLlama 1.1B, SmolLM 135-360M |
| Small |
4-8GB |
Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 1B/3B |
| Medium |
8-16GB |
Llama 3.1 8B (4bit), Mistral 7B, Gemma 2 9B |
| Large |
24GB+ |
Qwen3-32B (4bit), Llama 3.1 70B (4bit), Mistral Large 2 (4bit) |
| Frontier |
Multi-GPU |
DeepSeek-V3 (685B), Llama 3.1 405B, Qwen3-235B |
CLI Interface
# Primary command — abliterate a model
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced
# With options
obliteratus obliterate <model> \
--method informed \
--quantization 4bit \
--device auto \
--output-dir ./liberated-models
# Browse models by compute tier
obliteratus models --tier medium
# Interactive guided mode
obliteratus interactive
# Launch Gradio web UI
obliteratus ui
# Run study from YAML config
obliteratus run study_config.yaml --preset jailbreak
# Model architecture info
obliteratus info meta-llama/Llama-3.1-8B-Instruct
Telemetry
- Disabled by default for local installs (opt-in via
OBLITERATUS_TELEMETRY=1)
- Auto-enabled on HuggingFace Spaces (when
SPACE_ID env var detected)
- Collects only: model_id, method, benchmark scores, hardware info, timing
- Does NOT collect: IP addresses, user identity, prompt content
- Stored locally in
~/.obliteratus/telemetry.jsonl, synced to HF dataset only from Spaces
Dependencies
Core runtime: torch>=2.0, transformers>=4.40, datasets>=2.14, accelerate>=0.24, safetensors>=0.4, bitsandbytes>=0.46.1, pyyaml, rich, matplotlib, seaborn, pandas, numpy, scikit-learn, tqdm.
Optional: gradio>=5.0 (Web UI), optuna (Bayesian optimization).
Current State in Hermes Agent
Hermes Agent has no model weight manipulation, abliteration, or refusal removal capability. Related existing components:
- SAE Training Skill (
sparse-autoencoder-training) — Trains Sparse Autoencoders for interpretability; conceptually adjacent (both deal with model internals) but different purpose
- Axolotl Skill — Fine-tuning LLMs; complementary (fine-tune after abliteration)
- Unsloth Skill — Fast fine-tuning; complementary
- vLLM Skill — Serving LLMs; complementary (serve the abliterated model)
- HuggingFace Tokenizers Skill — Tokenizer management; tangentially related
No existing open issues cover abliteration, refusal removal, or weight projection topics.
Implementation Plan
Skill vs. Tool Classification
This should be a skill (Skills Hub, not bundled) because:
- CLI-driven: OBLITERATUS provides a full CLI (
obliteratus obliterate, obliteratus interactive, obliteratus ui, etc.) that the agent can invoke via terminal
- No custom Python integration needed: All operations go through the CLI or can be orchestrated via shell commands
- AGPL-3.0 license: The tool MUST remain an external process, never imported as a Python library into the MIT-licensed Hermes Agent codebase. CLI invocation is "mere aggregation" — permitted under AGPL
- Specialized audience: Only ML practitioners doing model customization need this — not broadly useful enough to bundle
What We'd Need
- SKILL.md — Procedures for the full abliteration workflow
- Template files — YAML config templates for common abliteration scenarios
- Scripts — Helper scripts for GPU detection, model size estimation, method recommendation
Phased Rollout
Phase 1: Core Abliteration Skill (MVP)
- Installation instructions (
pip install -e . from git clone)
- GPU/VRAM detection and model tier recommendation
- Model browsing via
obliteratus models --tier <tier>
- Basic abliteration workflow: model selection → method recommendation → execution → verification
- Method selection guidance (which method for which model family/size)
- Output handling: save locally, verify coherence
- Template YAML configs for common scenarios (8B dense model, 7B MoE, etc.)
Phase 2: Analysis-Informed Workflow
- Pre-abliteration analysis: run relevant analysis modules to understand refusal geometry
- Informed pipeline usage: teach agent when to use
--method informed vs specific methods
- Interpretation of analysis results (alignment imprint, concept geometry, self-repair risk)
- Iterative refinement: if first pass leaves residual refusal, adjust parameters and re-run
- Comparison workflows: run multiple methods and compare results
Phase 3: Integration & Publishing
- HuggingFace Hub upload of abliterated models with proper metadata
- Model card generation with abliteration details and benchmark results
- Integration with existing Hermes skills: abliterate → fine-tune (Axolotl/Unsloth) → serve (vLLM)
- Batch abliteration: process multiple models in sequence
- Steering vector workflows: reversible inference-time modifications as alternative to permanent weight changes
Pros & Cons
Pros
- Fills a real gap — No existing Hermes capability for model weight manipulation or refusal removal. This is a genuinely unique workflow.
- Low integration cost — As a skill wrapping a CLI tool, implementation is straightforward with no codebase changes needed.
- Consumer-friendly — Supports 4-bit quantization, CPU offloading, and tiered model presets. Users with an RTX 3060+ can abliterate 8B models; RTX 3090/4090 users can handle 32B+ models.
- Well-engineered source — 36.5k LOC with 13 methods, 27 analysis modules, 837 tests. This is not a toy project.
- The informed pipeline is genuinely novel — Auto-configuring abliteration based on real-time analysis of the model's refusal geometry is a significant advancement over manual parameter tuning.
- Complements existing mlops skills — Natural workflow: abliterate → fine-tune (Axolotl) → serve (vLLM). Each step has its own skill.
- Aligned with Nous' mission — Nous Research has a history of creating uncensored/open models. Providing tooling for customers to do the same is on-brand.
Cons / Risks
- AGPL-3.0 license — Most restrictive common OSS license. Must NEVER be imported as a Python library. CLI-only invocation is safe but requires discipline. Should be clearly documented in the skill.
- Large dependency footprint — PyTorch + Transformers + bitsandbytes is ~5-10GB of dependencies. Installation is heavyweight.
- GPU required for practical use — While tiny models work on CPU, any meaningful abliteration requires at least an 8GB GPU. This limits the user base.
- Quality risks — Aggressive abliteration can damage model coherence. The skill must emphasize verification and recommend conservative methods first.
- Ethical surface area — This tool's explicit purpose is removing safety guardrails. While legitimate for research and open-model customization, it may attract scrutiny. The skill should frame it professionally (model customization, not "jailbreaking").
- Telemetry awareness — While disabled by default locally, users should know about the telemetry system and make informed choices. The skill should mention this.
- Beta software (v0.1.2) — No API stability guarantees. The skill may need updates as OBLITERATUS evolves.
Open Questions
- Method recommendation heuristic — Should the skill include a decision tree for method selection (e.g., "MoE model → use expert-granular or nuclear; dense reasoning model → use CoT-aware; first time → use informed")? Or should it always recommend
informed as the default?
- HuggingFace integration — Should the skill handle HF Hub authentication and model upload, or delegate that to a separate workflow?
- Gated model access — Many models (Llama 3, Gemma) require HF access tokens. Should the skill handle token management or assume
huggingface-cli login is pre-configured?
- Steering vectors vs. permanent abliteration — Should the skill give equal weight to reversible steering vectors as an alternative to permanent weight modification? Steering vectors are lower risk but require inference-time hooks.
- Batch workflows — Is there demand for batch abliteration (e.g., "abliterate all Llama 3 sizes with the same method and compare")?
- Integration with SAE skill — The existing
sparse-autoencoder-training skill trains SAEs; OBLITERATUS can use SAE features for targeted abliteration. Should these skills cross-reference each other?
References
Overview
OBLITERATUS is an open-source toolkit (~36.5k LOC Python) for surgically removing refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. It uses mechanistic interpretability techniques — primarily SVD-based weight projection — to identify and excise "refusal directions" from model weights while preserving reasoning capabilities.
The toolkit offers 13 abliteration methods (from faithful reproductions of FailSpy, Gabliteration, Heretic, and RDO to novel pipelines like spectral cascade, CoT-aware, and expert-granular for MoE models), 27 analysis modules for mapping refusal geometry, and a 6-stage pipeline (SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH). It supports models from GPT-2 (CPU) to DeepSeek-V3 (multi-GPU) across 5 compute tiers with 116 curated model presets.
This is relevant to a subset of Nous Research customers who want to create uncensored models from open-weight base models. A Hermes Agent skill would let users orchestrate the full abliteration workflow — from model selection through analysis, abliteration, verification, and HuggingFace upload — via natural language.
Source: elder-plinius/OBLITERATUS (AGPL-3.0)
Research Findings
How OBLITERATUS Works
The core pipeline operates in 6 stages:
SUMMON — Load model via HuggingFace Transformers with auto device_map, optional 4-bit/8-bit quantization via bitsandbytes.
PROBE — Forward-pass harmful & harmless prompt pairs through all layers, collecting per-layer hidden-state activations at the last token position. Also collects jailbreak-contrastive activations and MoE router logits.
DISTILL — Extract refusal directions via SVD on
(harmful_mean - harmless_mean)per layer. Methods include standard diff-in-means, Whitened SVD (covariance-normalized), and Wasserstein-optimal extraction. Layer selection uses "knee detection" via COSMIC cosine similarity.EXCISE — Project out refusal directions from ALL weight matrices: attention Q/K/V projections, output projections, FFN up/down/gate, MoE routers, shared experts, embeddings. Uses norm-preserving projection. Advanced techniques include: reflection (inversion), per-expert directions for MoE, attention head surgery, SAE feature-level targeting, safety neuron masking, activation steering, and expert transplant.
VERIFY — Generate responses to test prompts, measure refusal rate, compute KL divergence vs baseline, perplexity, and coherence.
REBIRTH — Save the modified model + tokenizer +
abliteration_metadata.json.13 Abliteration Methods
basicfailspygabliterationhereticrdoadvancedaggressivespectral_cascadeinformedsurgicaloptimizedinvertednuclearThe Informed Pipeline (Key Innovation)
The
InformedAbliterationPipeline(986 lines) runs analysis DURING abliteration to auto-configure every decision:27 Analysis Modules (~10,400 lines)
Deep interpretability tools for understanding refusal geometry before touching weights:
Hardware Requirements
CLI Interface
Telemetry
OBLITERATUS_TELEMETRY=1)SPACE_IDenv var detected)~/.obliteratus/telemetry.jsonl, synced to HF dataset only from SpacesDependencies
Core runtime:
torch>=2.0,transformers>=4.40,datasets>=2.14,accelerate>=0.24,safetensors>=0.4,bitsandbytes>=0.46.1,pyyaml,rich,matplotlib,seaborn,pandas,numpy,scikit-learn,tqdm.Optional:
gradio>=5.0(Web UI),optuna(Bayesian optimization).Current State in Hermes Agent
Hermes Agent has no model weight manipulation, abliteration, or refusal removal capability. Related existing components:
sparse-autoencoder-training) — Trains Sparse Autoencoders for interpretability; conceptually adjacent (both deal with model internals) but different purposeNo existing open issues cover abliteration, refusal removal, or weight projection topics.
Implementation Plan
Skill vs. Tool Classification
This should be a skill (Skills Hub, not bundled) because:
obliteratus obliterate,obliteratus interactive,obliteratus ui, etc.) that the agent can invoke viaterminalWhat We'd Need
Phased Rollout
Phase 1: Core Abliteration Skill (MVP)
pip install -e .from git clone)obliteratus models --tier <tier>Phase 2: Analysis-Informed Workflow
--method informedvs specific methodsPhase 3: Integration & Publishing
Pros & Cons
Pros
Cons / Risks
Open Questions
informedas the default?huggingface-cli loginis pre-configured?sparse-autoencoder-trainingskill trains SAEs; OBLITERATUS can use SAE features for targeted abliteration. Should these skills cross-reference each other?References
sparse-autoencoder-trainingskill — Related interpretability toolingaxolotlskill — Complementary fine-tuning workflowserving-llms-vllmskill — Complementary serving workflow