Directional ablation to remove refusal behavior from transformer language models — without TransformerLens. Works with any model supported by HuggingFace Transformers.
Forked from Sumandora/remove-refusals-with-transformers and extended for production use across the Zen model family.
Refusal behavior in language models is mediated by a single linear direction in the residual stream. During inference, harmful prompts activate this direction; the model uses it as a signal to refuse. This technique:
- Identifies the refusal direction by comparing hidden states on harmful vs. harmless inputs at a chosen layer
- Ablates the direction at inference time by projecting it out of the residual stream at every layer
The result is a model that processes all inputs without the refusal gate, while preserving all other capabilities: reasoning, instruction following, coding, factual recall.
This implementation is based on:
- "Refusal in LLMs is mediated by a single direction" — Arditi et al. (2024)
- Maxime Labonne's abliteration methodology — practical guide with permanent weight modification
- Harmful Behaviors dataset (AdvBench)
- Alpaca Cleaned (harmless baseline)
The key insight is that refusal is a linear subspace, not a distributed capability. Projecting it out via a hook does not degrade model performance on other tasks — it surgically removes one learned behavior.
# Edit MODEL_ID in compute_refusal_dir.py, then:
python compute_refusal_dir.py
# Outputs: {MODEL_ID}_refusal_dir.pt# Loads the saved direction and patches it into every decoder layer:
python inference.pyBoth scripts share these parameters (edit at top of file):
| Parameter | Default | Description |
|---|---|---|
MODEL_ID |
tiiuae/Falcon3-1B-Instruct |
HuggingFace model ID |
layer_idx |
60% depth |
Layer to extract refusal direction from |
pos |
-1 |
Token position (last token = generation position) |
instructions |
32 |
Harmful/harmless samples to average over |
For production deployment (rather than inference-time hooks), the refusal direction can be permanently subtracted from the model's weight matrices. This is how the Zen model family is abliterated:
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("zenlm/zen-pro-instruct")
refusal_dir = torch.load("refusal_dir.pt")
# Project refusal direction out of every weight matrix that touches the residual stream
for layer in model.model.layers:
for W in [layer.self_attn.o_proj.weight,
layer.mlp.down_proj.weight]:
W.data -= torch.outer(
W @ refusal_dir,
refusal_dir
)
model.save_pretrained("zen-pro-abliterated")This permanently bakes the ablation into the weights — no hooks required at inference time.
| Architecture | Example Models |
|---|---|
Llama-style (model.model.layers) |
Llama 2/3, Mistral, Qwen3, Falcon3 |
DeepSeek V3 (model.model.layers) |
DeepSeek-V3, Kimi K2.5 |
GLM-4.7-Flash (model.model.layers) |
zen4-coder-flash base |
- Qwen 1.x: Uses
model.transformer.hinstead ofmodel.model.layers. Change the layer access ininference.py:76 - MoE models: Standard direction ablation misses refusal encoded in expert routing gates. See GT-QLoRA for MoE-specific approach
- 4-bit quantization: Works but direction computation may be less precise
This technique is applied to all Zen models to remove refusal bias at the weight level:
| Model | Parameters | Abliterated | Repo |
|---|---|---|---|
| Zen Nano | 0.6B | ✅ | zenlm/zen-nano |
| Zen Scribe | 4B | ✅ | zenlm/zen-scribe |
| Zen Pro | 8B | ✅ | zenlm/zen-pro |
| Zen Omni | 30B MoE | ✅ | zenlm/zen-omni |
| Zen4 Mini | 4B | ✅ | zenlm/zen4-mini |
| Zen4 | 8B | ✅ | zenlm/zen4 |
| Zen4 Pro | 14B | ✅ | zenlm/zen4-pro |
| Zen4 Max | 30B MoE | ✅ | zenlm/zen4-max |
| Zen4 Coder | 80B MoE | ✅ | zenlm/zen4-coder |
| Zen4 Coder Flash | 31B MoE | ✅ | zenlm/zen4-coder-flash |
| Zen4 Pro Max | 80B MoE | ✅ | zenlm/zen4-pro-max |
| Zen4 Ultra | 1.04T MoE | 🔄 GT-QLoRA | zenlm/zen4-ultra |
| Zen Designer GGUF | 235B VL MoE | ✅ GGUF | zenlm/zen-designer-gguf |
All abliterated weights are available at huggingface.co/zenlm.
- MoE support: Extend to gate/router ablation for Mixture of Experts architectures (see GT-QLoRA paper)
- Batch processing: Vectorize direction computation for faster extraction on large models
- Auto layer selection: Heuristic to find optimal ablation layer without manual tuning
- Multi-GPU support: Tensor-parallel ablation for 70B+ models
- Evaluation suite: Automated benchmark to measure capability preservation post-ablation
- CLI interface: Single command for end-to-end compute + ablate + save
Safety guardrails baked into model weights are a product decision, not a technical necessity. For applications where:
- Safety is managed at the application layer (filtering, rate limiting, monitoring)
- The deployment context is restricted (research, security testing, enterprise)
- The use case requires unrestricted reasoning (red team tooling, policy analysis, medical/legal)
...having refusal behavior in the weights is actively harmful to the product. Application-layer controls are more flexible, auditable, and appropriate than weight-level restrictions.
This is a research tool. Use responsibly and within applicable law.
pip install -r requirements.txttorch>=2.0
transformers>=4.40
bitsandbytes
einops
jaxtyping
tqdm
accelerate
- Original implementation: Sumandora
- Technique: "Refusal in LLMs is mediated by a single direction" — Arditi et al.
- Production methodology: Maxime Labonne
- Zen model applications: Hanzo AI / Zen LM
Part of the Zen model ecosystem by Hanzo AI (Techstars '17) and Zoo Labs Foundation.