Anti-Parallel Subspace Training for Ordered steering.
Serving up data-efficient inner alignment, one satisfying rotation at a time.
Gradient-based honesty steering trained as an adapter on the model's own representations, not outputs. Human input: two contrasting words, no preference labels.
How it works: Train a single adapter (~1 hour on Gemma-3-1B). At inference, dial the steering coefficient: +1 for more honest, -1 for less, 0 for baseline. One adapter, bidirectional control.
Why use it? As models get more capable, eval awareness rises: models detect when they're being tested and adjust their behavior. You can't trust their outputs, their chain-of-thought, or their stated values at face value. You need a method that operates on internal representations rather than outputs, so it works even when the model is gaming the eval. AntiPaSTO steers what the model actually computes. On DailyDilemmas, it outperforms prompting by 6.9x and works where prompting triggers refusal.
Applications:
- Combat eval awareness: steer toward credulity and honesty so the model takes the eval at face value and gives honest answers.
- Find deeper moral preferences: ask moral questions with and without honesty steering. Do stated values change?
- Swap the assistant axis: find it and replace it with a philosopher-king or poet
uv sync --all-groups
uv run pytest tests/test_train.py::test_train_rnd -v # smoke test (~3min)
uv run python nbs/train.py tiny --quick # al dente check
uv run python nbs/train.py # full course (Gemma-3-1B)
uv run python -m pytest # integration testsfrom antipasto.peft_utils.load import load_adapter
from antipasto.gen import gen, ScaleAdapter
# Load from local path or HuggingFace
model, tokenizer, layer_selection = load_adapter(
"wassname/antipasto-gemma-3-1b-honesty", # or local path
quantization_type="4bit"
)
# Generate with steering: coeff > 0 = honest, coeff < 0 = deceptive
prompt = "Should I tell my boss I was late because I overslept?"
with ScaleAdapter(model, coeff=1.0): # honest
honest_response = model.generate(**tokenizer(prompt, return_tensors="pt"))
with ScaleAdapter(model, coeff=-1.0): # deceptive
deceptive_response = model.generate(**tokenizer(prompt, return_tensors="pt"))
# Or generate at multiple coefficients
list(gen(model, tokenizer, prompt, coeffs=[-1, 0, 1], max_new_tokens=64))RLHF seasons the outputs but leaves the internals bland. AntiPaSTO marinates the model's hidden states directly, no preference labels required, just two contrasting words simmered into 800 synthetic pairs.
Ingredients:
- Incomplete contrast pairs (self-supervised, no labels to garnish)
- Cayley rotations on V (the secret sauce, keeps everything orthogonal)
- Projection loss + TV coherence + monotonicity constraints
- 800 synthetic pairs, ~1hr (low simmer)
What you get:
- Single adapter: flip α from +1 to -1 to reverse the flavor
- Train on honesty, transfers to 1,360 unseen moral dilemmas (9 value dimensions)
- Beats prompting by 6.9x on small models; gradient optimization where arithmetic steering (CAA) gets F1=0
- Suppression bypass: steers when prompting triggers refusal or meta-commentary
The pasta machine: SVD decomposition + Cayley rotations
# Adapter: rotate in SVD space
def forward(h, alpha):
R_v = cayley(theta_v, alpha) # coefficient-scaled rotation
S_scaled = S + alpha * delta_S
return h @ W_res.T + h @ V @ R_v @ diag(S_scaled) @ U.T
# Loss: antiparallel separation + coherence + ordering
def loss(model, x_cho, x_rej):
delta_pos = model(x_cho, +1) - model(x_rej, +1) - d_ref
delta_neg = model(x_cho, -1) - model(x_rej, -1) - d_ref
L_proj = symlog(delta_pos @ delta_neg) # want < 0 (antiparallel)
B_coh = tv_barrier(p_ref, p_pi, entropy) # TV trust region
B_mono = hinge(Delta_neg < 0 < Delta_pos) # ordered control
return L_proj + B_coh + B_monoantipasto/ # the kitchen
config.py # canonical recipe
metrics.py # taste testing
train/ # cooking instructions
peft_utils/ # pasta machine internals
docs/ # diagrams, plating notes
nbs/ # experimental dishes
outputs/adapters/ # trained models (ready to serve)
Still simmering. Full research history (experiments, ablations, burnt batches) available on request.
I am working on v2 which
- removes SVD for full lora (I found that changing the loss to prevent drift allows this)
- reduces init variance
- more expressive personas
- larger models
- better metric
If you would like to collaborate, please reach out.
Built on the shoulders of other chefs:
- CAA / RepEng -- arithmetic steering that inspired this gradient-based approach
- PiSSA -- SVD-based adapter initialization
- SSVD -- rotating V for domain generalization
- PEFT -- the adapter ecosystem
- DailyDilemmas -- the evaluation benchmark
@misc{clark2026antipasto,
title = {AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations},
author = {Clark, Michael J.},
year = {2026},
eprint = {2601.07473},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2601.07473}
}