Skip to content

wassname/AntiPaSTO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🍝 AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

arXiv License BlogPost

Anti-Parallel Subspace Training for Ordered steering.

Serving up data-efficient inner alignment, one satisfying rotation at a time.

Gradient-based honesty steering trained as an adapter on the model's own representations, not outputs. Human input: two contrasting words, no preference labels.

How it works: Train a single adapter (~1 hour on Gemma-3-1B). At inference, dial the steering coefficient: +1 for more honest, -1 for less, 0 for baseline. One adapter, bidirectional control.

Why use it? As models get more capable, eval awareness rises: models detect when they're being tested and adjust their behavior. You can't trust their outputs, their chain-of-thought, or their stated values at face value. You need a method that operates on internal representations rather than outputs, so it works even when the model is gaming the eval. AntiPaSTO steers what the model actually computes. On DailyDilemmas, it outperforms prompting by 6.9x and works where prompting triggers refusal.

Applications:

  • Combat eval awareness: steer toward credulity and honesty so the model takes the eval at face value and gives honest answers.
  • Find deeper moral preferences: ask moral questions with and without honesty steering. Do stated values change?
  • Swap the assistant axis: find it and replace it with a philosopher-king or poet

Bidirectional control

Quick Start

Bake your own

uv sync --all-groups
uv run pytest tests/test_train.py::test_train_rnd -v  # smoke test (~3min)
uv run python nbs/train.py tiny --quick              # al dente check

uv run python nbs/train.py               # full course (Gemma-3-1B)

uv run python -m pytest # integration tests

One we prepared earlier

nbs/talk_to_checkpoint.ipynb

Load a pretrained adapter

from antipasto.peft_utils.load import load_adapter
from antipasto.gen import gen, ScaleAdapter

# Load from local path or HuggingFace
model, tokenizer, layer_selection = load_adapter(
    "wassname/antipasto-gemma-3-1b-honesty",  # or local path
    quantization_type="4bit"
)

# Generate with steering: coeff > 0 = honest, coeff < 0 = deceptive
prompt = "Should I tell my boss I was late because I overslept?"
with ScaleAdapter(model, coeff=1.0):  # honest
    honest_response = model.generate(**tokenizer(prompt, return_tensors="pt"))
with ScaleAdapter(model, coeff=-1.0):  # deceptive  
    deceptive_response = model.generate(**tokenizer(prompt, return_tensors="pt"))

# Or generate at multiple coefficients
list(gen(model, tokenizer, prompt, coeffs=[-1, 0, 1], max_new_tokens=64))

The Recipe

RLHF seasons the outputs but leaves the internals bland. AntiPaSTO marinates the model's hidden states directly, no preference labels required, just two contrasting words simmered into 800 synthetic pairs.

Incomplete contrast pairs

Ingredients:

  • Incomplete contrast pairs (self-supervised, no labels to garnish)
  • Cayley rotations on V (the secret sauce, keeps everything orthogonal)
  • Projection loss + TV coherence + monotonicity constraints
  • 800 synthetic pairs, ~1hr (low simmer)

What you get:

  • Single adapter: flip α from +1 to -1 to reverse the flavor
  • Train on honesty, transfers to 1,360 unseen moral dilemmas (9 value dimensions)
  • Beats prompting by 6.9x on small models; gradient optimization where arithmetic steering (CAA) gets F1=0
  • Suppression bypass: steers when prompting triggers refusal or meta-commentary

Architecture

The pasta machine: SVD decomposition + Cayley rotations

# Adapter: rotate in SVD space
def forward(h, alpha):
    R_v = cayley(theta_v, alpha)  # coefficient-scaled rotation
    S_scaled = S + alpha * delta_S
    return h @ W_res.T + h @ V @ R_v @ diag(S_scaled) @ U.T

# Loss: antiparallel separation + coherence + ordering
def loss(model, x_cho, x_rej):
    delta_pos = model(x_cho, +1) - model(x_rej, +1) - d_ref
    delta_neg = model(x_cho, -1) - model(x_rej, -1) - d_ref
    
    L_proj = symlog(delta_pos @ delta_neg)        # want < 0 (antiparallel)
    B_coh = tv_barrier(p_ref, p_pi, entropy)      # TV trust region
    B_mono = hinge(Delta_neg < 0 < Delta_pos)     # ordered control
    
    return L_proj + B_coh + B_mono

Loss geometry

Project Layout

antipasto/           # the kitchen
  config.py          # canonical recipe
  metrics.py         # taste testing
  train/             # cooking instructions
  peft_utils/        # pasta machine internals
docs/                # diagrams, plating notes
nbs/                 # experimental dishes
outputs/adapters/    # trained models (ready to serve)

Status

Still simmering. Full research history (experiments, ablations, burnt batches) available on request.

I am working on v2 which

  • removes SVD for full lora (I found that changing the loss to prevent drift allows this)
  • reduces init variance
  • more expressive personas
  • larger models
  • better metric

If you would like to collaborate, please reach out.

Acknowledgments

Built on the shoulders of other chefs:

  • CAA / RepEng -- arithmetic steering that inspired this gradient-based approach
  • PiSSA -- SVD-based adapter initialization
  • SSVD -- rotating V for domain generalization
  • PEFT -- the adapter ecosystem
  • DailyDilemmas -- the evaluation benchmark

Citation

@misc{clark2026antipasto,
  title = {AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations},
  author = {Clark, Michael J.},
  year = {2026},
  eprint = {2601.07473},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2601.07473}
}
Nano banana's attempt to draw the loss landscape, I'm not sure if it helps understand the loss, but I like it

About

AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors