🍝 AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Anti-Parallel Subspace Training for Ordered steering.

Serving up data-efficient inner alignment, one satisfying rotation at a time.

Gradient-based honesty steering trained as an adapter on the model's own representations, not outputs. Human input: two contrasting words, no preference labels.

How it works: Train a single adapter (~1 hour on Gemma-3-1B). At inference, dial the steering coefficient: +1 for more honest, -1 for less, 0 for baseline. One adapter, bidirectional control.

Why use it? As models get more capable, eval awareness rises: models detect when they're being tested and adjust their behavior. You can't trust their outputs, their chain-of-thought, or their stated values at face value. You need a method that operates on internal representations rather than outputs, so it works even when the model is gaming the eval. AntiPaSTO steers what the model actually computes. On DailyDilemmas, it outperforms prompting by 6.9x and works where prompting triggers refusal.

Applications:

Combat eval awareness: steer toward credulity and honesty so the model takes the eval at face value and gives honest answers.
Find deeper moral preferences: ask moral questions with and without honesty steering. Do stated values change?
Swap the assistant axis: find it and replace it with a philosopher-king or poet

Quick Start

Bake your own

uv sync --all-groups
uv run pytest tests/test_train.py::test_train_rnd -v  # smoke test (~3min)
uv run python nbs/train.py tiny --quick              # al dente check

uv run python nbs/train.py               # full course (Gemma-3-1B)

uv run python -m pytest # integration tests

One we prepared earlier

nbs/talk_to_checkpoint.ipynb

Load a pretrained adapter

from antipasto.peft_utils.load import load_adapter
from antipasto.gen import gen, ScaleAdapter

# Load from local path or HuggingFace
model, tokenizer, layer_selection = load_adapter(
    "wassname/antipasto-gemma-3-1b-honesty",  # or local path
    quantization_type="4bit"
)

# Generate with steering: coeff > 0 = honest, coeff < 0 = deceptive
prompt = "Should I tell my boss I was late because I overslept?"
with ScaleAdapter(model, coeff=1.0):  # honest
    honest_response = model.generate(**tokenizer(prompt, return_tensors="pt"))
with ScaleAdapter(model, coeff=-1.0):  # deceptive  
    deceptive_response = model.generate(**tokenizer(prompt, return_tensors="pt"))

# Or generate at multiple coefficients
list(gen(model, tokenizer, prompt, coeffs=[-1, 0, 1], max_new_tokens=64))

The Recipe

RLHF seasons the outputs but leaves the internals bland. AntiPaSTO marinates the model's hidden states directly, no preference labels required, just two contrasting words simmered into 800 synthetic pairs.

Ingredients:

Incomplete contrast pairs (self-supervised, no labels to garnish)
Cayley rotations on V (the secret sauce, keeps everything orthogonal)
Projection loss + TV coherence + monotonicity constraints
800 synthetic pairs, ~1hr (low simmer)

What you get:

Single adapter: flip α from +1 to -1 to reverse the flavor
Train on honesty, transfers to 1,360 unseen moral dilemmas (9 value dimensions)
Beats prompting by 6.9x on small models; gradient optimization where arithmetic steering (CAA) gets F1=0
Suppression bypass: steers when prompting triggers refusal or meta-commentary

Architecture

The pasta machine: SVD decomposition + Cayley rotations

# Adapter: rotate in SVD space
def forward(h, alpha):
    R_v = cayley(theta_v, alpha)  # coefficient-scaled rotation
    S_scaled = S + alpha * delta_S
    return h @ W_res.T + h @ V @ R_v @ diag(S_scaled) @ U.T

# Loss: antiparallel separation + coherence + ordering
def loss(model, x_cho, x_rej):
    delta_pos = model(x_cho, +1) - model(x_rej, +1) - d_ref
    delta_neg = model(x_cho, -1) - model(x_rej, -1) - d_ref
    
    L_proj = symlog(delta_pos @ delta_neg)        # want < 0 (antiparallel)
    B_coh = tv_barrier(p_ref, p_pi, entropy)      # TV trust region
    B_mono = hinge(Delta_neg < 0 < Delta_pos)     # ordered control
    
    return L_proj + B_coh + B_mono

Project Layout

antipasto/           # the kitchen
  config.py          # canonical recipe
  metrics.py         # taste testing
  train/             # cooking instructions
  peft_utils/        # pasta machine internals
docs/                # diagrams, plating notes
nbs/                 # experimental dishes
outputs/adapters/    # trained models (ready to serve)

Status

Still simmering. Full research history (experiments, ablations, burnt batches) available on request.

I am working on v2 which

removes SVD for full lora (I found that changing the loss to prevent drift allows this)
reduces init variance
more expressive personas
larger models
better metric

If you would like to collaborate, please reach out.

Acknowledgments

Built on the shoulders of other chefs:

CAA / RepEng -- arithmetic steering that inspired this gradient-based approach
PiSSA -- SVD-based adapter initialization
SSVD -- rotating V for domain generalization
PEFT -- the adapter ecosystem
DailyDilemmas -- the evaluation benchmark

Citation

@misc{clark2026antipasto,
  title = {AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations},
  author = {Clark, Michael J.},
  year = {2026},
  eprint = {2601.07473},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2601.07473}
}

Nano banana's attempt to draw the loss landscape, I'm not sure if it helps understand the loss, but I like it

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
antipasto		antipasto
docs		docs
nbs		nbs
scripts		scripts
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
justfile		justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍝 AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Quick Start

Bake your own

One we prepared earlier

Load a pretrained adapter

The Recipe

Architecture

Project Layout

Status

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🍝 AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Quick Start

Bake your own

One we prepared earlier

Load a pretrained adapter

The Recipe

Architecture

Project Layout

Status

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages