One model, two roles: emergent specialization in a shared recurrent Transformer
"One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer"
Jucheng Shen*, Wenyi Su*, Anastasios Kyrillidis
arXiv:2605.17811
Can a shared-weight recurrent Transformer develop two distinct internal roles without being partitioned into separate modules? AIR (Asymmetric Input Recurrence) is a minimal two-state reasoning architecture in which the same Transformer block is reused for both updates, and the only built-in asymmetry is that the encoded input is injected during L-updates but not H-updates:
Across Sudoku-Extreme and Maze-30Γ30, this single architectural detail causes the shared model to specialize:
- Shared parameters, distinct roles β one Transformer block, two latent states with mechanistically-different functional roles
- Half the parameters of HRM β matches the two-network baseline (Wang et al. 2025) on Sudoku (60.0% vs 55.0%) and Maze (75.6% vs 74.5%)
- Asymmetric input is the necessary signal β symmetric variants collapse to ~52% Sudoku / ~70% Maze; the 8-point gap on Sudoku is the cost of removing the asymmetry
- Prepend-and-strip level token recovers most of the gap β adding a structurally separable state-identity token to the symmetric base lifts Sudoku from 50.9% β 57.5%
- Mechanistic split in attention β L-updates concentrate ~47% more attention mass inside the constraint neighbourhood than H-updates at the deepest layer; on Sudoku, deeper layers additionally route attention to violated cells
- State coupling, not redundancy β freeze experiments collapse final accuracy to 0% on both tasks; the two states are load-bearing in a coupled feedback loop
- Open-source reproduction β every figure, table, and ablation in the paper has a corresponding shell script under
experiment_*/
Sudoku β left columns: zH (fully committed). Right columns: zL (some cells held as BLANK; the held-back set shifts across sub-steps).
Maze-30Γ30 β zH commits to a full layout; zL holds regions as PAD and revises them locally as the rollout progresses.
Python 3.10+. Install Python dependencies:
pip install -r requirements.txt
unzip adam_atan2.zip # bundles the AdamATan2 optimizerIf CUDA 12.6 and matching PyTorch wheels are not already installed, one tested setup is:
# CUDA 12.6
CUDA_URL=https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
wget -O cuda_installer.run "$CUDA_URL"
sudo sh cuda_installer.run --silent --toolkit --override
export CUDA_HOME=/usr/local/cuda-12.6
# PyTorch with CUDA 12.6
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# Build helpers for the CUDA extensions
pip install packaging ninja wheel setuptools setuptools-scmTraining scripts log to Weights & Biases:
export WANDB_API_KEY=your_wandb_api_keypython dataset/build_sudoku_dataset.py \
--output-dir data/sudoku-extreme-1k-aug-1000 \
--subsample-size 1000 --num-aug 1000
python dataset/build_maze_dataset.py \
--output-dir data/maze-30x30-hard-1kIf you already have prebuilt datasets, place them at data/sudoku-extreme-1k-aug-1000 and data/maze-30x30-hard-1k.
# Sudoku-Extreme
bash experiment_input-injection-specialization-sudoku/train_sudoku_Lx_H.sh
# Maze-30Γ30
bash experiment_input-injection-specialization-maze/train_maze_Lx_H.shbash experiment_input-injection-specialization-sudoku/run_all_input_injection_specialization.sh
bash experiment_input-injection-specialization-maze/run_all_input_injection_specialization.shIf you are not on a Slurm cluster, run individual train_*.sh scripts directly (each is a single-GPU launch). Variant naming follows the paper: L_Hx, Lx_H, L_H2x, L2x_H, Lx_H2x, L2x_Hx, Lx_Hx, L2x_H2x.
bash experiment_addition-prepend-strip-no-strip/run_all_addition_prepend_strip_no_strip.shbash experiment_operator-form-control/run_all_operator_form_control.shβββ pretrain.py # Training entry point (Hydra-configured)
βββ evaluate.py # Held-out evaluation
βββ puzzle_dataset.py # Puzzle-token dataset wrapper
βββ config/ # Default architecture + training configs
βββ dataset/ # Sudoku-Extreme + Maze-30Γ30 builders
βββ models/
β βββ air/ # AIR architecture variants (shared block, two states)
βββ experiment_input-injection-specialization-sudoku/ # 8 AIR variants Γ 5 seeds on Sudoku
βββ experiment_input-injection-specialization-maze/ # 8 AIR variants Γ 5 seeds on Maze
βββ experiment_addition-prepend-strip-no-strip/ # Level-token controls (addition / prepend-strip / prepend-no-strip)
βββ experiment_operator-form-control/ # Linear, nonlinear, Hadamard, sign-flip input transforms
βββ experiment_visual-sudoku-decoded-freeze/ # Decoded zH/zL rollouts + freeze interventions (Sudoku)
βββ experiment_visual-maze-decoded-freeze/ # Decoded zH/zL rollouts + freeze interventions (Maze)
βββ experiment_attention-analysis-sudoku/ # Attention contrasts (bar charts + example heatmaps)
βββ experiment_attention-analysis-maze/ # Maze counterpart
βββ adam_atan2.zip # Bundled AdamATan2 optimizer; unzip before training
βββ requirements.txt
| Hardware | Use | Notes |
|---|---|---|
| 1 Γ NVIDIA H200 (80 GB) | Single Sudoku training run | ~4 hours per run |
| 8 Γ NVIDIA H200 (80 GB) | Single Maze training run | ~3 hours per run |
| Any Ampere or newer GPU | Inference / freeze / attention analysis | β€ 16 GB sufficient for batched eval |
Total compute reported in the paper (all sweeps + preliminary runs): ~500 GPU-hours.
All numbers are mean Β± standard deviation across 5 training seeds, evaluated on the full held-out test sets (422,786 Sudoku puzzles, 1,000 mazes). Bold marks the per-column winner.
| Variant | Sudoku (%) | Maze (%) | ||
|---|---|---|---|---|
| Asymmetric ($\Delta > 0$) | ||||
| L_Hx | (0, 1) | 1 | 58.7 Β± 3.3 | 75.3 Β± 3.2 |
| Lx_H (default) | (1, 0) | 1 | 60.0 Β± 2.0 | 71.0 Β± 6.3 |
| L_H2x | (0, 2) | 2 | 58.6 Β± 1.9 | 75.6 Β± 1.9 |
| L2x_H | (2, 0) | 2 | 59.1 Β± 2.4 | 71.1 Β± 6.4 |
| Lx_H2x | (1, 2) | 1 | 59.6 Β± 0.9 | 70.9 Β± 2.4 |
| L2x_Hx | (2, 1) | 1 | 58.6 Β± 2.9 | 74.5 Β± 1.6 |
| Group mean (asym) | β | β | 59.1 | 73.1 |
| Symmetric ($\Delta = 0$) | ||||
| Lx_Hx | (1, 1) | 0 | 52.1 Β± 1.6 | 69.4 Β± 2.5 |
| L2x_H2x | (2, 2) | 0 | 50.9 Β± 2.9 | 70.3 Β± 4.2 |
| Group mean (sym) | β | β | 51.5 | 69.9 |
| Two-network baseline | ||||
| HRM (Wang et al. 2025) | β | β | 55.0 | 74.5 |
Headline: the asymmetric group averages ~7.6 points higher than the symmetric group on Sudoku and ~3.2 points higher on Maze, and the best AIR variants match or exceed HRM with half the Transformer parameters.
| Level-token strategy | Mechanism | Sudoku (%) |
|---|---|---|
| L2x_H2x (symmetric base, no token) | β | 50.9 Β± 2.9 |
| Β + Addition | element-wise add to every token | 50.0 Β± 1.9 |
| Β + Prepend (strip) | prepend, attend, then strip | 57.5 Β± 1.3 |
| Β + Prepend (no strip) | prepend, persist across cycles | 47.8 Β± 1.6 |
|
For reference: asymmetric |
β | ~59.0 |
A level token that occupies its own sequence position (prepend + strip) recovers most of the asymmetric-injection benefit. Mixing the signal into every content token (addition) or letting it accumulate content (no strip) does not work.
| Layer |
|
|
|
|---|---|---|---|
| 0 | 0.050 Β± 0.001 | 0.053 Β± 0.000 | β0.018 Β± 0.002 |
| 1 | 0.015 Β± 0.001 | 0.023 Β± 0.000 | 0.006 Β± 0.001 |
| 2 | 0.138 Β± 0.001 | 0.125 Β± 0.001 | 0.028 Β± 0.001 |
| 3 | 0.244 Β± 0.002 | 0.182 Β± 0.002 | 0.037 Β± 0.002 |
L-updates put ~47% more attention mass inside the constraint neighbourhood than H-updates at the deepest layer (Sudoku, control queries). Violation-specific routing emerges only in the deeper layers.
One blank query cell (puzzle p0121, query r2c6, layer 0). The L-update places about 0.81 of its attention mass inside the constraint neighbourhood; the H-update places only 0.24. Same puzzle, same query, same head β only the update type differs.
| Task | Intervention | Total content change | Final accuracy |
|---|---|---|---|
| Sudoku | normal |
|
55.1% |
| Sudoku | freeze |
|
0% |
| Sudoku | freeze |
|
0% |
| Maze | normal |
|
71.0% |
| Maze | freeze |
|
0% |
| Maze | freeze |
|
0% |
The experiment folders ship run_all_*.sh scripts that submit the full sweep via sbatch. To run a single variant directly:
# Sudoku β default AIR
bash experiment_input-injection-specialization-sudoku/train_sudoku_Lx_H.sh
# Sudoku β symmetric control
bash experiment_input-injection-specialization-sudoku/train_sudoku_Lx_Hx.sh
# Level-token (prepend-and-strip) on the symmetric base
bash experiment_addition-prepend-strip-no-strip/train_sudoku_L2x_H2x_input_token_prepend.shBoth experiment_visual-*-decoded-freeze/ folders contain decode_*_intermediate_first_10.sh, *_freeze_zH_zL_*5runs.sh, and *_freeze_zH_zL_symmetric.sh. These require trained checkpoints; by default they look under checkpoints/ paths configured in each script. Override via AIR_SUDOKU_CKPT_PATH / AIR_MAZE_CKPT_PATH, or edit the path at the top of the script.
# Bar-chart data + multi-layer figure
bash experiment_attention-analysis-sudoku/generate_bar_data.sh
bash experiment_attention-analysis-sudoku/multilayer_figure.sh
# Maze counterparts
bash experiment_attention-analysis-maze/generate_bar_data.sh
bash experiment_attention-analysis-maze/multilayer_figure.shgenerate_bar_data.py captures L/H attention maps over 1,000 test puzzles at sub-steps {2, 4, 6, 8, 10, 12, 14, 15} and writes per-layer JSON into bar_data/.
@article{shen2026air,
title = {One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer},
author = {Shen, Jucheng and Su, Wenyi and Kyrillidis, Anastasios},
journal = {arXiv preprint arXiv:2605.17811},
year = {2026},
url = {https://arxiv.org/abs/2605.17811}
}- π§ One Model, Two Roles β Quanta-style walkthrough on AI-OWLS. Decoded rollouts, the symmetric control, the injection-asymmetry ablation, the level-token recovery, the freeze experiments, and the attention split β for a reader outside the subfield.
- Jucheng Shen β
jucheng.shen@rice.edu - Wenyi (Barbara) Su β
barbara.su@rice.edu - Anastasios Kyrillidis β
anastasios@rice.edu
Rice University, Department of Computer Science. Jucheng Shen and Wenyi (Barbara) Su contributed equally.



