Skip to content

okezue/CollatzSteering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Canary in the Carry Chain

Code, data, and paper sources for "The Canary in the Carry Chain: Transformers Know the Schedule Before They Can Execute" (NeurIPS 2026 submission).

What the paper claims

When a transformer fails on an iterative algorithmic task, it has often failed to execute, not to schedule. We factor such tasks as $y = E(x, c(x))$ for a discrete controller $c(x)$ and an executor $E$, and we measure controller-state representations directly on the long Collatz step in base 32, where the controller is the pair $(k, k')$ giving the loop counts of the multiplicative step.

A class-balanced linear probe at encoder Layer 2 decodes $k$ at $100%$ while shuffled, within-length, untrained, and rank-limited controls all stay at chance. In a decoder-only replication the $k$-probe leads exact-match by 43 epochs at the $90%$ threshold. Ablating the Layer 2 feed-forward block collapses exact-match from $99.09%$ to $23.15%$ while the $k$-probe stays at $99.75%$, dissociating the controller from the executor.

On a unified $N = 2000$ class-balanced grid we show a three-effect decomposition for explicit controller interfaces (ECI). Dedicated interface slots alone shift $k_{95}$ from 5 to 6 and lift $A_{k=7}$ to $91.3%$. Consistent input-tied codes add a small further increment. Alignment between the model's predicted controller and the interface embedding shifts the next failure boundary, lifting $A_{k=8}$ to a maximum of $77.5%$ across seven $1000$-epoch seeds, with four of seven seeds at or above $20%$. The corresponding upper bound for predicted interfaces without alignment is $20.7%$.

A theorem in Section 4 shows the conditions under which a deterministic interface, which adds no Shannon information beyond the input, can still move the achievable frontier of a restricted executor class. A cross-seed Layer 2 MLP comparison shows the high-$A_{k=8}$ basin contains many distinct distributed solutions rather than one shared circuit.

Repository layout

  • The Python and shell files at the repo root are the training, probing, evaluation, and circuit-analysis code. The most relevant entry points are eci_suite.py, eci_placebos.py, circuit_basin.py, circuit_patch.py, probe_balanced.py, decoder_only.py, and the eci_phase_*.py scripts.
  • results_final/ contains all raw experiment outputs, organized by experiment family. See results_final/MANIFEST.md for the family-level index and results_final/all_results.csv for a consolidated table of headline metrics by (variant, seed).

Reproducing the headline numbers

Long Collatz step in base 32, seven oracle_aligned seeds at 1000 epochs, unified $N = 2000$ grid:

Seed $A_{\mathrm{bal}}$ $A_{\mathrm{hard}}$ $A_{k=7}$ $A_{k=8}$ $k_{95}$
main 73.39% 45.44% 95.62% 40.69% 7
100 76.13% 54.97% 87.38% 77.53% 6
890 73.91% 47.26% 90.98% 50.80% 6
789 68.83% 30.37% 91.12% 0.00% 6
234 65.48% 20.01% 37.01% 23.01% 6
456 60.93% 15.02% 43.97% 1.09% 4
567 60.58% 14.68% 43.93% 0.10% 4

Mean $A_{k=8}$ is $27.60%$ (std $30.12$). Four of seven seeds reach $A_{k=8} \ge 20%$.

Running the code

pip install torch numpy matplotlib tqdm

# Train the headline 3x+1 base-32 model
python run.py train --base 32 --dev cuda

# Train an ECI variant (one of: strong_baseline, null_slots, iid_marginal,
# shuffled, fixed_permutation, predicted_ss, oracle_aligned, oracle_both,
# eci_baseline, aux_only)
python eci_suite.py oracle_aligned --base 32 --epochs 1000 --out output_eci/oracle_aligned

# Run the per-neuron Layer 2 MLP causal contribution sweep
python circuit_basin.py output_seeds_1k_locks/oracle_aligned_s100 8 100

# Cross-seed activation patching
python circuit_patch.py output_seeds_1k_locks/oracle_aligned_s890 \
                        output_seeds_1k_locks/oracle_aligned_s567 8 200

# Class-balanced probe selectivity sweep
python probe_balanced.py --base 32 --ckpt output/b32/best.pt

# Decoder-only replication
python decoder_only.py --base 32 --epochs 300

# Regenerate every paper figure
for f in paper/make_*.py; do python "$f"; done

A 1000-epoch oracle_aligned run takes about 6 hours on a single H100. The full ECI sweep at 1000 epochs each takes about 30 H100-hours when run with multiple variants in parallel. The cross-seed circuit comparison takes about an hour per seed.

Model weights

Local model checkpoints (*.pt files for the headline output_mps/b32, the decoder-only output_do, the controller-only output_ctrl, and one output_eci_seeds/baseline_s123 seed, totaling ~2 GB) are uploaded separately to Zenodo. The raw metrics.json and balanced_stats.json files needed to reproduce every paper number are in results_final/ and do not require the weights.

Compute

Total compute reported for the experiments in the paper is about 195 H100-equivalent hours, summarized in Appendix I (Table 9) of the paper.

References

  • Charton, F. and Narayanan, A. (2025). Transformers know more than they can tell: Learning the Collatz sequence. arXiv:2511.10811
  • Turner, A. et al. (2023). Activation addition: Steering language models without optimization. arXiv:2308.10248
  • Nanda, N. et al. (2023). Progress measures for grokking via mechanistic interpretability. ICLR 2023
  • Conmy, A. et al. (2024). How to use and interpret activation patching. arXiv:2404.15255
  • Nye, M. et al. (2022). Show your work: Scratchpads for intermediate computation with language models. ICLR 2022
  • McLeish, S. et al. (2024). Transformers can do arithmetic with the right embeddings. NeurIPS 2024

About

Steerable Control-Flow in Collatz Transformers: probing and causally editing loop-length representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors