Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks) by aazizyan · Pull Request #855 · openai/parameter-golf

aazizyan · 2026-03-26T14:36:10Z

Depth recurrence has been tried ~45 times in this competition and never reached SOTA. I wanted to understand why, so I ran a structured ablation (12 runs total) testing five techniques against the three failure modes that kill recurrence: quantization error amplification, per-iteration identity collapse, and residual magnitude erasure.

Result: 1.2659 BPB post-quant. Config: 1 prelude + 4 shared × 3 loops + 1 coda = 14 effective layers from 6 unique blocks, 10.7 MB (well under 16MB limit). Q-gap +0.0076. Not competitive with SOTA (1.1194), but 3-loop recurrence has never worked before — PR #363 measured ~900× quantization amplification over 3 cycles.

What made it work (three techniques, none used elsewhere in the competition):

Output-LN — moved RMSNorm from MLP input to MLP output. This was the critical piece. Without it, shared weights can't tell loop iterations apart because pre-norm erases magnitude. With it, mixing alphas learned a clear gradient (0.37→0.70) instead of collapsing to ~0.48. Based on Kim et al. (arXiv:2502.02732, §3.2).
Birkhoff-constrained mixing — replaced the learned 2-vector residual mix with a sigmoid parameterization that guarantees spectral norm ≤ 1. Alone it actually hurts (+0.002 BPB), but paired with Output-LN it prevents the exponential quantization blowup that killed every prior 3-loop attempt. From mHC-lite (arXiv:2601.05732).
Capped timestep scaling — per-iteration scale vectors clamped to [-4, +4], stored as float16. The surprising finding: zero effect on pre-quant BPB, but reduces Q-gap by 26–30%. It's a quantization robustness technique, not a training technique. The gammas survive quantization as float16 passthrough while everything else gets crushed to int8. From Xu & Sato (arXiv:2410.01405).

Possibly useful for non-recurrent submissions: Output-LN, Birkhoff mixing, and per-layer scaling vectors are all drop-in additions that don't require weight sharing. If anyone wants to try them on the standard stack, the code is there.

Full ablation tables and theory with citations in the README and research_notes.md.

…niques

aazizyan · 2026-03-26T14:41:29Z

Some untested directions that might be worth exploring:

These three techniques on shallow recurrence (repeat 1-2 layers on the SOTA stack) — the Q-gap reduction from timestep scaling could be meaningful at the frontier
Int6/GPTQ interaction with Birkhoff mixing — sigmoid values in [0,1] should quantize cleanly at any bit width
Output-LN on non-recurrent models — may help even without weight sharing, since it lets MLP see unnormalized inputs while bounding output
Gamma cap ablation (2.0 vs 4.0 vs 8.0) — the cap value was chosen empirically, not optimized
QAT combined with Birkhoff + Output-LN + timestep scaling — QAT has been tried for recurrence before, but not with these stabilization techniques in place

aazizyan added 7 commits March 26, 2026 17:57

feat: add modified training script with recurrence stabilization tech…

8dc11e2

…niques

feat: add screening experiment scripts and logs (7 runs, 2000 steps)

43f7438

feat: add full-scale experiment scripts and logs (5 runs, 600s 8xH100)

5603ee0

docs: add research notes with theory and citations

ab1db1b

chore: add submission metadata and primary run log

52442c9

docs: add README for PR submission

37e6422

docs: polish README and research notes for PR submission

a3a8613

vimeto mentioned this pull request Mar 29, 2026

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342 #1096

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks)#855

Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks)#855
aazizyan wants to merge 7 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale

aazizyan commented Mar 26, 2026

Uh oh!

aazizyan commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aazizyan commented Mar 26, 2026

Uh oh!

aazizyan commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant