Skip to content

Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks)#855

Open
aazizyan wants to merge 7 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale
Open

Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks)#855
aazizyan wants to merge 7 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale

Conversation

@aazizyan
Copy link
Copy Markdown

Depth recurrence has been tried ~45 times in this competition and never reached SOTA. I wanted to understand why, so I ran a structured ablation (12 runs total) testing five techniques against the three failure modes that kill recurrence: quantization error amplification, per-iteration identity collapse, and residual magnitude erasure.

Result: 1.2659 BPB post-quant. Config: 1 prelude + 4 shared × 3 loops + 1 coda = 14 effective layers from 6 unique blocks, 10.7 MB (well under 16MB limit). Q-gap +0.0076. Not competitive with SOTA (1.1194), but 3-loop recurrence has never worked before — PR #363 measured ~900× quantization amplification over 3 cycles.

What made it work (three techniques, none used elsewhere in the competition):

  • Output-LN — moved RMSNorm from MLP input to MLP output. This was the critical piece. Without it, shared weights can't tell loop iterations apart because pre-norm erases magnitude. With it, mixing alphas learned a clear gradient (0.37→0.70) instead of collapsing to ~0.48. Based on Kim et al. (arXiv:2502.02732, §3.2).

  • Birkhoff-constrained mixing — replaced the learned 2-vector residual mix with a sigmoid parameterization that guarantees spectral norm ≤ 1. Alone it actually hurts (+0.002 BPB), but paired with Output-LN it prevents the exponential quantization blowup that killed every prior 3-loop attempt. From mHC-lite (arXiv:2601.05732).

  • Capped timestep scaling — per-iteration scale vectors clamped to [-4, +4], stored as float16. The surprising finding: zero effect on pre-quant BPB, but reduces Q-gap by 26–30%. It's a quantization robustness technique, not a training technique. The gammas survive quantization as float16 passthrough while everything else gets crushed to int8. From Xu & Sato (arXiv:2410.01405).

Possibly useful for non-recurrent submissions: Output-LN, Birkhoff mixing, and per-layer scaling vectors are all drop-in additions that don't require weight sharing. If anyone wants to try them on the standard stack, the code is there.

Full ablation tables and theory with citations in the README and research_notes.md.

@aazizyan
Copy link
Copy Markdown
Author

Some untested directions that might be worth exploring:

  • These three techniques on shallow recurrence (repeat 1-2 layers on the SOTA stack) — the Q-gap reduction from timestep scaling could be meaningful at the frontier
  • Int6/GPTQ interaction with Birkhoff mixing — sigmoid values in [0,1] should quantize cleanly at any bit width
  • Output-LN on non-recurrent models — may help even without weight sharing, since it lets MLP see unnormalized inputs while bounding output
  • Gamma cap ablation (2.0 vs 4.0 vs 8.0) — the cap value was chosen empirically, not optimized
  • QAT combined with Birkhoff + Output-LN + timestep scaling — QAT has been tried for recurrence before, but not with these stabilization techniques in place

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant