Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks)#855
Open
aazizyan wants to merge 7 commits intoopenai:mainfrom
Conversation
Author
|
Some untested directions that might be worth exploring:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depth recurrence has been tried ~45 times in this competition and never reached SOTA. I wanted to understand why, so I ran a structured ablation (12 runs total) testing five techniques against the three failure modes that kill recurrence: quantization error amplification, per-iteration identity collapse, and residual magnitude erasure.
Result: 1.2659 BPB post-quant. Config: 1 prelude + 4 shared × 3 loops + 1 coda = 14 effective layers from 6 unique blocks, 10.7 MB (well under 16MB limit). Q-gap +0.0076. Not competitive with SOTA (1.1194), but 3-loop recurrence has never worked before — PR #363 measured ~900× quantization amplification over 3 cycles.
What made it work (three techniques, none used elsewhere in the competition):
Output-LN — moved RMSNorm from MLP input to MLP output. This was the critical piece. Without it, shared weights can't tell loop iterations apart because pre-norm erases magnitude. With it, mixing alphas learned a clear gradient (0.37→0.70) instead of collapsing to ~0.48. Based on Kim et al. (arXiv:2502.02732, §3.2).
Birkhoff-constrained mixing — replaced the learned 2-vector residual mix with a sigmoid parameterization that guarantees spectral norm ≤ 1. Alone it actually hurts (+0.002 BPB), but paired with Output-LN it prevents the exponential quantization blowup that killed every prior 3-loop attempt. From mHC-lite (arXiv:2601.05732).
Capped timestep scaling — per-iteration scale vectors clamped to [-4, +4], stored as float16. The surprising finding: zero effect on pre-quant BPB, but reduces Q-gap by 26–30%. It's a quantization robustness technique, not a training technique. The gammas survive quantization as float16 passthrough while everything else gets crushed to int8. From Xu & Sato (arXiv:2410.01405).
Possibly useful for non-recurrent submissions: Output-LN, Birkhoff mixing, and per-layer scaling vectors are all drop-in additions that don't require weight sharing. If anyone wants to try them on the standard stack, the code is there.
Full ablation tables and theory with citations in the README and research_notes.md.