Skip to content

Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why#363

Merged
0hq merged 4 commits intoopenai:mainfrom
evangelinehelsinki:submission/depth-recurrence
Mar 25, 2026
Merged

Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why#363
0hq merged 4 commits intoopenai:mainfrom
evangelinehelsinki:submission/depth-recurrence

Conversation

@evangelinehelsinki
Copy link
Copy Markdown
Contributor

@evangelinehelsinki evangelinehelsinki commented Mar 21, 2026

Summary

Depth recurrence saves parameters but doesn't improve bpb under competition constraints. 4 days of controlled experiments across 8xH100, 2xH100, and consumer GPUs, ~35 runs total.

  • Flat 11L 512d: 1.1648 bpb (sliding window) | 15.3 MB | 5375 steps @ 112ms
  • Looped 3x3 640d: 1.1894 bpb (sliding window) | 14.5 MB | 4175 steps @ 144ms
  • Gap: +0.025 bpb (looped worse) — same hardware, same tricks, same seed

Key Findings

Noisy QAT (our contribution): Differentiable uniform noise matching int8 quantization error on loop core blocks. Collapses recurrence quantization gap from 0.37 bpb to 0.002 bpb. This interaction between recurrence and quantization has not been documented in the competition or (to my knowledge) in published literature.

3x3 > 2x5 loops: More unique blocks with fewer repeats beats fewer blocks with more repeats — faster per step, better bpb, smaller artifact. Useful for anyone working on looped transformers.

Why recurrence fails here — the two taxes:

  1. Quantization compounding: Shared weights quantized once, but error propagates through N repeats superlinearly
  2. Step time overhead: +32ms/step = 1200 fewer training steps in 600s budget. The smaller artifact can't compensate.

12 Negative Results (with numbers)

XSA all layers (+0.001 worse), cyclic momentum (catastrophic), QuadgramHash (unclear), factored embeddings 192d (+0.053), factored embeddings 256d (+0.063), Value Residual (+0.14 catastrophic), progressive unrolling (DDP crash), sawtooth LR (4x slowdown from recompilation), full-weight TTT (overfitting), LeakyReLU (didn't transfer to 8xH100), late QAT + int5 (redundant, +0.006), BigramHash 10240 (no improvement on looped arch)

Full Writeup

See README.md for the complete 460-line research document including architecture details, hyperparameter sweep tables, reproduction instructions, and honest speculation about what might work with more compute.

Architecture

Based on PR #325 (Aum08Desai, Middle-Cycle looped transformer):

  • Stem (1-3 unique blocks) → Core (2-3 shared blocks × 2-5 repeats) → Tail (1-3 unique blocks)
  • 640d, 10 heads, GQA or full MHA, 3x MLP, BigramHash, SmearGate, XSA, partial RoPE
  • Loop-specific: per-repeat embeddings, LoRA adapters, refinement blocks on non-attention repeats

Test Plan

  • 3-seed validation on 8xH100 SXM (seeds 1337, 42, 7)
  • Controlled flat vs looped A/B comparison (same hardware, same config)
  • Quantization error amplification measured and documented
  • Full pipeline: train → EMA → quantize → compress → decompress → dequantize → eval
  • Hyperparameter sweeps (EMA, warmdown, MTP, WD, grad clip) on 2xH100
  • 12 negative results documented with specific bpb numbers

4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP
BigramHash + XSA + LoRA + Late STE QAT + int8+zstd

Key finding: quantization error amplifies ~900x through recurrence cycles,
making int6 incompatible with weight-sharing architectures. Int8 for shared
blocks reduces the gap from 1.14 to 0.37 bpb.

3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)
rarce added a commit to rarce/parameter-golf that referenced this pull request Mar 22, 2026
original_model.md:
- Discard depth recurrence (amplifies quant error 900×, throughput loss)
- New direction: eval-time optimization stack (PPM-C + GPTQ-lite)
- Document all our experiment results (v3, v4, v4_30m, ringgolf)
- Add TTT/XSA interaction findings (PR openai#303: mutually exclusive)
- Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB)
- 4-phase execution plan targeting PPM-C as original contribution

review_pr_records_track_10min_16mb.md:
- Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363
- New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128)
- Document negative results from $500 compute spend (PR openai#375)
- Unexplored opportunities: PPM-C, Neural Cache

review_records_track_10min_16mb.md:
- Add timestamp note (17 records, no changes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete 4-day experimental report on looped transformers in Parameter Golf:
- Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap)
- Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb
- 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins
- 12 negative results with specific numbers
- Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip)
- Updated training script with all experimental features
@evangelinehelsinki evangelinehelsinki changed the title Non-record: Depth Recurrence + Quantization Error Amplification Finding Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why Mar 24, 2026
me when I cant write
@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 25, 2026

I liked the writeup! Can you remove the README and the extra train gpt file and I can merge?

- Remove pr325_train_gpt.py from PR (dev file, not submission)
- Restore original README.md
- Update records/ writeup with v2 content
- Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_)
- Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)
@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 25, 2026

Thanks!

@0hq 0hq merged commit 50390d6 into openai:main Mar 25, 2026
TimS-ml referenced this pull request in TimS-ml/parameter-golf-autoresearch Mar 26, 2026
…What Works, What Doesn't, and Why (openai#363)

* Non-record: depth recurrence + quantization error amplification finding

4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP
BigramHash + XSA + LoRA + Late STE QAT + int8+zstd

Key finding: quantization error amplifies ~900x through recurrence cycles,
making int6 incompatible with weight-sharing architectures. Int8 for shared
blocks reduces the gap from 1.14 to 0.37 bpb.

3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)

* docs: comprehensive depth recurrence research writeup

Complete 4-day experimental report on looped transformers in Parameter Golf:
- Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap)
- Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb
- 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins
- 12 negative results with specific numbers
- Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip)
- Updated training script with all experimental features

* Update README.md

me when I cant write

* fix: remove extra files, update writeup per reviewer feedback

- Remove pr325_train_gpt.py from PR (dev file, not submission)
- Restore original README.md
- Update records/ writeup with v2 content
- Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_)
- Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)

---------

Co-authored-by: Evangeline Kamin <eve@aurora.lan>
jl-1 pushed a commit to jl-1/parameter-golf that referenced this pull request Mar 26, 2026
SmearGate is incompatible with weight sharing per the depth recurrence
research (PR openai#363). Disable it automatically when SHARE_BODY=1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
…What Works, What Doesn't, and Why (openai#363)

* Non-record: depth recurrence + quantization error amplification finding

4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP
BigramHash + XSA + LoRA + Late STE QAT + int8+zstd

Key finding: quantization error amplifies ~900x through recurrence cycles,
making int6 incompatible with weight-sharing architectures. Int8 for shared
blocks reduces the gap from 1.14 to 0.37 bpb.

3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)

* docs: comprehensive depth recurrence research writeup

Complete 4-day experimental report on looped transformers in Parameter Golf:
- Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap)
- Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb
- 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins
- 12 negative results with specific numbers
- Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip)
- Updated training script with all experimental features

* Update README.md

me when I cant write

* fix: remove extra files, update writeup per reviewer feedback

- Remove pr325_train_gpt.py from PR (dev file, not submission)
- Restore original README.md
- Update records/ writeup with v2 content
- Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_)
- Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)

---------

Co-authored-by: Evangeline Kamin <eve@aurora.lan>
fumbl8 pushed a commit to fumbl8/parameter-golf that referenced this pull request Mar 27, 2026
…What Works, What Doesn't, and Why (openai#363)

* Non-record: depth recurrence + quantization error amplification finding

4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP
BigramHash + XSA + LoRA + Late STE QAT + int8+zstd

Key finding: quantization error amplifies ~900x through recurrence cycles,
making int6 incompatible with weight-sharing architectures. Int8 for shared
blocks reduces the gap from 1.14 to 0.37 bpb.

3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)

* docs: comprehensive depth recurrence research writeup

Complete 4-day experimental report on looped transformers in Parameter Golf:
- Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap)
- Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb
- 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins
- 12 negative results with specific numbers
- Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip)
- Updated training script with all experimental features

* Update README.md

me when I cant write

* fix: remove extra files, update writeup per reviewer feedback

- Remove pr325_train_gpt.py from PR (dev file, not submission)
- Restore original README.md
- Update records/ writeup with v2 content
- Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_)
- Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)

---------

Co-authored-by: Evangeline Kamin <eve@aurora.lan>
anish-krishnan pushed a commit to anish-krishnan/parameter-golf that referenced this pull request Mar 30, 2026
…What Works, What Doesn't, and Why (openai#363)

* Non-record: depth recurrence + quantization error amplification finding

4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP
BigramHash + XSA + LoRA + Late STE QAT + int8+zstd

Key finding: quantization error amplifies ~900x through recurrence cycles,
making int6 incompatible with weight-sharing architectures. Int8 for shared
blocks reduces the gap from 1.14 to 0.37 bpb.

3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)

* docs: comprehensive depth recurrence research writeup

Complete 4-day experimental report on looped transformers in Parameter Golf:
- Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap)
- Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb
- 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins
- 12 negative results with specific numbers
- Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip)
- Updated training script with all experimental features

* Update README.md

me when I cant write

* fix: remove extra files, update writeup per reviewer feedback

- Remove pr325_train_gpt.py from PR (dev file, not submission)
- Restore original README.md
- Update records/ writeup with v2 content
- Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_)
- Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)

---------

Co-authored-by: Evangeline Kamin <eve@aurora.lan>
Itssshikhar pushed a commit to Itssshikhar/parameter-golf that referenced this pull request Mar 31, 2026
…What Works, What Doesn't, and Why (openai#363)

* Non-record: depth recurrence + quantization error amplification finding

4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP
BigramHash + XSA + LoRA + Late STE QAT + int8+zstd

Key finding: quantization error amplifies ~900x through recurrence cycles,
making int6 incompatible with weight-sharing architectures. Int8 for shared
blocks reduces the gap from 1.14 to 0.37 bpb.

3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)

* docs: comprehensive depth recurrence research writeup

Complete 4-day experimental report on looped transformers in Parameter Golf:
- Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap)
- Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb
- 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins
- 12 negative results with specific numbers
- Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip)
- Updated training script with all experimental features

* Update README.md

me when I cant write

* fix: remove extra files, update writeup per reviewer feedback

- Remove pr325_train_gpt.py from PR (dev file, not submission)
- Restore original README.md
- Update records/ writeup with v2 content
- Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_)
- Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)

---------

Co-authored-by: Evangeline Kamin <eve@aurora.lan>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants