Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why by evangelinehelsinki · Pull Request #363 · openai/parameter-golf

evangelinehelsinki · 2026-03-21T20:02:36Z

Summary

Depth recurrence saves parameters but doesn't improve bpb under competition constraints. 4 days of controlled experiments across 8xH100, 2xH100, and consumer GPUs, ~35 runs total.

Flat 11L 512d: 1.1648 bpb (sliding window) | 15.3 MB | 5375 steps @ 112ms
Looped 3x3 640d: 1.1894 bpb (sliding window) | 14.5 MB | 4175 steps @ 144ms
Gap: +0.025 bpb (looped worse) — same hardware, same tricks, same seed

Key Findings

Noisy QAT (our contribution): Differentiable uniform noise matching int8 quantization error on loop core blocks. Collapses recurrence quantization gap from 0.37 bpb to 0.002 bpb. This interaction between recurrence and quantization has not been documented in the competition or (to my knowledge) in published literature.

3x3 > 2x5 loops: More unique blocks with fewer repeats beats fewer blocks with more repeats — faster per step, better bpb, smaller artifact. Useful for anyone working on looped transformers.

Why recurrence fails here — the two taxes:

Quantization compounding: Shared weights quantized once, but error propagates through N repeats superlinearly
Step time overhead: +32ms/step = 1200 fewer training steps in 600s budget. The smaller artifact can't compensate.

12 Negative Results (with numbers)

XSA all layers (+0.001 worse), cyclic momentum (catastrophic), QuadgramHash (unclear), factored embeddings 192d (+0.053), factored embeddings 256d (+0.063), Value Residual (+0.14 catastrophic), progressive unrolling (DDP crash), sawtooth LR (4x slowdown from recompilation), full-weight TTT (overfitting), LeakyReLU (didn't transfer to 8xH100), late QAT + int5 (redundant, +0.006), BigramHash 10240 (no improvement on looped arch)

Full Writeup

See README.md for the complete 460-line research document including architecture details, hyperparameter sweep tables, reproduction instructions, and honest speculation about what might work with more compute.

Architecture

Based on PR #325 (Aum08Desai, Middle-Cycle looped transformer):

Stem (1-3 unique blocks) → Core (2-3 shared blocks × 2-5 repeats) → Tail (1-3 unique blocks)
640d, 10 heads, GQA or full MHA, 3x MLP, BigramHash, SmearGate, XSA, partial RoPE
Loop-specific: per-repeat embeddings, LoRA adapters, refinement blocks on non-attention repeats

Test Plan

3-seed validation on 8xH100 SXM (seeds 1337, 42, 7)
Controlled flat vs looped A/B comparison (same hardware, same config)
Quantization error amplification measured and documented
Full pipeline: train → EMA → quantize → compress → decompress → dequantize → eval
Hyperparameter sweeps (EMA, warmdown, MTP, WD, grad clip) on 2xH100
12 negative results documented with specific bpb numbers

4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP BigramHash + XSA + LoRA + Late STE QAT + int8+zstd Key finding: quantization error amplifies ~900x through recurrence cycles, making int6 incompatible with weight-sharing architectures. Int8 for shared blocks reduces the gap from 1.14 to 0.37 bpb. 3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)

original_model.md: - Discard depth recurrence (amplifies quant error 900×, throughput loss) - New direction: eval-time optimization stack (PPM-C + GPTQ-lite) - Document all our experiment results (v3, v4, v4_30m, ringgolf) - Add TTT/XSA interaction findings (PR openai#303: mutually exclusive) - Add PR openai#375 meta-insight (1ms overhead = 0.006 BPB) - 4-phase execution plan targeting PPM-C as original contribution review_pr_records_track_10min_16mb.md: - Add 2026-03-22 update with PRs openai#374, openai#379, openai#390, openai#375, openai#303, openai#363 - New SOTA at 1.1246 (PR openai#374: Tight SWA + VE128) - Document negative results from $500 compute spend (PR openai#375) - Unexplored opportunities: PPM-C, Neural Cache review_records_track_10min_16mb.md: - Add timestamp note (17 records, no changes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete 4-day experimental report on looped transformers in Parameter Golf: - Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap) - Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb - 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins - 12 negative results with specific numbers - Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip) - Updated training script with all experimental features

me when I cant write

0hq · 2026-03-25T17:49:59Z

I liked the writeup! Can you remove the README and the extra train gpt file and I can merge?

- Remove pr325_train_gpt.py from PR (dev file, not submission) - Restore original README.md - Update records/ writeup with v2 content - Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_) - Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)

0hq · 2026-03-25T23:07:46Z

Thanks!

…What Works, What Doesn't, and Why (openai#363) * Non-record: depth recurrence + quantization error amplification finding 4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP BigramHash + XSA + LoRA + Late STE QAT + int8+zstd Key finding: quantization error amplifies ~900x through recurrence cycles, making int6 incompatible with weight-sharing architectures. Int8 for shared blocks reduces the gap from 1.14 to 0.37 bpb. 3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8) * docs: comprehensive depth recurrence research writeup Complete 4-day experimental report on looped transformers in Parameter Golf: - Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap) - Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb - 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins - 12 negative results with specific numbers - Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip) - Updated training script with all experimental features * Update README.md me when I cant write * fix: remove extra files, update writeup per reviewer feedback - Remove pr325_train_gpt.py from PR (dev file, not submission) - Restore original README.md - Update records/ writeup with v2 content - Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_) - Clarify T=0.90 is activation-dependent (relu² specific, found via grid search) --------- Co-authored-by: Evangeline Kamin <eve@aurora.lan>

SmearGate is incompatible with weight sharing per the depth recurrence research (PR openai#363). Disable it automatically when SHARE_BODY=1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…What Works, What Doesn't, and Why (openai#363) * Non-record: depth recurrence + quantization error amplification finding 4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP BigramHash + XSA + LoRA + Late STE QAT + int8+zstd Key finding: quantization error amplifies ~900x through recurrence cycles, making int6 incompatible with weight-sharing architectures. Int8 for shared blocks reduces the gap from 1.14 to 0.37 bpb. 3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8) * docs: comprehensive depth recurrence research writeup Complete 4-day experimental report on looped transformers in Parameter Golf: - Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap) - Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb - 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins - 12 negative results with specific numbers - Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip) - Updated training script with all experimental features * Update README.md me when I cant write * fix: remove extra files, update writeup per reviewer feedback - Remove pr325_train_gpt.py from PR (dev file, not submission) - Restore original README.md - Update records/ writeup with v2 content - Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_) - Clarify T=0.90 is activation-dependent (relu² specific, found via grid search) --------- Co-authored-by: Evangeline Kamin <eve@aurora.lan>

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

evangelinehelsinki changed the title ~~Non-record: Depth Recurrence + Quantization Error Amplification Finding~~ Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why Mar 24, 2026

Update README.md

4678863

me when I cant write

0hq merged commit 50390d6 into openai:main Mar 25, 2026

aazizyan mentioned this pull request Mar 26, 2026

Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks) #855

Open

This was referenced Mar 26, 2026

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #891

Closed

Non-record: Technique Taxonomy — Tier List, Interaction Effects, and BPB Verification Tools #892

Open

vimeto mentioned this pull request Mar 29, 2026

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342 #1096

Draft

gowtham0992 mentioned this pull request Mar 30, 2026

Notable Non-Record: Universal Transformer — 1.2249 BPB — Depth Recurrence with Iteration Embeddings #1110

Open

9 tasks

Copilot AI mentioned this pull request Mar 30, 2026

Novel approaches analysis for sub-1.10 BPB Parameter Golf kailean/parameter-golf#1

Draft

Robby955 mentioned this pull request Mar 30, 2026

Partition Function Inflation: Why Hashed N-Gram Caches Produce Invalid BPB Scores (Non-Record, Analytical) #1147

Open

Tanush1912 mentioned this pull request Mar 31, 2026

RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters #1181

Open

5 tasks

dentity007 mentioned this pull request Mar 31, 2026

Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why#363

Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why#363
0hq merged 4 commits intoopenai:mainfrom
evangelinehelsinki:submission/depth-recurrence

evangelinehelsinki commented Mar 21, 2026 •

edited

Loading

Uh oh!

0hq commented Mar 25, 2026

Uh oh!

0hq commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

evangelinehelsinki commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Findings

12 Negative Results (with numbers)

Full Writeup

Architecture

Test Plan

Uh oh!

0hq commented Mar 25, 2026

Uh oh!

0hq commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evangelinehelsinki commented Mar 21, 2026 •

edited

Loading