Improved baseline: recurrent depth + LoRA + MTP by monroestephenson · Pull Request #21 · openai/parameter-golf

monroestephenson · 2026-03-18T21:50:48Z

Summary

Depth recurrence: 5 unique transformer layers looped 2x = 10 effective layers, using weight sharing to trade stored parameters for effective depth
Wider model: MODEL_DIM=704 with 8 attention heads and 4 KV heads
SwiGLU MLP: Replaces the previous ReLU-squared MLP with a parameter-neutral SwiGLU variant
Per-loop LoRA adapters: Adds rank-16 LoRA adapters on attention Q/K/V/O projections for loop-specific specialization
Training-time extras: Adds multi-token prediction auxiliary heads and EMA averaging for eval/serialization
EMA fix: Prevents short runs from serializing near-initial weights when EMA_START_STEP is not reached

Config: SP-1024, 5 unique x 2 recurrence = 10 effective layers, dim=704, 8 heads, 4 KV heads, tied embeddings.

Test plan

Model constructs and runs forward pass on CPU for the current config
Int8+zlib quantization roundtrip rerun on the current model (6410127 bytes on random weights)
Post-quant loss delta rerun on random weights (0.017730712890625)
python3 -m py_compile records/track_10min_16mb/2026-03-18_ImprovedBaseline/train_gpt.py
Full 8xH100 training run
Validate BPB improvement over the baseline on a real training run

8 unique layers x 2 recurrence = 16 effective layers (vs baseline 9), dim=640 (vs 512), SwiGLU MLP, quantization-aware training noise. Estimated 14.9MB artifact under 16MB cap.

- Replace SwiGLU with ReLU^2 (proven at this scale, fewer params per layer) - Remove fake QAT noise (was corrupting weights, not real quantization-aware training) - Replace dead parameterless recurrence_norms with learned per-recurrence scale vectors - Config: 5 unique x 2 rec = 10 effective layers, dim=704, ~9.0MB artifact - Est ~2.1x baseline speed, ~6500 steps, ~3.4B tokens in 10min

- 4 unique x 3 rec = 12 effective layers (was 5x2=10) - dim=768, LoRA rank 32 (was dim=704, rank 16) - Drop fullgraph=True for compatibility with LoRA conditionals - ~6.7K steps estimated in 10 min, 5.72 MB artifact (36% of budget)

Replace enable_gqa kwarg (requires 2.5+) with manual repeat_interleave for KV heads.

Based on baseline architecture (9L/512d) with proven competition wins: - Sliding window eval (stride=256) for ~0.03-0.04 BPB improvement - seq_len=4096 for better per-token context - Tuned Muon optimizer (0.99 momentum, lower LR, longer warmdown) - FP16 tied embedding preservation through quantization - GQA fix for PyTorch 2.4 compatibility - Previous run log saved (1.2844 pre-quant, 2.3258 post-quant) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- logs/daily_research.md: append 2026-03-31 research section - PR openai#771 CLOSED (score-first TTT rule violation) - PR openai#727 CLOSED (n-gram illegal — no renormalization) - Merged SOTA: 1.1147 (PR openai#1019, 2026-03-25) - New PRs: openai#1184 (0.9485 Scylla tokenizer), openai#1185 (0.9641) - SLOT eval technique, Full GPTQ, QK-Gain 4.0 documented - CLAUDE.md: update Competition Strategy + lessons 21-24 - Merged SOTA updated to 1.1147 - Current Best Path rewritten for 2026-03-31 - Lessons openai#21-24: TTT fix, n-gram risk, Scylla, SLOT - TTT constraint clarified to score-first protocol - Version bumped to v9.0 https://claude.ai/code/session_015z6QKyKzDSYzTniW1GPhAe

monroestephenson added 2 commits March 18, 2026 22:49

Add improved baseline: depth recurrence + SwiGLU + QAT noise

acf8400

8 unique layers x 2 recurrence = 16 effective layers (vs baseline 9), dim=640 (vs 512), SwiGLU MLP, quantization-aware training noise. Estimated 14.9MB artifact under 16MB cap.

0hq self-requested a review March 18, 2026 23:20

0hq self-assigned this Mar 18, 2026

Fix EMA export behavior for short runs

a1679ef

monroestephenson changed the title ~~Improved baseline: depth recurrence + SwiGLU + QAT noise~~ Improved baseline: recurrent depth + LoRA + MTP Mar 18, 2026

monroestephenson and others added 4 commits March 19, 2026 00:50

Add experiment sweeps and ablation toggles

72fc81a

Upsize model and fix compile for competition run

935906c

- 4 unique x 3 rec = 12 effective layers (was 5x2=10) - dim=768, LoRA rank 32 (was dim=704, rank 16) - Drop fullgraph=True for compatibility with LoRA conditionals - ~6.7K steps estimated in 10 min, 5.72 MB artifact (36% of budget)

Fix GQA for PyTorch 2.4 compatibility

57a1665

Replace enable_gqa kwarg (requires 2.5+) with manual repeat_interleave for KV heads.

0hq closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved baseline: recurrent depth + LoRA + MTP#21

Improved baseline: recurrent depth + LoRA + MTP#21
monroestephenson wants to merge 7 commits intoopenai:mainfrom
monroestephenson:improved-baseline

monroestephenson commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

monroestephenson commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

monroestephenson commented Mar 18, 2026 •

edited

Loading