Skip to content

Improved baseline: recurrent depth + LoRA + MTP#21

Closed
monroestephenson wants to merge 7 commits intoopenai:mainfrom
monroestephenson:improved-baseline
Closed

Improved baseline: recurrent depth + LoRA + MTP#21
monroestephenson wants to merge 7 commits intoopenai:mainfrom
monroestephenson:improved-baseline

Conversation

@monroestephenson
Copy link
Copy Markdown

@monroestephenson monroestephenson commented Mar 18, 2026

Summary

  • Depth recurrence: 5 unique transformer layers looped 2x = 10 effective layers, using weight sharing to trade stored parameters for effective depth
  • Wider model: MODEL_DIM=704 with 8 attention heads and 4 KV heads
  • SwiGLU MLP: Replaces the previous ReLU-squared MLP with a parameter-neutral SwiGLU variant
  • Per-loop LoRA adapters: Adds rank-16 LoRA adapters on attention Q/K/V/O projections for loop-specific specialization
  • Training-time extras: Adds multi-token prediction auxiliary heads and EMA averaging for eval/serialization
  • EMA fix: Prevents short runs from serializing near-initial weights when EMA_START_STEP is not reached

Config: SP-1024, 5 unique x 2 recurrence = 10 effective layers, dim=704, 8 heads, 4 KV heads, tied embeddings.

Test plan

  • Model constructs and runs forward pass on CPU for the current config
  • Int8+zlib quantization roundtrip rerun on the current model (6410127 bytes on random weights)
  • Post-quant loss delta rerun on random weights (0.017730712890625)
  • python3 -m py_compile records/track_10min_16mb/2026-03-18_ImprovedBaseline/train_gpt.py
  • Full 8xH100 training run
  • Validate BPB improvement over the baseline on a real training run

8 unique layers x 2 recurrence = 16 effective layers (vs baseline 9),
dim=640 (vs 512), SwiGLU MLP, quantization-aware training noise.
Estimated 14.9MB artifact under 16MB cap.
- Replace SwiGLU with ReLU^2 (proven at this scale, fewer params per layer)
- Remove fake QAT noise (was corrupting weights, not real quantization-aware training)
- Replace dead parameterless recurrence_norms with learned per-recurrence scale vectors
- Config: 5 unique x 2 rec = 10 effective layers, dim=704, ~9.0MB artifact
- Est ~2.1x baseline speed, ~6500 steps, ~3.4B tokens in 10min
@0hq 0hq self-requested a review March 18, 2026 23:20
@0hq 0hq self-assigned this Mar 18, 2026
@monroestephenson monroestephenson changed the title Improved baseline: depth recurrence + SwiGLU + QAT noise Improved baseline: recurrent depth + LoRA + MTP Mar 18, 2026
monroestephenson and others added 4 commits March 19, 2026 00:50
- 4 unique x 3 rec = 12 effective layers (was 5x2=10)
- dim=768, LoRA rank 32 (was dim=704, rank 16)
- Drop fullgraph=True for compatibility with LoRA conditionals
- ~6.7K steps estimated in 10 min, 5.72 MB artifact (36% of budget)
Replace enable_gqa kwarg (requires 2.5+) with manual
repeat_interleave for KV heads.
Based on baseline architecture (9L/512d) with proven competition wins:
- Sliding window eval (stride=256) for ~0.03-0.04 BPB improvement
- seq_len=4096 for better per-token context
- Tuned Muon optimizer (0.99 momentum, lower LR, longer warmdown)
- FP16 tied embedding preservation through quantization
- GQA fix for PyTorch 2.4 compatibility
- Previous run log saved (1.2844 pre-quant, 2.3258 post-quant)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@0hq 0hq closed this Mar 19, 2026
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 31, 2026
- logs/daily_research.md: append 2026-03-31 research section
  - PR openai#771 CLOSED (score-first TTT rule violation)
  - PR openai#727 CLOSED (n-gram illegal — no renormalization)
  - Merged SOTA: 1.1147 (PR openai#1019, 2026-03-25)
  - New PRs: openai#1184 (0.9485 Scylla tokenizer), openai#1185 (0.9641)
  - SLOT eval technique, Full GPTQ, QK-Gain 4.0 documented
- CLAUDE.md: update Competition Strategy + lessons 21-24
  - Merged SOTA updated to 1.1147
  - Current Best Path rewritten for 2026-03-31
  - Lessons openai#21-24: TTT fix, n-gram risk, Scylla, SLOT
  - TTT constraint clarified to score-first protocol
  - Version bumped to v9.0

https://claude.ai/code/session_015z6QKyKzDSYzTniW1GPhAe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants