Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64 by RoyiRa · Pull Request #700 · openai/parameter-golf

RoyiRa · 2026-03-25T10:48:55Z

Record: 5-expert Hedge Mixer + CROWN-Q + stride=64 (val_bpb=1.0541)

val_bpb: 1.0541 (3-seed mean) | ~15.7 MB | 8xH100 SXM

Results (8xH100 80GB SXM)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	Eval time	Artifact
1337	98.1ms	5,935	1.1251	1.0473	-0.0778	336s	15.89 MB
42	97.9ms	5,947	1.1264	1.0686	-0.0578	336s	15.69 MB
7	98.0ms	5,940	1.1246	1.0465	-0.0781	336s	15.66 MB
Mean			1.1254	1.0541	-0.0713	336s	~15.75 MB

Contributions

1. CROWN-Q Training Penalty (training-time)

Added a quantization-aware penalty during warmdown that penalizes weights sensitive to quantization error:

crown_q_loss = lambda * mean(w^2 * delta^2 / 12)

where delta = row_max / clip_range is the per-row quantization step size. This encourages weights to be quantization-friendly, reducing post-quantization degradation. CROWN_Q_LAMBDA=0.01.

Effect: Slightly better compression (artifact ~200KB smaller) and more robust quantization.

2. Eval stride 32 -> 64 (eval-time)

Changed sliding window stride from 32 to 64 during evaluation. Experiment showed identical BPB quality but 2x faster scoring. Frees ~100s of eval budget for more TTT epochs.

3. TTT Epochs 3 -> 4 (eval-time)

Increased test-time training from 3 to 4 epochs per chunk, using the time freed by stride=64. Each additional epoch adapts the model more to scored data. Tested 8 epochs but that overfits (1.0735 vs 1.0473 for 4 epochs).

Combined Effect

stride=64 saves ~100s of eval time
4th TTT epoch uses ~85s of the saved time
Net eval time: ~336s (down from ~562s), well within 600s budget
BPB improvement: 1.0745 -> 1.0541 (-0.0204)

Architecture

Component	Setting
Layers	11 (512d, 8H, 8KV)
MLP	3.5x with LeakyReLU(0.5)^2
BigramHash	6144 (dim=128)
XSA	All 11 layers (ws=8)
VE128	Layers 9-10
Quantization	Full GPTQ int5 + zstd level 22
Pruning	3% magnitude
TTT	AdamW lr=0.0001, 4 epochs, 131K chunks, Polyak 0.998
Mixer	5-expert Hedge (neural, unigram, bigram, trigram, entropy)
Training reserve	18s (for EMA + calibration + quantization)
Early warmdown	LR schedule targets 582s
CROWN-Q	lambda=0.01 during warmdown
Eval stride	64 (was 32)

Reproduction

DATA_PATH=../data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=../data/tokenizers/fineweb_1024_bpe.model \
SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
USE_MIXER=1 MIXER_ETA=0.1 \
TTT_EPOCHS=4 TTT_FREEZE_BLOCKS=2 \
TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \
ADAPTIVE_LR=1 ADAPTIVE_LR_MAX=3.0 \
EVAL_STRIDE=64 \
CROWN_Q_LAMBDA=0.01 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Compliance

Constraint	Limit	Actual	Status
Train time	600s	582s	Pass
Eval time	600s	336s	Pass
Artifact size	16,000,000 bytes	15,892,040 bytes (worst seed)	Pass
No pre-scoring training	—	Score-first TTT: each chunk scored under `inference_mode()` before any training on it	Pass
GPTQ calibration in training budget	—	Runs within 18s training reserve (1.9s actual)	Pass

Credits

Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon
CROWN-Q concept: PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693 by @EthanYangTW
5-expert Hedge mixer: PR Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688

…de=64

Built on PR openai#700 with hyperparameter improvements found via autoresearch-multi combinatorial search: - XSA_LAST_N=6 (extended from 4 to 6 layers) - BIGRAM_VOCAB_SIZE=4096 (doubled from 2048) 3-seed mean: 1.1078 (std 0.0045) Seeds: 42=1.1045, 1337=1.1061, 2025=1.1129 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Rewrite ttt_adapt() to score-first pattern (Issue openai#677 compliant): - Process val data in sequential chunks (TTT_CHUNK_TOKENS=131072) - Phase 1: score chunk under inference_mode (forward only) - Phase 2: train on scored tokens with AdamW (K epochs) - Each token scored BEFORE model trains on it 2. Switch TTT optimizer from SGD to AdamW (lr=0.0001, wd=0.0) - PR openai#700 showed AdamW >> SGD for TTT - Default 4 epochs, freeze first 2 blocks 3. Fix DDP find_unused_parameters → static_graph=True - Same 3x slowdown fix as submission directory 4. TTT defaults: disabled by default (TTT_ENABLED=0) - Enable with TTT_ENABLED=1 for TTT+n-gram combined eval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Four major additions to the Kuda Architecture: 1. Hedge Mixer (5-expert, eval-time): Multiplicative Weights Update mixing neural + unigram + bigram + trigram + entropy experts. Based on online learning theory (Freund & Schapire 1997). Same principle as PAQ/CMIX world-best compressors. Expected -0.065 BPB (PR openai#700 validated). 2. CROWN-Q warmdown penalty: lambda * mean(w^2 * delta^2 / 12) pushes weights into flat minima that survive quantization. delta^2/12 is the uniform quantization noise variance. w^2 is diagonal Fisher proxy. Applied during warmdown only. From PR openai#693. 3. RoPE NTK fix: Propagate train_seq_len to all blocks' Rotary modules. Prevents positional encoding mismatch between train (2048) and eval. From PR openai#714 — produced tightest seed variance in competition. 4. TTT infrastructure: Score-first eval with SGD adaptation on scored tokens. FiLM-only TTT planned for Kuda recurrence mode. All features verified locally: forward/backward, CROWN-Q penalty, 5-expert Hedge mixing, Hedge weight updates, RoPE propagation. Script now 1,559 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Deep analysis of feature dependency chains in both winning approaches. SOTA is speed-first, PR openai#700 is eval-first. Every feature enables the next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>

Research-backed fixes for all four blockers: 1. Quant gap (0.071→0.005): Late QAT with STE on bank slices, EMA via named_parameters (not state_dict), Full GPTQ with Hessian 2. Eval speed (101min→10min): SOTA's sliding window TTT pattern, batch 32 windows, distribute across 8 GPUs, cosine LR decay 3. Artifact (16.9MB→16MB): 3% magnitude pruning (PR openai#700 pattern) 4. EMA/DDP: Use named_parameters() on unwrapped base_model All implementations sourced from actual SOTA code (pg-sota-train.md). Priority: EMA fix → Late QAT → Pruning → Sliding TTT → Full GPTQ. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>

Maps every top entry through BPB = L + Q + T + M: - openai#700 solved M (mixer) but has worst L (training) - openai#609 solved Q (quant) but has zero T and M (no eval pipeline) - openai#549 solved L (training) but has zero M (no mixer) - Nobody has optimized all four terms simultaneously - Theoretical optimal = 1.052 (combine best of each) - Our Track B path to 1.025 via recurrence + FiLM-only TTT + Mixer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>

…eframe Corrections: - T+M are combined (-0.020), not separate. PR openai#700 gets -0.073 (3.6x better) - Our Q gap (0.066) is larger than the openai#549-openai#700 total gap — Q is THE bottleneck - Added "Best Known" column comparing against best per-term, not just merged SOTA New insights added: - Kaplan width scaling, hidden ≥ 512 threshold, Goldilocks depth - MoE viability at small scale (inactive experts compress well) - Vocab expansion opportunity (mechanical BPB reduction) - Compression reframe: BPB competition = compression competition, 20 years of literature - Strategic evolution: feature bloat → simplify → Q bottleneck → compression-first approach - Theoretical optimal 1.052 = combine best of openai#549 + openai#609 + openai#700 (nobody has done this) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stri…

57d1d2c

…de=64

notapplica mentioned this pull request Mar 25, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

agalimova mentioned this pull request Mar 25, 2026

Record Submission: 1.1078 BPB — XSA6 + BigramHash4K on Hedge Mixer Stack #720

Closed

5 tasks

RoyiRa force-pushed the submission/2026-03-25-hedge-mixer-crown-q branch 2 times, most recently from 30e7835 to 57d1d2c Compare March 25, 2026 14:27

deanbrr mentioned this pull request Mar 25, 2026

Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779

Closed

hypery11 mentioned this pull request Mar 26, 2026

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440) #825

Open

4 tasks

AnirudhRahul mentioned this pull request Mar 26, 2026

Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT #834

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700
RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa:submission/2026-03-25-hedge-mixer-crown-q

RoyiRa commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RoyiRa commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: 5-expert Hedge Mixer + CROWN-Q + stride=64 (val_bpb=1.0541)

Results (8xH100 80GB SXM)

Contributions

1. CROWN-Q Training Penalty (training-time)

2. Eval stride 32 -> 64 (eval-time)

3. TTT Epochs 3 -> 4 (eval-time)

Combined Effect

Architecture

Reproduction

Compliance

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RoyiRa commented Mar 25, 2026 •

edited

Loading