Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700
Open
RoyiRa wants to merge 1 commit intoopenai:mainfrom
Open
Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa wants to merge 1 commit intoopenai:mainfrom
Conversation
agalimova
added a commit
to agalimova/parameter-golf
that referenced
this pull request
Mar 25, 2026
Built on PR openai#700 with hyperparameter improvements found via autoresearch-multi combinatorial search: - XSA_LAST_N=6 (extended from 4 to 6 layers) - BIGRAM_VOCAB_SIZE=4096 (doubled from 2048) 3-seed mean: 1.1078 (std 0.0045) Seeds: 42=1.1045, 1337=1.1061, 2025=1.1129 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
30e7835 to
57d1d2c
Compare
Asukabot0
added a commit
to Asukabot0/parameter-golf
that referenced
this pull request
Mar 25, 2026
1. Rewrite ttt_adapt() to score-first pattern (Issue openai#677 compliant): - Process val data in sequential chunks (TTT_CHUNK_TOKENS=131072) - Phase 1: score chunk under inference_mode (forward only) - Phase 2: train on scored tokens with AdamW (K epochs) - Each token scored BEFORE model trains on it 2. Switch TTT optimizer from SGD to AdamW (lr=0.0001, wd=0.0) - PR openai#700 showed AdamW >> SGD for TTT - Default 4 epochs, freeze first 2 blocks 3. Fix DDP find_unused_parameters → static_graph=True - Same 3x slowdown fix as submission directory 4. TTT defaults: disabled by default (TTT_ENABLED=0) - Enable with TTT_ENABLED=1 for TTT+n-gram combined eval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
6 tasks
theLightArchitect
added a commit
to theLightArchitect/parameter-golf
that referenced
this pull request
Mar 27, 2026
Four major additions to the Kuda Architecture: 1. Hedge Mixer (5-expert, eval-time): Multiplicative Weights Update mixing neural + unigram + bigram + trigram + entropy experts. Based on online learning theory (Freund & Schapire 1997). Same principle as PAQ/CMIX world-best compressors. Expected -0.065 BPB (PR openai#700 validated). 2. CROWN-Q warmdown penalty: lambda * mean(w^2 * delta^2 / 12) pushes weights into flat minima that survive quantization. delta^2/12 is the uniform quantization noise variance. w^2 is diagonal Fisher proxy. Applied during warmdown only. From PR openai#693. 3. RoPE NTK fix: Propagate train_seq_len to all blocks' Rotary modules. Prevents positional encoding mismatch between train (2048) and eval. From PR openai#714 — produced tightest seed variance in competition. 4. TTT infrastructure: Score-first eval with SGD adaptation on scored tokens. FiLM-only TTT planned for Kuda recurrence mode. All features verified locally: forward/backward, CROWN-Q penalty, 5-expert Hedge mixing, Hedge weight updates, RoPE propagation. Script now 1,559 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
theLightArchitect
added a commit
to theLightArchitect/parameter-golf
that referenced
this pull request
Mar 27, 2026
Deep analysis of feature dependency chains in both winning approaches. SOTA is speed-first, PR openai#700 is eval-first. Every feature enables the next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
theLightArchitect
added a commit
to theLightArchitect/parameter-golf
that referenced
this pull request
Mar 27, 2026
Research-backed fixes for all four blockers: 1. Quant gap (0.071→0.005): Late QAT with STE on bank slices, EMA via named_parameters (not state_dict), Full GPTQ with Hessian 2. Eval speed (101min→10min): SOTA's sliding window TTT pattern, batch 32 windows, distribute across 8 GPUs, cosine LR decay 3. Artifact (16.9MB→16MB): 3% magnitude pruning (PR openai#700 pattern) 4. EMA/DDP: Use named_parameters() on unwrapped base_model All implementations sourced from actual SOTA code (pg-sota-train.md). Priority: EMA fix → Late QAT → Pruning → Sliding TTT → Full GPTQ. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
theLightArchitect
added a commit
to theLightArchitect/parameter-golf
that referenced
this pull request
Mar 27, 2026
Maps every top entry through BPB = L + Q + T + M: - openai#700 solved M (mixer) but has worst L (training) - openai#609 solved Q (quant) but has zero T and M (no eval pipeline) - openai#549 solved L (training) but has zero M (no mixer) - Nobody has optimized all four terms simultaneously - Theoretical optimal = 1.052 (combine best of each) - Our Track B path to 1.025 via recurrence + FiLM-only TTT + Mixer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
theLightArchitect
added a commit
to theLightArchitect/parameter-golf
that referenced
this pull request
Mar 27, 2026
…eframe Corrections: - T+M are combined (-0.020), not separate. PR openai#700 gets -0.073 (3.6x better) - Our Q gap (0.066) is larger than the openai#549-openai#700 total gap — Q is THE bottleneck - Added "Best Known" column comparing against best per-term, not just merged SOTA New insights added: - Kaplan width scaling, hidden ≥ 512 threshold, Goldilocks depth - MoE viability at small scale (inactive experts compress well) - Vocab expansion opportunity (mechanical BPB reduction) - Compression reframe: BPB competition = compression competition, 20 years of literature - Strategic evolution: feature bloat → simplify → Q bottleneck → compression-first approach - Theoretical optimal 1.052 = combine best of openai#549 + openai#609 + openai#700 (nobody has done this) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Kevin Francis Tan <kf.tan@lightarchitects.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 5-expert Hedge Mixer + CROWN-Q + stride=64 (val_bpb=1.0541)
val_bpb: 1.0541 (3-seed mean) | ~15.7 MB | 8xH100 SXM
Results (8xH100 80GB SXM)
Contributions
1. CROWN-Q Training Penalty (training-time)
Added a quantization-aware penalty during warmdown that penalizes weights sensitive to quantization error:
where
delta = row_max / clip_rangeis the per-row quantization step size. This encourages weights to be quantization-friendly, reducing post-quantization degradation.CROWN_Q_LAMBDA=0.01.Effect: Slightly better compression (artifact ~200KB smaller) and more robust quantization.
2. Eval stride 32 -> 64 (eval-time)
Changed sliding window stride from 32 to 64 during evaluation. Experiment showed identical BPB quality but 2x faster scoring. Frees ~100s of eval budget for more TTT epochs.
3. TTT Epochs 3 -> 4 (eval-time)
Increased test-time training from 3 to 4 epochs per chunk, using the time freed by stride=64. Each additional epoch adapts the model more to scored data. Tested 8 epochs but that overfits (1.0735 vs 1.0473 for 4 epochs).
Combined Effect
Architecture
Reproduction
Compliance
inference_mode()before any training on itCredits