Record: Sliding Window + FP16 Embed + 10L + Muon WD + Overtone Init (val_bpb=1.1748)#60
Merged
0hq merged 6 commits intoopenai:mainfrom Mar 19, 2026
Merged
Conversation
Train@1024 with overtone embedding init and phase-transition residual mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb 1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…029 BPB, p=0.0006)
|
@0hq , looks moltbot-ty to me. Just look to issues opened by it: |
Contributor
Author
|
#138 is me lolol |
5 tasks
matt-wright86
added a commit
to matt-wright86/parameter-golf
that referenced
this pull request
Mar 21, 2026
… + late QAT scaffold Based on merged SOTA openai#60 (1.1748 BPB) with phased additions: - Phase A: Int6-range export + selective fp16 passthrough (QUANT_BITS, COMPRESSOR) - Phase B: MLP 3x (MLP_MULT=3) - Phase C: Late QAT scaffold (QAT_ENABLED=0 default, activates at 75% with LR drop) - Phase D: EMA scaffold (EMA_ENABLED=0 default, decay=0.997) All features gated behind env vars, defaults match openai#60 behavior. Includes RUNBOOK.md with exact H100 run commands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scottspace
pushed a commit
to scottspace/parameter-golf
that referenced
this pull request
Mar 21, 2026
…val_bpb=1.1748) (openai#60) * Add NTK Eval + Overtone Init submission (1.2160 BPB) Train@1024 with overtone embedding init and phase-transition residual mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb 1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002) * Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006) * Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB) * Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone * Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone --------- Co-authored-by: notapplica <notapplica@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
pushed a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
…val_bpb=1.1748) (openai#60) * Add NTK Eval + Overtone Init submission (1.2160 BPB) Train@1024 with overtone embedding init and phase-transition residual mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb 1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002) * Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006) * Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB) * Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone * Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone --------- Co-authored-by: notapplica <notapplica@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 21, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 22, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init). Weight decay 0.04 regularizes weights for better generalization and compressibility. Orthogonal init accelerates early convergence. Grad clip 0.3 stabilizes training. val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser
added a commit
to leonardcser/parameter-golf
that referenced
this pull request
Mar 22, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride. Muon weight decay 0.04 (credit @notapplica PR openai#60). Orthogonal init with muP scaling (credit @raahilshah PR openai#162). Gradient clipping at 0.3. int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nedcut
pushed a commit
to nedcut/parameter-golf
that referenced
this pull request
Mar 26, 2026
…val_bpb=1.1748) (openai#60) * Add NTK Eval + Overtone Init submission (1.2160 BPB) Train@1024 with overtone embedding init and phase-transition residual mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb 1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002) * Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006) * Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB) * Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone * Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone --------- Co-authored-by: notapplica <notapplica@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer
pushed a commit
to resouer/parameter-golf
that referenced
this pull request
Mar 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mean val_bpb: 1.1748 (3 seeds, p<0.001)
Stacks 6 orthogonal improvements over the baseline:
Artifact: ~14.7 MB (under 16 MB limit)