Skip to content

Record: Sliding Window + FP16 Embed + 10L + Muon WD + Overtone Init (val_bpb=1.1748)#60

Merged
0hq merged 6 commits intoopenai:mainfrom
notapplica:submission/ntk-eval-overtone-init
Mar 19, 2026
Merged

Record: Sliding Window + FP16 Embed + 10L + Muon WD + Overtone Init (val_bpb=1.1748)#60
0hq merged 6 commits intoopenai:mainfrom
notapplica:submission/ntk-eval-overtone-init

Conversation

@notapplica
Copy link
Copy Markdown
Contributor

@notapplica notapplica commented Mar 19, 2026

Summary

Mean val_bpb: 1.1748 (3 seeds, p<0.001)

Stacks 6 orthogonal improvements over the baseline:

  1. Sliding window evaluation (stride=64, seq_len=1024) — every token scored with 960+ context
  2. FP16 tied embedding export — skip int8 quantization for tok_emb (errors compound in both input/output paths)
  3. 10 transformer layers (up from 9) — Muon weight decay compresses enough to fit the extra layer
  4. Decoupled weight decay for Muon optimizer (0.02) — Muon has no built-in regularization; adding p.mul_(1 - wd * lr) improves generalization + quantization
  5. Overtone spectral embedding init — SVD power-law spectrum shaping
  6. Phase-transition residual mixing — sigmoid-scheduled resid_mix initialization
Seed val_loss val_bpb Steps ms/step
1337 1.9849 1.1756 10424 57.55
42 1.9827 1.1742 10710 56.06
7 1.9830 1.1744 10498 57.18
Mean 1.9835 1.1748

Artifact: ~14.7 MB (under 16 MB limit)

notapplica and others added 6 commits March 18, 2026 23:51
Train@1024 with overtone embedding init and phase-transition residual
mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb
1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@notapplica notapplica changed the title NTK Eval + Overtone Init (val_bpb=1.2160) Record: Sliding Window + FP16 Embed + 10L + Muon WD + Overtone Init (val_bpb=1.1748) Mar 19, 2026
Copy link
Copy Markdown
Collaborator

@0hq 0hq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@FI-Mihej
Copy link
Copy Markdown

@notapplica
Copy link
Copy Markdown
Contributor Author

notapplica commented Mar 20, 2026

#138 is me lolol
Not moltbot but somewhat automated (i steer) (:
I have one claude working on the challenge and one claude analyzing everything in public

lolrazh added a commit to lolrazh/parameter-golf that referenced this pull request Mar 20, 2026
openai#59: 5-min + TTT, 258 steps, TTT didn't improve undertrained model
openai#60: 10-min no TTT, 515 steps, best prequant 1.4038, sliding eval incomplete

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
matt-wright86 added a commit to matt-wright86/parameter-golf that referenced this pull request Mar 21, 2026
… + late QAT scaffold

Based on merged SOTA openai#60 (1.1748 BPB) with phased additions:
- Phase A: Int6-range export + selective fp16 passthrough (QUANT_BITS, COMPRESSOR)
- Phase B: MLP 3x (MLP_MULT=3)
- Phase C: Late QAT scaffold (QAT_ENABLED=0 default, activates at 75% with LR drop)
- Phase D: EMA scaffold (EMA_ENABLED=0 default, decay=0.997)

All features gated behind env vars, defaults match openai#60 behavior.
Includes RUNBOOK.md with exact H100 run commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init).
Weight decay 0.04 regularizes weights for better generalization and
compressibility. Orthogonal init accelerates early convergence.
Grad clip 0.3 stabilizes training.

val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride.
Muon weight decay 0.04 (credit @notapplica PR openai#60).
Orthogonal init with muP scaling (credit @raahilshah PR openai#162).
Gradient clipping at 0.3.

int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scottspace pushed a commit to scottspace/parameter-golf that referenced this pull request Mar 21, 2026
…val_bpb=1.1748) (openai#60)

* Add NTK Eval + Overtone Init submission (1.2160 BPB)

Train@1024 with overtone embedding init and phase-transition residual
mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb
1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002)

* Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006)

* Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB)

* Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone

* Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone

---------

Co-authored-by: notapplica <notapplica@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser pushed a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
…val_bpb=1.1748) (openai#60)

* Add NTK Eval + Overtone Init submission (1.2160 BPB)

Train@1024 with overtone embedding init and phase-transition residual
mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb
1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002)

* Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006)

* Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB)

* Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone

* Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone

---------

Co-authored-by: notapplica <notapplica@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init).
Weight decay 0.04 regularizes weights for better generalization and
compressibility. Orthogonal init accelerates early convergence.
Grad clip 0.3 stabilizes training.

val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 21, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride.
Muon weight decay 0.04 (credit @notapplica PR openai#60).
Orthogonal init with muP scaling (credit @raahilshah PR openai#162).
Gradient clipping at 0.3.

int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 22, 2026
Credit: @notapplica PR openai#60 (Muon WD), @raahilshah PR openai#162 (ortho init).
Weight decay 0.04 regularizes weights for better generalization and
compressibility. Orthogonal init accelerates early convergence.
Grad clip 0.3 stabilizes training.

val_bpb 1.2649, compressed 14.7MB (-0.5MB from weight decay).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
leonardcser added a commit to leonardcser/parameter-golf that referenced this pull request Mar 22, 2026
Sliding window eval (credit @mattqlf PR openai#50) with configurable stride.
Muon weight decay 0.04 (credit @notapplica PR openai#60).
Orthogonal init with muP scaling (credit @raahilshah PR openai#162).
Gradient clipping at 0.3.

int8 roundtrip val_bpb: 1.2653. Sliding window would add ~0.03 on 8xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
…val_bpb=1.1748) (openai#60)

* Add NTK Eval + Overtone Init submission (1.2160 BPB)

Train@1024 with overtone embedding init and phase-transition residual
mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb
1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update submission: Muon WD + NTK Eval + Overtone Init (1.2094 BPB, p=0.0002)

* Update submission: 10-Layer + Muon WD + NTK Eval + Overtone Init (1.2029 BPB, p=0.0006)

* Update submission: FP16 Embed + 10L + Muon WD + NTK + Overtone (1.2008 BPB)

* Update submission: 1.2000 BPB — FP16 Embed + 10L + Muon WD + NTK@1408 + Overtone

* Update: 1.1748 BPB — Sliding Window + FP16 Embed + 10L + Muon WD + Overtone

---------

Co-authored-by: notapplica <notapplica@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer pushed a commit to resouer/parameter-golf that referenced this pull request Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants