[10min/16MB] AWQ + Cyclic Momentum + ReLU² + 11L Shared — 1.1507 bpb by SPThole · Pull Request #623 · openai/parameter-golf

SPThole · 2026-03-24T14:22:27Z

val_bpb: 1.1507 ± 0.0016 (3-seed mean) | ~15.4 MB | 8×H100 SXM, 600s

Seed	val_bpb	Size
42	1.1502	15.45 MB
43	1.1494	15.52 MB
44	1.1526	15.43 MB

Techniques

Technique	Description
AWQ	Activation-aware weight scaling (α=0.5) before int5/int6 quantization. Closed 63% of the quant gap (0.027 → 0.010 bpb).
Cyclic Muon Momentum	Triangle wave 0.85–0.95 (period=50) after warmup. Escapes sharp minima.
ReLU²	Sparser MLP activations, better for small models.
11L Shared	10 unique weight sets, last block reused. Free depth.

Development

Experimented on 1×H100 (21+ experiments) to iterate fast, promoted the winner to 8×H100. Tested MTP, curriculum learning, TTT, GPTQ-lite, layer-aware quant, EMA, QAT, AdaptiveRMSNorm, partial RoPE, and value embeddings — most were neutral or negative in isolation. AWQ and cyclic momentum were the clear winners beyond the base architecture.

Full ablation table and experiment logs in the README.

Two-stage investigation into training data selection for Parameter Golf: Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most reliable (rho=0.984). But all 80 shards have nearly identical bigram statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise). Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard variance is 535x larger than between-shard. Selected top 12% by bigram CE and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006). Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero. Conclusion: On FineWeb (already filtered), hard data selection trades diversity for match quality, and diversity wins. Corroborated by PRs openai#737, openai#623, openai#333 and Sachdeva et al. (ICLR 2025). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SPThole and others added 3 commits March 24, 2026 18:09

updated sub

2e1278e

updt readme

db9ea39

Update README.md

a029a31

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

SPThole and others added 3 commits March 25, 2026 09:52

Merge branch 'openai:main' into main

2c2d807

added qk init non record one H100

d7ec40c

removing non record

3019166

This was referenced Mar 25, 2026

[Non Record] Attention Warm-Start: Initializing Q/K from Bigram Co-occurrence SVD #678

Open

[Non Record] Online Curriculum Learning #737

Open

abaybektursun mentioned this pull request Mar 25, 2026

Non-record: Data ordering & selection — negative result on FineWeb #772

Open

SPThole and others added 2 commits March 26, 2026 08:34

Merge branch 'openai:main' into main

940cb57

adding exps

de42f2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[10min/16MB] AWQ + Cyclic Momentum + ReLU² + 11L Shared — 1.1507 bpb#623

[10min/16MB] AWQ + Cyclic Momentum + ReLU² + 11L Shared — 1.1507 bpb#623
SPThole wants to merge 8 commits intoopenai:mainfrom
SPThole:main

SPThole commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SPThole commented Mar 24, 2026

Techniques

Development

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant