Skip to content

[10min/16MB] AWQ + Cyclic Momentum + ReLU² + 11L Shared — 1.1507 bpb#623

Open
SPThole wants to merge 8 commits intoopenai:mainfrom
SPThole:main
Open

[10min/16MB] AWQ + Cyclic Momentum + ReLU² + 11L Shared — 1.1507 bpb#623
SPThole wants to merge 8 commits intoopenai:mainfrom
SPThole:main

Conversation

@SPThole
Copy link
Copy Markdown

@SPThole SPThole commented Mar 24, 2026

val_bpb: 1.1507 ± 0.0016 (3-seed mean) | ~15.4 MB | 8×H100 SXM, 600s

Seed val_bpb Size
42 1.1502 15.45 MB
43 1.1494 15.52 MB
44 1.1526 15.43 MB

Techniques

Technique Description
AWQ Activation-aware weight scaling (α=0.5) before int5/int6 quantization. Closed 63% of the quant gap (0.027 → 0.010 bpb).
Cyclic Muon Momentum Triangle wave 0.85–0.95 (period=50) after warmup. Escapes sharp minima.
ReLU² Sparser MLP activations, better for small models.
11L Shared 10 unique weight sets, last block reused. Free depth.

Development

Experimented on 1×H100 (21+ experiments) to iterate fast, promoted the winner to 8×H100. Tested MTP, curriculum learning, TTT, GPTQ-lite, layer-aware quant, EMA, QAT, AdaptiveRMSNorm, partial RoPE, and value embeddings — most were neutral or negative in isolation. AWQ and cyclic momentum were the clear winners beyond the base architecture.

Full ablation table and experiment logs in the README.

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 25, 2026
Two-stage investigation into training data selection for Parameter Golf:

Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most
reliable (rho=0.984). But all 80 shards have nearly identical bigram
statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise).

Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard
variance is 535x larger than between-shard. Selected top 12% by bigram CE
and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006).

Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model
perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero.

Conclusion: On FineWeb (already filtered), hard data selection trades
diversity for match quality, and diversity wins. Corroborated by PRs openai#737,
openai#623, openai#333 and Sachdeva et al. (ICLR 2025).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant