Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)#287
Conversation
|
nice find on ema over swa, that makes alot of sense for smoother convergence. did you try different decay values or was 0.997 just the first thing that worked |
Adds TTT (3-epoch SGD on val data) to jfprincz's openai#287 base (1.1271). TTT is eval-time only so artifact size stays at ~15.5MB. Projected score: ~1.122-1.124. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@mohosy |
tried 0.999, averaged too slowly and hurt BPB. 0.997 was better i so kept it |
4 proven optimizations from merged leaderboard entries: - train_seq_len 1024→2048 (both top entries use this) - SWA every 50 steps, start_frac=0.4 (swept optimal by openai#2 author) - grad_clip_norm 0.3 (both top entries use this) - LRs 0.025/0.025/0.035 (from PR openai#287) BigramHash stays at 2048 to avoid artifact size risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
xsa and ema lowkey hurt mine, was getting similar numbers to urs but they boosted mine back up |
|
i took a lot of diff approaches from u tho |
- Add leaderboard table: jfprincz 1.1271 is new target; mohosy racing same stack - Add Reptile meta-TTT finding (PR openai#296): 10x better than naive TTT with SmearGate; error-guided TTT is negative; 13L crossover point identified - Add SWA checkpoint count finding (PR openai#238): 84 checkpoints reverses quant gap; explains why our WD=1200 SWA showed no effect - Update jfprincz entry to include PR openai#287 results (1.1271) - Add meta-lessons 10 and 11
|
im switching off transformers |
…erparams Aggressive iteration to beat SOTA 1.1428. Changes from SmearGate v5: - XSA: exclusive self-attention on last 4 layers (zero-param, from PR openai#287) - EMA decay=0.997 replaces SWA (continuous averaging, from PR openai#287) - train_seq_len 1024→2048 - grad_clip_norm 0.3 - LRs 0.025/0.025/0.035 - WD 0.04, 11 layers, BigramHash 2048 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#287 (SOTA val_bpb=1.1271) with Overtone embedding init added. Overtone reshapes tok_emb singular values to power-law decay for better int6 quantization. No KURE, R2, tanh, or TTT — clean minimal delta over proven SOTA. Modal launcher with CLI flags for EMA decay, num_layers, and XSA sweep. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive analysis of Parameter Golf techniques across 3 rounds of deep research, validated against 100+ PRs from openai/parameter-golf. Covers SOTA stack (PR openai#287, 1.1271 BPB), proven techniques (TTT, XSA, EMA), debunked approaches (LZMA, BTT, Int5), and prioritized action plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combines PR openai#287 (XSA + EMA + Int6 QAT) with PR openai#254 TTT adaptation. Changes: FA2 fallback import, TTT hyperparameters, ttt_adapt function, TTT call before torch.compile in eval section.
|
@jfprincz bet thats good to know, 0.997 seems like the sweet spot then. gonna run with that @himanalot bro switching off transformers is wild but honestly respect the move, if it hits it hits lol. what are you thinking like state space or something completely different |
12.5MB compressed with 9 layers → room for 10th layer. Top PRs (openai#287, openai#309) use 10-11 layers for better BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11 layers + 3x MLP — may be tight on 16MB budget. Will test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ded eval context Non-record research submission. Proposes caching K/V pairs across sliding windows to extend effective context from 2K to 50K+ tokens at eval time. Backward-looking, zero artifact cost, rule-compliant. Implementation provided but untested due to compute constraints. Base: PR openai#287 reproduction at 1.1284 BPB.
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.
Implemented EMA (exponential moving average) weight averaging as alternative to SWA, matching SOTA PR openai#287 technique. ralph_020 with decay=0.997 regressed badly (initial weights 12% of EMA at 700 steps). Testing lower decay next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@mohosy |
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360). Three novel additions on thwu1 SOTA base (1.1428): - TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers, from PR openai#287 - TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only) Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
|
@himanalot lol fair enough, everyone doing the same xsa ema stack does get kinda stale. lmk what you come up with tho im genuinley curious |
|
@mohosy xsa didn't work for me and ema provided little benefit. ssm didn't work either (not worth taking the hit to avg step time). tbh i was prob wrong in what i said. not sure why everyone's getting benefits with xsa and i'm seeing a 0.006 hit tho. |
New addition: EMA (decay=0.9999) shadow model, eval uses EMA weights. EMA coexists with SWA. Zero artifact cost. Consistent with PR openai#338 (best open PR, 1.1254 bpb) which also uses EMA. 11th layer ruled out: needs ~0.91MB, only ~0.36MB budget available. Full stack on thwu1 base (1.1428): - TrigramHash(20480, dim=32): trigram embeddings, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers (PR openai#287) - EMA: decay=0.9999, shadow model used at final eval - TTT: 3-epoch SGD on val tokens, all ranks, ~47s budget Artifact: ~15.64MB. H100 validation pending.
T4 ablation (1000 steps, 4 variants): V2 bigram=10240 no trigram: 5.4379 loss WINNER V4 bigram=8192 + trigram=8192: 5.6956 loss V3 bigram=4096 + trigram=20480: 5.7924 loss (was our submission) V1 bigram=4096 no trigram: 5.8414 loss TrigramHash adds noise, bigram reduction actively hurts. Restored bigram=10240. Stack is now: XSA + EMA + TTT on thwu1 base. These are proven techniques (XSA from PR openai#287, EMA+TTT from PR openai#338 lineage) applied cleanly on the openai#1 submission.
…nt6-mlp3x-wd04-1.1271 Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)
- matrix_lr: 0.04→0.025, scalar_lr: 0.04→0.025 (from PR openai#287, openai#309) - tied_embed_lr: 0.05→0.035 - muon_momentum: 0.95→0.97, warmup_start: 0.85→0.90 - muon_momentum_warmup_steps: 500→800 - warmdown_iters: 1200→2000 (longer cooldown) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…816 (val_bpb 1.1116) 3-seed mean: 1.1116 ± 0.0005 Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118 Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky) + coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112) + fullgraph=True torch.compile Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.
Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)
val_bpb: 1.1271 (sliding window, stride=64) | 15.5 MB | 8xH100 SXM, 600s
Progress from prior submissions
Two new techniques on top of PR #198's 11-layer stack.
Key additions over PR #198
Everything else from PR #198 carries forward: 11 layers, OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.
Results
Reproducibility (3 seeds)
Mean: 1.1280 | Variance: 0.0015 | Submitted: seed 1337
Run command