Update README.md little things by TNELLI-OAI · Pull Request #1 · openai/parameter-golf

TNELLI-OAI · 2026-03-12T01:16:37Z

space between modelcraft
revising kinda awk language about unusual ability

…articipant form

- add the PR openai#10 tied-embedding training nuance to project memory so this branch is tracked as training-side plus export-side precision handling - add the Issue openai#43 tokenizer-artifact accounting note so tokenizer work is not under-ranked by an overly strict byte model - extend ideas.md research memory with the PR openai#1-openai#35 audit and issue audit so future research passes do not repeat low-signal early PR review - update the ranked backlog wording to reflect the stronger tokenizer and tied-embedding evidence

Proven by Hive leaderboard (openai#1 score). Converges slowly but pays off with 10K+ steps available on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stack of four published techniques: EMA + seq2048 + FP16 embedding passthrough + sliding window eval (stride=64). Beats current leader (1.1925) by 0.0036 BPB. Built with PROTEUS by LightSpeedUp — lightspeedup.com Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…derboard)

Previous run exceeded 16MB cap (FP16 embedding + full MLP = 16.15MB). Fixed by shrinking MLP hidden from 1024 to 992. Artifact now 15,878,735 bytes (99.2% of cap). Score: 1.18956858 BPB — still #1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2): - Int6 quant, FP16 tied embeddings, Late-K passthrough - Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix - Muon decoupled weight decay, AdamW for embeddings/scalars - Novel: QAT with STE in last 30% of training for near-zero quant penalty - Cosine warmdown schedule, higher Muon momentum warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add second run log with aggressive TTT settings that beats previous openai#1 mean. Both conservative and aggressive run logs included for reproducibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…6 BPB) Include both conservative (1.1767) and aggressive (1.1744) run results. Best single run beats current openai#1 mean (1.17475). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add second run log with aggressive TTT settings that beats previous openai#1 mean. Both conservative and aggressive run logs included for reproducibility.

…6 BPB) Include both conservative (1.1767) and aggressive (1.1744) run results. Best single run beats current openai#1 mean (1.17475).

…ing val-only openai#1 behavior. This adds a reusable Modal launcher and updates standard submission artifacts/logs to reflect the new best quantized result (val_bpb 1.14649233).

Key improvements over baseline: - Delayed QAT: STE fake-quantization only in last 15% of training time, allowing model to train at full precision before adapting to quantization - Symmetric int6 clip range [-31, 31] instead of asymmetric [-32, 31] - Wider MLP (3x), tuned LR=0.025, momentum=0.99 with 1500-step warmup - Sliding window eval with stride=64 for better BPB measurement - fp16 embedding passthrough (tok_emb kept unquantized) 3-seed validation (seeds 1337, 42, 7): 1.15924, 1.15980, 1.16066 → mean 1.15990 BPB Beats current openai#1 (PR openai#88) at 1.1605 BPB.

- train_gpt.py: ADAM_WEIGHT_DECAY env var (AdamW when >0), FP16_EMBED flag - RESEARCH_NOTES.md: Full analysis of all open PRs, technique taxonomy, strategy to beat new openai#1 (1.1318 BPB from PR openai#198) - Key finding: Int6+zstd, SmearGate, BigramHash, SWA, MuonWD are essential - Our TTT LoRA is unique advantage not used by any top-5 submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stack of four published techniques: EMA + seq2048 + FP16 embedding passthrough + sliding window eval (stride=64). Beats current leader (1.1925) by 0.0036 BPB. Built with PROTEUS by LightSpeedUp — lightspeedup.com Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous run exceeded 16MB cap (FP16 embedding + full MLP = 16.15MB). Fixed by shrinking MLP hidden from 1024 to 992. Artifact now 15,878,735 bytes (99.2% of cap). Score: 1.18956858 BPB — still #1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- MLP_MULT=3 support (wider MLP, -0.013 BPB) - Int6 per-row quantization (QUANT_BITS=6, saves ~4MB) - FP16 tied embedding passthrough (FP16_EMBED=1) - Sliding window eval with compiled NTK-RoPE (EVAL_STRIDE, EVAL_SEQ_LEN) - Muon decoupled weight decay (MUON_WEIGHT_DECAY) - Overtone spectral embedding init (OVERTONE_INIT) - Phase-transition resid_mix init (PHASE_RESID_MIX) - Extra eval loops support (EVAL_NUM_LOOPS) - Multi-eval mode (EVAL_CONFIGS for testing multiple configs per run) - VAL_MAX_TOKENS for fast directional experiments - Compiled forward for eval (compiled_forward) Validated on Mac: near-zero quant gap (0.0002 BPB) with FP16 embed + Muon WD. All leaderboard openai#1 techniques implemented and tested. Depth recurrence explored and rejected (int6 quant gap too large). 1260 lines, under 1500 limit. All new features default-disabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- master_plan.md: Phase 7 added (leaderboard openai#1 techniques: Muon WD, FP16 tied embedding export, sliding window eval, overtone spectral init, phase-transition residual mixing) - info_runpod.md: replace <your-fork> placeholder URLs with HaardhikK Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Rebuild from the proven openai#1 submission (PR openai#198, 1.1326 BPB) and stack four untried improvements: - RoPE base 50K (smoother position interpolation at seq2048) - LAWA-EMA replacing periodic SWA (continuous exponential moving average) - Context-length curriculum (seq1024 early for 60% more steps, seq2048 late) - Full-model SGD test-time training (1 epoch, lr=3e-4, on val data) Architecture: 11L 512d MLP3x SmearGate BigramHash OrthoInit WD=0.04 Artifact: ~15.7MB (int6+zstd-22), 26.8M params, FA3 with SDPA fallback Pending 8xH100 run. Target: sub-1.13 BPB. Made-with: Cursor

Near-SOTA result. Key finding: SWA every 120 steps (13 checkpoints) outperforms standard SWA/200 (7 checkpoints) by ~0.001 BPB. 11L int6, FA3, OrthoInit, SmearGate+BigramHash(2048), NTK RoPE. 15.9MB artifact. Gap to leaderboard openai#1: 0.001. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ash SWA Early submission requesting compute for 8xH100 validation runs. Builds on current SOTA openai#1 (1.1428 BPB) by adding eval-only improvements: - LoRA TTT: rank-8 per-document adaptation (Q/V + LM head) - Sliding window stride=32 (from 64): 2x context overlap Base architecture unchanged: 10L MLP3x, BigramHash(10240), SmearGate, SWA, Int5 MLP + Int6 attention, seq2048, Muon 0.99, zstd-22. Local validation (RTX 5090, competition-equivalent steps): - Beating old SOTA by 0.035 BPB at step 4000 (val_bpb 1.2833 vs 1.3185) - Gap widening through training Projected: ~1.136 BPB (pending 8xH100 3-seed validation) Requesting compute credits to run 3 seeds for statistical significance.

Novel techniques from the top 2 leaderboard entries: 1. BigramHash (BIGRAM_BUCKETS=4096, BIGRAM_DIM=128): - Hash consecutive token pairs → embedding lookup → project to model_dim - XOR with coprime multipliers for hash function - Captures local bigram context (~524K params for 4096 buckets) - Used by openai#1 (thwu1, 1.1428 BPB) and openai#2 (Raahil Shah, 1.1458 BPB) 2. SmearGate (SMEAR_GATE=1): - Learned per-dim gate blending current token with previous token - Applied after embedding normalization - Only ~512 params - Used by openai#2 and openai#4 Both are env-var controlled (0=disabled by default). run_v7_full.sh enables everything for the full stack. Also fixed: BigramHash/SmearGate params added to optimizer groups. 1438 lines (62 under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Case study: reordering training shards by model difficulty (hardest first) gives -0.0033 BPB improvement over sequential ordering. Zero architecture changes, zero compute cost, ten lines of code. Key finding: token-level statistics (KL divergence) find 0.0009 range across shards. Model perplexity finds 0.0475 range -- 100x more variation. The two metrics are uncorrelated (r = -0.056). 3-seed validated on PR openai#549 (merged openai#1): Seed 1337: 1.1217 -> 1.1183 (-0.0034) Seed 42: 1.1222 -> 1.1181 (-0.0041) Seed 2025: 1.1221 -> 1.1198 (-0.0023) Mean: 1.1220 -> 1.1187 (-0.0033) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Built on merged SOTA openai#1 (signalrush, 1.1228 BPB). Contributions: Full GPTQ, VRL, XSA-11, QAT-Export alignment, LeakyReLU², score-first TTT with T=0.98 temperature calibration. BPB numbers to be filled after RunPod validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Built on merged SOTA openai#1 (signalrush, 1.1228 BPB). Contributions: Full GPTQ, VRL, XSA-11, QAT-Export alignment, LeakyReLU², score-first TTT with T=0.98 temperature calibration. BPB numbers to be filled after RunPod validation.

records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/ Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local ablation runs. Features implemented and validated: - NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100) - EMA (decay sweep: 0.997 for competition, 0.97 validated locally) - int6 clip-search quantizer + in-process A/B comparison - Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1) - Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar - MLP_MULT float support (enables MLP_MULT=3.0) Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps) Pending: full 11L competition run on 8xH100 with seq_len=2048 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/ Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local ablation runs. Features implemented and validated: - NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100) - EMA (decay sweep: 0.997 for competition-scale, 0.97 validated locally) - int6 clip-search quantizer + in-process A/B comparison - Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1) - Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar - MLP_MULT float support (enables MLP_MULT=3.0) Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps) Not a leaderboard attempt. Pending: full 11L competition run on 8xH100.

v11 BigramHash+SmearGate tracks ~0.006 worse than v4 at every checkpoint. Likely redundant with seq_len=2048 context. Killed early. Started v12: 10L + 3xMLP + int6+zstd + late QAT + warmdown=3000. This targets the "single largest contributor" (3xMLP) from the openai#1 submission. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

10L + 3xMLP + int6+zstd + late QAT + WD=3000: - val_bpb=1.1525 (sliding), artifact 14.1MB - Late QAT cut int6 damage from 0.034 to 0.002 bpb! - 3xMLP is the biggest contributor (24.1M params, fits with int6) - Gap to openai#1 (1.1428): only 0.010 bpb Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

11L + 3xMLP + int6+zstd + late QAT: - val_bpb=1.1440 (sliding), artifact 15.6MB - Only 0.0012 behind openai#1 on the leaderboard - Within seed variance of the target - 11th layer added 0.009 bpb per step over 10L Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

11L + 3xMLP + int6+zstd + late QAT + SWA(24 checkpoints) + seed=42: - val_bpb=1.1403 (sliding window), artifact 15.1MB - BEATS the merged openai#1 score of 1.1428 by 0.0025 bpb! - SWA added -0.004 bpb improvement over v13 (1.1440 -> 1.1403) - Late QAT: only 0.002 bpb int6 quantization damage - 9205 steps @ 521ms on 1xH100 (80 min equivalent) Technique stack: - 11 transformer layers, 512 dim, 8 heads, 4 KV heads (GQA) - 3x MLP expansion (hidden=1536), relu^2 - seq_len=2048, Muon momentum=0.99, WD=0.04, grad_clip=0.3 - Int6 mixed quantization + zstd-22 compression - Late QAT (STE int6 at lr_scale < 0.1) - SWA every 50 steps in last 40% of warmdown - Sliding window evaluation (stride=64) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Match the openai#1 record's optimizer settings: - MATRIX_LR/SCALAR_LR: 0.015 -> 0.025 - MUON_MOMENTUM: 0.95 -> 0.99 - MUON_MOMENTUM_WARMUP_START: 0.85 -> 0.92 - MUON_MOMENTUM_WARMUP_STEPS: 500 -> 1500 - WARMDOWN_ITERS: 1200 -> 3500

openai#1 untried combination from competition commentary: TTT (from openai#254) + XSA (from openai#265) = estimated 1.117-1.121 BPB XSA_LAST_N=3 excludes self-attention in final 3 layers. Zero extra params, frees attention capacity for cross-token focus. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update README.md: add space to model craft, update some language on p…

63730fc

…articipant form

TNELLI-OAI merged commit 334db23 into main Mar 12, 2026

0hq deleted the TNELLI-OAI-patch-1 branch March 16, 2026 20:45

unnir added a commit to unnir/parameter-golf that referenced this pull request Mar 19, 2026

Update v27: OrthoInit + muP scaling (val_bpb: 1.1539, openai#1 on lea…

f58e6e0

…derboard)

pleasedontddosme mentioned this pull request Mar 20, 2026

Add Combined Int6 + QAT + Sliding Window submission #149

Draft

dexhunter mentioned this pull request Mar 20, 2026

Community Tool: Parameter Golf Leaderboard Monitor (CLI + Claude Code Skill) #158

Closed

yesbhautik mentioned this pull request Mar 20, 2026

Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds) #64

Open

0xjaishy mentioned this pull request Mar 20, 2026

Draft: SOTA+ TTT + RoPE50K + EMA + Curriculum (pending H100 run) #223

Draft

3 tasks

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun mentioned this pull request Mar 24, 2026

draft #650

Closed

gthgomez mentioned this pull request Mar 25, 2026

[WIP] Non-record: Local Ablation Pipeline — EMA + Int6 + Partial RoPE (GTX 1650) #682

Open

FlashyFlash3011 mentioned this pull request Mar 25, 2026

Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) #347

Open

abaybektursun mentioned this pull request Mar 25, 2026

Non-record: Data ordering & selection — negative result on FineWeb #772

Open

henrycashe26 mentioned this pull request Mar 26, 2026

Add baseline and depth recurrence submissions (1xH100 20min runs) #822

Open

Meirzhan05 mentioned this pull request Mar 26, 2026

2L QAT Int4-MLP + Int6-Attn #910

Open

4 tasks

FlashyFlash3011 mentioned this pull request Mar 27, 2026

Ultimate: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) #952

Closed

4 tasks

himanalot mentioned this pull request Mar 27, 2026

Record: Nacrith Log-Bias + Full-Rescore N-gram — val_bpb 0.00000035 (3-seed mean) #959

Closed

11 tasks

asuramaya mentioned this pull request Mar 28, 2026

Credits for Runpod from OpenAI #942

Open

AbhayAnandUCSD mentioned this pull request Mar 29, 2026

Non-record: Reproduction of SOTA #1 (SmearGate+BigramHash+Int6+SWA) on RunPod 8xH100 #1071

Open

4 tasks

This was referenced Mar 29, 2026

Non-record: LLaDA-MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline) #1100

Closed

Non-record: MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline) #1106

Open

sisegod mentioned this pull request Mar 30, 2026

Non-Record Submission: 1.1986 BPB — HybridQuantGPT v6.1 rANS + Legal TTT #1123

Open

5 tasks

icryo mentioned this pull request Mar 31, 2026

Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update README.md little things#1

Update README.md little things#1
TNELLI-OAI merged 1 commit intomainfrom
TNELLI-OAI-patch-1

TNELLI-OAI commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TNELLI-OAI commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant