Closed
Conversation
newjordan
referenced
this pull request
in newjordan/parameter-golf
Mar 20, 2026
Bavadiya leads with 0.9695 BPB (▼0.2549 vs baseline), far ahead of Sam Larson's #2 at 1.1574. The 0.19 BPB gap suggests techniques beyond the known int6+MLP3x+sliding-window formula. https://claude.ai/code/session_01RtoPPgJGUFS7XfcFCPwYtq
pleasedontddosme
added a commit
to pleasedontddosme/parameter-golf
that referenced
this pull request
Mar 20, 2026
Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2): - Int6 quant, FP16 tied embeddings, Late-K passthrough - Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix - Muon decoupled weight decay, AdamW for embeddings/scalars - Novel: QAT with STE in last 30% of training for near-zero quant penalty - Cosine warmdown schedule, higher Muon momentum warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
keshav55
added a commit
to keshav55/parameter-golf
that referenced
this pull request
Mar 20, 2026
Novel techniques from the top 2 leaderboard entries: 1. BigramHash (BIGRAM_BUCKETS=4096, BIGRAM_DIM=128): - Hash consecutive token pairs → embedding lookup → project to model_dim - XOR with coprime multipliers for hash function - Captures local bigram context (~524K params for 4096 buckets) - Used by openai#1 (thwu1, 1.1428 BPB) and openai#2 (Raahil Shah, 1.1458 BPB) 2. SmearGate (SMEAR_GATE=1): - Learned per-dim gate blending current token with previous token - Applied after embedding normalization - Only ~512 params - Used by openai#2 and openai#4 Both are env-var controlled (0=disabled by default). run_v7_full.sh enables everything for the full stack. Also fixed: BigramHash/SmearGate params added to optimizer groups. 1438 lines (62 under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kasimte
pushed a commit
to kasimte/parameter-golf
that referenced
this pull request
Mar 21, 2026
4 proven optimizations from merged leaderboard entries: - train_seq_len 1024→2048 (both top entries use this) - SWA every 50 steps, start_frac=0.4 (swept optimal by openai#2 author) - grad_clip_norm 0.3 (both top entries use this) - LRs 0.025/0.025/0.035 (from PR openai#287) BigramHash stays at 2048 to avoid artifact size risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
evnkm
added a commit
to evnkm/parameter-golf
that referenced
this pull request
Mar 21, 2026
Rewrote train_gpt_shared.py with full SOTA stack from #1 leaderboard submission (10L GPT, BigramHash, SmearGate, mixed int5/int6 quant, SWA, Muon WD=0.04, magnitude pruning, zstd-22, sliding window eval). Baseline result: val_bpb = 1.1438 (vs SOTA 1.1428) on 8xH100 in 600s. Added two new ideas on top: - TrigramHashEmbedding(4096 buckets, 32-dim): captures 3-token local patterns beyond bigram. Adds ~147K params (~60-80KB compressed). - Progressive QAT (int5/int6 STE fake-quantize): applied from step 0 via CastedLinear.qat_clip to avoid costly torch.compile recompile. Experiment openai#2 (trigram + QAT at 70% wallclock) scored 1.1630 — worse than baseline because the torch.compile recompile at activation cost ~130s (22% of 600s budget). Fixed by moving QAT to start of training. Other changes: - run_modal.py: migrated from deprecated modal.Mount to Image.add_local_dir, fixed sys.exit(0) traceback to raise RuntimeError only on failure. - research/IDEAS.md: full research log with 11 ranked ideas and experiment tracking table. Next: run openai#3 with QAT-from-start + trigram to test without recompile penalty, then per-layer bitwidth search to squeeze more capacity into the 16MB budget. Made-with: Cursor
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
newjordan
pushed a commit
to newjordan/parameter-golf-1
that referenced
this pull request
Mar 23, 2026
4 tasks
ifrederico
added a commit
to ifrederico/parameter-golf
that referenced
this pull request
Mar 25, 2026
Replace baseline train_gpt.py with the openai#2 leaderboard submission code, stacking all proven techniques: 11L architecture, INT6 QAT with late activation, GPTQ-lite clip search, XSA on last 4 layers, EMA (0.997), BigramHash(2048), SmearGate, Partial RoPE (16/64), LN Scale, ValueEmbedding, sliding-window eval (stride=64), zstd-22 compression, warmdown=3500, Muon momentum→0.99, WD=0.04. Add manifest writing for harness integration. Update harness to make manifest optional. Refresh first-wave experiments and tests for new API.
gthgomez
added a commit
to gthgomez/parameter-golf
that referenced
this pull request
Mar 25, 2026
records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/ Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local ablation runs. Features implemented and validated: - NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100) - EMA (decay sweep: 0.997 for competition, 0.97 validated locally) - int6 clip-search quantizer + in-process A/B comparison - Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1) - Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar - MLP_MULT float support (enables MLP_MULT=3.0) Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps) Pending: full 11L competition run on 8xH100 with seq_len=2048 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gthgomez
added a commit
to gthgomez/parameter-golf
that referenced
this pull request
Mar 25, 2026
records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/ Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local ablation runs. Features implemented and validated: - NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100) - EMA (decay sweep: 0.997 for competition-scale, 0.97 validated locally) - int6 clip-search quantizer + in-process A/B comparison - Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1) - Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar - MLP_MULT float support (enables MLP_MULT=3.0) Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps) Not a leaderboard attempt. Pending: full 11L competition run on 8xH100.
gthgomez
added a commit
to gthgomez/parameter-golf
that referenced
this pull request
Mar 25, 2026
records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/ Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local ablation runs. Features implemented and validated: - NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100) - EMA (decay sweep: 0.997 for competition-scale, 0.97 validated locally) - int6 clip-search quantizer + in-process A/B comparison - Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1) - Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar - MLP_MULT float support (enables MLP_MULT=3.0) Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps) Not a leaderboard attempt. Pending: full 11L competition run on 8xH100.
3 tasks
11 tasks
adityamhn
added a commit
to adityamhn/parameter-golf
that referenced
this pull request
Mar 28, 2026
- Warmdown 1200 → 3500 (proven by both our research and openai#2 leaderboard entry) - Muon weight decay WD=0.04 (proven at both Tier 1 and Tier 2 scales) - Adam embedding weight decay WD=0.01 (proven to stack with Muon WD) - LeakyReLU(0.5) activation (used by openai#1 leaderboard entry) Made-with: Cursor
wsylvest
added a commit
to wsylvest/parameter-golf
that referenced
this pull request
Mar 30, 2026
Replace single fixed clip percentile (99.99984) with per-row optimal clip search across 5 percentiles [0.999, 0.9995, 0.9999, 0.99999, 1.0]. Each row uses the percentile giving minimum reconstruction MSE. Deterministic, zero training cost. Used by openai#2 submission (est. -0.003 q_gap).
jzmyres
pushed a commit
to jzmyres/parameter-golf
that referenced
this pull request
Mar 31, 2026
1. RevDEQ: single shared block iterated 5 times (Constraint openai#1) 2. MLA with Gated Attention: low-rank KV, decoupled RoPE, sigmoid gates (Constraint openai#3) 3. Soft Dense Routing: router + per-expert sigmoid gate on MLP groups (Constraint openai#2) All constraints satisfied. Model: 9.4M params. Fix DDP unused param issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jzmyres
pushed a commit
to jzmyres/parameter-golf
that referenced
this pull request
Mar 31, 2026
- Replace split-dim experts (dim//E) with full-dim low-rank (dim→rank→dim) so every expert sees all dimensions through a rank bottleneck - MoS routing: pure softmax convex combination (Mixtape paper), removed sigmoid gates (expert_gate_ctp/ntp_logits) - Added configurable attn_expert_rank / mlp_expert_rank hyperparameters - Added MoS eval diagnostics: usage/entropy/balance_cv for CTP+NTP - Updated metrics plot: expert usage shows min/max/mean/median per component (Attn, MLP, MoS CTP, MoS NTP) for scalability - Updated CLAUDE.md Constraint openai#2 with full-dim + Mixtape clarifications - Result: val_bpb=1.4094, attn_cv 0.32→0.22, artifact 15.4MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.