Skip to content

Codex/start 10b token export run#2

Closed
0hq wants to merge 30 commits intomainfrom
codex/start-10b-token-export-run
Closed

Codex/start 10b token export run#2
0hq wants to merge 30 commits intomainfrom
codex/start-10b-token-export-run

Conversation

@0hq
Copy link
Copy Markdown
Collaborator

@0hq 0hq commented Mar 18, 2026

No description provided.

@0hq 0hq closed this Mar 18, 2026
@0hq 0hq deleted the codex/start-10b-token-export-run branch March 18, 2026 16:32
newjordan referenced this pull request in newjordan/parameter-golf Mar 20, 2026
Bavadiya leads with 0.9695 BPB (▼0.2549 vs baseline), far ahead of
Sam Larson's #2 at 1.1574. The 0.19 BPB gap suggests techniques beyond
the known int6+MLP3x+sliding-window formula.

https://claude.ai/code/session_01RtoPPgJGUFS7XfcFCPwYtq
pleasedontddosme added a commit to pleasedontddosme/parameter-golf that referenced this pull request Mar 20, 2026
Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2):
- Int6 quant, FP16 tied embeddings, Late-K passthrough
- Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix
- Muon decoupled weight decay, AdamW for embeddings/scalars
- Novel: QAT with STE in last 30% of training for near-zero quant penalty
- Cosine warmdown schedule, higher Muon momentum warmup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
keshav55 added a commit to keshav55/parameter-golf that referenced this pull request Mar 20, 2026


Novel techniques from the top 2 leaderboard entries:

1. BigramHash (BIGRAM_BUCKETS=4096, BIGRAM_DIM=128):
   - Hash consecutive token pairs → embedding lookup → project to model_dim
   - XOR with coprime multipliers for hash function
   - Captures local bigram context (~524K params for 4096 buckets)
   - Used by openai#1 (thwu1, 1.1428 BPB) and openai#2 (Raahil Shah, 1.1458 BPB)

2. SmearGate (SMEAR_GATE=1):
   - Learned per-dim gate blending current token with previous token
   - Applied after embedding normalization
   - Only ~512 params
   - Used by openai#2 and openai#4

Both are env-var controlled (0=disabled by default).
run_v7_full.sh enables everything for the full stack.

Also fixed: BigramHash/SmearGate params added to optimizer groups.
1438 lines (62 under 1500 limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kasimte pushed a commit to kasimte/parameter-golf that referenced this pull request Mar 21, 2026
4 proven optimizations from merged leaderboard entries:
- train_seq_len 1024→2048 (both top entries use this)
- SWA every 50 steps, start_frac=0.4 (swept optimal by openai#2 author)
- grad_clip_norm 0.3 (both top entries use this)
- LRs 0.025/0.025/0.035 (from PR openai#287)

BigramHash stays at 2048 to avoid artifact size risk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
evnkm added a commit to evnkm/parameter-golf that referenced this pull request Mar 21, 2026
Rewrote train_gpt_shared.py with full SOTA stack from #1 leaderboard
submission (10L GPT, BigramHash, SmearGate, mixed int5/int6 quant,
SWA, Muon WD=0.04, magnitude pruning, zstd-22, sliding window eval).

Baseline result: val_bpb = 1.1438 (vs SOTA 1.1428) on 8xH100 in 600s.

Added two new ideas on top:
- TrigramHashEmbedding(4096 buckets, 32-dim): captures 3-token local
  patterns beyond bigram. Adds ~147K params (~60-80KB compressed).
- Progressive QAT (int5/int6 STE fake-quantize): applied from step 0
  via CastedLinear.qat_clip to avoid costly torch.compile recompile.

Experiment openai#2 (trigram + QAT at 70% wallclock) scored 1.1630 — worse
than baseline because the torch.compile recompile at activation cost
~130s (22% of 600s budget). Fixed by moving QAT to start of training.

Other changes:
- run_modal.py: migrated from deprecated modal.Mount to Image.add_local_dir,
  fixed sys.exit(0) traceback to raise RuntimeError only on failure.
- research/IDEAS.md: full research log with 11 ranked ideas and
  experiment tracking table.

Next: run openai#3 with QAT-from-start + trigram to test without recompile
penalty, then per-layer bitwidth search to squeeze more capacity into
the 16MB budget.

Made-with: Cursor
newjordan referenced this pull request in newjordan/parameter-golf Mar 22, 2026
Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500).
3-seed mean 1.1248, best seed 1.1243. #2 overall, #1 non-TTT on the leaderboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500).
3-seed mean 1.1248, best seed 1.1243. openai#2 overall, openai#1 non-TTT on the leaderboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
Copy of pr374_safe — EMA(0.997) + Tight SWA + QAT(0.15) + warmdown(3500).
3-seed mean 1.1248, best seed 1.1243. openai#2 overall, openai#1 non-TTT on the leaderboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ifrederico added a commit to ifrederico/parameter-golf that referenced this pull request Mar 25, 2026
Replace baseline train_gpt.py with the openai#2 leaderboard submission code,
stacking all proven techniques: 11L architecture, INT6 QAT with late
activation, GPTQ-lite clip search, XSA on last 4 layers, EMA (0.997),
BigramHash(2048), SmearGate, Partial RoPE (16/64), LN Scale,
ValueEmbedding, sliding-window eval (stride=64), zstd-22 compression,
warmdown=3500, Muon momentum→0.99, WD=0.04.

Add manifest writing for harness integration. Update harness to make
manifest optional. Refresh first-wave experiments and tests for new API.
gthgomez added a commit to gthgomez/parameter-golf that referenced this pull request Mar 25, 2026
records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/

Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting
proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local
ablation runs. Features implemented and validated:

- NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100)
- EMA (decay sweep: 0.997 for competition, 0.97 validated locally)
- int6 clip-search quantizer + in-process A/B comparison
- Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1)
- Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar
- MLP_MULT float support (enables MLP_MULT=3.0)

Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps)
Pending: full 11L competition run on 8xH100 with seq_len=2048

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gthgomez added a commit to gthgomez/parameter-golf that referenced this pull request Mar 25, 2026
records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/

Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting
proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local
ablation runs. Features implemented and validated:

- NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100)
- EMA (decay sweep: 0.997 for competition-scale, 0.97 validated locally)
- int6 clip-search quantizer + in-process A/B comparison
- Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1)
- Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar
- MLP_MULT float support (enables MLP_MULT=3.0)

Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps)
Not a leaderboard attempt. Pending: full 11L competition run on 8xH100.
gthgomez added a commit to gthgomez/parameter-golf that referenced this pull request Mar 25, 2026
records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/

Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting
proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local
ablation runs. Features implemented and validated:

- NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100)
- EMA (decay sweep: 0.997 for competition-scale, 0.97 validated locally)
- int6 clip-search quantizer + in-process A/B comparison
- Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1)
- Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar
- MLP_MULT float support (enables MLP_MULT=3.0)

Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps)
Not a leaderboard attempt. Pending: full 11L competition run on 8xH100.
adityamhn added a commit to adityamhn/parameter-golf that referenced this pull request Mar 28, 2026
- Warmdown 1200 → 3500 (proven by both our research and openai#2 leaderboard entry)
- Muon weight decay WD=0.04 (proven at both Tier 1 and Tier 2 scales)
- Adam embedding weight decay WD=0.01 (proven to stack with Muon WD)
- LeakyReLU(0.5) activation (used by openai#1 leaderboard entry)

Made-with: Cursor
wsylvest added a commit to wsylvest/parameter-golf that referenced this pull request Mar 30, 2026
Replace single fixed clip percentile (99.99984) with per-row optimal
clip search across 5 percentiles [0.999, 0.9995, 0.9999, 0.99999, 1.0].
Each row uses the percentile giving minimum reconstruction MSE.
Deterministic, zero training cost. Used by openai#2 submission (est. -0.003 q_gap).
jzmyres pushed a commit to jzmyres/parameter-golf that referenced this pull request Mar 31, 2026
1. RevDEQ: single shared block iterated 5 times (Constraint openai#1)
2. MLA with Gated Attention: low-rank KV, decoupled RoPE, sigmoid gates (Constraint openai#3)
3. Soft Dense Routing: router + per-expert sigmoid gate on MLP groups (Constraint openai#2)

All constraints satisfied. Model: 9.4M params. Fix DDP unused param issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jzmyres pushed a commit to jzmyres/parameter-golf that referenced this pull request Mar 31, 2026
- Replace split-dim experts (dim//E) with full-dim low-rank (dim→rank→dim)
  so every expert sees all dimensions through a rank bottleneck
- MoS routing: pure softmax convex combination (Mixtape paper),
  removed sigmoid gates (expert_gate_ctp/ntp_logits)
- Added configurable attn_expert_rank / mlp_expert_rank hyperparameters
- Added MoS eval diagnostics: usage/entropy/balance_cv for CTP+NTP
- Updated metrics plot: expert usage shows min/max/mean/median per component
  (Attn, MLP, MoS CTP, MoS NTP) for scalability
- Updated CLAUDE.md Constraint openai#2 with full-dim + Mixtape clarifications
- Result: val_bpb=1.4094, attn_cv 0.32→0.22, artifact 15.4MB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants