Skip to content

[WIP] Optimized Muon/Architecture research by @NOPIMPOSSSIBLEWHY#4

Closed
NOPIMPOSSSIBLEWHY wants to merge 1 commit intoopenai:mainfrom
NOPIMPOSSSIBLEWHY:main
Closed

[WIP] Optimized Muon/Architecture research by @NOPIMPOSSSIBLEWHY#4
NOPIMPOSSSIBLEWHY wants to merge 1 commit intoopenai:mainfrom
NOPIMPOSSSIBLEWHY:main

Conversation

@NOPIMPOSSSIBLEWHY
Copy link
Copy Markdown

Research starting on local MLX (Mac M3). Benchmarking architectures for the 16MB limit using Muon and muP.

@0hq 0hq marked this pull request as draft March 19, 2026 16:57
@0hq 0hq closed this Mar 19, 2026
keshav55 added a commit to keshav55/parameter-golf that referenced this pull request Mar 20, 2026


Novel techniques from the top 2 leaderboard entries:

1. BigramHash (BIGRAM_BUCKETS=4096, BIGRAM_DIM=128):
   - Hash consecutive token pairs → embedding lookup → project to model_dim
   - XOR with coprime multipliers for hash function
   - Captures local bigram context (~524K params for 4096 buckets)
   - Used by openai#1 (thwu1, 1.1428 BPB) and openai#2 (Raahil Shah, 1.1458 BPB)

2. SmearGate (SMEAR_GATE=1):
   - Learned per-dim gate blending current token with previous token
   - Applied after embedding normalization
   - Only ~512 params
   - Used by openai#2 and openai#4

Both are env-var controlled (0=disabled by default).
run_v7_full.sh enables everything for the full stack.

Also fixed: BigramHash/SmearGate params added to optimizer groups.
1438 lines (62 under 1500 limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gb250e referenced this pull request in gb250e/parameter-golf Mar 21, 2026
dhruvjatkar referenced this pull request in dhruvjatkar/parameter-golf Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future
improvements must be orthogonal to TTT. This update:
- Sets 1.0781 BPB (PR openai#672) as the new target to beat
- Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2,
  SwiGLU #3, Muon-VS #4, aggressive quant openai#5, MASA openai#6,
  depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8
- Deprioritizes TTT-related directions already exploited by PR openai#672
- Collapses ~1000 lines of stale Round 0-3.9 session logs into a
  concise historical summary
- Removes resolved blockers (flash_attn, SSH hangs, local runtime)
- Adds fresh Round 1 section with 5 submitted experiments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants