Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization by mrdavtan · Pull Request #212 · openai/parameter-golf

mrdavtan · 2026-03-20T13:06:35Z

Non-record submission. val_bpb=1.1329 (3-seed mean, std=0.0006). 11L 512d int6+zstd-22, 15.3 MB, 8xH100 SXM. 25 experiments in FINDINGS.md.

Findings not tested elsewhere

Codebook quantization (#24): K-means K=256 gives 87% lower reconstruction MSE than int6 per-row, but 25% larger artifact under zstd-22. Higher byte entropy in codebook indices compresses less efficiently than clamped int6 values. Other K values and codecs not tested.

Magnitude pruning (#23): Zeroing smallest 3% of weights increased our artifact by 728KB under zstd-22. 1% and 5% were neutral. Non-monotonic on our checkpoint; other weight distributions may differ.

Multi-token prediction (#14): Auxiliary t+2 head at 0.5× loss weight: +0.0018 BPP, 3% slower. Other weightings and scales not tested.

Embedding SVD (#25): Rank-64 explains 41.9% of variance on tok_emb (1024×512). Linear low-rank factorization not viable at this vocabulary size. Nonlinear methods not tested.

Also documented: Depth recurrence failure (#13), QAT under torch.compile (#2b, credit @152334H), int5 gap (#15), curriculum learning (#11), optimizer coverage (#16), and 14 others.

Reproduction

git clone https://github.com/mrdavtan/parameter-golf.git && cd parameter-golf && git checkout int6-3xMLP-pr
pip install flash-attn --no-cache-dir --no-build-isolation && pip install zstandard sentencepiece huggingface_hub
python3 data/cached_challenge_fineweb.py --variant sp1024
SEED=7 QAT=0 TTT_MAX_STEPS=500 TTT_FREEZE_BLOCKS=1 TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 \
UNET_SKIPS=1 ROPE_DIMS=16 LN_SCALE=1 ROPE_BASE=10000 EVAL_STRIDE=32 DOC_ISOLATED_EVAL=0 \
LATE_K_FP16=0 FP16_EMBED_EXPORT=0 \
torchrun --standalone --nproc_per_node=8 records/track_10min_16mb/2026-03-21_11L_XSA_EMA_TTT/train_gpt.py

Builds on PRs #162, #77, #180 and modded-nanogpt.

mrdavtan · 2026-03-20T16:08:09Z

Update: 5-seed statistical validation added

Seed	val_bpb
31337	1.1703
1337	1.1708
2024	1.1712
42	1.1732
7	1.1767
Mean	1.1724
Std	0.0026

Gap vs baseline: 0.036 nats (threshold: 0.005) | t-stat: 44.2 | p < 0.01

All 5 runs on 8×H100 SXM (RunPod Parameter Golf template), PyTorch 2.9.1+cu128, same config, only seed varied. README and submission.json updated with full results.

3-seed validation (mean 1.1329, std 0.0006). Added findings on codebook quantization vs zstd, magnitude pruning non-monotonicity, and embedding SVD analysis. Language cleanup throughout.

mrdavtan · 2026-03-23T17:35:31Z

Closing: reproduction command includes non-causal TTT (TTT_MAX_STEPS=500), which is invalid per #402. Negative findings documented in FINDINGS.md remain in the repo.

mrdavtan mentioned this pull request Mar 20, 2026

Non-record: FP16 embed + WD20k + seq2048 + doc-isolated sliding window (val_bpb=1.2045) #151

Closed

4 tasks

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

docs: clarify PR openai#212 not on official leaderboard (never merged)

36dfb1c

mahsumaktas mentioned this pull request Mar 21, 2026

11L XSA + SmearGate + BigramHash + SWA (mean val_bpb=1.1565, 3 seeds) #186

Closed

4 tasks

mrdavtan closed this Mar 21, 2026

evangelinehelsinki mentioned this pull request Mar 21, 2026

Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why #363

Merged

6 tasks

Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale (val_bpb=1.1401)

7c3e33a

mrdavtan reopened this Mar 21, 2026

mrdavtan force-pushed the int6-3xMLP-pr branch from d3a4ebc to 7c3e33a Compare March 21, 2026 20:52

mrdavtan changed the title ~~Record: Int6 + 3x MLP + sliding window (val_bpb=1.1708) + 9 ablations~~ Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale — val_bpb=1.1401 Mar 21, 2026

mrdavtan mentioned this pull request Mar 21, 2026

Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale — val_bpb=1.1401 #371

Closed

mrdavtan force-pushed the int6-3xMLP-pr branch 6 times, most recently from 2bc195b to 3b0b437 Compare March 22, 2026 00:56

mrdavtan changed the title ~~Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale — val_bpb=1.1401~~ Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale — val_bpb=1.1375 Mar 22, 2026

mrdavtan force-pushed the int6-3xMLP-pr branch from 3b0b437 to b621f7f Compare March 22, 2026 01:05

Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale (val_bpb=1.1401)

a5fd524

mrdavtan force-pushed the int6-3xMLP-pr branch from b621f7f to a5fd524 Compare March 22, 2026 01:48

mrdavtan changed the title ~~Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale — val_bpb=1.1375~~ Record: 11L EMA + TTT + Partial RoPE + LN Scale + XSA4 — mean val_bpb=1.1329 (3 seeds) Mar 22, 2026

Non-record: 25 experiments with negative findings

7237476

3-seed validation (mean 1.1329, std 0.0006). Added findings on codebook quantization vs zstd, magnitude pruning non-monotonicity, and embedding SVD analysis. Language cleanup throughout.

mrdavtan changed the title ~~Record: 11L EMA + TTT + Partial RoPE + LN Scale + XSA4 — mean val_bpb=1.1329 (3 seeds)~~ Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization Mar 22, 2026

mrdavtan mentioned this pull request Mar 23, 2026

Record: Cosine TTT scheduling with per-layer lr — mean val_bpb=1.0970 (3 seeds) #481

Closed

NotADevIAmaMeatPopsicle mentioned this pull request Mar 23, 2026

Record: pcloadloveletter v6 — Novel Codebook+Huffman Compression + AdamW TTT (val_bpb=1.0487) #532

Closed

mrdavtan closed this Mar 23, 2026

mrdavtan mentioned this pull request Mar 29, 2026

Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers) #1048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization#212

Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization#212
mrdavtan wants to merge 3 commits intoopenai:mainfrom
mrdavtan:int6-3xMLP-pr

mrdavtan commented Mar 20, 2026 •

edited

Loading

Uh oh!

mrdavtan commented Mar 20, 2026

Uh oh!

mrdavtan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrdavtan commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Findings not tested elsewhere

Reproduction

Uh oh!

mrdavtan commented Mar 20, 2026

Update: 5-seed statistical validation added

Uh oh!

mrdavtan commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mrdavtan commented Mar 20, 2026 •

edited

Loading