Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers) by mrdavtan · Pull Request #1048 · openai/parameter-golf

mrdavtan · 2026-03-29T00:59:25Z

Transformer layers share identical rotational structure across different random seeds. Procrustes alignment between the same layer trained on seeds 1337 vs 7 gives 90% MSE reduction — despite zero cosine similarity between raw weights. This is architectural, not learned. Compact parameterization of the rotations remains unsolved (Givens angles unexplored).

8 findings from checkpoint analysis across 33 experiments. Follow-up to PR #212 (25 training experiments).

Summary

#	Technique	Result	Verdict
1	Symmetry-transport (Procrustes)	91% MSE reduction but 380% larger artifact	Dead
2	Low-rank rotation approximation	Rank-128 captures 16.6% of rotation variance	Dead
3	Cross-seed Procrustes	90% MSE reduction — rotational structure is architectural	Interesting but unexploitable
4	SWA vs EMA weight smoothness	SWA 2.3× smoother, 24% smaller artifacts	Viable if step count preserved
5	Block 7 c_k outlier	Kurtosis 11.9 (avg ~0.5), most sensitive large tensor	Worth fp16 protection
6	GPTQ-lite on well-conditioned weights	0.2% MSE improvement	Marginal
7	Selective fp16 embedding	Entropy-based, saves 944KB vs full fp16	Viable
8	Non-monotonic pruning+zstd	3% pruning increases artifact by 728KB	Surprising interaction

1. Symmetry-transport compression

Procrustes alignment between layers shows 91-93% MSE reduction on MLP proj — layers genuinely share rotational structure. Full prototype implemented: store 1 prototype + 10 rotation matrices.

Result: 14.3 MB (int6+zstd) → 68.8 MB (transport). Rotation matrices are dense, high-entropy — zstd can't compress them.

2. Low-rank rotation approximation

Rank-128 captures only 16.6% of the rotation delta. Dead end without a compact rotation parameterization (Givens angles unexplored).

3. Cross-seed Procrustes

Same layer across different seeds (1337 vs 7): 90% MSE reduction after alignment. Weights are completely different values (cosine similarity ~0.000) but the rotational structure is identical.

Implication: This is an architectural property, not a training artifact. A universal rotation-aware quantization scheme could theoretically work across seeds — but storing the rotations remains the bottleneck.

4. SWA vs EMA artifact smoothness

Metric	EMA (1.1344)	SWA (1.2076)
MLP fc std	0.2056	0.0912 (2.3× smoother)
MLP fc kurtosis	0.53	0.17 (3× lower)
Artifact size	16.3 MB	12.4 MB (24% smaller)

SWA's poor score (1.2076) was due to a slow pod (5,532 steps vs 8,672) — not SWA itself. Complementary to PR #989's finding that SWA sabotages QAT. SWA-level smoothness at EMA step counts would give small artifacts with headroom for more parameters.

5. Block 7 is the quantization outlier

c_k kurtosis 11.9 vs average ~0.5. Consistently heaviest tails across checkpoints. Current LATE_K_FP16 protects last 2 layers — block 7 is layer 7, unprotected.

6. GPTQ-lite

Per-row optimal clip search (5 quantile ratios, 0.9→0.99999): 0.2% MSE improvement over uniform int6. Weight distributions are well-conditioned. Zero cost, so keep as default, but don't expect gains.

7. Selective fp16 embedding

Top 10% highest-entropy embedding rows (102/1024 tokens) in fp16, rest in int6. Saves 944KB vs full fp16 (104KB vs 1,048KB). Quality trade minimal — 90% of tokens have well-clustered embeddings that survive int6.

8. Non-monotonic pruning + zstd

3% magnitude pruning increases artifact by 728KB. 1% and 5% are neutral. Zeroing creates byte patterns that interact badly with zstd-22 at specific sparsity levels. Always measure compressed size, not just reconstruction quality.

Key takeaway

Int6 + zstd-22 is near-optimal for this architecture. Approaches that improve reconstruction MSE (codebooks, Procrustes) don't improve compressed size — the codec must be considered jointly.

Reproduction

Analysis scripts in experiments/: quant_analysis.py, analyze_quant_gates.py.

Builds on PRs #162 (raahilshah), #180 (thwu1), #212 (prior 25 experiments), and modded-nanogpt.

…controlled)

…(1.11x target)

Compares int6, GPTQ-lite, and learned codebook quantization on saved checkpoints. Reports per-tensor MSE, outlier detection, embedding entropy analysis, and compressed size. No GPU needed for analysis.

… DDP 14.8s/epoch)

Major changes: - DDP gradient sharding: each GPU processes batch_seqs sequences, manual all_reduce on gradients (matches PR openai#415/openai#417 approach) - Two-phase TTT (TTT_TWO_PHASE=1): Phase 1: norm-only recalibration (50 epochs Adam, ~22K params) Phase 2: selective block adaptation (10 epochs SGD, last 3 blocks) - TTT_BATCH_SEQS=64 per GPU (512 total with 8 GPUs) - Falls back to single-phase SGD if TTT_TWO_PHASE=0 Expected speedup: ~235x (from 1344s/epoch to ~5.7s/epoch)

…er attention)

… 20)

New defaults: TTT_COSINE=1, TTT_EPOCHS=30, TTT_LR=0.0005 Supports: TTT_WARMUP_FRAC, TTT_PERLAYER (3x proj, 0.5x fc) Tested: 26 configs across 3 rounds on 4080 GPU

…cal testing)

…ny GPU count

…ttern) Root cause: per-sequence indexing from permuted indices was ~100x slower than contiguous val_tokens slicing. Each GPU now takes a contiguous shard and iterates sequentially, matching openai#398's working implementation.

… steps/epoch)

Local ablation showed h1792 gives -0.056 BPB over h1536 at similar step cost. Fixed unset blocks that were killing MLP_HIDDEN, BIGRAM_HASH_BUCKETS, and TRAIN_BATCH_TOKENS immediately after setting them (same bug class as Finding openai#24).

Run3 checkpoint (1.1496) shows 27.5M params → 15.66MB artifact at full scale. h1792 would add ~3M params → ~17.6MB, 1.6MB over cap. Compression ratio degrades with longer training (denser weights). Staying at h1536 (default). The local ablation was misleading.

Sigmoid skip gates (PR openai#505): replace additive skip connections with sigmoid-gated blend: x = gate*x + (1-gate)*scaled_skip. Learned per-dim gates init to sigmoid(0)=0.5. SIGMOID_SKIP_GATES=1 (default on). Decoder 2x LR (PR openai#505): decoder layers (>= num_encoder_layers) get DECODER_LR_MULT=2.0 applied on top of the per-layer lr splits. Combined with quant-damage lr: decoder proj gets 1.5*2=3x base lr. Both features are env-var controlled and default ON.

1. Progressive Layer Freezing: freeze encoder blocks during warmdown (PROGRESSIVE_FREEZE=1) to halve backward pass and gain ~1700 extra decoder-focused training steps. Includes Muon weight-decay guard to prevent frozen weights from decaying. 2. Hyper-Connections: learned mixing of all prior layer outputs (HYPER_CONNECTIONS=1, mode=scalar|vector). Generalizes U-Net skips and resid_mix into a single mechanism with negligible param cost. 3. Logit Ensemble: average logits from EMA + raw checkpoint at eval (LOGIT_ENSEMBLE=1). Eval-only, zero training impact. Quantizes both checkpoints independently for fair comparison. All toggleable via env vars, all off by default. New ablation script run_ablation_innovations.sh with experiments F/G/H/I.

Stale MLP_HIDDEN=1792 in pod env caused QAT ablation to train with wrong model size (27.8M vs 27.5M params, 108ms vs 70ms step time).

786K at ~108ms/step yields fewer total tokens than 524K at ~70ms/step. Run3 baseline used 524K and achieved 1.1496 BPB — all ablations should use the same batch size for fair comparison.

…ompile Dynamic tensor sizes per layer (2, 3, 4... 12) broke torch.compile graph tracing. Now all alpha tensors are padded to max_slots with -inf masks on unused entries. Pre-allocated output buffer replaces Python list.

Pre-allocated buffer with all_outputs[i+1] = x_in caused version mismatch in backward pass. Now uses list + torch.stack per layer (no in-place ops). Zero-padding + -inf masks still ensure static shapes for torch.compile.

Run6 (1.1635) showed sigmoid gates + decoder 2x LR + bigram 8192 all hurt. Those techniques are from PR openai#505 which has a different architecture (kv8, h1792). They don't transfer to our kv4/h1536 setup. Reverted: SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM_HASH_BUCKETS=4096, GRAD_CLIP_NORM=0.3 Our unique stack remains: VR + GA + Star-ReLU + per-layer lr + GradQuant + TrigramHash

…iques Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792 config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236. Novel technique analysis: DG Attention (differential values), BitNet b1.58 (ternary weights + depth recurrence), arithmetic coding (replaces zstd-22), LeakyReLU(0.5)^2 (-0.003 BPB, zero params).

LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2. Addresses dead neuron problem. LEAKY_RELU=1 env var. run_no_ttt_best.sh: run3 base + three free lunches: - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB) - LeakyReLU(0.5)^2 (zero params, -0.003 BPB) - QAT=1 (run5 proved negative quant gap) Drops sigmoid gates and decoder 2x LR (run6 showed they hurt). Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).

…est)

Checkpoint analysis reveals GA gates negatively correlate with quant damage (corr=-0.43), learning to dampen high-damage heads. But attn_gate weights are themselves the most fragile tensors (top 5 of damage ranking). Keeping them in fp32 preserves the correction. Also adds CK_LR_MULT for c_k projections which have ~2x the relative quant MSE of c_q/c_v. Default 1.0 (no change), set to 1.5 to test.

Was missing SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM=4096, GRAD_CLIP=0.3. Previous runs on this script had decoder 2x LR and bigram 8192 which caused the same regression as run6.

Late Training Replay (PR openai#445): buffer last 100 training batches during warmdown (scale < 0.2), replay 2 epochs at 10% LR after training ends. EMA updated during replay (critical detail from openai#445). ~50 lines. run_bestshot.sh stacks everything: MATRIX_LR=0.03, fp32 attn_gate, CK_LR_MULT=1.5, Late Replay, VALUE_RESIDUAL=0

Key discovery: changes that improve pre-quant quality (MATRIX_LR=0.03, sigmoid gates) make weights quant-fragile. QAT is the antidote. Run11 confirmed MATRIX_LR=0.03 alone has +0.014 quant gap. LeakyReLU also falsified for our architecture. All 11 runs documented with pre/post-quant breakdown.

mrdavtan · 2026-03-31T09:29:58Z

@valerio-oai This is a research contribution — 33 experiments documenting compression dead ends (Procrustes, pruning, codebooks) so others don't repeat them. The cross-seed finding (#3) is novel: layers share identical rotational structure across different random seeds, which may be relevant to the "learning adapters on random linear maps" wishlist item. Happy to answer questions.

mrdavtan added 30 commits March 21, 2026 13:50

Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale (val_bpb=1.1401)

7c3e33a

Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale (val_bpb=1.1401)

b621f7f

Add GPTQ-lite, Reptile meta-TTT, Shared Value Embedding (all env var …

06686df

…controlled)

Add run scripts: baseline (1.1375), oneshot (1.12x target), moonshot …

04f109a

…(1.11x target)

Add quantization analysis script for offline experiments

2dfdc88

Compares int6, GPTQ-lite, and learned codebook quantization on saved checkpoints. Reports per-tensor MSE, outlier detection, embedding entropy analysis, and compressed size. No GPU needed for analysis.

Add 1GPU experiment framework for feature isolation testing

b4ac5c1

Document 1GPU→8GPU translation and workflow

de39958

Fix oneshot: use 524K batch (786K too slow at 133ms on this pod)

593fe30

Add aggressive TTT script (matches PR openai#398 strategy)

6d047ea

Cap TTT at 7 epochs (our TTT is single-GPU, 73s/epoch vs openai#398's…

08f3dcb

… DDP 14.8s/epoch)

Fix aggressive TTT: EVAL_STRIDE=64 for faster eval, fix echo

36ad885

Rewrite moonshot with all session findings

fbbe8a7

Update aggressive TTT script for two-phase DDP approach

4192c4f

Add FA3 Hopper install to pod_start.sh (prebuilt wheels, ~15-20% fast…

dd67848

…er attention)

Add two_phase_ttt, reptile_ttt, ve experiments to 1GPU plan

79aa1bb

Gitignore all private/untracked docs

bd76318

Switch TTT to AdamW (PR openai#442: -0.019 BPB over SGD, 10 epochs vs…

3931c8f

… 20)

Add cosine TTT scheduling (+16% over flat lr in local testing)

b246428

New defaults: TTT_COSINE=1, TTT_EPOCHS=30, TTT_LR=0.0005 Supports: TTT_WARMUP_FRAC, TTT_PERLAYER (3x proj, 0.5x fc) Tested: 26 configs across 3 rounds on 4080 GPU

Per-layer cosine TTT defaults (+23.5% over flat AdamW in 34-config lo…

9e11bca

…cal testing)

Add run_competition.sh (per-layer cosine TTT), fix pod_start.sh for a…

603b290

…ny GPU count

Fix cosine schedule total_steps (was using max_steps=9999, not actual…

e16cd9b

… steps/epoch)

TTT_EPOCHS=50 (loss still declining at 30)

a2ed631

Add causal TTT (score-first, train-after) — TTT_CAUSAL=1

42d66c3

Add run_causal_ttt.sh (score-first legal TTT)

95ca8f2

Add VE_ENABLED=1 and WARMDOWN_ITERS=3500 to causal TTT script

b580c8c

Fix run_causal_ttt.sh: XSA_LAST_N=4, keep VE+warmdown, enable QAT

fbf814f

Add per-chunk cosine lr decay to causal TTT

c12042f

Add run_no_ttt.sh (sliding window eval, no TTT)

f5086d1

mrdavtan added 26 commits March 23, 2026 07:57

Add run_no_ttt_qat.sh: QAT on + trigram off ablation

666a8d5

Add innovation ablation experiment doc with testing protocol

da42b57

Add MLP_HIDDEN to unset block in all run scripts

acea2ca

Stale MLP_HIDDEN=1792 in pod env caused QAT ablation to train with wrong model size (27.8M vs 27.5M params, 108ms vs 70ms step time).

Fix batch size: 786K→524K to match run3 baseline

f15cac0

786K at ~108ms/step yields fewer total tokens than 524K at ~70ms/step. Run3 baseline used 524K and achieved 1.1496 BPB — all ablations should use the same batch size for fair comparison.

Update experiment doc with session learnings and run5 results

4c0c62a

Shelve Progressive Freeze: invalid 1GPU test, marginal on 8GPU math

36ecd09

Rename run_no_ttt_best.sh → run_freelunch.sh

218e620

Rename → run_lr03_leaky_qat.sh

7711f60

Fix pod_start.sh: re-cd after git reset to fix shell getcwd error

b444039

Add run_lr03.sh: run3 config + MATRIX_LR=0.03 only (single variable t…

d3d512a

…est)

Shelve Hyper-Connections G/H: O(layers^2) overhead too expensive

4ad0cc5

Fix run_no_ttt_1gpu.sh to match run3 config exactly

08f99aa

Was missing SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM=4096, GRAD_CLIP=0.3. Previous runs on this script had decoder 2x LR and bigram 8192 which caused the same regression as run6.

Update bestshot: just MATRIX_LR=0.03 + QAT=1 (two changes from run3)

436a600

Add compression analysis results and scripts

e55a3b4

notapplica mentioned this pull request Mar 29, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Copilot AI mentioned this pull request Mar 30, 2026

Novel approaches analysis for sub-1.10 BPB Parameter Golf kailean/parameter-golf#1

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers)#1048

Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers)#1048
mrdavtan wants to merge 74 commits intoopenai:mainfrom
mrdavtan:compression-negative-results

mrdavtan commented Mar 29, 2026 •

edited

Loading

Uh oh!

mrdavtan commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrdavtan commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Symmetry-transport compression

2. Low-rank rotation approximation

3. Cross-seed Procrustes

4. SWA vs EMA artifact smoothness

5. Block 7 is the quantization outlier

6. GPTQ-lite

7. Selective fp16 embedding

8. Non-monotonic pruning + zstd

Key takeaway

Reproduction

Uh oh!

mrdavtan commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mrdavtan commented Mar 29, 2026 •

edited

Loading