Skip to content

Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers)#1048

Open
mrdavtan wants to merge 74 commits intoopenai:mainfrom
mrdavtan:compression-negative-results
Open

Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers)#1048
mrdavtan wants to merge 74 commits intoopenai:mainfrom
mrdavtan:compression-negative-results

Conversation

@mrdavtan
Copy link
Copy Markdown

@mrdavtan mrdavtan commented Mar 29, 2026

Transformer layers share identical rotational structure across different random seeds. Procrustes alignment between the same layer trained on seeds 1337 vs 7 gives 90% MSE reduction — despite zero cosine similarity between raw weights. This is architectural, not learned. Compact parameterization of the rotations remains unsolved (Givens angles unexplored).

8 findings from checkpoint analysis across 33 experiments. Follow-up to PR #212 (25 training experiments).

Summary

# Technique Result Verdict
1 Symmetry-transport (Procrustes) 91% MSE reduction but 380% larger artifact Dead
2 Low-rank rotation approximation Rank-128 captures 16.6% of rotation variance Dead
3 Cross-seed Procrustes 90% MSE reduction — rotational structure is architectural Interesting but unexploitable
4 SWA vs EMA weight smoothness SWA 2.3× smoother, 24% smaller artifacts Viable if step count preserved
5 Block 7 c_k outlier Kurtosis 11.9 (avg ~0.5), most sensitive large tensor Worth fp16 protection
6 GPTQ-lite on well-conditioned weights 0.2% MSE improvement Marginal
7 Selective fp16 embedding Entropy-based, saves 944KB vs full fp16 Viable
8 Non-monotonic pruning+zstd 3% pruning increases artifact by 728KB Surprising interaction

1. Symmetry-transport compression

Procrustes alignment between layers shows 91-93% MSE reduction on MLP proj — layers genuinely share rotational structure. Full prototype implemented: store 1 prototype + 10 rotation matrices.

Result: 14.3 MB (int6+zstd) → 68.8 MB (transport). Rotation matrices are dense, high-entropy — zstd can't compress them.

2. Low-rank rotation approximation

Rank-128 captures only 16.6% of the rotation delta. Dead end without a compact rotation parameterization (Givens angles unexplored).

3. Cross-seed Procrustes

Same layer across different seeds (1337 vs 7): 90% MSE reduction after alignment. Weights are completely different values (cosine similarity ~0.000) but the rotational structure is identical.

Implication: This is an architectural property, not a training artifact. A universal rotation-aware quantization scheme could theoretically work across seeds — but storing the rotations remains the bottleneck.

4. SWA vs EMA artifact smoothness

Metric EMA (1.1344) SWA (1.2076)
MLP fc std 0.2056 0.0912 (2.3× smoother)
MLP fc kurtosis 0.53 0.17 (3× lower)
Artifact size 16.3 MB 12.4 MB (24% smaller)

SWA's poor score (1.2076) was due to a slow pod (5,532 steps vs 8,672) — not SWA itself. Complementary to PR #989's finding that SWA sabotages QAT. SWA-level smoothness at EMA step counts would give small artifacts with headroom for more parameters.

5. Block 7 is the quantization outlier

c_k kurtosis 11.9 vs average ~0.5. Consistently heaviest tails across checkpoints. Current LATE_K_FP16 protects last 2 layers — block 7 is layer 7, unprotected.

6. GPTQ-lite

Per-row optimal clip search (5 quantile ratios, 0.9→0.99999): 0.2% MSE improvement over uniform int6. Weight distributions are well-conditioned. Zero cost, so keep as default, but don't expect gains.

7. Selective fp16 embedding

Top 10% highest-entropy embedding rows (102/1024 tokens) in fp16, rest in int6. Saves 944KB vs full fp16 (104KB vs 1,048KB). Quality trade minimal — 90% of tokens have well-clustered embeddings that survive int6.

8. Non-monotonic pruning + zstd

3% magnitude pruning increases artifact by 728KB. 1% and 5% are neutral. Zeroing creates byte patterns that interact badly with zstd-22 at specific sparsity levels. Always measure compressed size, not just reconstruction quality.

Key takeaway

Int6 + zstd-22 is near-optimal for this architecture. Approaches that improve reconstruction MSE (codebooks, Procrustes) don't improve compressed size — the codec must be considered jointly.

Reproduction

Analysis scripts in experiments/: quant_analysis.py, analyze_quant_gates.py.

Builds on PRs #162 (raahilshah), #180 (thwu1), #212 (prior 25 experiments), and modded-nanogpt.

mrdavtan added 30 commits March 21, 2026 13:50
Compares int6, GPTQ-lite, and learned codebook quantization on saved
checkpoints. Reports per-tensor MSE, outlier detection, embedding
entropy analysis, and compressed size. No GPU needed for analysis.
Major changes:
- DDP gradient sharding: each GPU processes batch_seqs sequences,
  manual all_reduce on gradients (matches PR openai#415/openai#417 approach)
- Two-phase TTT (TTT_TWO_PHASE=1):
  Phase 1: norm-only recalibration (50 epochs Adam, ~22K params)
  Phase 2: selective block adaptation (10 epochs SGD, last 3 blocks)
- TTT_BATCH_SEQS=64 per GPU (512 total with 8 GPUs)
- Falls back to single-phase SGD if TTT_TWO_PHASE=0

Expected speedup: ~235x (from 1344s/epoch to ~5.7s/epoch)
New defaults: TTT_COSINE=1, TTT_EPOCHS=30, TTT_LR=0.0005
Supports: TTT_WARMUP_FRAC, TTT_PERLAYER (3x proj, 0.5x fc)
Tested: 26 configs across 3 rounds on 4080 GPU
…ttern)

Root cause: per-sequence indexing from permuted indices was ~100x slower
than contiguous val_tokens slicing. Each GPU now takes a contiguous
shard and iterates sequentially, matching openai#398's working implementation.
mrdavtan added 26 commits March 23, 2026 07:57
Local ablation showed h1792 gives -0.056 BPB over h1536 at similar step cost.
Fixed unset blocks that were killing MLP_HIDDEN, BIGRAM_HASH_BUCKETS, and
TRAIN_BATCH_TOKENS immediately after setting them (same bug class as Finding openai#24).
Run3 checkpoint (1.1496) shows 27.5M params → 15.66MB artifact at full scale.
h1792 would add ~3M params → ~17.6MB, 1.6MB over cap.
Compression ratio degrades with longer training (denser weights).
Staying at h1536 (default). The local ablation was misleading.
Sigmoid skip gates (PR openai#505): replace additive skip connections with
sigmoid-gated blend: x = gate*x + (1-gate)*scaled_skip. Learned per-dim
gates init to sigmoid(0)=0.5. SIGMOID_SKIP_GATES=1 (default on).

Decoder 2x LR (PR openai#505): decoder layers (>= num_encoder_layers) get
DECODER_LR_MULT=2.0 applied on top of the per-layer lr splits.
Combined with quant-damage lr: decoder proj gets 1.5*2=3x base lr.

Both features are env-var controlled and default ON.
1. Progressive Layer Freezing: freeze encoder blocks during warmdown
   (PROGRESSIVE_FREEZE=1) to halve backward pass and gain ~1700 extra
   decoder-focused training steps. Includes Muon weight-decay guard
   to prevent frozen weights from decaying.

2. Hyper-Connections: learned mixing of all prior layer outputs
   (HYPER_CONNECTIONS=1, mode=scalar|vector). Generalizes U-Net skips
   and resid_mix into a single mechanism with negligible param cost.

3. Logit Ensemble: average logits from EMA + raw checkpoint at eval
   (LOGIT_ENSEMBLE=1). Eval-only, zero training impact. Quantizes both
   checkpoints independently for fair comparison.

All toggleable via env vars, all off by default. New ablation script
run_ablation_innovations.sh with experiments F/G/H/I.
Stale MLP_HIDDEN=1792 in pod env caused QAT ablation to train with
wrong model size (27.8M vs 27.5M params, 108ms vs 70ms step time).
786K at ~108ms/step yields fewer total tokens than 524K at ~70ms/step.
Run3 baseline used 524K and achieved 1.1496 BPB — all ablations should
use the same batch size for fair comparison.
…ompile

Dynamic tensor sizes per layer (2, 3, 4... 12) broke torch.compile graph
tracing. Now all alpha tensors are padded to max_slots with -inf masks on
unused entries. Pre-allocated output buffer replaces Python list.
Pre-allocated buffer with all_outputs[i+1] = x_in caused version
mismatch in backward pass. Now uses list + torch.stack per layer
(no in-place ops). Zero-padding + -inf masks still ensure static
shapes for torch.compile.
Run6 (1.1635) showed sigmoid gates + decoder 2x LR + bigram 8192 all hurt.
Those techniques are from PR openai#505 which has a different architecture (kv8, h1792).
They don't transfer to our kv4/h1536 setup.

Reverted: SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM_HASH_BUCKETS=4096, GRAD_CLIP_NORM=0.3
Our unique stack remains: VR + GA + Star-ReLU + per-layer lr + GradQuant + TrigramHash
…iques

Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792
config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236.

Novel technique analysis: DG Attention (differential values), BitNet b1.58
(ternary weights + depth recurrence), arithmetic coding (replaces zstd-22),
LeakyReLU(0.5)^2 (-0.003 BPB, zero params).
LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2.
Addresses dead neuron problem. LEAKY_RELU=1 env var.

run_no_ttt_best.sh: run3 base + three free lunches:
  - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB)
  - LeakyReLU(0.5)^2 (zero params, -0.003 BPB)
  - QAT=1 (run5 proved negative quant gap)

Drops sigmoid gates and decoder 2x LR (run6 showed they hurt).
Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).
Checkpoint analysis reveals GA gates negatively correlate with quant
damage (corr=-0.43), learning to dampen high-damage heads. But
attn_gate weights are themselves the most fragile tensors (top 5 of
damage ranking). Keeping them in fp32 preserves the correction.

Also adds CK_LR_MULT for c_k projections which have ~2x the relative
quant MSE of c_q/c_v. Default 1.0 (no change), set to 1.5 to test.
Was missing SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM=4096,
GRAD_CLIP=0.3. Previous runs on this script had decoder 2x LR and
bigram 8192 which caused the same regression as run6.
Late Training Replay (PR openai#445): buffer last 100 training batches during
warmdown (scale < 0.2), replay 2 epochs at 10% LR after training ends.
EMA updated during replay (critical detail from openai#445). ~50 lines.

run_bestshot.sh stacks everything:
  MATRIX_LR=0.03, fp32 attn_gate, CK_LR_MULT=1.5,
  Late Replay, VALUE_RESIDUAL=0
Key discovery: changes that improve pre-quant quality (MATRIX_LR=0.03,
sigmoid gates) make weights quant-fragile. QAT is the antidote.
Run11 confirmed MATRIX_LR=0.03 alone has +0.014 quant gap.
LeakyReLU also falsified for our architecture.

All 11 runs documented with pre/post-quant breakdown.
@mrdavtan mrdavtan changed the title Non-record: Compression moonshots — 8 negative/marginal findings (Procrustes, SWA smoothness, selective fp16, pruning+zstd) Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers) Mar 31, 2026
@mrdavtan
Copy link
Copy Markdown
Author

@valerio-oai This is a research contribution — 33 experiments documenting compression dead ends (Procrustes, pruning, codebooks) so others don't repeat them. The cross-seed finding (#3) is novel: layers share identical rotational structure across different random seeds, which may be relevant to the "learning adapters on random linear maps" wishlist item. Happy to answer questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant