Non-record: Cross-seed rotational symmetry in transformer weights — 33 checkpoint experiments (Procrustes, pruning×zstd, block-level quantization outliers)#1048
Open
mrdavtan wants to merge 74 commits intoopenai:mainfrom
Conversation
Compares int6, GPTQ-lite, and learned codebook quantization on saved checkpoints. Reports per-tensor MSE, outlier detection, embedding entropy analysis, and compressed size. No GPU needed for analysis.
… DDP 14.8s/epoch)
Major changes: - DDP gradient sharding: each GPU processes batch_seqs sequences, manual all_reduce on gradients (matches PR openai#415/openai#417 approach) - Two-phase TTT (TTT_TWO_PHASE=1): Phase 1: norm-only recalibration (50 epochs Adam, ~22K params) Phase 2: selective block adaptation (10 epochs SGD, last 3 blocks) - TTT_BATCH_SEQS=64 per GPU (512 total with 8 GPUs) - Falls back to single-phase SGD if TTT_TWO_PHASE=0 Expected speedup: ~235x (from 1344s/epoch to ~5.7s/epoch)
New defaults: TTT_COSINE=1, TTT_EPOCHS=30, TTT_LR=0.0005 Supports: TTT_WARMUP_FRAC, TTT_PERLAYER (3x proj, 0.5x fc) Tested: 26 configs across 3 rounds on 4080 GPU
…ttern) Root cause: per-sequence indexing from permuted indices was ~100x slower than contiguous val_tokens slicing. Each GPU now takes a contiguous shard and iterates sequentially, matching openai#398's working implementation.
Local ablation showed h1792 gives -0.056 BPB over h1536 at similar step cost. Fixed unset blocks that were killing MLP_HIDDEN, BIGRAM_HASH_BUCKETS, and TRAIN_BATCH_TOKENS immediately after setting them (same bug class as Finding openai#24).
Run3 checkpoint (1.1496) shows 27.5M params → 15.66MB artifact at full scale. h1792 would add ~3M params → ~17.6MB, 1.6MB over cap. Compression ratio degrades with longer training (denser weights). Staying at h1536 (default). The local ablation was misleading.
Sigmoid skip gates (PR openai#505): replace additive skip connections with sigmoid-gated blend: x = gate*x + (1-gate)*scaled_skip. Learned per-dim gates init to sigmoid(0)=0.5. SIGMOID_SKIP_GATES=1 (default on). Decoder 2x LR (PR openai#505): decoder layers (>= num_encoder_layers) get DECODER_LR_MULT=2.0 applied on top of the per-layer lr splits. Combined with quant-damage lr: decoder proj gets 1.5*2=3x base lr. Both features are env-var controlled and default ON.
1. Progressive Layer Freezing: freeze encoder blocks during warmdown (PROGRESSIVE_FREEZE=1) to halve backward pass and gain ~1700 extra decoder-focused training steps. Includes Muon weight-decay guard to prevent frozen weights from decaying. 2. Hyper-Connections: learned mixing of all prior layer outputs (HYPER_CONNECTIONS=1, mode=scalar|vector). Generalizes U-Net skips and resid_mix into a single mechanism with negligible param cost. 3. Logit Ensemble: average logits from EMA + raw checkpoint at eval (LOGIT_ENSEMBLE=1). Eval-only, zero training impact. Quantizes both checkpoints independently for fair comparison. All toggleable via env vars, all off by default. New ablation script run_ablation_innovations.sh with experiments F/G/H/I.
Stale MLP_HIDDEN=1792 in pod env caused QAT ablation to train with wrong model size (27.8M vs 27.5M params, 108ms vs 70ms step time).
786K at ~108ms/step yields fewer total tokens than 524K at ~70ms/step. Run3 baseline used 524K and achieved 1.1496 BPB — all ablations should use the same batch size for fair comparison.
…ompile Dynamic tensor sizes per layer (2, 3, 4... 12) broke torch.compile graph tracing. Now all alpha tensors are padded to max_slots with -inf masks on unused entries. Pre-allocated output buffer replaces Python list.
Pre-allocated buffer with all_outputs[i+1] = x_in caused version mismatch in backward pass. Now uses list + torch.stack per layer (no in-place ops). Zero-padding + -inf masks still ensure static shapes for torch.compile.
Run6 (1.1635) showed sigmoid gates + decoder 2x LR + bigram 8192 all hurt. Those techniques are from PR openai#505 which has a different architecture (kv8, h1792). They don't transfer to our kv4/h1536 setup. Reverted: SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM_HASH_BUCKETS=4096, GRAD_CLIP_NORM=0.3 Our unique stack remains: VR + GA + Star-ReLU + per-layer lr + GradQuant + TrigramHash
…iques Key finding: PR openai#505 (1.1181) does NOT fit in 16MB — their 8KV+h1792 config produces ~20MB artifacts. Real non-TTT target is openai#445 at 1.1236. Novel technique analysis: DG Attention (differential values), BitNet b1.58 (ternary weights + depth recurrence), arithmetic coding (replaces zstd-22), LeakyReLU(0.5)^2 (-0.003 BPB, zero params).
LeakyReLU(0.5)^2: zero extra params, proven -0.003 BPB vs relu^2. Addresses dead neuron problem. LEAKY_RELU=1 env var. run_no_ttt_best.sh: run3 base + three free lunches: - MATRIX_LR=0.03 (PR openai#530, verified -0.005+ BPB) - LeakyReLU(0.5)^2 (zero params, -0.003 BPB) - QAT=1 (run5 proved negative quant gap) Drops sigmoid gates and decoder 2x LR (run6 showed they hurt). Real target is openai#445 at 1.1236 (not openai#505 which doesn't fit 16MB).
Checkpoint analysis reveals GA gates negatively correlate with quant damage (corr=-0.43), learning to dampen high-damage heads. But attn_gate weights are themselves the most fragile tensors (top 5 of damage ranking). Keeping them in fp32 preserves the correction. Also adds CK_LR_MULT for c_k projections which have ~2x the relative quant MSE of c_q/c_v. Default 1.0 (no change), set to 1.5 to test.
Was missing SIGMOID_SKIP_GATES=0, DECODER_LR_MULT=1.0, BIGRAM=4096, GRAD_CLIP=0.3. Previous runs on this script had decoder 2x LR and bigram 8192 which caused the same regression as run6.
Late Training Replay (PR openai#445): buffer last 100 training batches during warmdown (scale < 0.2), replay 2 epochs at 10% LR after training ends. EMA updated during replay (critical detail from openai#445). ~50 lines. run_bestshot.sh stacks everything: MATRIX_LR=0.03, fp32 attn_gate, CK_LR_MULT=1.5, Late Replay, VALUE_RESIDUAL=0
Key discovery: changes that improve pre-quant quality (MATRIX_LR=0.03, sigmoid gates) make weights quant-fragile. QAT is the antidote. Run11 confirmed MATRIX_LR=0.03 alone has +0.014 quant gap. LeakyReLU also falsified for our architecture. All 11 runs documented with pre/post-quant breakdown.
Author
|
@valerio-oai This is a research contribution — 33 experiments documenting compression dead ends (Procrustes, pruning, codebooks) so others don't repeat them. The cross-seed finding (#3) is novel: layers share identical rotational structure across different random seeds, which may be relevant to the "learning adapters on random linear maps" wishlist item. Happy to answer questions. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Transformer layers share identical rotational structure across different random seeds. Procrustes alignment between the same layer trained on seeds 1337 vs 7 gives 90% MSE reduction — despite zero cosine similarity between raw weights. This is architectural, not learned. Compact parameterization of the rotations remains unsolved (Givens angles unexplored).
8 findings from checkpoint analysis across 33 experiments. Follow-up to PR #212 (25 training experiments).
Summary
1. Symmetry-transport compression
Procrustes alignment between layers shows 91-93% MSE reduction on MLP proj — layers genuinely share rotational structure. Full prototype implemented: store 1 prototype + 10 rotation matrices.
Result: 14.3 MB (int6+zstd) → 68.8 MB (transport). Rotation matrices are dense, high-entropy — zstd can't compress them.
2. Low-rank rotation approximation
Rank-128 captures only 16.6% of the rotation delta. Dead end without a compact rotation parameterization (Givens angles unexplored).
3. Cross-seed Procrustes
Same layer across different seeds (1337 vs 7): 90% MSE reduction after alignment. Weights are completely different values (cosine similarity ~0.000) but the rotational structure is identical.
Implication: This is an architectural property, not a training artifact. A universal rotation-aware quantization scheme could theoretically work across seeds — but storing the rotations remains the bottleneck.
4. SWA vs EMA artifact smoothness
SWA's poor score (1.2076) was due to a slow pod (5,532 steps vs 8,672) — not SWA itself. Complementary to PR #989's finding that SWA sabotages QAT. SWA-level smoothness at EMA step counts would give small artifacts with headroom for more parameters.
5. Block 7 is the quantization outlier
c_k kurtosis 11.9 vs average ~0.5. Consistently heaviest tails across checkpoints. Current LATE_K_FP16 protects last 2 layers — block 7 is layer 7, unprotected.
6. GPTQ-lite
Per-row optimal clip search (5 quantile ratios, 0.9→0.99999): 0.2% MSE improvement over uniform int6. Weight distributions are well-conditioned. Zero cost, so keep as default, but don't expect gains.
7. Selective fp16 embedding
Top 10% highest-entropy embedding rows (102/1024 tokens) in fp16, rest in int6. Saves 944KB vs full fp16 (104KB vs 1,048KB). Quality trade minimal — 90% of tokens have well-clustered embeddings that survive int6.
8. Non-monotonic pruning + zstd
3% magnitude pruning increases artifact by 728KB. 1% and 5% are neutral. Zeroing creates byte patterns that interact badly with zstd-22 at specific sparsity levels. Always measure compressed size, not just reconstruction quality.
Key takeaway
Int6 + zstd-22 is near-optimal for this architecture. Approaches that improve reconstruction MSE (codebooks, Procrustes) don't improve compressed size — the codec must be considered jointly.
Reproduction
Analysis scripts in
experiments/:quant_analysis.py,analyze_quant_gates.py.Builds on PRs #162 (raahilshah), #180 (thwu1), #212 (prior 25 experiments), and modded-nanogpt.