Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal#215
Open
JayCheng113 wants to merge 1 commit intoopenai:mainfrom
Open
Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal#215JayCheng113 wants to merge 1 commit intoopenai:mainfrom
JayCheng113 wants to merge 1 commit intoopenai:mainfrom
Conversation
11-layer transformer with factored Q projection (rank 192), reducing step time by 22% and enabling 28% more training steps within the 10-minute budget. Weight analysis revealed Q matrices have extreme condition numbers while K/V/O remain full-rank, confirming Q is the only viable target for low-rank factorization. 3-seed validation: 1.1548 / 1.1552 / 1.1575 (mean 1.1558) vs official record (1.1748): -0.019 bpb, -0.031 nats, p < 0.001
SkywardSyntax
pushed a commit
to SkywardSyntax/parameter-golf
that referenced
this pull request
Mar 20, 2026
PR openai#215 found Q projection matrices have condition numbers >100M, meaning Q naturally operates in a low-rank subspace. Factoring Q as W_down(dim→r) @ W_up(r→dim) with r=192 saves 2.6% params and ~22% step time on H100, yielding ~28% more training steps. Enable with Q_RANK=192 (default 0 = full rank, no change). K, V, O projections remain full rank (they ARE full rank). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
|
This is the most impressive PR I have seen on this repo yet. Insight into Q is great. If I had to bet on a participant to come up with something novel here, I would bet on you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Results
All runs use clean compile cache, zstd-22 compression.
vs official record (SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit, val_bpb=1.1748):
Key Techniques
Low-rank Q factorization (r=192): Standard Q projection uses a full 512×512 matrix (262K params/layer), but weight analysis revealed Q matrices have extreme condition numbers (100M+ for the factored product) — meaning Q naturally operates in a low-dimensional subspace. I factor Q into
c_q_down(512→192)+c_q_up(192→512)(196K params/layer, saving 25%). The trained model uses only 89-114 effective dimensions out of 192 (46-59% utilization), confirming 192 is sufficient. Key benefits: (a) step time drops from 108ms to 77ms (−22%), giving 28% more training steps in 10 minutes; (b) the factored low-rank structure compresses better under int6 per-row quantization. Importantly, K (cond 19-29), V (cond 5-8), and O (cond 531-2620) are all full-rank — Q is the only matrix where this works.11 transformer layers with encoder-decoder skip connections (5 encoder + 6 decoder), using parameter savings from low-rank Q to add 2 extra layers within the 16MB budget. (I am considering adding one more layer.)
Int6 per-row quantization + zstd-22 compression for MLP and attention weights. FP16 passthrough for tied embeddings and last 2 layers' key weights (Late-K).
Sliding window evaluation (stride=64) for final score.
Reproducibility
rm -rf ~/.cache/torch/inductor_cache/ NUM_LAYERS=11 WEIGHT_DECAY=0.038 SEED=1337 torchrun --nproc_per_node=8 train_gpt.pyAll 3 seeds ran with clean compile cache. Step counts: 7732 / 7821 / 7597 (2.9% spread).
Test plan
zstandardauto-installExperiments That Didn't Work
The following ideas were explored through weight analysis and experiments but did not improve val_bpb:
1. Legendre resid_mix initialization
After training, the
resid_mixparameters (mix0 and mix1) show a clear depth-dependent pattern: mix0 has a Block-0 outlier with a U-shape for deeper layers, and mix1 follows a strong U-shape with negative values in the middle (embedding subtraction). These patterns fit well to 4th-order Legendre polynomials. I tried initializing resid_mix with integer Legendre coefficients at half-scale (interpolating between standard init and Legendre target) to give the optimizer a warm start. However, experiments showed no measurable improvement — Adam converges the scalar resid_mix parameters to their targets within ~200 steps regardless of initialization. The Legendre shape is correct but the optimizer doesn't need help finding it.2. Content-Dependent Pre-Rotation on FFN
I proposed a content-dependent 2D rotation before each MLP's first linear layer (fully-connected, 512→1536): a small projection (
angle_proj, 512→32, zero-init) computes 32 rotation angles from the input, then rotates 32 pairs of input dimensions before feeding into the fully-connected layer. This adds SwiGLU-like content-dependent feature mixing at only 1% parameter cost (16K params per layer) and without sacrificing MLP width — unlike SwiGLU which requires a third gate matrix (50% more MLP params, forcing MLP 3x down to 2x in a 16MB budget).Concretely, for each rotated pair$(u_i, v_i)$ , the model predicts an input-dependent angle $\theta_i(x)$ and applies
So each output coordinate is a sum of original features multiplied by input-dependent coefficients$\cos \theta_i(x)$ and $\sin \theta_i(x)$ . In that sense, the module implements the same core mechanism as SwiGLU-style gating: content-dependent multiplicative feature mixing. The difference is that the gating is heavily structured: instead of learning a full gate branch over the expanded MLP width, it learns only 32 scalar gates, each shared by one 2D subspace. This makes it a much more constrained but much cheaper gated MLP variant.
Experiments confirmed the rotation is genuinely useful: the model learned large rotation angles (~146°, not small perturbations), used near-full effective rank (29-30/32 pairs independently active), and achieved higher per-step quality in later training compared to the baseline. Further analysis showed Block 0 learned the strongest rotation (row_norm=2.55, focusing on raw embedding dimensions 359, 156, 423...) while deeper layers learned weaker rotation (row_norm ~1.1, sharing contextual dimensions 4, 12, 23...), suggesting Block 0 benefits most. However,
torch.compilegenerates separate kernels for the rotation operations (cos/sin + concatenation), adding ~9 seconds of fixed compilation overhead — theoretically ~2ms of compute inflated to 9ms by graph-level inefficiency. I also tested a Block-0-only variant to reduce overhead, but the fixed compilation cost remained. In a 600-second budget, this overhead costs ~100 training steps, which negated the per-step quality gain.This remains a promising direction: content-dependent rotation provides norm-preserving, information-lossless feature mixing (det(R)=1) as a near-free alternative to gating mechanisms in parameter-constrained settings. The bottleneck is purely at the operator compilation level, not the method itself.
3. Depth-attention residual (AttnRes) architecture
Inspired by Moonshot AI's Attention Residuals, I explored replacing the standard residual stream with a depth-attention mechanism: each layer's input is
emb + depth_attn(δ₀..δ_{i-1})wheredepth_attnuses learned position bias (Legendre polynomials) and content-based routing over all previous layers' outputs. The motivation was (a) selective delta combination for better gradient flow, and (b) quantization error suppression (softmax weights sum to 1, reducing error accumulation).However, attention residual turns out to be counterproductive in small, dense models like this one. In Kimi-K2's MoE setting, attention residual helps route across sparsely-activated experts. In our dense 512-dim model, it actually suppresses the residual stream: softmax constrains routing weights to be non-negative and sum-to-1, but weight analysis of the baseline's
resid_mixrevealed the optimal depth routing requires negative weights (middle layers subtract embedding with mix1 ≈ -4) and non-normalized weights — patterns that softmax fundamentally cannot express. Additionally, unnormalized Values in depth attention caused block polarization where only 3 of 9 blocks remained active (67% of parameters wasted). The simpleresid_mixmechanism (h = mix0 * x + mix1 * x0with unconstrained per-dim scalars) is strictly more expressive and naturally achieves the same quantization error reduction (Σ A_i² = 2.20, 24% of standard residual) without any architectural overhead.Future Directions
Lower-rank Q (r=128 or adaptive per-layer rank): r=128 showed 94-97% energy capture but crossed the quality threshold. An adaptive scheme — wider rank in deep layers (where attn_scale peaks) and narrower in shallow layers — could push further.
Better compilation for Pre-Rotation: The content-dependent rotation achieved higher per-step quality but lost to compilation overhead. The core implementation is minimal:
A custom Triton kernel fusing
angle_proj -> cos/sin -> rotation -> fcinto a single pass, or improvements intorch.compile's handling of trigonometric operations within compiled graphs, would eliminate the 9-second overhead and make this technique viable. The method provides SwiGLU-like content-dependent feature mixing at 1% parameter cost with zero information loss (det(R)=1), making it particularly suited for parameter-constrained or inference-optimized settings.