Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal by JayCheng113 · Pull Request #215 · openai/parameter-golf

JayCheng113 · 2026-03-20T14:22:04Z

Results

Seed	Steps	step_avg	val_bpb (sliding)	Artifact Size
1337	7732	77.6ms	1.1548	14,747,273
42	7821	77.1ms	1.1552	14,939,593
113	7597	79.0ms	1.1575	14,676,072
Mean	—	—	1.1558

All runs use clean compile cache, zstd-22 compression.

vs official record (SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit, val_bpb=1.1748):

Improvement: -0.0190 bpb / -0.031 nats
One-sided t-test: t≈22, df=2, p < 0.001

Key Techniques

Low-rank Q factorization (r=192): Standard Q projection uses a full 512×512 matrix (262K params/layer), but weight analysis revealed Q matrices have extreme condition numbers (100M+ for the factored product) — meaning Q naturally operates in a low-dimensional subspace. I factor Q into c_q_down(512→192) + c_q_up(192→512) (196K params/layer, saving 25%). The trained model uses only 89-114 effective dimensions out of 192 (46-59% utilization), confirming 192 is sufficient. Key benefits: (a) step time drops from 108ms to 77ms (−22%), giving 28% more training steps in 10 minutes; (b) the factored low-rank structure compresses better under int6 per-row quantization. Importantly, K (cond 19-29), V (cond 5-8), and O (cond 531-2620) are all full-rank — Q is the only matrix where this works.
11 transformer layers with encoder-decoder skip connections (5 encoder + 6 decoder), using parameter savings from low-rank Q to add 2 extra layers within the 16MB budget. (I am considering adding one more layer.)
Int6 per-row quantization + zstd-22 compression for MLP and attention weights. FP16 passthrough for tied embeddings and last 2 layers' key weights (Late-K).
Sliding window evaluation (stride=64) for final score.

Reproducibility

rm -rf ~/.cache/torch/inductor_cache/
NUM_LAYERS=11 WEIGHT_DECAY=0.038 SEED=1337 torchrun --nproc_per_node=8 train_gpt.py

All 3 seeds ran with clean compile cache. Step counts: 7732 / 7821 / 7597 (2.9% spread).

Test plan

3-seed validation with p < 0.001
All artifacts under 16,000,000 bytes
Clean compile cache for each seed
Sliding window eval (stride=64) for final score
train_gpt.py runs independently with zstandard auto-install

Experiments That Didn't Work

The following ideas were explored through weight analysis and experiments but did not improve val_bpb:

1. Legendre resid_mix initialization

After training, the resid_mix parameters (mix0 and mix1) show a clear depth-dependent pattern: mix0 has a Block-0 outlier with a U-shape for deeper layers, and mix1 follows a strong U-shape with negative values in the middle (embedding subtraction). These patterns fit well to 4th-order Legendre polynomials. I tried initializing resid_mix with integer Legendre coefficients at half-scale (interpolating between standard init and Legendre target) to give the optimizer a warm start. However, experiments showed no measurable improvement — Adam converges the scalar resid_mix parameters to their targets within ~200 steps regardless of initialization. The Legendre shape is correct but the optimizer doesn't need help finding it.

2. Content-Dependent Pre-Rotation on FFN

I proposed a content-dependent 2D rotation before each MLP's first linear layer (fully-connected, 512→1536): a small projection (angle_proj, 512→32, zero-init) computes 32 rotation angles from the input, then rotates 32 pairs of input dimensions before feeding into the fully-connected layer. This adds SwiGLU-like content-dependent feature mixing at only 1% parameter cost (16K params per layer) and without sacrificing MLP width — unlike SwiGLU which requires a third gate matrix (50% more MLP params, forcing MLP 3x down to 2x in a 16MB budget).

Concretely, for each rotated pair $(u_i, v_i)$, the model predicts an input-dependent angle $\theta_i(x)$ and applies

$$ u_i' = u_i \cos \theta_i(x) + v_i \sin \theta_i(x), \qquad v_i' = -u_i \sin \theta_i(x) + v_i \cos \theta_i(x). $$

So each output coordinate is a sum of original features multiplied by input-dependent coefficients $\cos \theta_i(x)$ and $\sin \theta_i(x)$. In that sense, the module implements the same core mechanism as SwiGLU-style gating: content-dependent multiplicative feature mixing. The difference is that the gating is heavily structured: instead of learning a full gate branch over the expanded MLP width, it learns only 32 scalar gates, each shared by one 2D subspace. This makes it a much more constrained but much cheaper gated MLP variant.

Experiments confirmed the rotation is genuinely useful: the model learned large rotation angles (~146°, not small perturbations), used near-full effective rank (29-30/32 pairs independently active), and achieved higher per-step quality in later training compared to the baseline. Further analysis showed Block 0 learned the strongest rotation (row_norm=2.55, focusing on raw embedding dimensions 359, 156, 423...) while deeper layers learned weaker rotation (row_norm ~1.1, sharing contextual dimensions 4, 12, 23...), suggesting Block 0 benefits most. However, torch.compile generates separate kernels for the rotation operations (cos/sin + concatenation), adding ~9 seconds of fixed compilation overhead — theoretically ~2ms of compute inflated to 9ms by graph-level inefficiency. I also tested a Block-0-only variant to reduce overhead, but the fixed compilation cost remained. In a 600-second budget, this overhead costs ~100 training steps, which negated the per-step quality gain.

This remains a promising direction: content-dependent rotation provides norm-preserving, information-lossless feature mixing (det(R)=1) as a near-free alternative to gating mechanisms in parameter-constrained settings. The bottleneck is purely at the operator compilation level, not the method itself.

3. Depth-attention residual (AttnRes) architecture

Inspired by Moonshot AI's Attention Residuals, I explored replacing the standard residual stream with a depth-attention mechanism: each layer's input is emb + depth_attn(δ₀..δ_{i-1}) where depth_attn uses learned position bias (Legendre polynomials) and content-based routing over all previous layers' outputs. The motivation was (a) selective delta combination for better gradient flow, and (b) quantization error suppression (softmax weights sum to 1, reducing error accumulation).

However, attention residual turns out to be counterproductive in small, dense models like this one. In Kimi-K2's MoE setting, attention residual helps route across sparsely-activated experts. In our dense 512-dim model, it actually suppresses the residual stream: softmax constrains routing weights to be non-negative and sum-to-1, but weight analysis of the baseline's resid_mix revealed the optimal depth routing requires negative weights (middle layers subtract embedding with mix1 ≈ -4) and non-normalized weights — patterns that softmax fundamentally cannot express. Additionally, unnormalized Values in depth attention caused block polarization where only 3 of 9 blocks remained active (67% of parameters wasted). The simple resid_mix mechanism (h = mix0 * x + mix1 * x0 with unconstrained per-dim scalars) is strictly more expressive and naturally achieves the same quantization error reduction (Σ A_i² = 2.20, 24% of standard residual) without any architectural overhead.

Future Directions

Lower-rank Q (r=128 or adaptive per-layer rank): r=128 showed 94-97% energy capture but crossed the quality threshold. An adaptive scheme — wider rank in deep layers (where attn_scale peaks) and narrower in shallow layers — could push further.

Better compilation for Pre-Rotation: The content-dependent rotation achieved higher per-step quality but lost to compilation overhead. The core implementation is minimal:

class MLP(nn.Module):
    def __init__(self, dim, mlp_mult, n_rot_pairs=32):
        ...
        self.angle_proj = CastedLinear(dim, n_rot_pairs, bias=False)  # zero-init

    def forward(self, x):
        r = self.n_rot_pairs
        angles = self.angle_proj(x)          # [B, T, 32] content-dependent angles
        cos_a, sin_a = angles.cos(), angles.sin()
        x1, x2 = x[..., :r], x[..., r:2*r]
        x = torch.cat([
            x1 * cos_a + x2 * sin_a,         # rotated pair 1
            -x1 * sin_a + x2 * cos_a,         # rotated pair 2
            x[..., 2*r:]                       # unchanged dims
        ], dim=-1)
        return self.proj(torch.relu(self.fc(x)).square())

A custom Triton kernel fusing angle_proj -> cos/sin -> rotation -> fc into a single pass, or improvements in torch.compile's handling of trigonometric operations within compiled graphs, would eliminate the 9-second overhead and make this technique viable. The method provides SwiGLU-like content-dependent feature mixing at 1% parameter cost with zero information loss (det(R)=1), making it particularly suited for parameter-constrained or inference-optimized settings.

11-layer transformer with factored Q projection (rank 192), reducing step time by 22% and enabling 28% more training steps within the 10-minute budget. Weight analysis revealed Q matrices have extreme condition numbers while K/V/O remain full-rank, confirming Q is the only viable target for low-rank factorization. 3-seed validation: 1.1548 / 1.1552 / 1.1575 (mean 1.1558) vs official record (1.1748): -0.019 bpb, -0.031 nats, p < 0.001

PR openai#215 found Q projection matrices have condition numbers >100M, meaning Q naturally operates in a low-rank subspace. Factoring Q as W_down(dim→r) @ W_up(r→dim) with r=192 saves 2.6% params and ~22% step time on H100, yielding ~28% more training steps. Enable with Q_RANK=192 (default 0 = full rank, no change). K, V, O projections remain full rank (they ARE full rank). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ClassicLarry · 2026-03-22T22:34:10Z

This is the most impressive PR I have seen on this repo yet. Insight into Q is great. If I had to bet on a participant to come up with something novel here, I would bet on you!

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

SkywardSyntax mentioned this pull request Mar 21, 2026

Non-record: 12L Low-Rank Q + QAT (1xH100, pre-quant 1.2035) #316

Open

6 tasks

JayCheng113 changed the title ~~Record: 11L Low-Rank Q192 (val_bpb=1.1548)~~ Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) Mar 22, 2026

JayCheng113 changed the title ~~Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548)~~ Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal#215

Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal#215
JayCheng113 wants to merge 1 commit intoopenai:mainfrom
JayCheng113:submission/11l-lowrank-q192

JayCheng113 commented Mar 20, 2026 •

edited

Loading

Uh oh!

ClassicLarry commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JayCheng113 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Key Techniques

Reproducibility

Test plan

Experiments That Didn't Work

Future Directions

Uh oh!

ClassicLarry commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JayCheng113 commented Mar 20, 2026 •

edited

Loading