Skip to content

Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal#215

Open
JayCheng113 wants to merge 1 commit intoopenai:mainfrom
JayCheng113:submission/11l-lowrank-q192
Open

Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal#215
JayCheng113 wants to merge 1 commit intoopenai:mainfrom
JayCheng113:submission/11l-lowrank-q192

Conversation

@JayCheng113
Copy link
Copy Markdown

@JayCheng113 JayCheng113 commented Mar 20, 2026

Results

Seed Steps step_avg val_bpb (sliding) Artifact Size
1337 7732 77.6ms 1.1548 14,747,273
42 7821 77.1ms 1.1552 14,939,593
113 7597 79.0ms 1.1575 14,676,072
Mean 1.1558

All runs use clean compile cache, zstd-22 compression.

vs official record (SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit, val_bpb=1.1748):

  • Improvement: -0.0190 bpb / -0.031 nats
  • One-sided t-test: t≈22, df=2, p < 0.001

Key Techniques

  1. Low-rank Q factorization (r=192): Standard Q projection uses a full 512×512 matrix (262K params/layer), but weight analysis revealed Q matrices have extreme condition numbers (100M+ for the factored product) — meaning Q naturally operates in a low-dimensional subspace. I factor Q into c_q_down(512→192) + c_q_up(192→512) (196K params/layer, saving 25%). The trained model uses only 89-114 effective dimensions out of 192 (46-59% utilization), confirming 192 is sufficient. Key benefits: (a) step time drops from 108ms to 77ms (−22%), giving 28% more training steps in 10 minutes; (b) the factored low-rank structure compresses better under int6 per-row quantization. Importantly, K (cond 19-29), V (cond 5-8), and O (cond 531-2620) are all full-rank — Q is the only matrix where this works.

  2. 11 transformer layers with encoder-decoder skip connections (5 encoder + 6 decoder), using parameter savings from low-rank Q to add 2 extra layers within the 16MB budget. (I am considering adding one more layer.)

  3. Int6 per-row quantization + zstd-22 compression for MLP and attention weights. FP16 passthrough for tied embeddings and last 2 layers' key weights (Late-K).

  4. Sliding window evaluation (stride=64) for final score.

Reproducibility

rm -rf ~/.cache/torch/inductor_cache/
NUM_LAYERS=11 WEIGHT_DECAY=0.038 SEED=1337 torchrun --nproc_per_node=8 train_gpt.py

All 3 seeds ran with clean compile cache. Step counts: 7732 / 7821 / 7597 (2.9% spread).

Test plan

  • 3-seed validation with p < 0.001
  • All artifacts under 16,000,000 bytes
  • Clean compile cache for each seed
  • Sliding window eval (stride=64) for final score
  • train_gpt.py runs independently with zstandard auto-install

Experiments That Didn't Work

The following ideas were explored through weight analysis and experiments but did not improve val_bpb:

1. Legendre resid_mix initialization

After training, the resid_mix parameters (mix0 and mix1) show a clear depth-dependent pattern: mix0 has a Block-0 outlier with a U-shape for deeper layers, and mix1 follows a strong U-shape with negative values in the middle (embedding subtraction). These patterns fit well to 4th-order Legendre polynomials. I tried initializing resid_mix with integer Legendre coefficients at half-scale (interpolating between standard init and Legendre target) to give the optimizer a warm start. However, experiments showed no measurable improvement — Adam converges the scalar resid_mix parameters to their targets within ~200 steps regardless of initialization. The Legendre shape is correct but the optimizer doesn't need help finding it.

2. Content-Dependent Pre-Rotation on FFN

I proposed a content-dependent 2D rotation before each MLP's first linear layer (fully-connected, 512→1536): a small projection (angle_proj, 512→32, zero-init) computes 32 rotation angles from the input, then rotates 32 pairs of input dimensions before feeding into the fully-connected layer. This adds SwiGLU-like content-dependent feature mixing at only 1% parameter cost (16K params per layer) and without sacrificing MLP width — unlike SwiGLU which requires a third gate matrix (50% more MLP params, forcing MLP 3x down to 2x in a 16MB budget).

Concretely, for each rotated pair $(u_i, v_i)$, the model predicts an input-dependent angle $\theta_i(x)$ and applies

$$ u_i' = u_i \cos \theta_i(x) + v_i \sin \theta_i(x), \qquad v_i' = -u_i \sin \theta_i(x) + v_i \cos \theta_i(x). $$

So each output coordinate is a sum of original features multiplied by input-dependent coefficients $\cos \theta_i(x)$ and $\sin \theta_i(x)$. In that sense, the module implements the same core mechanism as SwiGLU-style gating: content-dependent multiplicative feature mixing. The difference is that the gating is heavily structured: instead of learning a full gate branch over the expanded MLP width, it learns only 32 scalar gates, each shared by one 2D subspace. This makes it a much more constrained but much cheaper gated MLP variant.

Experiments confirmed the rotation is genuinely useful: the model learned large rotation angles (~146°, not small perturbations), used near-full effective rank (29-30/32 pairs independently active), and achieved higher per-step quality in later training compared to the baseline. Further analysis showed Block 0 learned the strongest rotation (row_norm=2.55, focusing on raw embedding dimensions 359, 156, 423...) while deeper layers learned weaker rotation (row_norm ~1.1, sharing contextual dimensions 4, 12, 23...), suggesting Block 0 benefits most. However, torch.compile generates separate kernels for the rotation operations (cos/sin + concatenation), adding ~9 seconds of fixed compilation overhead — theoretically ~2ms of compute inflated to 9ms by graph-level inefficiency. I also tested a Block-0-only variant to reduce overhead, but the fixed compilation cost remained. In a 600-second budget, this overhead costs ~100 training steps, which negated the per-step quality gain.

This remains a promising direction: content-dependent rotation provides norm-preserving, information-lossless feature mixing (det(R)=1) as a near-free alternative to gating mechanisms in parameter-constrained settings. The bottleneck is purely at the operator compilation level, not the method itself.

3. Depth-attention residual (AttnRes) architecture

Inspired by Moonshot AI's Attention Residuals, I explored replacing the standard residual stream with a depth-attention mechanism: each layer's input is emb + depth_attn(δ₀..δ_{i-1}) where depth_attn uses learned position bias (Legendre polynomials) and content-based routing over all previous layers' outputs. The motivation was (a) selective delta combination for better gradient flow, and (b) quantization error suppression (softmax weights sum to 1, reducing error accumulation).

However, attention residual turns out to be counterproductive in small, dense models like this one. In Kimi-K2's MoE setting, attention residual helps route across sparsely-activated experts. In our dense 512-dim model, it actually suppresses the residual stream: softmax constrains routing weights to be non-negative and sum-to-1, but weight analysis of the baseline's resid_mix revealed the optimal depth routing requires negative weights (middle layers subtract embedding with mix1 ≈ -4) and non-normalized weights — patterns that softmax fundamentally cannot express. Additionally, unnormalized Values in depth attention caused block polarization where only 3 of 9 blocks remained active (67% of parameters wasted). The simple resid_mix mechanism (h = mix0 * x + mix1 * x0 with unconstrained per-dim scalars) is strictly more expressive and naturally achieves the same quantization error reduction (Σ A_i² = 2.20, 24% of standard residual) without any architectural overhead.

Future Directions

Lower-rank Q (r=128 or adaptive per-layer rank): r=128 showed 94-97% energy capture but crossed the quality threshold. An adaptive scheme — wider rank in deep layers (where attn_scale peaks) and narrower in shallow layers — could push further.

Better compilation for Pre-Rotation: The content-dependent rotation achieved higher per-step quality but lost to compilation overhead. The core implementation is minimal:

class MLP(nn.Module):
    def __init__(self, dim, mlp_mult, n_rot_pairs=32):
        ...
        self.angle_proj = CastedLinear(dim, n_rot_pairs, bias=False)  # zero-init

    def forward(self, x):
        r = self.n_rot_pairs
        angles = self.angle_proj(x)          # [B, T, 32] content-dependent angles
        cos_a, sin_a = angles.cos(), angles.sin()
        x1, x2 = x[..., :r], x[..., r:2*r]
        x = torch.cat([
            x1 * cos_a + x2 * sin_a,         # rotated pair 1
            -x1 * sin_a + x2 * cos_a,         # rotated pair 2
            x[..., 2*r:]                       # unchanged dims
        ], dim=-1)
        return self.proj(torch.relu(self.fc(x)).square())

A custom Triton kernel fusing angle_proj -> cos/sin -> rotation -> fc into a single pass, or improvements in torch.compile's handling of trigonometric operations within compiled graphs, would eliminate the 9-second overhead and make this technique viable. The method provides SwiGLU-like content-dependent feature mixing at 1% parameter cost with zero information loss (det(R)=1), making it particularly suited for parameter-constrained or inference-optimized settings.

11-layer transformer with factored Q projection (rank 192), reducing
step time by 22% and enabling 28% more training steps within the
10-minute budget. Weight analysis revealed Q matrices have extreme
condition numbers while K/V/O remain full-rank, confirming Q is the
only viable target for low-rank factorization.

3-seed validation: 1.1548 / 1.1552 / 1.1575 (mean 1.1558)
vs official record (1.1748): -0.019 bpb, -0.031 nats, p < 0.001
SkywardSyntax pushed a commit to SkywardSyntax/parameter-golf that referenced this pull request Mar 20, 2026
PR openai#215 found Q projection matrices have condition numbers >100M,
meaning Q naturally operates in a low-rank subspace. Factoring
Q as W_down(dim→r) @ W_up(r→dim) with r=192 saves 2.6% params
and ~22% step time on H100, yielding ~28% more training steps.

Enable with Q_RANK=192 (default 0 = full rank, no change).
K, V, O projections remain full rank (they ARE full rank).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ClassicLarry
Copy link
Copy Markdown

This is the most impressive PR I have seen on this repo yet. Insight into Q is great. If I had to bet on a participant to come up with something novel here, I would bet on you!

@JayCheng113 JayCheng113 changed the title Record: 11L Low-Rank Q192 (val_bpb=1.1548) Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) Mar 22, 2026
@JayCheng113 JayCheng113 changed the title Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) Non-Record: 11L Low-Rank on Q192 (val_bpb=1.1548) 14.7MB in decimal Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants