Skip to content

submission: Int6 MLP3x + Late-K Passthrough + SlidingWindow (val_bpb: 1.1605)#99

Open
takhir-iota wants to merge 4 commits intoopenai:mainfrom
takhir-iota:codex/seed2025-top2k-stride64-submission
Open

submission: Int6 MLP3x + Late-K Passthrough + SlidingWindow (val_bpb: 1.1605)#99
takhir-iota wants to merge 4 commits intoopenai:mainfrom
takhir-iota:codex/seed2025-top2k-stride64-submission

Conversation

@takhir-iota
Copy link
Copy Markdown

@takhir-iota takhir-iota commented Mar 19, 2026

Int6 MLP3x + Late-K Passthrough + SlidingWindow

Summary

This PR is a 10-minute submission for leaderboard placement, not a record claim.

The submitted run is the best under-cap seed on this lane:

  • seed2025
  • final_sliding_window_eval_exact stride:64 val_loss:1.95946000 val_bpb:1.16050360
  • Total submission size quant+zstd: 15,844,924

The lane stacks four practical improvements on the strong 9-layer, 512-dim GPT recipe:

  1. Int6 mixed quantization + zstd: .mlp., .attn.c_q., .attn.c_v., and .attn.proj. are stored in int6, then compressed with zstd.
  2. 3x MLP expansion: MLP_MULT=3 keeps the wider hidden layer that materially improves score within the byte budget.
  3. Selective K preservation: blocks.7.attn.c_k.weight and blocks.8.attn.c_k.weight stay in fp16, while the remaining c_k matrices use grouped int8 with group_size=64.
  4. Sliding-window evaluation: EVAL_STRIDE=64 gives near-full context at evaluation time and is the main improvement over the stride-256 variant.

Configuration

VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4
MLP_MULT=3 TIE_EMBEDDINGS=1
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500
WARMDOWN_ITERS=3000 QK_GAIN_INIT=1.7
TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024
LOWBIT_BITS=6 LOWBIT_STE=0
LOWBIT_NAME_PATTERNS=.mlp.,.attn.c_q.,.attn.c_v.,.attn.proj.
INT8_KEEP_FLOAT_NAME_PATTERNS=tok_emb.weight,blocks.7.attn.c_k.weight,blocks.8.attn.c_k.weight
INT8_GROUP_OVERRIDES=.attn.c_k.:64
SERIAL_COMPRESSOR=zstd
EVAL_STRIDE=64
MAX_WALLCLOCK_SECONDS=600

Command

torchrun --standalone --nproc_per_node=8 train_gpt.py

Key Metrics

  • Training stopped at step 12791/20000 due to the 600s wallclock cap
  • Average training step time: 46.91ms
  • Model params: 21,778,504
  • Pre-quant eval: val_loss:2.0047 val_bpb:1.1873
  • Quantized roundtrip: val_loss:2.01675174 val_bpb:1.19443398
  • Sliding window (stride=64): val_loss:1.95946000 val_bpb:1.16050360
  • Sliding-window eval time: 70834ms
  • Code size: 37,988 bytes
  • Total submission size: 15,844,924 bytes

Quantization Strategy

The main serializer choice is to spend the remaining bytes on the attention K path rather than on broader fp16 promotion:

  • tok_emb.weight: fp16 passthrough
  • blocks.7.attn.c_k.weight and blocks.8.attn.c_k.weight: fp16 passthrough
  • remaining .attn.c_k. matrices: grouped int8, group_size=64
  • .mlp., .attn.c_q., .attn.c_v., .attn.proj.: int6
  • compressor: zstd

This keeps the artifact under 16,000,000 bytes while preserving the highest-value late-layer key projections.

Additional Seeds

The same lane was rerun on two more under-cap seeds:

  • seed42: val_loss:1.96035715 val_bpb:1.16103494, 15,802,877 bytes
  • seed4242: val_loss:1.96595032 val_bpb:1.16434753, 15,822,568 bytes

Maintainers can place this submission according to the project’s current eligibility and ranking criteria.

@takhir-iota takhir-iota changed the title submission: Top2K + sliding-window stride64 submission: Int6 MLP3x + Late-K Passthrough + SlidingWindow (val_bpb: 1.1605) Mar 19, 2026
@takhir-iota
Copy link
Copy Markdown
Author

The submitted code snapshot is minified because code size counts toward the 16,000,000-byte submission limit. I kept the PR body and submission README detailed so the method is still reviewable, and the included logs/config describe the exact run behavior.

m0at added a commit to m0at/parameter-golf that referenced this pull request Mar 20, 2026
MLP_HIDDEN=1488, 15.93MB. 9918 steps in 570s (57ms/step).
LR tuning from PR openai#99: scalar_lr 0.04->0.02, embed_lr 0.05->0.03.
Improvement vs baseline: -0.0596 BPB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants