Skip to content

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)#1179

Open
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:submission/splitlr-dim160-gptq-brotli-1.1105
Open

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)#1179
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:submission/splitlr-dim160-gptq-brotli-1.1105

Conversation

@dexhunter
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1105 (3-seed mean, std 0.0006)
  • val_loss: 1.8751 nats
  • Artifact: ~15.81 MB
  • Built on PR #1019 by @abaybektursun

What's New

  1. Split-LR — different learning rates for early (0.025) vs late (0.030) layers
  2. BigramHash(2816×160) — wider projection dimension (160 vs 112), fewer buckets
  3. Sigmoid-gated U-Net skip connections — learnable gates on encoder-decoder skips
  4. Soft-round QAT — temperature-controlled rounding (alpha 1→16) replacing STE
  5. Brotli-11 + byte-shuffle — saves ~400KB vs LZMA-9
  6. Code minification — 101KB→23KB, saves 78KB artifact budget

Results (8×H100 SXM, no TTT)

Seed Steps ms/step Sliding BPB val_loss (nats) Artifact
1337 6,702 88.2 1.1110 1.8758 15,807,723
42 6,708 88.1 1.1098 1.8739 15,811,712
2025 6,712 88.1 1.1108 1.8755 15,800,500
Mean 6,707 88.1 1.1105 1.8751

Compliance

  • 3-seed verification (std 0.0006)
  • No TTT, no SLOT, no eval-time adaptation
  • Standard F.cross_entropy scoring (softmax, sum=1)
  • Artifact < 16,000,000 bytes (all seeds)
  • Training ≤ 600s, eval ≤ 600s
  • GPTQ calibration within training budget (9s reserved)

Reproduction

pip install brotli
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
for SEED in 1337 42 2025; do
  BIGRAM_DIM=160 SEED=$SEED \
  torchrun --standalone --nproc_per_node=8 train_gpt.py
done

See README.md for full details.

Tanush1912 added a commit to Tanush1912/parameter-golf that referenced this pull request Mar 31, 2026
Novel contribution: shallow recurrence (layers 4,5 repeated once each)
with rank-2 LoRA corrections on attention projections, RMSNorm before
repeat, and learnable alpha scaling. 13 virtual layers from 11 physical
layers at 28KB (0.18%) parameter overhead.

Hyperparameter changes from PR openai#1179 base (1.1105 BPB):
- NEGATIVE_SLOPE: 0.5 -> 0.9 (validated +0.013 BPB in issue openai#140)
- QK_GAIN_INIT: 1.5 -> 4.0 (validated +0.006 BPB in PR openai#1176)
- TTT_ENABLED: 1 (score-first, legal variant)
- WARMDOWN_ITERS: 4000 (extended from 3500)
- BIGRAM_DIM: 160 (from 112)

Status: WIP - awaiting compute for 3-seed validation runs.
@msisovic
Copy link
Copy Markdown

msisovic commented Apr 1, 2026

When I ran this locally, and added additional logging to check how much time was spent up to when GPTQ calibration is finished (so after gptq:calibrated 66 layers in 6.8s is printed), I found out that the whole part up until that took a bit more than the ~7,8 seconds the calibration takes, making me have to change the gptq seconds reservation parameter to ~25 seconds.

Not pointing this out to invalidate your submission, in fact it was super useful as a baseline to mine. Luckily the fix is easy, as I mentioned above, it didn't hurt performance much. I decided to port over the AR generation of samples for GPTQ to avoid the above hassle though, as it's a tried and already accepted method, so you could do that as well.

Gusanidas added a commit to Gusanidas/parameter-golf that referenced this pull request Apr 1, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bigbag pushed a commit to bigbag/parameter-golf that referenced this pull request Apr 1, 2026
3-seed mean 1.10272 BPB (std 0.00106), beats merged SOTA by 0.012.
Built on PR openai#1179 with MuonEq-R optimizer, context-only SLOT
(causal variant), and QK_GAIN=5.0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dexhunter
Copy link
Copy Markdown
Author

Thanks, this is a good point.

Our local H100 runs used GPTQ_RESERVE_MS=9000, and on those three seeds training stopped at ~591.1s with the log showing gptq:calibrated 66 layers in 6.8s, so 9s was sufficient under our script's accounting on our setup.

That said, you're right that the printed 6.8s only reflects the logged calibration call, not necessarily all GPTQ-side overhead between the end of training and the end of calibration. So for portability/re-runs on slightly different setups, a larger reserve like ~25s is safer.

This doesn't change the reported scores, but it is a useful reproducibility note. We've since also moved to AR self-generated GPTQ calibration in later branches for exactly this reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants