Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean) by dexhunter · Pull Request #1179 · openai/parameter-golf

dexhunter · 2026-03-31T11:16:23Z

Summary

val_bpb: 1.1105 (3-seed mean, std 0.0006)
val_loss: 1.8751 nats
Artifact: ~15.81 MB
Built on PR #1019 by @abaybektursun

What's New

Split-LR — different learning rates for early (0.025) vs late (0.030) layers
BigramHash(2816×160) — wider projection dimension (160 vs 112), fewer buckets
Sigmoid-gated U-Net skip connections — learnable gates on encoder-decoder skips
Soft-round QAT — temperature-controlled rounding (alpha 1→16) replacing STE
Brotli-11 + byte-shuffle — saves ~400KB vs LZMA-9
Code minification — 101KB→23KB, saves 78KB artifact budget

Results (8×H100 SXM, no TTT)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
1337	6,702	88.2	1.1110	1.8758	15,807,723
42	6,708	88.1	1.1098	1.8739	15,811,712
2025	6,712	88.1	1.1108	1.8755	15,800,500
Mean	6,707	88.1	1.1105	1.8751

Compliance

3-seed verification (std 0.0006)
No TTT, no SLOT, no eval-time adaptation
Standard F.cross_entropy scoring (softmax, sum=1)
Artifact < 16,000,000 bytes (all seeds)
Training ≤ 600s, eval ≤ 600s
GPTQ calibration within training budget (9s reserved)

Reproduction

pip install brotli
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
for SEED in 1337 42 2025; do
  BIGRAM_DIM=160 SEED=$SEED \
  torchrun --standalone --nproc_per_node=8 train_gpt.py
done

See README.md for full details.

…b 1.1105 (3-seed mean)

Novel contribution: shallow recurrence (layers 4,5 repeated once each) with rank-2 LoRA corrections on attention projections, RMSNorm before repeat, and learnable alpha scaling. 13 virtual layers from 11 physical layers at 28KB (0.18%) parameter overhead. Hyperparameter changes from PR openai#1179 base (1.1105 BPB): - NEGATIVE_SLOPE: 0.5 -> 0.9 (validated +0.013 BPB in issue openai#140) - QK_GAIN_INIT: 1.5 -> 4.0 (validated +0.006 BPB in PR openai#1176) - TTT_ENABLED: 1 (score-first, legal variant) - WARMDOWN_ITERS: 4000 (extended from 3500) - BIGRAM_DIM: 160 (from 112) Status: WIP - awaiting compute for 3-seed validation runs.

msisovic · 2026-04-01T00:30:47Z

When I ran this locally, and added additional logging to check how much time was spent up to when GPTQ calibration is finished (so after gptq:calibrated 66 layers in 6.8s is printed), I found out that the whole part up until that took a bit more than the ~7,8 seconds the calibration takes, making me have to change the gptq seconds reservation parameter to ~25 seconds.

Not pointing this out to invalidate your submission, in fact it was super useful as a baseline to mine. Luckily the fix is easy, as I mentioned above, it didn't hurt performance much. I decided to port over the AR generation of samples for GPTQ to avoid the above hassle though, as it's a tried and already accepted method, so you could do that as well.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3-seed mean 1.10272 BPB (std 0.00106), beats merged SOTA by 0.012. Built on PR openai#1179 with MuonEq-R optimizer, context-only SLOT (causal variant), and QK_GAIN=5.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dexhunter · 2026-04-01T12:30:39Z

Thanks, this is a good point.

Our local H100 runs used GPTQ_RESERVE_MS=9000, and on those three seeds training stopped at ~591.1s with the log showing gptq:calibrated 66 layers in 6.8s, so 9s was sufficient under our script's accounting on our setup.

That said, you're right that the printed 6.8s only reflects the logged calibration call, not necessarily all GPTQ-side overhead between the end of training and the end of calibration. So for portability/re-runs on slightly different setups, a larger reserve like ~25s is safer.

This doesn't change the reported scores, but it is a useful reproducibility note. We've since also moved to AR self-generated GPTQ calibration in later branches for exactly this reason.

dexhunter added 2 commits March 31, 2026 11:15

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bp…

926a242

…b 1.1105 (3-seed mean)

fix: clarify PR openai#1019 attribution (not our merged PR)

6772e6a

Tanush1912 mentioned this pull request Mar 31, 2026

RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters #1181

Open

5 tasks

msisovic mentioned this pull request Apr 1, 2026

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204

Open

Gusanidas added a commit to Gusanidas/parameter-golf that referenced this pull request Apr 1, 2026

Fix base PR reference: openai#1130 not openai#1179

2fc09fc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigbag mentioned this pull request Apr 1, 2026

Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)#1179

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)#1179
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:submission/splitlr-dim160-gptq-brotli-1.1105

dexhunter commented Mar 31, 2026

Uh oh!

msisovic commented Apr 1, 2026

Uh oh!

dexhunter commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Mar 31, 2026

Summary

What's New

Results (8×H100 SXM, no TTT)

Compliance

Reproduction

Uh oh!

msisovic commented Apr 1, 2026

Uh oh!

dexhunter commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants