Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)#1179
Conversation
Novel contribution: shallow recurrence (layers 4,5 repeated once each) with rank-2 LoRA corrections on attention projections, RMSNorm before repeat, and learnable alpha scaling. 13 virtual layers from 11 physical layers at 28KB (0.18%) parameter overhead. Hyperparameter changes from PR openai#1179 base (1.1105 BPB): - NEGATIVE_SLOPE: 0.5 -> 0.9 (validated +0.013 BPB in issue openai#140) - QK_GAIN_INIT: 1.5 -> 4.0 (validated +0.006 BPB in PR openai#1176) - TTT_ENABLED: 1 (score-first, legal variant) - WARMDOWN_ITERS: 4000 (extended from 3500) - BIGRAM_DIM: 160 (from 112) Status: WIP - awaiting compute for 3-seed validation runs.
|
When I ran this locally, and added additional logging to check how much time was spent up to when GPTQ calibration is finished (so after Not pointing this out to invalidate your submission, in fact it was super useful as a baseline to mine. Luckily the fix is easy, as I mentioned above, it didn't hurt performance much. I decided to port over the AR generation of samples for GPTQ to avoid the above hassle though, as it's a tried and already accepted method, so you could do that as well. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-seed mean 1.10272 BPB (std 0.00106), beats merged SOTA by 0.012. Built on PR openai#1179 with MuonEq-R optimizer, context-only SLOT (causal variant), and QK_GAIN=5.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks, this is a good point. Our local H100 runs used That said, you're right that the printed This doesn't change the reported scores, but it is a useful reproducibility note. We've since also moved to AR self-generated GPTQ calibration in later branches for exactly this reason. |
Summary
What's New
Results (8×H100 SXM, no TTT)
Compliance
Reproduction
See README.md for full details.