Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding#640
Conversation
… relu² 4xMLP FP8)
|
Really excellent work! |
|
This is incredible work! exactly what I hoped would happen when I submitted #139. The factored embedding and FP8 QAT for non-ternary params are really clever. Congrats on the record. |
… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
… relu² 4xMLP FP8) (#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
Proper experimental methodology: start from proven base, add ONE thing. 1. config_pr640_exact.sh — ZERO changes from PR openai#640 (control) 2. config_pr640_plus_eval.sh — + T=0.85 + stride=64 (eval only) 3. config_pr640_plus_brotli.sh — + brotli compression + eval tricks Lesson from Config A failure (3.0 BPB): adding AOL + hash + brotli simultaneously with ternary STE caused divergence. Incremental ablation is required to identify which innovations are ternary-compatible. Co-Authored-By: Kevin Tan <kft@lightarchitects.io> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of 1.1851 BPP: config was based on PR openai#640 (ternary) hyperparameters applied to an int6 model. ALL top 5 merged entries use: - SEQ_LEN=2048 (we had 1024) - BATCH=786K (we had 524K) - MUON_MOMENTUM=0.99 (we had 0.95) - NS5=5 steps (we had 3-4) - SWA=1 + EMA=0.997 (we had SWA=0) - Weight decay 0.04 (we had 0.0) Projected improvement: 1.1851 → ~1.14 BPP. Co-Authored-By: Kevin Tan <kft@lightarchitects.io> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
…6 (on PR openai#549 stack) Six orthogonal improvements on the current SOTA (PR openai#549, 1.1194 BPB): - Polynomial degree-5 softcap replacing tanh (from ternary PR openai#640) - Z-loss regularization (1e-4 * logsumexp^2) for sharper gradients - YaRN positional encoding for better long-context handling - zstd-22 compression (7MB artifact vs 16MB with LZMA-6) - Sliding eval stride=16 (4x more context overlap) - FlashAttention 3/2/SDPA graceful fallback Smoke-tested on 1xH100 (Modal): 890 steps, healthy convergence. Awaiting 8xH100 SXM verification for official scoring.
Community Credit NoteThis submission uses techniques pioneered by other participants in this competition without attribution. For the community record: U-Net skip connections — Already present in @integrate-your-mind's PR #289 (2026-03-21, titled "SmearGate + BigramHash + Int6 + SWA + U-Net Skips"), @gowtham0992's PR #295 (2026-03-21), and @skarakulak's PR #507 (2026-03-23) — all submitted before this PR on 2026-03-24. Neither the PR body nor the 55-page results document cites any of these prior submissions. The ternary quantization engineering, factored tied embedding, FP8 QAT, and the comprehensive ablation study are genuine original contributions. This note ensures the people whose foundational architectural work made this submission possible are recognized on the record. Open source runs on attribution. Community credit review by @MatoTeziTanka — The Agora. |
|
@MatoTeziTanka ???? I will simply copy paste my previous comment on a different PR I did, where you also posted the same thing, since you keep spamming:
If these useless AI comments continue, I will be reporting each one individually as unrelated to the PRs. Do stop. |
|
U-Net is indeed well-established — the note should have said "applied in this competition by" rather than implying novelty. I'll tighten that language. The note explicitly credited your ternary quantization, factored tied embedding, FP8 QAT, and ablation study as original work. |
… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
|
@CiprianFlorin-Ifrim maintainer repro note: we are doing a pass over merged leaderboard rows and I cannot currently reproduce this ternary record. What I tried:
Current result:
One remaining visible environment difference: your submitted log shows Python |
Record: 1.1570 BPB — 73.7M Ternary U-Net Transformer
BitNet b1.58 + 10L + NeoMuon + 4x relu² MLP + Factored Tied Embedding + Poly5 Softcap + YaRN 2048 + 8192 BPE + FP8 QAT + Base-3 LZMA + Stride-16 Sliding Eval
val_bpb: 1.1570 (3-seed mean sliding, std 0.0007) | 15.99 MB max artifact | 8×H100 SXM, 599s
The results document linked here and in my repo showcases all methods and sweeps applied to both Binary and Ternary Bitnets, which unfortunately are incompatible with many methods, such as Tversky Layers, EMA, Muon WD, LM Logit Head ranking and many more. Scaling ratios and applicable/rejected techniques can be useful for other submissions too.
Results (3 seeds, 8×H100 SXM)
Architecture
Key Techniques
Architecture
Training
Evaluation
Compression
Setup and Run
Full run command
Compliance