Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding by CiprianFlorin-Ifrim · Pull Request #640 · openai/parameter-golf

CiprianFlorin-Ifrim · 2026-03-24T19:01:27Z

Record: 1.1570 BPB — 73.7M Ternary U-Net Transformer

BitNet b1.58 + 10L + NeoMuon + 4x relu² MLP + Factored Tied Embedding + Poly5 Softcap + YaRN 2048 + 8192 BPE + FP8 QAT + Base-3 LZMA + Stride-16 Sliding Eval

val_bpb: 1.1570 (3-seed mean sliding, std 0.0007) | 15.99 MB max artifact | 8×H100 SXM, 599s

Full experiment log covering 250+ runs, ablations, and decision rationale, that could help anyone else: RESULTS.md. Complete training logs in my personal repo: logs/.

The results document linked here and in my repo showcases all methods and sweeps applied to both Binary and Ternary Bitnets, which unfortunately are incompatible with many methods, such as Tversky Layers, EMA, Muon WD, LM Logit Head ranking and many more. Scaling ratios and applicable/rejected techniques can be useful for other submissions too.

Results (3 seeds, 8×H100 SXM)

Seed	Steps	ms/step	Sliding BPB (s16)	val_bpb	RT bpb	Artifact
42	6,530	91.7	1.1565	1.1816	1.1837	15,993,853 bytes
1337	6,520	91.9	1.1568	1.1825	1.1839	15,995,705 bytes
7	6,530	91.8	1.1578	1.1823	1.1850	15,992,753 bytes
Mean	6,527	91.8	1.1570	1.1821	1.1842	15,994,104 bytes
Std	5	0.1	0.0007	0.0005	0.0007	1,498 bytes

Architecture

10 transformer layers, dim=768, 8 heads, 4 KV heads (GQA), head_dim=96
BitNet b1.58 ternary quantisation: weights {-1, 0, +1}, ~1.6 bits/param, per-group (128) absmean scaling
4x MLP expansion (hidden=3072) with relu² activation, fused gate+up projection
U-Net encoder/decoder with learned skip weights (ones-init) and per-block residual mix from input embedding
Factored tied embedding: 8192×254 bottleneck with learned 254-to-768 and 768-to-254 projections
Polynomial softcap (degree 5, cap=10) with Z-loss regularisation (1e-4)
YaRN positional encoding (max_len=2048, ROPE_BASE=5000)
Fused QKV projection (single TernaryLinear)
FlashAttention-3 (Hopper native kernels)
73.7M parameters, 15.92MB artifact (64.9M ternary + 2.5M fp8 + 70KB code)

Key Techniques

Architecture

Width over depth: 768d/10L outperforms 512d/25L — faster steps (91ms vs 127ms) yield 6,530 vs 4,720 steps in 600s
4x relu² MLP: relu² is -0.024 bpb over relu at zero cost; 4x width adds -0.008 bpb over 3x at same step budget
EMBED_DIM=254: frees ~4MB for wider MLP; 254 = 256-2 to fit code within the byte budget

Training

NeoMuon with 3 Newton-Schulz steps: compensates for ternary STE gradient attenuation; 3 steps equivalent to 5 at convergence (+190 free steps)
Fused QKV + fused relu²: ~4-6ms/step saving (~180 extra training steps)
FlashAttention-3: -9% step time (~380 free steps)
524k batch tokens: optimal for ternary STE — 262k too noisy, 1M loses gradient updates

Evaluation

Temperature scaling (T=0.90): 5-point grid on training tokens; relu² logits slightly underconfident
Sliding window (stride=16): full context per scored token, ~0.025 bpb over chunked eval

Compression

Base-3 + LZMA (preset=9): 5 trits/byte packing, 39% reduction over int8+zlib; auto-compared against bitmask per run
FP8 QAT (e4m3): halves fp_params (~5MB to ~2.5MB), only 0.002 bpb RT penalty
Shrinkage fix: corrects ternary zero-fraction scale mismatch, eliminating all roundtrip gaps

Setup and Run

# Environment setup (conda + Python 3.13 + PyTorch + FlashAttention-3 + Triton + dataset)
bash setup.sh

# Activate and run
conda activate golf
SEED=42 bash run_cuda_ternary.sh

Full run command

RUN_ID=ternary_run \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
ATTN_PROJ_TYPE=standard \
LOGIT_HEAD_TYPE=standard \
TVERSKY_MEMBERSHIP=sigmoid \
TVERSKY_NUM_FEATURES=0 \
TVERSKY_FEATURE_POOLS=0 \
VOCAB_SIZE=8192 \
BITNET_GROUP_SIZE=128 \
BIGRAM_HASH=0 \
EMBED_DIM=254 \
TRAINING_DEPTH_RECURRENCE=0 \
EVAL_DEPTH_RECURRENCE=0 \
NUM_LAYERS=10 \
MODEL_DIM=768 \
NUM_KV_HEADS=4 \
NUM_HEADS=8 \
DIFF_ATTN=0 \
MLP_MULT=4 \
MLP_GROUPS=0 \
MATRIX_OPTIMIZER=muon \
ADAM_LR=0.05 \
ADAM_WD=0.05 \
MUON_BACKEND_STEPS=3 \
MUON_MOMENTUM=0.95 \
MUON_MOMENTUM_WARMUP_START=0.85 \
MUON_MOMENTUM_WARMUP_STEPS=500 \
MUON_WD=0.0 \
MATRIX_LR=0.04 \
SCALAR_LR=0.02 \
TIED_EMBED_LR=0.02 \
WARMDOWN_FRACTION=0.2 \
LOGIT_SOFTCAP=10 \
QK_GAIN_INIT=2.25 \
ROPE_TYPE=yarn \
YARN_MAX_LEN=2048 \
ROPE_BASE=5000 \
BATCH_TOKENS_START=0 \
BATCH_SCHEDULE_FRACTION=0.33 \
TRAIN_BATCH_TOKENS=524288 \
SEQ_LEN_START=0 \
SEQ_SCHEDULE_FRACTION=0.0 \
TRAIN_SEQ_LEN=1024 \
SMEAR=0 \
ITERATIONS=10000 \
WARMUP_STEPS=5 \
MAX_WALLCLOCK_SECONDS=599 \
VAL_LOSS_EVERY=0 \
TRAIN_LOG_EVERY=1000 \
CHURN_LOG_EVERY=0 \
VAL_MAX_TOKENS=0 \
TIE_EMBEDDINGS=1 \
UNTIE_AT_FRACTION=0.00 \
HEAD_LR=0.02 \
CORR_WEIGHT_LR=0.02 \
ACTIVATION=relu2 \
SOFTCAP_TYPE=poly \
MTP_HEADS=0 \
REFINER=0 \
REFINER_KERNEL=3 \
SLIDING_EVAL=1 \
SLIDING_EVAL_STRIDE=16 \
SLIDING_BATCH_SIZE=256 \
TEMP_SCALING=1 \
FP_STORAGE=FP8 \
SEED=42 \
COMPILE_MODE=default \
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 train_gpt_cuda_ternary.py

Compliance

3 seeds run on 8×H100 SXM
All 3 seeds train in <=600s (max: 599.7s)
All 3 seeds artifact <=16,000,000 bytes (max: 15,995,705)
Sliding window eval stride=16, consistent (std=0.0007)
No test-time training on validation data
No network calls during evaluation
No external compute

… relu² 4xMLP FP8)

0hq · 2026-03-25T05:20:15Z

Really excellent work!

ksang123 · 2026-03-25T16:25:25Z

This is incredible work! exactly what I hoped would happen when I submitted #139. The factored embedding and FP8 QAT for non-ternary params are really clever. Congrats on the record.

… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

… relu² 4xMLP FP8) (#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

Proper experimental methodology: start from proven base, add ONE thing. 1. config_pr640_exact.sh — ZERO changes from PR openai#640 (control) 2. config_pr640_plus_eval.sh — + T=0.85 + stride=64 (eval only) 3. config_pr640_plus_brotli.sh — + brotli compression + eval tricks Lesson from Config A failure (3.0 BPB): adding AOL + hash + brotli simultaneously with ternary STE caused divergence. Incremental ablation is required to identify which innovations are ternary-compatible. Co-Authored-By: Kevin Tan <kft@lightarchitects.io> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause of 1.1851 BPP: config was based on PR openai#640 (ternary) hyperparameters applied to an int6 model. ALL top 5 merged entries use: - SEQ_LEN=2048 (we had 1024) - BATCH=786K (we had 524K) - MUON_MOMENTUM=0.99 (we had 0.95) - NS5=5 steps (we had 3-4) - SWA=1 + EMA=0.997 (we had SWA=0) - Weight decay 0.04 (we had 0.0) Projected improvement: 1.1851 → ~1.14 BPP. Co-Authored-By: Kevin Tan <kft@lightarchitects.io> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

…6 (on PR openai#549 stack) Six orthogonal improvements on the current SOTA (PR openai#549, 1.1194 BPB): - Polynomial degree-5 softcap replacing tanh (from ternary PR openai#640) - Z-loss regularization (1e-4 * logsumexp^2) for sharper gradients - YaRN positional encoding for better long-context handling - zstd-22 compression (7MB artifact vs 16MB with LZMA-6) - Sliding eval stride=16 (4x more context overlap) - FlashAttention 3/2/SDPA graceful fallback Smoke-tested on 1xH100 (Modal): 890 steps, healthy convergence. Awaiting 8xH100 SXM verification for official scoring.

MatoTeziTanka · 2026-04-12T18:32:16Z

Community Credit Note

This submission uses techniques pioneered by other participants in this competition without attribution. For the community record:

U-Net skip connections — Already present in @integrate-your-mind's PR #289 (2026-03-21, titled "SmearGate + BigramHash + Int6 + SWA + U-Net Skips"), @gowtham0992's PR #295 (2026-03-21), and @skarakulak's PR #507 (2026-03-23) — all submitted before this PR on 2026-03-24. Neither the PR body nor the 55-page results document cites any of these prior submissions.

The ternary quantization engineering, factored tied embedding, FP8 QAT, and the comprehensive ablation study are genuine original contributions. This note ensures the people whose foundational architectural work made this submission possible are recognized on the record. Open source runs on attribution.

Community credit review by @MatoTeziTanka — The Agora.

CiprianFlorin-Ifrim · 2026-04-12T20:01:18Z

@MatoTeziTanka ????
I have mentioned to you on discord that it's better to not tag the whole OpenAI team on every PR and to provide everyone with AI-generated reviews unless people ask you for that. You are not part of the people that run the compeition, so you should leave that task to the organisers. Now, you keep spamming my PRs with nonsense, some of them already closed and merged too, as a form or retaliation which is immature.

I will simply copy paste my previous comment on a different PR I did, where you also posted the same thing, since you keep spamming:
"Please stop using AI in your messages and evaluations, you clearly do not have the context needed.

These methods are not unique from the other participants and methods are common because they are good. It is like attributing credits to everyone else that used AdamW or Muon, because someone else did so. They are industry standards. What do you expect everyone to check all the 2000 PRs and offer 'credit', the credit goes to the original authors of these industry standards.
SmearGate is actually a method that was used in the NanoChat/GPT speedrun challenge, a challenge even mentioned in the readme of this one. And if you would've inspected my analysis properly, you would've noted that SmearGate is actually worse for many of my submissions, and is not used.
UNet is a staple in the AI field much like point 1 with the paper having been released in 2015."

If these useless AI comments continue, I will be reporting each one individually as unrelated to the PRs. Do stop.

MatoTeziTanka · 2026-04-12T20:16:24Z

U-Net is indeed well-established — the note should have said "applied in this competition by" rather than implying novelty. I'll tighten that language. The note explicitly credited your ternary quantization, factored tied embedding, FP8 QAT, and ablation study as original work.

… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

cocohearts · 2026-05-04T02:56:00Z

@CiprianFlorin-Ifrim maintainer repro note: we are doing a pass over merged leaderboard rows and I cannot currently reproduce this ternary record.

What I tried:

Exact submitted record archive for PR Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding #640.
Seed 42.
Same visible hparams as the submitted ternary_log_42.txt. Our first generated replay command had TVERSKY_NUM_FEATURES=0 / TVERSKY_FEATURE_POOLS=0, but the submitted log has 128 / 1, so I reran with those corrected.
PyTorch is 2.10.0+cu128 in both submitted log and rerun; exact audit FlashAttention wheel was installed.

Current result:

Submitted seed 42: final_sliding val_bpb=1.1565, stop step 6530, budget 15993853/16000000.
Targeted exact-visible-hparam rerun seed 42: final_sliding val_bpb=1.2280, stop step 6470, budget 16000737/16000000 (over budget).
The copied internal log hparameter line now differs from the submitted one only by run_id. The loss trajectory is already much worse by early checkpoints (step:1000 loss 3.2448 vs submitted 3.3120, then by step:6000 loss 3.2610 vs submitted 3.0211).

One remaining visible environment difference: your submitted log shows Python 3.13.12, while our audit rerun used Python 3.12.9; PyTorch is still 2.10.0+cu128. Could you reply here with any setup detail that is required but not captured by run_cuda_ternary.sh / the log? In particular: exact conda env, dataset snapshot/download command, FlashAttention wheel/source, or any hidden setup from your original machine. If there is a corrected repro bundle or exact container, please point us to it so we can rerun.

Record Submission: 1.1570 BPB - 73.7M Ternary U-Net (10L 768d 8192BPE…

71b6bd0

… relu² 4xMLP FP8)

CiprianFlorin-Ifrim mentioned this pull request Mar 24, 2026

Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary Asymmetric U-Net + NeoMuon + 4xrelu²MLP + Smear + Fact Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8 + Bit-packing LZMA + Stride-16 Eval - 2h #641

Merged

6 tasks

notapplica mentioned this pull request Mar 24, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

0hq approved these changes Mar 25, 2026

View reviewed changes

0hq merged commit 69bc84e into openai:main Mar 25, 2026

This was referenced Mar 25, 2026

Ternary UNet submission (PR #640) folder is in track_10min_16mb instead of track_non_record_16mb #729

Closed

Fix: move Ternary UNet submission folder from track_10min_16mb to track_non_record_16mb #730

Closed

ksang123 mentioned this pull request Mar 25, 2026

Non-record: BitNet b1.58 — 65M ternary params beat 4-hour baseline in 10 minutes (val_bpb=1.2029) #139

Open

SirSaltySalmon mentioned this pull request Mar 26, 2026

(Nonrecord) Applied Async Prefetching Potentially Boosts Performance #785

Open

TimS-ml referenced this pull request in TimS-ml/parameter-golf-autoresearch Mar 26, 2026

Record Submission: 1.1570 BPB - 73.7M Ternary U-Net (10L 768d 8192BPE…

d456b1a

… relu² 4xMLP FP8) (#640) Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>

sseanliu mentioned this pull request Mar 26, 2026

Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization #831

Open

3 tasks

This was referenced Mar 27, 2026

[Record Submission] - 1.1539: 74.3M Ternary U-Net Transformer (v2 - Continuation from #PR640) #920

Open

[Notable Non-Record Submission] 1.1090 BPB - 74.3M Ternary U-Net Transformer (100k steps/3h) #923

Open

monisha-max mentioned this pull request Apr 4, 2026

Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack) #1325

Open

4 tasks

CiprianFlorin-Ifrim mentioned this pull request Apr 5, 2026

[Notable Non-Record Submission] Everything Everywhere All in One Bit: XNOR-mally I'd use floats - 118M XNOR-Net - 1.539 BPB - 10-Min and Unconstrained Runs #1388

Merged

This was referenced Apr 11, 2026

Community review feedback from @MatoTeziTanka — summary + thread index CiprianFlorin-Ifrim/openai-parameter-golf-lab#3

Closed

Ternary v2 "more stable" claim is under-powered with n=3 per side CiprianFlorin-Ifrim/openai-parameter-golf-lab#4

Closed

dd-dent mentioned this pull request Apr 30, 2026

Non-Record: TTSM — Typical Ternary State-Space Model, 2.0032 bpb #1999

Open

3 tasks

deniskurlov mentioned this pull request May 1, 2026

Non-record: Quinary quantization + SP16384 + per-group lrzip + TTT - bpb 1.1384 #2086

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding#640

Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding#640
0hq merged 1 commit intoopenai:mainfrom
CiprianFlorin-Ifrim:submission-ternary

CiprianFlorin-Ifrim commented Mar 24, 2026 •

edited

Loading

Uh oh!

0hq commented Mar 25, 2026

Uh oh!

ksang123 commented Mar 25, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

CiprianFlorin-Ifrim commented Apr 12, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

cocohearts commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

CiprianFlorin-Ifrim commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: 1.1570 BPB — 73.7M Ternary U-Net Transformer

Results (3 seeds, 8×H100 SXM)

Architecture

Key Techniques

Architecture

Training

Evaluation

Compression

Setup and Run

Compliance

Uh oh!

0hq commented Mar 25, 2026

Uh oh!

ksang123 commented Mar 25, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Credit Note

Uh oh!

CiprianFlorin-Ifrim commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

cocohearts commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CiprianFlorin-Ifrim commented Mar 24, 2026 •

edited

Loading

CiprianFlorin-Ifrim commented Apr 12, 2026 •

edited

Loading