Skip to content

Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding#640

Merged
0hq merged 1 commit intoopenai:mainfrom
CiprianFlorin-Ifrim:submission-ternary
Mar 25, 2026
Merged

Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding#640
0hq merged 1 commit intoopenai:mainfrom
CiprianFlorin-Ifrim:submission-ternary

Conversation

@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor

@CiprianFlorin-Ifrim CiprianFlorin-Ifrim commented Mar 24, 2026

Record: 1.1570 BPB — 73.7M Ternary U-Net Transformer

BitNet b1.58 + 10L + NeoMuon + 4x relu² MLP + Factored Tied Embedding + Poly5 Softcap + YaRN 2048 + 8192 BPE + FP8 QAT + Base-3 LZMA + Stride-16 Sliding Eval

val_bpb: 1.1570 (3-seed mean sliding, std 0.0007) | 15.99 MB max artifact | 8×H100 SXM, 599s

Full experiment log covering 250+ runs, ablations, and decision rationale, that could help anyone else: RESULTS.md. Complete training logs in my personal repo: logs/.

The results document linked here and in my repo showcases all methods and sweeps applied to both Binary and Ternary Bitnets, which unfortunately are incompatible with many methods, such as Tversky Layers, EMA, Muon WD, LM Logit Head ranking and many more. Scaling ratios and applicable/rejected techniques can be useful for other submissions too.

Results (3 seeds, 8×H100 SXM)

Seed Steps ms/step Sliding BPB (s16) val_bpb RT bpb Artifact
42 6,530 91.7 1.1565 1.1816 1.1837 15,993,853 bytes
1337 6,520 91.9 1.1568 1.1825 1.1839 15,995,705 bytes
7 6,530 91.8 1.1578 1.1823 1.1850 15,992,753 bytes
Mean 6,527 91.8 1.1570 1.1821 1.1842 15,994,104 bytes
Std 5 0.1 0.0007 0.0005 0.0007 1,498 bytes

Architecture

  • 10 transformer layers, dim=768, 8 heads, 4 KV heads (GQA), head_dim=96
  • BitNet b1.58 ternary quantisation: weights {-1, 0, +1}, ~1.6 bits/param, per-group (128) absmean scaling
  • 4x MLP expansion (hidden=3072) with relu² activation, fused gate+up projection
  • U-Net encoder/decoder with learned skip weights (ones-init) and per-block residual mix from input embedding
  • Factored tied embedding: 8192×254 bottleneck with learned 254-to-768 and 768-to-254 projections
  • Polynomial softcap (degree 5, cap=10) with Z-loss regularisation (1e-4)
  • YaRN positional encoding (max_len=2048, ROPE_BASE=5000)
  • Fused QKV projection (single TernaryLinear)
  • FlashAttention-3 (Hopper native kernels)
  • 73.7M parameters, 15.92MB artifact (64.9M ternary + 2.5M fp8 + 70KB code)

Key Techniques

Architecture

  • Width over depth: 768d/10L outperforms 512d/25L — faster steps (91ms vs 127ms) yield 6,530 vs 4,720 steps in 600s
  • 4x relu² MLP: relu² is -0.024 bpb over relu at zero cost; 4x width adds -0.008 bpb over 3x at same step budget
  • EMBED_DIM=254: frees ~4MB for wider MLP; 254 = 256-2 to fit code within the byte budget

Training

  • NeoMuon with 3 Newton-Schulz steps: compensates for ternary STE gradient attenuation; 3 steps equivalent to 5 at convergence (+190 free steps)
  • Fused QKV + fused relu²: ~4-6ms/step saving (~180 extra training steps)
  • FlashAttention-3: -9% step time (~380 free steps)
  • 524k batch tokens: optimal for ternary STE — 262k too noisy, 1M loses gradient updates

Evaluation

  • Temperature scaling (T=0.90): 5-point grid on training tokens; relu² logits slightly underconfident
  • Sliding window (stride=16): full context per scored token, ~0.025 bpb over chunked eval

Compression

  • Base-3 + LZMA (preset=9): 5 trits/byte packing, 39% reduction over int8+zlib; auto-compared against bitmask per run
  • FP8 QAT (e4m3): halves fp_params (~5MB to ~2.5MB), only 0.002 bpb RT penalty
  • Shrinkage fix: corrects ternary zero-fraction scale mismatch, eliminating all roundtrip gaps

Setup and Run

# Environment setup (conda + Python 3.13 + PyTorch + FlashAttention-3 + Triton + dataset)
bash setup.sh

# Activate and run
conda activate golf
SEED=42 bash run_cuda_ternary.sh
Full run command
RUN_ID=ternary_run \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
ATTN_PROJ_TYPE=standard \
LOGIT_HEAD_TYPE=standard \
TVERSKY_MEMBERSHIP=sigmoid \
TVERSKY_NUM_FEATURES=0 \
TVERSKY_FEATURE_POOLS=0 \
VOCAB_SIZE=8192 \
BITNET_GROUP_SIZE=128 \
BIGRAM_HASH=0 \
EMBED_DIM=254 \
TRAINING_DEPTH_RECURRENCE=0 \
EVAL_DEPTH_RECURRENCE=0 \
NUM_LAYERS=10 \
MODEL_DIM=768 \
NUM_KV_HEADS=4 \
NUM_HEADS=8 \
DIFF_ATTN=0 \
MLP_MULT=4 \
MLP_GROUPS=0 \
MATRIX_OPTIMIZER=muon \
ADAM_LR=0.05 \
ADAM_WD=0.05 \
MUON_BACKEND_STEPS=3 \
MUON_MOMENTUM=0.95 \
MUON_MOMENTUM_WARMUP_START=0.85 \
MUON_MOMENTUM_WARMUP_STEPS=500 \
MUON_WD=0.0 \
MATRIX_LR=0.04 \
SCALAR_LR=0.02 \
TIED_EMBED_LR=0.02 \
WARMDOWN_FRACTION=0.2 \
LOGIT_SOFTCAP=10 \
QK_GAIN_INIT=2.25 \
ROPE_TYPE=yarn \
YARN_MAX_LEN=2048 \
ROPE_BASE=5000 \
BATCH_TOKENS_START=0 \
BATCH_SCHEDULE_FRACTION=0.33 \
TRAIN_BATCH_TOKENS=524288 \
SEQ_LEN_START=0 \
SEQ_SCHEDULE_FRACTION=0.0 \
TRAIN_SEQ_LEN=1024 \
SMEAR=0 \
ITERATIONS=10000 \
WARMUP_STEPS=5 \
MAX_WALLCLOCK_SECONDS=599 \
VAL_LOSS_EVERY=0 \
TRAIN_LOG_EVERY=1000 \
CHURN_LOG_EVERY=0 \
VAL_MAX_TOKENS=0 \
TIE_EMBEDDINGS=1 \
UNTIE_AT_FRACTION=0.00 \
HEAD_LR=0.02 \
CORR_WEIGHT_LR=0.02 \
ACTIVATION=relu2 \
SOFTCAP_TYPE=poly \
MTP_HEADS=0 \
REFINER=0 \
REFINER_KERNEL=3 \
SLIDING_EVAL=1 \
SLIDING_EVAL_STRIDE=16 \
SLIDING_BATCH_SIZE=256 \
TEMP_SCALING=1 \
FP_STORAGE=FP8 \
SEED=42 \
COMPILE_MODE=default \
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 train_gpt_cuda_ternary.py

Compliance

  • 3 seeds run on 8×H100 SXM
  • All 3 seeds train in <=600s (max: 599.7s)
  • All 3 seeds artifact <=16,000,000 bytes (max: 15,995,705)
  • Sliding window eval stride=16, consistent (std=0.0007)
  • No test-time training on validation data
  • No network calls during evaluation
  • No external compute

@0hq
Copy link
Copy Markdown
Contributor

0hq commented Mar 25, 2026

Really excellent work!

@ksang123
Copy link
Copy Markdown

This is incredible work! exactly what I hoped would happen when I submitted #139. The factored embedding and FP8 QAT for non-ternary params are really clever. Congrats on the record.

Mistobaan pushed a commit to Mistobaan/parameter-golf that referenced this pull request Mar 25, 2026
… relu² 4xMLP FP8) (openai#640)

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
TimS-ml referenced this pull request in TimS-ml/parameter-golf-autoresearch Mar 26, 2026
… relu² 4xMLP FP8) (#640)

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
nedcut pushed a commit to nedcut/parameter-golf that referenced this pull request Mar 26, 2026
… relu² 4xMLP FP8) (openai#640)

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
… relu² 4xMLP FP8) (openai#640)

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
anish-krishnan pushed a commit to anish-krishnan/parameter-golf that referenced this pull request Mar 30, 2026
… relu² 4xMLP FP8) (openai#640)

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
theLightArchitect added a commit to theLightArchitect/parameter-golf that referenced this pull request Mar 30, 2026
Proper experimental methodology: start from proven base, add ONE thing.

1. config_pr640_exact.sh — ZERO changes from PR openai#640 (control)
2. config_pr640_plus_eval.sh — + T=0.85 + stride=64 (eval only)
3. config_pr640_plus_brotli.sh — + brotli compression + eval tricks

Lesson from Config A failure (3.0 BPB): adding AOL + hash + brotli
simultaneously with ternary STE caused divergence. Incremental ablation
is required to identify which innovations are ternary-compatible.

Co-Authored-By: Kevin Tan <kft@lightarchitects.io>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
theLightArchitect added a commit to theLightArchitect/parameter-golf that referenced this pull request Mar 30, 2026
Root cause of 1.1851 BPP: config was based on PR openai#640 (ternary)
hyperparameters applied to an int6 model. ALL top 5 merged entries use:
  - SEQ_LEN=2048 (we had 1024)
  - BATCH=786K (we had 524K)
  - MUON_MOMENTUM=0.99 (we had 0.95)
  - NS5=5 steps (we had 3-4)
  - SWA=1 + EMA=0.997 (we had SWA=0)
  - Weight decay 0.04 (we had 0.0)

Projected improvement: 1.1851 → ~1.14 BPP.

Co-Authored-By: Kevin Tan <kft@lightarchitects.io>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Itssshikhar pushed a commit to Itssshikhar/parameter-golf that referenced this pull request Mar 31, 2026
… relu² 4xMLP FP8) (openai#640)

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
monisha-max pushed a commit to monisha-max/parameter-golf that referenced this pull request Apr 4, 2026
…6 (on PR openai#549 stack)

Six orthogonal improvements on the current SOTA (PR openai#549, 1.1194 BPB):
- Polynomial degree-5 softcap replacing tanh (from ternary PR openai#640)
- Z-loss regularization (1e-4 * logsumexp^2) for sharper gradients
- YaRN positional encoding for better long-context handling
- zstd-22 compression (7MB artifact vs 16MB with LZMA-6)
- Sliding eval stride=16 (4x more context overlap)
- FlashAttention 3/2/SDPA graceful fallback

Smoke-tested on 1xH100 (Modal): 890 steps, healthy convergence.
Awaiting 8xH100 SXM verification for official scoring.
@MatoTeziTanka
Copy link
Copy Markdown

Community Credit Note

This submission uses techniques pioneered by other participants in this competition without attribution. For the community record:

U-Net skip connections — Already present in @integrate-your-mind's PR #289 (2026-03-21, titled "SmearGate + BigramHash + Int6 + SWA + U-Net Skips"), @gowtham0992's PR #295 (2026-03-21), and @skarakulak's PR #507 (2026-03-23) — all submitted before this PR on 2026-03-24. Neither the PR body nor the 55-page results document cites any of these prior submissions.

The ternary quantization engineering, factored tied embedding, FP8 QAT, and the comprehensive ablation study are genuine original contributions. This note ensures the people whose foundational architectural work made this submission possible are recognized on the record. Open source runs on attribution.


Community credit review by @MatoTeziTankaThe Agora.

@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor Author

CiprianFlorin-Ifrim commented Apr 12, 2026

@MatoTeziTanka ????
I have mentioned to you on discord that it's better to not tag the whole OpenAI team on every PR and to provide everyone with AI-generated reviews unless people ask you for that. You are not part of the people that run the compeition, so you should leave that task to the organisers. Now, you keep spamming my PRs with nonsense, some of them already closed and merged too, as a form or retaliation which is immature.

I will simply copy paste my previous comment on a different PR I did, where you also posted the same thing, since you keep spamming:
"Please stop using AI in your messages and evaluations, you clearly do not have the context needed.

  1. These methods are not unique from the other participants and methods are common because they are good. It is like attributing credits to everyone else that used AdamW or Muon, because someone else did so. They are industry standards. What do you expect everyone to check all the 2000 PRs and offer 'credit', the credit goes to the original authors of these industry standards.
  2. SmearGate is actually a method that was used in the NanoChat/GPT speedrun challenge, a challenge even mentioned in the readme of this one. And if you would've inspected my analysis properly, you would've noted that SmearGate is actually worse for many of my submissions, and is not used.
  3. UNet is a staple in the AI field much like point 1 with the paper having been released in 2015."

If these useless AI comments continue, I will be reporting each one individually as unrelated to the PRs. Do stop.

@MatoTeziTanka
Copy link
Copy Markdown

U-Net is indeed well-established — the note should have said "applied in this competition by" rather than implying novelty. I'll tighten that language. The note explicitly credited your ternary quantization, factored tied embedding, FP8 QAT, and ablation study as original work.

123-code pushed a commit to 123-code/parameter-golf that referenced this pull request Apr 19, 2026
… relu² 4xMLP FP8) (openai#640)

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
@cocohearts
Copy link
Copy Markdown
Collaborator

@CiprianFlorin-Ifrim maintainer repro note: we are doing a pass over merged leaderboard rows and I cannot currently reproduce this ternary record.

What I tried:

Current result:

  • Submitted seed 42: final_sliding val_bpb=1.1565, stop step 6530, budget 15993853/16000000.
  • Targeted exact-visible-hparam rerun seed 42: final_sliding val_bpb=1.2280, stop step 6470, budget 16000737/16000000 (over budget).
  • The copied internal log hparameter line now differs from the submitted one only by run_id. The loss trajectory is already much worse by early checkpoints (step:1000 loss 3.2448 vs submitted 3.3120, then by step:6000 loss 3.2610 vs submitted 3.0211).

One remaining visible environment difference: your submitted log shows Python 3.13.12, while our audit rerun used Python 3.12.9; PyTorch is still 2.10.0+cu128. Could you reply here with any setup detail that is required but not captured by run_cuda_ternary.sh / the log? In particular: exact conda env, dataset snapshot/download command, FlashAttention wheel/source, or any hidden setup from your original machine. If there is a corrected repro bundle or exact container, please point us to it so we can rerun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants