Non-record: BitNet b1.58 — 65M ternary params beat 4-hour baseline in 10 minutes (val_bpb=1.2029)#139
Non-record: BitNet b1.58 — 65M ternary params beat 4-hour baseline in 10 minutes (val_bpb=1.2029)#139ksang123 wants to merge 1 commit intoopenai:mainfrom
Conversation
…s 4h baseline in 10min via Chinchilla scaling
There was a problem hiding this comment.
Pull request overview
Adds a new non-record 16MB submission entry implementing BitNet b1.58 ternary-weight training + base-3 packing to fit ~64.5M params into a ~15.1MB artifact, along with the exact run script, logs, and write-up.
Changes:
- Introduces a new
train_gpt.pyfor this record that swaps core linear layers to ternary BitLinear (STE) and adds ternary/base-3 packing export + roundtrip eval. - Adds reproducibility artifacts: the 8×H100 run script, full training log, and
submission.jsonmetadata. - Adds a README write-up describing the scaling rationale and method.
Reviewed changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
records/track_non_record_16mb/2026-03-19_BitNet158/train_gpt.py |
New BitNet training script with ternary layers and ternary artifact serialization. |
records/track_non_record_16mb/2026-03-19_BitNet158/run_8xh100.sh |
Reproduction script for the submission run. |
records/track_non_record_16mb/2026-03-19_BitNet158/train.log |
Captured output from the submission run. |
records/track_non_record_16mb/2026-03-19_BitNet158/submission.json |
Reported metrics and artifact sizing metadata. |
records/track_non_record_16mb/2026-03-19_BitNet158/README.md |
Method/results write-up for the submission. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ding window) BitNet b1.58 ternary quantization with full-training STE. 68M params in 15.88MB via base-3 packing (1.6 bits/param). Near-lossless roundtrip (0.0016 BPB gap). Systematic analysis of why the standard competition stack breaks for ternary: - XSA, weight decay, grad clipping: cause training plateau at 2.4 - SmearGate, BigramHash, OrthoInit: hurt or no effect - EMA/SWA: fundamentally incompatible - TTT: no improvement on ternary models What works: higher LR (0.04), wider MLP, fp16 scale simulation, longer warmdown. Improves on PR openai#139 (1.2029 → 1.1770).
…ding window) BitNet b1.58 ternary quantization with full-training STE. 68M params in 15.88MB via base-3 packing (1.6 bits/param). Near-lossless roundtrip (0.0016 BPB gap). Systematic analysis of why the standard competition stack breaks for ternary: - XSA, weight decay, grad clipping: cause training plateau at 2.4 - SmearGate, BigramHash, OrthoInit: hurt or no effect - EMA/SWA: fundamentally incompatible - TTT: no improvement on ternary models What works: higher LR (0.04), wider MLP, fp16 scale simulation, longer warmdown. Improves on PR openai#139 (1.2029 → 1.1770).
…ding window) BitNet b1.58 ternary quantization with full-training STE. 68M params in 15.88MB via base-3 packing (1.6 bits/param). Near-lossless roundtrip (0.0016 BPB gap). Systematic analysis of why the standard competition stack breaks for ternary: - XSA, weight decay, grad clipping: cause training plateau at 2.4 - SmearGate, BigramHash, OrthoInit: hurt or no effect - EMA/SWA: fundamentally incompatible - TTT: no improvement on ternary models What works: higher LR (0.04), wider MLP, fp16 scale simulation, longer warmdown. Improves on PR openai#139 (1.2029 → 1.1770).
…ding window) BitNet b1.58 ternary quantization with full-training STE. 68M params in 15.88MB via base-3 packing (1.6 bits/param). Near-lossless roundtrip (0.0016 BPB gap). Systematic analysis of why the standard competition stack breaks for ternary: - XSA, weight decay, grad clipping: cause training plateau at 2.4 - SmearGate, BigramHash, OrthoInit: hurt or no effect - EMA/SWA: fundamentally incompatible - TTT: no improvement on ternary models What works: higher LR (0.04), wider MLP, fp16 scale simulation, longer warmdown. Improves on PR openai#139 (1.2029 → 1.1770).
…ding window) BitNet b1.58 ternary quantization with full-training STE. 68M params in 15.88MB via base-3 packing (1.6 bits/param). Near-lossless roundtrip (0.0016 BPB gap). Systematic analysis of why the standard competition stack breaks for ternary: - XSA, weight decay, grad clipping: cause training plateau at 2.4 - SmearGate, BigramHash, OrthoInit: hurt or no effect - EMA/SWA: fundamentally incompatible - TTT: no improvement on ternary models What works: higher LR (0.04), wider MLP, fp16 scale simulation, longer warmdown. Improves on PR openai#139 (1.2029 → 1.1770).
…ding window) BitNet b1.58 ternary quantization with full-training STE. 68M params in 15.88MB via base-3 packing (1.6 bits/param). Near-lossless roundtrip (0.0016 BPB gap). Systematic analysis of why the standard competition stack breaks for ternary: - XSA, weight decay, grad clipping: cause training plateau at 2.4 - SmearGate, BigramHash, OrthoInit: hurt or no effect - EMA/SWA: fundamentally incompatible - TTT: no improvement on ternary models What works: higher LR (0.04), wider MLP, fp16 scale simulation, longer warmdown. Improves on PR openai#139 (1.2029 → 1.1770).
|
@ksang123 I feel like tagging my PR in this and also leaving a kind-of snarky comment on my own PR as "exactly what I hoped would happen when I submitted 139" is a bit disingenous, almost as if my work extends or builds upon yours. My work started as soon as the challenge was announced, and as I am the creator of the Bitnet Rust library/bitnet-llm, with a lot of private work and research done on Bitnets (and general int8 transformers) and their application on low power MCUs, it made sense to work on that for this challenge given the constraints. I simply did not want to submit something to "hold a place in the queue that I could update later" until it was complete, otherwise within 3h on the day of the challenge I could've been under 1.20 and mention how I was first. You can find a proper research document on the circa 250 runs I did here Results Document (formatted by Claude from my RESULTS.md file). Do not collate the two submissions. |
|
@CiprianFlorin-Ifrim I didn't mean to imply your work built on mine, I know it wasn't. It was extremely good and I'm genuinely glad the ternary approach was pushed this far. Separately, I do wish my submissions had gotten reviewed. |
All love, it could be that the title gave the wrong idea? I see people using Record for the main leaderboard and non-record for the "I am uploading something now to have a PR here, but will update it later", which imo even adds pressure on the organisers to manager the hundreds of them. Mine had the specific leaderboard name, "notable non record runs". Also the time too, as it looks like they want the notable non record leaderboard to be just for weird stuff with unlimited compute. |
BitNet b1.58: 64.5M Ternary Parameters in 15.1MB
val_bpb: 1.2029 (post-roundtrip) | 15.11 MB | 8×H100, 10 minutes
The idea
The baseline's 17.1M params are saturated long before the wallclock runs out (T/N ≈
424×). Chinchilla says: fit more parameters. BitNet b1.58 lets me pack 64.5M ternary
{-1, 0, 1} parameters into 15.1MB via base-3 encoding at 1.6 bits/param — 3.8× more
params than the baseline in the same artifact size.
What's different
forward pass
lossless. No post-training quantization fight.
scales match what the model saw
trit minimum
Results
A 10-minute ternary model beats 4 hours of full-precision training. The full write-
up with scaling law analysis is in the README.