Skip to content

Non-record: BitNet b1.58 - 68M ternary params, val_bpb=1.1770, systematic analysis of ternary limitations#367

Open
ksang123 wants to merge 1 commit intoopenai:mainfrom
ksang123:bitnet-68M-systematic-analysis
Open

Non-record: BitNet b1.58 - 68M ternary params, val_bpb=1.1770, systematic analysis of ternary limitations#367
ksang123 wants to merge 1 commit intoopenai:mainfrom
ksang123:bitnet-68M-systematic-analysis

Conversation

@ksang123
Copy link
Copy Markdown

Improves on PR #139 (1.2029 → 1.1770). 68M ternary {-1,0,1} params packed at 1.6 bits/param in 15.88MB via base-3 encoding.

Key findings:

  • The entire standard competition stack (XSA, SmearGate, BigramHash, OrthoInit, WD, EMA/SWA, TTT) either breaks or doesn't help ternary models
  • XSA and weight decay cause complete training plateaus at val_loss 2.4 — ternary is a fundamentally different optimization regime
  • Near-lossless quantization roundtrip (0.0016 BPB gap) via fp16 scale simulation during training
  • Ternary prefers higher LR (0.04 vs 0.025), no regularization, and longer warmdown — the opposite of int6 best practices
  • Suggests int4 with late QAT as an unexplored middle ground: 50% more params than int6 with near-zero quant gap

Full writeup with negative results table in the README.

…ding window)

BitNet b1.58 ternary quantization with full-training STE. 68M params in 15.88MB
via base-3 packing (1.6 bits/param). Near-lossless roundtrip (0.0016 BPB gap).

Systematic analysis of why the standard competition stack breaks for ternary:
- XSA, weight decay, grad clipping: cause training plateau at 2.4
- SmearGate, BigramHash, OrthoInit: hurt or no effect
- EMA/SWA: fundamentally incompatible
- TTT: no improvement on ternary models

What works: higher LR (0.04), wider MLP, fp16 scale simulation, longer warmdown.

Improves on PR openai#139 (1.2029 → 1.1770).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant