Skip to content

Non-record: BitNet b1.58 + depth recurrence + NorMuon (1.7510 BPB, 3.78 MB)#126

Open
Athenox14 wants to merge 1 commit intoopenai:mainfrom
Athenox14:submission/bitnet158-depth-recurrence
Open

Non-record: BitNet b1.58 + depth recurrence + NorMuon (1.7510 BPB, 3.78 MB)#126
Athenox14 wants to merge 1 commit intoopenai:mainfrom
Athenox14:submission/bitnet158-depth-recurrence

Conversation

@Athenox14
Copy link
Copy Markdown

Non-record submission: BitNet b1.58 + Depth Recurrence + NorMuon

val_bpb (ternary roundtrip): 1.7510 | 3.78 MB | Unlimited compute (~3h, 1×RTX 3060)

This submission explores combining BitNet b1.58 ternary quantization with depth recurrence to maximize model capacity within the 16 MB artifact limit.

Key ideas

Ternary packing (2 bits/weight): storing weights as {-1, 0, +1} packed 4-per-byte then zlib-compressed yields a 3.74 MB model — leaving 4× more parameter budget than an equivalent int8+zlib approach, enabling much larger models within the size limit.

Depth recurrence + U-Net skips:

4 unique transformer blocks run 3× each = 12 effective layers, with learnable skip connections between encoder and decoder halves.
A per-block resid_mix parameter lets each recurrence pass blend the current hidden state with the original embedding, allowing blocks to specialize by depth despite shared weights.

NorMuon: Muon optimizer with per-neuron row-wise RMS normalization after Newton-Schulz orthogonalization, replacing the uniform scaling heuristic.

Sequence length warmup + YaRN: geometric warmup 128→1024 over 2000 steps with NTK-aware RoPE base scaling to stabilize early training.

Limitations & next steps

A significant quantization gap exists (pre-quant 1.4866 → post-quant 1.7510, Δ=+0.264 BPB), indicating the QAT does not sufficiently push latent weights toward {-1, 0, +1}.
Follow-up runs add a ternary commitment loss to address this, and scale to ~60M unique parameters (still within the 16 MB budget).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant