Non-record: BitNet b1.58 + depth recurrence + NorMuon (1.7510 BPB, 3.78 MB)#126
Open
Athenox14 wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: BitNet b1.58 + depth recurrence + NorMuon (1.7510 BPB, 3.78 MB)#126Athenox14 wants to merge 1 commit intoopenai:mainfrom
Athenox14 wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record submission: BitNet b1.58 + Depth Recurrence + NorMuon
val_bpb (ternary roundtrip): 1.7510 | 3.78 MB | Unlimited compute (~3h, 1×RTX 3060)
This submission explores combining BitNet b1.58 ternary quantization with depth recurrence to maximize model capacity within the 16 MB artifact limit.
Key ideas
Ternary packing (2 bits/weight): storing weights as {-1, 0, +1} packed 4-per-byte then zlib-compressed yields a 3.74 MB model — leaving 4× more parameter budget than an equivalent int8+zlib approach, enabling much larger models within the size limit.
Depth recurrence + U-Net skips:
4 unique transformer blocks run 3× each = 12 effective layers, with learnable skip connections between encoder and decoder halves.
A per-block
resid_mixparameter lets each recurrence pass blend the current hidden state with the original embedding, allowing blocks to specialize by depth despite shared weights.NorMuon: Muon optimizer with per-neuron row-wise RMS normalization after Newton-Schulz orthogonalization, replacing the uniform scaling heuristic.
Sequence length warmup + YaRN: geometric warmup 128→1024 over 2000 steps with NTK-aware RoPE base scaling to stabilize early training.
Limitations & next steps
A significant quantization gap exists (pre-quant 1.4866 → post-quant 1.7510, Δ=+0.264 BPB), indicating the QAT does not sufficiently push latent weights toward {-1, 0, +1}.
Follow-up runs add a ternary commitment loss to address this, and scale to ~60M unique parameters (still within the 16 MB budget).