[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)#599
Open
mkenney2 wants to merge 1 commit intoopenai:mainfrom
Open
[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)#599mkenney2 wants to merge 1 commit intoopenai:mainfrom
mkenney2 wants to merge 1 commit intoopenai:mainfrom
Conversation
First competitive non-transformer architecture. 7-layer hybrid model running attention and Mamba SSM in parallel within each block. Mean val_bpb: 1.1828 ± 0.0036 (3 seeds, 8xH100) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
himanshudongre
added a commit
to himanshudongre/parameter-golf
that referenced
this pull request
Mar 28, 2026
… Golf First functional SSM architecture in Parameter Golf with zero throughput penalty. Previous SSM attempts (Hymba, PR openai#599) used Mamba's selective scan which requires custom CUDA kernels, resulting in 3.4x throughput penalty. S4D-Lin replaces this with standard F.conv1d — pure PyTorch, fully torch.compile compatible. Results: 1.1682 bpb post-GPTQ-int5 (vs 1.1194 SOTA). The throughput problem is solved (116ms/step, matching baseline) but attention quality > SSM quality in lower layers at this scale. Detailed analysis of why, plus lessons for future SSM work. Checks off "State-space models" from Requests for PRs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hymba: Hybrid Attention + Mamba SSM
First competitive non-transformer architecture in the competition.
A 7-layer hybrid model that runs attention and Mamba SSM in parallel within each block, achieving val_bpb ~1.18 (3 seeds) on 8×H100 in 10 minutes, beating the naive transformer baseline (1.2244).
Architecture
Each block runs two branches in parallel on the same input:
mamba-ssm) with causal conv1d, gated projectionOutputs are merged with a learned weighted average:
sigmoid(α) · attn + (1 - sigmoid(α)) · mamba, then projected through a shared output layer.Key design: the input projection for K, V, and Mamba (x, gate) is fused into a single matmul for GPU efficiency. Q is projected separately to allow for future factorization.
Key Findings
Shallow models win at this compute budget. The SSM branch makes each layer more powerful (attention for local precision, Mamba for long-range state), reducing the need for depth. Fewer layers = faster steps = more training in 10 minutes. The optimal depth (7L) is significantly shallower than the transformer baseline (9-11L).
Training stability determines quantization quality. With standard LR (0.04), the int6 quantization gap was 0.02-0.06 BPB. Reducing to LR=0.02 with aggressive cosine warmdown (3000 steps) produced smoother, more quantization-friendly weights — shrinking the gap to 0.02 without QAT. The warmdown phase alone accounts for ~0.06 BPB improvement as EMA weights converge.
The Mamba branch adds minimal overhead on multi-GPU. At 7 layers with MLP 4×, Hymba runs at ~85ms/step on 8×H100 (~7,000 steps in 10 min). The selective scan is O(T) and the per-layer Mamba params are small, so gradient sync overhead is negligible.