Skip to content

[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)#599

Open
mkenney2 wants to merge 1 commit intoopenai:mainfrom
mkenney2:hymba-submission
Open

[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)#599
mkenney2 wants to merge 1 commit intoopenai:mainfrom
mkenney2:hymba-submission

Conversation

@mkenney2
Copy link
Copy Markdown

@mkenney2 mkenney2 commented Mar 24, 2026

Hymba: Hybrid Attention + Mamba SSM

First competitive non-transformer architecture in the competition.

A 7-layer hybrid model that runs attention and Mamba SSM in parallel within each block, achieving val_bpb ~1.18 (3 seeds) on 8×H100 in 10 minutes, beating the naive transformer baseline (1.2244).

Architecture

Each block runs two branches in parallel on the same input:

  1. Attention branch: Standard GQA (8 heads, 4 KV heads) with RoPE + SDPA
  2. Mamba branch: Selective scan SSM (mamba-ssm) with causal conv1d, gated projection

Outputs are merged with a learned weighted average: sigmoid(α) · attn + (1 - sigmoid(α)) · mamba, then projected through a shared output layer.

Key design: the input projection for K, V, and Mamba (x, gate) is fused into a single matmul for GPU efficiency. Q is projected separately to allow for future factorization.

Key Findings

Shallow models win at this compute budget. The SSM branch makes each layer more powerful (attention for local precision, Mamba for long-range state), reducing the need for depth. Fewer layers = faster steps = more training in 10 minutes. The optimal depth (7L) is significantly shallower than the transformer baseline (9-11L).

Training stability determines quantization quality. With standard LR (0.04), the int6 quantization gap was 0.02-0.06 BPB. Reducing to LR=0.02 with aggressive cosine warmdown (3000 steps) produced smoother, more quantization-friendly weights — shrinking the gap to 0.02 without QAT. The warmdown phase alone accounts for ~0.06 BPB improvement as EMA weights converge.

The Mamba branch adds minimal overhead on multi-GPU. At 7 layers with MLP 4×, Hymba runs at ~85ms/step on 8×H100 (~7,000 steps in 10 min). The selective scan is O(T) and the per-layer Mamba params are small, so gradient sync overhead is negligible.

First competitive non-transformer architecture. 7-layer hybrid model
running attention and Mamba SSM in parallel within each block.

Mean val_bpb: 1.1828 ± 0.0036 (3 seeds, 8xH100)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mkenney2 mkenney2 changed the title Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828) [Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828) Mar 24, 2026
himanshudongre added a commit to himanshudongre/parameter-golf that referenced this pull request Mar 28, 2026
… Golf

First functional SSM architecture in Parameter Golf with zero throughput
penalty. Previous SSM attempts (Hymba, PR openai#599) used Mamba's selective scan
which requires custom CUDA kernels, resulting in 3.4x throughput penalty.
S4D-Lin replaces this with standard F.conv1d — pure PyTorch, fully
torch.compile compatible.

Results: 1.1682 bpb post-GPTQ-int5 (vs 1.1194 SOTA). The throughput problem
is solved (116ms/step, matching baseline) but attention quality > SSM quality
in lower layers at this scale. Detailed analysis of why, plus lessons for
future SSM work.

Checks off "State-space models" from Requests for PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant