[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828) by mkenney2 · Pull Request #599 · openai/parameter-golf

mkenney2 · 2026-03-24T05:31:33Z

Hymba: Hybrid Attention + Mamba SSM

First competitive non-transformer architecture in the competition.

A 7-layer hybrid model that runs attention and Mamba SSM in parallel within each block, achieving val_bpb ~1.18 (3 seeds) on 8×H100 in 10 minutes, beating the naive transformer baseline (1.2244).

Architecture

Each block runs two branches in parallel on the same input:

Attention branch: Standard GQA (8 heads, 4 KV heads) with RoPE + SDPA
Mamba branch: Selective scan SSM (mamba-ssm) with causal conv1d, gated projection

Outputs are merged with a learned weighted average: sigmoid(α) · attn + (1 - sigmoid(α)) · mamba, then projected through a shared output layer.

Key design: the input projection for K, V, and Mamba (x, gate) is fused into a single matmul for GPU efficiency. Q is projected separately to allow for future factorization.

Key Findings

Shallow models win at this compute budget. The SSM branch makes each layer more powerful (attention for local precision, Mamba for long-range state), reducing the need for depth. Fewer layers = faster steps = more training in 10 minutes. The optimal depth (7L) is significantly shallower than the transformer baseline (9-11L).

Training stability determines quantization quality. With standard LR (0.04), the int6 quantization gap was 0.02-0.06 BPB. Reducing to LR=0.02 with aggressive cosine warmdown (3000 steps) produced smoother, more quantization-friendly weights — shrinking the gap to 0.02 without QAT. The warmdown phase alone accounts for ~0.06 BPB improvement as EMA weights converge.

The Mamba branch adds minimal overhead on multi-GPU. At 7 layers with MLP 4×, Hymba runs at ~85ms/step on 8×H100 (~7,000 steps in 10 min). The selective scan is O(T) and the per-layer Mamba params are small, so gradient sync overhead is negligible.

First competitive non-transformer architecture. 7-layer hybrid model running attention and Mamba SSM in parallel within each block. Mean val_bpb: 1.1828 ± 0.0036 (3 seeds, 8xH100) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… Golf First functional SSM architecture in Parameter Golf with zero throughput penalty. Previous SSM attempts (Hymba, PR openai#599) used Mamba's selective scan which requires custom CUDA kernels, resulting in 3.4x throughput penalty. S4D-Lin replaces this with standard F.conv1d — pure PyTorch, fully torch.compile compatible. Results: 1.1682 bpb post-GPTQ-int5 (vs 1.1194 SOTA). The throughput problem is solved (116ms/step, matching baseline) but attention quality > SSM quality in lower layers at this scale. Detailed analysis of why, plus lessons for future SSM work. Checks off "State-space models" from Requests for PRs.

mkenney2 changed the title ~~Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)~~ [Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828) Mar 24, 2026

notapplica mentioned this pull request Mar 24, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

himanshudongre mentioned this pull request Mar 28, 2026

Non-record: S4D-Lin SSM Hybrid — Fixing Why Mamba Failed in Parameter… #1013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)#599

[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)#599
mkenney2 wants to merge 1 commit intoopenai:mainfrom
mkenney2:hymba-submission

mkenney2 commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mkenney2 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hymba: Hybrid Attention + Mamba SSM

Architecture

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mkenney2 commented Mar 24, 2026 •

edited

Loading