Is your feature request related to a problem? Please describe.
While Megatron Core supports Multi-Latent Attention (MLA) for KV cache compression (as used in DS-V2/V3 and Kimi-K2), it lacks support for Kimi Delta Attention (KDA), a sparse attention mechanism that selectively computes attention based on the delta from previous layers or cached representations.
Describe the solution you'd like
Add support for Kimi Delta Attention including the following components:
- New configuration options in
TransformerConfig
- New attention module:
DeltaAttention or KDASelfAttention class extending the existing Attention base class
- Layer spec support: Add KDA layer specs similar to how MLA is handled
Describe alternatives you've considered
Additional context
Kimi Delta Attention arxiv
Is your feature request related to a problem? Please describe.
While Megatron Core supports Multi-Latent Attention (MLA) for KV cache compression (as used in DS-V2/V3 and Kimi-K2), it lacks support for Kimi Delta Attention (KDA), a sparse attention mechanism that selectively computes attention based on the delta from previous layers or cached representations.
Describe the solution you'd like
Add support for Kimi Delta Attention including the following components:
TransformerConfigDeltaAttentionorKDASelfAttentionclass extending the existingAttentionbase classDescribe alternatives you've considered
Additional context
Kimi Delta Attention arxiv