Feature Request: Kimi Delta Attention (KDA)

**Is your feature request related to a problem? Please describe.**
While Megatron Core supports Multi-Latent Attention (MLA) for KV cache compression (as used in DS-V2/V3 and Kimi-K2), it lacks support for Kimi Delta Attention (KDA), a sparse attention mechanism that selectively computes attention based on the delta from previous layers or cached representations.

**Describe the solution you'd like**
Add support for Kimi Delta Attention including the following components:

1. New configuration options in `TransformerConfig`
2. New attention module: `DeltaAttention` or `KDASelfAttention` class extending the existing `Attention` base class
3. Layer spec support: Add KDA layer specs similar to how MLA is handled

**Describe alternatives you've considered**

**Additional context**
Kimi Delta Attention [arxiv](https://arxiv.org/abs/2510.26692)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Kimi Delta Attention (KDA) #2446

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: Kimi Delta Attention (KDA) #2446

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions