DeepSeek V3.2 support by kunlunl · Pull Request #2440 · NVIDIA/Megatron-LM

kunlunl · 2025-12-01T09:21:37Z

DeepSeek V3.2 Sparse Attention Support

dev PR: #2154

1. TL;DR

What: This PR adds support for DeepSeek V3.2-style sparse attention (DSA) to Megatron-LM, enabling models to use learned attention sparsity patterns via a lightweight indexer module.

Why: Dense attention has O(n²) complexity which becomes computationally prohibitive for long sequences. DSA reduces this by learning to predict which key-value pairs are most relevant for each query, allowing the model to attend to only top-k tokens instead of all tokens.

Impact: Users can now train models with DeepSeek V3.2's sparse attention mechanism in Megatron-LM, which combines Multi-Latent Attention with a trainable sparse indexer.

2. Big Picture

2.1 Before vs After Architecture

graph TB
    subgraph "Before: Standard Dense Attention"
        A1[Hidden States] --> B1[QKV Projection]
        B1 --> C1[Dense Attention]
        C1 --> D1[Output Projection]
    end
    
    subgraph "After: DeepSeek Sparse Attention"
        A2[Hidden States] --> B2[QKV Projection]
        A2 --> E2[DSA Indexer]
        B2 --> E2
        E2 --> F2[Top-K Selection]
        B2 --> G2[Sparse Attention]
        F2 --> G2
        G2 --> D2[Output Projection]
        E2 --> H2[Indexer Loss]
        H2 --> D2
    end
    
    style E2 fill:#90EE90
    style F2 fill:#90EE90
    style H2 fill:#FFD700

Key Changes:

NEW: DSA Indexer module that learns to predict important tokens
NEW: Sparse attention module that only computes attention for top-k tokens
NEW: KL divergence auxiliary loss to train the indexer
MODIFIED: Multi-Latent Attention to support DSA variant

2.2 Change Scope Summary

Category	Files	Description
New Core Module	`megatron/core/transformer/experimental_attention_variant/dsa.py`	DSA indexer, DSA sparse attention module, loss computation
New Spec File	`megatron/core/models/gpt/experimental_attention_variant_module_specs.py`	Module specs for attention variants
New Test	`tests/unit_tests/transformer/test_attention_variant_dsa.py`	Comprehensive DSA unit tests
Modified Core	`megatron/core/transformer/multi_latent_attention.py`	MLA integration with DSA
Modified Config	`megatron/core/transformer/transformer_config.py`	Added DSA config parameters
Modified Args	`megatron/training/arguments.py`	CLI arguments for DSA
Modified Training	`megatron/training/training.py`	Loss logging for indexer
Modified Specs	`megatron/core/models/gpt/gpt_layer_specs.py`	Renamed linear_attention → experimental_attention_variant
Modified Builder	`gpt_builders.py`	Updated to use new attention variant system

3. Key Design Points

Core Abstractions Introduced:

DSAIndexer: Computes index scores to identify top-k most relevant tokens
- Input: Hidden states x [seqlen, batch, hidden_size] + compressed query qr [seqlen, batch, q_lora_rank]
- Output: Top-k indices [batch, seqlen, index_topk]
- Uses its own small transformer-like architecture with Q/K projections + RoPE + Hadamard rotation
DSAttention: Sparse attention mechanism using indexer outputs
- Wraps DSAIndexer and applies sparse attention kernel
- Attaches KL divergence loss to train indexer
DSAIndexerLossAutoScaler: Custom autograd function
- Allows indexer loss to backpropagate independently of main loss
- Scales indexer loss gradient separately

Interface Contracts:

# DSAIndexer.forward
def forward(x, qr, mask=None, packed_seq_params=None) -> topk_indices
    """
    x: [seqlen, batch, hidden_size] - Main hidden states (DETACHED)
    qr: [seqlen, batch, q_lora_rank] - Compressed query (DETACHED)
    mask: [batch, seqlen, seqlen] - Attention mask (FP32 with -inf for masked positions)
    
    Returns: [batch, seqlen, index_topk] - Indices of top-k tokens to attend to
    """

# DSAttention.forward
def forward(query, key, value, x, qr, attention_mask, ...) -> output
    """
    query: [sq, b, np, hn] - Full query tensor from MLA
    key: [sk, b, np, hn] - Full key tensor from MLA
    value: [sk, b, np, hnv] - Full value tensor from MLA
    x: [sq, b, hidden_size] - Original hidden states for indexer
    qr: [sq, b, q_lora_rank] - Compressed query for indexer
    
    Returns: [sq, b, hidden_size] - Attention output with indexer loss attached
    """

Important Invariants:

Indexer inputs (x, qr) are always detached - gradients don't flow back to main model
Indexer loss is attached via DSAIndexerLossAutoScaler.apply() - backpropagates separately
Top-k selection uses masked index scores (causal mask applied before topk)
DSA currently requires multi_latent_attention=True and context_parallel_size=1

4. Execution Path Deep Dive

4.1 Entry Point

DSA is triggered when creating a GPT model with --experimental-attention-variant dsa flag:

# Entry: gpt_builders.py::gpt_builder()
def gpt_builder(args, pre_process, post_process, vp_stage=None, config=None):
    # ...
    linear_attention_variants = ["gated_delta_net"]
    if args.num_experts or args.experimental_attention_variant in linear_attention_variants:
        transformer_layer_spec = get_gpt_decoder_block_spec(...)  # Uses MoE path
    elif:
        # ...
    else:
        transformer_layer_spec = _get_transformer_layer_spec(
            # ...
            experimental_attention_variant=args.experimental_attention_variant,  # 'dsa'
            # ...
        )

4.2 Data Flow

graph TD
    A["Input: hidden_states<br/>[sq, b, hidden]"] --> B["MLA Q Compression<br/>linear_q_proj→linear_q_down_proj"]
    A --> C["MLA KV Compression<br/>linear_kv_down_proj"]
    
    B --> D["q_compressed<br/>[sq, b, q_lora_rank]"]
    C --> E["kv_compressed<br/>[sq, b, kv_lora_rank]"]
    
    D --> F["MLA Q Upsampling<br/>linear_q_up_proj + RoPE"]
    E --> G["MLA KV Upsampling<br/>linear_kv_up_proj + RoPE"]
    
    F --> H["query<br/>[sq, b, np, hn]"]
    G --> I["key<br/>[sk, b, np, hn]"]
    G --> J["value<br/>[sk, b, np, hnv]"]
    
    A --> K["x.detach()"]
    D --> L["q_compressed.detach()"]
    
    K --> M["DSAIndexer"]
    L --> M
    M --> N["Indexer Q Proj<br/>linear_wq_b<br/>[sq, b, index_n_heads, index_head_dim]"]
    K --> O["Indexer K Proj<br/>linear_wk + k_norm<br/>[sk, b, index_head_dim]"]
    K --> P["Indexer Weights<br/>linear_weights_proj<br/>[sq, b, index_n_heads]"]
    
    N --> Q["Apply RoPE"]
    O --> R["Apply RoPE"]
    Q --> S["rotate_activation<br/>&#40;Hadamard transform&#41;"]
    R --> T["rotate_activation<br/>&#40;Hadamard transform&#41;"]
    
    S --> U["Index Scores<br/>q @ k^T → ReLU → weighted sum<br/>[b, sq, sk]"]
    T --> U
    P --> U
    
    U --> V["TopK Selection<br/>[b, sq, index_topk]"]
    
    H --> W["Sparse Attention"]
    I --> W
    J --> W
    V --> W
    
    W --> X["attention_output<br/>[sq, b, hidden]"]
    
    U --> Y["KL Divergence Loss<br/>KL&#40;true_attn || index_scores&#41;"]
    V --> Y
    H --> Y
    I --> Y
    
    Y --> Z["indexer_loss<br/>scalar"]
    
    X --> AA["DSAIndexerLossAutoScaler.apply"]
    Z --> AA
    AA --> AB["Final Output<br/>&#40;with loss attached&#41;"]
    
    style K fill:#FFE4B5
    style L fill:#FFE4B5
    style M fill:#90EE90
    style U fill:#87CEEB
    style V fill:#87CEEB
    style W fill:#FFD700
    style Y fill:#FF6347
    style Z fill:#FF6347
    style AA fill:#DDA0DD

5. Module Relationships

classDiagram
    class TransformerConfig {
        +int num_layers
        +int hidden_size
        +str experimental_attention_variant
    }
    
    class MLATransformerConfig {
        +int q_lora_rank
        +int kv_lora_rank
        +int dsa_indexer_n_heads
        +int dsa_indexer_head_dim
        +int dsa_indexer_topk
        +float dsa_indexer_loss_coeff
    }
    
    class Attention {
        <<abstract>>
        +forward()*
    }
    
    class MultiLatentAttention {
        +get_query_key_value_tensors()
        +forward()
    }
    
    class MLASelfAttention {
        +linear_q_proj
        +linear_kv_down_proj
        +core_attention
        +get_query_key_value_tensors(return_compressed_tensors)
    }
    
    class DSAttention {
        +indexer: DSAIndexer
        +softmax_scale: float
        +forward(query, key, value, x, qr, ...)
    }
    
    class DSAIndexer {
        +linear_wq_b
        +linear_wk
        +k_norm
        +linear_weights_proj
        +rotary_pos_emb
        +forward(x, qr, mask)
        +forward_with_scores(x, qr, mask)
        -_apply_rope()
        -_compute_index_scores()
    }
    
    class DSAIndexerSubmodules {
        +linear_wq_b: ModuleSpec
        +linear_wk: ModuleSpec
        +k_norm: ModuleSpec
        +linear_weights_proj: ModuleSpec
    }
    
    class DSAttentionSubmodules {
        +indexer: ModuleSpec
    }
    
    class MLASelfAttentionSubmodules {
        +core_attention: ModuleSpec
        +linear_q_proj
        +linear_kv_down_proj
        +q_layernorm
        +kv_layernorm
    }
    
    class RotaryEmbedding {
        +forward(seq_len)
    }
    
    class DSAIndexerLossAutoScaler {
        <<autograd.Function>>
        +forward(output, loss)$
        +backward(grad_output)$
        +set_loss_scale(scale)$
        +main_loss_backward_scale$
    }
    
    class DSAIndexerLossLoggingHelper {
        +save_loss_to_tracker()$
        +reduce_loss_in_tracker()$
        +track_indexer_metrics()$
        +tracker: dict$
    }
    
    TransformerConfig <|-- MLATransformerConfig : extends
    Attention <|-- MultiLatentAttention : extends
    MultiLatentAttention <|-- MLASelfAttention : extends
    Attention <|-- DSAttention : implements (core_attention)
    
    MLASelfAttention --> DSAttention : uses as core_attention
    MLASelfAttention --> MLASelfAttentionSubmodules : configured by
    DSAttention --> DSAIndexer : contains
    DSAttention --> DSAttentionSubmodules : configured by
    DSAIndexer --> DSAIndexerSubmodules : configured by
    DSAIndexer --> RotaryEmbedding : uses
    DSAttention ..> DSAIndexerLossAutoScaler : uses
    DSAttention ..> DSAIndexerLossLoggingHelper : logs to
    
    MLASelfAttention ..> MLATransformerConfig : reads config
    DSAttention ..> MLATransformerConfig : reads config
    DSAIndexer ..> MLATransformerConfig : reads config

Key Relationships:

Composition:
- MLASelfAttention contains DSAttention as its core_attention module
- DSAttention contains DSAIndexer for computing sparse indices
Utility Classes:
- DSAIndexerLossAutoScaler: Custom autograd for loss attachment
- DSAIndexerLossLoggingHelper: Singleton for collecting losses across layers

New Dependencies Introduced:

fast_hadamard_transform (optional): For Hadamard rotation activation
Fallback: Mock implementation in tests
Production: Uses optimized CUDA kernel

6. Examples

6.1 Configuration Parameters

CLI Arguments Example (added in arguments.py):

--experimental-attention-variant dsa          # Enable DSA (DeepSeek Sparse Attention)
--dsa-indexer-n-heads 8                       # Number of indexer heads (default: num-attention-heads)
--dsa-indexer-head-dim 64                     # Dimension per indexer head (default: kv-channels)
--dsa-indexer-topk 32                         # Top-k tokens to select per query
--dsa-indexer-loss-coeff 1.0                # Coefficient for KL divergence loss (0 = disabled)
--dsa-indexer-use-sparse-loss                 # Use sparse KL loss (only on top-k positions)

TransformerConfig Example:

config = MLATransformerConfig(
    # ... standard MLA params ...
    experimental_attention_variant='dsa',      # 'dsa' | 'gated_delta_net' | None
    dsa_indexer_n_heads=8,                    # Must divide by TP size
    dsa_indexer_head_dim=64,                  # Typically same as kv_channels
    dsa_indexer_topk=32,                      # k in O(n·k) complexity
    dsa_indexer_loss_coeff=1.0,             # Typical range: 0.0001 - 0.01
    dsa_indexer_use_sparse_loss=False,        # True = sparse, False = dense KL loss
)

6.2 Example Usage

Training a GPT model with DSA:

python pretrain_gpt.py \
    --num-layers 32 \
    --hidden-size 4096 \
    --num-attention-heads 32 \
    --seq-length 8192 \
    \
    # Enable Multi-Latent Attention (required for DSA)
    --multi-latent-attention \
    --q-lora-rank 512 \
    --kv-lora-rank 512 \
    --qk-head-dim 128 \
    --qk-pos-emb-head-dim 64 \
    --v-head-dim 128 \
    \
    # Enable DeepSeek Sparse Attention
    --experimental-attention-variant dsa \
    --dsa-indexer-n-heads 16 \
    --dsa-indexer-head-dim 128 \
    --dsa-indexer-topk 256 \
    --dsa-indexer-loss-coeff 0.001 \
    \
    # Standard training args
    --micro-batch-size 1 \
    --global-batch-size 512 \
    --lr 1.0e-4 \
    --train-iters 100000 \
    --lr-decay-iters 100000 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 0.1 \
    --clip-grad 1.0 \
    --bf16

Expected Behavior:

Each layer will use sparse attention with top-256 tokens (instead of full 8192)
Indexer loss will be logged to TensorBoard as indexer loss

Hi, thank you very much for your great work.
However, I’ve encountered an issue where the current DSA implementation consumes a large amount of GPU memory when the sequence length is long.
I’m wondering whether there are any planned or ongoing efforts to optimize the memory usage for long-sequence scenarios.

Thanks a lot in advance for your help.

kunlunl · 2026-01-29T06:38:07Z

Hi, thank you very much for your great work. However, I’ve encountered an issue where the current DSA implementation consumes a large amount of GPU memory when the sequence length is long. I’m wondering whether there are any planned or ongoing efforts to optimize the memory usage for long-sequence scenarios.

Thanks a lot in advance for your help.

Yes. The large memory footprint comes from the unfused DSA and indexer, which generate many seq^2 tensors. We have ongoing PRs to integrate fused kernels to replace the unfused pytorch implementation, but it's still WIP and the fused kernel can only run in specific shape.
Here are three PRs:

Add fused dsa #3044: Add a new absorption MLA implementation and use the fused DSA kernel.
[Refactor] Decouple topk and loss from DSA Indexer #3013: Refactor indexer to decouple top-k and loss, make it compatible with fused indexer kernels. This PR was created community.
[WIP Feat] Split-K Indexer Kernels #2869: Fused indexer kernels. This PR was created community.

xhjhggybz · 2026-01-29T06:58:05Z

Hi, thank you very much for your great work. However, I’ve encountered an issue where the current DSA implementation consumes a large amount of GPU memory when the sequence length is long. I’m wondering whether there are any planned or ongoing efforts to optimize the memory usage for long-sequence scenarios.
Thanks a lot in advance for your help.

Yes. The large memory footprint comes from the unfused DSA and indexer, which generate many seq^2 tensors. We have ongoing PRs to integrate fused kernels to replace the unfused pytorch implementation, but it's still WIP and the fused kernel can only run in specific shape. Here are three PRs:

Add absorbed-mla #3044: Add a new absorption MLA implementation and use the fused DSA kernel.

[Refactor] Decouple topk and loss from DSA Indexer #3013: Refactor indexer to decouple top-k and loss, make it compatible with fused indexer kernels. This PR was created community.

[WIP Feat] Split-K Indexer Kernels #2869: Fused indexer kernels. This PR was created community.

Thanks a lot for the detailed explanation and for sharing the related PRs — this is very helpful！

Add support for DSA

e003f8d

Signed-off-by: kunlunl <kunlunl@nvidia.com>

kunlunl requested review from a team as code owners December 1, 2025 09:21

kunlunl mentioned this pull request Dec 1, 2025

[dev] DeepSeek V3.2 support #2154

Merged

kunlunl changed the title ~~Add support for DSA~~ DeepSeek V3.2 support Dec 1, 2025

fzyzcjy mentioned this pull request Dec 2, 2025

Feature Request: Support for DeepSeek Sparse Attention in Megatron (DeepSeek 3.2) #1869

Closed

yanring added module: moe Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Dec 3, 2025

ericharper added the complexity: high label Dec 8, 2025

twoflypig reviewed Dec 12, 2025

View reviewed changes

snowmanwwg added the dev2main: mbridge dev to main: this PR is needed in main for mbridge label Jan 6, 2026

Phlip79 requested a review from deepakn94 January 9, 2026 19:23

Merge branch 'main' into kunlunl/deepseek_v3.2_main

1a1522e

copy-pr-bot Bot temporarily deployed to nemo-ci January 9, 2026 19:28 Inactive

ko3n1g added this to the Core 0.16 milestone Jan 9, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci January 9, 2026 19:29 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 9, 2026 19:29 Failure

Phlip79 added complexity: medium and removed complexity: high labels Jan 9, 2026

Phlip79 requested a review from ananthsub January 9, 2026 19:32

Update transformer_config.py

31f4120

copy-pr-bot Bot temporarily deployed to nemo-ci January 9, 2026 23:15 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 9, 2026 23:15 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 15, 2026 23:07 Inactive

Phlip79 added 2 commits January 15, 2026 15:29

Fix linting

2ec8a8f

Merge branch 'main' into kunlunl/deepseek_v3.2_main

15f1fac

copy-pr-bot Bot temporarily deployed to nemo-ci January 15, 2026 23:30 Inactive

copy-pr-bot Bot temporarily deployed to test January 15, 2026 23:31 Inactive

kunlunl and others added 2 commits January 16, 2026 14:15

Update MLA test for return compressed tensors by get_query_key_value_…

87d7e70

…tensors

Merge branch 'main' into kunlunl/deepseek_v3.2_main

271f29b

copy-pr-bot Bot temporarily deployed to nemo-ci January 16, 2026 06:15 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 16, 2026 06:16 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 16, 2026 06:16 Inactive

Fix linting error

30b970e

Fix UT error

6f0f976

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

+                  ####################
+                  # attention variant
+                  ####################
+                  experimental_attention_variant: Optional[str] = None

Conversation

kunlunl commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DeepSeek V3.2 Sparse Attention Support

1. TL;DR

2. Big Picture

2.1 Before vs After Architecture

2.2 Change Scope Summary

3. Key Design Points

4. Execution Path Deep Dive

4.1 Entry Point

4.2 Data Flow

5. Module Relationships

6. Examples

6.1 Configuration Parameters

6.2 Example Usage

Further Reading

Uh oh!

copy-pr-bot Bot commented Dec 1, 2025

Uh oh!

fzyzcjy commented Dec 2, 2025

Uh oh!

twoflypig Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

kunlunl Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

jaredcasper Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

kunlunl Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Wineheart-Taro commented Dec 22, 2025

Uh oh!

Phlip79 commented Jan 9, 2026

Uh oh!

Phlip79 commented Jan 9, 2026

Uh oh!

Phlip79 commented Jan 15, 2026

Uh oh!

Phlip79 commented Jan 15, 2026

Uh oh!

kunlunl commented Jan 16, 2026

Uh oh!

kunlunl commented Jan 16, 2026

Uh oh!

kunlunl commented Jan 16, 2026

Uh oh!

Meta-YZ commented Jan 28, 2026

Uh oh!

Phlip79 commented Jan 28, 2026

Uh oh!

xhjhggybz commented Jan 29, 2026

Uh oh!

kunlunl commented Jan 29, 2026

Uh oh!

xhjhggybz commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

kunlunl commented Dec 1, 2025 •

edited

Loading