Support DeepSeek-v3.2 by yuzhongw-nvidia · Pull Request #1421 · NVIDIA-NeMo/Megatron-Bridge

yuzhongw-nvidia · 2025-11-19T11:41:31Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Added support for DeepSeek-V3.2 model integration with Megatron framework, including multi-latent attention and sparse attention indexing.
- Enabled mixture-of-experts configuration handling for the new model variant.
Tests
- Added functional tests for DeepSeek-V3.2 model conversion validation.

copy-pr-bot · 2025-11-19T11:41:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-01-08T03:33:29Z

/ok to test 066adea

yaoyu-33 · 2026-01-08T03:36:50Z

@yuzhongw-nvidia can you run " pre-commit run --all-files"

yaoyu-33 · 2026-01-08T18:37:19Z

/ok to test 4f0d855

yaoyu-33 · 2026-01-20T17:07:23Z

pending transformers merging on HF side.

wuyaoxuehun · 2026-02-12T05:56:56Z

pending transformers merging on HF side.

@yaoyu-33 can you merge this pr now? and can GLM-5 be also implemented cause they share the same arch?

coderabbitai · 2026-02-25T09:53:57Z

📝 Walkthrough

Walkthrough

This PR introduces DeepSeek-V3.2 model conversion support in Megatron, adding a new bridge implementation with provider customization for sparse attention indexing, mixture-of-experts settings, multi-latent attention, and comprehensive weight mapping logic. Functional tests validate the conversion pipeline across multi-GPU environments.

Changes

Cohort / File(s)	Summary
Bridge Implementation & Export `src/megatron/bridge/models/deepseek/__init__.py`, `src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py`	New DeepSeekV32Bridge class for DeepSeek-V3.2 model conversion with provider customization (experimental DSA attention, MoE settings, rope, fusion options), mapping registry for parameter translation, and automatic rotary embedding inverse frequency injection during weight conversion.
Configuration & Mappings `src/megatron/bridge/models/deepseek/common.py`	Added DeepSeek-V3.2-specific indexer configuration selection logic and sparse attention parameter mappings (indexer.linear_wq_b, indexer.linear_wk, indexer.k_norm, indexer.weights_proj) plus MoE router expert bias mapping to HF e_score_correction_bias.
Functional Tests `tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`	New test suite for DeepSeek-V3.2 conversion validation, creating toy models with MoE and DSA configurations, executing multi-GPU conversion pipeline, and verifying output integrity (config fields, weight presence, parameter preservation).

Sequence Diagram

sequenceDiagram
    participant HF as HuggingFace<br/>DeepSeek-V3.2
    participant Bridge as DeepSeekV32Bridge
    participant Config as Config &<br/>Mappings
    participant Megatron as Megatron<br/>Model
    participant Weights as Weight<br/>Converter

    HF->>Bridge: Load HF model config & weights
    Bridge->>Config: Query get_common_configs()
    Config-->>Bridge: Return DSA indexer & MoE settings
    Bridge->>Config: Query get_common_mapping_list()
    Config-->>Bridge: Return sparse attention & router mappings
    Bridge->>Bridge: Create provider with TE specs & fusion options
    Bridge-->>Megatron: Build GPTModel with DSA & MoE
    Bridge->>Weights: Convert HF weights via mapping
    Weights->>HF: Apply parameter mappings (indexer, router bias)
    Weights->>Weights: Inject rotary embedding inv_freq
    Weights-->>Megatron: Load augmented weights

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

community-request

Suggested reviewers

yaoyu-33

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR adds DeepSeek-V3.2 support but lacks documented test results and evidence of successful execution in description.	Document test results confirming successful conversion with correct model configuration and address all flagged code issues before running tests.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Support DeepSeek-v3.2' directly and clearly summarizes the main change: adding support for the DeepSeek-V3.2 model through new bridge implementation, configuration mappings, and functional tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (2)

src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py (1)

102-109: Remove duplicate expert_bias mapping append.

get_common_mapping_list() already includes decoder.layers.*.mlp.router.expert_bias -> model.layers.*.mlp.gate.e_score_correction_bias, so this append is redundant.

Proposed fix

     def mapping_registry(self) -> MegatronMappingRegistry:
         mapping_list = get_common_mapping_list()
-        mapping_list.append(
-            AutoMapping(
-                megatron_param="decoder.layers.*.mlp.router.expert_bias",
-                hf_param="model.layers.*.mlp.gate.e_score_correction_bias",
-            )
-        )
         return MegatronMappingRegistry(*mapping_list)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 102
- 109, The mapping_registry method is appending a duplicate AutoMapping already
provided by get_common_mapping_list(); remove the redundant mapping append in
mapping_registry (the AutoMapping with megatron_param
"decoder.layers.*.mlp.router.expert_bias" and hf_param
"model.layers.*.mlp.gate.e_score_correction_bias") so that mapping_list only
relies on get_common_mapping_list() and does not re-add the same AutoMapping.

tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py (1)

91-101: Add category mark and explicit GPU-count gating for parallel cases.

This test parametrization includes multi-process variants (world_size=2). Adding explicit gating improves behavior on under-provisioned runners and clarifies execution requirements.

As per coding guidelines "tests//*.py: Use 'pytest.mark' to categorize tests (unit, integration, system)" and "tests//*.py: Document hardware requirements for GPU tests".

Proposed fix

 import pytest
+import torch
 from transformers import AutoConfig, AutoTokenizer, DeepseekV32ForCausalLM
@@
     `@pytest.mark.run_only_on`("GPU")
+    `@pytest.mark.integration`
     `@pytest.mark.parametrize`(
@@
     def test_deepseek_conversion_parallelism(self, deepseek_toy_model_path, tmp_path, tp, pp, ep, test_name):
@@
         world_size = tp * pp * ep
+        if torch.cuda.device_count() < world_size:
+            pytest.skip(f"Requires at least {world_size} GPUs for tp={tp}, pp={pp}, ep={ep}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`
around lines 91 - 101, Add a test category mark (e.g., `@pytest.mark.integration`
or `@pytest.mark.system`) to the test_deepseek_conversion_parallelism test and
guard multi-process cases by checking available GPUs: compute world_size = tp *
pp * ep inside test_deepseek_conversion_parallelism and if world_size > 1 and
torch.cuda.device_count() < world_size call pytest.skip(...) with a clear
message; keep the existing `@pytest.mark.run_only_on`("GPU") and parametrization
but add the new mark and the runtime GPU-count check to avoid running parallel
variants on under-provisioned runners.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/deepseek/common.py`:
- Around line 76-80: The code assumes hf_config.architectures exists; change the
check to safely handle missing or None architectures by reading architectures =
getattr(hf_config, "architectures", None) (or using
hf_config.__dict__.get("architectures")) and only perform the
"DeepseekV32ForCausalLM" membership test if architectures is a non-empty
iterable (e.g., isinstance(architectures, (list, tuple)) or
bool(architectures)); then set configs["experimental_attention_variant"],
configs["dsa_indexer_head_dim"], configs["dsa_indexer_n_heads"], and
configs["dsa_indexer_topk"] exactly as before when the membership test passes.
Use the existing symbols hf_config and configs to locate where to apply this
guard.

In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py`:
- Around line 60-64: Replace the hardcoded DSA indexer constants by reading them
from the loaded HuggingFace model config object and assigning those values to
provider.dsa_indexer_head_dim, provider.dsa_indexer_n_heads and
provider.dsa_indexer_topk (and keep provider.dsa_indexer_loss_coeff if it is not
configurable); e.g., pull the values from the HF config variable used when
loading the model (e.g., config or hf_config) and fall back to sensible defaults
only if the config fields are missing, so the conversion uses the actual model
settings rather than fixed numbers.
- Around line 145-159: The code builds inv_freq (stored on
self._deepseek_inv_freq) and injects it into
converted_weights_dict[inv_freq_key] without verifying it matches the expected
tensor shape from the source; update the block that computes and assigns
inv_freq to first obtain a reference tensor (use
next(iter(converted_weights_dict.values())) or
converted_weights_dict[inv_freq_key] if present) and validate that
inv_freq.shape equals the reference tensor.shape (or the expected shape derived
from hf_config.qk_rope_head_dim: torch.arange(0, rotary_dim, 2).shape). If
shapes mismatch, either reshape/expand/convert inv_freq to match the reference
shape or raise/log a clear error and do not write the incompatible tensor; keep
all references to _deepseek_inv_freq, inv_freq, inv_freq_key,
converted_weights_dict, and hf_config.qk_rope_head_dim so the check is applied
before assigning converted_weights_dict[inv_freq_key] = inv_freq.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`:
- Around line 79-80: The test contains a duplicate call to
model.save_pretrained(model_dir, safe_serialization=True) (two identical
invocations); remove the redundant second call so the model is serialized only
once, keeping the remaining call to model.save_pretrained with
safe_serialization=True (reference: the save_pretrained calls on the model in
test_deepseek_v32_conversion.py).
- Line 171: The test asserts the wrong config key name: update the assertion in
test_deepseek_v32_conversion (the saved dict check) to use "index_topk" ->
"index_topk" is incorrect; change saved["index_top_k"] to saved["index_topk"] so
the test matches the conversion/config pathway that uses index_topk.
- Line 63: Guard the deletion of the fixture attribute so the fixture setup
doesn't raise when the attribute is missing: before removing
config.quantization_config in test_deepseek_v32_conversion.py, check whether the
attribute exists (e.g., with hasattr(config, "quantization_config")) or perform
the deletion inside a try/except catching AttributeError, and only remove it
when present to avoid breaking fixture creation.
- Around line 75-76: Replace the broad "except Exception: pass" in the tokenizer
setup block with targeted error handling: either catch specific expected
exceptions (e.g., OSError, FileNotFoundError, ValueError) or log the exception
and call pytest.skip(reason) or re-raise to fail the test; specifically update
the try/except surrounding the tokenizer initialization (the block that
currently uses "except Exception: pass") so failures are not silently swallowed
but reported or skipped with a clear message.

---

Nitpick comments:
In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py`:
- Around line 102-109: The mapping_registry method is appending a duplicate
AutoMapping already provided by get_common_mapping_list(); remove the redundant
mapping append in mapping_registry (the AutoMapping with megatron_param
"decoder.layers.*.mlp.router.expert_bias" and hf_param
"model.layers.*.mlp.gate.e_score_correction_bias") so that mapping_list only
relies on get_common_mapping_list() and does not re-add the same AutoMapping.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`:
- Around line 91-101: Add a test category mark (e.g., `@pytest.mark.integration`
or `@pytest.mark.system`) to the test_deepseek_conversion_parallelism test and
guard multi-process cases by checking available GPUs: compute world_size = tp *
pp * ep inside test_deepseek_conversion_parallelism and if world_size > 1 and
torch.cuda.device_count() < world_size call pytest.skip(...) with a clear
message; keep the existing `@pytest.mark.run_only_on`("GPU") and parametrization
but add the new mark and the runtime GPU-count check to avoid running parallel
variants on under-provisioned runners.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b86749b and 8653312.

📒 Files selected for processing (4)

src/megatron/bridge/models/deepseek/__init__.py
src/megatron/bridge/models/deepseek/common.py
src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py
tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py

coderabbitai · 2026-02-25T09:54:00Z

+    if "DeepseekV32ForCausalLM" in hf_config.architectures:
+        configs["experimental_attention_variant"] = "dsa"
+        configs["dsa_indexer_head_dim"] = hf_config.index_head_dim
+        configs["dsa_indexer_n_heads"] = hf_config.index_n_heads
+        configs["dsa_indexer_topk"] = hf_config.index_topk


⚠️ Potential issue | 🟠 Major

Handle missing architectures safely in V3.2 detection.

Line 76 assumes hf_config.architectures is always present and iterable. If it is missing or None, conversion fails before config translation completes.

Proposed fix

- if "DeepseekV32ForCausalLM" in hf_config.architectures: + architectures = getattr(hf_config, "architectures", None) or [] + if "DeepseekV32ForCausalLM" in architectures: configs["experimental_attention_variant"] = "dsa" configs["dsa_indexer_head_dim"] = hf_config.index_head_dim configs["dsa_indexer_n_heads"] = hf_config.index_n_heads configs["dsa_indexer_topk"] = hf_config.index_topk

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/deepseek/common.py` around lines 76 - 80, The code assumes hf_config.architectures exists; change the check to safely handle missing or None architectures by reading architectures = getattr(hf_config, "architectures", None) (or using hf_config.__dict__.get("architectures")) and only perform the "DeepseekV32ForCausalLM" membership test if architectures is a non-empty iterable (e.g., isinstance(architectures, (list, tuple)) or bool(architectures)); then set configs["experimental_attention_variant"], configs["dsa_indexer_head_dim"], configs["dsa_indexer_n_heads"], and configs["dsa_indexer_topk"] exactly as before when the membership test passes. Use the existing symbols hf_config and configs to locate where to apply this guard.

coderabbitai · 2026-02-25T09:54:01Z

+        provider.experimental_attention_variant = "dsa"
+        provider.dsa_indexer_head_dim = 128
+        provider.dsa_indexer_n_heads = 64
+        provider.dsa_indexer_topk = 2048
+        provider.dsa_indexer_loss_coeff = 0.001


⚠️ Potential issue | 🟠 Major

Derive DSA indexer settings from HF config instead of hardcoding.

Lines 61-63 hardcode indexer dimensions/top-k, which can diverge from the loaded model config and cause conversion mismatches.

Proposed fix

provider.experimental_attention_variant = "dsa" - provider.dsa_indexer_head_dim = 128 - provider.dsa_indexer_n_heads = 64 - provider.dsa_indexer_topk = 2048 + provider.dsa_indexer_head_dim = hf_config.index_head_dim + provider.dsa_indexer_n_heads = hf_config.index_n_heads + provider.dsa_indexer_topk = hf_config.index_topk provider.dsa_indexer_loss_coeff = 0.001

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 60 - 64, Replace the hardcoded DSA indexer constants by reading them from the loaded HuggingFace model config object and assigning those values to provider.dsa_indexer_head_dim, provider.dsa_indexer_n_heads and provider.dsa_indexer_topk (and keep provider.dsa_indexer_loss_coeff if it is not configurable); e.g., pull the values from the HF config variable used when loading the model (e.g., config or hf_config) and fall back to sensible defaults only if the config fields are missing, so the conversion uses the actual model settings rather than fixed numbers.

coderabbitai · 2026-02-25T09:54:01Z

+        inv_freq = getattr(self, "_deepseek_inv_freq", None)
+        if inv_freq is None:
+            rotary_dim = self.hf_config.qk_rope_head_dim
+            rotary_base = self.hf_config.rope_theta
+            inv_freq = 1.0 / (rotary_base ** (torch.arange(0, rotary_dim, 2, dtype=torch.float32) / rotary_dim))
+            self._deepseek_inv_freq = inv_freq
+
+        if converted_weights_dict:
+            reference_tensor = next(iter(converted_weights_dict.values()))
+            if inv_freq.device != reference_tensor.device:
+                inv_freq = inv_freq.to(device=reference_tensor.device)
+                self._deepseek_inv_freq = inv_freq
+
+        converted_weights_dict[inv_freq_key] = inv_freq
+        return converted_weights_dict


⚠️ Potential issue | 🟠 Major

Validate inv_freq tensor shape before injecting converted weights.

Lines 145-159 synthesize inv_freq and write it into converted_weights_dict without checking compatibility with the source tensor shape for that key.

As per coding guidelines "src/megatron/bridge/models/**/*.py: Always validate tensor shapes before copying weights in weight conversion".

Proposed fix

inv_freq = getattr(self, "_deepseek_inv_freq", None) if inv_freq is None: rotary_dim = self.hf_config.qk_rope_head_dim rotary_base = self.hf_config.rope_theta inv_freq = 1.0 / (rotary_base ** (torch.arange(0, rotary_dim, 2, dtype=torch.float32) / rotary_dim)) self._deepseek_inv_freq = inv_freq + + source_inv_freq = hf_state_dict.get(inv_freq_key) + if source_inv_freq is not None and tuple(source_inv_freq.shape) != tuple(inv_freq.shape): + raise ValueError( + f"Shape mismatch for {inv_freq_key}: source={tuple(source_inv_freq.shape)} " + f"computed={tuple(inv_freq.shape)}" + ) if converted_weights_dict: reference_tensor = next(iter(converted_weights_dict.values())) if inv_freq.device != reference_tensor.device: inv_freq = inv_freq.to(device=reference_tensor.device) self._deepseek_inv_freq = inv_freq

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 145 - 159, The code builds inv_freq (stored on self._deepseek_inv_freq) and injects it into converted_weights_dict[inv_freq_key] without verifying it matches the expected tensor shape from the source; update the block that computes and assigns inv_freq to first obtain a reference tensor (use next(iter(converted_weights_dict.values())) or converted_weights_dict[inv_freq_key] if present) and validate that inv_freq.shape equals the reference tensor.shape (or the expected shape derived from hf_config.qk_rope_head_dim: torch.arange(0, rotary_dim, 2).shape). If shapes mismatch, either reshape/expand/convert inv_freq to match the reference shape or raise/log a clear error and do not write the incompatible tensor; keep all references to _deepseek_inv_freq, inv_freq, inv_freq_key, converted_weights_dict, and hf_config.qk_rope_head_dim so the check is applied before assigning converted_weights_dict[inv_freq_key] = inv_freq.

coderabbitai · 2026-02-25T09:54:01Z

+
+        for key, value in DEEPSEEK_V32_OVERRIDES.items():
+            setattr(config, key, value)
+        del config.quantization_config


⚠️ Potential issue | 🟠 Major

Guard quantization_config removal to avoid fixture crashes.

Line 63 can raise AttributeError when quantization_config is absent, which breaks fixture creation before the test body runs.

Proposed fix

- del config.quantization_config + if hasattr(config, "quantization_config"): + delattr(config, "quantization_config")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` at line 63, Guard the deletion of the fixture attribute so the fixture setup doesn't raise when the attribute is missing: before removing config.quantization_config in test_deepseek_v32_conversion.py, check whether the attribute exists (e.g., with hasattr(config, "quantization_config")) or perform the deletion inside a try/except catching AttributeError, and only remove it when present to avoid breaking fixture creation.

coderabbitai · 2026-02-25T09:54:01Z

+        except Exception:
+            pass


⚠️ Potential issue | 🟡 Minor

Avoid broad silent exception swallowing in tokenizer setup.

Lines 75-76 hide all failures, making this functional test harder to debug when environment/setup regressions occur.

Proposed fix

- except Exception: - pass + except OSError as exc: + pytest.skip(f"Tokenizer setup failed in test fixture: {exc}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception:

pass

except OSError as exc:

pytest.skip(f"Tokenizer setup failed in test fixture: {exc}")

🧰 Tools

🪛 Ruff (0.15.2)

[error] 75-76: try-except-pass detected, consider logging the exception

(S110)

[warning] 75-75: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` around lines 75 - 76, Replace the broad "except Exception: pass" in the tokenizer setup block with targeted error handling: either catch specific expected exceptions (e.g., OSError, FileNotFoundError, ValueError) or log the exception and call pytest.skip(reason) or re-raise to fail the test; specifically update the try/except surrounding the tokenizer initialization (the block that currently uses "except Exception: pass") so failures are not silently swallowed but reported or skipped with a clear message.

coderabbitai · 2026-02-25T09:54:01Z

+        model.save_pretrained(model_dir, safe_serialization=True)
+        model.save_pretrained(model_dir, safe_serialization=True)


⚠️ Potential issue | 🟡 Minor

Remove duplicate model serialization call.

Line 80 repeats the exact save_pretrained invocation from Line 79.

Proposed fix

model.save_pretrained(model_dir, safe_serialization=True) - model.save_pretrained(model_dir, safe_serialization=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

model.save_pretrained(model_dir, safe_serialization=True)

model.save_pretrained(model_dir, safe_serialization=True)

model.save_pretrained(model_dir, safe_serialization=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` around lines 79 - 80, The test contains a duplicate call to model.save_pretrained(model_dir, safe_serialization=True) (two identical invocations); remove the redundant second call so the model is serialized only once, keeping the remaining call to model.save_pretrained with safe_serialization=True (reference: the save_pretrained calls on the model in test_deepseek_v32_conversion.py).

coderabbitai · 2026-02-25T09:54:01Z

+        assert saved["moe_intermediate_size"] == 768
+        assert saved["index_head_dim"] == 128
+        assert saved["index_n_heads"] == 16
+        assert saved["index_top_k"] == 2048


⚠️ Potential issue | 🟠 Major

Fix the asserted index-topk config key name.

Line 171 asserts index_top_k, but the conversion/config pathway uses index_topk naming. This can create a false-negative assertion.

Proposed fix

- assert saved["index_top_k"] == 2048 + assert saved["index_topk"] == 2048

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert saved["index_top_k"] == 2048

assert saved["index_topk"] == 2048

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` at line 171, The test asserts the wrong config key name: update the assertion in test_deepseek_v32_conversion (the saved dict check) to use "index_topk" -> "index_topk" is incorrect; change saved["index_top_k"] to saved["index_topk"] so the test matches the conversion/config pathway that uses index_topk.

yaoyu-33 · 2026-02-25T18:38:29Z

Need dependency[main] fix(moe): Fix several bugs of experimental attention variant by yuzhongw-nvidia · Pull Request #3026 · NVIDIA/Megatron-LM and need some more tests

janEbert · 2026-03-02T20:02:20Z

Hey, just for my understanding, this PR is not usable to convert checkpoints yet, right? Since DSv3.2 support is currently missing in transformers (huggingface/transformers#41251 has not been merged yet), the original checkpoint cannot be loaded. Or do I have to patch the checkpoint in some way to make it work? (E.g., set model_type: "deepseek_v3" in the checkpoint's config.json?)

yuzhongw-nvidia · 2026-03-03T01:15:29Z

Hey, just for my understanding, this PR is not usable to convert checkpoints yet, right? Since DSv3.2 support is currently missing in transformers (huggingface/transformers#41251 has not been merged yet), the original checkpoint cannot be loaded. Or do I have to patch the checkpoint in some way to make it work? (E.g., set model_type: "deepseek_v3" in the checkpoint's config.json?)

I'm not sure, but @yaoyu-33 told me that megatron-bridge can do the conversion by straightforwardly loading the weights from the safetensors files now, without constructing an HF model object. Hi @yaoyu-33, please correct me if I misunderstood, thanks.

janEbert · 2026-03-09T11:32:56Z

Hey, what you said seems to be correct! However, while the HF model is never instantiated, the script does look for a model class to be defined, so some work is required.

Therefore I fixed the original checkpoint like this to load it:

# "HuggingFace" checkpoint directory from, e.g.,
# https://huggingface.co/deepseek-ai/DeepSeek-V3.2
hf_cp_dir=/path/to/DeepSeek-V3.2  # TODO change this

# Add dummy configuration code
cat <<EOF > "$hf_cp_dir"/configuration_deepseek.py
# coding=utf-8

from transformers.models.deepseek_v3 import DeepseekV3Config

DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}

class DeepseekV32Config(DeepseekV3Config):
    model_type = "deepseek_v32"
    index_head_dim = 128
    index_n_heads = 64
    index_topk = 2048

__all__ = ["DeepseekV32Config"]
EOF

# Add dummy modeling code
cat <<EOF > "$hf_cp_dir"/modeling_deepseek.py
# coding=utf-8

from transformers.models.deepseek_v3 import DeepseekV3ForCausalLM, DeepseekV3Model

class DeepseekV32Model(DeepseekV3Model):
    pass

class DeepseekV32ForCausalLM(DeepseekV3ForCausalLM):
    pass

__all__ = ["DeepseekV32Model", "DeepseekV32ForCausalLM"]
EOF

# Add AutoModel resolution
sed -i '/"architectures":/i\  "auto_map": {\
    "AutoConfig": "configuration_deepseek.DeepseekV32Config",\
    "AutoModel": "modeling_deepseek.DeepseekV32Model",\
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV32ForCausalLM"\
  },' "$hf_cp_dir"/config.json

And a small MBridge patch due to MCore upstream updates:

mbridge_dir=/path/to/Megatron-Bridge  # TODO change this

sed -i 's/== ModelType\.encoder_and_decoder/!= ModelType.encoder_or_decoder/
s/!= ModelType\.encoder_and_decoder/== ModelType.encoder_or_decoder/' \
    "$mbridge_dir"/src/megatron/bridge/models/model_provider.py

How I submitted:

env PYTHONPATH="$mbridge_dir"/src:"$PYTHONPATH" python \
    -u \
    "$mbridge_dir"/examples/conversion/convert_checkpoints.py \
    import \
    --hf-model "$hf_cp_dir" \
    --megatron-path "$hf_cp_dir"_mlm \
    --trust-remote-code

With these patches, I managed to convert the original DSv3.2 checkpoint with your commit df6f50a. :)
Haven't tried to load it yet, but there was no warning or error during the conversion.

yaoyu-33 previously approved these changes Jan 8, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci January 8, 2026 03:33 Inactive

yuzhongw-nvidia dismissed yaoyu-33’s stale review via 4f0d855 January 8, 2026 04:39

copy-pr-bot Bot temporarily deployed to nemo-ci January 8, 2026 18:37 Inactive

copy-pr-bot Bot temporarily deployed to test January 8, 2026 18:38 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 8, 2026 18:41 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 8, 2026 19:03 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 8, 2026 19:09 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 8, 2026 19:09 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 8, 2026 19:09 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 8, 2026 19:09 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 8, 2026 19:09 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 10, 2026 01:04 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 10, 2026 01:04 Failure

yaoyu-33 mentioned this pull request Jan 10, 2026

[Feature Request] Support for DeepSeek-V3.2-Speciale Model Training #1875

Open

Phlip79 mentioned this pull request Jan 28, 2026

DeepSeek V3.2 support NVIDIA/Megatron-LM#2440

Merged

HollowMan6 requested changes Feb 20, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/deepseek/common.py Outdated

support deepseek-v3.2

8653312

yuzhongw-nvidia force-pushed the dsv32 branch from 4f0d855 to 8653312 Compare February 25, 2026 09:39

coderabbitai Bot reviewed Feb 25, 2026

View reviewed changes

update tests

540e463

fix dsv32 bridge

df6f50a

janEbert mentioned this pull request Apr 7, 2026

[QUESTION]Different RoPE Split Order for Q/K in DSA Implementation NVIDIA/Megatron-LM#2733

Closed

yaoyu-33 added area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work waiting-on-customer Waiting on the original author to respond labels May 12, 2026

-        except Exception:
-            pass
+        except OSError as exc:
+            pytest.skip(f"Tokenizer setup failed in test fixture: {exc}")

		model.save_pretrained(model_dir, safe_serialization=True)
		model.save_pretrained(model_dir, safe_serialization=True)

	assert saved["index_top_k"] == 2048
	assert saved["index_topk"] == 2048

Conversation

yuzhongw-nvidia commented Nov 19, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Nov 19, 2025

Uh oh!

yaoyu-33 commented Jan 8, 2026

Uh oh!

yaoyu-33 commented Jan 8, 2026

Uh oh!

yaoyu-33 commented Jan 8, 2026

Uh oh!

yaoyu-33 commented Jan 20, 2026

Uh oh!

wuyaoxuehun commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 25, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Feb 25, 2026

Uh oh!

janEbert commented Mar 2, 2026

Uh oh!

yuzhongw-nvidia commented Mar 3, 2026

Uh oh!

janEbert commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuzhongw-nvidia commented Nov 19, 2025 •

edited by coderabbitai Bot

Loading

wuyaoxuehun commented Feb 12, 2026 •

edited

Loading