Skip to content

Support DeepSeek-v3.2#1421

Open
yuzhongw-nvidia wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
yuzhongw-nvidia:dsv32
Open

Support DeepSeek-v3.2#1421
yuzhongw-nvidia wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
yuzhongw-nvidia:dsv32

Conversation

@yuzhongw-nvidia

@yuzhongw-nvidia yuzhongw-nvidia commented Nov 19, 2025

Copy link
Copy Markdown
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Added support for DeepSeek-V3.2 model integration with Megatron framework, including multi-latent attention and sparse attention indexing.
    • Enabled mixture-of-experts configuration handling for the new model variant.
  • Tests

    • Added functional tests for DeepSeek-V3.2 model conversion validation.

@copy-pr-bot

copy-pr-bot Bot commented Nov 19, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33
yaoyu-33 previously approved these changes Jan 8, 2026
@yaoyu-33

yaoyu-33 commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

/ok to test 066adea

@yaoyu-33

yaoyu-33 commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

@yuzhongw-nvidia can you run " pre-commit run --all-files"

@yaoyu-33

yaoyu-33 commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

/ok to test 4f0d855

@yaoyu-33

Copy link
Copy Markdown
Contributor

pending transformers merging on HF side.

@wuyaoxuehun

wuyaoxuehun commented Feb 12, 2026

Copy link
Copy Markdown

pending transformers merging on HF side.

@yaoyu-33 can you merge this pr now? and can GLM-5 be also implemented cause they share the same arch?

Comment thread src/megatron/bridge/models/deepseek/common.py Outdated
@coderabbitai

coderabbitai Bot commented Feb 25, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR introduces DeepSeek-V3.2 model conversion support in Megatron, adding a new bridge implementation with provider customization for sparse attention indexing, mixture-of-experts settings, multi-latent attention, and comprehensive weight mapping logic. Functional tests validate the conversion pipeline across multi-GPU environments.

Changes

Cohort / File(s) Summary
Bridge Implementation & Export
src/megatron/bridge/models/deepseek/__init__.py, src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py
New DeepSeekV32Bridge class for DeepSeek-V3.2 model conversion with provider customization (experimental DSA attention, MoE settings, rope, fusion options), mapping registry for parameter translation, and automatic rotary embedding inverse frequency injection during weight conversion.
Configuration & Mappings
src/megatron/bridge/models/deepseek/common.py
Added DeepSeek-V3.2-specific indexer configuration selection logic and sparse attention parameter mappings (indexer.linear_wq_b, indexer.linear_wk, indexer.k_norm, indexer.weights_proj) plus MoE router expert bias mapping to HF e_score_correction_bias.
Functional Tests
tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py
New test suite for DeepSeek-V3.2 conversion validation, creating toy models with MoE and DSA configurations, executing multi-GPU conversion pipeline, and verifying output integrity (config fields, weight presence, parameter preservation).

Sequence Diagram

sequenceDiagram
    participant HF as HuggingFace<br/>DeepSeek-V3.2
    participant Bridge as DeepSeekV32Bridge
    participant Config as Config &<br/>Mappings
    participant Megatron as Megatron<br/>Model
    participant Weights as Weight<br/>Converter

    HF->>Bridge: Load HF model config & weights
    Bridge->>Config: Query get_common_configs()
    Config-->>Bridge: Return DSA indexer & MoE settings
    Bridge->>Config: Query get_common_mapping_list()
    Config-->>Bridge: Return sparse attention & router mappings
    Bridge->>Bridge: Create provider with TE specs & fusion options
    Bridge-->>Megatron: Build GPTModel with DSA & MoE
    Bridge->>Weights: Convert HF weights via mapping
    Weights->>HF: Apply parameter mappings (indexer, router bias)
    Weights->>Weights: Inject rotary embedding inv_freq
    Weights-->>Megatron: Load augmented weights
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

community-request

Suggested reviewers

  • yaoyu-33
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR adds DeepSeek-V3.2 support but lacks documented test results and evidence of successful execution in description. Document test results confirming successful conversion with correct model configuration and address all flagged code issues before running tests.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Support DeepSeek-v3.2' directly and clearly summarizes the main change: adding support for the DeepSeek-V3.2 model through new bridge implementation, configuration mappings, and functional tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py (1)

102-109: Remove duplicate expert_bias mapping append.

get_common_mapping_list() already includes decoder.layers.*.mlp.router.expert_bias -> model.layers.*.mlp.gate.e_score_correction_bias, so this append is redundant.

Proposed fix
     def mapping_registry(self) -> MegatronMappingRegistry:
         mapping_list = get_common_mapping_list()
-        mapping_list.append(
-            AutoMapping(
-                megatron_param="decoder.layers.*.mlp.router.expert_bias",
-                hf_param="model.layers.*.mlp.gate.e_score_correction_bias",
-            )
-        )
         return MegatronMappingRegistry(*mapping_list)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 102
- 109, The mapping_registry method is appending a duplicate AutoMapping already
provided by get_common_mapping_list(); remove the redundant mapping append in
mapping_registry (the AutoMapping with megatron_param
"decoder.layers.*.mlp.router.expert_bias" and hf_param
"model.layers.*.mlp.gate.e_score_correction_bias") so that mapping_list only
relies on get_common_mapping_list() and does not re-add the same AutoMapping.
tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py (1)

91-101: Add category mark and explicit GPU-count gating for parallel cases.

This test parametrization includes multi-process variants (world_size=2). Adding explicit gating improves behavior on under-provisioned runners and clarifies execution requirements.

As per coding guidelines "tests//*.py: Use 'pytest.mark' to categorize tests (unit, integration, system)" and "tests//*.py: Document hardware requirements for GPU tests".

Proposed fix
 import pytest
+import torch
 from transformers import AutoConfig, AutoTokenizer, DeepseekV32ForCausalLM
@@
     `@pytest.mark.run_only_on`("GPU")
+    `@pytest.mark.integration`
     `@pytest.mark.parametrize`(
@@
     def test_deepseek_conversion_parallelism(self, deepseek_toy_model_path, tmp_path, tp, pp, ep, test_name):
@@
         world_size = tp * pp * ep
+        if torch.cuda.device_count() < world_size:
+            pytest.skip(f"Requires at least {world_size} GPUs for tp={tp}, pp={pp}, ep={ep}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`
around lines 91 - 101, Add a test category mark (e.g., `@pytest.mark.integration`
or `@pytest.mark.system`) to the test_deepseek_conversion_parallelism test and
guard multi-process cases by checking available GPUs: compute world_size = tp *
pp * ep inside test_deepseek_conversion_parallelism and if world_size > 1 and
torch.cuda.device_count() < world_size call pytest.skip(...) with a clear
message; keep the existing `@pytest.mark.run_only_on`("GPU") and parametrization
but add the new mark and the runtime GPU-count check to avoid running parallel
variants on under-provisioned runners.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/deepseek/common.py`:
- Around line 76-80: The code assumes hf_config.architectures exists; change the
check to safely handle missing or None architectures by reading architectures =
getattr(hf_config, "architectures", None) (or using
hf_config.__dict__.get("architectures")) and only perform the
"DeepseekV32ForCausalLM" membership test if architectures is a non-empty
iterable (e.g., isinstance(architectures, (list, tuple)) or
bool(architectures)); then set configs["experimental_attention_variant"],
configs["dsa_indexer_head_dim"], configs["dsa_indexer_n_heads"], and
configs["dsa_indexer_topk"] exactly as before when the membership test passes.
Use the existing symbols hf_config and configs to locate where to apply this
guard.

In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py`:
- Around line 60-64: Replace the hardcoded DSA indexer constants by reading them
from the loaded HuggingFace model config object and assigning those values to
provider.dsa_indexer_head_dim, provider.dsa_indexer_n_heads and
provider.dsa_indexer_topk (and keep provider.dsa_indexer_loss_coeff if it is not
configurable); e.g., pull the values from the HF config variable used when
loading the model (e.g., config or hf_config) and fall back to sensible defaults
only if the config fields are missing, so the conversion uses the actual model
settings rather than fixed numbers.
- Around line 145-159: The code builds inv_freq (stored on
self._deepseek_inv_freq) and injects it into
converted_weights_dict[inv_freq_key] without verifying it matches the expected
tensor shape from the source; update the block that computes and assigns
inv_freq to first obtain a reference tensor (use
next(iter(converted_weights_dict.values())) or
converted_weights_dict[inv_freq_key] if present) and validate that
inv_freq.shape equals the reference tensor.shape (or the expected shape derived
from hf_config.qk_rope_head_dim: torch.arange(0, rotary_dim, 2).shape). If
shapes mismatch, either reshape/expand/convert inv_freq to match the reference
shape or raise/log a clear error and do not write the incompatible tensor; keep
all references to _deepseek_inv_freq, inv_freq, inv_freq_key,
converted_weights_dict, and hf_config.qk_rope_head_dim so the check is applied
before assigning converted_weights_dict[inv_freq_key] = inv_freq.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`:
- Around line 79-80: The test contains a duplicate call to
model.save_pretrained(model_dir, safe_serialization=True) (two identical
invocations); remove the redundant second call so the model is serialized only
once, keeping the remaining call to model.save_pretrained with
safe_serialization=True (reference: the save_pretrained calls on the model in
test_deepseek_v32_conversion.py).
- Line 171: The test asserts the wrong config key name: update the assertion in
test_deepseek_v32_conversion (the saved dict check) to use "index_topk" ->
"index_topk" is incorrect; change saved["index_top_k"] to saved["index_topk"] so
the test matches the conversion/config pathway that uses index_topk.
- Line 63: Guard the deletion of the fixture attribute so the fixture setup
doesn't raise when the attribute is missing: before removing
config.quantization_config in test_deepseek_v32_conversion.py, check whether the
attribute exists (e.g., with hasattr(config, "quantization_config")) or perform
the deletion inside a try/except catching AttributeError, and only remove it
when present to avoid breaking fixture creation.
- Around line 75-76: Replace the broad "except Exception: pass" in the tokenizer
setup block with targeted error handling: either catch specific expected
exceptions (e.g., OSError, FileNotFoundError, ValueError) or log the exception
and call pytest.skip(reason) or re-raise to fail the test; specifically update
the try/except surrounding the tokenizer initialization (the block that
currently uses "except Exception: pass") so failures are not silently swallowed
but reported or skipped with a clear message.

---

Nitpick comments:
In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py`:
- Around line 102-109: The mapping_registry method is appending a duplicate
AutoMapping already provided by get_common_mapping_list(); remove the redundant
mapping append in mapping_registry (the AutoMapping with megatron_param
"decoder.layers.*.mlp.router.expert_bias" and hf_param
"model.layers.*.mlp.gate.e_score_correction_bias") so that mapping_list only
relies on get_common_mapping_list() and does not re-add the same AutoMapping.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`:
- Around line 91-101: Add a test category mark (e.g., `@pytest.mark.integration`
or `@pytest.mark.system`) to the test_deepseek_conversion_parallelism test and
guard multi-process cases by checking available GPUs: compute world_size = tp *
pp * ep inside test_deepseek_conversion_parallelism and if world_size > 1 and
torch.cuda.device_count() < world_size call pytest.skip(...) with a clear
message; keep the existing `@pytest.mark.run_only_on`("GPU") and parametrization
but add the new mark and the runtime GPU-count check to avoid running parallel
variants on under-provisioned runners.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b86749b and 8653312.

📒 Files selected for processing (4)
  • src/megatron/bridge/models/deepseek/__init__.py
  • src/megatron/bridge/models/deepseek/common.py
  • src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py
  • tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py

Comment on lines +76 to +80
if "DeepseekV32ForCausalLM" in hf_config.architectures:
configs["experimental_attention_variant"] = "dsa"
configs["dsa_indexer_head_dim"] = hf_config.index_head_dim
configs["dsa_indexer_n_heads"] = hf_config.index_n_heads
configs["dsa_indexer_topk"] = hf_config.index_topk

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle missing architectures safely in V3.2 detection.

Line 76 assumes hf_config.architectures is always present and iterable. If it is missing or None, conversion fails before config translation completes.

Proposed fix
-    if "DeepseekV32ForCausalLM" in hf_config.architectures:
+    architectures = getattr(hf_config, "architectures", None) or []
+    if "DeepseekV32ForCausalLM" in architectures:
         configs["experimental_attention_variant"] = "dsa"
         configs["dsa_indexer_head_dim"] = hf_config.index_head_dim
         configs["dsa_indexer_n_heads"] = hf_config.index_n_heads
         configs["dsa_indexer_topk"] = hf_config.index_topk
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/deepseek/common.py` around lines 76 - 80, The code
assumes hf_config.architectures exists; change the check to safely handle
missing or None architectures by reading architectures = getattr(hf_config,
"architectures", None) (or using hf_config.__dict__.get("architectures")) and
only perform the "DeepseekV32ForCausalLM" membership test if architectures is a
non-empty iterable (e.g., isinstance(architectures, (list, tuple)) or
bool(architectures)); then set configs["experimental_attention_variant"],
configs["dsa_indexer_head_dim"], configs["dsa_indexer_n_heads"], and
configs["dsa_indexer_topk"] exactly as before when the membership test passes.
Use the existing symbols hf_config and configs to locate where to apply this
guard.

Comment on lines +60 to +64
provider.experimental_attention_variant = "dsa"
provider.dsa_indexer_head_dim = 128
provider.dsa_indexer_n_heads = 64
provider.dsa_indexer_topk = 2048
provider.dsa_indexer_loss_coeff = 0.001

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Derive DSA indexer settings from HF config instead of hardcoding.

Lines 61-63 hardcode indexer dimensions/top-k, which can diverge from the loaded model config and cause conversion mismatches.

Proposed fix
         provider.experimental_attention_variant = "dsa"
-        provider.dsa_indexer_head_dim = 128
-        provider.dsa_indexer_n_heads = 64
-        provider.dsa_indexer_topk = 2048
+        provider.dsa_indexer_head_dim = hf_config.index_head_dim
+        provider.dsa_indexer_n_heads = hf_config.index_n_heads
+        provider.dsa_indexer_topk = hf_config.index_topk
         provider.dsa_indexer_loss_coeff = 0.001
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 60 -
64, Replace the hardcoded DSA indexer constants by reading them from the loaded
HuggingFace model config object and assigning those values to
provider.dsa_indexer_head_dim, provider.dsa_indexer_n_heads and
provider.dsa_indexer_topk (and keep provider.dsa_indexer_loss_coeff if it is not
configurable); e.g., pull the values from the HF config variable used when
loading the model (e.g., config or hf_config) and fall back to sensible defaults
only if the config fields are missing, so the conversion uses the actual model
settings rather than fixed numbers.

Comment on lines +145 to +159
inv_freq = getattr(self, "_deepseek_inv_freq", None)
if inv_freq is None:
rotary_dim = self.hf_config.qk_rope_head_dim
rotary_base = self.hf_config.rope_theta
inv_freq = 1.0 / (rotary_base ** (torch.arange(0, rotary_dim, 2, dtype=torch.float32) / rotary_dim))
self._deepseek_inv_freq = inv_freq

if converted_weights_dict:
reference_tensor = next(iter(converted_weights_dict.values()))
if inv_freq.device != reference_tensor.device:
inv_freq = inv_freq.to(device=reference_tensor.device)
self._deepseek_inv_freq = inv_freq

converted_weights_dict[inv_freq_key] = inv_freq
return converted_weights_dict

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate inv_freq tensor shape before injecting converted weights.

Lines 145-159 synthesize inv_freq and write it into converted_weights_dict without checking compatibility with the source tensor shape for that key.

As per coding guidelines "src/megatron/bridge/models/**/*.py: Always validate tensor shapes before copying weights in weight conversion".

Proposed fix
         inv_freq = getattr(self, "_deepseek_inv_freq", None)
         if inv_freq is None:
             rotary_dim = self.hf_config.qk_rope_head_dim
             rotary_base = self.hf_config.rope_theta
             inv_freq = 1.0 / (rotary_base ** (torch.arange(0, rotary_dim, 2, dtype=torch.float32) / rotary_dim))
             self._deepseek_inv_freq = inv_freq
+
+        source_inv_freq = hf_state_dict.get(inv_freq_key)
+        if source_inv_freq is not None and tuple(source_inv_freq.shape) != tuple(inv_freq.shape):
+            raise ValueError(
+                f"Shape mismatch for {inv_freq_key}: source={tuple(source_inv_freq.shape)} "
+                f"computed={tuple(inv_freq.shape)}"
+            )
 
         if converted_weights_dict:
             reference_tensor = next(iter(converted_weights_dict.values()))
             if inv_freq.device != reference_tensor.device:
                 inv_freq = inv_freq.to(device=reference_tensor.device)
                 self._deepseek_inv_freq = inv_freq
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 145
- 159, The code builds inv_freq (stored on self._deepseek_inv_freq) and injects
it into converted_weights_dict[inv_freq_key] without verifying it matches the
expected tensor shape from the source; update the block that computes and
assigns inv_freq to first obtain a reference tensor (use
next(iter(converted_weights_dict.values())) or
converted_weights_dict[inv_freq_key] if present) and validate that
inv_freq.shape equals the reference tensor.shape (or the expected shape derived
from hf_config.qk_rope_head_dim: torch.arange(0, rotary_dim, 2).shape). If
shapes mismatch, either reshape/expand/convert inv_freq to match the reference
shape or raise/log a clear error and do not write the incompatible tensor; keep
all references to _deepseek_inv_freq, inv_freq, inv_freq_key,
converted_weights_dict, and hf_config.qk_rope_head_dim so the check is applied
before assigning converted_weights_dict[inv_freq_key] = inv_freq.


for key, value in DEEPSEEK_V32_OVERRIDES.items():
setattr(config, key, value)
del config.quantization_config

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard quantization_config removal to avoid fixture crashes.

Line 63 can raise AttributeError when quantization_config is absent, which breaks fixture creation before the test body runs.

Proposed fix
-        del config.quantization_config
+        if hasattr(config, "quantization_config"):
+            delattr(config, "quantization_config")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` at
line 63, Guard the deletion of the fixture attribute so the fixture setup
doesn't raise when the attribute is missing: before removing
config.quantization_config in test_deepseek_v32_conversion.py, check whether the
attribute exists (e.g., with hasattr(config, "quantization_config")) or perform
the deletion inside a try/except catching AttributeError, and only remove it
when present to avoid breaking fixture creation.

Comment on lines +75 to +76
except Exception:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid broad silent exception swallowing in tokenizer setup.

Lines 75-76 hide all failures, making this functional test harder to debug when environment/setup regressions occur.

Proposed fix
-        except Exception:
-            pass
+        except OSError as exc:
+            pytest.skip(f"Tokenizer setup failed in test fixture: {exc}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception:
pass
except OSError as exc:
pytest.skip(f"Tokenizer setup failed in test fixture: {exc}")
🧰 Tools
🪛 Ruff (0.15.2)

[error] 75-76: try-except-pass detected, consider logging the exception

(S110)


[warning] 75-75: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`
around lines 75 - 76, Replace the broad "except Exception: pass" in the
tokenizer setup block with targeted error handling: either catch specific
expected exceptions (e.g., OSError, FileNotFoundError, ValueError) or log the
exception and call pytest.skip(reason) or re-raise to fail the test;
specifically update the try/except surrounding the tokenizer initialization (the
block that currently uses "except Exception: pass") so failures are not silently
swallowed but reported or skipped with a clear message.

Comment on lines +79 to +80
model.save_pretrained(model_dir, safe_serialization=True)
model.save_pretrained(model_dir, safe_serialization=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate model serialization call.

Line 80 repeats the exact save_pretrained invocation from Line 79.

Proposed fix
         model.save_pretrained(model_dir, safe_serialization=True)
-        model.save_pretrained(model_dir, safe_serialization=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model.save_pretrained(model_dir, safe_serialization=True)
model.save_pretrained(model_dir, safe_serialization=True)
model.save_pretrained(model_dir, safe_serialization=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`
around lines 79 - 80, The test contains a duplicate call to
model.save_pretrained(model_dir, safe_serialization=True) (two identical
invocations); remove the redundant second call so the model is serialized only
once, keeping the remaining call to model.save_pretrained with
safe_serialization=True (reference: the save_pretrained calls on the model in
test_deepseek_v32_conversion.py).

assert saved["moe_intermediate_size"] == 768
assert saved["index_head_dim"] == 128
assert saved["index_n_heads"] == 16
assert saved["index_top_k"] == 2048

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the asserted index-topk config key name.

Line 171 asserts index_top_k, but the conversion/config pathway uses index_topk naming. This can create a false-negative assertion.

Proposed fix
-        assert saved["index_top_k"] == 2048
+        assert saved["index_topk"] == 2048
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert saved["index_top_k"] == 2048
assert saved["index_topk"] == 2048
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` at
line 171, The test asserts the wrong config key name: update the assertion in
test_deepseek_v32_conversion (the saved dict check) to use "index_topk" ->
"index_topk" is incorrect; change saved["index_top_k"] to saved["index_topk"] so
the test matches the conversion/config pathway that uses index_topk.

@yaoyu-33

Copy link
Copy Markdown
Contributor

@janEbert

janEbert commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Hey, just for my understanding, this PR is not usable to convert checkpoints yet, right? Since DSv3.2 support is currently missing in transformers (huggingface/transformers#41251 has not been merged yet), the original checkpoint cannot be loaded. Or do I have to patch the checkpoint in some way to make it work? (E.g., set model_type: "deepseek_v3" in the checkpoint's config.json?)

@yuzhongw-nvidia

Copy link
Copy Markdown
Contributor Author

Hey, just for my understanding, this PR is not usable to convert checkpoints yet, right? Since DSv3.2 support is currently missing in transformers (huggingface/transformers#41251 has not been merged yet), the original checkpoint cannot be loaded. Or do I have to patch the checkpoint in some way to make it work? (E.g., set model_type: "deepseek_v3" in the checkpoint's config.json?)

I'm not sure, but @yaoyu-33 told me that megatron-bridge can do the conversion by straightforwardly loading the weights from the safetensors files now, without constructing an HF model object. Hi @yaoyu-33, please correct me if I misunderstood, thanks.

@janEbert

janEbert commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

Hey, what you said seems to be correct! However, while the HF model is never instantiated, the script does look for a model class to be defined, so some work is required.

Therefore I fixed the original checkpoint like this to load it:

# "HuggingFace" checkpoint directory from, e.g.,
# https://huggingface.co/deepseek-ai/DeepSeek-V3.2
hf_cp_dir=/path/to/DeepSeek-V3.2  # TODO change this

# Add dummy configuration code
cat <<EOF > "$hf_cp_dir"/configuration_deepseek.py
# coding=utf-8

from transformers.models.deepseek_v3 import DeepseekV3Config

DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}

class DeepseekV32Config(DeepseekV3Config):
    model_type = "deepseek_v32"
    index_head_dim = 128
    index_n_heads = 64
    index_topk = 2048

__all__ = ["DeepseekV32Config"]
EOF

# Add dummy modeling code
cat <<EOF > "$hf_cp_dir"/modeling_deepseek.py
# coding=utf-8

from transformers.models.deepseek_v3 import DeepseekV3ForCausalLM, DeepseekV3Model

class DeepseekV32Model(DeepseekV3Model):
    pass

class DeepseekV32ForCausalLM(DeepseekV3ForCausalLM):
    pass

__all__ = ["DeepseekV32Model", "DeepseekV32ForCausalLM"]
EOF

# Add AutoModel resolution
sed -i '/"architectures":/i\  "auto_map": {\
    "AutoConfig": "configuration_deepseek.DeepseekV32Config",\
    "AutoModel": "modeling_deepseek.DeepseekV32Model",\
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV32ForCausalLM"\
  },' "$hf_cp_dir"/config.json

And a small MBridge patch due to MCore upstream updates:

mbridge_dir=/path/to/Megatron-Bridge  # TODO change this

sed -i 's/== ModelType\.encoder_and_decoder/!= ModelType.encoder_or_decoder/
s/!= ModelType\.encoder_and_decoder/== ModelType.encoder_or_decoder/' \
    "$mbridge_dir"/src/megatron/bridge/models/model_provider.py

How I submitted:

env PYTHONPATH="$mbridge_dir"/src:"$PYTHONPATH" python \
    -u \
    "$mbridge_dir"/examples/conversion/convert_checkpoints.py \
    import \
    --hf-model "$hf_cp_dir" \
    --megatron-path "$hf_cp_dir"_mlm \
    --trust-remote-code

With these patches, I managed to convert the original DSv3.2 checkpoint with your commit df6f50a. :)
Haven't tried to load it yet, but there was no warning or error during the conversion.

@yaoyu-33 yaoyu-33 added area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work waiting-on-customer Waiting on the original author to respond labels May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants