Support DeepSeek-v3.2#1421
Conversation
|
/ok to test 066adea |
|
@yuzhongw-nvidia can you run " pre-commit run --all-files" |
|
/ok to test 4f0d855 |
|
pending transformers merging on HF side. |
@yaoyu-33 can you merge this pr now? and can GLM-5 be also implemented cause they share the same arch? |
4f0d855 to
8653312
Compare
📝 WalkthroughWalkthroughThis PR introduces DeepSeek-V3.2 model conversion support in Megatron, adding a new bridge implementation with provider customization for sparse attention indexing, mixture-of-experts settings, multi-latent attention, and comprehensive weight mapping logic. Functional tests validate the conversion pipeline across multi-GPU environments. Changes
Sequence DiagramsequenceDiagram
participant HF as HuggingFace<br/>DeepSeek-V3.2
participant Bridge as DeepSeekV32Bridge
participant Config as Config &<br/>Mappings
participant Megatron as Megatron<br/>Model
participant Weights as Weight<br/>Converter
HF->>Bridge: Load HF model config & weights
Bridge->>Config: Query get_common_configs()
Config-->>Bridge: Return DSA indexer & MoE settings
Bridge->>Config: Query get_common_mapping_list()
Config-->>Bridge: Return sparse attention & router mappings
Bridge->>Bridge: Create provider with TE specs & fusion options
Bridge-->>Megatron: Build GPTModel with DSA & MoE
Bridge->>Weights: Convert HF weights via mapping
Weights->>HF: Apply parameter mappings (indexer, router bias)
Weights->>Weights: Inject rotary embedding inv_freq
Weights-->>Megatron: Load augmented weights
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py (1)
102-109: Remove duplicateexpert_biasmapping append.
get_common_mapping_list()already includesdecoder.layers.*.mlp.router.expert_bias -> model.layers.*.mlp.gate.e_score_correction_bias, so this append is redundant.Proposed fix
def mapping_registry(self) -> MegatronMappingRegistry: mapping_list = get_common_mapping_list() - mapping_list.append( - AutoMapping( - megatron_param="decoder.layers.*.mlp.router.expert_bias", - hf_param="model.layers.*.mlp.gate.e_score_correction_bias", - ) - ) return MegatronMappingRegistry(*mapping_list)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 102 - 109, The mapping_registry method is appending a duplicate AutoMapping already provided by get_common_mapping_list(); remove the redundant mapping append in mapping_registry (the AutoMapping with megatron_param "decoder.layers.*.mlp.router.expert_bias" and hf_param "model.layers.*.mlp.gate.e_score_correction_bias") so that mapping_list only relies on get_common_mapping_list() and does not re-add the same AutoMapping.tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py (1)
91-101: Add category mark and explicit GPU-count gating for parallel cases.This test parametrization includes multi-process variants (
world_size=2). Adding explicit gating improves behavior on under-provisioned runners and clarifies execution requirements.As per coding guidelines "tests//*.py: Use 'pytest.mark' to categorize tests (unit, integration, system)" and "tests//*.py: Document hardware requirements for GPU tests".
Proposed fix
import pytest +import torch from transformers import AutoConfig, AutoTokenizer, DeepseekV32ForCausalLM @@ `@pytest.mark.run_only_on`("GPU") + `@pytest.mark.integration` `@pytest.mark.parametrize`( @@ def test_deepseek_conversion_parallelism(self, deepseek_toy_model_path, tmp_path, tp, pp, ep, test_name): @@ world_size = tp * pp * ep + if torch.cuda.device_count() < world_size: + pytest.skip(f"Requires at least {world_size} GPUs for tp={tp}, pp={pp}, ep={ep}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` around lines 91 - 101, Add a test category mark (e.g., `@pytest.mark.integration` or `@pytest.mark.system`) to the test_deepseek_conversion_parallelism test and guard multi-process cases by checking available GPUs: compute world_size = tp * pp * ep inside test_deepseek_conversion_parallelism and if world_size > 1 and torch.cuda.device_count() < world_size call pytest.skip(...) with a clear message; keep the existing `@pytest.mark.run_only_on`("GPU") and parametrization but add the new mark and the runtime GPU-count check to avoid running parallel variants on under-provisioned runners.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/deepseek/common.py`:
- Around line 76-80: The code assumes hf_config.architectures exists; change the
check to safely handle missing or None architectures by reading architectures =
getattr(hf_config, "architectures", None) (or using
hf_config.__dict__.get("architectures")) and only perform the
"DeepseekV32ForCausalLM" membership test if architectures is a non-empty
iterable (e.g., isinstance(architectures, (list, tuple)) or
bool(architectures)); then set configs["experimental_attention_variant"],
configs["dsa_indexer_head_dim"], configs["dsa_indexer_n_heads"], and
configs["dsa_indexer_topk"] exactly as before when the membership test passes.
Use the existing symbols hf_config and configs to locate where to apply this
guard.
In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py`:
- Around line 60-64: Replace the hardcoded DSA indexer constants by reading them
from the loaded HuggingFace model config object and assigning those values to
provider.dsa_indexer_head_dim, provider.dsa_indexer_n_heads and
provider.dsa_indexer_topk (and keep provider.dsa_indexer_loss_coeff if it is not
configurable); e.g., pull the values from the HF config variable used when
loading the model (e.g., config or hf_config) and fall back to sensible defaults
only if the config fields are missing, so the conversion uses the actual model
settings rather than fixed numbers.
- Around line 145-159: The code builds inv_freq (stored on
self._deepseek_inv_freq) and injects it into
converted_weights_dict[inv_freq_key] without verifying it matches the expected
tensor shape from the source; update the block that computes and assigns
inv_freq to first obtain a reference tensor (use
next(iter(converted_weights_dict.values())) or
converted_weights_dict[inv_freq_key] if present) and validate that
inv_freq.shape equals the reference tensor.shape (or the expected shape derived
from hf_config.qk_rope_head_dim: torch.arange(0, rotary_dim, 2).shape). If
shapes mismatch, either reshape/expand/convert inv_freq to match the reference
shape or raise/log a clear error and do not write the incompatible tensor; keep
all references to _deepseek_inv_freq, inv_freq, inv_freq_key,
converted_weights_dict, and hf_config.qk_rope_head_dim so the check is applied
before assigning converted_weights_dict[inv_freq_key] = inv_freq.
In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`:
- Around line 79-80: The test contains a duplicate call to
model.save_pretrained(model_dir, safe_serialization=True) (two identical
invocations); remove the redundant second call so the model is serialized only
once, keeping the remaining call to model.save_pretrained with
safe_serialization=True (reference: the save_pretrained calls on the model in
test_deepseek_v32_conversion.py).
- Line 171: The test asserts the wrong config key name: update the assertion in
test_deepseek_v32_conversion (the saved dict check) to use "index_topk" ->
"index_topk" is incorrect; change saved["index_top_k"] to saved["index_topk"] so
the test matches the conversion/config pathway that uses index_topk.
- Line 63: Guard the deletion of the fixture attribute so the fixture setup
doesn't raise when the attribute is missing: before removing
config.quantization_config in test_deepseek_v32_conversion.py, check whether the
attribute exists (e.g., with hasattr(config, "quantization_config")) or perform
the deletion inside a try/except catching AttributeError, and only remove it
when present to avoid breaking fixture creation.
- Around line 75-76: Replace the broad "except Exception: pass" in the tokenizer
setup block with targeted error handling: either catch specific expected
exceptions (e.g., OSError, FileNotFoundError, ValueError) or log the exception
and call pytest.skip(reason) or re-raise to fail the test; specifically update
the try/except surrounding the tokenizer initialization (the block that
currently uses "except Exception: pass") so failures are not silently swallowed
but reported or skipped with a clear message.
---
Nitpick comments:
In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py`:
- Around line 102-109: The mapping_registry method is appending a duplicate
AutoMapping already provided by get_common_mapping_list(); remove the redundant
mapping append in mapping_registry (the AutoMapping with megatron_param
"decoder.layers.*.mlp.router.expert_bias" and hf_param
"model.layers.*.mlp.gate.e_score_correction_bias") so that mapping_list only
relies on get_common_mapping_list() and does not re-add the same AutoMapping.
In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`:
- Around line 91-101: Add a test category mark (e.g., `@pytest.mark.integration`
or `@pytest.mark.system`) to the test_deepseek_conversion_parallelism test and
guard multi-process cases by checking available GPUs: compute world_size = tp *
pp * ep inside test_deepseek_conversion_parallelism and if world_size > 1 and
torch.cuda.device_count() < world_size call pytest.skip(...) with a clear
message; keep the existing `@pytest.mark.run_only_on`("GPU") and parametrization
but add the new mark and the runtime GPU-count check to avoid running parallel
variants on under-provisioned runners.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/megatron/bridge/models/deepseek/__init__.pysrc/megatron/bridge/models/deepseek/common.pysrc/megatron/bridge/models/deepseek/deepseek_v32_bridge.pytests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py
| if "DeepseekV32ForCausalLM" in hf_config.architectures: | ||
| configs["experimental_attention_variant"] = "dsa" | ||
| configs["dsa_indexer_head_dim"] = hf_config.index_head_dim | ||
| configs["dsa_indexer_n_heads"] = hf_config.index_n_heads | ||
| configs["dsa_indexer_topk"] = hf_config.index_topk |
There was a problem hiding this comment.
Handle missing architectures safely in V3.2 detection.
Line 76 assumes hf_config.architectures is always present and iterable. If it is missing or None, conversion fails before config translation completes.
Proposed fix
- if "DeepseekV32ForCausalLM" in hf_config.architectures:
+ architectures = getattr(hf_config, "architectures", None) or []
+ if "DeepseekV32ForCausalLM" in architectures:
configs["experimental_attention_variant"] = "dsa"
configs["dsa_indexer_head_dim"] = hf_config.index_head_dim
configs["dsa_indexer_n_heads"] = hf_config.index_n_heads
configs["dsa_indexer_topk"] = hf_config.index_topk🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/deepseek/common.py` around lines 76 - 80, The code
assumes hf_config.architectures exists; change the check to safely handle
missing or None architectures by reading architectures = getattr(hf_config,
"architectures", None) (or using hf_config.__dict__.get("architectures")) and
only perform the "DeepseekV32ForCausalLM" membership test if architectures is a
non-empty iterable (e.g., isinstance(architectures, (list, tuple)) or
bool(architectures)); then set configs["experimental_attention_variant"],
configs["dsa_indexer_head_dim"], configs["dsa_indexer_n_heads"], and
configs["dsa_indexer_topk"] exactly as before when the membership test passes.
Use the existing symbols hf_config and configs to locate where to apply this
guard.
| provider.experimental_attention_variant = "dsa" | ||
| provider.dsa_indexer_head_dim = 128 | ||
| provider.dsa_indexer_n_heads = 64 | ||
| provider.dsa_indexer_topk = 2048 | ||
| provider.dsa_indexer_loss_coeff = 0.001 |
There was a problem hiding this comment.
Derive DSA indexer settings from HF config instead of hardcoding.
Lines 61-63 hardcode indexer dimensions/top-k, which can diverge from the loaded model config and cause conversion mismatches.
Proposed fix
provider.experimental_attention_variant = "dsa"
- provider.dsa_indexer_head_dim = 128
- provider.dsa_indexer_n_heads = 64
- provider.dsa_indexer_topk = 2048
+ provider.dsa_indexer_head_dim = hf_config.index_head_dim
+ provider.dsa_indexer_n_heads = hf_config.index_n_heads
+ provider.dsa_indexer_topk = hf_config.index_topk
provider.dsa_indexer_loss_coeff = 0.001🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 60 -
64, Replace the hardcoded DSA indexer constants by reading them from the loaded
HuggingFace model config object and assigning those values to
provider.dsa_indexer_head_dim, provider.dsa_indexer_n_heads and
provider.dsa_indexer_topk (and keep provider.dsa_indexer_loss_coeff if it is not
configurable); e.g., pull the values from the HF config variable used when
loading the model (e.g., config or hf_config) and fall back to sensible defaults
only if the config fields are missing, so the conversion uses the actual model
settings rather than fixed numbers.
| inv_freq = getattr(self, "_deepseek_inv_freq", None) | ||
| if inv_freq is None: | ||
| rotary_dim = self.hf_config.qk_rope_head_dim | ||
| rotary_base = self.hf_config.rope_theta | ||
| inv_freq = 1.0 / (rotary_base ** (torch.arange(0, rotary_dim, 2, dtype=torch.float32) / rotary_dim)) | ||
| self._deepseek_inv_freq = inv_freq | ||
|
|
||
| if converted_weights_dict: | ||
| reference_tensor = next(iter(converted_weights_dict.values())) | ||
| if inv_freq.device != reference_tensor.device: | ||
| inv_freq = inv_freq.to(device=reference_tensor.device) | ||
| self._deepseek_inv_freq = inv_freq | ||
|
|
||
| converted_weights_dict[inv_freq_key] = inv_freq | ||
| return converted_weights_dict |
There was a problem hiding this comment.
Validate inv_freq tensor shape before injecting converted weights.
Lines 145-159 synthesize inv_freq and write it into converted_weights_dict without checking compatibility with the source tensor shape for that key.
As per coding guidelines "src/megatron/bridge/models/**/*.py: Always validate tensor shapes before copying weights in weight conversion".
Proposed fix
inv_freq = getattr(self, "_deepseek_inv_freq", None)
if inv_freq is None:
rotary_dim = self.hf_config.qk_rope_head_dim
rotary_base = self.hf_config.rope_theta
inv_freq = 1.0 / (rotary_base ** (torch.arange(0, rotary_dim, 2, dtype=torch.float32) / rotary_dim))
self._deepseek_inv_freq = inv_freq
+
+ source_inv_freq = hf_state_dict.get(inv_freq_key)
+ if source_inv_freq is not None and tuple(source_inv_freq.shape) != tuple(inv_freq.shape):
+ raise ValueError(
+ f"Shape mismatch for {inv_freq_key}: source={tuple(source_inv_freq.shape)} "
+ f"computed={tuple(inv_freq.shape)}"
+ )
if converted_weights_dict:
reference_tensor = next(iter(converted_weights_dict.values()))
if inv_freq.device != reference_tensor.device:
inv_freq = inv_freq.to(device=reference_tensor.device)
self._deepseek_inv_freq = inv_freq🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/deepseek/deepseek_v32_bridge.py` around lines 145
- 159, The code builds inv_freq (stored on self._deepseek_inv_freq) and injects
it into converted_weights_dict[inv_freq_key] without verifying it matches the
expected tensor shape from the source; update the block that computes and
assigns inv_freq to first obtain a reference tensor (use
next(iter(converted_weights_dict.values())) or
converted_weights_dict[inv_freq_key] if present) and validate that
inv_freq.shape equals the reference tensor.shape (or the expected shape derived
from hf_config.qk_rope_head_dim: torch.arange(0, rotary_dim, 2).shape). If
shapes mismatch, either reshape/expand/convert inv_freq to match the reference
shape or raise/log a clear error and do not write the incompatible tensor; keep
all references to _deepseek_inv_freq, inv_freq, inv_freq_key,
converted_weights_dict, and hf_config.qk_rope_head_dim so the check is applied
before assigning converted_weights_dict[inv_freq_key] = inv_freq.
|
|
||
| for key, value in DEEPSEEK_V32_OVERRIDES.items(): | ||
| setattr(config, key, value) | ||
| del config.quantization_config |
There was a problem hiding this comment.
Guard quantization_config removal to avoid fixture crashes.
Line 63 can raise AttributeError when quantization_config is absent, which breaks fixture creation before the test body runs.
Proposed fix
- del config.quantization_config
+ if hasattr(config, "quantization_config"):
+ delattr(config, "quantization_config")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` at
line 63, Guard the deletion of the fixture attribute so the fixture setup
doesn't raise when the attribute is missing: before removing
config.quantization_config in test_deepseek_v32_conversion.py, check whether the
attribute exists (e.g., with hasattr(config, "quantization_config")) or perform
the deletion inside a try/except catching AttributeError, and only remove it
when present to avoid breaking fixture creation.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Avoid broad silent exception swallowing in tokenizer setup.
Lines 75-76 hide all failures, making this functional test harder to debug when environment/setup regressions occur.
Proposed fix
- except Exception:
- pass
+ except OSError as exc:
+ pytest.skip(f"Tokenizer setup failed in test fixture: {exc}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception: | |
| pass | |
| except OSError as exc: | |
| pytest.skip(f"Tokenizer setup failed in test fixture: {exc}") |
🧰 Tools
🪛 Ruff (0.15.2)
[error] 75-76: try-except-pass detected, consider logging the exception
(S110)
[warning] 75-75: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`
around lines 75 - 76, Replace the broad "except Exception: pass" in the
tokenizer setup block with targeted error handling: either catch specific
expected exceptions (e.g., OSError, FileNotFoundError, ValueError) or log the
exception and call pytest.skip(reason) or re-raise to fail the test;
specifically update the try/except surrounding the tokenizer initialization (the
block that currently uses "except Exception: pass") so failures are not silently
swallowed but reported or skipped with a clear message.
| model.save_pretrained(model_dir, safe_serialization=True) | ||
| model.save_pretrained(model_dir, safe_serialization=True) |
There was a problem hiding this comment.
Remove duplicate model serialization call.
Line 80 repeats the exact save_pretrained invocation from Line 79.
Proposed fix
model.save_pretrained(model_dir, safe_serialization=True)
- model.save_pretrained(model_dir, safe_serialization=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model.save_pretrained(model_dir, safe_serialization=True) | |
| model.save_pretrained(model_dir, safe_serialization=True) | |
| model.save_pretrained(model_dir, safe_serialization=True) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py`
around lines 79 - 80, The test contains a duplicate call to
model.save_pretrained(model_dir, safe_serialization=True) (two identical
invocations); remove the redundant second call so the model is serialized only
once, keeping the remaining call to model.save_pretrained with
safe_serialization=True (reference: the save_pretrained calls on the model in
test_deepseek_v32_conversion.py).
| assert saved["moe_intermediate_size"] == 768 | ||
| assert saved["index_head_dim"] == 128 | ||
| assert saved["index_n_heads"] == 16 | ||
| assert saved["index_top_k"] == 2048 |
There was a problem hiding this comment.
Fix the asserted index-topk config key name.
Line 171 asserts index_top_k, but the conversion/config pathway uses index_topk naming. This can create a false-negative assertion.
Proposed fix
- assert saved["index_top_k"] == 2048
+ assert saved["index_topk"] == 2048📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert saved["index_top_k"] == 2048 | |
| assert saved["index_topk"] == 2048 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/models/deepseek/test_deepseek_v32_conversion.py` at
line 171, The test asserts the wrong config key name: update the assertion in
test_deepseek_v32_conversion (the saved dict check) to use "index_topk" ->
"index_topk" is incorrect; change saved["index_top_k"] to saved["index_topk"] so
the test matches the conversion/config pathway that uses index_topk.
|
Need dependency[main] fix(moe): Fix several bugs of experimental attention variant by yuzhongw-nvidia · Pull Request #3026 · NVIDIA/Megatron-LM and need some more tests |
|
Hey, just for my understanding, this PR is not usable to convert checkpoints yet, right? Since DSv3.2 support is currently missing in |
I'm not sure, but @yaoyu-33 told me that megatron-bridge can do the conversion by straightforwardly loading the weights from the safetensors files now, without constructing an HF model object. Hi @yaoyu-33, please correct me if I misunderstood, thanks. |
|
Hey, what you said seems to be correct! However, while the HF model is never instantiated, the script does look for a model class to be defined, so some work is required. Therefore I fixed the original checkpoint like this to load it: # "HuggingFace" checkpoint directory from, e.g.,
# https://huggingface.co/deepseek-ai/DeepSeek-V3.2
hf_cp_dir=/path/to/DeepSeek-V3.2 # TODO change this
# Add dummy configuration code
cat <<EOF > "$hf_cp_dir"/configuration_deepseek.py
# coding=utf-8
from transformers.models.deepseek_v3 import DeepseekV3Config
DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
class DeepseekV32Config(DeepseekV3Config):
model_type = "deepseek_v32"
index_head_dim = 128
index_n_heads = 64
index_topk = 2048
__all__ = ["DeepseekV32Config"]
EOF
# Add dummy modeling code
cat <<EOF > "$hf_cp_dir"/modeling_deepseek.py
# coding=utf-8
from transformers.models.deepseek_v3 import DeepseekV3ForCausalLM, DeepseekV3Model
class DeepseekV32Model(DeepseekV3Model):
pass
class DeepseekV32ForCausalLM(DeepseekV3ForCausalLM):
pass
__all__ = ["DeepseekV32Model", "DeepseekV32ForCausalLM"]
EOF
# Add AutoModel resolution
sed -i '/"architectures":/i\ "auto_map": {\
"AutoConfig": "configuration_deepseek.DeepseekV32Config",\
"AutoModel": "modeling_deepseek.DeepseekV32Model",\
"AutoModelForCausalLM": "modeling_deepseek.DeepseekV32ForCausalLM"\
},' "$hf_cp_dir"/config.jsonAnd a small MBridge patch due to MCore upstream updates: mbridge_dir=/path/to/Megatron-Bridge # TODO change this
sed -i 's/== ModelType\.encoder_and_decoder/!= ModelType.encoder_or_decoder/
s/!= ModelType\.encoder_and_decoder/== ModelType.encoder_or_decoder/' \
"$mbridge_dir"/src/megatron/bridge/models/model_provider.pyHow I submitted: env PYTHONPATH="$mbridge_dir"/src:"$PYTHONPATH" python \
-u \
"$mbridge_dir"/examples/conversion/convert_checkpoints.py \
import \
--hf-model "$hf_cp_dir" \
--megatron-path "$hf_cp_dir"_mlm \
--trust-remote-codeWith these patches, I managed to convert the original DSv3.2 checkpoint with your commit df6f50a. :) |
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Tests