Minor improvements for Dynamic-cp#4226
Conversation
|
/ok to test b169e13 |
|
/claude strict-review |
| if _pad_n > 0: | ||
| qkv = qkv[:, :_orig_seq, :] |
There was a problem hiding this comment.
[CRITICAL Implementation] Related to the same issue above — this output stripping also needs to be guarded against the case where cu_seqlens_q is None and _pad_n > 0. Without the guard on lines 458-461, this line is unreachable (the code would have already crashed), but once the guard is applied above, you need to ensure the stripping only happens when padding was actually applied.
Consider:
if _pad_n > 0 and cu_seqlens_q is not None:
qkv = qkv[:, :_orig_seq, :]Or, more cleanly, track whether padding was actually applied (since without cu_seqlens the padding itself is still harmless—only the cu_seqlens mutation is the problem). If you keep the padding unconditionally, this stripping is still needed. But if you guard the entire padding block on cu_seqlens_q is not None, then this stripping should be guarded the same way.
| keys_to_keep = {'original_seq_len', 'padded_seq_len'} | ||
| if is_first_pp or mtp_on_this_pp: | ||
| keys_to_keep.update(['tokens', 'position_ids']) | ||
| if is_last_pp or mtp_on_this_pp: | ||
| keys_to_keep.update(['labels', 'loss_mask']) | ||
| for sample in batch: | ||
| for key in list(sample.keys()): | ||
| if key not in keys_to_keep: | ||
| del sample[key] |
There was a problem hiding this comment.
[SUGGESTION Simplification] The stripping logic is correct and well-structured. One minor robustness note: this assumes the only keys in the raw (unpacked) samples are {tokens, labels, loss_mask, position_ids, original_seq_len, padded_seq_len}. If a custom dataset adds extra metadata keys beyond these (e.g., a sample_id or language_tag), they would be silently dropped.
This is unlikely to be a problem in practice since _unpack_batch only produces those six keys, but a brief inline comment noting the assumption (that only these keys exist post-unpack) would help future maintainers.
Code Review SummaryPR: Minor improvements for Dynamic-cp — Adds dynamic CP support for GDN and MTP, adapts data scheduling to the new per-stage data_iterator policy from PR #3564. Findings
Critical
What looks good
Risk AssessmentMedium risk due to the critical conv1d padding bug. The data scheduling changes are well-structured but affect a complex subsystem (dynamic CP × VPP × MTP × per-stage data routing), so thorough integration testing with multi-stage pipelines, VPP, and MTP is recommended. The GDN and MTP dynamic CP changes are straightforward |
b169e13 to
b850cee
Compare
|
/ok to test b850cee |
|
/ok to test b642842 |
b642842 to
570ff74
Compare
|
/ok to test 570ff74 |
570ff74 to
daf2e0a
Compare
|
/ok to test daf2e0a |
|
/ok to test daf2e0a |
daf2e0a to
0bede0d
Compare
|
/ok to test 0bede0d |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27022438250 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27066685670 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27112839305 |
Signed-off-by: xiaoyao0115 <1804647152@qq.com>
Signed-off-by: tailaim <tailaim@nvidia.com>
When token counts differ across ranks (e.g. Dynamic CP) or microbatches, locally dividing the loss by num_tokens before all-reducing produces a biased per-token loss in the logs. Switch MTPLossLoggingHelper to store raw loss sums and token counts, perform a single all-reduce over the packed [sums, counts] tensor, then compute sum/sum afterwards. This yields the correct weighted-average per-token loss regardless of per-rank token-count imbalance. Drop the 1/num_microbatches scaling in training.py since the tracker already returns the per-token loss aggregated across all ranks and microbatches; no further scaling is needed. Signed-off-by: xiaoyao0115 <1804647152@qq.com>
Signed-off-by: xiaoyao0115 <1804647152@qq.com>
Signed-off-by: xiaoyao0115 <1804647152@qq.com>
Signed-off-by: xiaoyao0115 <1804647152@qq.com>
Signed-off-by: xiaoyao0115 <1804647152@qq.com>
Signed-off-by: xiaoyao0115 <1804647152@qq.com>
ccd324b to
3de8fbe
Compare
|
/ok to test 3de8fbe |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27135364923 |
What does this PR do ?
Dynamic CP support for GDN and MTP, data scheduling adaptations
Summary
broadcast_to_pp_groupis removed and per-stage data field stripping is addedChanges
1. GDN Dynamic CP support (
gated_delta_net.py)tensor_a2a_cp2hp/tensor_a2a_hp2cp,get_parameter_local_cp, conv1d groups, QKV split dimensions, etc.) now use the dynamic CP group resolved frompacked_seq_paramscausal_conv1dTriton kernel includesNB = cdiv(total_tokens, 1024)in its autotune key. Dynamic CP causestotal_tokensto vary per microbatch, triggering repeated Triton autotuning (hundreds of ms each). Input is now padded to_CONV_PAD_ALIGNMENT = 4096boundaries to collapseNBinto far fewer buckets. Implementation: zero-pad the input tensor along the sequence dimension, adjustcu_seqlens[-1]to include padding, run conv1d, then strip padding from the output2. MTP dynamic CP group fix (
gpt_model.py,multi_token_prediction.py)MultiTokenPredictionLayer._roll_input_ids_and_position_ids:roll_tensornow usesresolve_cp_group()to resolve the dynamic CP group frompacked_seq_params, instead of the staticself.cp_groupGPTModel.forward:process_mtp_lossnow passes the dynamic CP group resolved frompacked_seq_paramsinstead of the staticself.pg_collection.cp3. Data scheduling adaptations (
data_schedule.py,data_schedule_utils.py)Upstream PR #3564 (
Fix quantize.py script and support packed sequences in pretrain_gpt.py) changedis_dataset_built_on_rankandget_batchinpretrain_gpt.pyso that in packed-sequence (SFT) mode, all PP stages (including middle ones) build datasets and receive data_iterators. Middle stages get aPackedSeqParamswithcu_seqlens/max_seqlendirectly from their own iterator. Previously only first/last PP stages had data_iterators, and middle stages relied onbroadcast_to_pp_groupto receive metadata from the first PP stage.Adaptations:
broadcast_to_pp_group: Middle PP stages now participate in scheduling and have their own data_iterators, making the PP-group metadata broadcast unnecessary. The function (~95 lines) and itsnumpydependency are removedtokens/position_ids, last PP keeps onlylabels/loss_mask, MTP stages keep all four. Middle stages only carry metadata (cu_seqlens,max_seqlen, etc.) and do not participate in data tensor all-to-allContribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.