[Dev] feat: Dynamic CP (part 2) by xiaoyao0115 · Pull Request #2000 · NVIDIA/Megatron-LM

xiaoyao0115 · 2025-10-28T08:57:13Z

This PR is the second part of hybrid-cp. The first part is: #2054
（PR for main branch: #2304 ）

This PR provides end-to-end Dynamic CP support, including PP, VPP. （PR for main branch: #2304 ）

Major changes

Added default_dynamic_cp scheduler. Dynamic CP now only requires specifying --sequence-packing-scheduler default_dynamic_cp (auto-set when --dynamic-context-parallel is enabled). The previous standalone dynamic_context_parallel_forward_backward function and DynamicCPDataLoaderWrapper are removed — Dynamic CP now goes through the standard pipeline schedule, making it compatible with PP / VPP / TP out of the box.
Added Mamba and MTP (Multi-Token Prediction) support. Both MambaMixer and the MTP module now dynamically switch the CP group per micro-batch via packed_seq_params.cp_group, and restore the original group after each forward pass.

Other changes

Added --min-dynamic-context-parallel-size argument to control the minimum CP group size (default 1).
Removed DynamicCPMegatronPretrainingSampler; Dynamic CP now uses the standard MegatronPretrainingSampler.
Removed get_batch_on_this_dynamic_cp_rank utility; all CP slicing goes through the THD packed path.
Moved scheduling algorithms (next_hdp_group, dcp_gpus_needed, dcp_make_buckets_equal, etc.) from the deleted dynamic_cp_schedule.py into data_schedule_utils.py as standalone functions.
broadcast_to_pp_group and create_data_iterator now propagate local_cp_size for PP / VPP stages.
Fixed TEDotProductAttention to lazily initialize cp_stream on dynamic CP path.
Fixed attention.py to restore pg_collection.cp after dynamic CP forward.
Properly clean up _DYNAMIC_DP_CP_GROUPS in destroy_model_parallel.
Added functional test gpt3_mcore_te_tp2_pp1_cp4_dcp and extended unit tests with DCP parameter combinations.

Convergence and performance

Convergence has been verified on Qwen3-30B-A3B on 32 GPUs, with max_seqlen set to 49152 and max_seqlen_per_dp_cp_rank set to 3072. In the figure below, bshd refers to running with CP=16, where sequences are padded to max_seqlen and executed in the same bshd format as in pretraining. thd-packing refers to using CP=16 while packing variable-length sequences. In dynamic-cp, the maximum CP group size is also 16.

Known limitations

Dynamic CP group sizes are limited to powers of 2.
CUDA Graphs are not supported.
Works best with FlashAttention; cuDNN FusedAttention recompiles on every shape change, negating performance gains.
FSDP + PP is not yet supported.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-10-28T08:57:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yanring · 2025-11-07T09:26:07Z

Is there any difference between this and #2054?

kunlunl · 2025-11-07T10:27:02Z

Is there any difference between this and #2054?

This is the second MR, we need to merge 2054 first, and then this 2000 (The reason the second MR is 2000, while the first one is 2054 (>2000), is because they were migrated from GitLab at different times)

yanring · 2025-11-10T06:22:46Z

Is there any difference between this and #2054?

This is the second MR, we need to merge 2054 first, and then this 2000 (The reason the second MR is 2000, while the first one is 2054 (>2000), is because they were migrated from GitLab at different times)

Got it, thanks! Could you please update the title to reflect this?

kunlunl · 2025-12-01T12:13:19Z

/ok to test e0c90c5

xiaoyao0115 · 2026-04-01T08:37:17Z

/ok to test b4c4fe6

yuzhongw-nvidia · 2026-04-03T09:51:31Z

/ok to test b649d0b

yuzhongw-nvidia · 2026-04-03T10:42:51Z

/ok to test f632053

yuzhongw-nvidia · 2026-04-03T11:34:22Z

/ok to test 4319109

yuzhongw-nvidia · 2026-04-06T03:24:11Z

/ok to test ec6b5f4

yuzhongw-nvidia · 2026-04-06T04:14:56Z

/ok to test 0cbcc94

yuzhongw-nvidia · 2026-04-06T09:35:40Z

/ok to test 49f9e79

xiaoyao0115 · 2026-04-07T02:34:02Z

/ok to test 45b1232

Signed-off-by: tailaim <tailaim@nvidia.com>

Co-authored-by: Yuzhong Wang <yuzhongw@nvidia.com> Update model_config.yaml Update model_config.yaml

yuzhongw-nvidia · 2026-04-07T07:18:17Z

/ok to test aa49993

svcnvidia-nemo-ci · 2026-04-07T10:07:51Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24075960736

Victarry

Thanks for the great work on Dynamic CP Part 2 — the design of routing dynamic CP through the standard PP/VPP schedule is clean and well-motivated.

Left a mix of comments across CRITICAL / IMPORTANT / SUGGESTION levels. The most actionable ones are:

fill_empty negative-index wrap-around (line 921) — condition order bug in VPP alignment
fill_empty_gpus assertion inverted (line 811) — allows overwriting non-empty GPU slots
[[]] * N shared mutable lists (line 818) — classic Python pitfall, latent data corruption

⚠️ Disclosure: Part of this review was assisted by AI (Claude). Please double-check the findings independently — especially the CRITICAL ones — and let me know if any are inaccurate or based on incorrect assumptions.

Victarry · 2026-04-07T15:06:37Z

+        This function recursively forms groups of sub-samples such that all DPxCP ranks
+        have a roughly balanced workload in the group.
+        """
+        mslpr = self.max_seq_len_per_rank


The name of mslpr is kind of confused here. Just use the full name is fine.

Victarry · 2026-04-07T15:10:34Z

+def dcp_make_buckets_equal(
+    sample_seqlens: List[Tuple[int, int]],
+    compute_estimator: Callable,
+    max_seq_len_per_rank: int,
+    min_cp_size: int = 1,
+) -> List[deque]:
+    """Split samples into buckets of roughly equal work, one per unique CP size."""
+    seqlens = [seq_len for _, seq_len in sample_seqlens]
+    k = len({dcp_gpus_needed(L, max_seq_len_per_rank, min_cp_size) for L in seqlens})
+
+    work = []
+    for _, s in sample_seqlens:
+        cp_size = dcp_gpus_needed(s, max_seq_len_per_rank, min_cp_size)
+        work.append(compute_estimator(s, cp_size))
+    total_work = sum(work)
+    target = total_work / k
+    buckets, cur, cur_work = [], [], 0.0
+    remaining_k = k
+
+    for i, (sample_id, seq_len) in enumerate(sample_seqlens):
+        w = compute_estimator(seq_len)
+        projected = cur_work + w
+        if cur and (
+            projected > target * 1.1 or len(sample_seqlens) - i <= remaining_k - len(buckets)
+        ):
+            buckets.append(deque(cur))
+            cur, cur_work = [], 0.0
+            remaining_k -= 1
+        cur.append((sample_id, seq_len))
+        cur_work += w
+
+    if cur:
+        buckets.append(deque(cur))
+    return buckets


The readability of this function could be improved, example code refactored with GPT:

def dcp_make_buckets_equal( sample_seqlens: List[Tuple[int, int]], compute_estimator: Callable, max_seq_len_per_rank: int, min_cp_size: int = 1, bucket_overfill_factor = 1.1, ) -> List[deque]: """Split samples into buckets of roughly equal work, one per unique CP size Args: sample_seqlens (List[Tuple[int, int]]): (sample_id, seq_len) tuples. compute_estimator (Callable[[int, Optional[int]], float]): workload estimator function. max_seq_len_per_rank (int): max tokens per rank for packing. min_cp_size (int, optional): minimum CP size for dynamic CP. """ seqlens = [seq_len for _, seq_len in sample_seqlens] unique_cp_sizes = {dcp_gpus_needed(seq_len, max_seq_len_per_rank, min_cp_size) for seq_len in seqlens} target_bucket_count = len(unique_cp_sizes) per_sample_work = [] for _, seq_len in sample_seqlens: cp_size = dcp_gpus_needed(seq_len, max_seq_len_per_rank, min_cp_size) per_sample_work.append(compute_estimator(seq_len, cp_size)) total_work = sum(per_sample_work) target_work_per_bucket = total_work / target_bucket_count buckets, current_bucket, current_bucket_work = [], [], 0.0 remaining_target_buckets = target_bucket_count for sample_idx, (sample_id, seq_len) in enumerate(sample_seqlens): sample_work = compute_estimator(seq_len) projected_bucket_work = current_bucket_work + sample_work need_reserve_buckets_for_remaining_samples = len(sample_seqlens) - sample_idx <= ( remaining_target_buckets - len(buckets) ) exceeds_bucket_work_target = projected_bucket_work > ( target_work_per_bucket * bucket_overfill_factor ) should_close_current_bucket = bool(current_bucket) and ( exceeds_bucket_work_target or need_reserve_buckets_for_remaining_samples ) if should_close_current_bucket: buckets.append(deque(current_bucket)) current_bucket, current_bucket_work = [], 0.0 remaining_target_buckets -= 1 current_bucket.append((sample_id, seq_len)) current_bucket_work += sample_work if current_bucket: buckets.append(deque(current_bucket)) return buckets

Some conventions for better readability:

Use meaningful name with clarity.

Simply complex condition

Prevent hardcoded magic number like 1.1 here.

Use consistent naming. In this PR, workload should be preferred than work/compute.

Victarry · 2026-04-07T15:12:26Z

+    max_seq_len_per_rank: int,
+    min_cp_size: int = 1,
+) -> List[deque]:
+    """Split samples into buckets of roughly equal work, one per unique CP size."""


It seems this function doesn't guard the number of buckets equal to the number of unique CP sizes?
If so, the comments should be updated.

Victarry · 2026-04-08T05:06:50Z

+            while i >= 0:
+                sid0 = sample_id_group[i][0]
+                cp_size = 0
+                while sid0 in sample_id_group[i] and i >= 0:


[CRITICAL Correctness] fill_empty inner while loop condition order causes Python negative-index wrap-around.

When i = 0, the loop evaluates sample_id_group[0] (true), then decrements i to -1. On the next iteration, Python evaluates sample_id_group[-1] (wrapping to the last element) before checking i >= 0. This inflates cp_size, causing align_sample_id_groups to split micro-batches incorrectly under VPP.

Suggestion:

# Before: while sid0 in sample_id_group[i] and i >= 0: # After (short-circuit prevents negative index access): while i >= 0 and sid0 in sample_id_group[i]:

Victarry · 2026-04-08T05:07:19Z

+                assert not all(
+                    work for work in micro_batches[empty_gpu : empty_gpu + needed_count]
+                ), "Empty GPUs were detected but not enough to expand."


[CRITICAL Correctness] Assertion logic is inverted — allows overwriting non-empty GPU slots.

not all(work for ...) passes when at least one slot is empty, but the intent is "all slots in this range must be empty". With the current logic, if only 1 of needed_count slots is empty and the others contain real data, the assertion passes and the expansion overwrites existing assignments silently.

Suggestion:

# Before: assert not all( work for work in micro_batches[empty_gpu : empty_gpu + needed_count] ), "Empty GPUs were detected but not enough to expand." # After: assert all( not work for work in micro_batches[empty_gpu : empty_gpu + needed_count] ), "Empty GPUs were detected but not enough contiguous empty slots to expand."

Victarry · 2026-04-08T05:08:16Z

+        assert (
+            existing_group_sizes
+        ), "There should be at least one group existing, cannot redistribute, "
+        "try to increase 'max-seqlen-per-dp-cp-rank'."


[IMPORTANT Correctness] Assert message is silently truncated — the second string line is a standalone expression, not part of the message.

Without enclosing parentheses, the newline after "... cannot redistribute, " terminates the assert statement. "try to increase 'max-seqlen-per-dp-cp-rank'." on the next line becomes a no-op expression statement.

Suggestion:

assert existing_group_sizes, ( "There should be at least one group existing, cannot redistribute, " "try to increase 'max-seqlen-per-dp-cp-rank'." )

Victarry · 2026-04-08T05:08:28Z

+def next_hdp_group(
+    sample_seqlens: List[Tuple[int, int]],
+    compute_estimator: Callable[[int], float],
+    total_gpus: int,
+    gpus_needed_fn: Callable[[int], int],
+    make_buckets_equal_fn: Callable,
+    max_seq_len_per_rank: float,
+    get_total_workload_fn: Callable,
+    delta: float = 0.05,
+    strategy: str = "dp",
+    eps_bucket: float = 0.10,
+) -> Tuple[List[List[int]], List[Tuple[int, int]], List[float], List[List[int]]]:


[IMPORTANT Readability] next_hdp_group is 270 lines with 3 nested closures (trim_overload, fill_empty_gpus, inner fill_empty), mixing greedy scheduling, balance checking, empty-GPU filling, and overload trimming in a single scope.

The closures capture mutable outer state via nonlocal, making it hard to reason about which variables are modified where. fill_empty_gpus alone is ~70 lines of array-shifting logic that would be much easier to test and review as a standalone module-level function.

Suggestion: Extract fill_empty_gpus (and trim_overload when uncommented) into top-level functions with explicit input/output signatures. next_hdp_group should only contain the greedy main loop.

Victarry · 2026-04-08T05:08:40Z

+    micro_batches = [[] for _ in range(total_gpus)]
+    exec_times = [0.0 for _ in range(total_gpus)]
+    sample_ids_per_gpu = [[] for _ in range(total_gpus)]
+    packing_sequence_len = {}
+
+    gpu_group_id = [None] * total_gpus
+    group_members = {}
+    group_size = {}
+    next_gid = 0
+
+    pp_cursor = 0
+    prev_needed = None
+    check_balance = False


[SUGGESTION Naming] Core scheduling state variables are overly abbreviated for a 270-line function in distributed scheduling code.

In a function this long with multiple nested scopes, short names like gid, g_members, g_size, nc, next_pw force readers to keep a mental lookup table. Suggested renames:

Current Suggested

gpu_group_id (OK)

group_members (OK)

group_size group_cp_size (disambiguate from len(group_members[gid]))

next_gid next_group_id

packing_sequence_len packed_tokens_per_rank

pp_cursor consider removing (see strategy comment)

Victarry · 2026-04-08T05:08:49Z

+    # Step4: Prepare "local_cp_size" if dynamic context parallel is enabled.
+    if dynamic_cp:
+        if is_tp_rank_0:
+            if type(batch['local_cp_size']) == int:


[SUGGESTION Readability] Prefer isinstance(batch['local_cp_size'], int) over type(...) == int.

isinstance is the Pythonic idiom and correctly handles int subclasses (e.g. np.int64). Same applies to the type(batch['max_seqlen']) == int check on line 595.

Victarry · 2026-04-08T05:09:00Z

+        scan_order = (
+            range(len(buckets))
+            if strategy == "dp"
+            else [(pp_cursor + i) % len(buckets) for i in range(len(buckets))]
+        )


[SUGGESTION Simplification] strategy parameter is always "dp" — the "pp" branch and pp_cursor variable are dead code.

No caller passes strategy="pp", so the round-robin scan_order path is unreachable. The pp_cursor maintenance (lines 675-676, 732) and the strategy parameter add complexity without providing value.

If "pp" is planned future work, add a # TODO with a tracking issue. Otherwise, remove the strategy parameter and the pp-related code to simplify the already complex scheduling logic.

xiaoyao0115 assigned xiaoyao0115 and kunlunl Oct 28, 2025

xiaoyao0115 requested review from a team as code owners October 28, 2025 08:57

xiaoyao0115 added the enhancement New feature or request label Oct 28, 2025

xiaoyao0115 force-pushed the hybrid-cp branch 3 times, most recently from f33edcd to 48e91d2 Compare November 2, 2025 09:33

yanring added module: moe dev branch Dev branch related issues and development labels Nov 5, 2025

xiaoyao0115 changed the title ~~[Dev] feat: hybrid-cp feature for dev branch (Author: Parth Kunlun Tailai)~~ [Dev] feat: hybrid-cp feature for dev branch (part 2) Nov 11, 2025

xiaoyao0115 changed the title ~~[Dev] feat: hybrid-cp feature for dev branch (part 2)~~ [Dev] feat: hybrid-cp for dev branch (part 2) Nov 11, 2025

xiaoyao0115 force-pushed the hybrid-cp branch from 983e5f3 to 11d9960 Compare November 12, 2025 09:54

kunlunl reviewed Nov 24, 2025

View reviewed changes

Comment thread pretrain_gpt.py Outdated

kunlunl reviewed Nov 24, 2025

View reviewed changes

Comment thread pretrain_gpt.py Outdated

copy-pr-bot Bot temporarily deployed to nemo-ci December 1, 2025 12:13 Inactive

ko3n1g added this to the Core 0.16 milestone Dec 1, 2025

copy-pr-bot Bot had a problem deploying to nemo-ci December 1, 2025 12:13 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci December 1, 2025 12:13 Inactive

copy-pr-bot Bot had a problem deploying to public December 1, 2025 12:20 Failure

yanring mentioned this pull request Dec 3, 2025

[ROADMAP][Updated on April 07] Megatron Core MoE Roadmap #1729

Open

48 tasks

xiaoyao0115 force-pushed the hybrid-cp branch from e0c90c5 to 501a5f6 Compare December 4, 2025 07:25

shifangx mentioned this pull request Dec 5, 2025

[QWen3_VL] pretrain performance optimization NVIDIA-NeMo/Megatron-Bridge#1605

Open

kunlunl requested a review from parthmannan December 9, 2025 13:25

copy-pr-bot Bot temporarily deployed to test April 1, 2026 08:38 Inactive

yuzhongw-nvidia reviewed Apr 2, 2026

View reviewed changes

Comment thread tests/test_utils/recipes/h100/gpt.yaml Outdated

yuzhongw-nvidia self-requested a review April 3, 2026 09:51

yuzhongw-nvidia approved these changes Apr 3, 2026

View reviewed changes

yuzhongw-nvidia enabled auto-merge April 3, 2026 09:52

copy-pr-bot Bot temporarily deployed to test April 3, 2026 09:52 Inactive

copy-pr-bot Bot temporarily deployed to test April 3, 2026 10:44 Inactive

xiaoyao0115 and others added 5 commits April 7, 2026 00:18

dynamic-cp support for latest dev branch

4dd30f7

Signed-off-by: tailaim <tailaim@nvidia.com>

add support for mtp and mamba

50bca8e

Signed-off-by: tailaim <tailaim@nvidia.com>

functional test and several fixes

1a7ab4b

Signed-off-by: tailaim <tailaim@nvidia.com>

Add tag mr-github-slim for the new functional test

039d942

Co-authored-by: Yuzhong Wang <yuzhongw@nvidia.com> Update model_config.yaml Update model_config.yaml

fix

aa49993

yaox12 approved these changes Apr 7, 2026

View reviewed changes

Victarry reviewed Apr 8, 2026

View reviewed changes

xiaoyao0115 mentioned this pull request Apr 9, 2026

Minor improvements for Dynamic-cp #4226

Merged

5 tasks

svcnvidia-nemo-ci mentioned this pull request May 5, 2026

chore: nightly sync main into dev (05_05_2026) #4619

Closed

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

sbhavani mentioned this pull request May 26, 2026

[ROADMAP][2026 Q2] Megatron Core Roadmap #4997

Open

This was referenced Jun 10, 2026

Add Dynamic Context Parallelism support (port from dev) #5252

Closed

Sync Dynamic-CP feature from dev to main #5279

Closed

Current	Suggested
`gpu_group_id`	(OK)
`group_members`	(OK)
`group_size`	`group_cp_size` (disambiguate from `len(group_members[gid])`)
`next_gid`	`next_group_id`
`packing_sequence_len`	`packed_tokens_per_rank`
`pp_cursor`	consider removing (see strategy comment)

Conversation

xiaoyao0115 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Major changes

Other changes

Convergence and performance

Known limitations

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot Bot commented Oct 28, 2025

Uh oh!

yanring commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunlunl commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanring commented Nov 10, 2025

Uh oh!

Uh oh!

Uh oh!

kunlunl commented Dec 1, 2025

Uh oh!

xiaoyao0115 commented Apr 1, 2026

Uh oh!

Uh oh!

yuzhongw-nvidia commented Apr 3, 2026

Uh oh!

yuzhongw-nvidia commented Apr 3, 2026

Uh oh!

yuzhongw-nvidia commented Apr 3, 2026

Uh oh!

yuzhongw-nvidia commented Apr 6, 2026

Uh oh!

yuzhongw-nvidia commented Apr 6, 2026

Uh oh!

yuzhongw-nvidia commented Apr 6, 2026

Uh oh!

xiaoyao0115 commented Apr 7, 2026

Uh oh!

yuzhongw-nvidia commented Apr 7, 2026

Uh oh!

svcnvidia-nemo-ci commented Apr 7, 2026

Uh oh!

Victarry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

xiaoyao0115 commented Oct 28, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`

yanring commented Nov 7, 2025 •

edited

Loading

kunlunl commented Nov 7, 2025 •

edited

Loading