Skip to content

Add Dynamic Context Parallelism support (port from dev)#5252

Closed
ilml wants to merge 1 commit into
NVIDIA:mainfrom
ilml:dcp/sync-dynamic-cp-to-main
Closed

Add Dynamic Context Parallelism support (port from dev)#5252
ilml wants to merge 1 commit into
NVIDIA:mainfrom
ilml:dcp/sync-dynamic-cp-to-main

Conversation

@ilml

@ilml ilml commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Port the dynamic context parallelism feature (--dynamic-context-parallel) from the dev branch to main. The feature works on dev but is missing/incomplete on main; this PR brings main to parity.

Covers the following dev PRs:

plus the dev-side wrap_data_iterator and arg-rename fixes.

Core feature files (data_schedule.py, data_schedule_utils.py, varlen_dataset.py) are byte-identical to dev.

Main-side adaptations

  • pretrain_gpt.py keeps main's consolidated get_batch/forward_step and adds a sequence-packing branch on top.
  • gated_delta_net.py threads the per-microbatch resolved cp_group through main's fused head-perm all-to-all path.
  • multi_token_prediction.py re-implements the loss-sum/token-count tracker inside main's metrics API, keeping acceptance-rate logging.
  • The deprecated --hybrid-context-parallel flag maps to --dynamic-context-parallel via ModelParallelConfig.__post_init__.

Testing

  • All changed Python files pass syntax/compile checks; unit and functional tests have not been run locally — relying on CI.
  • Adds functional test case gpt3_mcore_te_tp2_pp1_cp4_dcp and unit tests (test_sequence_packing.py, test_varlen_dataset.py, test_get_batch.py, etc.) ported from dev.

🤖 Generated with Claude Code

Port the dynamic CP (--dynamic-context-parallel) feature from the dev
branch to main, covering dev PRs NVIDIA#2924 (THD E2E sequence-packing
framework), NVIDIA#3405 (THD+DCP rope fix, hybrid->dynamic rename), NVIDIA#2000
(Dynamic CP part 2), NVIDIA#4226 (resolve_cp_group, MTP token-weighted loss
logging, GDN/MLA enablement), NVIDIA#4832 (VarlenDataset and dataloader
contract fix), plus the dev-side wrap_data_iterator and arg-rename
fixes.

Main-side adaptations:
- pretrain_gpt.py keeps main's consolidated get_batch/forward_step and
  adds a sequence-packing branch on top.
- gated_delta_net.py threads the per-microbatch resolved cp_group
  through main's fused head-perm all-to-all path.
- multi_token_prediction.py re-implements the loss-sum/token-count
  tracker inside main's metrics API, keeping acceptance-rate logging.
- The deprecated --hybrid-context-parallel flag maps to
  --dynamic-context-parallel via ModelParallelConfig.__post_init__.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant