moe: add DeepEP V2 ElasticBuffer support to MoE flex dispatcher#4632
Draft
dmvevents wants to merge 3 commits into
Draft
moe: add DeepEP V2 ElasticBuffer support to MoE flex dispatcher#4632dmvevents wants to merge 3 commits into
dmvevents wants to merge 3 commits into
Conversation
2fb717d to
1aa3a7e
Compare
dmvevents
added a commit
to antonai-work/nemo-rl-deepep-v2-efa
that referenced
this pull request
May 5, 2026
…file Makes this repo rebuildable end-to-end without any private-repo access. Changes: - docker/Dockerfile: merged in full base-image recipe. Was `FROM deepep-base-v2:latest` (a pre-built private image); now builds from vanilla `nvidia/cuda:12.9.0-devel-ubuntu24.04` with full EFA + aws-ofi-nccl + NCCL + GDRCopy + NVSHMEM + DeepEP V2 + Megatron + NeMo-RL stack in-Dockerfile. Public git clone URLs only. - docker/Dockerfile: COPY paths repointed from private-tree locations (`integrations/nemo-rl-fullstack/...`, `scripts/verify_efa_traffic.sh`) to this repo's `patches/`, `tests/`, `docker/` directories. - docker/build.sh: rewritten to drop the private base-image prereq + private-tree `REPO_ROOT` assumptions. - docs/ARCHITECTURE.md: removed reference to a "private development repo" shim file; rephrased as a design decision about this repo. - tests/k8s/multi-node-training-h100.yaml: replaced hard-coded AWS account ID in the ECR image path with a clear placeholder. - patches/0004-0006: regenerated from the fork branch after amending commit messages + source comments to drop `antonai-work/deepep-v2-integration` refs. Author restored to the real identity (Anton Alexander). Code tree is byte-identical to the pre-rebase branch. - ci/: removed. The CodeBuild spec was wired to a private account ID and private-repo source paths; leaving it in the public repo would have shipped broken config. Verification: `grep -r "antonai-work/deepep-v2-integration|/home/ubuntu| /tmp/nemo-rl-pr-prep|058264135704" .` returns no matches. Megatron fork branch `deepep-v2-elasticbuffer-support` force-pushed with identical code tree; PR NVIDIA/Megatron-LM#4632 picks up the cleaned commit history automatically.
This was referenced May 5, 2026
DeepEP PR NVIDIA#605 (merged 2026-04-29) renames `deep_ep.Buffer` to `deep_ep.ElasticBuffer` and changes the dispatch/combine contract (5-tuple return with the per-expert list moved onto the handle, `async_with_compute_stream` in place of `async_finish`, layout kwargs dropped because V2 infers layout internally from `topk_idx`). This adds a second import probe next to the existing `HybridEPBuffer` probe and teaches `get_buffer()` / `FusedDispatch` / `FusedCombine` to branch on `HAVE_DEEP_EP_V2`. When V2 is present it is preferred; otherwise the legacy `Buffer` code path is unchanged. `_DeepepManager` itself (token_dispatcher.py) does not change — all V2-specific knowledge lives in this one file. Why: - Consumers already use V2 via a downstream compatibility shim. A full reproducible recipe (Dockerfile + k8s manifest + training driver) is published at https://github.com/antonai-work/nemo-rl-deepep-v2-efa. Validated on 2-node p5.48xlarge + EFA with Qwen3-30B-A3B-BF16: loss decreased 3 steps, real grad_norm, 0.8 GB cross-node EFA TX. - Removes the need for the downstream shim once this lands. - Mirrors the existing `HybridEPBuffer` probe pattern already in this file, so review load stays in the "infra bump" bucket rather than the "new feature" bucket. V1 parity rules baked in: - `num_max_tokens_per_rank` pinned from env (`MCORE_DEEPEP_V2_MAX_TOKENS_PER_RANK`, default 8192) to avoid the JIT template instantiation drift across ranks that otherwise hangs the cross-node Gin barrier (DeepEP dispatch.hpp:138 template arg). - `num_allocated_qps=0` on EFA so V2's built-in Queue-Pair auto-cap kicks in (avoids CUDA 719 at dispatch.hpp:183 against AWS EFA provider). - `num_sms=0` on combine so V2 reuses `handle.num_sms` from dispatch (mismatch triggers sticky CUDA 719 at jit/handle.hpp:86). - `do_expand=False` matches V1 token layout, so downstream callers like `_DeepepManager.dispatch_postprocess` see the same recv shape. - `previous_event` is seeded via `buffer.capture()` under `async_finish=True` per V2's contract at buffer.hpp:483 (previous_event requires allocate_on_comm_stream=True). Test plan: - `tests/unit_tests/transformer/moe/test_fused_a2a_deepep_v2.py` exercises probe plumbing (V1-only, V2-only, neither-installed). - Existing `TestFlexDispatcher` in `test_token_dispatcher.py` runs under whichever DeepEP flavour is installed in the CI image. - Full 2-node D+C validation run against Qwen3-30B-A3B-BF16 on H100 + EFA is published as a reproducible recipe at https://github.com/antonai-work/nemo-rl-deepep-v2-efa (see docs/VALIDATION.md for the expected-output contract). Related: - DeepEP PR NVIDIA#605 (merged 2026-04-29) - DeepEP PR NVIDIA#612 (still open — AWS EFA auto-QP cap; not required on InfiniBand / NVLink fabrics).
V2 (PR NVIDIA#605) defines EventOverlap in deep_ep.utils.event but does not re-export it from deep_ep.utils (only EventHandle). Fall through to the submodule path so fused_a2a loads under V2-only installs.
V2 ElasticBuffer.dispatch at elastic.py:768 calls get_theoretical_num_sms (num_experts, num_topk) BEFORE resolving num_experts from the handle at line 782. Passing num_experts=None with num_sms=0 raises 'TypeError: unsupported operand type(s) for % NoneType and int' during the backward of FusedCombine (which reuses a handle). Fix: extract num_experts from handle.num_experts and pass explicitly.
1aa3a7e to
2f149cf
Compare
dmvevents
pushed a commit
to dmvevents/RL
that referenced
this pull request
May 6, 2026
Bumps the deep_ep git pin in pyproject.toml from bfded348 (2025-10-29, pre-V2) to b306af0 (2026-04-29), which is the merge commit of DeepEP PR NVIDIA-NeMo#605 "Introducing EPv2". Why --- The current pin predates the DeepEP V2 API (ElasticBuffer, PP/CP/Engram support). Consumers of NeMo-RL's Megatron backend that follow NVIDIA/Megatron-LM#4632 ("Shape Y" Megatron V2 adoption) cannot resolve deep_ep.ElasticBuffer with the current pin; the virtualenv still installs the pre-V2 tree. This change bumps only the pin. It does not by itself change any NeMo-RL code path. Paired with Megatron-LM#4632, it enables the end-to-end V2 path that is already running on AWS p5en.48xlarge 2x H200 in the reproduction repo below. Upstream references ------------------- * deepseek-ai/DeepEP#605 (V2 merge 2026-04-29) * NVIDIA/Megatron-LM#4632 (Megatron-side V2 adoption) Reproduction ------------ End-to-end reproduction (Dockerfile + K8s manifests + smoke bench) is public at: https://github.com/antonai-work/nemo-rl-deepep-v2-efa Related NeMo-RL PR (separate concern, same fleet): NVIDIA-NeMo#2410 (Dockerfile LD_LIBRARY_PATH for EFA OFI discovery) Signed-off-by: Anton Alexander <antonai@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds DeepEP V2 (
ElasticBuffer) support next to the existing legacyBuffercode path inmegatron/core/transformer/moe/fused_a2a.py. Whendeep_ep.ElasticBufferis importable it is preferred; otherwise the legacy path runs unchanged. Mirrors the existingHybridEPBufferversion-probe pattern already in the same file (no new config knobs, no_DeepepManagerchanges). Validated on 2-node p5.48xlarge + AWS EFA with a Qwen3-30B-A3B-style MoE config: loss decreased 26.41 → 24.61 over 3 steps, real grad_norm per step, and 1.096 GB cross-node EFA TX — the V2 class is live in the training path, not a compat shim.Motivation
DeepEP V2 (deepseek-ai/DeepEP#605) merged on 2026-04-29 and introduces
ElasticBufferin place ofBuffer, changing the dispatch/combine contract:(recv_x, recv_topk_idx, recv_topk_weights, EPHandle, event)instead of the V1 6-tuple.num_recv_tokens_per_expert_listnow lives onEPHandle.topk_idx, so theget_dispatch_layout()call and its four layout kwargs are gone.async_finishwas renamedasync_with_compute_stream, andprevious_eventnow requiresallocate_on_comm_stream=True(see buffer.hpp:483 in the V2 tree).num_max_tokens_per_rankacross ranks (dispatch.hpp:138,150).Consumers of Megatron's MoE flex dispatcher (including the vLLM, SGLang, NeMo-RL and TRT-LLM integrations we maintain downstream at antonai-work/nemo-rl-deepep-v2-efa) are already running on V2 via a V1-compat shim. This PR removes the need for that shim on the Megatron side.
Demand signal: NVIDIA/Megatron-LM#2647 (open since 2026-02-13) tracks the broader "DeepEP on AWS EFA" request, with engagement from the NCCL team (@xiaofanl-nvidia). The V2 branch is the path that version of
deep_eptargets.Also resolves #3999 for the V2 code path. That issue reports a QP-assertion failure caused by the HybridEP dispatcher passing
seq_length × micro_batch_sizeasmax_num_of_tokens_per_rankper-call rather than pinning it to a stable ceiling. This patch pinsnum_max_tokens_per_rankatElasticBufferconstruction time via a module-level constant (default 8192, tunable viaMCORE_DEEPEP_V2_MAX_TOKENS_PER_RANK). Ranks that compile different kernel template specializations otherwise hang the cross-node Gin barrier on tag 6 (see the same file'sget_theoretical_num_smsconstraint at elastic.py:611).Design choice: single-class version probe
A dual-class approach (
_DeepepV2Manageralongside_DeepepManager, chosen by a newmoe_deepep_api_versionconfig field) would add a ~250-line near-duplicate totoken_dispatcher.pyand force a new knob throughMoEFlexTokenDispatcher.__init__,TransformerConfig, YAML and docs. This PR takes the other fork:fused_a2a.py._DeepepManageris unchanged;MoEFlexTokenDispatcheris unchanged.try: from deep_ep import ElasticBuffer; HAVE_DEEP_EP_V2 = True; except ImportError: HAVE_DEEP_EP_V2 = False) is copy-shape from the adjacentHybridEPBufferblock already present in this file (PR Support TP > GQA for inference #3627).Precedent: #4228 ("build: bump DeepEP to 34152ae") merged 5 days after opening with 0 review comments and 3 additions / 3 deletions. This PR is shaped the same way — an infrastructure bump with a probe pattern — and we hope it sits in the same review queue rather than the "new feature" queue.
What changed
megatron/core/transformer/moe/fused_a2a.pyget_buffer,FusedDispatch.forward/backward,FusedCombine.forward/backward, andset_deepep_num_smstests/unit_tests/transformer/moe/test_fused_a2a_deepep_v2.pyV1 fall-through:
HAVE_DEEP_EP=True, HAVE_DEEP_EP_V2=False→ legacyBufferpath runs byte-identical to the pre-patch state.fused_dispatch = fused_combine = set_deepep_num_sms = None(unchanged).V2-specific safeguards baked in:
num_allocated_qps=0so V2 auto-caps the QP budget against AWS EFA's 128-slot shared GIN ring (avoids CUDA 719 at dispatch.hpp:183).num_sms=0on combine so V2 reuseshandle.num_smsfrom dispatch (mismatch triggers sticky CUDA 719 at jit/handle.hpp:86).num_max_tokens_per_rankpinned at construction — this is the [Bug] HybridEP dispatcher passes incorrect max_num_of_tokens_per_rank to DeepEP, causing RDMA QP assertion failure #3999 fix.previous_eventseeded viabuffer.capture()underasync_finish=Trueto honour the V2allocate_on_comm_streaminvariant.do_expand=Falseon dispatch to preserve V1 token layout.from deep_ep.utils.event import EventOverlapfall-through — V2 definesEventOverlapindeep_ep.utils.eventbut does not re-export it fromdeep_ep.utils(V1 did).Evidence
Validated on 2-node p5.48xlarge H100 + AWS EFA, namespace
megatron-shapey-validation:A fully reproducible Dockerfile + k8s manifest + training driver that regenerates the above is published at
antonai-work/nemo-rl-deepep-v2-efa. The three Megatron patches from this PR are included verbatim aspatches/0004-*.patchso reviewers can rebuild from vanilla upstream without private-repo access. Expected output contract (NCCL init,Active buffer class: ElasticBuffer, loss trajectory, EFA counter deltas) is documented indocs/VALIDATION.md.Backwards compatibility
_DeepepManager,MoEFlexTokenDispatcher,TransformerConfigall untouched.HAVE_DEEP_EPcontinues to reflect V1 legacyBufferavailability.MCORE_DEEPEP_V2_MAX_TOKENS_PER_RANK=8192,MCORE_DEEPEP_V2_HIDDEN=7168,MCORE_DEEPEP_V2_NUM_TOPK=8.try/except ImportErrorblocks at module load (microseconds); when V2 isn't installed everything is a no-op.Tests
test_fused_a2a_deepep_v2.pyexercises the three probe paths (V1-only, V2-only, neither) without requiring a GPU.TestFlexDispatcher.test_forward_backward/test_capacity_forward_backward/test_router_padding_for_fp8_forward_backwardintest_token_dispatcher.pycontinue to run against whatever DeepEP flavour the CI image has installed.Related
/cc @NVIDIA/mixture-of-experts-adlr @NVIDIA/mixture-of-experts-devtech