Skip to content

[Bug] HybridEP dispatcher passes incorrect max_num_of_tokens_per_rank to DeepEP, causing RDMA QP assertion failure #3999

@tingmingzhong

Description

@tingmingzhong

Summary

When using the HybridEP backend (--moe_flex_dispatcher_backend hybridep) for MoE expert-parallel training across multiple nodes, HybridEPDispatch.forward() in fused_a2a.py incorrectly uses the total number of tokens in a micro-batch (seq_length × micro_batch_size) as the max_num_of_tokens_per_rank parameter for HybridEPBuffer. This causes the RDMA Queue Pair send-queue depth (tx_depth) to exceed the hardware limit of 65535, triggering an assertion failure in DeepEP's internode communication initialization.

Environment

  • Megatron-LM: latest (hybrid-ep branch of DeepEP integrated)
  • DeepEP: v1.2.1 (hybrid-ep branch)
  • Model: Qwen3-30B-A3B (MoE, 128 experts)
  • Hardware: 2 nodes × 8 GPUs, InfiniBand RDMA interconnect
  • Training config: --max_length 8192 --micro_batch_size 8 --packing true --expert_model_parallel_size 16 --moe_flex_dispatcher_backend hybridep

Error

python: /path/to/DeepEP/csrc/hybrid_ep/buffer/internode.cu:167:
void setup_qp_init_attr(..., int): Assertion `tx_depth > 0 && tx_depth < 65536' failed.

All ranks crash with SIGABRT (signal 6) during HybridEP buffer initialization.

Root Cause

The call chain

  1. MoEFlexTokenDispatcher.dispatch_preprocess() (token_dispatcher.py:1438) reshapes hidden_states from [seq_length, batch_size, hidden_size] to [seq_length * batch_size, hidden_size]:

    hidden_states = hidden_states.view(-1, self.hidden_shape[-1])
  2. HybridEPDispatch.forward() (fused_a2a.py:354-359) extracts the first dimension of the already-flattened tensor and uses it as seq_len:

    if _hybrid_ep_buffer is None:
        seq_len, hidden_dim = x.shape[-2:]  # x is [seq_len * batch_size, hidden_dim]
        init_hybrid_ep_buffer(group, hidden_dim, seq_len, ...)  # seq_len is actually num_total_tokens
  3. init_hybrid_ep_buffer() (fused_a2a.py:316) passes this value directly as max_num_of_tokens_per_rank:

    _hybrid_ep_buffer = HybridEPBuffer(
        group=group,
        hidden_dim=hidden_dim,
        max_num_of_tokens_per_rank=seq_len,  # <-- This is seq_length * micro_batch_size, not per-rank tokens
        ...
    )
  4. DeepEP internode.cu uses max_num_of_tokens_per_rank to compute the RDMA QP send-queue depth:

    // internode.cu:430 (dispatch)
    setup_qp_init_attr(..., 3 * buffer_config.max_num_of_tokens_per_rank + 1);
    // internode.cu:587 (combine)
    setup_qp_init_attr(..., 2 * buffer_config.max_num_of_tokens_per_rank + 1);
  5. The assertion enforces the IB hardware limit:

    assert(tx_depth > 0 && tx_depth < 65536);

Concrete example

With max_length=8192, micro_batch_size=8, packing=true:

Parameter Value
x.shape[0] (after flatten) 8192 × 8 = 65536
max_num_of_tokens_per_rank passed to DeepEP 65536
dispatch tx_depth 3 × 65536 + 1 = 196609
Hardware limit < 65536

The dispatch tx_depth exceeds the limit by .

Why single-node works but multi-node fails

The RDMA QP initialization (and thus the tx_depth assertion) only runs when num_of_nodes > 1. Single-node setups use NVLink-only communication and never hit this code path.

Steps to Reproduce

  1. Configure a multi-node MoE training with HybridEP:

    --expert_model_parallel_size 16 \
    --moe_token_dispatcher_type flex \
    --moe_flex_dispatcher_backend hybridep \
    --micro_batch_size 8 \
    --max_length 8192 \
    --packing true
  2. Run training across 2+ nodes with RDMA/InfiniBand.

  3. Training crashes immediately during HybridEP buffer initialization with:

    Assertion `tx_depth > 0 && tx_depth < 65536' failed.
    

Note: The issue is triggered when 3 × seq_length × micro_batch_size + 1 > 65535, i.e., seq_length × micro_batch_size > 21845. Common configurations like 8192 × 4 = 32768 or 4096 × 8 = 32768 will hit this.

Affected Code

  • megatron/core/transformer/moe/fused_a2a.pyHybridEPDispatch.forward() (line 354-359) and init_hybrid_ep_buffer() (line 316)

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions