[WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading by ishandhanani · Pull Request #19983 · sgl-project/sglang

ishandhanani · 2026-03-05T21:40:55Z

Summary

Mixed TP support: seed TP != target TP (e.g., TP2 seed -> TP4 target, or TP4 -> TP2)
Shard metadata published from seed (full_shape, shard_dim, effective_tp_size, shard_index)
Transfer plan computes byte-range overlaps for dim-0 sharded params, GPU-side column slicing for dim-1
Removes small-tensor disk fallback -- all tensors now transfer via RDMA/NVLink (mooncake IPC offset bug fixed upstream)

Bug fixes in this revision

FP8 shard_dim for RowParallelLinear: FP8 quant sets output_dim=0 and input_dim=1 on all weights. For RowParallelLinear the actual shard dim is input_dim, not output_dim. Fixed by checking module type before selecting shard_dim.
src_eff_tp <= 1 replicated short-circuit: When source is TP=1, all tensors were treated as replicated. Fixed to only check shard_dim == -1.

Depends on

modelexpress: feat: Add TransferEngine backend to P2P metadata ai-dynamo/modelexpress#157 + ai-dynamo/modelexpress (idhanani/mx-mixed-tp-v1)
mooncake: IPC offset fix in IntraNodeNvlinkTransport::registerLocalMemory() (upstream pending)

Test plan

Llama-3.3-70B-Instruct-FP8

Direction	Seed	Target	RDMA Ops	Dim1 Fixups	Load Time	Disk Baseline
Same TP	TP=2	TP=2	1763	0	1.24s	13.6s
Scale-up	TP=2	TP=4	1763	160	1.42-1.89s	13.6s
Scale-down	TP=4	TP=2	2725	320	1.65-1.83s	8.5s

Same TP (TP2->TP2) -- 10.95x speedup, identical outputs
Scale-up (TP2->TP4) -- coherent output verified
Scale-down (TP4->TP2) -- coherent output verified, matches disk-loaded TP2

Qwen3-0.6B (BF16)

Scale-down (TP2->TP1) -- output matches disk-loaded TP1 exactly

Add MODEL_EXPRESS backend for remote instance weight loading that uses ModelExpress gRPC server for metadata coordination instead of direct HTTP between seed and target instances. Supports FP8 and BF16 models with per-tensor byte-size matching for mixed-dtype transfers. New CLI args: --model-express-url, --model-express-model-name, --model-express-source

Replace --model-express-url, --model-express-model-name, --model-express-source with single --model-express-config JSON arg. Properties provide backwards-compatible access for all downstream code (model_runner, loader, load_config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-05T21:40:58Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Dead code from initial MX integration. We switched to raw byte size comparison instead of dtype string conversion.

- Remove unused _get_model_dtype_str() method - Drop lossy element_size_to_dtype reverse mapping from seed publish (dtype field was never read on target side) - Wrap MxClient usage in try/finally to prevent gRPC channel leaks - Close MxClient before starting RDMA transfers (connection not needed during transfer phase)

Extends ModelExpress remote instance loading to support mixed TP configurations (seed TP != target TP). Adds shard metadata (full_shape, shard_dim, effective_tp_size, shard_index) to tensor descriptors so target ranks can compute byte-range overlaps from arbitrary source TP configurations. Key changes: - model_runner.py: publish shard metadata extracted from model modules (ColumnParallelLinear, RowParallelLinear, etc.) to ModelExpress - loader.py: mixed-TP transfer plan with dim-0 overlap algorithm, dim-1 GPU-side column slicing, and replicated tensor handling. Remove small-tensor disk fallback (mooncake IPC offset bug fixed upstream) -- all tensors now transfer via RDMA/NVLink. - remote_instance_weight_loader_utils.py: v1/v2 registration helpers Tested: Llama-3.3-70B-Instruct-FP8 TP2->TP2 (same), TP2->TP4 (scale up)

Two bugs blocking mixed-TP scale-up and scale-down: 1. Seed published wrong shard_dim for RowParallelLinear with FP8 quant. FP8 sets output_dim=0 and input_dim=1 on all weights. The old code always picked output_dim, making row-parallel layers appear dim-0 sharded. Fix: check module type first; use input_dim for RowParallel. 2. Target treated all tensors from a TP=1 source as replicated due to the condition `shard_dim == -1 or src_eff_tp <= 1`. This copied full unsharded tensors into resized params, breaking weight shapes. Fix: only check shard_dim == -1 for replicated classification. Validated: Llama-3.3-70B-FP8 scale-up (TP2->TP4) and scale-down (TP4->TP2), plus Qwen3-0.6B scale-down (TP2->TP1) matching disk baseline exactly.

zhengluo-nv · 2026-03-10T19:13:54Z

+        logger.info("ModelExpress: weight transfer complete for tp_rank=%d", tp_rank)
+
+
+def _compute_transfer_plan(model, source_index, tp_rank, tp_size):


The logic here is pretty complex. How to verify this works for different models? Is it worth supporting mixed TP transfer? Alternatively we can treat different TPs as different seed instances.

I agree. Registering TP workers under different parallel mechanisms as distinct seed instances should be sufficient. I would suggest only allowing client instances to load weights from seed instances that share the same model and parallel mechanism.

amysaq2023 · 2026-03-12T08:42:00Z

+        logger.info("ModelExpress: weight transfer complete for tp_rank=%d", tp_rank)
+
+
+def _compute_transfer_plan(model, source_index, tp_rank, tp_size):


I agree. Registering TP workers under different parallel mechanisms as distinct seed instances should be sufficient. I would suggest only allowing client instances to load weights from seed instances that share the same model and parallel mechanism.

amysaq2023 · 2026-03-12T08:44:41Z

+    # Always use v1 (per-param registration) for correctness.
+    # v2's block-merging approach has issues with small tensors
+    # that are sub-allocated within CUDA blocks.
+    return register_memory_region_v1(model, transfer_engine)


I'm concerned that the v1 registering approach might significantly impact boot-up time. Would it be possible to at least provide users with an option to choose their preferred registering method?

ishandhanani · 2026-03-20T18:44:11Z

We will discuss this PR in DMs

ishandhanani and others added 2 commits March 5, 2026 02:20

ishandhanani changed the title ~~Add mixed tensor-parallelism for ModelExpress P2P loading~~ [WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading Mar 5, 2026

Remove unused _DTYPE_ELEMENT_SIZES dict and _dtype_to_element_size()

bca0e9d

Dead code from initial MX integration. We switched to raw byte size comparison instead of dtype string conversion.

ishandhanani force-pushed the idhanani/mx-mixed-tp-v1 branch from f42cb6e to f7e5203 Compare March 6, 2026 04:17

Ishan Dhanani added 3 commits March 6, 2026 04:21

ishandhanani force-pushed the idhanani/mx-mixed-tp-v1 branch from f7e5203 to ac7b2e3 Compare March 6, 2026 04:25

ishandhanani mentioned this pull request Mar 6, 2026

Fix NVLink IPC offset corruption for sub-allocated GPU tensors kvcache-ai/Mooncake#1622

Merged

zhengluo-nv reviewed Mar 10, 2026

View reviewed changes

ishandhanani mentioned this pull request Mar 12, 2026

[1/2] Add ModelExpress coordination for remote instance weight loading - matching TP #19920

Merged

5 tasks

amysaq2023 reviewed Mar 12, 2026

View reviewed changes

Base automatically changed from ishan/mx to main March 18, 2026 20:38

ishandhanani closed this Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading#19983

[WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading#19983
ishandhanani wants to merge 6 commits intomainfrom
idhanani/mx-mixed-tp-v1

ishandhanani commented Mar 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

zhengluo-nv Mar 10, 2026

Uh oh!

amysaq2023 Mar 12, 2026

Uh oh!

amysaq2023 Mar 12, 2026

Uh oh!

amysaq2023 Mar 12, 2026

Uh oh!

ishandhanani commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger.info("ModelExpress: weight transfer complete for tp_rank=%d", tp_rank)


		def _compute_transfer_plan(model, source_index, tp_rank, tp_size):

Conversation

ishandhanani commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug fixes in this revision

Depends on

Test plan

Llama-3.3-70B-Instruct-FP8

Qwen3-0.6B (BF16)

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

zhengluo-nv Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

amysaq2023 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

amysaq2023 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

amysaq2023 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ishandhanani commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ishandhanani commented Mar 5, 2026 •

edited

Loading