Skip to content

[WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading#19983

Closed
ishandhanani wants to merge 6 commits intomainfrom
idhanani/mx-mixed-tp-v1
Closed

[WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading#19983
ishandhanani wants to merge 6 commits intomainfrom
idhanani/mx-mixed-tp-v1

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

@ishandhanani ishandhanani commented Mar 5, 2026

Summary

  • Mixed TP support: seed TP != target TP (e.g., TP2 seed -> TP4 target, or TP4 -> TP2)
  • Shard metadata published from seed (full_shape, shard_dim, effective_tp_size, shard_index)
  • Transfer plan computes byte-range overlaps for dim-0 sharded params, GPU-side column slicing for dim-1
  • Removes small-tensor disk fallback -- all tensors now transfer via RDMA/NVLink (mooncake IPC offset bug fixed upstream)

Bug fixes in this revision

  1. FP8 shard_dim for RowParallelLinear: FP8 quant sets output_dim=0 and input_dim=1 on all weights. For RowParallelLinear the actual shard dim is input_dim, not output_dim. Fixed by checking module type before selecting shard_dim.
  2. src_eff_tp <= 1 replicated short-circuit: When source is TP=1, all tensors were treated as replicated. Fixed to only check shard_dim == -1.

Depends on

Test plan

Llama-3.3-70B-Instruct-FP8

Direction Seed Target RDMA Ops Dim1 Fixups Load Time Disk Baseline
Same TP TP=2 TP=2 1763 0 1.24s 13.6s
Scale-up TP=2 TP=4 1763 160 1.42-1.89s 13.6s
Scale-down TP=4 TP=2 2725 320 1.65-1.83s 8.5s
  • Same TP (TP2->TP2) -- 10.95x speedup, identical outputs
  • Scale-up (TP2->TP4) -- coherent output verified
  • Scale-down (TP4->TP2) -- coherent output verified, matches disk-loaded TP2

Qwen3-0.6B (BF16)

  • Scale-down (TP2->TP1) -- output matches disk-loaded TP1 exactly

ishandhanani and others added 2 commits March 5, 2026 02:20
Add MODEL_EXPRESS backend for remote instance weight loading that uses
ModelExpress gRPC server for metadata coordination instead of direct
HTTP between seed and target instances. Supports FP8 and BF16 models
with per-tensor byte-size matching for mixed-dtype transfers.

New CLI args: --model-express-url, --model-express-model-name,
--model-express-source
Replace --model-express-url, --model-express-model-name, --model-express-source
with single --model-express-config JSON arg. Properties provide backwards-compatible
access for all downstream code (model_runner, loader, load_config).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani ishandhanani changed the title Add mixed tensor-parallelism for ModelExpress P2P loading [WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading Mar 5, 2026
Dead code from initial MX integration. We switched to raw byte size
comparison instead of dtype string conversion.
@ishandhanani ishandhanani force-pushed the idhanani/mx-mixed-tp-v1 branch from f42cb6e to f7e5203 Compare March 6, 2026 04:17
Ishan Dhanani added 3 commits March 6, 2026 04:21
- Remove unused _get_model_dtype_str() method
- Drop lossy element_size_to_dtype reverse mapping from seed publish
  (dtype field was never read on target side)
- Wrap MxClient usage in try/finally to prevent gRPC channel leaks
- Close MxClient before starting RDMA transfers (connection not needed
  during transfer phase)
Extends ModelExpress remote instance loading to support mixed TP
configurations (seed TP != target TP). Adds shard metadata (full_shape,
shard_dim, effective_tp_size, shard_index) to tensor descriptors so
target ranks can compute byte-range overlaps from arbitrary source TP
configurations.

Key changes:
- model_runner.py: publish shard metadata extracted from model modules
  (ColumnParallelLinear, RowParallelLinear, etc.) to ModelExpress
- loader.py: mixed-TP transfer plan with dim-0 overlap algorithm,
  dim-1 GPU-side column slicing, and replicated tensor handling.
  Remove small-tensor disk fallback (mooncake IPC offset bug fixed
  upstream) -- all tensors now transfer via RDMA/NVLink.
- remote_instance_weight_loader_utils.py: v1/v2 registration helpers

Tested: Llama-3.3-70B-Instruct-FP8 TP2->TP2 (same), TP2->TP4 (scale up)
Two bugs blocking mixed-TP scale-up and scale-down:

1. Seed published wrong shard_dim for RowParallelLinear with FP8 quant.
   FP8 sets output_dim=0 and input_dim=1 on all weights. The old code
   always picked output_dim, making row-parallel layers appear dim-0
   sharded. Fix: check module type first; use input_dim for RowParallel.

2. Target treated all tensors from a TP=1 source as replicated due to
   the condition `shard_dim == -1 or src_eff_tp <= 1`. This copied full
   unsharded tensors into resized params, breaking weight shapes. Fix:
   only check shard_dim == -1 for replicated classification.

Validated: Llama-3.3-70B-FP8 scale-up (TP2->TP4) and scale-down
(TP4->TP2), plus Qwen3-0.6B scale-down (TP2->TP1) matching disk
baseline exactly.
logger.info("ModelExpress: weight transfer complete for tp_rank=%d", tp_rank)


def _compute_transfer_plan(model, source_index, tp_rank, tp_size):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is pretty complex. How to verify this works for different models? Is it worth supporting mixed TP transfer? Alternatively we can treat different TPs as different seed instances.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Registering TP workers under different parallel mechanisms as distinct seed instances should be sufficient. I would suggest only allowing client instances to load weights from seed instances that share the same model and parallel mechanism.

logger.info("ModelExpress: weight transfer complete for tp_rank=%d", tp_rank)


def _compute_transfer_plan(model, source_index, tp_rank, tp_size):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Registering TP workers under different parallel mechanisms as distinct seed instances should be sufficient. I would suggest only allowing client instances to load weights from seed instances that share the same model and parallel mechanism.

# Always use v1 (per-param registration) for correctness.
# v2's block-merging approach has issues with small tensors
# that are sub-allocated within CUDA blocks.
return register_memory_region_v1(model, transfer_engine)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that the v1 registering approach might significantly impact boot-up time. Would it be possible to at least provide users with an option to choose their preferred registering method?

Base automatically changed from ishan/mx to main March 18, 2026 20:38
@ishandhanani
Copy link
Copy Markdown
Collaborator Author

We will discuss this PR in DMs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants