[WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading#19983
[WIP][2/2] Add mixed tensor-parallelism for ModelExpress P2P loading#19983ishandhanani wants to merge 6 commits intomainfrom
Conversation
Add MODEL_EXPRESS backend for remote instance weight loading that uses ModelExpress gRPC server for metadata coordination instead of direct HTTP between seed and target instances. Supports FP8 and BF16 models with per-tensor byte-size matching for mixed-dtype transfers. New CLI args: --model-express-url, --model-express-model-name, --model-express-source
Replace --model-express-url, --model-express-model-name, --model-express-source with single --model-express-config JSON arg. Properties provide backwards-compatible access for all downstream code (model_runner, loader, load_config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Dead code from initial MX integration. We switched to raw byte size comparison instead of dtype string conversion.
f42cb6e to
f7e5203
Compare
- Remove unused _get_model_dtype_str() method - Drop lossy element_size_to_dtype reverse mapping from seed publish (dtype field was never read on target side) - Wrap MxClient usage in try/finally to prevent gRPC channel leaks - Close MxClient before starting RDMA transfers (connection not needed during transfer phase)
Extends ModelExpress remote instance loading to support mixed TP configurations (seed TP != target TP). Adds shard metadata (full_shape, shard_dim, effective_tp_size, shard_index) to tensor descriptors so target ranks can compute byte-range overlaps from arbitrary source TP configurations. Key changes: - model_runner.py: publish shard metadata extracted from model modules (ColumnParallelLinear, RowParallelLinear, etc.) to ModelExpress - loader.py: mixed-TP transfer plan with dim-0 overlap algorithm, dim-1 GPU-side column slicing, and replicated tensor handling. Remove small-tensor disk fallback (mooncake IPC offset bug fixed upstream) -- all tensors now transfer via RDMA/NVLink. - remote_instance_weight_loader_utils.py: v1/v2 registration helpers Tested: Llama-3.3-70B-Instruct-FP8 TP2->TP2 (same), TP2->TP4 (scale up)
Two bugs blocking mixed-TP scale-up and scale-down: 1. Seed published wrong shard_dim for RowParallelLinear with FP8 quant. FP8 sets output_dim=0 and input_dim=1 on all weights. The old code always picked output_dim, making row-parallel layers appear dim-0 sharded. Fix: check module type first; use input_dim for RowParallel. 2. Target treated all tensors from a TP=1 source as replicated due to the condition `shard_dim == -1 or src_eff_tp <= 1`. This copied full unsharded tensors into resized params, breaking weight shapes. Fix: only check shard_dim == -1 for replicated classification. Validated: Llama-3.3-70B-FP8 scale-up (TP2->TP4) and scale-down (TP4->TP2), plus Qwen3-0.6B scale-down (TP2->TP1) matching disk baseline exactly.
f7e5203 to
ac7b2e3
Compare
| logger.info("ModelExpress: weight transfer complete for tp_rank=%d", tp_rank) | ||
|
|
||
|
|
||
| def _compute_transfer_plan(model, source_index, tp_rank, tp_size): |
There was a problem hiding this comment.
The logic here is pretty complex. How to verify this works for different models? Is it worth supporting mixed TP transfer? Alternatively we can treat different TPs as different seed instances.
There was a problem hiding this comment.
I agree. Registering TP workers under different parallel mechanisms as distinct seed instances should be sufficient. I would suggest only allowing client instances to load weights from seed instances that share the same model and parallel mechanism.
| logger.info("ModelExpress: weight transfer complete for tp_rank=%d", tp_rank) | ||
|
|
||
|
|
||
| def _compute_transfer_plan(model, source_index, tp_rank, tp_size): |
There was a problem hiding this comment.
I agree. Registering TP workers under different parallel mechanisms as distinct seed instances should be sufficient. I would suggest only allowing client instances to load weights from seed instances that share the same model and parallel mechanism.
| # Always use v1 (per-param registration) for correctness. | ||
| # v2's block-merging approach has issues with small tensors | ||
| # that are sub-allocated within CUDA blocks. | ||
| return register_memory_region_v1(model, transfer_engine) |
There was a problem hiding this comment.
I'm concerned that the v1 registering approach might significantly impact boot-up time. Would it be possible to at least provide users with an option to choose their preferred registering method?
|
We will discuss this PR in DMs |
Summary
Bug fixes in this revision
output_dim=0andinput_dim=1on all weights. For RowParallelLinear the actual shard dim isinput_dim, notoutput_dim. Fixed by checking module type before selecting shard_dim.src_eff_tp <= 1replicated short-circuit: When source is TP=1, all tensors were treated as replicated. Fixed to only checkshard_dim == -1.Depends on
IntraNodeNvlinkTransport::registerLocalMemory()(upstream pending)Test plan
Llama-3.3-70B-Instruct-FP8
Qwen3-0.6B (BF16)