Integrate LayerWiseDistributedOptimizer with DDP buffer infrastructure#4509
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/claude review |
f834483 to
bfb4f2c
Compare
|
/claude review |
24d6864 to
eb82980
Compare
eb82980 to
47ff44d
Compare
47ff44d to
8d2876b
Compare
|
Addressed review comment: renamed Also in the latest push: all buffers now use the shard-aligned layout (not just Muon buffers). This ensures no param is split across shard boundaries regardless of optimizer type. The separate |
8d2876b to
c2a0018
Compare
c2a0018 to
8b54011
Compare
8b54011 to
60b1ef7
Compare
Adds a shard-aligned parameter layout for LayerWiseDistributedOptimizer that guarantees no parameter is split across shard boundaries — the invariant optimizers like Muon need so Newton-Schulz iteration can run on full weight matrices. The optimizer publishes this layout to DDP, which means DDP can manage the buffers exactly as it does for DistributedOptimizer: - Gradient reduction: reduce-scatter via use_distributed_optimizer=True, replacing the previous all-reduce-then-pick-your-shard scheme. - Parameter sync: DDP's standard buffer all-gather via start_param_sync(), replacing the legacy flatten / all_gather_v / unflatten path. Existing layerwise optimizer tests were patched to also exercise the new code path. Code that computes param_layouts is separately tested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a use_layer_wise_param_layout=True kwarg to wrap_model_chunks_with_ddp. Layout computation now requires both use_layer_wise_distributed_optimizer and use_layer_wise_param_layout. The get_model production call site passes False so live training runs stay on the legacy LayerWise sync path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the size-matching loop pops a seed with no exact-numel peers (e.g. an embedding), the remaining shard slots are greedily packed with the next unassigned smaller params from the queue (respecting 64-element alignment) instead of being filled with pure padding. For a unique-large seed at the top of the backprop pool, this turns ``(dp_size - 1) * param_numel`` of empty padding into productive bucket content. Also renames the param-layout test file to ``test_layer_wise_param_layout.py`` to match the ``layer_wise_`` convention used by the optimizer module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
31ba8b9 to
0fa0ad6
Compare
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25770105108 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25776760140 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25805685255 |
Summary
Adds a shard-aligned parameter layout for
LayerWiseDistributedOptimizerthat guarantees no parameter is split across shard boundaries — the invariant optimizers like Muon need so Newton-Schulz iteration can run on full weight matrices. The optimizer publishes this layout to DDP, which means DDP can manage the buffers exactly as it does forDistributedOptimizer:use_distributed_optimizer=True, replacing the previous all-reduce-then-pick-your-shard scheme.model_chunk.start_param_sync(), replacing the legacy flatten /all_gather_v/ unflatten path that copied data on both ends.Why
Before this PR,
LayerWiseDistributedOptimizerran outside DDP's buffer infrastructure:use_distributed_optimizer=False, so gradients were all-reduced rather than reduce-scattered. Each rank held a full reduced gradient buffer even though it would only update its own shard.all_gather_vthat flattened each rank's params into a contiguous tensor, all-gathered, then unflattened andcopy_-ed the results back into model params. Two extra full-size copies per sync.Both fell out of "the optimizer doesn't know how the buffer is laid out, so DDP can't be told to slice it." Once the optimizer can publish a layout that guarantees "every param fits inside one DP shard," DDP's existing reduce-scatter and buffer all-gather work directly.
Approach
Shard-aligned size-matching layout
LayerWiseDistributedOptimizer._compute_per_buffer_param_layoutproduces a shard-aligned layout per buffer using a size-matching algorithm: each round claims one param for shard 0 and assigns same-sized params (or padding) to shards 1..dp-1, so all shards grow by exactly the same amount in lockstep. Shard sizes stay equal by construction; for repeated-layer models (N identical transformer blocks) padding overhead is zero. Every param ends up fully contained within a single DP-rank's shard — the invariant Muon's whole-tensor update requires.Shared (tied) embeddings get isolated buckets — they are placed alone in shard 0 of their own bucket, with shards 1..dp-1 padded to the same size. This preserves the no-shard-crossing invariant but pays a
(dp_size - 1) * pad(numel)cost per shared embedding. Eliminating that cost is the goal of a follow-up PR that routes Adam-managed params through a separateDistributedOptimizer.Optimizer state partitioning derived from the layout
_shard_params_from_layoutderives each parameter's owning rank directly from the published layout ((start - bucket_start) // shard_size) instead of the previous independent ping-pong-by-numel assignment. Optimizer and DDP now agree on shard assignments by construction, eliminating the class of bugs where the two could disagree and the optimizer would step on a parameter whose gradient was reduce-scattered to a different rank. A defensive assertion in_shard_params_from_layoutcatches any param straddling a shard boundary, so layout regressions fail loudly rather than silently corrupting the emerging optimizer's update.Step flow
DDP is now configured with
use_distributed_optimizer=Truefor layerwise mode, so model params are views intobucket.param_dataand grads are reduce-scattered into each rank's shard of the gradient buffer during backward. On step:LayerWise.stepruns the Muon update on its local-rank shard of every layerwise buffer; the standard "main → model_param" copy updates the param buffer in place (because model_param is a view into the buffer).model_chunk.start_param_sync(force_sync=True)syncs the buffer across DP ranks via DDP's standard all-gather — no flatten/unflatten copies.DDP wiring centralised in a helper
wrap_model_chunks_with_ddp(new, inmegatron/training/training.py) centralises the DDP-construction logic shared betweenget_modeland unit tests:ddp_config.use_distributed_optimizer = Truewhenuse_layer_wise_distributed_optimizer=True(needed for reduce-scatter).full_param_layoutviaLayerWiseDistributedOptimizer.compute_full_param_layout(layerwise) orDistributedOptimizer.compute_full_param_layout(standard distopt).Legacy paths kept as fallback (for the no-layout case)
The pre-PR sync paths are retained for callers that don't supply a layout yet and are marked for removal once all call sites pass one:
LayerWiseDistributedOptimizer.allgather_params(flatten /all_gather_v/ unflatten path).LayerWiseDistributedOptimizer.set_bucket_layerwise_params_list.LayerWiseDistributedOptimizer._shard_params_ping_pong(the ping-pong-by-numel fallback used when no layout is supplied)._ParamAndGradBucketGroup.start_param_sync.Tests
tests/unit_tests/distributed/test_layerwise_param_layout.py: covers the size-matching layout (uniform / mixed / shared-embedding / dp-divisibility / backprop ordering /dp_size ∈ {1, 2, 4, 8}).test_emerging_optimizersanddist_checkpointing/test_layer_wise_optimizer: updated to construct DDP via the newwrap_model_chunks_with_ddphelper so the layerwise wiring matchestraining.get_model.Test plan
test_layerwise_param_layout.pypasses (shard divisor, size-matching layout, shared-embedding isolation, bucket alignment, backprop ordering, full layout grouping)test_param_layout.pytests still passtest_emerging_optimizerstests pass againstLayerWiseDistributedOptimizerdist_checkpointing/test_layer_wise_optimizertests pass--use-layer-wise-distributed-optimizerand verify convergence matches the existing layerwise baselineddp_config.use_distributed_optimizer=Trueis in effect and grads are reduce-scattered into per-rank shardsdp_size > 1to verify gradient flow + convergence end-to-endFollow-up
Routing Adam-managed parameters through a separate
DistributedOptimizer(sub-tensor sharding) so tied embeddings avoid the(dp_size - 1) * pad(numel)per-rank padding cost. That work is staged on a separate branch and will land as a follow-up PR.