[torchtitan][resubmit] experimenting new replicate integration with torchtitan#2458
Merged
[torchtitan][resubmit] experimenting new replicate integration with torchtitan#2458
Conversation
…torchtitan [ghstack-poisoned]
fegin
approved these changes
Feb 27, 2026
Contributor
There was a problem hiding this comment.
Did a very light review since this is a resubmit PR. Please explicitly mention that this is a resubmit PR to the summary and title.
linter failure is real, please fix it.
We should also wait for the integration tests are green though our CI has some issues now.
Also, do we even have CI for replicate case? Maybe it worths to add one for llama3. Can be done in another PR.
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
Closed
…eplicate integration with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…eplicate integration with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…eplicate integration with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…eplicate integration with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…eplicate integration with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…eplicate integration with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
…tion with torchtitan"
**Summary**
- Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
- Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
- Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
- Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
- Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling
**This is a resubmit PR of (#1714
[ghstack-poisoned]
saforem2
added a commit
to saforem2/torchtitan
that referenced
this pull request
Mar 10, 2026
Propagate upstream change (pytorch#2458) that replaces DDP-based replication with the new composable `replicate` API from `torch.distributed._composable.replicate_with_fsdp`. The new `apply_replicate` wraps modules per-component (tok_embeddings, transformer blocks, norm+output, full model) with MixedPrecisionPolicy support, mirroring the FSDP wrapping pattern. This removes the old "DDP has not supported > 1D parallelism" restriction. Updated files: - ezpz/agpt/parallelize.py: replaced inline apply_ddp with apply_replicate - ezpz/moe/parallelize.py: switched import and call site - ezpz/qwen3/parallelize.py: switched import and call site
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
to use apply_replicate instead of apply_ddp
This is a resubmit PR of (#1714)
Stack from ghstack (oldest at bottom):