Skip to content

[torchtitan][resubmit] experimenting new replicate integration with torchtitan#2458

Merged
anshul-si merged 16 commits intomainfrom
gh/anshul-si/3/head
Mar 9, 2026
Merged

[torchtitan][resubmit] experimenting new replicate integration with torchtitan#2458
anshul-si merged 16 commits intomainfrom
gh/anshul-si/3/head

Conversation

@anshul-si
Copy link
Contributor

@anshul-si anshul-si commented Feb 27, 2026

Summary

  • Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
    torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
    as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  • Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
    to use apply_replicate instead of apply_ddp
  • Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  • Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  • Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

This is a resubmit PR of (#1714)

Stack from ghstack (oldest at bottom):

anshul-si added a commit that referenced this pull request Feb 27, 2026
…torchtitan

ghstack-source-id: 575bd54
Pull Request resolved: #2458
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 27, 2026
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a very light review since this is a resubmit PR. Please explicitly mention that this is a resubmit PR to the summary and title.

linter failure is real, please fix it.

We should also wait for the integration tests are green though our CI has some issues now.

Also, do we even have CI for replicate case? Maybe it worths to add one for llama3. Can be done in another PR.

@anshul-si anshul-si changed the title [torchtitan][replicate] experimenting new replicate integration with torchtitan [torchtitan][resubmit] experimenting new replicate integration with torchtitan Feb 27, 2026
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 2, 2026
…torchtitan

ghstack-source-id: 7358605
Pull Request resolved: #2458
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 2, 2026
…torchtitan

ghstack-source-id: 575bd54
Pull Request resolved: #2458
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 5, 2026
…torchtitan

ghstack-source-id: 77d6de7
Pull Request resolved: #2458
@anshul-si anshul-si mentioned this pull request Mar 5, 2026
…eplicate integration with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 5, 2026
…torchtitan

ghstack-source-id: c5f2579
Pull Request resolved: #2458
…eplicate integration with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 5, 2026
…torchtitan

ghstack-source-id: 8299aef
Pull Request resolved: #2458
…eplicate integration with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 6, 2026
…torchtitan

ghstack-source-id: ae10106
Pull Request resolved: #2458
…eplicate integration with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 9, 2026
…torchtitan

ghstack-source-id: 41e80b5
Pull Request resolved: #2458
…eplicate integration with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 9, 2026
…torchtitan

ghstack-source-id: cdf5caf
Pull Request resolved: #2458
…eplicate integration with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
…tion with torchtitan"

**Summary**
                                                                                                                                           
  - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
  torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
  as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
  - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
  to use apply_replicate instead of apply_ddp
  - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
  - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
  - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

 **This is a resubmit PR of (#1714





[ghstack-poisoned]
anshul-si added a commit that referenced this pull request Mar 9, 2026
…torchtitan

ghstack-source-id: 8ab7331
Pull Request resolved: #2458
@anshul-si anshul-si changed the base branch from gh/anshul-si/3/base to main March 9, 2026 23:07
@anshul-si anshul-si merged commit e1847f4 into main Mar 9, 2026
20 of 29 checks passed
saforem2 added a commit to saforem2/torchtitan that referenced this pull request Mar 10, 2026
Propagate upstream change (pytorch#2458) that replaces
DDP-based replication with the new composable `replicate` API from
`torch.distributed._composable.replicate_with_fsdp`.

The new `apply_replicate` wraps modules per-component (tok_embeddings,
transformer blocks, norm+output, full model) with MixedPrecisionPolicy
support, mirroring the FSDP wrapping pattern. This removes the old
"DDP has not supported > 1D parallelism" restriction.

Updated files:
- ezpz/agpt/parallelize.py: replaced inline apply_ddp with apply_replicate
- ezpz/moe/parallelize.py: switched import and call site
- ezpz/qwen3/parallelize.py: switched import and call site
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants