[torchtitan][resubmit] experimenting new replicate integration with torchtitan by anshul-si · Pull Request #2458 · pytorch/torchtitan

anshul-si · 2026-02-27T22:10:02Z

Summary

Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses
torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern
as FSDP (tok_embeddings, transformer blocks, norm+output, full model)
Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer)
to use apply_replicate instead of apply_ddp
Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack
Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop
Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling

This is a resubmit PR of (#1714)

Stack from ghstack (oldest at bottom):

-> [torchtitan][resubmit] experimenting new replicate integration with torchtitan #2458

…torchtitan [ghstack-poisoned]

…torchtitan ghstack-source-id: 575bd54 Pull Request resolved: #2458

fegin

Did a very light review since this is a resubmit PR. Please explicitly mention that this is a resubmit PR to the summary and title.

linter failure is real, please fix it.

We should also wait for the integration tests are green though our CI has some issues now.

Also, do we even have CI for replicate case? Maybe it worths to add one for llama3. Can be done in another PR.

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: 7358605 Pull Request resolved: #2458

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: 575bd54 Pull Request resolved: #2458

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: 77d6de7 Pull Request resolved: #2458

…eplicate integration with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: c5f2579 Pull Request resolved: #2458

…eplicate integration with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: 8299aef Pull Request resolved: #2458

…eplicate integration with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: ae10106 Pull Request resolved: #2458

…eplicate integration with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: 41e80b5 Pull Request resolved: #2458

…eplicate integration with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: cdf5caf Pull Request resolved: #2458

…eplicate integration with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…tion with torchtitan" **Summary** - Replaces the old apply_ddp (DDP-based replication) with a new apply_replicate function that uses torch.distributed._composable.replicate with per-module wrapping and MixedPrecisionPolicy support — mirroring the same wrapping pattern as FSDP (tok_embeddings, transformer blocks, norm+output, full model) - Updates all model parallelization files (llama3, llama4, qwen3, deepseek_v3, gpt_oss, VLM, transformers_modeling_backend, RL trainer) to use apply_replicate instead of apply_ddp - Removes the old "DDP has not supported > 1D parallelism" restrictions, since replicate integrates with the composable parallelism stack - Disables automatic gradient division for replicate modules (same as FSDP) so gradient scaling is handled by the training loop - Updates maybe_enable_amp in distributed utils to recognize dp_replicate_enabled for mixed precision handling **This is a resubmit PR of (#1714 [ghstack-poisoned]

…torchtitan ghstack-source-id: 8ab7331 Pull Request resolved: #2458

Propagate upstream change (pytorch#2458) that replaces DDP-based replication with the new composable `replicate` API from `torch.distributed._composable.replicate_with_fsdp`. The new `apply_replicate` wraps modules per-component (tok_embeddings, transformer blocks, norm+output, full model) with MixedPrecisionPolicy support, mirroring the FSDP wrapping pattern. This removes the old "DDP has not supported > 1D parallelism" restriction. Updated files: - ezpz/agpt/parallelize.py: replaced inline apply_ddp with apply_replicate - ezpz/moe/parallelize.py: switched import and call site - ezpz/qwen3/parallelize.py: switched import and call site

[torchtitan][replicate] experimenting new replicate integration with …

1ee3b6b

…torchtitan [ghstack-poisoned]

anshul-si requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 27, 2026 22:10

anshul-si added a commit that referenced this pull request Feb 27, 2026

[torchtitan][replicate] experimenting new replicate integration with …

b56ade0

…torchtitan ghstack-source-id: 575bd54 Pull Request resolved: #2458

pytorch-bot bot added the ciflow/8gpu label Feb 27, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 27, 2026

fegin approved these changes Feb 27, 2026

View reviewed changes

anshul-si changed the title ~~[torchtitan][replicate] experimenting new replicate integration with torchtitan~~ [torchtitan][resubmit] experimenting new replicate integration with torchtitan Feb 27, 2026

anshul-si added a commit that referenced this pull request Mar 2, 2026

[torchtitan][replicate] experimenting new replicate integration with …

fc3f47e

…torchtitan ghstack-source-id: 7358605 Pull Request resolved: #2458

anshul-si added a commit that referenced this pull request Mar 2, 2026

[torchtitan][replicate] experimenting new replicate integration with …

0b99009

…torchtitan ghstack-source-id: 575bd54 Pull Request resolved: #2458

anshul-si added a commit that referenced this pull request Mar 5, 2026

[torchtitan][replicate] experimenting new replicate integration with …

0bc3a02

…torchtitan ghstack-source-id: 77d6de7 Pull Request resolved: #2458

anshul-si mentioned this pull request Mar 5, 2026

not landing #2501

Closed

anshul-si added 2 commits March 5, 2026 13:23

anshul-si added a commit that referenced this pull request Mar 5, 2026

[torchtitan][replicate] experimenting new replicate integration with …

84c2e1d

…torchtitan ghstack-source-id: c5f2579 Pull Request resolved: #2458

anshul-si added 2 commits March 5, 2026 13:24

anshul-si added a commit that referenced this pull request Mar 5, 2026

[torchtitan][replicate] experimenting new replicate integration with …

8ee46b5

…torchtitan ghstack-source-id: 8299aef Pull Request resolved: #2458

anshul-si added 2 commits March 6, 2026 11:07

anshul-si added a commit that referenced this pull request Mar 6, 2026

[torchtitan][replicate] experimenting new replicate integration with …

f82cf46

…torchtitan ghstack-source-id: ae10106 Pull Request resolved: #2458

anshul-si added 2 commits March 9, 2026 10:37

anshul-si added a commit that referenced this pull request Mar 9, 2026

[torchtitan][replicate] experimenting new replicate integration with …

af0f82d

…torchtitan ghstack-source-id: 41e80b5 Pull Request resolved: #2458

anshul-si added a commit that referenced this pull request Mar 9, 2026

[torchtitan][replicate] experimenting new replicate integration with …

64eb960

…torchtitan ghstack-source-id: cdf5caf Pull Request resolved: #2458

anshul-si added 2 commits March 9, 2026 14:23

anshul-si added a commit that referenced this pull request Mar 9, 2026

[torchtitan][replicate] experimenting new replicate integration with …

c9f27a6

…torchtitan ghstack-source-id: 8ab7331 Pull Request resolved: #2458

anshul-si changed the base branch from gh/anshul-si/3/base to main March 9, 2026 23:07

anshul-si merged commit e1847f4 into main Mar 9, 2026
20 of 29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchtitan][resubmit] experimenting new replicate integration with torchtitan#2458

[torchtitan][resubmit] experimenting new replicate integration with torchtitan#2458
anshul-si merged 16 commits intomainfrom
gh/anshul-si/3/head

anshul-si commented Feb 27, 2026 •

edited

Loading

Uh oh!

fegin left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anshul-si commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anshul-si commented Feb 27, 2026 •

edited

Loading

fegin left a comment •

edited

Loading