[Trainer] accelerate contextparallel support in trainer by kashif · Pull Request #40205 · huggingface/transformers

kashif · 2025-08-15T14:18:41Z

What does this PR do?

Add support for context parallel in the Trainer

salmanmohammadi · 2025-08-15T14:22:21Z

src/transformers/trainer.py

        self.is_tp_enabled = getattr(self.accelerator.state, "torch_tp_plugin", None) is not None
+        self.is_cp_enabled = (
+            getattr(self.accelerator.state, "parallelism_config", None) is not None
+            and getattr(self.accelerator.state.parallelism_config, "cp_size", 1) > 1


Should we only rely onparallelism_config to configure CP?

salmanmohammadi · 2025-08-15T14:23:34Z

src/transformers/trainer.py

        self.is_deepspeed_enabled = getattr(self.accelerator.state, "deepspeed_plugin", None) is not None
        self.is_fsdp_enabled = getattr(self.accelerator.state, "fsdp_plugin", None) is not None
        self.is_tp_enabled = getattr(self.accelerator.state, "torch_tp_plugin", None) is not None
+        self.is_cp_enabled = (


It would be great to just use self.parallelism_config = getattr(self.accelerator.parallelism_config, None), and also to have a ref for parallelism_config in TrainerState

…ing tokens

…ransformers into trainer-cp

HuggingFaceDocBuilderDev · 2025-08-15T14:49:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

winglian · 2025-08-15T15:00:40Z

src/transformers/training_args.py

+            if not self.fsdp:
+                from accelerate.utils import FullyShardedDataParallelPlugin
+
+                self.fsdp_plugin = FullyShardedDataParallelPlugin(
+                    fsdp_version=2,
+                    auto_wrap_policy="transformer_based_wrap",
+                    state_dict_type="FULL_STATE_DICT",
+                )
+            else:
+                # Ensure FSDP v2 is used when context parallelism is enabled
+                if self.fsdp_config.get("version", 1) != 2:
+                    logger.warning("Context parallelism requires FSDP v2. Updating FSDP config to use version 2.")
+                    self.fsdp_config["version"] = 2


Shouldn't it warn the user when it's enabling FSDP without explicit configuration from the user?

src/transformers/training_args.py

src/transformers/trainer.py

src/transformers/training_args.py

src/transformers/trainer.py

S1ro1 · 2025-08-18T14:44:30Z

src/transformers/trainer.py

-            and num_items_in_batch is not None
-        ):
-            loss *= self.accelerator.num_processes
+        # if (


TODO: Need to understand why we need this realistically.

S1ro1 · 2025-08-18T14:46:07Z

~~Current status is a bit funky, the loss logging seems to be very weird, with outputs being correct, but logged losses being off sometimes:~~

kashif · 2025-08-22T09:23:20Z

@SunMarc I have fixed the issues you raised

S1ro1 · 2025-08-22T09:52:36Z

src/transformers/trainer.py

        logger.info(f"Saving model checkpoint to {output_dir}")

+        # Defer to accelerate's get_state_dict when using distributed setups that require special state dict handling
+        if state_dict is None and (self.is_fsdp2 or self.is_deepspeed_enabled):


We don't need this at all. save_pretrained works with torch parallelism just ok. I suppose we do want to keep this for non transformers models only?

S1ro1 · 2025-08-22T11:57:34Z

Failing tests seem unrelated.

SunMarc

Thanks ! A few nits but overall LGTM

src/transformers/training_args.py

SunMarc · 2025-08-22T12:45:31Z

src/transformers/trainer.py

+            if state_dict is None and (getattr(self.accelerator, "is_fsdp2", False) or self.is_deepspeed_enabled):
+                state_dict = self.accelerator.get_state_dict(self.model)


is there an issue with how things are currently handled ? just to better understand

I think it would just silently fail at this point, but it's with custom models which is a rather rare use-case.

SunMarc · 2025-08-22T12:46:10Z

src/transformers/trainer.py

+        if (
+            getattr(self.accelerator, "parallelism_config") is not None
+            and self.accelerator.parallelism_config.cp_enabled
+        ):


still need to fix that potentially but we can do that in a follow up otherwise

src/transformers/trainer.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

SunMarc

Minor nit and we can merge !

src/transformers/training_args.py

src/transformers/trainer.py

kashif · 2025-08-25T14:39:08Z

on it @SunMarc

SunMarc

Thanks !

sfc-gh-sbekman · 2025-10-23T21:41:23Z

Am I missing something or was this feature merged w/o adding any tests?

I'm working on an HF Trainer integration PR for ALST/UlyssesSP via huggingface/accelerate#3817 and I was hoping to have some existing CP tests I could extend/copy but I can't find any.

How will you know if this feature breaks if you have no tests? The Accelerate side doesn't test most of this feature either. I'm puzzled.

kashif · 2025-10-25T14:33:25Z

CI test PR #41860

kashif added 2 commits August 15, 2025 14:17

initial context_parallel_size support in trainer

ca8e366

Merge branch 'main' into trainer-cp

29b22cf

salmanmohammadi reviewed Aug 15, 2025

View reviewed changes

kashif added 3 commits August 15, 2025 14:29

For context parallelism, use AVG instead of SUM to avoid over-account…

d70fef4

…ing tokens

git pushMerge branch 'trainer-cp' of https://github.com/huggingface/t…

dd58dd0

…ransformers into trainer-cp

use parallelism_config.cp_enabled

ecc2366

add parallelism_config to trainer state

a629ff0

winglian reviewed Aug 15, 2025

View reviewed changes

salmanmohammadi reviewed Aug 15, 2025

View reviewed changes

src/transformers/training_args.py Outdated Show resolved Hide resolved

warn when auto-enabling FSDP

66d4273

S1ro1 reviewed Aug 15, 2025

View reviewed changes