Trainer.compute_loss: fix loss over-counting under TP and EP-as-TP by AmineDiro · Pull Request #45994 · huggingface/transformers

AmineDiro · 2026-05-15T10:07:41Z

What does this PR do?

When using DP + TP or DP+ EP set by the FSDP+EP branch in_build_accelerator_args replicates the same batch across tp_size ranks, the model's per-rank loss is already per_rank_token_sum / global_num_items_in_batch; multiplying by the full num_processes over-counts by tp_size.

Test

Model: Random-init Qwen3-MoE (4L, 8E, Hidden=256)
Hardware: 1 node × 8 H100
Hyperparameters: Context=2k, LR=0, Seed=42
Expected Loss: $\log(151936) \approx 11.93$

Row	Backend	DP × EP	Pre-fix	Post-fix	Job
A	fsdp2	8 × 1	11.97	11.97	22153595
B	fsdp2	2 × 4	47.88	11.97	22153596
C	DS-Z3	8 × 1	11.97	11.97	22152580
D	DS-Z2	8 × 1	11.97	11.97	22152581
E	DS-Z2	1 × 8	11.97	11.97	22152578
F	DS-Z2	2 × 4	11.97	11.97	22153597

Code Agent Policy

I confirm that this is not a pure code agent PR.

Who can review?

@3outeille @ArthurZucker

HuggingFaceDocBuilderDev · 2026-05-15T10:21:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2026-05-15T14:35:05Z

cc @SunMarc maybe, not sure if it's code agent slop!

vasqu · 2026-05-15T17:38:54Z

Definitely not code agent slop! There is currently a lot of work going on re FSDP+TP+EP

AmineDiro · 2026-05-20T15:40:38Z

cc @SunMarc maybe, not sure if it's code agent slop!
@Rocketknight1 haha maybe the PR desc made it look like AI slop, we can't have nice PR desc anymore 🤣

This is a real issue for the reported loss at least, it gets inflated by tp_sizex factor

vasqu · 2026-05-25T15:30:09Z

I think we can merge @AmineDiro? Just please sync first and sanity check before 🫡

AmineDiro · 2026-05-26T08:08:42Z

@vasqu Updated and ready !

vasqu · 2026-05-26T13:55:28Z

Neat, merging!

vasqu · 2026-05-26T14:15:19Z

Force merged, because gh actions are not working properly. Multiple runs already showed that those were only flaky tests (if any failed)

ArthurZucker

ty, seems like a test is welcome no?

ArthurZucker · 2026-05-27T13:34:46Z

+            # TP and EP-as-TP ranks see replicated batches; `num_processes` over-counts
+            # them by `tp_size`. Mirror the divisor used in `_get_num_items_in_batch`.
+            loss_scale = self.accelerator.num_processes
+            if (pc := getattr(self.accelerator, "parallelism_config", None)) is not None:


can you give a more meaningful name than PC please

…uggingface#45994)

Trainer.compute_loss: fix loss over-counting under TP and EP-as-TP

cb6b2b8

AmineDiro mentioned this pull request May 15, 2026

🛣️ Path to 30B MoE long-context SFT training huggingface/trl#5713

Open

AmineDiro mentioned this pull request May 20, 2026

EP + Trainer integration on top of DistributedConfig (#45028) #46126

Open

3outeille approved these changes May 25, 2026

View reviewed changes

Merge branch 'main' into fix-fsdp-ep-loss-scale

f8e657d

vasqu added this pull request to the merge queue May 26, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 26, 2026

vasqu added this pull request to the merge queue May 26, 2026

github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks May 26, 2026

vasqu merged commit b131ec1 into huggingface:main May 26, 2026
48 of 95 checks passed

ArthurZucker reviewed May 27, 2026

View reviewed changes

yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request May 28, 2026

Trainer.compute_loss: fix loss over-counting under TP and EP-as-TP (h…

e7100ac

…uggingface#45994)

kashif pushed a commit to kashif/transformers that referenced this pull request Jun 1, 2026

Trainer.compute_loss: fix loss over-counting under TP and EP-as-TP (h…

1dbf65c

…uggingface#45994)

khushali9 pushed a commit to khushali9/transformers that referenced this pull request Jun 8, 2026

Trainer.compute_loss: fix loss over-counting under TP and EP-as-TP (h…

50aa928

…uggingface#45994)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer.compute_loss: fix loss over-counting under TP and EP-as-TP#45994

Trainer.compute_loss: fix loss over-counting under TP and EP-as-TP#45994
vasqu merged 2 commits into
huggingface:mainfrom
AmineDiro:fix-fsdp-ep-loss-scale

AmineDiro commented May 15, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 15, 2026

Uh oh!

Rocketknight1 commented May 15, 2026

Uh oh!

vasqu commented May 15, 2026

Uh oh!

AmineDiro commented May 20, 2026

Uh oh!

vasqu commented May 25, 2026

Uh oh!

AmineDiro commented May 26, 2026

Uh oh!

Uh oh!

vasqu commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

vasqu commented May 26, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

AmineDiro commented May 15, 2026

What does this PR do?

Test

Code Agent Policy

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented May 15, 2026

Uh oh!

Rocketknight1 commented May 15, 2026

Uh oh!

vasqu commented May 15, 2026

Uh oh!

AmineDiro commented May 20, 2026

Uh oh!

vasqu commented May 25, 2026

Uh oh!

AmineDiro commented May 26, 2026

Uh oh!

Uh oh!

vasqu commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

vasqu commented May 26, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants