Fix EP token group padding issue by danielvegamyhre · Pull Request #1718 · pytorch/torchtitan

danielvegamyhre · 2025-09-17T23:42:10Z

Fixes #1651

Summary

Round up max_len of permuted token indicies in expert parallel decorator to be a multiple of token group alignment size.

Test plan

Llama4 debug model with FSDP=2, EP=2: NGPU=2 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --parallelism.data_parallel_shard_degree=2 --parallelism.expert_parallel_degree=2 --compile.enable

danielvegamyhre · 2025-09-17T23:47:41Z

cc @tianyu-l for review

thanks @vkuzo for pointing this out!

vkuzo · 2025-09-17T23:53:25Z


+        # Make sure max_len of permuted token indicies is divisible by TOKEN_GROUP_ALIGN_SIZE_M,
+        # by padding it to the nearest multiple of TOKEN_GROUP_ALIGN_SIZE_M.
+        ceil_div = lambda x, y: (x + y - 1) // y


nit: define a regular function instead of using a lambda

Added round up util for this

tianyu-l · 2025-09-18T01:05:05Z

    return total_norm
+
+
+def _round_up(x: int, y: int) -> int:


probably should put in torchtitan/tools/utils.py instead of torchtitan/distributed/utils

tianyu-l · 2025-09-18T01:05:41Z

+        # Make sure max_len of permuted token indicies is divisible by TOKEN_GROUP_ALIGN_SIZE_M,
+        # by padding it to the nearest multiple of TOKEN_GROUP_ALIGN_SIZE_M.
+        x_padded_per_expert = (
+            x.shape[0] + experts_per_ep_rank * TOKEN_GROUP_ALIGN_SIZE_M


oh so the previous issue was caused by x.shape[0] not divisible by TOKEN_GROUP_ALIGN_SIZE_M?

Yeah that's my understanding.

The experts_per_ep_rank * TOKEN_GROUP_ALIGN_SIZE_M padding does upper bound based padding (for each token group, variable amount of padding will be needed since token group sizes are variable, but at most we will have to add TOKEN_GROUP_ALIGN_SIZE_M per group, so it does that). However, it doesn't account for the original total M (x.shape[0]) potentially not being divisible by alignment size.

tianyu-l

LGTM, I think it can be merged without changing the attention backends

Depends on previous PR in stack: #1717

danielvegamyhre · 2025-09-18T14:31:19Z

LGTM, I think it can be merged without changing the attention backends

Depends on previous PR in stack: #1717

Confirmed CUDNN issue is resolved in todays nightly cuda 12.8 build. Reverted that change.

Fixes pytorch#1651 ## Summary - Round up `max_len` of permuted token indicies in expert parallel decorator to be a multiple of token group alignment size. ## Test plan - Llama4 debug model with FSDP=2, EP=2: `NGPU=2 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --parallelism.data_parallel_shard_degree=2 --parallelism.expert_parallel_degree=2 --compile.enable `

danielvegamyhre requested review from fegin, tianyu-l, wconstab and wwwjn as code owners September 17, 2025 23:42

temporarily removed cudnn attention backend

88c6894

danielvegamyhre force-pushed the group-pad branch from 7bbcb2d to e64d344 Compare September 17, 2025 23:46

vkuzo reviewed Sep 17, 2025

View reviewed changes

fix EP token group padding bug

bd14959

danielvegamyhre force-pushed the group-pad branch from e64d344 to bd14959 Compare September 18, 2025 00:42

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025

tianyu-l reviewed Sep 18, 2025

View reviewed changes

add back cudnn backend, issue resolved

cda0f74

danielvegamyhre force-pushed the group-pad branch from a066ca7 to cda0f74 Compare September 18, 2025 14:30

tianyu-l approved these changes Sep 18, 2025

View reviewed changes

tianyu-l merged commit 60645bc into pytorch:main Sep 18, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix EP token group padding issue#1718

Fix EP token group padding issue#1718
tianyu-l merged 3 commits into
pytorch:mainfrom
danielvegamyhre:group-pad

danielvegamyhre commented Sep 17, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Sep 17, 2025

Uh oh!

vkuzo Sep 17, 2025

Uh oh!

danielvegamyhre Sep 18, 2025

Uh oh!

tianyu-l Sep 18, 2025

Uh oh!

danielvegamyhre Sep 18, 2025

Uh oh!

tianyu-l Sep 18, 2025

Uh oh!

danielvegamyhre Sep 18, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

danielvegamyhre commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielvegamyhre commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

danielvegamyhre commented Sep 17, 2025

Uh oh!

vkuzo Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielvegamyhre commented Sep 17, 2025 •

edited

Loading