Enable per-layer compile with or without MoE#2741
Conversation
|
@laithsakka @pianpwk need to follow up here |
for the A100 case, I'm pretty sure #2399 should solve this (an unbacked symbol is allocated for slice output sizes at dynamo time, but disappears at inductor time due to the size being computable), but we decided against that to remove the padding entirely. It should be doable by extending #2620 to the A100 forloop path. |
mmm the decision to remove padding was not related to this problem though right? because this is fixable |
c25b301 to
20fe4dd
Compare
| num_tokens_per_expert: torch.Tensor, | ||
| ) -> torch.Tensor: | ||
| # dynamic number of tokens in expert parallel | ||
| torch._dynamo.mark_dynamic(x, 0) |
There was a problem hiding this comment.
If we don't need to mark_dynamic, we can remove everything in this function except for the per-layer compile call.
There was a problem hiding this comment.
good call!
per-layer compiler was working util rebasing on top of recent moe refactor
i might need to do some change over that moe refactor. will publish shortly
There was a problem hiding this comment.
rebased. finally ready for review
20fe4dd to
5874b3f
Compare
|
|
||
|
|
||
| class TestApplyCompile(unittest.TestCase): | ||
| def test_patched_once(self): |
There was a problem hiding this comment.
no more monkeypatch on _run_experts_grouped_mm
apply_compile now just sets dynamo config and calls in-place .compile(), both inherently idempotent
| for layer_id, transformer_block in model.layers.named_children(): | ||
| transformer_block.compile(backend=compile_config.backend, fullgraph=True) | ||
| # pyrefly: ignore [missing-attribute] | ||
| model.layers.register_module(layer_id, transformer_block) |
There was a problem hiding this comment.
removed register_module because of using in-place .compile
|
"8 GPU Model Tests / build-test (cuda, linux.g5.48xlarge.nvidia.gpu," should be gone once pytorch nightly has pytorch/pytorch#166460 |
|
VLM 8 GPU Integration Tests / build-test (cuda, linux.g5.48xlarge.nvidia.gpu CI error should be cauesd by #2780 |
5874b3f to
b409060
Compare
| # NOTE: this would incur a synchronization between device and host | ||
| num_tokens_per_expert_list = num_tokens_per_expert.tolist() | ||
|
|
||
| total_tokens = sum(num_tokens_per_expert_list) |
There was a problem hiding this comment.
I thought this was a workaround needed when we apply padding, which was removed in #2774. Do we still need this?
There was a problem hiding this comment.
the problematic line is num_tokens_per_expert_list = num_tokens_per_expert.tolist(). this exists before/after moe refactoring (or padding removal)
cc @pianpwk if you have a better answer
from claude
def _run_experts_for_loop(w1, w2, w3, x, num_tokens_per_expert):
# .tolist() crosses device→host, each value becomes an unbacked symint (u0, u1, ...)
num_tokens_per_expert_list = num_tokens_per_expert.tolist()
# sum of unbacked symints → another unbacked symint (u_total)
total_tokens = sum(num_tokens_per_expert_list)
# The compiler sees: x[:u_total]
# It needs to prove: 0 <= u_total <= x.shape[0]
# But u_total is unbacked — no concrete value or range info
# → assertion failure without the guards below
torch._check(total_tokens >= 0)
torch._check(total_tokens <= x.shape[0])
x_splits = torch.split(x[:total_tokens], ...)
The torch._check calls inject symbolic constraints so the compiler can prove the slice is valid.
There was a problem hiding this comment.
@weifengpy what error are you hitting? you likely have to specify capture_scalar_outputs=True or a similar flag as @xmfan mentioned
There was a problem hiding this comment.
And yeah, I thought #2774 should allow you to delete this variable, and not need the x[:total_tokens] slice, since that was meant to remove padding
There was a problem hiding this comment.
finally make it work with torch._check. there is a gap on how inductor handle torch._check( == ), I have to workaround with torch._check(<=) and torch._check(>=)
will file the inductor issue in pytorch
There was a problem hiding this comment.
torch.split decompose into split_with_sizes, and trigger torch._check_with( ==
if we do the fix at pytorch-side: pytorch/pytorch#179311 we won't need torch._check at titan side
There was a problem hiding this comment.
that sounds better, as it solves other use cases as well
There was a problem hiding this comment.
attaching Pian's fix: pytorch/pytorch#179315 at pytorch side. should be better than mine. anyway, I will remove those torch._check from titan once pytorch side fix landed
converting to draft for now if we want to wait for pytorch side fix
There was a problem hiding this comment.
@bobrenjc93 i thought bob suppoorted boolean inputs to graphs in inductor?
7bfb29d to
0da784a
Compare
0da784a to
9f38866
Compare
| # non_blocking=True is safe in eager, but under torch.compile the | ||
| # async D2H transfer can race with the subsequent .tolist()/.item() | ||
| # calls, producing stale values and failing unbacked-symint guards. | ||
| non_blocking = not torch.compiler.is_compiling() |
There was a problem hiding this comment.
Fine for me to land. Thanks!
9f38866 to
61e685e
Compare
7c7e3ca to
136cda6
Compare
Consolidate apply_compile_dense and apply_compile_sparse into a single apply_compile function. The only difference was capture_scalar_outputs which is harmless for dense models. Remove the _run_experts_grouped_mm separate compile boundary and EP wrapper — no longer needed now that the CI uses cu130+ nightly which handles the unbacked-symint Eq(u1, u2) constraints in inductor. Remove the x[:total_tokens] slice in _run_experts_for_loop — padding was removed in pytorch#2774, so sum(num_tokens_per_expert) == x.shape[0] and the slice is a no-op.
136cda6 to
cbc0ec8
Compare
|
note: if people use pytorch nightly with cu128 on A100, |
… models The upstream compile consolidation (PR pytorch#2741) unconditionally sets torch._dynamo.config.capture_scalar_outputs=True in apply_compile. This is needed for MoE dynamic shapes but breaks the separately-compiled loss_fn when loss_parallel + ignore_index in cross_entropy produce unbacked symbols (zuf0, zuf1), causing PendingUnbackedSymbolNotFound. Reset the flag after apply_compile in the dense agpt parallelize path.
18 configs (11 agpt + 7 MoE), 10 steps each on 24 XPU tiles. 8/18 passed. Documents the 80B compile regression from upstream PR pytorch#2741 and the blendcorpus cache race fix.
…crash Documents the PendingUnbackedSymbolNotFound error introduced by upstream PR pytorch#2741's unconditional capture_scalar_outputs=True, and the fix (resetting the flag for dense models after apply_compile).
per-layer compile becomes possible after applying fully_shard at layer-level (no more moe level): #2281
apply_compile: consolidate apply_compile_dense and apply_compile_sparse into one.The only difference was capture_scalar_outputs which is harmless for dense models.
Also removed the _run_experts_grouped_mm separate compile boundary and EP wrapper.
FSDP2 + EP:
NGPU=8 MODULE=deepseek_v3 CONFIG=deepseek_v3_debugmodel ./run_train.sh --compile.enable --compile.components model,loss --parallelism.expert_parallel_degree 4FSDP2 + EP + TP:
NGPU=8 MODULE=deepseek_v3 CONFIG=deepseek_v3_debugmodel ./run_train.sh --compile.enable --compile.components model,loss --parallelism.expert_parallel_degree 4 --parallelism.tensor_parallel_degree 2FSDP2 + EP + TP&ETP:
NGPU=8 MODULE=deepseek_v3 CONFIG=deepseek_v3_debugmodel ./run_train.sh --compile.enable --compile.components model,loss --parallelism.expert_parallel_degree 4 --parallelism.tensor_parallel_degree 2 --parallelism.expert_tensor_parallel_degree 2qwen3-vl:
NGPU=8 MODULE=qwen3_vl CONFIG=qwen3_vl_debugmodel_moe ./run_train.sh --compile.enable --compile.components model,loss --parallelism.expert_parallel_degree 4 --training.steps 20H100 works

A100 also works
