[RL] adopt local map attention for vLLM attention by acisseJZhong · Pull Request #2638 · pytorch/torchtitan

acisseJZhong · 2026-03-20T18:29:53Z

Summary

Adopt LocalMapAttention as the base class for VLLMAttention,
replacing manual DTensor.to_local() / DTensor.from_local()
with local_map for DTensor-to-local conversion.
Add call override that captures seq_len = q.size(2) from
the DTensor before local_map's to_local() and passes it to
forward() via kwargs. This preserves the canonical symbolic
shape (s72) that GQAttention uses in its downstream view(bs,
seqlen, -1). Capturing from the DTensor's global shape ensures
the correct symbolic size is used under torch.compile.
See Dtensor Shard->Replicate redistribution corrupts symbolic shapes under torch.compile pytorch#175690
Remove replace_with_vllm_compatible_flash_attention() and
its usage in the trainer — the trainer no longer patches its
attention module to match vLLM's kernel.
Fix the test's vLLM engine creation to use
GeneratorCompileConfig instead of hardcoded CompilationConfig,
aligning the test with the generator's actual
compile/CUDA-graph settings.

Test attention numerics under both true eager mode and compile mode:

NCCL_NVLS_ENABLE=0 torchrun --nproc_per_node=2     torchtitan/experiments/rl/tests/test_attn_numerics.py

============================================================
LOGPROB COMPARISON RESULTS
============================================================
Bitwise identical : False
Tokens checked : 30
Tokens different : 30
Max delta : 1.041739e-01
Avg delta : 1.816580e-02
Diff mean : 2.139650e-03
Diff max : 1.041739e-01
============================================================

Lucaskabela · 2026-03-20T23:53:42Z

        output_flat = output_flat.narrow(0, 0, batch_size * seq_len)

-        # Reshape back to titan: (batch, num_heads_local, seq_len, head_dim)
+        # Reshape back to titan: (batch, seq_len, num_heads_local, head_dim)


Note the transpose (1, 2) - comment is right since we swap num_heads and seq_len on L273

yup, I thought the shape annotation is only for the next line. Reverted it back!

wwwjn · 2026-03-21T01:02:00Z

-        # supporting paged attention / kv cache.
        if batch_invariant_mode:
-            replace_with_vllm_compatible_flash_attention(
+            replace_with_vllm_attention(


I don't think we need this change to make trainer and generator bit-wise identity. To make bit-wise identity, we need the forward path run the same kernels on trainer and generator. This function is replace to vllm.Attention() for kv cache capability. For trainer, set config to Varlen attention should be enough

Also this might not be trainable as vllm.Attention should not have backward?

Let's remove this change from this PR.

oh I though vllm.Attention which uses PyTorchFlashAttentionImpl has the backward?

also tried removing this, doesn't seem to change numerics

============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : False Tokens checked : 30 Tokens different : 30 Max delta : 1.041739e-01 Avg delta : 1.816580e-02 Diff mean : 2.139650e-03 Diff max : 1.041739e-01 ============================================================

oh I though vllm.Attention which uses PyTorchFlashAttentionImpl has the backward?

I see, you are right. But in trainer, we don't need the kv cache capability so we can directly using Varlen attention to achieve bit-wise identity

wwwjn · 2026-03-21T01:03:22Z

        tensor_parallel_size=gen_config.parallelism.tensor_parallel_degree,
        distributed_executor_backend="external_launcher",
        gpu_memory_utilization=gen_config.gpu_memory_limit,
+        enforce_eager=gen_config.compile.is_eager,


Why adding this? Do you need enforce_eager = True when compile and cudagraph are disabled?

it should use the config from _test_config below, otherwise there's two compile config floating around

tianyu-l · 2026-03-22T00:18:00Z

        Returns:
            ``(batch, num_heads, seq_len, head_dim)``
        """
-        # Capture the original symbolic seq_len from the input BEFORE


Actually I don't know why we were using global seq_len here.

In TP, qkv are sharded on num_head dimension so seq_len should be the same on DTensor / local tensor

In all-gather CP, qkv are sharded on seq_len dimension before entering CP, but replicate on kv after the hooks (it may not work with varlen attention yet).

In either case
q = q.transpose(1, 2).reshape(batch_size * seq_len, -1, head_dim) should work with local qkv's seq_len?

It's because L270

# vLLM's flash attention backend may pad the token count (e.g. # round up to an even number), which introduces a new symbolic # shape under torch.compile. Narrow to trim this padding # NOTE: this error only happens when batch_size and seq_len are 1 # which happens with cudagraph capture for dummy input

during cudagraph capture with TP=2, the seqlen should be 1, but it's padded to 2, and we need to capture the original symbolic seq_len from the input BEFORE to_local. After using local map, this entire forward is wrapped in local region, therefore I moved this chunk of code before calling forward.

this needs follow up with pytorch/pytorch#175690; @Lucaskabela mentioned:

I think this is a symbolic shape propagation error - w.r.t to_local(), the symbol we are using has to be divisible by the TP size, so I think it is something like ( 2*(s77 + 1) // 2) or something odd like that - this actually results in some bug I believe in our code that is generated

The approach in this PR is just workaround like before. Can delete once the fix is landed.

acisseJZhong · 2026-03-23T17:55:18Z

running https://github.com/pytorch/torchtitan/pull/2300/changes#diff-61d04a6e7103debe722a05fcf985c39b2471c332de71b93de4e23e878b9a5d5a the test_numerics.py, all three tests passed. The command I am using is

NCCL_NVLS_ENABLE=0 MASTER_ADDR=localhost MASTER_PORT=29500 MODEL_CHECKPOINT_PATH=Qwen/Qwen3-0.6B pytest         torchtitan/experiments/rl/tests/test_numerics.py -v -s

cc @wwwjn

It means the whole attention module (not only self_attn) keep align with vllm's numerics,

Lucaskabela

PR LGTM but may need a rebase to avoid merge conflicts

Lucaskabela · 2026-03-23T21:07:09Z

    compute_token_log_probs,
    verify_logprob_identity,
 )
 from torchtitan.experiments.rl.types import Episode


I think this is removed on main if you rebase :)

acisseJZhong · 2026-03-23T21:39:28Z

+        # Therefore it is breaking compile. We need to fix this in pytorch.
+        # See more details in https://github.com/pytorch/pytorch/issues/175690
+        # TODO(@Lucaskabela): remove this once the issue is fixed in pytorch
        batch_size, _, seq_len, head_dim = q.shape


@Lucaskabela I updated this to still capture seq_len in local region. This captured seq_len is wrong and will break compile behavior, but it is the right thing to do to get seq_len in local region instead of global.

Please help take a look into the fix for compile. Thanks!!

## Summary We turned the compile for generator off in #2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```

**Summary** - Adopt LocalMapAttention as the base class for VLLMAttention, replacing manual DTensor.to_local() / DTensor.from_local() with local_map for DTensor-to-local conversion. - Add __call__ override that captures seq_len = q.size(2) from the DTensor before local_map's to_local() and passes it to forward() via kwargs. This preserves the canonical symbolic shape (s72) that GQAttention uses in its downstream view(bs, seqlen, -1). Capturing from the DTensor's global shape ensures the correct symbolic size is used under torch.compile. See pytorch/pytorch#175690 - Remove replace_with_vllm_compatible_flash_attention() and its usage in the trainer — the trainer no longer patches its attention module to match vLLM's kernel. - Fix the test's vLLM engine creation to use GeneratorCompileConfig instead of hardcoded CompilationConfig, aligning the test with the generator's actual compile/CUDA-graph settings. **Test attention numerics under both true eager mode and compile mode:** ``` NCCL_NVLS_ENABLE=0 torchrun --nproc_per_node=2 torchtitan/experiments/rl/tests/test_attn_numerics.py ``` > ============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : False Tokens checked : 30 Tokens different : 30 Max delta : 1.041739e-01 Avg delta : 1.816580e-02 Diff mean : 2.139650e-03 Diff max : 1.041739e-01 ============================================================

## Summary We turned the compile for generator off in pytorch#2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```

## Summary We turned the compile for generator off in #2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```

**Summary** - Adopt LocalMapAttention as the base class for VLLMAttention, replacing manual DTensor.to_local() / DTensor.from_local() with local_map for DTensor-to-local conversion. - Add __call__ override that captures seq_len = q.size(2) from the DTensor before local_map's to_local() and passes it to forward() via kwargs. This preserves the canonical symbolic shape (s72) that GQAttention uses in its downstream view(bs, seqlen, -1). Capturing from the DTensor's global shape ensures the correct symbolic size is used under torch.compile. See pytorch/pytorch#175690 - Remove replace_with_vllm_compatible_flash_attention() and its usage in the trainer — the trainer no longer patches its attention module to match vLLM's kernel. - Fix the test's vLLM engine creation to use GeneratorCompileConfig instead of hardcoded CompilationConfig, aligning the test with the generator's actual compile/CUDA-graph settings. **Test attention numerics under both true eager mode and compile mode:** ``` NCCL_NVLS_ENABLE=0 torchrun --nproc_per_node=2 torchtitan/experiments/rl/tests/test_attn_numerics.py ``` > ============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : False Tokens checked : 30 Tokens different : 30 Max delta : 1.041739e-01 Avg delta : 1.816580e-02 Diff mean : 2.139650e-03 Diff max : 1.041739e-01 ============================================================

## Summary We turned the compile for generator off in pytorch#2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```

**Summary** - Adopt LocalMapAttention as the base class for VLLMAttention, replacing manual DTensor.to_local() / DTensor.from_local() with local_map for DTensor-to-local conversion. - Add __call__ override that captures seq_len = q.size(2) from the DTensor before local_map's to_local() and passes it to forward() via kwargs. This preserves the canonical symbolic shape (s72) that GQAttention uses in its downstream view(bs, seqlen, -1). Capturing from the DTensor's global shape ensures the correct symbolic size is used under torch.compile. See pytorch/pytorch#175690 - Remove replace_with_vllm_compatible_flash_attention() and its usage in the trainer — the trainer no longer patches its attention module to match vLLM's kernel. - Fix the test's vLLM engine creation to use GeneratorCompileConfig instead of hardcoded CompilationConfig, aligning the test with the generator's actual compile/CUDA-graph settings. **Test attention numerics under both true eager mode and compile mode:** ``` NCCL_NVLS_ENABLE=0 torchrun --nproc_per_node=2 torchtitan/experiments/rl/tests/test_attn_numerics.py ``` > ============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : False Tokens checked : 30 Tokens different : 30 Max delta : 1.041739e-01 Avg delta : 1.816580e-02 Diff mean : 2.139650e-03 Diff max : 1.041739e-01 ============================================================

## Summary We turned the compile for generator off in pytorch#2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```

pytorch-bot Bot added the ciflow/8gpu label Mar 20, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 20, 2026

acisseJZhong marked this pull request as draft March 20, 2026 18:30

acisseJZhong force-pushed the rl_attention branch from 29220ee to 540e0b3 Compare March 20, 2026 23:17

Lucaskabela reviewed Mar 20, 2026

View reviewed changes

acisseJZhong marked this pull request as ready for review March 21, 2026 00:45

acisseJZhong requested review from fegin, tianyu-l, wconstab and wwwjn as code owners March 21, 2026 00:45

acisseJZhong force-pushed the rl_attention branch from dad2b8a to 7cad1e9 Compare March 21, 2026 00:53

wwwjn reviewed Mar 21, 2026

View reviewed changes

acisseJZhong force-pushed the rl_attention branch 2 times, most recently from 391bdda to da24e6c Compare March 21, 2026 07:08

tianyu-l reviewed Mar 22, 2026

View reviewed changes

acisseJZhong requested a review from zhxchen17 March 23, 2026 17:55

acisseJZhong force-pushed the rl_attention branch 2 times, most recently from fa78caf to d976e7c Compare March 23, 2026 20:56

Lucaskabela approved these changes Mar 23, 2026

View reviewed changes

acisseJZhong added 4 commits March 23, 2026 14:37

test

c2a20ec

compile problem

742429c

compile fails

7bc2359

break compile

d36ea94

acisseJZhong force-pushed the rl_attention branch from 6b373fd to d36ea94 Compare March 23, 2026 21:37

acisseJZhong commented Mar 23, 2026

View reviewed changes

acisseJZhong force-pushed the rl_attention branch from e785d79 to 512e404 Compare March 23, 2026 21:46

tianyu-l approved these changes Mar 23, 2026

View reviewed changes

disable compile in config

ba38d7f

acisseJZhong force-pushed the rl_attention branch from 512e404 to ba38d7f Compare March 23, 2026 22:55

acisseJZhong merged commit d229e97 into main Mar 23, 2026
18 of 33 checks passed

This was referenced Mar 23, 2026

[Bugfix] Fix by copying sym shapes as needed pytorch/pytorch#178210

Closed

[RL] Turn compile for generator back on #2710

Merged

Conversation

acisseJZhong commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong commented Mar 23, 2026 • edited by wwwjn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lucaskabela left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

acisseJZhong commented Mar 20, 2026 •

edited

Loading

acisseJZhong Mar 21, 2026 •

edited

Loading

acisseJZhong Mar 23, 2026 •

edited

Loading

acisseJZhong commented Mar 23, 2026 •

edited by wwwjn

Loading

acisseJZhong Mar 23, 2026 •

edited

Loading