@assume_pure by tengyifei · Pull Request #8962 · pytorch/xla

tengyifei · 2025-04-11T00:11:55Z

We introduce a decorator, @assume_pure, that can be placed on PyTorch/XLA functions and easily eliminate lazy tensor tracing overhead. If you have a pure function that only uses torch upstream ops, that function can be decorated with @assume_pure and will only be traced once for each unique input tensor shape combinations.

Design

@assume_pure brings together three pieces of existing technologies:

jax.vjp, which takes a JAX function and gives you the autograd forward and backward pass
torchax, which converts a pure PyTorch function to a JAX function
xb.call_jax, which can call any JAX function from PyTorch/XLA and integrate it into the HLO graph

It works by:

Use torchax.interop.jax_view to obtain a JAX function from the input PyTorch function
Use jax.vjp to get the forward and backward pass
Return a torch.autograd.Function instance, where the forward implementation is xb.call_jax(forward_pass), and the backward implementation is xb.call_jax(backward_pass), respectively.

The core logic is actually just a single line:

def assume_pure(fn):
  from torchax.interop import jax_view
  return j2t_autograd(jax_view(fn))

How is the HLO cached

xb.call_jax caches the HLO if all the input shapes/dtypes and non-tensor arguments are the same.

Therefore, subsequent xb.call_jax will just reuse the cached HLO instead of retracing.

The same kind of caching happens in both the forward and backward pass.

Different from the jax wrapper we used in splash_attention, the j2t_autograd function saves the residuals (intermediate activations) during the forward pass and reuses them during the backward by plugging those into the vjp_fun again. This means it won't force a rematerialization (rerun the fwd) during the backward.

Alternatives

Instead of jax.vjp we could also use AOTAutograd to get the forward and backward pass. However, AOTAutograd has a number of downsides:

It does more than just getting the backward. It also forcefully decomposes all operations into the "aten" op set. Decomposing operations will negatively impact performance, especially in the case of einsum.
There is no straightforward path to support profiler trace spans. In contrast, in the proposed approach we could translate xp.Trace(...) to jax.named_scope(...).
Supporting custom operations such as pallas kernels will be cumbersome. We'll need to wrap every kernel into a PyTorch custom operator in order for AOTAutograd to not crash on those functions. In contrast, in the proposed approach we could augment our pallas kernels to directly jump into JAX when the input tensor is a torchax tensor.

Instead of assume_pure, we could also use torch.compile to cache the XLA executable of the compiled function and skip the lazy tensor tracing. However, torch.compile has its own downsides:

torch.compile itself uses AOTAutograd and will suffer from the decomposition and customer operations issues etc.
torch.compile has a general perception of "either it works, or debugging will be complicated", which has been corroborated by experiments by people in the PyTorch/XLA team. See PyTorch team members' own recommendation 1. In contrast, @assume_pure has very simple rules for determining if it will work: if your function is pure, then it works.
torch.compile will graph break when entering and leaving the compiled region. In contrast, @assume_pure can avoid tracing overhead without even breaking the graph. The cached HLO is inlined into the overall HLO.

Benchmarks

I tested tracing an example 100 layer decoder-only model:

~/pytorch/xla
❯ TESTBRIDGE_TEST_ONLY=test_trace_transformer_with_spda_attention python3 test/test_assume_pure.py --benchmark_iterations 100
[...]
No `@assume_pure` time: 140.1342 ms
`@assume_pure` time: 24.1658 ms

Importantly, the @assume_pure does not scale with increasing complexity inside the model. That's because we only trace the model once, paying a fixed up-front cost, and then later runs will reuse the cached XLA computation object.

Anecdotally, @bhavya01 reported saving >200ms tracing time in an SDXL experiment. That's very significant since each training step is sub-1 second.

zpcore · 2025-04-11T22:54:37Z

This is GREAT! Looking foward to this feature!

yaoshiang · 2025-04-14T15:19:38Z

Another issue to think about: the naming @assume_pure...

Some decorators change the function's behavior, like, @staticmethod.

Others change the behavior of the function to the caller, like torch's @torch.no_grad.

@assume_pure is closer to @no_grad... the function behavior itself isn't changing, but it's a directive to some callers to treat this function differently. no_grad is great because we immediately know the caller: the autograd system.

@assume_pure doesn't really indicate which caller is being directed. Is there a name that includes the specific caller. Maybe something like @torch_xla.compile(pure=True)? Then it's really explicit who this decorator is for.

It may be too hard to think through all the ways we expose lazy tensor and harmonize them in the timeframe of this PR, so consider this optional, and we may have to refactor all the ways we discuss lazy tensor and compilation at some point in the future. (e.g. mark step, xm.optimizer_step, torch_xla.compile, etc).

Add more tests

bhavya01 · 2025-04-14T22:55:37Z

Thanks for the great work! Just confirming that when using assume_pure with gradient checkpointing, it will still store the residuals for the forward function wrapped by assume_pure, right? Otherwise, LGTM

tengyifei · 2025-04-14T23:00:56Z

when using assume_pure with gradient checkpointing, it will still store the residuals for the forward function wrapped by assume_pure

Depends on how those two are combined.

torch_xla.utils.checkpoint(assume_pure(fn)): then the PyTorch autograd will discard the residuals and recompute them again during the backward pass. It just works like we wanted checkpointing to work.
assume_pure(torch_xla.utils.checkpoint(fn)): this case doesn't work as torch_xla.utils.checkpoint will attempt to add an optimization barrier onto torchax tensors. But supporting this falls in the same bucket as supporting xs.mark_sharding etc: we just need to capture that function in torchax and handle it accordingly.

tengyifei · 2025-04-14T23:15:31Z

It may be too hard to think through all the ways we expose lazy tensor and harmonize them in the timeframe of this PR, so consider this optional, and we may have to refactor all the ways we discuss lazy tensor and compilation at some point in the future. (e.g. mark step, xm.optimizer_step, torch_xla.compile, etc).

Ack. I don't have a great immediate thought (maybe @pure_function_avoid_retrace? a mouthful..), but I have circulated an internal proposal about using torchax and JAX etc, that has a section with naming ideas.

tengyifei · 2025-04-14T23:15:41Z

This is ready for another look

yaoshiang

tests appear to confirm the expected behavior of the decorator.

tengyifei mentioned this pull request Apr 11, 2025

[WRONG PR] @assume_pure #8923

Merged

tengyifei marked this pull request as ready for review April 11, 2025 01:15

tengyifei requested a review from mikegre-google as a code owner April 11, 2025 01:15

tengyifei force-pushed the yifeit/vjp-in-xla branch 5 times, most recently from 514c163 to cfe2e41 Compare April 11, 2025 06:04

tengyifei requested review from bhavya01, qihqi and zpcore April 11, 2025 06:05

yaoshiang requested changes Apr 11, 2025

View reviewed changes

mikegre-google reviewed Apr 11, 2025

View reviewed changes

zpcore reviewed Apr 11, 2025

View reviewed changes

Comment thread test/test_assume_pure.py

zpcore reviewed Apr 11, 2025

View reviewed changes

Comment thread torch_xla/experimental/assume_pure.py Outdated

zpcore reviewed Apr 11, 2025

View reviewed changes

Comment thread torch_xla/experimental/assume_pure.py Outdated

tengyifei added 8 commits April 14, 2025 14:09

Introduce @assume_pure

5c49c1a

Add more tests

Fix tests

7bc0b98

Clean up the test

77f5ad2

Test a complex module

2cfe166

Cleanup

db76326

Cleanup and add tests

792d26a

add docs

60c823e

Address comments

1642715

tengyifei force-pushed the yifeit/vjp-in-xla branch from cfe2e41 to 1642715 Compare April 14, 2025 21:09

tengyifei added 3 commits April 14, 2025 14:18

Address more comments

8f9a701

Address more comments

053d89c

Test in TPU too

cf7861a

bhavya01 approved these changes Apr 14, 2025

View reviewed changes

Add docs

c31d32a

tengyifei requested review from mikegre-google, yaoshiang and zpcore April 14, 2025 23:15

qihqi approved these changes Apr 14, 2025

View reviewed changes

Comment thread torch_xla/experimental/assume_pure.py

zpcore approved these changes Apr 14, 2025

View reviewed changes

tengyifei added 2 commits April 14, 2025 17:47

Move j2t_autograd into torchax

2fe976d

Remove unneeded import

074eb25

tengyifei force-pushed the yifeit/vjp-in-xla branch from aeee946 to 074eb25 Compare April 15, 2025 00:51

mikegre-google approved these changes Apr 15, 2025

View reviewed changes

yaoshiang approved these changes Apr 16, 2025

View reviewed changes

tengyifei merged commit cf764ff into master Apr 16, 2025
24 checks passed

tengyifei mentioned this pull request May 9, 2025

Implement @assume_pure with torch-fx and aotautograd in torch_xla #9131

Open

Conversation

tengyifei commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

How is the HLO cached

Alternatives

Benchmarks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zpcore commented Apr 11, 2025

Uh oh!

yaoshiang commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhavya01 commented Apr 14, 2025

Uh oh!

tengyifei commented Apr 14, 2025

Uh oh!

tengyifei commented Apr 14, 2025

Uh oh!

tengyifei commented Apr 14, 2025

Uh oh!

Uh oh!

yaoshiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tengyifei commented Apr 11, 2025 •

edited

Loading

yaoshiang commented Apr 14, 2025 •

edited

Loading