Support of Splash Attention using xla_builder.call_jax by zpcore · Pull Request #145 · AI-Hypercomputer/torchprime

zpcore · 2025-03-08T00:18:10Z

Support of Splash Attention (SA) using new feature xla_builder.call_jax (internal one pager doc). call_jax enables us to copy JAX code directly into torch_xla without diving into detail how to interact with the bottom level API of pallas kernel. This PR serves as an example of how to run feedforward and retrieve the grad under the torch framework.

For quick test, we can use

python torchprime/torch_xla_models/train.py model=llama-3.1-8b dataset_config_name=wikitext-103-raw-v1 global_batch_size=4 profile_step=3 ici_mesh.fsdp=4 model.attention_kernel=splash_attention

Note: The PR is still in the experimental stage. We plan to test it in TorchPrime first and move the kernel to PyTorch/XLA later.

Performance improvement summary compared with MaxText and Flash Attention (FA) kernel:

	PTXLA + FA	PTXLA + SA	PTXLA + SA + SCAN + hostoffload	PTXLA + SA + jit caching	MaxText + FA	MaxText + SA
step time	5.87s	5.367s	4.619s	4.694s	5.95s	4.45s
mfu	37.15%	40.6%	47.19%	46.44%	36.63%	48.98%

zpcore · 2025-03-10T07:18:22Z

We are almost achieving the same fwd and bwd for each decoder layer compared with MaxText. However, overall, there is still a gap between the torchprime (profile) and maxtext (profile) performance. Below is the summary:

Test: Llama 3.1 8B, 8K seq_len; host: v6e-256.

	MaxText (JAX)	TorchPrime	Cause of difference
fwd of each decoder layer	35.7ms	33.2ms	N/A
bwd of each decoder layer	93.7ms	109.4ms	Extra pallas kernel call in attention activation remat
step time	4.451s	5.367s	`barrier core` overhead

Details:

First main issue is the barrier core, which takes 704ms:

If we can get rid of it, the step time can be improved to 4.68s. As a comparison, JAX step time is 4.451s.

Second issue, PTXLA have an extra pallas call during activation remat:
JAX:

PTXLA:

zpcore · 2025-03-12T23:41:29Z

The PR requires the nightly build from 03/12 to include pytorch/xla#8789. Need merge #145 to fix the test failure.

tengyifei

still reviewing the other half

tengyifei · 2025-03-14T21:39:57Z

need to rebase i think

tengyifei · 2025-03-14T22:20:56Z

Oh! Not have to be in this PR; would you like to update 1 and close out #133. We could add the cmdline recipe and MFU to 2 and close out #135 too!

zpcore · 2025-03-14T23:19:51Z

Oh! Not have to be in this PR; would you like to update 1 and close out #133. We could add the cmdline recipe and MFU to 2 and close out #135 too!

Sure, I will update the performance with the SA.

zpcore requested review from bhavya01, qihqi and tengyifei March 8, 2025 00:19

tengyifei reviewed Mar 8, 2025

View reviewed changes

Comment thread torchprime/torch_xla_models/experimental/custom_kernel.py Outdated

bhavya01 reviewed Mar 8, 2025

View reviewed changes

Comment thread torchprime/torch_xla_models/experimental/custom_kernel.py Outdated

zpcore marked this pull request as ready for review March 8, 2025 09:01

zpcore mentioned this pull request Mar 12, 2025

update docker to 03/12 #152

Merged

zpcore force-pushed the piz/sa branch from 5d34c8f to a7d8e8c Compare March 13, 2025 01:29

tengyifei suggested changes Mar 13, 2025

View reviewed changes

tengyifei reviewed Mar 13, 2025

View reviewed changes

Comment thread torchprime/torch_xla_models/experimental/custom_kernel.py Outdated

tengyifei suggested changes Mar 13, 2025

View reviewed changes

tengyifei approved these changes Mar 14, 2025

View reviewed changes

Comment thread torchprime/torch_xla_models/train.py Outdated

Comment thread .github/workflows/e2e_test.yml Outdated

Comment thread torchprime/torch_xla_models/experimental/custom_kernel.py Outdated

zpcore added 15 commits March 14, 2025 21:54

Support of Splash Attention using xla_builder.call_jax

ca71f35

nit

f500ba5

nit update

13589e1

Support align sharding spec with JAX

b23017a

Update model config and enable e2e test for sa

c6fa304

name nit

7cecfc7

fix test arg

cc8f803

enable e2e test for sa

91b8a8f

Support run spalsh attention without repeat KV heads

40611f8

fix e2e test

e9333b3

fix mixtral test

23d0d95

update splash attention kernel block size config to improve performance

6a27570

nit test fix

8810922

lru_cache for func arg to call_jax to prevent jit recompilation

dcf21dc

debug with profile

2c6ac9e

zpcore added 4 commits March 14, 2025 21:54

caching the call_jax to reduce tracing overhead

0a93f95

clean up code regarding shard spec

471fab7

fix e2e test naming

984598d

nit

fb4835c

zpcore force-pushed the piz/sa branch from b281baa to fb4835c Compare March 14, 2025 21:55

fix e2e test naming again

708afb4

fix name for xpk

352d4e9

zpcore merged commit e100715 into main Mar 14, 2025

zpcore deleted the piz/sa branch March 14, 2025 23:19

This was referenced Mar 18, 2025

Enable SplashAttention on multi-slice #159

Open

Performance of Llama 3.1 8B matches Huggingface fork #133

Closed

zpcore mentioned this pull request Mar 31, 2025

Adapt Splash Attention from TorchPrime pytorch/xla#8911

Merged

Conversation

zpcore commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zpcore commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test: Llama 3.1 8B, 8K seq_len; host: v6e-256.

Details:

Uh oh!

zpcore commented Mar 12, 2025

Uh oh!

tengyifei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tengyifei commented Mar 14, 2025

Uh oh!

tengyifei commented Mar 14, 2025

Uh oh!

zpcore commented Mar 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zpcore commented Mar 8, 2025 •

edited

Loading

zpcore commented Mar 10, 2025 •

edited

Loading