temporarily removed cudnn attention backend by danielvegamyhre · Pull Request #1717 · pytorch/torchtitan

danielvegamyhre · 2025-09-17T23:40:51Z

We should remove this until long term fix for #1713 is landed. I believe @eqy is working on a fix. I tried using pytorch built from source with latest changes just now, but the issue persists, so for now we can remove cudnn attention backend and add back later.

danielvegamyhre · 2025-09-17T23:45:39Z

cc @tianyu-l for review

eqy · 2025-09-17T23:59:55Z

What commit did you build at? I believe the fix was merged ~8 hours ago: pytorch/pytorch#163104

tianyu-l

What is the exact issue?
If eager works, we should disable it on compile path only.
If bf16+compile works, we should disable it on quantization path only.

danielvegamyhre · 2025-09-18T00:23:15Z

What commit did you build at? I believe the fix was merged ~8 hours ago: pytorch/pytorch#163104

2.10.0a0+git28c42cc (28c42cc28090e7ee629c9a89b5ef2cc4838fb755)

eqy · 2025-09-18T00:24:41Z

Ok, can you share the failure message? I would be surprised if it was the same one...

As a sanity check, the following unit test (included in the PR) should not error out if you have a build with the fix:

        q = torch.randn(2, 8, 1024, 128, dtype=torch.half, device='cuda', requires_grad=True)
        grad = torch.randn_like(q)

        @torch.compile()
        def func():
            with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.CUDNN_ATTENTION):
                out = torch.nn.functional.scaled_dot_product_attention(q, q, q)
                out.backward(grad)
            return out

        out = func()

        q_cpu = q.float().cpu().detach().clone()
        q_cpu.requires_grad = True
        grad_cpu = grad.cpu().float()
        out_cpu = torch.nn.functional.scaled_dot_product_attention(q_cpu, q_cpu, q_cpu)
        out_cpu.backward(grad_cpu)
        self.assertEqual(out, out_cpu.cuda().half(), atol=1e-3, rtol=1e-3)
        self.assertEqual(q.grad, q_cpu.grad.cuda().half(), atol=7e-3, rtol=5e-3)

danielvegamyhre · 2025-09-18T00:27:25Z

Ok, can you share the failure message? I would be surprised if it was the same one...

It's the same error message. Maybe I need to uninstall pytorch-triton too and do make triton..?

Doing another complete pull, uninstall, make clean, install

(torch) [danvm@devgpu007.snb3 ~/torchtitan (main)]$ rm -rf /tmp/torchinductor_danvm;

(torch) [danvm@devgpu007.snb3 ~/torchtitan (main)]$ NGPU=2 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --parallelism.data_parallel_shard_degree=2 --parallelism.expert_parallel_degree=2 --compile.enable 

...

    File "/tmp/torchinductor_danvm/sj/csjffroid2fsi32up4lyfhbmbcsgvwvkdda4fibw5fyenflebnhn.py", line 482, in call
      buf24 = torch.ops.aten._scaled_dot_product_cudnn_attention.default(reinterpret_tensor(buf22, (8, 16, 2048, 16), (524288, 16, 256, 1), 0), reinterpret_tensor(buf23, (8, 16, 2048, 16), (524288, 16, 256, 1), 0), reinterpret_tensor(buf5, (8, 16, 2048, 16), (524288, 16, 256, 1), 0), None, True, 0.0, True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/danvm/pytorch/torch/_ops.py", line 841, in __call__
      return self._op(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

eqy · 2025-09-18T00:29:17Z

That doesn't look the same as

[rank0]:[rank0]: AssertionError: wrong number of dimensions4 for op: torch.ops.aten._scaled_dot_product_cudnn_attention.default

:/

danielvegamyhre · 2025-09-18T00:31:15Z

That doesn't look the same as

[rank0]:[rank0]: AssertionError: wrong number of dimensions4 for op: torch.ops.aten._scaled_dot_product_cudnn_attention.default

:/

Oh, I've seen both of these cudnn related issues as part of the #1713 at various points, the issue described in #1713 is indeed a different message though, sorry for the confusion. The workaround of just not using CUDNN backend is what has resolved both:

CUDNN not initialized is for cuda 12.9, built from source (we don't build nightlies for 12.9 anymore)
The wrong number of dimensions was for cuda 12.8 nightly.

eqy · 2025-09-18T00:33:53Z

cuDNN not initialized is pretty wild, are we almost out of GPU memory or something for this model?

I'll check a source build in 12.9 tomorrow.

danielvegamyhre · 2025-09-18T00:35:06Z

cuDNN not initialized is pretty wild, are we almost out of GPU memory or something for this model?

I don't think so, after removing cudnn backend it hits around 80gb of GMEM on a b200 with ~183gb capacity.

I'll check a source build in 12.9 tomorrow.

Sounds good, thanks for taking a look

eqy · 2025-09-18T00:44:04Z

To help narrow things down, could you also please collect some logging information:

CUDNN_FRONTEND_LOG_FILE=frontend.txt CUDNN_FRONTEND_LOG_INFO=1 CUDNN_LOGLEVEL_DBG=3 CUDNN_LOGDEST_DBG=backend.txt python yourrepro.py

Thanks!

fegin

We could not disable cuDNN backend. You can disable it for specific settings. But some people are still using it to benchmark (until last week).

danielvegamyhre · 2025-09-18T14:28:05Z

Update: I retried this morning with today's latest nightly build which include's @eqy's fix, and the issue does not repro. Looks like the CUDNN not initialized must be a local env issue for me building from source, so we can close this.

danielvegamyhre requested review from fegin, tianyu-l, wconstab and wwwjn as code owners September 17, 2025 23:40

danielvegamyhre mentioned this pull request Sep 17, 2025

Fix EP token group padding issue #1718

Merged

temporarily removed cudnn attention backend

88c6894

danielvegamyhre force-pushed the epfix branch from a5cc80a to 88c6894 Compare September 17, 2025 23:45

tianyu-l requested changes Sep 18, 2025

View reviewed changes

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025

fegin requested changes Sep 18, 2025

View reviewed changes

danielvegamyhre closed this Sep 18, 2025

Conversation

danielvegamyhre commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvegamyhre commented Sep 17, 2025

Uh oh!

eqy commented Sep 17, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Sep 18, 2025

Uh oh!

eqy commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvegamyhre commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eqy commented Sep 18, 2025

Uh oh!

danielvegamyhre commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eqy commented Sep 18, 2025

Uh oh!

danielvegamyhre commented Sep 18, 2025

Uh oh!

eqy commented Sep 18, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielvegamyhre commented Sep 17, 2025 •

edited

Loading

eqy commented Sep 18, 2025 •

edited

Loading

danielvegamyhre commented Sep 18, 2025 •

edited

Loading

danielvegamyhre commented Sep 18, 2025 •

edited

Loading