Skip to content

temporarily removed cudnn attention backend#1717

Closed
danielvegamyhre wants to merge 1 commit into
pytorch:mainfrom
danielvegamyhre:epfix
Closed

temporarily removed cudnn attention backend#1717
danielvegamyhre wants to merge 1 commit into
pytorch:mainfrom
danielvegamyhre:epfix

Conversation

@danielvegamyhre

@danielvegamyhre danielvegamyhre commented Sep 17, 2025

Copy link
Copy Markdown
Contributor

We should remove this until long term fix for #1713 is landed. I believe @eqy is working on a fix. I tried using pytorch built from source with latest changes just now, but the issue persists, so for now we can remove cudnn attention backend and add back later.

@danielvegamyhre

Copy link
Copy Markdown
Contributor Author

cc @tianyu-l for review

@eqy

eqy commented Sep 17, 2025

Copy link
Copy Markdown

What commit did you build at? I believe the fix was merged ~8 hours ago: pytorch/pytorch#163104

@tianyu-l tianyu-l left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the exact issue?
If eager works, we should disable it on compile path only.
If bf16+compile works, we should disable it on quantization path only.

@danielvegamyhre

Copy link
Copy Markdown
Contributor Author

What commit did you build at? I believe the fix was merged ~8 hours ago: pytorch/pytorch#163104

2.10.0a0+git28c42cc (28c42cc28090e7ee629c9a89b5ef2cc4838fb755)

@eqy

eqy commented Sep 18, 2025

Copy link
Copy Markdown

Ok, can you share the failure message? I would be surprised if it was the same one...

As a sanity check, the following unit test (included in the PR) should not error out if you have a build with the fix:

        q = torch.randn(2, 8, 1024, 128, dtype=torch.half, device='cuda', requires_grad=True)
        grad = torch.randn_like(q)

        @torch.compile()
        def func():
            with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.CUDNN_ATTENTION):
                out = torch.nn.functional.scaled_dot_product_attention(q, q, q)
                out.backward(grad)
            return out

        out = func()

        q_cpu = q.float().cpu().detach().clone()
        q_cpu.requires_grad = True
        grad_cpu = grad.cpu().float()
        out_cpu = torch.nn.functional.scaled_dot_product_attention(q_cpu, q_cpu, q_cpu)
        out_cpu.backward(grad_cpu)
        self.assertEqual(out, out_cpu.cuda().half(), atol=1e-3, rtol=1e-3)
        self.assertEqual(q.grad, q_cpu.grad.cuda().half(), atol=7e-3, rtol=5e-3)

@danielvegamyhre

danielvegamyhre commented Sep 18, 2025

Copy link
Copy Markdown
Contributor Author

Ok, can you share the failure message? I would be surprised if it was the same one...

It's the same error message. Maybe I need to uninstall pytorch-triton too and do make triton..?

Doing another complete pull, uninstall, make clean, install

(torch) [danvm@devgpu007.snb3 ~/torchtitan (main)]$ rm -rf /tmp/torchinductor_danvm;

(torch) [danvm@devgpu007.snb3 ~/torchtitan (main)]$ NGPU=2 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --parallelism.data_parallel_shard_degree=2 --parallelism.expert_parallel_degree=2 --compile.enable 

...

    File "/tmp/torchinductor_danvm/sj/csjffroid2fsi32up4lyfhbmbcsgvwvkdda4fibw5fyenflebnhn.py", line 482, in call
      buf24 = torch.ops.aten._scaled_dot_product_cudnn_attention.default(reinterpret_tensor(buf22, (8, 16, 2048, 16), (524288, 16, 256, 1), 0), reinterpret_tensor(buf23, (8, 16, 2048, 16), (524288, 16, 256, 1), 0), reinterpret_tensor(buf5, (8, 16, 2048, 16), (524288, 16, 256, 1), 0), None, True, 0.0, True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/danvm/pytorch/torch/_ops.py", line 841, in __call__
      return self._op(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

@eqy

eqy commented Sep 18, 2025

Copy link
Copy Markdown

That doesn't look the same as

[rank0]:[rank0]: AssertionError: wrong number of dimensions4 for op: torch.ops.aten._scaled_dot_product_cudnn_attention.default

:/

@danielvegamyhre

danielvegamyhre commented Sep 18, 2025

Copy link
Copy Markdown
Contributor Author

That doesn't look the same as

[rank0]:[rank0]: AssertionError: wrong number of dimensions4 for op: torch.ops.aten._scaled_dot_product_cudnn_attention.default

:/

Oh, I've seen both of these cudnn related issues as part of the #1713 at various points, the issue described in #1713 is indeed a different message though, sorry for the confusion. The workaround of just not using CUDNN backend is what has resolved both:

  • CUDNN not initialized is for cuda 12.9, built from source (we don't build nightlies for 12.9 anymore)
  • The wrong number of dimensions was for cuda 12.8 nightly.

@eqy

eqy commented Sep 18, 2025

Copy link
Copy Markdown

cuDNN not initialized is pretty wild, are we almost out of GPU memory or something for this model?

I'll check a source build in 12.9 tomorrow.

@danielvegamyhre

Copy link
Copy Markdown
Contributor Author

cuDNN not initialized is pretty wild, are we almost out of GPU memory or something for this model?

I don't think so, after removing cudnn backend it hits around 80gb of GMEM on a b200 with ~183gb capacity.

I'll check a source build in 12.9 tomorrow.

Sounds good, thanks for taking a look

@eqy

eqy commented Sep 18, 2025

Copy link
Copy Markdown

To help narrow things down, could you also please collect some logging information:

CUDNN_FRONTEND_LOG_FILE=frontend.txt CUDNN_FRONTEND_LOG_INFO=1 CUDNN_LOGLEVEL_DBG=3 CUDNN_LOGDEST_DBG=backend.txt python yourrepro.py

Thanks!

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025

@fegin fegin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could not disable cuDNN backend. You can disable it for specific settings. But some people are still using it to benchmark (until last week).

@danielvegamyhre

Copy link
Copy Markdown
Contributor Author

Update: I retried this morning with today's latest nightly build which include's @eqy's fix, and the issue does not repro. Looks like the CUDNN not initialized must be a local env issue for me building from source, so we can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants