Fix MusicGen SDPA by ylacombe · Pull Request #31208 · huggingface/transformers

ylacombe · 2024-06-03T15:02:15Z

SDPA produces Nan when given a padding mask that attends to no position at all (see pytorch/pytorch#103749 (comment)).
In the case of Musicgen, it can happen for two reasons:

when we use get_unconditional_inputs (here)
when guidance_scale>1 (see here)

There might be more elegant way to do this, WDYT @amyeroberts ?

HuggingFaceDocBuilderDev · 2024-06-03T15:23:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ylacombe · 2024-06-03T16:13:24Z

src/transformers/models/musicgen/modeling_musicgen.py

                output_attentions=output_attentions,
            )

+        # Ignore copy


This ignore copy doesn't work, any ideas why @ydshieh ?

Hi. Copy and ignore copy only works with a named entity (like function, method, class). It could not be used in a place (block) where no declared name is given. Either refactor the code or just remove the copy statement

Copied from transformers.models.bart.modeling_bart.BartSdpaAttention with Bart->Musicgen

if no code re-factorization is possible.

Thanks @ydshieh, I'll remove the copy statement

amyeroberts

Thanks for looking into and addressing this!

amyeroberts · 2024-06-03T16:22:27Z

src/transformers/models/musicgen/modeling_musicgen.py

+        # Ignore copy
+        if (
+            attention_mask is not None
+            and (attention_mask.mean(dim=[1, 2, 3]) <= torch.finfo(attention_mask.dtype).min).any()


What's the reason for using finfo here and not just 0?

The attention has already been processed and is filled with -inf of the correspondant dtype!

amyeroberts · 2024-06-03T16:23:39Z

src/transformers/models/musicgen/modeling_musicgen.py

+            return super().forward(
+                hidden_states,
+                key_value_states=key_value_states,
+                past_key_value=past_key_value,
+                attention_mask=attention_mask,
+                layer_head_mask=layer_head_mask,
+                output_attentions=output_attentions,
+            )


Rather than fallback, I would just raise an exception. Otherwise this expensive check and forward pass can easily go unnoticed

Well, sdpa and guidance_scale>1 are used by default, so we'd raise the error almost every time. Also, even if the model uses eager mode for the cross-attention layers (in which the bug happens), it'll still benefit from the speed-up of the self-attention layers.

Should we find a better way of testing the attention mask ? For example, we could raise a warning that it will happens here and here and switch the cross-attention SDPA layers to eager layers by default when it happens?

OK, I see. I'm a bit concerned about this causing unexpected behaviour, in particular defaulting to eager like this as it's a bit magic. As there's other layers which can still use SDPA, this seems like a pragmatic solution.

Let's leave as-is. If more users raise issues, then we'll have to re-think.

amyeroberts

Thanks for digging into this and fixing!

ylacombe · 2024-06-14T08:59:52Z

src/transformers/models/musicgen/modeling_musicgen.py

        )


-# Copied from transformers.models.bart.modeling_bart.BartSdpaAttention with Bart->Musicgen


I removed the statement here, just want to make sure that it's okay for you @amyeroberts before merging!

ylacombe added 2 commits June 3, 2024 15:27

fix sdpa musicgen

c84cab4

make style

cc3bbd6

ylacombe commented Jun 3, 2024

View reviewed changes

amyeroberts reviewed Jun 3, 2024

View reviewed changes

amyeroberts approved these changes Jun 6, 2024

View reviewed changes

ylacombe and others added 2 commits June 14, 2024 10:51

Merge branch 'huggingface:main' into fix-musicgen-sdpa

e559f25

remove copied from statement from Musicgen SDPA

605df12

ylacombe commented Jun 14, 2024

View reviewed changes

ylacombe merged commit 43ee585 into huggingface:main Jun 14, 2024

ylacombe mentioned this pull request Jun 14, 2024

[MusicGen] SDPA gives nans/infs during sampling #30020

Closed

4 tasks

qubvel mentioned this pull request Jul 15, 2024

Add sdpa and FA2 for CLIP #31940

Merged

4 tasks

		)


		# Copied from transformers.models.bart.modeling_bart.BartSdpaAttention with Bart->Musicgen

Conversation

ylacombe commented Jun 3, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jun 3, 2024

Uh oh!

ylacombe Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

ydshieh Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Copied from transformers.models.bart.modeling_bart.BartSdpaAttention with Bart->Musicgen

Uh oh!

ylacombe Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

amyeroberts Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

ylacombe Jun 4, 2024

Choose a reason for hiding this comment

Uh oh!

amyeroberts Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

ylacombe Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amyeroberts Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

ylacombe Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amyeroberts Jun 14, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ydshieh Jun 3, 2024 •

edited

Loading

ylacombe Jun 4, 2024 •

edited

Loading

ylacombe Jun 14, 2024 •

edited

Loading