Add Flash Attention 2 to M2M100 model by visheratin · Pull Request #30256 · huggingface/transformers

visheratin · 2024-04-15T15:04:41Z

What does this PR do?

This PR adds support for Flash Attention 2 in M2M100 models (e.g., NLLB). Here is the Colab notebook with a working demo.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @younesbelkada

younesbelkada

Hi @visheratin
Thanks for this great addition ! I see in the PR you used some old / deprecated variables such as _use_flash_attention_2, please see: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L405 - you need to inherit from M2M100Attention. There is also sligtly more work to be done on the documentation side to add expected speedups, check out this most recent PR: #29226 to see what are the required changes and let me know if you have any question - thanks !

visheratin · 2024-04-16T14:28:47Z

The _use_flash_attention_2 is needed to handle different attention mask formats. As far as I can see, the same flags/logic is used across many other models (e.g., DistilBERT). Is there another better way to handle this?

younesbelkada · 2024-04-16T14:31:21Z

@visheratin correct, for llama it's because the attention mask logic has been refactored in favor of AttentionMaskConverter. OK to use _use_flash_attention_2, the other points are still valid though, can you check what has been done for GPT2 and make the changes accordingly? 🙏

visheratin · 2024-04-17T00:03:21Z

I fixed inheritance and added the sections about FA2 along with speedup image to the NLLB and M2M100 doc pages. I also added an integration test. Let me know if there is anything that needs to be done.

younesbelkada

Looks very clean ! Thanks for working on this ! I left one single comment - what do you think?

src/transformers/models/m2m_100/modeling_m2m_100.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

visheratin · 2024-04-17T15:00:36Z

Sure! I committed the change.

HuggingFaceDocBuilderDev · 2024-04-17T15:12:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

Thanks again for the smooth integration!

amyeroberts

Thanks for adding!

Just some small nits to resolve before merge

docs/source/en/model_doc/nllb.md

amyeroberts · 2024-04-17T15:38:10Z

src/transformers/models/m2m_100/modeling_m2m_100.py


-        # create causal mask
-        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
-        combined_attention_mask = _prepare_4d_causal_attention_mask(


Why rename here from combined_attention_mask to attention_mask?

This is an artifact of debugging. I returned the old name.

amyeroberts · 2024-04-17T15:38:27Z

src/transformers/models/m2m_100/modeling_m2m_100.py

                        decoder_layer.__call__,
                        hidden_states,
-                        combined_attention_mask,
+                        # combined_attention_mask,


I would be better to keep the old name though

Suggested change

# combined_attention_mask,

amyeroberts · 2024-04-17T15:38:37Z

src/transformers/models/m2m_100/modeling_m2m_100.py

                    layer_outputs = decoder_layer(
                        hidden_states,
-                        attention_mask=combined_attention_mask,
+                        # attention_mask=combined_attention_mask,


Same here

Suggested change

# attention_mask=combined_attention_mask,

nice catch !

amyeroberts · 2024-04-17T15:42:16Z

tests/models/m2m_100/test_modeling_m2m_100.py

+            "I think there are two levels of response from the French government.",
+            "When François Hollande calls Barack Obama or when Foreign Minister Laurent Fabius calls the U.S."
+            " Ambassador, they respond to a real discovery, which is that of the scale of U.S. surveillance on all"
+            " communications in France.",


Same examples as in the original tests.

Ah, sorry, I didn't mean to be confusing, It's just that it was talking about surveillance so I thought I'd drop a wee spy

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

amyeroberts

Thanks again for adding and iterating!

visheratin · 2024-04-17T23:55:52Z

My pleasure! Thank you both, @amyeroberts and @younesbelkada, for the fast review!

* Added flash attention 2. * Fixes. * Fix inheritance. * Fixed init. * Remove stuff. * Added documentation. * Add FA2 to M2M100 documentation. * Add test. * Fixed documentation. * Update src/transformers/models/m2m_100/modeling_m2m_100.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/en/model_doc/nllb.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fixed variable name. --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

visheratin added 2 commits April 15, 2024 03:41

Added flash attention 2.

d0ec7dd

Fixes.

84020ff

younesbelkada reviewed Apr 16, 2024

View reviewed changes

visheratin added 6 commits April 16, 2024 17:24

Fix inheritance.

5ce57fd

Fixed init.

7dc1993

Remove stuff.

6b863d3

Added documentation.

17d0797

Add FA2 to M2M100 documentation.

2525e1d

Add test.

4c4737b

Fixed documentation.

d1c1d06

younesbelkada reviewed Apr 17, 2024

View reviewed changes

src/transformers/models/m2m_100/modeling_m2m_100.py Show resolved Hide resolved

Update src/transformers/models/m2m_100/modeling_m2m_100.py

b63ed70

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

younesbelkada approved these changes Apr 17, 2024

View reviewed changes

younesbelkada requested a review from amyeroberts April 17, 2024 15:18

amyeroberts reviewed Apr 17, 2024

View reviewed changes

visheratin and others added 3 commits April 17, 2024 12:43

Update docs/source/en/model_doc/nllb.md

78d75e5

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Fixed variable name.

f5f0213

Merge branch 'main' into m2m100

cab7006

visheratin requested a review from amyeroberts April 17, 2024 19:25

amyeroberts approved these changes Apr 17, 2024

View reviewed changes

younesbelkada merged commit b65df51 into huggingface:main Apr 18, 2024

Conversation

visheratin commented Apr 15, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

visheratin commented Apr 16, 2024

Uh oh!

younesbelkada commented Apr 16, 2024

Uh oh!

visheratin commented Apr 17, 2024

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

visheratin commented Apr 17, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Apr 17, 2024

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

visheratin commented Apr 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants