Add Flash Attention 2 to M2M100 model#30256
Conversation
younesbelkada
left a comment
There was a problem hiding this comment.
Hi @visheratin
Thanks for this great addition ! I see in the PR you used some old / deprecated variables such as _use_flash_attention_2, please see: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L405 - you need to inherit from M2M100Attention. There is also sligtly more work to be done on the documentation side to add expected speedups, check out this most recent PR: #29226 to see what are the required changes and let me know if you have any question - thanks !
|
The |
|
@visheratin correct, for llama it's because the attention mask logic has been refactored in favor of |
|
I fixed inheritance and added the sections about FA2 along with speedup image to the NLLB and M2M100 doc pages. I also added an integration test. Let me know if there is anything that needs to be done. |
younesbelkada
left a comment
There was a problem hiding this comment.
Looks very clean ! Thanks for working on this ! I left one single comment - what do you think?
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
|
Sure! I committed the change. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks again for the smooth integration!
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for adding!
Just some small nits to resolve before merge
|
|
||
| # create causal mask | ||
| # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] | ||
| combined_attention_mask = _prepare_4d_causal_attention_mask( |
There was a problem hiding this comment.
Why rename here from combined_attention_mask to attention_mask?
There was a problem hiding this comment.
This is an artifact of debugging. I returned the old name.
| decoder_layer.__call__, | ||
| hidden_states, | ||
| combined_attention_mask, | ||
| # combined_attention_mask, |
There was a problem hiding this comment.
I would be better to keep the old name though
| # combined_attention_mask, |
| layer_outputs = decoder_layer( | ||
| hidden_states, | ||
| attention_mask=combined_attention_mask, | ||
| # attention_mask=combined_attention_mask, |
There was a problem hiding this comment.
Same here
| # attention_mask=combined_attention_mask, |
| "I think there are two levels of response from the French government.", | ||
| "When François Hollande calls Barack Obama or when Foreign Minister Laurent Fabius calls the U.S." | ||
| " Ambassador, they respond to a real discovery, which is that of the scale of U.S. surveillance on all" | ||
| " communications in France.", |
There was a problem hiding this comment.
Same examples as in the original tests.
There was a problem hiding this comment.
Ah, sorry, I didn't mean to be confusing, It's just that it was talking about surveillance so I thought I'd drop a wee spy
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks again for adding and iterating!
|
My pleasure! Thank you both, @amyeroberts and @younesbelkada, for the fast review! |
* Added flash attention 2. * Fixes. * Fix inheritance. * Fixed init. * Remove stuff. * Added documentation. * Add FA2 to M2M100 documentation. * Add test. * Fixed documentation. * Update src/transformers/models/m2m_100/modeling_m2m_100.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/en/model_doc/nllb.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fixed variable name. --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
This PR adds support for Flash Attention 2 in M2M100 models (e.g., NLLB). Here is the Colab notebook with a working demo.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @younesbelkada