[`BERT`] Add support for sdpa by hackyon · Pull Request #28802 · huggingface/transformers

hackyon · 2024-01-31T22:55:28Z

What does this PR do?

Adding support for SDPA (scaled dot product attention) for Bert. More context in #28005.

Benchmarking Results on A100-80GB, CPUx12, RAM 96.6GB, OS Ubuntu 22.04, using BertLMHeadModel

Training benchmark based on fxmarty's script:

num_training_steps	batch_size	seq_len	Time per batch (eager - s)	Time per batch (sdpa - s)	Speedup (%)	Eager peak mem (MB)	sdpa peak mem (MB)	Mem saving (%)
1000	1	256	0.022	0.018	23.905	1128.190	1065.286	5.905
1000	1	512	0.034	0.028	20.473	1345.791	1093.933	23.023
1000	2	256	0.031	0.026	18.701	1175.685	1093.933	7.473
1000	2	512	0.057	0.047	21.315	2123.874	1370.097	55.016
1000	4	256	0.052	0.044	16.446	1784.135	1369.489	30.277
1000	4	512	0.106	0.087	21.524	3706.609	2196.791	68.728

Inference benchmark based on fxmarty's script:

num_batches	batch_size	seq_len	Per token latency eager (ms)	Per token latency SDPA (ms)	Speedup (%)	Mem eager (MB)	Mem BT (MB)	Mem saved (%)
50	1	64	5.906	5.420	8.962	271.610	271.407	0.075
50	1	128	5.825	5.402	7.834	279.157	279.718	-0.200
50	2	64	6.190	5.349	15.709	291.489	291.751	-0.090
50	2	128	6.168	5.360	15.066	307.514	307.776	-0.085
50	4	64	6.262	5.392	16.137	332.177	332.440	-0.079
50	4	128	6.201	5.382	15.215	364.271	364.742	-0.129

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @younesbelkada

(cc @fxmarty)

hackyon · 2024-01-31T23:01:02Z

Hey @ArthurZucker @younesbelkada

I was thinking SDPA (#28005) could be a good addition to BERT, so I drafted this change. It doesn't look too hairy so far.

As @ArthurZucker mentioned, BERT doesn't have a lot of params so there might not be much of a speedup, but this didn't look too difficult to implement so I figured whatever little improvement might still be helpful (as an aside, there's been some benchmarking of Flash Attention on training other implementations of BERT, and it still shows decent improvements).

Can you let me know if this is worth pursuing? If so, I'll add the tests and also fix the fix-copies dependencies.

Thanks!

hackyon · 2024-01-31T23:04:09Z

src/transformers/models/bert/modeling_bert.py

This is fixed in torch 2.2.0 I think, maybe I should check for it and skip the calls?

I think it is fine to leave. We should probably bump the requirement for SDPA to torch>=2.2 in the future.

This got me thinking, and I ran an additional set of benchmarking, given that FA2 is supported and the contiguous bug is fixed in 2.2.0: training and inference.

Both training and inference were ~5% faster with torch==2.2.0 (FA2 should be supported). I also tried out gating the .contiguous() requirement and saw an additional ~5-10% gain on top of that.

if version.parse(get_torch_version()) < version.parse("2.2.0") query_layer = query_layer.contiguous() key_layer = key_layer.contiguous() value_layer = value_layer.contiguous()

I'm leaning towards adding the if-statement to gate the call, so users who upgrade to torch=2.2.0 first can get the benefits right away (before we set the min torch version to 2.2.0). WDYT?

I added the if-statement for 2.2.0 in there. If you don't think it's a good idea, let me know and I'll remove it.

ArthurZucker · 2024-02-01T13:04:32Z

I think a good way to se if it is worth the shot is to benchmark your code and check if you have speedups in different contexts!

hackyon · 2024-02-01T16:21:06Z

Sounds good, lemme look into that

src/transformers/models/data2vec/modeling_data2vec_text.py

src/transformers/models/roberta/modeling_roberta.py

hackyon · 2024-02-06T07:38:42Z

@ArthurZucker I did some training and inference benchmarking for my change and posted the results in the PR description.

It looks like there are decent improvements across the board (percentage-wise, but I think the improvements would add up if we're doing a lot of training/inferencing). I think it could be a good addition. Thoughts?

ArthurZucker · 2024-02-07T07:38:02Z

Sounds like a good addition then! I'll let @fxmarty review and will be doing the final pass!

pommedeterresautee · 2024-02-07T16:01:01Z

Just curious, is it similar to #27478 ?
Seems also #28713 is highly related.

hackyon · 2024-02-07T16:05:13Z

re: @pommedeterresautee

Yes, it's similar. SDPA is built into pytorch, and can support Flash Attention (1) depending on the environment. AFAIK Flash Attention 2 isn't supported in SDPA yet, but there is a possibility for it to be supported down the road (but that should be built into pytorch already, and shouldn't need many changes from our end).

pommedeterresautee · 2024-02-07T16:13:26Z

Thanks, I think it is now
https://pytorch.org/blog/pytorch2-2/
scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.

hackyon · 2024-02-07T16:51:37Z

Oh nice, so I guess we could get FA2 for free eventually (when we upgrade pytorch).

Thanks for the links to similar work. I think they could cause some merge conflicts, so I'll message them and try to resolve it before it goes in.

fxmarty

It looks in good shape thank you, left a few comments

fxmarty · 2024-02-08T08:58:06Z

src/transformers/models/bridgetower/modeling_bridgetower.py

I would probably move the Copied from just to the __init__ and other methods, but not forward. For the forward, you can probably just add a comment that it is adapted from bert/roberta and once bridge_tower supports sdpa we can put back to copied from.

WDYT @ArthurZucker @amyeroberts

There seems to be 8 methods that copy-from BertMode#forward() exactly and has this section of change.

I won't mind adding SDPA to them as well once this goes in and reinstating the copy-from. It shouldn't be that difficult (famous last words)

I've removed the fix-copies from the instances, and so the logic for sdpa attention masks should only be in BertModel now.

src/transformers/models/camembert/modeling_camembert.py

src/transformers/models/clap/modeling_clap.py

src/transformers/models/data2vec/modeling_data2vec_text.py

src/transformers/models/roberta/modeling_roberta.py

src/transformers/models/xlm_roberta_xl/modeling_xlm_roberta_xl.py

src/transformers/models/bert/modeling_bert.py

fxmarty · 2024-02-08T09:11:06Z

src/transformers/models/bert/modeling_bert.py

@ArthurZucker there are create_extended_attention_mask_for_decoder, invert_attention_mask, get_extended_attention_mask methods in modeling_utils.py that should probably be deprecated / redirect to modeling_attn_mask_utils.py.

Yea, I agree.

It'd be great if we could mark those old methods as deprecated, and slowly update them once we verify that the old methods and the new methods are always returning the same results.

For the updated_attention_mask for sdpa, why can't we keep the previous logic and just do:

# Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path. # Details: https://github.com/pytorch/pytorch/issues/110213 causal_mask = causal_mask.mul(~torch.all(causal_mask == torch.finfo(embedding_output.dtype).min, dim=-1, keepdim=True)).to( dtype )

(from Llama)?
Not super fan of the complexity of _prepare_4d_causal_attention_mask_for_sdpa, and we should not add it in our new code IMO.

hackyon · 2024-02-08T14:49:41Z

src/transformers/modeling_attn_mask_utils.py

This code was changed to pass the fx tracing test (in common tests).

It would be good if you can help double check the logic here. I think the idea here is that we'll still have to use our own attention mask (rather than None) when tracing is active. The previous "pass" would cause the function to end without any return statements, which would have defaulted to None.

It looks OK to me, cc @fxmarty to confirm.

AFAICT, the difference here is coming from the additional isinstance(mask, torch.fx.Proxy) in the is_tracing_check. I don't believe the reworking to remove pass should affect anything - the new code is equivalent.

Yes it is fine, see

transformers/src/transformers/modeling_attn_mask_utils.py

Line 353 in 1ecf5f7

is_tracing = torch.jit.is_tracing() or isinstance(inputs_embeds, torch.fx.Proxy)

hackyon · 2024-02-08T14:55:45Z

src/transformers/models/bert/modeling_bert.py

This fix was added due to a test failure that uncovered an existing bug.

The head was initialized but the weights weren't retied as necessary. This was causing self.decoder.bias to be different from self.bias. When loading the pretrained model with low_cpu_mem_usage=True, the self.decoder.bias had uninitiated params (with device=meta) whereas self.bias was set properly (with device=cpu)

I'm slightly concerned this will affect the output some users see when using this model. Please let me know what you think about this.

I pulled this out to its own PR here:
#28948

This issue is unrelated to SDPA, but was just uncovered by a SPDA test, so I just pulled it out to its own PR.

Addition looks OK to me - thanks for digging into this.

I'm slightly concerned this will affect the output some users see when using this model. Please let me know what you think about this.

Could you expand on what you think might be an issue?

I was initially concerned that users were loading and using the model with a wrong bias (ie. device=meta), and this fix to use the correct bias will cause the results to change between versions.

However, that seems unlikely after playing around with this a bit more - turns out it was quite difficult to run the model when the bias had device=meta, so I doubt anyone was actually running the model in this particular configuration before the fix.

hackyon · 2024-02-08T14:57:14Z

tests/test_modeling_common.py

The self._prepare_for_class is necessary to support the BertForMultipleChoice model.

hackyon · 2024-02-08T15:22:41Z

I've rebased off of head and marked as ready for review. I had to dig through a couple of issues to get the tests to pass, let me now if you want to chat about any of them.

Thanks!

amyeroberts · 2024-02-08T16:54:16Z

@fxmarty @hackyon There's still several tests failing related to this PR. Once these are resolved you can ping me again for a final review

src/transformers/modeling_utils.py

hackyon · 2024-02-08T21:03:35Z

The tests are passing now. I also verified that test_modeling_bert passes with RUN_SLOW=1 (which contains the tests to ensure model output is the same for eager and sdpa attentions)

Please take another look when you get a chance. Thanks!

amyeroberts

Thanks for all the work adding this @hackyon as well as the additional work to dig into weird errors and find solutions. Great work!

Some general comments:

Let's wait for the merging of #28948 before merging this in
It would be good to add the performance numbers in the PR description to BERT's model page, similar to what's done for Flash Attention e.g. [here](https://huggingface.co/docs/transformers/v4.37.2/en/model_doc/gpt_neox#using-flash-attention-2.
test_eager_matches_sdpa_inference should be run for all existing models with SDPA implemented to confirm compatibility with the change in processed_inputs
We shouldn't be setting self._use_sdpa that don't have an SDPA attention class. We can just about get away with it for the models which have an attention dict, but not for the other models.

amyeroberts · 2024-02-13T18:59:49Z

src/transformers/modeling_attn_mask_utils.py

It looks OK to me, cc @fxmarty to confirm.

AFAICT, the difference here is coming from the additional isinstance(mask, torch.fx.Proxy) in the is_tracing_check. I don't believe the reworking to remove pass should affect anything - the new code is equivalent.

src/transformers/models/altclip/modeling_altclip.py

src/transformers/models/bert/modeling_bert.py

amyeroberts · 2024-02-13T19:04:18Z

src/transformers/models/bert/modeling_bert.py

Addition looks OK to me - thanks for digging into this.

I'm slightly concerned this will affect the output some users see when using this model. Please let me know what you think about this.

Could you expand on what you think might be an issue?

src/transformers/models/xlm_roberta_xl/modeling_xlm_roberta_xl.py

src/transformers/models/xmod/modeling_xmod.py

ArthurZucker · 2024-04-05T07:18:53Z

Oh wow

ArthurZucker

LGTM, let's rebase on main!

hackyon · 2024-04-06T22:45:49Z

Thanks!

I merged with main/HEAD, and re-ran the RUN_SLOW tests for both bert and also for test_eager_matches_sdpa_inference and they work as expected. There were existing failures for test_eager_matches_sdpa_inference with RUN_SLOW on main/HEAD, but nothing new introduced by this change.

I'm not sure about this test_pipelines_tf failure. I haven't touched any code with tf, and I was able to get the failing test test_stop_sequence_stopping_criteria to pass locally, so I'm thinking it's a flake or unrelated to this change.

amyeroberts · 2024-04-08T08:26:04Z

Hi @hackyon - great to see this ready to merge!

The generation tests aren't related to this diff and are failing on other PRs. We're working to push a fix to main - will let you know when resolved, you can rebase and hopefully we have full 🟢 for merging 🤗

hackyon · 2024-04-11T19:22:28Z

Thanks @amyeroberts @ArthurZucker

Just remerged with main/HEAD, and the unrelated failing TF pipeline test now passes. I checked the bert tests again with RUN_SLOW for good measure, and they continue to pass.

Let me know if there's anything else I could do here. Thanks!

hackyon · 2024-04-15T15:18:57Z

@ArthurZucker Please let me know if there's anything else you'd like me to do for this PR. Thanks!

…MultipleChoice models)

hackyon · 2024-04-22T13:37:14Z

Remerged with the latest main, and fixed a test.

@ArthurZucker @amyeroberts @fxmarty Please let me know if there's anything I can do here.

amyeroberts · 2024-04-26T15:23:38Z

@hackyon Everything's green and two approvals, so we're good to merge. Thanks for all the effort in adding this and iterating with us. It's great to have this added to one of the most popular models ❤️

HuggingFaceDocBuilderDev · 2024-04-26T15:42:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

hackyon · 2024-04-26T19:01:27Z

Thanks @amyeroberts for the merge! 🎉 I appreciate all the help from @fxmarty, @ArthurZucker, and you in getting this PR merged 🙏

I see you've submitted #30506 as a follow-up, and thank you for covering that. Please let me know if there's any other follow-up work, and I'd be happy to look into it.

hackyon · 2024-04-26T21:13:14Z

As I mentioned previously, I've also drafted a PR for adding SDPA support to RoBERTa-based models at #30510. Almost all of the changes are "Copied from" BERT, and so there is a little less room for error.

coekfung · 2024-07-17T08:11:20Z

I appreciate your job! As Esm is a Bert-base model, I think sdpa can be add to Esm with little modification.

hackyon commented Jan 31, 2024

View reviewed changes

fxmarty reviewed Feb 2, 2024

View reviewed changes

src/transformers/models/data2vec/modeling_data2vec_text.py Outdated Show resolved Hide resolved

fxmarty reviewed Feb 2, 2024

View reviewed changes

src/transformers/models/roberta/modeling_roberta.py Outdated Show resolved Hide resolved

This was referenced Feb 7, 2024

feat: add flash_attn 2 to bert #27478

Closed

Add FlashAttention2 for XLM-RoBERTa #28713

Closed

fxmarty reviewed Feb 8, 2024

View reviewed changes

hackyon commented Feb 8, 2024

View reviewed changes

hackyon force-pushed the sdpa-bert branch from bee1cce to 748e659 Compare February 8, 2024 15:19

hackyon marked this pull request as ready for review February 8, 2024 15:21

fxmarty requested a review from amyeroberts February 8, 2024 16:46

hackyon commented Feb 8, 2024

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

This was referenced Feb 9, 2024

Always initialize tied output_embeddings if it has a bias term #28947

Merged

Add tie_weights() to LM heads and set bias in set_output_embeddings() #28948

Merged

hackyon force-pushed the sdpa-bert branch from 5601e9f to b68240d Compare February 12, 2024 17:22

amyeroberts reviewed Feb 13, 2024

View reviewed changes

hackyon force-pushed the sdpa-bert branch from 2576e38 to fe6db3c Compare February 14, 2024 18:00

ArthurZucker self-requested a review March 25, 2024 08:24

ArthurZucker approved these changes Apr 5, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into sdpa-bert

0965399

minostauros mentioned this pull request Apr 7, 2024

_prepare_4d_attention_mask_for_sdpa is not for causal attention but claims... #30095

Closed

fxmarty mentioned this pull request Apr 9, 2024

Ignore non-causal mask in more cases with SDPA #30138

Merged

Merge remote-tracking branch 'upstream/main' into sdpa-bert

b4813a0

hackyon added 2 commits April 22, 2024 09:20

Merge remote-tracking branch 'upstream/main' into sdpa-bert

e312cd1

Fix test_sdpa_can_dispatch_on_flash by preparing input (required for …

66a24c1

…MultipleChoice models)

His-Wardship mentioned this pull request Apr 26, 2024

Open to contribution: adding torch.nn.functional.scaled_dot_product_attention support for more architectures #28005

Closed

6 tasks

amyeroberts merged commit dfa7b58 into huggingface:main Apr 26, 2024

amyeroberts mentioned this pull request Apr 26, 2024

Fix GroundingDINO, DPR after BERT SDPA update #30506

Merged

hackyon deleted the sdpa-bert branch April 26, 2024 19:01

hackyon mentioned this pull request Apr 26, 2024

[RoBERTa-based] Add support for sdpa #30510

Merged

5 tasks

michaelfeil mentioned this pull request Jun 9, 2024

[Bettertransformer] Transformers 4.41.0 (torch.SDPA-Bert) breaks bettertransformers Bert, but works in Transformers 4.40.2 huggingface/optimum#1902

Closed

4 tasks

This was referenced Jul 2, 2024

Suport sdpa for RoBERTa and XLM-RoBERTa models #31752

Open

[roberta] add sdpa to roberta and xlm-roberta #31754

Closed

imatiach-msft mentioned this pull request Aug 19, 2024

Fix blbooksgenre notebook failures due to error on deployment Azure/azureml-examples#3353

Merged

4 tasks

coekfung mentioned this pull request Nov 27, 2024

[ESM] Add support for sdpa. #34954

Open

5 tasks

Conversation

hackyon commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

hackyon commented Jan 31, 2024

Uh oh!

hackyon Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hackyon Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Feb 1, 2024

Uh oh!

hackyon commented Feb 1, 2024

Uh oh!

Uh oh!

Uh oh!

hackyon commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Feb 7, 2024

Uh oh!

pommedeterresautee commented Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hackyon commented Feb 7, 2024

Uh oh!

pommedeterresautee commented Feb 7, 2024

Uh oh!

hackyon commented Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hackyon Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hackyon Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hackyon commented Jan 31, 2024 •

edited

Loading

hackyon Jan 31, 2024 •

edited

Loading

hackyon Feb 15, 2024 •

edited

Loading

hackyon commented Feb 6, 2024 •

edited

Loading

pommedeterresautee commented Feb 7, 2024 •

edited

Loading

hackyon commented Feb 7, 2024 •

edited

Loading

fxmarty left a comment •

edited

Loading

hackyon Feb 8, 2024 •

edited

Loading

hackyon Feb 8, 2024 •

edited

Loading

hackyon Feb 14, 2024 •

edited

Loading

hackyon commented Apr 11, 2024 •

edited

Loading