Improve BERT-like models performance with better self attention#9124
Improve BERT-like models performance with better self attention#9124jplu merged 8 commits intohuggingface:masterfrom
Conversation
|
A Python profiling call gives the following improvements: |
sgugger
left a comment
There was a problem hiding this comment.
Thanks for working on this! I just have some cosmetic change nits, since I'm annoying and can't see things that are longer than 119 chars ;-)
There was a problem hiding this comment.
If you rebase after merging #9120, we canc lean up the version test in tf_optimization.
There was a problem hiding this comment.
Can we do that in a PR that will take care of only doing TF >=2.3 compliancy, instead?
There was a problem hiding this comment.
Why is the assert better than raising the ValueError? I liked the part above better but it's mostly because it fits with the 119 char limits. If we keep the assert, could you just split the line to respect that char limit?
There was a problem hiding this comment.
I don't mind to remove the assert and put back raising a value error. Will do this!
There was a problem hiding this comment.
Done in the last commit!
There was a problem hiding this comment.
Same comment as for the other assert.
There was a problem hiding this comment.
Done in the last commit!
There was a problem hiding this comment.
Same comment as for the other assert.
There was a problem hiding this comment.
Done in the last commit!
There was a problem hiding this comment.
Done in the last commit!
There was a problem hiding this comment.
Why is Longformer not included in the change?
There was a problem hiding this comment.
Because I don't know well enough this model, I prefer to let @patrickvonplaten handle this one.
There was a problem hiding this comment.
The intermediate layers of Longformer are 1-to-1 the same than BERT and there should be no problem to keep those lines. I'd be surprised if leaving this line would throw an error tbh. Did you try just leaving them? The only difference in Longformer is the Self-attention layer and all none of those copy statements concern the self-attention layer, so IMO we should leave the statements and run make fix-copies
There was a problem hiding this comment.
Oh indeed it works for the Intermediate layer! Only the Self attention still needs to be updated accordingly :)
There was a problem hiding this comment.
Yes that I'm happy to do in another PR - would be amazing if you could open a one-liner issue about it and tag me :-)
patrickvonplaten
left a comment
There was a problem hiding this comment.
Awesome! I am sure you already did this, but we before merging we should be sure of two things:
- All the slow tf BERT + BERT-like models tests are passing
- "Old" pre-trained models
tf_model.h5files that were saved with tf < 2.3 can be loaded into the new layer design + tf1 models (.ckpt) files can be loaded into the new model layer.
I don't really see a reason why neither 1) nor 2) should not work, but just to be sure it'd be great to test those quickly :-)
And I think we can leave all the longformer copy statements -> there shouldn't be a problem :-)
|
Thanks @patrickvonplaten !!
I haven't tested the tf1 models, you mean testing the |
|
@jlei2 has confirmed that now everything works as expected in the profiler and benchmark 👍 #6771 (comment) |
Yeah I mean loading a tf |
Ok as discussed offline TF1 checkpoints cannot even be loaded into TF2 at the moment (only if one goes through PT), so this PR is good to go for me! |
LysandreJik
left a comment
There was a problem hiding this comment.
This is very clean, and the performance improvements are amazing! Thanks for checking that the slow tests pass and that the previous checkpoints can still be loaded.
Great job, thank you for working on this!
There was a problem hiding this comment.
Why not use the tf.keras.layers.experimental.EinsumDense and keep the copy?
There was a problem hiding this comment.
Why not use the tf.keras.layers.experimental.EinsumDense and keep the copy?
There was a problem hiding this comment.
This one is not possible because the input/output shapes won't be anymore compatible.
What does this PR do?
This PR updates the way we implement the self attention layers in order to be aligned on the original BERT performance. Small breaking change, this improvement needs at least TF 2.3. This change has already been discussed with @thomwolf, and he agreed. But still needs the approval of @LysandreJik @patrickvonplaten and @sgugger
@patrickvonplaten I have removed the comment for
check_copiesin the Longformer model because I don't know enough this model to apply the proper changes, I will apply this update to one model by one model for the ones I know but can you take this one?@jlei2 as I'm on Windows, unfortunately the GPU profiling is not yet available in WSL, can you clone this branch and be sure that everythings works like expected with your benchmark? Thanks!!
Fixes # (issue)
#6771