Add BioGPT by kamalkraj · Pull Request #20420 · huggingface/transformers

kamalkraj · 2022-11-23T18:18:01Z

What does this PR do?

Adding BioGPT

Original Implementation and weights - https://github.com/microsoft/BioGPT

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger @patrickvonplaten

2. added _keys_to_ignore_on_load_missing 3. updated prepare_inputs_for_generation

HuggingFaceDocBuilderDev · 2022-11-24T11:54:05Z

The documentation is not available anymore as the PR was closed or merged.

2. updated BioGptTokenizer func

2. refactor tokenizer

similar to original checkpoints

updated doc strings

2. added attention mask to prepare for generation

younesbelkada

Thank you very much for this great contribution and introducing one of the first bio-medical text model to transformers !
I left a couple of minor comments, overall the PR looks really good to me! The main point being changing the naming of xxxLMHeadModel that needs to be changed to xxxForCausalLM, I also left a comment about layerdrop and output_hidden_states, let me know if this refactoring is possible!
Another point we should discuss is whether we can use smaller model for integration tests, let me know if you need help on that (no need to train the model)
Great effort on the integration side 💪 We should be very close merging this!

docs/source/en/model_doc/biogpt.mdx

src/transformers/__init__.py

src/transformers/models/auto/modeling_auto.py

src/transformers/models/biogpt/__init__.py

tests/models/biogpt/test_modeling_biogpt.py

tests/models/biogpt/test_tokenization_biogpt.py

tests/models/biogpt/test_modeling_biogpt.py

Copyright and updated assertion

kamalkraj · 2022-11-26T08:05:35Z

@younesbelkada
Done changes according to your suggestions.
Thanks for the review

younesbelkada · 2022-11-26T10:07:11Z

Thanks @kamalkraj
Let me give it another round of review and I'll get back to you

younesbelkada

Thanks a lot for the great efforts! I left few minor comments that needs to be double checked, otherwise everything looks great to me!

docs/source/en/model_doc/biogpt.mdx

src/transformers/models/biogpt/modeling_biogpt.py

younesbelkada · 2022-11-26T19:44:30Z

src/transformers/models/biogpt/modeling_biogpt.py

+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            dropout_probability = random.uniform(0, 1)
+            if self.training and (dropout_probability < self.layerdrop):
+                continue
+


thanks for the clarification, I think that you are right here, I am fine with this change. Let's keep it like this

younesbelkada · 2022-11-26T19:51:07Z

tests/models/biogpt/test_modeling_biogpt.py

+    def test_batch_generation(self):
+        model = BioGptLMHeadModel.from_pretrained("kamalkraj/biogpt")
+        model.to(torch_device)
+        tokenizer = BioGptTokenizer.from_pretrained("kamalkraj/biogpt")


But even if the model outputs gibberish I'd expect the logits to be different if someone changes the code in the future, so I guess we'll be able to flag it.
In my opinion I think that it's fine as it is but let's see what others think for final approval - so let's keep this as unresolved for now

tests/models/biogpt/test_modeling_biogpt.py

younesbelkada

Thanks so much for your very clean contribution of the first (if I am not mistaken) bio-medical model!
Clean code & documentation and nice tests 💪🏻
Leaving now the PR to @sgugger for a final review
Thanks!

sgugger

Thanks for adding this model! Once all comments have been resolved, we'll need to move the checkpoints to the microsoft org and adapt all references in this PR :-)

docs/source/en/model_doc/biogpt.mdx

src/transformers/__init__.py

src/transformers/models/auto/modeling_auto.py

src/transformers/models/biogpt/configuration_biogpt.py

sgugger · 2022-11-28T16:27:13Z

src/transformers/models/biogpt/configuration_biogpt.py

+logger = logging.get_logger(__name__)
+
+BIOGPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "kamalkraj/biogpt": "https://huggingface.co/kamalkraj/biogpt/resolve/main/config.json",


Checkpoints will need to be transferred to the microsoft org before we merge.

Let me know once the checkpoints are transferred, I can make the changes.
Thanks

src/transformers/models/biogpt/tokenization_biogpt.py

src/transformers/models/biogpt/modeling_biogpt.py

kamalkraj · 2022-11-28T17:18:46Z

@sgugger
Done changes according to your suggestions.
Thanks for the review

younesbelkada · 2022-12-01T15:17:12Z

Hi @kamalkraj
The repo has been moved to microsoft: https://huggingface.co/microsoft/biogpt
Could you please update the PR accordingly? Also it seems that you need to rebase to main
Thanks!

kamalkraj · 2022-12-01T19:53:31Z

@younesbelkada Done the changes
Thanks

younesbelkada · 2022-12-02T07:23:01Z

thanks @kamalkraj !
It seems that styling tests are failing, could you please run make fixup?

kamalkraj · 2022-12-03T15:04:21Z

@younesbelkada fixed

younesbelkada · 2022-12-03T15:18:34Z

Thanks so much @kamalkraj !
Let's leave it now to @sgugger to give his review ;)
Thanks!

sgugger

Thanks again for all your work on this!

* biogpt initial commit * updated init * fix faster decoding with use_cache * 1. fix input_ids and input_embeds with correct device 2. added _keys_to_ignore_on_load_missing 3. updated prepare_inputs_for_generation * add activation_dropout and scale_embedding * replace fsmt attention with bart attention * added test * run make fix-copies * doc init and fix build * updated README with proper information * 1. added tips to docs 2. updated BioGptTokenizer func * 1. added tokenizer test 2. refactor tokenizer * make fixup * add biogpt fairseq to hf converter * updated layer names more similar to original checkpoints * config update doc string and set defaults * added "#copied" from bart model and updated doc strings * enable model_input_names in tokenizer * 1. positionalembedding depending on attention_mask 2. added attention mask to prepare for generation * added test to verify past and generation * BioGptLMHeadModel -> BioGptForCausalLM * fix typo * tokenization and test Copyright and updated assertion * updated Copyright and one func at time in line * Copyright updates and minor doc fix * replace assertion with ValueError * rm extra space * added code syntax * revert cmnt position change * add tokenizer to auto * updated doc string * tokenizer doc string update * biogpt hub model update to microsoft/biogpt * make fixup * rm cmnt to fix flake8 5.0.4 vs 6 error

biogpt initial commit

d98ed7f

kamalkraj mentioned this pull request Nov 23, 2022

load and finetune microsoft/BioGPT#9

Closed

kamalkraj added 9 commits November 24, 2022 00:19

updated init

a65b69a

fix faster decoding with use_cache

9642760

1. fix input_ids and input_embeds with correct device

34266b0

2. added _keys_to_ignore_on_load_missing 3. updated prepare_inputs_for_generation

add activation_dropout and scale_embedding

67fe208

replace fsmt attention with bart attention

5ec8be7

added test

82716fb

Merge branch 'main' into BioGPT

468624f

run make fix-copies

841e94b

doc init and fix build

c867cb0

kamalkraj changed the title ~~[WIP] BioGPT~~ [WIP] Add BioGPT Nov 24, 2022

updated README with proper information

5f45dbc

kamalkraj added 11 commits November 24, 2022 20:55

1. added tips to docs

6f297c2

2. updated BioGptTokenizer func

1. added tokenizer test

1674e39

2. refactor tokenizer

make fixup

12707bf

add biogpt fairseq to hf converter

c1d7bba

updated layer names more

37a70c1

similar to original checkpoints

config update doc string and set defaults

46751fd

added "#copied" from bart model and

9e21881

updated doc strings

enable model_input_names in tokenizer

fb46c9c

1. positionalembedding depending on attention_mask

6dc65d2

2. added attention mask to prepare for generation

added test to verify past and generation

7432efb

Merge branch 'main' into BioGPT

9da85d4

kamalkraj changed the title ~~[WIP] Add BioGPT~~ Add BioGPT Nov 25, 2022

younesbelkada reviewed Nov 25, 2022

View reviewed changes

kamalkraj added 3 commits November 26, 2022 11:52

BioGptLMHeadModel -> BioGptForCausalLM

3d07e81

fix typo

6703338

tokenization and test

a5a4773

Copyright and updated assertion

replace assertion with ValueError

eb2196b

younesbelkada reviewed Nov 26, 2022

View reviewed changes

rm extra space

b0d7459

younesbelkada approved these changes Nov 28, 2022

View reviewed changes

sgugger reviewed Nov 28, 2022

View reviewed changes

kamalkraj added 5 commits November 28, 2022 22:36

added code syntax

ce57c65

revert cmnt position change

995a326

add tokenizer to auto

a08c264

updated doc string

2983cf3

tokenizer doc string update

22a3335

kamalkraj added 2 commits December 2, 2022 01:08

biogpt hub model update to microsoft/biogpt

861c3eb

Merge branch 'main' into BioGPT

a37fbc4

kamalkraj added 4 commits December 2, 2022 23:44

Merge branch 'main' into BioGPT

209223f

make fixup

cad6393

Merge branch 'main' into BioGPT

e9fecc1

rm cmnt to fix flake8 5.0.4 vs 6 error

d13ed50

sgugger approved these changes Dec 5, 2022

View reviewed changes

sgugger merged commit 13e7366 into huggingface:main Dec 5, 2022

Conversation

kamalkraj commented Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kamalkraj commented Nov 26, 2022

Uh oh!

younesbelkada commented Nov 26, 2022

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

younesbelkada Nov 26, 2022

Choose a reason for hiding this comment

Uh oh!

younesbelkada Nov 26, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgugger Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

kamalkraj Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kamalkraj commented Nov 28, 2022

Uh oh!

younesbelkada commented Dec 1, 2022

Uh oh!

kamalkraj commented Dec 1, 2022

Uh oh!

younesbelkada commented Dec 2, 2022

Uh oh!

kamalkraj commented Dec 3, 2022

Uh oh!

younesbelkada commented Dec 3, 2022

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

kamalkraj commented Nov 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 24, 2022 •

edited

Loading

younesbelkada left a comment •

edited

Loading