Conversation
2. added _keys_to_ignore_on_load_missing 3. updated prepare_inputs_for_generation
|
The documentation is not available anymore as the PR was closed or merged. |
2. updated BioGptTokenizer func
2. refactor tokenizer
similar to original checkpoints
updated doc strings
2. added attention mask to prepare for generation
There was a problem hiding this comment.
Thank you very much for this great contribution and introducing one of the first bio-medical text model to transformers !
I left a couple of minor comments, overall the PR looks really good to me! The main point being changing the naming of xxxLMHeadModel that needs to be changed to xxxForCausalLM, I also left a comment about layerdrop and output_hidden_states, let me know if this refactoring is possible!
Another point we should discuss is whether we can use smaller model for integration tests, let me know if you need help on that (no need to train the model)
Great effort on the integration side 💪 We should be very close merging this!
Copyright and updated assertion
|
@younesbelkada |
|
Thanks @kamalkraj |
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks a lot for the great efforts! I left few minor comments that needs to be double checked, otherwise everything looks great to me!
| if output_hidden_states: | ||
| all_hidden_states += (hidden_states,) | ||
| dropout_probability = random.uniform(0, 1) | ||
| if self.training and (dropout_probability < self.layerdrop): | ||
| continue | ||
|
|
There was a problem hiding this comment.
thanks for the clarification, I think that you are right here, I am fine with this change. Let's keep it like this
| def test_batch_generation(self): | ||
| model = BioGptLMHeadModel.from_pretrained("kamalkraj/biogpt") | ||
| model.to(torch_device) | ||
| tokenizer = BioGptTokenizer.from_pretrained("kamalkraj/biogpt") |
There was a problem hiding this comment.
But even if the model outputs gibberish I'd expect the logits to be different if someone changes the code in the future, so I guess we'll be able to flag it.
In my opinion I think that it's fine as it is but let's see what others think for final approval - so let's keep this as unresolved for now
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks so much for your very clean contribution of the first (if I am not mistaken) bio-medical model!
Clean code & documentation and nice tests 💪🏻
Leaving now the PR to @sgugger for a final review
Thanks!
sgugger
left a comment
There was a problem hiding this comment.
Thanks for adding this model! Once all comments have been resolved, we'll need to move the checkpoints to the microsoft org and adapt all references in this PR :-)
| logger = logging.get_logger(__name__) | ||
|
|
||
| BIOGPT_PRETRAINED_CONFIG_ARCHIVE_MAP = { | ||
| "kamalkraj/biogpt": "https://huggingface.co/kamalkraj/biogpt/resolve/main/config.json", |
There was a problem hiding this comment.
Checkpoints will need to be transferred to the microsoft org before we merge.
There was a problem hiding this comment.
Let me know once the checkpoints are transferred, I can make the changes.
Thanks
|
@sgugger |
|
Hi @kamalkraj |
|
@younesbelkada Done the changes |
|
thanks @kamalkraj ! |
|
@younesbelkada fixed |
|
Thanks so much @kamalkraj ! |
sgugger
left a comment
There was a problem hiding this comment.
Thanks again for all your work on this!
* biogpt initial commit * updated init * fix faster decoding with use_cache * 1. fix input_ids and input_embeds with correct device 2. added _keys_to_ignore_on_load_missing 3. updated prepare_inputs_for_generation * add activation_dropout and scale_embedding * replace fsmt attention with bart attention * added test * run make fix-copies * doc init and fix build * updated README with proper information * 1. added tips to docs 2. updated BioGptTokenizer func * 1. added tokenizer test 2. refactor tokenizer * make fixup * add biogpt fairseq to hf converter * updated layer names more similar to original checkpoints * config update doc string and set defaults * added "#copied" from bart model and updated doc strings * enable model_input_names in tokenizer * 1. positionalembedding depending on attention_mask 2. added attention mask to prepare for generation * added test to verify past and generation * BioGptLMHeadModel -> BioGptForCausalLM * fix typo * tokenization and test Copyright and updated assertion * updated Copyright and one func at time in line * Copyright updates and minor doc fix * replace assertion with ValueError * rm extra space * added code syntax * revert cmnt position change * add tokenizer to auto * updated doc string * tokenizer doc string update * biogpt hub model update to microsoft/biogpt * make fixup * rm cmnt to fix flake8 5.0.4 vs 6 error
What does this PR do?
Adding BioGPT
Original Implementation and weights - https://github.com/microsoft/BioGPT
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger @patrickvonplaten