Skip to content

[docstring] Fix docstrings for CodeGenConfig, CodeGenTokenizer, CodeGenTokenizerFast#26821

Merged
ydshieh merged 7 commits intohuggingface:mainfrom
daniilgaltsev:fix-codegen-docstring
Oct 19, 2023
Merged

[docstring] Fix docstrings for CodeGenConfig, CodeGenTokenizer, CodeGenTokenizerFast#26821
ydshieh merged 7 commits intohuggingface:mainfrom
daniilgaltsev:fix-codegen-docstring

Conversation

@daniilgaltsev
Copy link
Contributor

What does this PR do?

Fixes the docstrings for CodeGen following #26638

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@daniilgaltsev
Copy link
Contributor Author

@ydshieh Hi, can you review this?

@daniilgaltsev daniilgaltsev force-pushed the fix-codegen-docstring branch 2 times, most recently from f3ef48f to a757614 Compare October 16, 2023 10:01
@daniilgaltsev daniilgaltsev requested a review from ydshieh October 16, 2023 10:19
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking the usage of n_ctx in the file src/transformers/models/codegen/configuration_codegen.py, I think this is not the context length. Could you double check it too 🙏 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for the delayed response. Only now had some time to work on this. So, I've stepped through initialization code line by line and, if I am not completely missing something, n_ctx doesn't affect anything.

In CodeGenConfig n_ctx is only saved as an instance variable and is not present in attribute_map. In modeling_codegen.py n_ctx is only in 2 places:

  1. The docstring for CodeGenAttention._merge_heads (that's what made me think that n_ctx was used to implicitly set the context window size)
  2. In CodeGenModel.__init__: self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.num_attention_heads). Here it is used to set self.rotary_dim but it is not updated in config which is used in other places. And self.rotary_dim is not used in CodeGenModel.

To further test this I wrote a script which set n_ctx in config to a bogus value and it works fine.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from transformers.models.codegen import CodeGenForCausalLM

checkpoint = "Salesforce/codegen-350M-mono"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)
if True:
    config.n_ctx = -1  # This works fine
else:
    config.rotary_dim = -1  # This throws a RuntimeError during init

# import pdb; pdb.set_trace()
model = CodeGenForCausalLM(config=config)
# model = AutoModelForCausalLM.from_pretrained(checkpoint, config=config)  # This behaves the same
model.transformer.rotary_dim = "a"  # This does not break anything

text = "def hello_world():"
completion = model.generate(**tokenizer(text, return_tensors="pt"))
print(tokenizer.decode(completion[0]))

Does this seem right to you or are there other aspects I should consider? Otherwise, what should be done with n_ctx (just write in a docstring that it's not used to not break anyone downstream who might be using it)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @daniilgaltsev You are right. This attribute is kinda confusing. But from the two places you mentioned, it looks more like the dimension of rotary embedding. That's just make it this way despite its usage is confusing.

Thank you a lot !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still don't quite understand. Are you saying that n_ctx is actually used somewhere (as rotary_dim)? Because I can't find a place where it actually affects anything except setting an unused instance variable. To me it looks more like either a bug or code that was replaced by explicitly adding rotary_dim to config and forgetting to delete the old unused code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't say it is really used. But from the two places

config.n_ctx // config.num_attention_heads

and

Merges attn_head_size dim and num_attn_heads dim into n_ctx

it looks like n_ctx means the dimension of something rather than Number of tokens provided in a single pass. (sequence length), despite the name suggests it is like a sequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. But what should we put into its description? Since it does not affect anything, it would be misleading to indicate otherwise, because people would expect to see an effect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put

This attribute is used in `CodeGenModel.__init__` without any real effect.

We can provide an warning later in another PR. Let's move on to merge this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've pushed the updated n_ctx description. Thank you for the help.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestion and one question for n_ctx 🙏

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @daniilgaltsev

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ydshieh ydshieh merged commit ad08137 into huggingface:main Oct 19, 2023
@daniilgaltsev daniilgaltsev deleted the fix-codegen-docstring branch October 24, 2023 10:01
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* remove docstrings CodeGen from objects_to_ignore

* autofix codegen docstrings

* fill in the missing types and docstrings

* fixup

* change descriptions to be in a separate line

* apply docstring suggestions from code review

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>

* update n_ctx description in CodeGenConfig

---------

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants