[docstring] Fix docstrings for `CodeGenConfig`, `CodeGenTokenizer`, `CodeGenTokenizerFast` by daniilgaltsev · Pull Request #26821 · huggingface/transformers

daniilgaltsev · 2023-10-15T15:28:59Z

What does this PR do?

Fixes the docstrings for CodeGen following #26638

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

daniilgaltsev · 2023-10-15T15:44:13Z

@ydshieh Hi, can you review this?

src/transformers/models/codegen/configuration_codegen.py

ydshieh · 2023-10-16T13:10:34Z

src/transformers/models/codegen/configuration_codegen.py

Looking the usage of n_ctx in the file src/transformers/models/codegen/configuration_codegen.py, I think this is not the context length. Could you double check it too 🙏 .

Sorry, for the delayed response. Only now had some time to work on this. So, I've stepped through initialization code line by line and, if I am not completely missing something, n_ctx doesn't affect anything.

In CodeGenConfig n_ctx is only saved as an instance variable and is not present in attribute_map. In modeling_codegen.py n_ctx is only in 2 places:

The docstring for CodeGenAttention._merge_heads (that's what made me think that n_ctx was used to implicitly set the context window size)

In CodeGenModel.__init__: self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.num_attention_heads). Here it is used to set self.rotary_dim but it is not updated in config which is used in other places. And self.rotary_dim is not used in CodeGenModel.

To further test this I wrote a script which set n_ctx in config to a bogus value and it works fine.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig from transformers.models.codegen import CodeGenForCausalLM checkpoint = "Salesforce/codegen-350M-mono" tokenizer = AutoTokenizer.from_pretrained(checkpoint) config = AutoConfig.from_pretrained(checkpoint) if True: config.n_ctx = -1 # This works fine else: config.rotary_dim = -1 # This throws a RuntimeError during init # import pdb; pdb.set_trace() model = CodeGenForCausalLM(config=config) # model = AutoModelForCausalLM.from_pretrained(checkpoint, config=config) # This behaves the same model.transformer.rotary_dim = "a" # This does not break anything text = "def hello_world():" completion = model.generate(**tokenizer(text, return_tensors="pt")) print(tokenizer.decode(completion[0]))

Does this seem right to you or are there other aspects I should consider? Otherwise, what should be done with n_ctx (just write in a docstring that it's not used to not break anyone downstream who might be using it)?

Hi @daniilgaltsev You are right. This attribute is kinda confusing. But from the two places you mentioned, it looks more like the dimension of rotary embedding. That's just make it this way despite its usage is confusing.

Thank you a lot !

Sorry, I still don't quite understand. Are you saying that n_ctx is actually used somewhere (as rotary_dim)? Because I can't find a place where it actually affects anything except setting an unused instance variable. To me it looks more like either a bug or code that was replaced by explicitly adding rotary_dim to config and forgetting to delete the old unused code.

No, I don't say it is really used. But from the two places

config.n_ctx // config.num_attention_heads

and

Merges attn_head_size dim and num_attn_heads dim into n_ctx

it looks like n_ctx means the dimension of something rather than Number of tokens provided in a single pass. (sequence length), despite the name suggests it is like a sequence.

Ok, I see. But what should we put into its description? Since it does not affect anything, it would be misleading to indicate otherwise, because people would expect to see an effect.

Just put

This attribute is used in `CodeGenModel.__init__` without any real effect.

We can provide an warning later in another PR. Let's move on to merge this.

Ok, I've pushed the updated n_ctx description. Thank you for the help.

src/transformers/models/codegen/tokenization_codegen.py

src/transformers/models/codegen/tokenization_codegen_fast.py

ydshieh

Just a few suggestion and one question for n_ctx 🙏

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>

ydshieh

Thank you @daniilgaltsev

HuggingFaceDocBuilderDev · 2023-10-19T12:06:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

* remove docstrings CodeGen from objects_to_ignore * autofix codegen docstrings * fill in the missing types and docstrings * fixup * change descriptions to be in a separate line * apply docstring suggestions from code review Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * update n_ctx description in CodeGenConfig --------- Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>

ydshieh reviewed Oct 16, 2023

View reviewed changes

src/transformers/models/codegen/configuration_codegen.py Outdated Show resolved Hide resolved

daniilgaltsev force-pushed the fix-codegen-docstring branch 2 times, most recently from f3ef48f to a757614 Compare October 16, 2023 10:01

daniilgaltsev requested a review from ydshieh October 16, 2023 10:19

ydshieh reviewed Oct 16, 2023

View reviewed changes

src/transformers/models/codegen/tokenization_codegen.py Outdated Show resolved Hide resolved

ydshieh reviewed Oct 16, 2023

View reviewed changes

src/transformers/models/codegen/tokenization_codegen_fast.py Outdated Show resolved Hide resolved

ydshieh reviewed Oct 16, 2023

View reviewed changes

daniilgaltsev and others added 7 commits October 19, 2023 10:11

remove docstrings CodeGen from objects_to_ignore

9c9346b

autofix codegen docstrings

536c0e1

fill in the missing types and docstrings

a43e01c

fixup

d42c1d9

change descriptions to be in a separate line

d4db12c

apply docstring suggestions from code review

7dd5290

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>

update n_ctx description in CodeGenConfig

f9d0f5a

daniilgaltsev force-pushed the fix-codegen-docstring branch from 59b9d61 to f9d0f5a Compare October 19, 2023 10:16

ydshieh approved these changes Oct 19, 2023

View reviewed changes

ydshieh merged commit ad08137 into huggingface:main Oct 19, 2023

daniilgaltsev deleted the fix-codegen-docstring branch October 24, 2023 10:01

Conversation

daniilgaltsev commented Oct 15, 2023

What does this PR do?

Before submitting

Who can review?

Uh oh!

daniilgaltsev commented Oct 15, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Oct 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants