-
Notifications
You must be signed in to change notification settings - Fork 31.8k
Support auto_doctring in Processors
#42101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support auto_doctring in Processors
#42101
Conversation
…m/yonigozlan/transformers into remove-attributes-from-processors
* Super * Super * Super * Super --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* detectron2 - part 1 * detectron2 - part 2 --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…ingface#41978) fix autoawq[kernels] Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…moving redundant docstring and placeholders
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
auto_doctring in Processorsauto_doctring in Processors
|
[For maintainers] Suggested jobs to run (before merge) run-slow: align, altclip, aria, aya_vision, bamba, bark, blip, blip_2, bridgetower, bros, chameleon, chinese_clip, clap, clip, clipseg, clvp |
|
Also Cc @stevhliu :) |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trusting you on that, but I think it would be time to add some proper tests no? I see a very old test_auto_docstrings.py but that does not run any tests -> probably a very nice idea to start rewriting it!
| intro = f"""Constructs a {class_name} which wraps {components_text} into a single processor. | ||
| [`{class_name}`] offers all the functionalities of {classes_text}. See the | ||
| {classes_text_short} for more information. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we use textwrap.dedent here, so that the string respects the function indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it's done right after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, I don't see it 😅 I meant doing something like
intro = textwrap.dedent(
"""
bla
bla
more bla
"""
).strip()so that the indentation stays inside the function
Yes clearly! I'll add tests in the next autodocstring PR ;) |
stevhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks! added a few nits to the parameter definitions :)
|
|
||
| chat_template = { | ||
| "description": """ | ||
| A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string. | |
| A Jinja template to convert lists of messages in a chat into a tokenizable string. |
| The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings | ||
| (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set | ||
| `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings | |
| (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set | |
| `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). | |
| The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings | |
| (pretokenized string). If you pass a pretokenized input, set `is_split_into_words=True` to avoid ambiguity with batched inputs. |
| """, | ||
| } | ||
|
|
||
| audio = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, whats the difference between audio and audios below it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think audios is deprecated but still present in some places
| pad_to_multiple_of = { | ||
| "description": """ | ||
| If set will pad the sequence to a multiple of the provided value. Requires `padding` to be activated. | ||
| This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability | |
| This is especially useful to enable using Tensor Cores on NVIDIA hardware with compute capability |
| add_special_tokens = { | ||
| "description": """ | ||
| Whether or not to add special tokens when encoding the sequences. This will use the underlying | ||
| `PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are | |
| [`PretrainedTokenizerBase.build_inputs_with_special_tokens`] function, which defines which tokens are |
| list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), | ||
| you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), | |
| you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). | |
| list of strings (pretokenized string). If you pass pretokenized input, set is_split_into_words=True to avoid ambiguity with batched inputs. |
| list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), | ||
| you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), | |
| you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). | |
| list of strings (pretokenized string). If you pass pretokenized input, set is_split_into_words=True to avoid ambiguity with batched inputs. |
…om/yonigozlan/transformers into support-auto_doctring-in-processor
What does this PR do?
Add support for Processors in
@auto_docstring, and many other improvements toauto_docstring.pyandcheck_docstrings.py, including more robust auto-fix with check_docstrings for missing, redundant, or unnecessary docstrings.For processors,
auto_docstringwill pull custom args docstrings from custom "Kwargs" TypeDicts and add them to the.__doc__. For example, forprocessing_aria, we have:which results in the following docstring: