Skip to content

Conversation

@yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Nov 7, 2025

What does this PR do?

Add support for Processors in @auto_docstring, and many other improvements to auto_docstring.py and check_docstrings.py, including more robust auto-fix with check_docstrings for missing, redundant, or unnecessary docstrings.

For processors, auto_docstring will pull custom args docstrings from custom "Kwargs" TypeDicts and add them to the .__doc__. For example, for processing_aria, we have:

class AriaImagesKwargs(ImagesKwargs, total=False):
    """
    split_image (`bool`, *optional*, defaults to `False`):
        Whether to split large images into multiple crops. When enabled, images exceeding the maximum size are
        divided into overlapping crops that are processed separately and then combined. This allows processing
        of very high-resolution images that exceed the model's input size limits.
    max_image_size (`int`, *optional*, defaults to `980`):
        Maximum image size (in pixels) for a single image crop. Images larger than this will be split into
        multiple crops when `split_image=True`, or resized if splitting is disabled. This parameter controls
        the maximum resolution of individual image patches processed by the model.
    min_image_size (`int`, *optional*):
        Minimum image size (in pixels) for a single image crop. Images smaller than this will be upscaled to
        meet the minimum requirement. If not specified, images are processed at their original size (subject
        to the maximum size constraint).
    """

    split_image: bool
    max_image_size: int
    min_image_size: int


class AriaProcessorKwargs(ProcessingKwargs, total=False):
    images_kwargs: AriaImagesKwargs

    _defaults = {
        "text_kwargs": {
            "padding": False,
            "return_mm_token_type_ids": False,
        },
        "images_kwargs": {
            "max_image_size": 980,
            "split_image": False,
        },
        "return_tensors": TensorType.PYTORCH,
    }


@auto_docstring
class AriaProcessor(ProcessorMixin):
    ...

    @auto_docstring
    def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]],
        images: Optional[ImageInput] = None,
        **kwargs: Unpack[AriaProcessorKwargs],
    ) -> BatchFeature:
        r"""
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
            `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
            `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
            - **pixel_mask** -- Pixel mask to be fed to a model. Returned when `images` is not `None`.
        """
        ...

which results in the following docstring:

print(AriaProcessor.__call__.__doc__)
        Args:
            text (`Union[str, list, list]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list, list, list]`, *optional*):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            split_image (`bool`, *optional*, defaults to `False`):
                Whether to split large images into multiple crops. When enabled, images exceeding the maximum size are
                divided into overlapping crops that are processed separately and then combined. This allows processing
                of very high-resolution images that exceed the model's input size limits.
            max_image_size (`int`, *optional*, defaults to `980`):
                Maximum image size (in pixels) for a single image crop. Images larger than this will be split into
                multiple crops when `split_image=True`, or resized if splitting is disabled. This parameter controls
                the maximum resolution of individual image patches processed by the model.
            min_image_size (`int`, *optional*):
                Minimum image size (in pixels) for a single image crop. Images smaller than this will be upscaled to
                meet the minimum requirement. If not specified, images are processed at their original size (subject
                to the maximum size constraint).
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
            `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
            `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
            - **pixel_mask** -- Pixel mask to be fed to a model. Returned when `images` is not `None`.

yonigozlan and others added 30 commits October 15, 2025 15:47
* Super

* Super

* Super

* Super

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* detectron2 - part 1

* detectron2 - part 2

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…ingface#41978)

fix autoawq[kernels]

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yonigozlan yonigozlan changed the title [WIP] Support auto_doctring in Processors Support auto_doctring in Processors Jan 6, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, altclip, aria, aya_vision, bamba, bark, blip, blip_2, bridgetower, bros, chameleon, chinese_clip, clap, clip, clipseg, clvp

@yonigozlan
Copy link
Member Author

Also Cc @stevhliu :)

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trusting you on that, but I think it would be time to add some proper tests no? I see a very old test_auto_docstrings.py but that does not run any tests -> probably a very nice idea to start rewriting it!

Comment on lines +1452 to +1456
intro = f"""Constructs a {class_name} which wraps {components_text} into a single processor.
[`{class_name}`] offers all the functionalities of {classes_text}. See the
{classes_text_short} for more information.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use textwrap.dedent here, so that the string respects the function indentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it's done right after

Copy link
Member

@Cyrilvallez Cyrilvallez Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, I don't see it 😅 I meant doing something like

    intro = textwrap.dedent(
        """
        bla
        bla

        more bla
        """
    ).strip()

so that the indentation stays inside the function

@yonigozlan
Copy link
Member Author

Trusting you on that, but I think it would be time to add some proper tests no? I see a very old test_auto_docstrings.py but that does not run any tests -> probably a very nice idea to start rewriting it!

Yes clearly! I'll add tests in the next autodocstring PR ;)

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks! added a few nits to the parameter definitions :)


chat_template = {
"description": """
A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
A Jinja template to convert lists of messages in a chat into a tokenizable string.

Comment on lines 305 to 307
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If you pass a pretokenized input, set `is_split_into_words=True` to avoid ambiguity with batched inputs.

""",
}

audio = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, whats the difference between audio and audios below it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think audios is deprecated but still present in some places

pad_to_multiple_of = {
"description": """
If set will pad the sequence to a multiple of the provided value. Requires `padding` to be activated.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
This is especially useful to enable using Tensor Cores on NVIDIA hardware with compute capability

add_special_tokens = {
"description": """
Whether or not to add special tokens when encoding the sequences. This will use the underlying
`PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are
[`PretrainedTokenizerBase.build_inputs_with_special_tokens`] function, which defines which tokens are

Comment on lines 484 to 485
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
list of strings (pretokenized string). If you pass pretokenized input, set is_split_into_words=True to avoid ambiguity with batched inputs.

Comment on lines 493 to 494
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
list of strings (pretokenized string). If you pass pretokenized input, set is_split_into_words=True to avoid ambiguity with batched inputs.

@yonigozlan yonigozlan enabled auto-merge (squash) January 7, 2026 20:41
@yonigozlan yonigozlan merged commit c8bc4de into huggingface:main Jan 8, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.