Uniformize model processors (models *with* special arg names) #32841
Uniformize model processors (models *with* special arg names) #32841leloykun wants to merge 10 commits intohuggingface:mainfrom
Conversation
molbap
left a comment
There was a problem hiding this comment.
Thanks for the contribution! I think handling the text_pair etc kwargs is a bit more challenging, we should be able to use it solely with TypedDicts and without adding extra args
There was a problem hiding this comment.
Hm, it is not robust to rely on audio here - or at least we have to explicitly comment why we are doing that.
There was a problem hiding this comment.
I think that's what we also do in UDOP. Maybe a comment explaining is enough as audio arg-name isn't expected in Nougat
There was a problem hiding this comment.
yup, I just followed the code here: #32544 (comment)
without this, old code that passed the args as positional arguments would break
I'll add comments & warnings for now, but lemme guys know what you think is best
There was a problem hiding this comment.
I commented on the other PR as well. I missed it for Udop but I don't think it's a good precedent unfortunately - it's a bit hacky as it stands, trying to see if we can do it some other way that's cleaner
There was a problem hiding this comment.
Thanks for raising this! I have a new approach which I described in more detail here: #31911 (comment)
pls lemme know what you think abt it!
831e64c to
257c690
Compare
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks for this awesome work!
The workaround seems a bit hacky but I guess we need that for BC. Can we add tests for the newly added prepare_and_validate_optional_call_arg in the models that support extra args? And let's add a version when we'll stop handling BC for users, it can be until v4.47 which gives two major versions for users
| raise ValueError( | ||
| f"Expected *at most* {len(self.optional_call_args)} optional positional arguments in processor call but received {len(args)}." | ||
| "Passing positional arguments to the processor call is not recommended" | ||
| ) |
There was a problem hiding this comment.
This message imo isn't very informative for users who have no idea what is happening under the hood. Maybe we could show optional_call_args names and say we couldn't map positional args to those
There was a problem hiding this comment.
I've just changed the error message
@zucchini-nlp can you check if the new one's okay?
| text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None, | ||
| images: Optional[ImageInput] = None, |
There was a problem hiding this comment.
another thing we're doing now is swap the arg order, so that it is image, text, audio, videos. And that needs another deprecation cycle...
BTW, i am quite out of the loop, do we need this order-swapping for pipeline @yonigozlan ?
| The channel dimension format of the input image. If not provided, it will be inferred. | ||
| """ | ||
| input_height, input_width = get_image_size(image, channel_dim=input_data_format) | ||
| size = get_size_dict(size) |
There was a problem hiding this comment.
not clear why we needed these changes, was this causing CI failure?
There was a problem hiding this comment.
these are utils for old processors that don't support the new image size format yet
we might as well add these here since (1) they help w/ backwards compatibility, (2) make the image-text-to-text pipeline easier to implement, & (3) they just revert to a no-op if size already follows the new image size format
| text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None, | ||
| images: Optional[ImageInput] = None, |
There was a problem hiding this comment.
Oh, and also, these two should be swapped so that the order is image, text, kwargs
| text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None, | ||
| images: Optional[ImageInput] = None, | ||
| # The following is to capture `visual_prompt` argument that may be passed as a positional argument. |
| text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] | ||
| text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] | ||
| text_pair_target: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] |
There was a problem hiding this comment.
Should these be added here and not be treated as model-specific args?
There was a problem hiding this comment.
they're in the base tokenizer class so I figured it's okay to put them here
lemme know if I should remove them
There was a problem hiding this comment.
Personally for me it's okey to have it here, as it is a general arg accepted by all tokenizers, though not used by all processors
There was a problem hiding this comment.
Sounds good, that will remove the need for special args for Udop so that's great :)
|
I'll also swap the This PR should now be ready for review |
|
Hi @leloykun ! Pinging this as blocking requirements for this PR should all be completed soon! Notably the function to check that the images and text inputs are in the correct order is merged. Here's how it is used in LlaVa for example. Also I think your |
|
Hi again @leloykun , I opened a PR with some of your changes to processing_utils.py and test_processing_common.py here #33479 . If you think you will have the bandwidth to work on this, feel free to open a PR and I'll close mine. In any case, thanks a lot for your contributions on this! |
What does this PR do?
[text, images, audio, videos]and not config values for the tokenizer, image processor, etc.(arguments that carry data
TODO:
Fixes # (issue)
Who can review?
@zucchini-nlp @molbap @NielsRogge