Uniformize model processors (models w/o special arg names)#32845
Uniformize model processors (models w/o special arg names)#32845leloykun wants to merge 25 commits intohuggingface:mainfrom
Conversation
There was a problem hiding this comment.
same remark as in #32841 , we should not be using kwargs for a purpose other than the one advertised!
8a2a2c8 to
0e8e99e
Compare
…lipvideo, llava_next, llava_next_video, siglip, video_llava, vilt
1380596 to
ce9cc73
Compare
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks for this awesome work! I left some general comments, mostly nits about cleaning up docstring and clarification questions for my own uderstanding.
Additionally, we have to return BatchFeature and not BatchEncoding in processor's call
and Chameleon needs to be removed from PR, as it now has its own PR
| images: Optional[VideoInput] = None, | ||
| text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None, | ||
| audio=None, | ||
| videos=None, |
There was a problem hiding this comment.
Hmm, this is quite interesting since we accept videos as argument but we still leave it as images = VideoInput.
Oke, let's leave it for BC and I need to spend some time on making new video-models more standard, currently some work with images and some with videos. But that will not be soon I guess
There was a problem hiding this comment.
@zucchini-nlp I've made these changes:
- we now allow the user to use
videosinstead ofimages - using
imagesnow results in a future warning - using both
videosandimagesnow results in an error
src/transformers/models/instructblipvideo/processing_instructblipvideo.py
Show resolved
Hide resolved
| if isinstance(component_class_name, tuple): | ||
| if "_fast" in component_class_name[0]: | ||
| component_class_name = component_class_name[0] | ||
| else: | ||
| component_class_name = component_class_name[1] |
There was a problem hiding this comment.
Not very clear why we had to overwrite this, is this because we need to work only with FastTokenizers?
If that's the case, usually there should be no difference in how text is encoded with fast vs slow tokenizers, so we might rather dig why there's a difference
There was a problem hiding this comment.
I'll make the Chameleon-related changes in this PR: #32181
After that gets merged, I'll rebase this to main
|
@zucchini-nlp I just added 4 more models:
|
zucchini-nlp
left a comment
There was a problem hiding this comment.
Perfect, thanks for adding them! I left some general questions but I think we will first focus and merge a smaller PR, to decide on how to handle edge cases we've been discussing
| return_token_type_ids=False, | ||
| **kwargs, | ||
| ) | ||
| textual_input = self.tokenizer(text, **output_kwargs["text_kwargs"]) |
There was a problem hiding this comment.
this actually doesn't match with the removed lines. Previously we had some kwargs hardcoded, like trucation or padding and now users can set those to any value.
TBH I don't know why these were hardcoded but I'd leave it as is to not break anything. In that case we don't need the pad_to_max_length kwarg, as it is going to be padded to max-len no matter what
There was a problem hiding this comment.
I moved them to the default values in TvpProcessorKwargs. I think that'd suffice since this part would've error-ed out anyway if downstream users passed them as kwargs too (duplicate args error).
| return_tensors = output_kwargs["common_kwargs"].get("return_tensors") | ||
| if text is not None and videos is not None: | ||
| encoding["pixel_values"] = image_features.pixel_values | ||
| return encoding | ||
| return BatchFeature(data=dict(**encoding, **image_features), tensor_type=return_tensors) | ||
| elif text is not None: | ||
| return encoding | ||
| return BatchFeature(data=dict(**encoding), tensor_type=return_tensors) | ||
| else: | ||
| return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors) | ||
| return BatchFeature(data=dict(**image_features), tensor_type=return_tensors) |
There was a problem hiding this comment.
We can simplify this and other processors by doing only and assigning
encoding = image_features ={}
# process inputs if present here
BatchFeature(data=dict(**encoding, **image_features), tensor_type=return_tensors)There was a problem hiding this comment.
@zucchini-nlp I've implemented it slightly differently, but it's now a lot cleaner than before
@yonigozlan @molbap you guys might wanna take a look too
| "videos" not in inspect.signature(self.processor_class.__call__).parameters | ||
| or inspect.signature(self.processor_class.__call__).parameters["videos"].annotation == inspect._empty |
There was a problem hiding this comment.
Is this because of InstructBlipVideo? If it's only that, we better override this test for InstructBlipVideo and leave the check as "video_processor" in classes
When overriding for TVP, there's no need to check as we know if model has video processor or not. Actually seems like it's gonna be skipped because TVP has only image_processor
There was a problem hiding this comment.
This is for:
- InstructBlipVideo
- TVP
- Video Llava, &
- X-Clip
Only Llava-Next-Video actually has a video_processor_class
There was a problem hiding this comment.
I'm leaning on leaving this here cuz it covers more models than the original check + it significantly reduces LoCs
a44a76a to
9e00f68
Compare
| class TvpProcessorTest(ProcessorTesterMixin, unittest.TestCase): | ||
| from_pretrained_id = "Jiqing/tiny-random-tvp" | ||
| processor_class = TvpProcessor | ||
| videos_data_arg_name = "pixel_values" |
There was a problem hiding this comment.
@zucchini-nlp just a heads up cuz I remember you mentioning that you plan on reworking the text-videos models:
the output keys for video data is inconsistent; some uses pixel_values_videos while some just uses pixel_values. You'd most likely need to uniformize/standardize them too
There was a problem hiding this comment.
Yes, it's me who started using pixel_values_videos hehe, thanks for letting know :)
What does this PR do?
TODO:
Fixes # (issue)
Who can review?
@zucchini-nlp @molbap @NielsRogge