VLM support for image and video processing with SmolVLM support#206
VLM support for image and video processing with SmolVLM support#206davidkoski merged 54 commits intoml-explore:mainfrom
Conversation
Video/image fixes
Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.
I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.
The configuration fixes that make this work have been applied.
Generation (single image) works now 🔥
Also changed the input type to `image` to keep the sequence of frames untouched :)
smolvlm processing
Some cleanup
Additional smolvlm changes and adjustments
Images are always upscaled, so always tiled.
Fix single image pre-processing
| } | ||
|
|
||
| // inputs_merger | ||
| // TODO: why did we need to do changes here? Do we need a new modelling class, or did this never work (for tiling)? |
There was a problem hiding this comment.
This is actually a pending to-do for Idefics3. We can remove the comment here, but revisit whether this works for the previous smolvlm.
|
Sorry, I've been distracted with other stuff. I'll get back to this soon to address all the feedback! |
[wip] Addressing PR comments
|
I think I addressed most of the comments:
I think what's pending is:
|
| userInfo: [NSLocalizedDescriptionKey: "Failed to load the asset's duration"]) | ||
| } | ||
| let fps = targetFPS(duration) | ||
| // Note: the round was not present in `asCIImageSequence`, so we may now be passing 1 more frame to Qwen depending on video duration. |
There was a problem hiding this comment.
As noted in the comment, this may result in an additional frame being extracted for users of the previous asCIImageSequence (only Qwen VL). I don't think this would be a big deal, so we can just remove the comment.
| // Note: the round was not present in `asCIImageSequence`, so we may now be passing 1 more frame to Qwen depending on video duration. |
Awesome! @pcuenca and @cyrilzakka see my suggestion here on the video: https://github.com/ml-explore/mlx-swift-examples/pull/206/files#r2010564639 and yes, it looks like it needs swift-format Then I think it is ready to go. |
|
@pcuenca and @cyrilzakka -- I think there were just a few pending issues, in particular around the inclusion of the video. What do you think about this? Also needs swift-format run. If both of you are busy I am happy to get these last couple items so we can merge this. |
|
Sorry, I dropped the ball here. Looking at the final pieces today. |
Co-authored-by: David Koski <46639364+davidkoski@users.noreply.github.com>
Update SmolVLM PR
Please, let us know if there's anything else to revisit 🤗 |
|
Awesome @pcuenca ! I will review it and hopefully merge it this afternoon. |
davidkoski
left a comment
There was a problem hiding this comment.
Awesome, thank you @cyrilzakka and @pcuenca for your hard work here!
Hey all,
@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!