feat: Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils by autinn · Pull Request #240 · unslothai/unsloth-zoo

autinn · 2025-08-13T22:40:52Z

Problem

Currently, Unsloth's vision pipeline is designed primarily for static image inputs. It lacks the native functionality to process video files directly within the unsloth_zoo library. This prevents users from performing video-based inference with a pre-trained Qwen2.5-VL model without relying on external, non-integrated dependencies (e.g. qwen-vl-utils) .

This also points to the feature requests for processing video inputs (#3061).

Solution

Our team, (@neenz16, @suchithgali, and @autinn), created this pull request to introduces a complete, self-contained solution for video inference within the unsloth_zoo library. Instead of adding an external dependency from Qwen2.5-VL/qwen-vl-utils, we implement the core video processing logic directly in unsloth_zoo/vision_utils.py using the source code from qwen_vl_utils/vision_process.py.

Key changes include:

Implemented a Full Video Pipeline: The necessary functions to read, decode, and sample frames from video files, including smart_nframes, _read_video_torchvision, _read_video_decord, _read_video_torchcodec, and their helpers, have been added. This makes the unsloth_zoo library self-contained for video processing.
Dynamic Video Sampling: The fetch_video function has been fully implemented to handle video files, intelligently sampling frames, and managing resolution and other model constraints.
Processor Integration: The process_vision_info function and UnslothVisionDataCollator are updated to correctly receive and pass the video tensor and its associated metadata (like FPS) to the model's processor.
Debug Print Statement: Added a print statement to indicate and verify unsloth_zoo/vision_utils is running for video reading with the specified backend.

Test

This provided Colab Notebook demonstrates the before and after for running an end-to-end video inference run with and without the qwen-vl-utils dependency.

Before (Missing Dependency on Video Processing): A test shows that without installing qwen-vl-utils, a user attempting to run the original Qwen video inference code would encounter a ModuleNotFoundError.

Before (With External Dependency on qwen-vl-utils): The notebook shows that when qwen-vl-utils is installed, a user can successfully run video inference using the library's functions. This establishes a baseline for the desired functionality.

After (Integrated qwen-vl-utils to unsloth): The new test script successfully loads the unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit model. It then overwrites the vision_utils.py in unsloth_zoo with the newly implemented code to process a video file and a textual prompt. The test runs a single inference call with model.generate() and confirms that the model successfully processes the video and generates a coherent text response, all without relying on the external dependency.

Documentation Suggestion

We also propose a new “Video Inference” section on the “Vision Fine-tuning” part of the Unsloth documentation as the first step to achieve video fine-tuning. We are happy to work on that as well after the PR is merged to main.

Reference
qwen-vl-utils documentation: https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils
qwen vision_process.py code: https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py

Co-authored-by: Neenu Antony <neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>

mmathew23 · 2025-08-19T01:36:45Z

Hi @autinn @neenz16 @suchithgali, great work! Thank you for submitting!

Are there any additions to qwen vl utils in the PR, or is it the same logic but integrated into unsloth?

autinn · 2025-08-19T01:47:35Z

Hi @mmathew23, we integrated the same logic from qwen vl utils to unsloth and tested it on Colab Notebook for video inferencing. We also updated the print statement from “qwen-vl-utils reading videos” to “unsloth_zoo/vision-utils reading videos”.

mmathew23 · 2025-08-19T14:30:02Z

    "<|IMG_PATCH|>",      # Cohere
 ]

+from __future__ import annotations


This causes an error since its not at top of file, and doesn't seem needed, so it can be removed

mmathew23 · 2025-08-19T14:45:54Z

 VIDEO_MAX_PIXELS = 768 * 28 * 28
-VIDEO_TOTAL_PIXELS = 24576 * 28 * 28
+VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 128000 * 28 * 28 * 0.9)))
+logger.info(f"set VIDEO_TOTAL_PIXELS: {VIDEO_TOTAL_PIXELS}")


Using the logger is fine, but we generally want to keep logging less cluttered unless the user wants it. In other parts of the library we check for do_logging = os.environ.get("UNSLOTH_ENABLE_LOGGING", "0") == "1" to determine whether or not to log. If you'd like to stick with logger I'd suggest configuring the logging level based on the env variable, and no prints/logs if the variable is off, and prints if the variable is on. Best to also prepend "Unsloth: " before the actual string to log.

mmathew23 · 2025-08-19T14:47:49Z

+        max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
+        nframes = total_frames / video_fps * fps
+        if nframes > total_frames:
+            logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")


check if logging is enabled before logging and include Unsloth:

mmathew23 · 2025-08-19T14:48:18Z

+        output_format="TCHW",
+    )
+    total_frames, video_fps = video.size(0), info["video_fps"]
+    logger.info(f"torchvision:  {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s")


check if logging is enabled before logging and include Unsloth:

mmathew23 · 2025-08-19T14:48:38Z

+            f"Video duration: {max_duration:.2f}s ({total_frames} frames @ {video_fps}fps)"
+        )
+
+    logger.info(f"calculate video frame range: {start_frame=}, {end_frame=}, {total_frames=} from {video_start=}, {video_end=}, {video_fps=:.3f}")


check if logging is enabled before logging and include Unsloth:

mmathew23 · 2025-08-19T14:48:49Z

+    idx = torch.linspace(start_frame, end_frame, nframes).round().long().tolist()
+    video = vr.get_batch(idx).asnumpy()
+    video = torch.tensor(video).permute(0, 3, 1, 2)  # Convert to TCHW format
+    logger.info(f"decord:  {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s")


check if logging is enabled before logging and include Unsloth:

mmathew23 · 2025-08-19T14:49:08Z

+    idx = torch.linspace(start_frame, end_frame, nframes).round().long().tolist()
+    sample_fps = nframes / max(total_frames, 1e-6) * video_fps
+    video = decoder.get_frames_at(indices=idx).data
+    logger.info(f"torchcodec:  {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s")


check if logging is enabled before logging and include Unsloth:

mmathew23 · 2025-08-19T14:49:52Z

+    "torchcodec": _read_video_torchcodec,
+}
+
+FORCE_QWENVL_VIDEO_READER = os.getenv("FORCE_QWENVL_VIDEO_READER", None)


The file already has attribution to the qwen team on top, so let's rename the variable to be UNSLOTH to avoid confusion.

mmathew23 · 2025-08-19T14:50:21Z

+        video_reader_backend = "decord"
+    else:
+        video_reader_backend = "torchvision"
+    print(f"unsloth_zoo/vision_utils using {video_reader_backend} to read video.", file=sys.stderr)


check if logging is enabled before logging and include Unsloth: but let's also make sure we are consistent with prints or logs.

mmathew23 · 2025-08-19T14:50:49Z

+        try:
+            video, sample_fps = VIDEO_READER_BACKENDS[video_reader_backend](ele)
+        except Exception as e:
+            logger.warning(f"video_reader_backend {video_reader_backend} error, use torchvision as default, msg: {e}")


check if logging is enabled before logging and include Unsloth:

mmathew23 · 2025-08-19T14:51:12Z

+        max_pixels_supposed = ele.get("max_pixels", max_pixels)
+
+        if max_pixels_supposed > max_pixels:
+            logger.warning(f"The given max_pixels[{max_pixels_supposed}] exceeds limit[{max_pixels}].")


check if logging is enabled before logging and include Unsloth:

mmathew23 · 2025-08-19T15:07:12Z

Hi @autinn I added some comments. Since there's also changes to the collator we should make sure that existing training runs work as intended, while video models can also train too.

Great work, and appreciate the help! Once we get the comments and finetuning verification we can merge this.

mmathew23 · 2025-08-19T15:47:47Z

+def smart_nframes(
+    ele: dict,
+    total_frames: int,
+    video_fps: int | float,


I think | needs to be changed to Union in order to support python 3.9

mmathew23 · 2025-08-19T15:48:03Z

+    return video_reader_backend
+
+
+def fetch_video(ele: dict, image_factor: int = IMAGE_FACTOR, return_video_sample_fps: bool = False) -> torch.Tensor | list[Image.Image]:


I think | needs to be changed to Union in order to support python 3.9

neenz16 · 2025-08-19T17:02:14Z

Hi @autinn I added some comments. Since there's also changes to the collator we should make sure that existing training runs work as intended, while video models can also train too.

Great work, and appreciate the help! Once we get the comments and finetuning verification we can merge this.

Thanks a lot for the review and guidance @mmathew23! 🙏
We’ll work on the suggested changes and run verifications to ensure both existing training runs and video model finetuning work as expected. We’ll get back with updates soon. Really appreciate your support!

…to UNSLOTH_VIDEO Neenu Antony <Neenu.antony@sjsu.edu> Suchith Gali <sgali@ucmerced.edu>

…to UNSLOTH_VIDEO Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>

autinn · 2025-08-21T23:43:15Z

@mmathew23 Thank you again for your feedback! We worked on the code for a bit the past few days and here are the changes we made:

Remove from __future__ import annotations
Import logger from .log and UNSLOTH_ENABLE_LOGGING from .temporary_patches.common, similar to vllm_utils.py
Added conditional logging for the logger.info() and logger.warning() statements, including the print statement we originally added.
Updated variable naming from FORCE_QWENVL_VIDEO_READER to FORCE_UNSLOTH_VIDEO_READER
Tested on Colab Notebook and saw all logging statements showed as intended.

This is a cleaned up and merged version of unslothai#240. This adds video processing utilities for VLM finetuning. It's based on qwen-vl-utils repo https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils Co-authored-by: autinn <au-yeung@uni.minerva.edu> Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>

mmathew23 · 2025-09-13T03:47:53Z

Hello @autinn @neenz16 @suchithgali! Appreciate all the help on this.

We had refactored the vision_utils code before getting this in. I also had to clean up some issues to make sure non video finetuning still worked the same as well. I didn't have access to edit this PR branch directly so I created a new PR with your contributions, merged it into the new code, and cleaned it up. You can check it out here, #279. Thank you so much for your contribution!

neenz16 · 2025-09-14T09:23:37Z

Thanks so much, @mmathew23! Really appreciate you merging and cleaning this up. Glad our contributions made it in — We'll check out #279!

This is a cleaned up and merged version of #240. This adds video processing utilities for VLM finetuning. It's based on qwen-vl-utils repo https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils Co-authored-by: autinn <au-yeung@uni.minerva.edu> Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>

mmathew23 · 2025-10-01T04:00:32Z

Closing as #279 was merged

autinn and others added 6 commits August 11, 2025 08:57

update vision_utils.py for video inputs

d2da860

file update for logger

67ec832

file update on print statement

72bfa00

Merge branch 'unslothai:main' into video-inference-feature

2668410

added video inference feature using qwen-vl-utils

050e6ec

Co-authored-by: Neenu Antony <neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>

edited debug print statement

aaefed2

Co-authored-by: Neenu Antony <neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>

autinn changed the title ~~Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils~~ feat: Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils Aug 13, 2025

mmathew23 reviewed Aug 19, 2025

View reviewed changes

autinn and others added 5 commits August 20, 2025 08:42

Merge branch 'unslothai:main' into video-inference-feature

5adbb30

Merge branch 'unslothai:main' into video-inference-feature

073248f

added conditional loggings, fixed syntax error, renamed QWENVL_VIDEO …

818fc8a

…to UNSLOTH_VIDEO Neenu Antony <Neenu.antony@sjsu.edu> Suchith Gali <sgali@ucmerced.edu>

Merge branch 'unslothai:main' into video-inference-feature

f00f5e8

added conditional loggings, fixed syntax error, renamed QWENVL_VIDEO …

b010ad3

…to UNSLOTH_VIDEO Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>

mmathew23 mentioned this pull request Sep 13, 2025

Add video processing utils to vision_utils #279

Merged

Wu-Yuanfei mentioned this pull request Sep 23, 2025

GRPO Fine-tuning Implementation and Vision_Utils Integration for Qwen2.5-VL Model unslothai/unsloth#3357

Open

mmathew23 closed this Oct 1, 2025

		return video_reader_backend


		def fetch_video(ele: dict, image_factor: int = IMAGE_FACTOR, return_video_sample_fps: bool = False) -> torch.Tensor \| list[Image.Image]:

Conversation

autinn commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmathew23 commented Aug 19, 2025

Uh oh!

autinn commented Aug 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmathew23 Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmathew23 commented Aug 19, 2025

Uh oh!

mmathew23 Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neenz16 commented Aug 19, 2025

Uh oh!

autinn commented Aug 21, 2025

Uh oh!

mmathew23 commented Sep 13, 2025

Uh oh!

neenz16 commented Sep 14, 2025

Uh oh!

mmathew23 commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

autinn commented Aug 13, 2025 •

edited

Loading

mmathew23 Aug 19, 2025 •

edited

Loading

mmathew23 Aug 19, 2025 •

edited

Loading