Skip to content

feat: Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils#240

Closed
autinn wants to merge 11 commits into
unslothai:mainfrom
autinn:video-inference-feature
Closed

feat: Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils#240
autinn wants to merge 11 commits into
unslothai:mainfrom
autinn:video-inference-feature

Conversation

@autinn

@autinn autinn commented Aug 13, 2025

Copy link
Copy Markdown
Contributor

Problem

Currently, Unsloth's vision pipeline is designed primarily for static image inputs. It lacks the native functionality to process video files directly within the unsloth_zoo library. This prevents users from performing video-based inference with a pre-trained Qwen2.5-VL model without relying on external, non-integrated dependencies (e.g. qwen-vl-utils) .

This also points to the feature requests for processing video inputs (#3061).

Solution

Our team, (@neenz16, @suchithgali, and @autinn), created this pull request to introduces a complete, self-contained solution for video inference within the unsloth_zoo library. Instead of adding an external dependency from Qwen2.5-VL/qwen-vl-utils, we implement the core video processing logic directly in unsloth_zoo/vision_utils.py using the source code from qwen_vl_utils/vision_process.py.

Key changes include:

  • Implemented a Full Video Pipeline: The necessary functions to read, decode, and sample frames from video files, including smart_nframes, _read_video_torchvision, _read_video_decord, _read_video_torchcodec, and their helpers, have been added. This makes the unsloth_zoo library self-contained for video processing.

  • Dynamic Video Sampling: The fetch_video function has been fully implemented to handle video files, intelligently sampling frames, and managing resolution and other model constraints.

  • Processor Integration: The process_vision_info function and UnslothVisionDataCollator are updated to correctly receive and pass the video tensor and its associated metadata (like FPS) to the model's processor.

  • Debug Print Statement: Added a print statement to indicate and verify unsloth_zoo/vision_utils is running for video reading with the specified backend.

Test

This provided Colab Notebook demonstrates the before and after for running an end-to-end video inference run with and without the qwen-vl-utils dependency.

Before (Missing Dependency on Video Processing): A test shows that without installing qwen-vl-utils, a user attempting to run the original Qwen video inference code would encounter a ModuleNotFoundError.

Before (With External Dependency on qwen-vl-utils): The notebook shows that when qwen-vl-utils is installed, a user can successfully run video inference using the library's functions. This establishes a baseline for the desired functionality.

After (Integrated qwen-vl-utils to unsloth): The new test script successfully loads the unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit model. It then overwrites the vision_utils.py in unsloth_zoo with the newly implemented code to process a video file and a textual prompt. The test runs a single inference call with model.generate() and confirms that the model successfully processes the video and generates a coherent text response, all without relying on the external dependency.

Documentation Suggestion

We also propose a new “Video Inference” section on the “Vision Fine-tuning” part of the Unsloth documentation as the first step to achieve video fine-tuning. We are happy to work on that as well after the PR is merged to main.

Reference
qwen-vl-utils documentation: https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils
qwen vision_process.py code: https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py

autinn and others added 6 commits August 11, 2025 08:57
Co-authored-by: Neenu Antony <neenu.antony@sjsu.edu>
Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
Co-authored-by: Neenu Antony <neenu.antony@sjsu.edu>
Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
@autinn autinn changed the title Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils feat: Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils Aug 13, 2025
@mmathew23

Copy link
Copy Markdown
Collaborator

Hi @autinn @neenz16 @suchithgali, great work! Thank you for submitting!

Are there any additions to qwen vl utils in the PR, or is it the same logic but integrated into unsloth?

@autinn

autinn commented Aug 19, 2025

Copy link
Copy Markdown
Contributor Author

Hi @mmathew23, we integrated the same logic from qwen vl utils to unsloth and tested it on Colab Notebook for video inferencing. We also updated the print statement from “qwen-vl-utils reading videos” to “unsloth_zoo/vision-utils reading videos”.

Comment thread unsloth_zoo/vision_utils.py Outdated
"<|IMG_PATCH|>", # Cohere
]

from __future__ import annotations

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes an error since its not at top of file, and doesn't seem needed, so it can be removed

Comment thread unsloth_zoo/vision_utils.py Outdated
VIDEO_MAX_PIXELS = 768 * 28 * 28
VIDEO_TOTAL_PIXELS = 24576 * 28 * 28
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 128000 * 28 * 28 * 0.9)))
logger.info(f"set VIDEO_TOTAL_PIXELS: {VIDEO_TOTAL_PIXELS}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the logger is fine, but we generally want to keep logging less cluttered unless the user wants it. In other parts of the library we check for do_logging = os.environ.get("UNSLOTH_ENABLE_LOGGING", "0") == "1" to determine whether or not to log. If you'd like to stick with logger I'd suggest configuring the logging level based on the env variable, and no prints/logs if the variable is off, and prints if the variable is on. Best to also prepend "Unsloth: " before the actual string to log.

Comment thread unsloth_zoo/vision_utils.py Outdated
max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
nframes = total_frames / video_fps * fps
if nframes > total_frames:
logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")

@mmathew23 mmathew23 Aug 19, 2025

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if logging is enabled before logging and include Unsloth:

Comment thread unsloth_zoo/vision_utils.py Outdated
output_format="TCHW",
)
total_frames, video_fps = video.size(0), info["video_fps"]
logger.info(f"torchvision: {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if logging is enabled before logging and include Unsloth:

Comment thread unsloth_zoo/vision_utils.py Outdated
f"Video duration: {max_duration:.2f}s ({total_frames} frames @ {video_fps}fps)"
)

logger.info(f"calculate video frame range: {start_frame=}, {end_frame=}, {total_frames=} from {video_start=}, {video_end=}, {video_fps=:.3f}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if logging is enabled before logging and include Unsloth:

Comment thread unsloth_zoo/vision_utils.py Outdated
idx = torch.linspace(start_frame, end_frame, nframes).round().long().tolist()
video = vr.get_batch(idx).asnumpy()
video = torch.tensor(video).permute(0, 3, 1, 2) # Convert to TCHW format
logger.info(f"decord: {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if logging is enabled before logging and include Unsloth:

Comment thread unsloth_zoo/vision_utils.py Outdated
idx = torch.linspace(start_frame, end_frame, nframes).round().long().tolist()
sample_fps = nframes / max(total_frames, 1e-6) * video_fps
video = decoder.get_frames_at(indices=idx).data
logger.info(f"torchcodec: {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if logging is enabled before logging and include Unsloth:

Comment thread unsloth_zoo/vision_utils.py Outdated
"torchcodec": _read_video_torchcodec,
}

FORCE_QWENVL_VIDEO_READER = os.getenv("FORCE_QWENVL_VIDEO_READER", None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file already has attribution to the qwen team on top, so let's rename the variable to be UNSLOTH to avoid confusion.

Comment thread unsloth_zoo/vision_utils.py Outdated
video_reader_backend = "decord"
else:
video_reader_backend = "torchvision"
print(f"unsloth_zoo/vision_utils using {video_reader_backend} to read video.", file=sys.stderr)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if logging is enabled before logging and include Unsloth: but let's also make sure we are consistent with prints or logs.

Comment thread unsloth_zoo/vision_utils.py Outdated
try:
video, sample_fps = VIDEO_READER_BACKENDS[video_reader_backend](ele)
except Exception as e:
logger.warning(f"video_reader_backend {video_reader_backend} error, use torchvision as default, msg: {e}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if logging is enabled before logging and include Unsloth:

Comment thread unsloth_zoo/vision_utils.py Outdated
max_pixels_supposed = ele.get("max_pixels", max_pixels)

if max_pixels_supposed > max_pixels:
logger.warning(f"The given max_pixels[{max_pixels_supposed}] exceeds limit[{max_pixels}].")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if logging is enabled before logging and include Unsloth:

@mmathew23

Copy link
Copy Markdown
Collaborator

Hi @autinn I added some comments. Since there's also changes to the collator we should make sure that existing training runs work as intended, while video models can also train too.

Great work, and appreciate the help! Once we get the comments and finetuning verification we can merge this.

Comment thread unsloth_zoo/vision_utils.py Outdated
def smart_nframes(
ele: dict,
total_frames: int,
video_fps: int | float,

@mmathew23 mmathew23 Aug 19, 2025

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think | needs to be changed to Union in order to support python 3.9

Comment thread unsloth_zoo/vision_utils.py Outdated
return video_reader_backend


def fetch_video(ele: dict, image_factor: int = IMAGE_FACTOR, return_video_sample_fps: bool = False) -> torch.Tensor | list[Image.Image]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think | needs to be changed to Union in order to support python 3.9

@neenz16

neenz16 commented Aug 19, 2025

Copy link
Copy Markdown
Contributor

Hi @autinn I added some comments. Since there's also changes to the collator we should make sure that existing training runs work as intended, while video models can also train too.

Great work, and appreciate the help! Once we get the comments and finetuning verification we can merge this.

Thanks a lot for the review and guidance @mmathew23! 🙏
We’ll work on the suggested changes and run verifications to ensure both existing training runs and video model finetuning work as expected. We’ll get back with updates soon. Really appreciate your support!

@autinn

autinn commented Aug 21, 2025

Copy link
Copy Markdown
Contributor Author

@mmathew23 Thank you again for your feedback! We worked on the code for a bit the past few days and here are the changes we made:

  1. Remove from __future__ import annotations
  2. Import logger from .log and UNSLOTH_ENABLE_LOGGING from .temporary_patches.common, similar to vllm_utils.py
  3. Added conditional logging for the logger.info() and logger.warning() statements, including the print statement we originally added.
  4. Updated variable naming from FORCE_QWENVL_VIDEO_READER to FORCE_UNSLOTH_VIDEO_READER
  5. Tested on Colab Notebook and saw all logging statements showed as intended.

mmathew23 added a commit to mmathew23/unsloth-zoo that referenced this pull request Sep 13, 2025
This is a cleaned up and merged version of
unslothai#240.

This adds video processing utilities for VLM finetuning.

It's based on qwen-vl-utils repo https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils

Co-authored-by: autinn <au-yeung@uni.minerva.edu>
Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu>
Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
mmathew23 added a commit to mmathew23/unsloth-zoo that referenced this pull request Sep 13, 2025
This is a cleaned up and merged version of
unslothai#240.

This adds video processing utilities for VLM finetuning.

It's based on qwen-vl-utils repo https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils

Co-authored-by: autinn <au-yeung@uni.minerva.edu>
Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu>
Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
@mmathew23

Copy link
Copy Markdown
Collaborator

Hello @autinn @neenz16 @suchithgali! Appreciate all the help on this.

We had refactored the vision_utils code before getting this in. I also had to clean up some issues to make sure non video finetuning still worked the same as well. I didn't have access to edit this PR branch directly so I created a new PR with your contributions, merged it into the new code, and cleaned it up. You can check it out here, #279. Thank you so much for your contribution!

@neenz16

neenz16 commented Sep 14, 2025

Copy link
Copy Markdown
Contributor

Thanks so much, @mmathew23! Really appreciate you merging and cleaning this up. Glad our contributions made it in — We'll check out #279!

danielhanchen pushed a commit that referenced this pull request Sep 30, 2025
This is a cleaned up and merged version of
#240.

This adds video processing utilities for VLM finetuning.

It's based on qwen-vl-utils repo https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils

Co-authored-by: autinn <au-yeung@uni.minerva.edu>
Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu>
Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
@mmathew23

Copy link
Copy Markdown
Collaborator

Closing as #279 was merged

@mmathew23 mmathew23 closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants