feat: Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils#240
feat: Added Video inference feature into unsloth_zoo/vision_utils.py from qwen-vl-utils#240autinn wants to merge 11 commits into
Conversation
Co-authored-by: Neenu Antony <neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
Co-authored-by: Neenu Antony <neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
|
Hi @autinn @neenz16 @suchithgali, great work! Thank you for submitting! Are there any additions to qwen vl utils in the PR, or is it the same logic but integrated into unsloth? |
|
Hi @mmathew23, we integrated the same logic from qwen vl utils to unsloth and tested it on Colab Notebook for video inferencing. We also updated the print statement from “qwen-vl-utils reading videos” to “unsloth_zoo/vision-utils reading videos”. |
| "<|IMG_PATCH|>", # Cohere | ||
| ] | ||
|
|
||
| from __future__ import annotations |
There was a problem hiding this comment.
This causes an error since its not at top of file, and doesn't seem needed, so it can be removed
| VIDEO_MAX_PIXELS = 768 * 28 * 28 | ||
| VIDEO_TOTAL_PIXELS = 24576 * 28 * 28 | ||
| VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 128000 * 28 * 28 * 0.9))) | ||
| logger.info(f"set VIDEO_TOTAL_PIXELS: {VIDEO_TOTAL_PIXELS}") |
There was a problem hiding this comment.
Using the logger is fine, but we generally want to keep logging less cluttered unless the user wants it. In other parts of the library we check for do_logging = os.environ.get("UNSLOTH_ENABLE_LOGGING", "0") == "1" to determine whether or not to log. If you'd like to stick with logger I'd suggest configuring the logging level based on the env variable, and no prints/logs if the variable is off, and prints if the variable is on. Best to also prepend "Unsloth: " before the actual string to log.
| max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR) | ||
| nframes = total_frames / video_fps * fps | ||
| if nframes > total_frames: | ||
| logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]") |
There was a problem hiding this comment.
check if logging is enabled before logging and include Unsloth:
| output_format="TCHW", | ||
| ) | ||
| total_frames, video_fps = video.size(0), info["video_fps"] | ||
| logger.info(f"torchvision: {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s") |
There was a problem hiding this comment.
check if logging is enabled before logging and include Unsloth:
| f"Video duration: {max_duration:.2f}s ({total_frames} frames @ {video_fps}fps)" | ||
| ) | ||
|
|
||
| logger.info(f"calculate video frame range: {start_frame=}, {end_frame=}, {total_frames=} from {video_start=}, {video_end=}, {video_fps=:.3f}") |
There was a problem hiding this comment.
check if logging is enabled before logging and include Unsloth:
| idx = torch.linspace(start_frame, end_frame, nframes).round().long().tolist() | ||
| video = vr.get_batch(idx).asnumpy() | ||
| video = torch.tensor(video).permute(0, 3, 1, 2) # Convert to TCHW format | ||
| logger.info(f"decord: {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s") |
There was a problem hiding this comment.
check if logging is enabled before logging and include Unsloth:
| idx = torch.linspace(start_frame, end_frame, nframes).round().long().tolist() | ||
| sample_fps = nframes / max(total_frames, 1e-6) * video_fps | ||
| video = decoder.get_frames_at(indices=idx).data | ||
| logger.info(f"torchcodec: {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s") |
There was a problem hiding this comment.
check if logging is enabled before logging and include Unsloth:
| "torchcodec": _read_video_torchcodec, | ||
| } | ||
|
|
||
| FORCE_QWENVL_VIDEO_READER = os.getenv("FORCE_QWENVL_VIDEO_READER", None) |
There was a problem hiding this comment.
The file already has attribution to the qwen team on top, so let's rename the variable to be UNSLOTH to avoid confusion.
| video_reader_backend = "decord" | ||
| else: | ||
| video_reader_backend = "torchvision" | ||
| print(f"unsloth_zoo/vision_utils using {video_reader_backend} to read video.", file=sys.stderr) |
There was a problem hiding this comment.
check if logging is enabled before logging and include Unsloth: but let's also make sure we are consistent with prints or logs.
| try: | ||
| video, sample_fps = VIDEO_READER_BACKENDS[video_reader_backend](ele) | ||
| except Exception as e: | ||
| logger.warning(f"video_reader_backend {video_reader_backend} error, use torchvision as default, msg: {e}") |
There was a problem hiding this comment.
check if logging is enabled before logging and include Unsloth:
| max_pixels_supposed = ele.get("max_pixels", max_pixels) | ||
|
|
||
| if max_pixels_supposed > max_pixels: | ||
| logger.warning(f"The given max_pixels[{max_pixels_supposed}] exceeds limit[{max_pixels}].") |
There was a problem hiding this comment.
check if logging is enabled before logging and include Unsloth:
|
Hi @autinn I added some comments. Since there's also changes to the collator we should make sure that existing training runs work as intended, while video models can also train too. Great work, and appreciate the help! Once we get the comments and finetuning verification we can merge this. |
| def smart_nframes( | ||
| ele: dict, | ||
| total_frames: int, | ||
| video_fps: int | float, |
There was a problem hiding this comment.
I think | needs to be changed to Union in order to support python 3.9
| return video_reader_backend | ||
|
|
||
|
|
||
| def fetch_video(ele: dict, image_factor: int = IMAGE_FACTOR, return_video_sample_fps: bool = False) -> torch.Tensor | list[Image.Image]: |
There was a problem hiding this comment.
I think | needs to be changed to Union in order to support python 3.9
Thanks a lot for the review and guidance @mmathew23! 🙏 |
…to UNSLOTH_VIDEO Neenu Antony <Neenu.antony@sjsu.edu> Suchith Gali <sgali@ucmerced.edu>
…to UNSLOTH_VIDEO Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
|
@mmathew23 Thank you again for your feedback! We worked on the code for a bit the past few days and here are the changes we made:
|
This is a cleaned up and merged version of unslothai#240. This adds video processing utilities for VLM finetuning. It's based on qwen-vl-utils repo https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils Co-authored-by: autinn <au-yeung@uni.minerva.edu> Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
This is a cleaned up and merged version of unslothai#240. This adds video processing utilities for VLM finetuning. It's based on qwen-vl-utils repo https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils Co-authored-by: autinn <au-yeung@uni.minerva.edu> Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
|
Hello @autinn @neenz16 @suchithgali! Appreciate all the help on this. We had refactored the vision_utils code before getting this in. I also had to clean up some issues to make sure non video finetuning still worked the same as well. I didn't have access to edit this PR branch directly so I created a new PR with your contributions, merged it into the new code, and cleaned it up. You can check it out here, #279. Thank you so much for your contribution! |
|
Thanks so much, @mmathew23! Really appreciate you merging and cleaning this up. Glad our contributions made it in — We'll check out #279! |
This is a cleaned up and merged version of #240. This adds video processing utilities for VLM finetuning. It's based on qwen-vl-utils repo https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils Co-authored-by: autinn <au-yeung@uni.minerva.edu> Co-authored-by: Neenu Antony <Neenu.antony@sjsu.edu> Co-authored-by: Suchith Gali <sgali@ucmerced.edu>
|
Closing as #279 was merged |
Problem
Currently, Unsloth's vision pipeline is designed primarily for static image inputs. It lacks the native functionality to process video files directly within the
unsloth_zoolibrary. This prevents users from performing video-based inference with a pre-trainedQwen2.5-VLmodel without relying on external, non-integrated dependencies (e.g.qwen-vl-utils) .This also points to the feature requests for processing video inputs (#3061).
Solution
Our team, (@neenz16, @suchithgali, and @autinn), created this pull request to introduces a complete, self-contained solution for video inference within the
unsloth_zoolibrary. Instead of adding an external dependency fromQwen2.5-VL/qwen-vl-utils, we implement the core video processing logic directly inunsloth_zoo/vision_utils.pyusing the source code fromqwen_vl_utils/vision_process.py.Key changes include:
Implemented a Full Video Pipeline: The necessary functions to read, decode, and sample frames from video files, including
smart_nframes,_read_video_torchvision,_read_video_decord,_read_video_torchcodec, and their helpers, have been added. This makes the unsloth_zoo library self-contained for video processing.Dynamic Video Sampling: The
fetch_videofunction has been fully implemented to handle video files, intelligently sampling frames, and managing resolution and other model constraints.Processor Integration: The
process_vision_infofunction andUnslothVisionDataCollatorare updated to correctly receive and pass the video tensor and its associated metadata (like FPS) to the model's processor.Debug Print Statement: Added a print statement to indicate and verify
unsloth_zoo/vision_utilsis running for video reading with the specified backend.Test
This provided Colab Notebook demonstrates the before and after for running an end-to-end video inference run with and without the
qwen-vl-utilsdependency.Before (Missing Dependency on Video Processing): A test shows that without installing
qwen-vl-utils, a user attempting to run the original Qwen video inference code would encounter a ModuleNotFoundError.Before (With External Dependency on qwen-vl-utils): The notebook shows that when
qwen-vl-utilsis installed, a user can successfully run video inference using the library's functions. This establishes a baseline for the desired functionality.After (Integrated qwen-vl-utils to unsloth): The new test script successfully loads the
unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bitmodel. It then overwrites thevision_utils.pyinunsloth_zoowith the newly implemented code to process a video file and a textual prompt. The test runs a single inference call withmodel.generate()and confirms that the model successfully processes the video and generates a coherent text response, all without relying on the external dependency.Documentation Suggestion
We also propose a new “Video Inference” section on the “Vision Fine-tuning” part of the Unsloth documentation as the first step to achieve video fine-tuning. We are happy to work on that as well after the PR is merged to main.
Reference
qwen-vl-utils documentation: https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-utils
qwen vision_process.py code: https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py