[video processors] decode only sampled videos -> less RAM and faster processing by zucchini-nlp · Pull Request #39600 · huggingface/transformers

zucchini-nlp · 2025-07-23T09:55:59Z

What does this PR do?

This PR moves the video decoding code entirely into video processors, so that we can load only necessary video frames into memory. To be consistent with video processors, I also updated image processors to accept str in inputs and optionally load images.

The docs for video processors are also updated explaining how frames are sampled and what users need to do to turn it on/off. Note that we'll be using by default torchcodec and fallback to torchvision, and we won't support any arbitrary video decoders within video processor class. Otherwise we'd need to introduce more kwargs and handle differences between decoders, which bloats up the code even more

HuggingFaceDocBuilderDev · 2025-07-23T10:10:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/audio_utils.py

tests/models/qwen2_vl/test_video_processing_qwen2_vl.py

qubvel

Thanks for the PR, it should be a great improvement!

docs/source/en/main_classes/image_processor.md

src/transformers/audio_utils.py

qubvel · 2025-08-05T09:25:58Z

src/transformers/image_processing_utils_fast.py

cc @yonigozlan for changes in this file

src/transformers/image_utils.py

src/transformers/models/glm4v/image_processing_glm4v.py

src/transformers/models/glm4v/video_processing_glm4v.py

src/transformers/video_processing_utils.py

src/transformers/video_utils.py

zucchini-nlp · 2025-08-19T13:11:35Z

run-slow: aria, aya_vision, blip, bridgetower, chameleon, clip, colpali, deepseek_vl, deepseek_vl_hybrid, emu3, eomt, flava, gemma3, gemma3n, glm4v

github-actions · 2025-08-19T13:13:06Z

This comment contains run-slow, running the specified jobs:

models: ['models/aria', 'models/aya_vision', 'models/blip', 'models/bridgetower', 'models/chameleon', 'models/clip', 'models/colpali', 'models/deepseek_vl', 'models/deepseek_vl_hybrid', 'models/emu3', 'models/eomt', 'models/flava', 'models/gemma3', 'models/gemma3n', 'models/glm4v']
quantizations: [] ...

zucchini-nlp · 2025-08-19T14:29:55Z

run-slow: qwen2_vl, qwen2_5_vl, qwen2_5_omni, smolvlm, llava_onevision, llava_next_video, perception_lm

github-actions · 2025-08-19T14:31:14Z

This comment contains run-slow, running the specified jobs:

models: ['models/llava_next_video', 'models/llava_onevision', 'models/perception_lm', 'models/qwen2_5_omni', 'models/qwen2_5_vl', 'models/qwen2_vl', 'models/smolvlm']
quantizations: [] ...

zucchini-nlp · 2025-08-19T15:18:17Z

On no, new torch release doesn't work well with Bytes objects 😓 (fails only in CI, still figuring out why)

zucchini-nlp · 2025-08-20T09:47:04Z

run-slow: qwen2_vl, qwen2_5_vl, qwen2_5_omni, smolvlm, llava_onevision, llava_next_video, perception_lm

zucchini-nlp · 2025-08-20T09:56:38Z

run-slow: qwen2_vl, qwen2_5_vl, qwen2_5_omni, smolvlm, llava_onevision, llava_next_video, perception_lm

github-actions · 2025-08-20T09:57:55Z

This comment contains run-slow, running the specified jobs:

models: ['models/llava_next_video', 'models/llava_onevision', 'models/perception_lm', 'models/qwen2_5_omni', 'models/qwen2_5_vl', 'models/qwen2_vl', 'models/smolvlm']
quantizations: [] ...

zucchini-nlp · 2025-08-26T08:03:53Z

The CI is impossible to pass 🙃

zucchini-nlp · 2025-08-26T08:45:17Z

@bot /style

github-actions · 2025-08-26T08:45:59Z

Style bot fixed some files and pushed the changes.

github-actions · 2025-08-26T08:47:59Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, aya_vision, blip, bridgetower, chameleon, clip, colpali, deepseek_vl, deepseek_vl_hybrid, emu3, eomt, flava, gemma3, gemma3n, glm4v

shaform · 2026-01-22T17:12:29Z

@zucchini-nlp I believe it was previously possible for users to sample frames themselves and then pass the batched video frames directly to the video processor. This usage can be found in existing code, for example in the following notebook: https://huggingface.co/facebook/vjepa2-vitl-fpc16-256-ssv2/blob/main/notebook_finetuning.ipynb
.

However, this no longer seems to be supported due to changes introduced in this PR. In particular, the notebook above now produces errors when run with the current versions.

Was this behavior change intentional, or is it an unintended regression?

zucchini-nlp · 2026-01-22T18:25:18Z

@shaform hey, it is still possible to sample frames and pass them as a list of 3D frames or a 4D array. You just need to pass do_sample=False in the video processor call to disable sampling. Such as:

pixel_values = video_processor(my_sampled_video, do_sample=False).pixel_values_video

shaform · 2026-01-22T19:27:06Z

@shaform hey, it is still possible to sample frames and pass them as a list of 3D frames or a 4D array. You just need to pass do_sample=False in the video processor call to disable sampling. Such as:

pixel_values = video_processor(my_sampled_video, do_sample=False).pixel_values_video

@zucchini-nlp Thank you for your quick response. I tested a bit, and found that the reason the notebook fails now is because the processor will return a tensor with a incorrect shape if a batched input is given. Here is a minimal example to reproduce:

# input: B x T x C x H x W
processor(th.zeros(4, 5, 3, 100, 100), return_tensors="pt", do_sample_frames=False).pixel_values_videos.shape
# output: 1 x 1 x B x T x C x H x W
Out[17]: torch.Size([1, 1, 4, 5, 3, 256, 256])

It seems a simple workaround is to convert the input into a list, i.e.,

processor([t for t in th.zeros(4, 5, 3, 100, 100)], return_tensors="pt", do_sample_frames=False).pixel_values_videos.shape

Could this be considered as a bug?

zucchini-nlp · 2026-01-23T08:06:02Z

@shaform yeah, should not be happening. Would you mind opening an issue with a minimal reproducer (model_id and how you call it) so I don't forget about it?

zucchini-nlp added 4 commits July 15, 2025 16:25

draft update two models for now

0096460

batch update all VLMs first

34a2ff1

update some more image processors

4a8b169

merge main

c7ec229

zucchini-nlp added 10 commits July 23, 2025 12:37

update

0c1024a

fix a few tests

01ad83a

just make CI green for now

3187cc4

fix copies

7900ce8

update once more

47cce51

update

ea9a29b

merge main

4424f55

unskip the test

9957f9d

fix these two

1c9ad58

fix torchcodec audio loading

595fe00

zucchini-nlp commented Jul 28, 2025

View reviewed changes

src/transformers/audio_utils.py Show resolved Hide resolved

tests/models/qwen2_vl/test_video_processing_qwen2_vl.py Show resolved Hide resolved

zucchini-nlp mentioned this pull request Jul 31, 2025

Add support for including in-memory videos (not just files/urls) in apply_chat_template #39494

Merged

5 tasks

zucchini-nlp added 10 commits August 1, 2025 16:52

maybe

3f66126

merge main

4c5a674

yay, i fixed torchcodec installation and now can actually test it

ba02dec

Merge remote-tracking branch 'upstream/main' into video-decoding

272054f

fix copies deepseek

c05f31c

make sure the metadata is returrned when users request it

0fe6e26

add docs

6286a8a

update

1a78709

merge main

c9562f4

fixup

86ab24a

zucchini-nlp requested a review from qubvel August 4, 2025 13:12

Merge branch 'main' into video-decoding

f8f2506

qubvel reviewed Aug 5, 2025

View reviewed changes

zucchini-nlp added 3 commits August 19, 2025 17:58

Merge branch 'main' into video-decoding

de9f7fb

fixup

baeb0e4

typo

1fb826c

zucchini-nlp added 7 commits August 20, 2025 12:00

fix copies

c0a1c62

ifx smolvlm test

248716c

this is why torch's official benchmark was faster, set threads to 0

ca0e8ae

Merge branch 'main' into video-decoding

ceeab83

Merge branch 'main' into video-decoding

bf7e1d3

Merge branch 'main' into video-decoding

cd4f073

Merge branch 'main' into video-decoding

97e672f

Apply style fixes

1fb502b

zucchini-nlp merged commit f690a2a into huggingface:main Aug 26, 2025
24 checks passed

Isotr0py mentioned this pull request Sep 11, 2025

[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames vllm-project/vllm#24161

Merged

5 tasks

shaform mentioned this pull request Jan 23, 2026

Video processors return incorrect shape when input is batched #43450

Closed

4 tasks

Conversation

zucchini-nlp commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qubvel Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

zucchini-nlp commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

zucchini-nlp commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Aug 20, 2025

Uh oh!

zucchini-nlp commented Aug 20, 2025

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

zucchini-nlp commented Aug 26, 2025

Uh oh!

zucchini-nlp commented Aug 26, 2025

Uh oh!

github-actions bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 26, 2025

Uh oh!

Uh oh!

shaform commented Jan 22, 2026

Uh oh!

zucchini-nlp commented Jan 22, 2026

Uh oh!

shaform commented Jan 22, 2026

Uh oh!

zucchini-nlp commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zucchini-nlp commented Jul 23, 2025 •

edited

Loading

zucchini-nlp commented Aug 19, 2025 •

edited

Loading

github-actions bot commented Aug 26, 2025 •

edited

Loading