VLM Energon updates for videos, multiple images#3691
Conversation
Light Code ReviewCritical Issues:
Missing Test Coverage:
Suggested test cases: No perf tests impacted. |
|
/ok to test ca928cb |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 9d1fed0 |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 67e6f24 |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 78aca60 |
Adapts the unit tests to the refactored encoder which now computes visual-token counts via .prod(dim=-1) (torch syntax) on the processor's image_grid_thw / video_grid_thw outputs. The mocks previously returned np.array, causing TypeError. Also bumps max_padding_length to 512 so the expanded sequence length stays within seq_len and avoids the new SkipSample() path. Signed-off-by: Huy Vu <huvu@nvidia.com>
Adds README section describing the three composable controls that bound GPU cost per sample (min/max_pixels, max_num_images/max_num_frames, max_visual_tokens) and asserts the PEFT energon recipe defaults so the documented contract is enforced by tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test bca268d |
Pre-commit / ruff format requires two blank lines between a function and the following module-level block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 3b48e4e |
| input_ids = pre_expand_image_tokens( | ||
| text_inputs["input_ids"], | ||
| video_proc["video_grid_thw"], | ||
| image_token_id=151656, # <|video_pad|> for Qwen-VL family |
There was a problem hiding this comment.
this value should not be hardcoded, can we get it from the tokenizer? also it's a little confusing why this is the <video_pad token> when the function name is pre_expand_image_tokens
There was a problem hiding this comment.
Fixed both issues:
(1) Not using hardcoded.
(2) Change the name to pre_expand_vision_tokens to avoid confusing (the method is used for both image and video)
|
/ok to test 78f0d65 |
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 23641ca |
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: Huy Vu <huvu@nvidia.com> Signed-off-by: Huy Vu2 <huvu@nvidia.com> Co-authored-by: Huy Vu2 <huvu@eos0156.eos.clusters.nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
…instead of build_datasets() QwenVLEnergonProvider builds its task encoder eagerly at recipe-build time, capturing scalars such as seq_length, and stores it on the provider. Because the task encoder is a plain (non-dataclass) object, process_config_with_overrides excludes it from the OmegaConf round-trip and restores it verbatim. The per-field sync (seq_len, seq_length, min_pixels, max_pixels, max_num_images, max_num_frames, max_visual_tokens) was added in NVIDIA-NeMo#3691 inside build_datasets(). Move it into a finalize() override instead. finalize() is invoked by ConfigContainer.validate() right after overrides are merged, which is the hook the framework already reserves for post-override reconciliation, so the task encoder is made consistent with the (possibly overridden) config earlier and build_datasets() returns to pure dataset construction. Behavior is unchanged for Qwen3-VL when run through the normal training entrypoints, which call validate() (hence finalize()) before datasets are built. Signed-off-by: Naoyuki Terashita <nayopu3@gmail.com>
What does this PR do ?
Verifying and improving VLM Energon to work with videos, multiple images.
Task PR for reference: #3133
Note: this PR tested specifically for Qwen3-VL, which uses
qwen3_vl_step. General VLM, which usesvlm_stepwill be left for future work.Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information