VLM Energon updates for videos, multiple images by huvunvidia · Pull Request #3691 · NVIDIA-NeMo/Megatron-Bridge

huvunvidia · 2026-05-05T17:06:39Z

What does this PR do ?

Verifying and improving VLM Energon to work with videos, multiple images.
Task PR for reference: #3133

Note: this PR tested specifically for Qwen3-VL, which uses qwen3_vl_step. General VLM, which uses vlm_step will be left for future work.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-05-05T17:06:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-05-05T17:10:58Z

Light Code Review

Critical Issues:

Commented-out raise SkipSample() is a logic bug (task_encoder.py:350): The warning says "dropping sample" but the raise is commented out, so the sample silently proceeds into truncation that the warning itself says will corrupt visual tokens. Either uncomment the raise or fix the log message.
5 bare print() debug statements (task_encoder.py:226, 239, 269, 349): Project rules prohibit bare print(). Use logging or print_rank_0(). Each debug print is redundant with a logging.warning() call directly above it. These should be removed before merge.
DEBUGGING comment and commented-out code (task_encoder.py:60-61): The DEBUGGING label is not a valid justification for keeping commented-out code. The real reason (pre-decoded WDS frames) is explained in the comment block below. Clean up the stale marker and the dead line.

Missing Test Coverage:

The new QwenVLEnergonProvider fields (max_num_images, max_num_frames, max_visual_tokens) are not asserted in test_qwen3_vl_8b_peft_energon_task_encoder.
No test covers QwenVLEnergonProvider.build_datasets syncing its fields onto the task encoder.
No test covers the new pixel_values_videos / video_grid_thw normalization paths in Qwen2_5_VLVisualInputs.normalized_for_model().
No test covers the max_num_images skip, max_num_frames truncation, or max_visual_tokens skip logic in encode_sample.

Suggested test cases:

No perf tests impacted.

huvunvidia · 2026-05-06T15:32:28Z

/ok to test ca928cb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-06T15:56:24Z

/ok to test 9d1fed0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-06T17:05:01Z

/ok to test 67e6f24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-06T17:25:20Z

/ok to test 78aca60

Adapts the unit tests to the refactored encoder which now computes visual-token counts via .prod(dim=-1) (torch syntax) on the processor's image_grid_thw / video_grid_thw outputs. The mocks previously returned np.array, causing TypeError. Also bumps max_padding_length to 512 so the expanded sequence length stays within seq_len and avoids the new SkipSample() path. Signed-off-by: Huy Vu <huvu@nvidia.com>

Adds README section describing the three composable controls that bound GPU cost per sample (min/max_pixels, max_num_images/max_num_frames, max_visual_tokens) and asserts the PEFT energon recipe defaults so the documented contract is enforced by tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-07T15:39:18Z

/ok to test bca268d

Pre-commit / ruff format requires two blank lines between a function and the following module-level block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-07T15:45:55Z

/ok to test 3b48e4e

cuichenx · 2026-05-13T21:21:21Z

+    input_ids = pre_expand_image_tokens(
+        text_inputs["input_ids"],
+        video_proc["video_grid_thw"],
+        image_token_id=151656,  # <|video_pad|> for Qwen-VL family


this value should not be hardcoded, can we get it from the tokenizer? also it's a little confusing why this is the <video_pad token> when the function name is pre_expand_image_tokens

Fixed both issues:
(1) Not using hardcoded.
(2) Change the name to pre_expand_vision_tokens to avoid confusing (the method is used for both image and video)

huvunvidia · 2026-05-19T21:15:07Z

/ok to test 78f0d65

Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-19T21:19:59Z

/ok to test 23641ca

cuichenx

LGTM

Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: Huy Vu <huvu@nvidia.com> Signed-off-by: Huy Vu2 <huvu@nvidia.com> Co-authored-by: Huy Vu2 <huvu@eos0156.eos.clusters.nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

…instead of build_datasets() QwenVLEnergonProvider builds its task encoder eagerly at recipe-build time, capturing scalars such as seq_length, and stores it on the provider. Because the task encoder is a plain (non-dataclass) object, process_config_with_overrides excludes it from the OmegaConf round-trip and restores it verbatim. The per-field sync (seq_len, seq_length, min_pixels, max_pixels, max_num_images, max_num_frames, max_visual_tokens) was added in NVIDIA-NeMo#3691 inside build_datasets(). Move it into a finalize() override instead. finalize() is invoked by ConfigContainer.validate() right after overrides are merged, which is the hook the framework already reserves for post-override reconciliation, so the task encoder is made consistent with the (possibly overridden) config earlier and build_datasets() returns to pure dataset construction. Behavior is unchanged for Qwen3-VL when run through the normal training entrypoints, which call validate() (hence finalize()) before datasets are built. Signed-off-by: Naoyuki Terashita <nayopu3@gmail.com>

workable code

60294bf

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread tests/unit_tests/recipes/qwen_vl/test_qwen3_vl_recipes.py

Huy Vu2 added 2 commits May 6, 2026 07:59

adding inference code for Qwen3 for multi-images and video

33ddcd9

resolve conflict

ca928cb

style: fix ruff-format line-length violations flagged by CI

9d1fed0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

style: apply ruff-format reformats and remove debug prints

67e6f24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

Huy Vu2 and others added 2 commits May 6, 2026 10:13

style: fix remaining ruff-format violations in task_encoder.py

c1555c0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

Merge remote-tracking branch 'origin/main' into huvu/vlm_energon

78aca60

copy-pr-bot Bot temporarily deployed to test May 6, 2026 17:26 Inactive

yaoyu-33 added area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels May 7, 2026

huvunvidia and others added 3 commits May 7, 2026 08:22

Merge remote-tracking branch 'origin/main' into huvu/vlm_energon

4623a8f

copy-pr-bot Bot temporarily deployed to public May 7, 2026 15:40 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 15:46 Inactive

copy-pr-bot Bot temporarily deployed to test May 7, 2026 15:47 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 15:59 Inactive

copy-pr-bot Bot temporarily deployed to public May 12, 2026 14:36 Inactive

copy-pr-bot Bot temporarily deployed to public May 12, 2026 14:51 Inactive

cuichenx reviewed May 13, 2026

View reviewed changes

adding examples for mantis dataset for finetuning

6493a99

yaoyu-33 added waiting-on-customer Waiting on the original author to respond and removed needs-review PR is ready for code review and waiting on a reviewer labels May 19, 2026

Huy Vu2 added 2 commits May 19, 2026 13:59

addressing comemnts

6c4dec3

change prepare_mantis_energon.py file location

78f0d65

copy-pr-bot Bot temporarily deployed to public May 19, 2026 21:15 Inactive

[recipe] chore: apply ruff format to prepare_mantis_energon.py

23641ca

Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 19, 2026 21:20 Inactive

copy-pr-bot Bot temporarily deployed to test May 19, 2026 21:20 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 21:28 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 21:44 Inactive

cuichenx approved these changes May 19, 2026

View reviewed changes

cuichenx added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed waiting-on-customer Waiting on the original author to respond labels May 19, 2026

cuichenx enabled auto-merge (squash) May 19, 2026 21:46

cuichenx merged commit 609255f into main May 19, 2026
97 checks passed

cuichenx deleted the huvu/vlm_energon branch May 19, 2026 22:48

cuichenx mentioned this pull request Jun 2, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

nayopu mentioned this pull request Jun 13, 2026

[recipes] refactor: Sync Qwen3-VL Energon task encoder in finalize() instead of build_datasets() #4341

Closed

Conversation

huvunvidia commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 5, 2026

Light Code Review

Uh oh!

huvunvidia commented May 6, 2026

Uh oh!

huvunvidia commented May 6, 2026

Uh oh!

huvunvidia commented May 6, 2026

Uh oh!

huvunvidia commented May 6, 2026

Uh oh!

huvunvidia commented May 7, 2026

Uh oh!

huvunvidia commented May 7, 2026

Uh oh!

cuichenx May 13, 2026

Choose a reason for hiding this comment

Uh oh!

huvunvidia May 19, 2026

Choose a reason for hiding this comment

Uh oh!

huvunvidia commented May 19, 2026

Uh oh!

huvunvidia commented May 19, 2026

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huvunvidia commented May 5, 2026 •

edited

Loading