Skip to content

VLM Energon updates for videos, multiple images#3691

Merged
cuichenx merged 23 commits into
mainfrom
huvu/vlm_energon
May 19, 2026
Merged

VLM Energon updates for videos, multiple images#3691
cuichenx merged 23 commits into
mainfrom
huvu/vlm_energon

Conversation

@huvunvidia

@huvunvidia huvunvidia commented May 5, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Verifying and improving VLM Energon to work with videos, multiple images.
Task PR for reference: #3133

Note: this PR tested specifically for Qwen3-VL, which uses qwen3_vl_step. General VLM, which uses vlm_step will be left for future work.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@copy-pr-bot

copy-pr-bot Bot commented May 5, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated
Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated
Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated
Comment thread tests/unit_tests/recipes/qwen_vl/test_qwen3_vl_recipes.py
@claude

claude Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Light Code Review

Critical Issues:

  1. Commented-out raise SkipSample() is a logic bug (task_encoder.py:350): The warning says "dropping sample" but the raise is commented out, so the sample silently proceeds into truncation that the warning itself says will corrupt visual tokens. Either uncomment the raise or fix the log message.

  2. 5 bare print() debug statements (task_encoder.py:226, 239, 269, 349): Project rules prohibit bare print(). Use logging or print_rank_0(). Each debug print is redundant with a logging.warning() call directly above it. These should be removed before merge.

  3. DEBUGGING comment and commented-out code (task_encoder.py:60-61): The DEBUGGING label is not a valid justification for keeping commented-out code. The real reason (pre-decoded WDS frames) is explained in the comment block below. Clean up the stale marker and the dead line.

Missing Test Coverage:

  • The new QwenVLEnergonProvider fields (max_num_images, max_num_frames, max_visual_tokens) are not asserted in test_qwen3_vl_8b_peft_energon_task_encoder.
  • No test covers QwenVLEnergonProvider.build_datasets syncing its fields onto the task encoder.
  • No test covers the new pixel_values_videos / video_grid_thw normalization paths in Qwen2_5_VLVisualInputs.normalized_for_model().
  • No test covers the max_num_images skip, max_num_frames truncation, or max_visual_tokens skip logic in encode_sample.

Suggested test cases:

No perf tests impacted.

@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test ca928cb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 9d1fed0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 67e6f24

Huy Vu2 and others added 2 commits May 6, 2026 10:13
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 78aca60

@yaoyu-33 yaoyu-33 added area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels May 7, 2026
huvunvidia and others added 3 commits May 7, 2026 08:22
Adapts the unit tests to the refactored encoder which now computes
visual-token counts via .prod(dim=-1) (torch syntax) on the processor's
image_grid_thw / video_grid_thw outputs. The mocks previously returned
np.array, causing TypeError. Also bumps max_padding_length to 512 so
the expanded sequence length stays within seq_len and avoids the new
SkipSample() path.

Signed-off-by: Huy Vu <huvu@nvidia.com>
Adds README section describing the three composable controls that bound
GPU cost per sample (min/max_pixels, max_num_images/max_num_frames,
max_visual_tokens) and asserts the PEFT energon recipe defaults so the
documented contract is enforced by tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test bca268d

Pre-commit / ruff format requires two blank lines between a function and
the following module-level block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 3b48e4e

input_ids = pre_expand_image_tokens(
text_inputs["input_ids"],
video_proc["video_grid_thw"],
image_token_id=151656, # <|video_pad|> for Qwen-VL family

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value should not be hardcoded, can we get it from the tokenizer? also it's a little confusing why this is the <video_pad token> when the function name is pre_expand_image_tokens

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed both issues:
(1) Not using hardcoded.
(2) Change the name to pre_expand_vision_tokens to avoid confusing (the method is used for both image and video)

@yaoyu-33 yaoyu-33 added waiting-on-customer Waiting on the original author to respond and removed needs-review PR is ready for code review and waiting on a reviewer labels May 19, 2026
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 78f0d65

Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 23641ca

@cuichenx cuichenx left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cuichenx cuichenx added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed waiting-on-customer Waiting on the original author to respond labels May 19, 2026
@cuichenx cuichenx enabled auto-merge (squash) May 19, 2026 21:46
@cuichenx cuichenx merged commit 609255f into main May 19, 2026
97 checks passed
@cuichenx cuichenx deleted the huvu/vlm_energon branch May 19, 2026 22:48
vasunvidia pushed a commit to vasunvidia/Megatron-Bridge that referenced this pull request Jun 10, 2026
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Signed-off-by: Huy Vu <huvu@nvidia.com>
Signed-off-by: Huy Vu2 <huvu@nvidia.com>
Co-authored-by: Huy Vu2 <huvu@eos0156.eos.clusters.nvidia.com>
Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
nayopu added a commit to nayopu/Megatron-Bridge that referenced this pull request Jun 13, 2026
…instead of build_datasets()

QwenVLEnergonProvider builds its task encoder eagerly at recipe-build time,
capturing scalars such as seq_length, and stores it on the provider. Because the
task encoder is a plain (non-dataclass) object, process_config_with_overrides
excludes it from the OmegaConf round-trip and restores it verbatim.

The per-field sync (seq_len, seq_length, min_pixels, max_pixels, max_num_images,
max_num_frames, max_visual_tokens) was added in NVIDIA-NeMo#3691 inside build_datasets().
Move it into a finalize() override instead. finalize() is invoked by
ConfigContainer.validate() right after overrides are merged, which is the hook
the framework already reserves for post-override reconciliation, so the task
encoder is made consistent with the (possibly overridden) config earlier and
build_datasets() returns to pure dataset construction. Behavior is unchanged for
Qwen3-VL when run through the normal training entrypoints, which call
validate() (hence finalize()) before datasets are built.

Signed-off-by: Naoyuki Terashita <nayopu3@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset builders, preprocessing, and samplers feature New capabilities, enhancements, or enablement work ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants