[data] tracking: VLM SFT benchmark — high-res & interleaved multi-image dataset support

## Summary

Track end-to-end support for high-resolution image and interleaved multi-image datasets in the VLM SFT benchmark pipeline (Energon-based).

Driven by user requests to cover corner cases where encoder bottlenecks appear (e.g., 10+ images per sample, very large image sizes).

## Phase 1 — Functional Test (throughput + loss convergence)

Verify data loading via Energon, confirm images are properly processed, throughput holds up, and loss converges. (needs data sample balance for vision encoder)

- [x] `TIGER-Lab/Mantis-Instruct` — interleaved multi-image (10+ images/sample), throughput verification + Energon image processing
- [x] `Ahren09/InfoVQA` — high-res infographic images, throughput + loss convergence
- [x] `lmms-lab/LLaVA-Video-178K` — video pipeline functional check

### Functional test checklist (per dataset)
- [x] `datasets.load_dataset()` loads successfully
- [x] Data format parseable by current VLM data pipeline (Energon)
- [x] Images/video decode correctly (especially large images, multi-image, video)
- [x] Single forward pass without OOM
- [x] 10-50 step training run completes without errors, loss converges
- [x] For video: confirm on-the-fly loading support

Changes note:
- Adding proper support for video
- Adding control and processing logic for max_num_images, max_num_frames, max_visual_tokens
- Fixing issue when sequence truncation truncating visual tokens (e.g. when running with PP>1, sequence length needs to constant)
- Adding wrapper QwenVLEnergonProvider to bring VLM-specific knobs to CLI
=> General setup:
    - max_pixels: upper-bound for media height/width, resize if exceeding
    - max_num_images, max_num_frames: upper-bound for number of medias, dropping if exceeding
    - max_visual_tokens: upper-bound for number of media tokens, dropping if exceeding
  
Records:
- Wandb convergence runs for 8b:
   - infovqa: https://wandb.ai/nvidia/vlm_energon_naver/runs/r3ocpr6e?nw=nwuserhuvu
   - mantis: https://wandb.ai/nvidia/vlm_energon_naver/runs/5tpirdqv?nw=nwuserhuvu
   - llava_video: https://wandb.ai/nvidia/vlm_energon_naver/runs/g7mbdx6t?nw=nwuserhuvu
- Wandb benchmarking runs for 30b and 235b:
   - https://wandb.ai/nvidia/vlm_energon_naver/reports/Qwen3-VL-Energon-benchmarking--VmlldzoxNjkzOTcwNw


## Phase 2 — Benchmark + Perf Optimization

After functional tests pass, move to performance reproduction and benchmark accuracy.

- [ ] `lmms-lab/LLaVA-NeXT-Interleave-Bench` — perf reproduction for interleaved multi-image (17 GB, 39K rows)
- [ ] `Ahren09/InfoVQA` — (continued) benchmark accuracy for high-res
- [ ] `lmms-lab/LLaVA-Video-178K` — (continued) video perf benchmark
- [ ] `HuggingFaceFV/finevideo` — stretch: long video + audio

## Dataset Summary

| Dataset | Category | HF Link | Size |
|---------|----------|---------|------|
| `TIGER-Lab/Mantis-Instruct` | Multi-image interleaved | [HF](https://huggingface.co/datasets/TIGER-Lab/Mantis-Instruct) | 462 MB, 1M rows |
| `Ahren09/InfoVQA` | High-res image | [HF](https://huggingface.co/datasets/Ahren09/InfoVQA) | 2 GB, 30K rows |
| `lmms-lab/LLaVA-Video-178K` | Video | [HF](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) | 645 MB annotations + videos, 1.6M rows |
| `lmms-lab/LLaVA-NeXT-Interleave-Bench` | Interleaved benchmark | [HF](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Interleave-Bench) | 17 GB, 39K rows |
| `HuggingFaceFV/finevideo` | Long video + audio | [HF](https://huggingface.co/datasets/HuggingFaceFV/finevideo) | Large |

## Context

- All benchmarking and validation should be conducted within an Energon-based setup.
- Users specifically care about samples with 10+ image paths per sample, which expose encoder bottlenecks.
- `MP-DocVQA` and `MINT-1T-PDF` were considered but excluded since `Mantis-Instruct` and `LLaVA-NeXT-Interleave-Bench` already cover 10+ image interleaved samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] tracking: VLM SFT benchmark — high-res & interleaved multi-image dataset support #3133

Summary

Phase 1 — Functional Test (throughput + loss convergence)

Functional test checklist (per dataset)

Phase 2 — Benchmark + Perf Optimization

Dataset Summary

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dataset	Category	HF Link	Size
`TIGER-Lab/Mantis-Instruct`	Multi-image interleaved	HF	462 MB, 1M rows
`Ahren09/InfoVQA`	High-res image	HF	2 GB, 30K rows
`lmms-lab/LLaVA-Video-178K`	Video	HF	645 MB annotations + videos, 1.6M rows
`lmms-lab/LLaVA-NeXT-Interleave-Bench`	Interleaved benchmark	HF	17 GB, 39K rows
`HuggingFaceFV/finevideo`	Long video + audio	HF	Large

[data] tracking: VLM SFT benchmark — high-res & interleaved multi-image dataset support #3133

Description

Summary

Phase 1 — Functional Test (throughput + loss convergence)

Functional test checklist (per dataset)

Phase 2 — Benchmark + Perf Optimization

Dataset Summary

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions