Skip to content

[data] tracking: VLM SFT benchmark — high-res & interleaved multi-image dataset support #3133

@yaoyu-33

Description

@yaoyu-33

Summary

Track end-to-end support for high-resolution image and interleaved multi-image datasets in the VLM SFT benchmark pipeline (Energon-based).

Driven by user requests to cover corner cases where encoder bottlenecks appear (e.g., 10+ images per sample, very large image sizes).

Phase 1 — Functional Test (throughput + loss convergence)

Verify data loading via Energon, confirm images are properly processed, throughput holds up, and loss converges. (needs data sample balance for vision encoder)

  • TIGER-Lab/Mantis-Instruct — interleaved multi-image (10+ images/sample), throughput verification + Energon image processing
  • Ahren09/InfoVQA — high-res infographic images, throughput + loss convergence
  • lmms-lab/LLaVA-Video-178K — video pipeline functional check

Functional test checklist (per dataset)

  • datasets.load_dataset() loads successfully
  • Data format parseable by current VLM data pipeline (Energon)
  • Images/video decode correctly (especially large images, multi-image, video)
  • Single forward pass without OOM
  • 10-50 step training run completes without errors, loss converges
  • For video: confirm on-the-fly loading support

Changes note:

  • Adding proper support for video
  • Adding control and processing logic for max_num_images, max_num_frames, max_visual_tokens
  • Fixing issue when sequence truncation truncating visual tokens (e.g. when running with PP>1, sequence length needs to constant)
  • Adding wrapper QwenVLEnergonProvider to bring VLM-specific knobs to CLI
    => General setup:
    • max_pixels: upper-bound for media height/width, resize if exceeding
    • max_num_images, max_num_frames: upper-bound for number of medias, dropping if exceeding
    • max_visual_tokens: upper-bound for number of media tokens, dropping if exceeding

Records:

Phase 2 — Benchmark + Perf Optimization

After functional tests pass, move to performance reproduction and benchmark accuracy.

  • lmms-lab/LLaVA-NeXT-Interleave-Bench — perf reproduction for interleaved multi-image (17 GB, 39K rows)
  • Ahren09/InfoVQA — (continued) benchmark accuracy for high-res
  • lmms-lab/LLaVA-Video-178K — (continued) video perf benchmark
  • HuggingFaceFV/finevideo — stretch: long video + audio

Dataset Summary

Dataset Category HF Link Size
TIGER-Lab/Mantis-Instruct Multi-image interleaved HF 462 MB, 1M rows
Ahren09/InfoVQA High-res image HF 2 GB, 30K rows
lmms-lab/LLaVA-Video-178K Video HF 645 MB annotations + videos, 1.6M rows
lmms-lab/LLaVA-NeXT-Interleave-Bench Interleaved benchmark HF 17 GB, 39K rows
HuggingFaceFV/finevideo Long video + audio HF Large

Context

  • All benchmarking and validation should be conducted within an Energon-based setup.
  • Users specifically care about samples with 10+ image paths per sample, which expose encoder bottlenecks.
  • MP-DocVQA and MINT-1T-PDF were considered but excluded since Mantis-Instruct and LLaVA-NeXT-Interleave-Bench already cover 10+ image interleaved samples.
Pinned by huvunvidia

Metadata

Metadata

Labels

area:dataDataset builders, preprocessing, and samplersfeatureNew capabilities, enhancements, or enablement worktrackingTracking issue for an ongoing project with smaller stepsx-naver

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions