Summary
Track end-to-end support for high-resolution image and interleaved multi-image datasets in the VLM SFT benchmark pipeline (Energon-based).
Driven by user requests to cover corner cases where encoder bottlenecks appear (e.g., 10+ images per sample, very large image sizes).
Phase 1 — Functional Test (throughput + loss convergence)
Verify data loading via Energon, confirm images are properly processed, throughput holds up, and loss converges. (needs data sample balance for vision encoder)
Functional test checklist (per dataset)
Changes note:
- Adding proper support for video
- Adding control and processing logic for max_num_images, max_num_frames, max_visual_tokens
- Fixing issue when sequence truncation truncating visual tokens (e.g. when running with PP>1, sequence length needs to constant)
- Adding wrapper QwenVLEnergonProvider to bring VLM-specific knobs to CLI
=> General setup:
- max_pixels: upper-bound for media height/width, resize if exceeding
- max_num_images, max_num_frames: upper-bound for number of medias, dropping if exceeding
- max_visual_tokens: upper-bound for number of media tokens, dropping if exceeding
Records:
- Wandb convergence runs for 8b:
- Wandb benchmarking runs for 30b and 235b:
Phase 2 — Benchmark + Perf Optimization
After functional tests pass, move to performance reproduction and benchmark accuracy.
Dataset Summary
| Dataset |
Category |
HF Link |
Size |
TIGER-Lab/Mantis-Instruct |
Multi-image interleaved |
HF |
462 MB, 1M rows |
Ahren09/InfoVQA |
High-res image |
HF |
2 GB, 30K rows |
lmms-lab/LLaVA-Video-178K |
Video |
HF |
645 MB annotations + videos, 1.6M rows |
lmms-lab/LLaVA-NeXT-Interleave-Bench |
Interleaved benchmark |
HF |
17 GB, 39K rows |
HuggingFaceFV/finevideo |
Long video + audio |
HF |
Large |
Context
- All benchmarking and validation should be conducted within an Energon-based setup.
- Users specifically care about samples with 10+ image paths per sample, which expose encoder bottlenecks.
MP-DocVQA and MINT-1T-PDF were considered but excluded since Mantis-Instruct and LLaVA-NeXT-Interleave-Bench already cover 10+ image interleaved samples.
Summary
Track end-to-end support for high-resolution image and interleaved multi-image datasets in the VLM SFT benchmark pipeline (Energon-based).
Driven by user requests to cover corner cases where encoder bottlenecks appear (e.g., 10+ images per sample, very large image sizes).
Phase 1 — Functional Test (throughput + loss convergence)
Verify data loading via Energon, confirm images are properly processed, throughput holds up, and loss converges. (needs data sample balance for vision encoder)
TIGER-Lab/Mantis-Instruct— interleaved multi-image (10+ images/sample), throughput verification + Energon image processingAhren09/InfoVQA— high-res infographic images, throughput + loss convergencelmms-lab/LLaVA-Video-178K— video pipeline functional checkFunctional test checklist (per dataset)
datasets.load_dataset()loads successfullyChanges note:
=> General setup:
Records:
Phase 2 — Benchmark + Perf Optimization
After functional tests pass, move to performance reproduction and benchmark accuracy.
lmms-lab/LLaVA-NeXT-Interleave-Bench— perf reproduction for interleaved multi-image (17 GB, 39K rows)Ahren09/InfoVQA— (continued) benchmark accuracy for high-reslmms-lab/LLaVA-Video-178K— (continued) video perf benchmarkHuggingFaceFV/finevideo— stretch: long video + audioDataset Summary
TIGER-Lab/Mantis-InstructAhren09/InfoVQAlmms-lab/LLaVA-Video-178Klmms-lab/LLaVA-NeXT-Interleave-BenchHuggingFaceFV/finevideoContext
MP-DocVQAandMINT-1T-PDFwere considered but excluded sinceMantis-InstructandLLaVA-NeXT-Interleave-Benchalready cover 10+ image interleaved samples.Histograms of images dataset, which contains multiple sub-dataset with different resolutions for references: