In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence.
Inter. Comb.: Interleaved combinations of modalities. Cap. per Instance: Capability per instance. Difficulty Tax.: Difficulty taxonomy.
Construction pipeline. We first collect a wide range of multimodal data mainly from three sources: curated samples from public datasets, real-world multimedia content from social media (vlogs, posts), and open resources (forums, websites). Then, we manually design interleaved combinations tailored to different modalities, providing templates for the later construction. During the construction of QA pairs, we design task types and template instances to ensure task diversity and semantic validity. Based on these, GPT-5-mini is employed to generate additional candidate instances for data expansion, so as to simulate more any-to-any multimodal scenarios that are hard to directly retrieve from the Internet.
Quality Control. We adopt a two-phase quality control process. All QA pairs are manually reviewed and revised as needed to ensure that the modality placeholder tags comply with task specifications and that the content remains logically consistent. Then, an independent checking process is conducted, where reviewers carefully examine each completed sample to further ensure the reliability and high quality of our dataset.
Data construction pipeline and quality control overview.
Semantic Correctness & Generation Quality. Semantic Correctness (SC) measures how well the generated output semantically aligns with the reference answer. To ensure fair evaluation across modalities with varying instruction-following capabilities, we convert all modality outputs into comparable caption-like textual representations and employ the LLM-as-a-Judge strategy for measurement. Generation Quality (GQ) evaluates the perceptual quality and structural coherence of generated content. Accordingly, we design modality-specific no-reference quality assessment methods to ensure unified and comparable quality metrics across multimodal scenarios. Then, we compose both SC and GQ into Semantic–Quality Coupled Score (SQCS) to reflect the overall performance.
Response Structure Integrity. We devise Response Structure Integrity to evaluate whether a model adheres to task-defined structural requirements regarding modality types and item quantities, regardless of semantic or logical correctness. Technically, we break it down into two branches: Strict Structure Score (StS) evaluates the strict structural consistency of a model's output. StS requires that the types and quantities of modalities generated in model's response precisely correspond to those in the ground truth. Any missing or redundant modalities, or discrepancies in the number of modality placeholder tags, are explicitly penalized. Lenient Structure Score (LeS) evaluates the degree of coverage at the modality level. LeS assesses whether the types of modalities generated in model's response are consistent with those in the ground truth.
Interleaved Coherence. Interleaved Coherence is designed to evaluate a model’s ability to maintain logical connectivity and expressive coordination during multimodal integration, measured by Holistic Coherence (HC), which focuses on cross-modal semantic and structural consistency, and Stylistic Harmony (SH), which evaluates consistency in writing style, tone, and visual aesthetics. We adopt the LLM-as-a-Judge to quantify HC and SH, and ultimately use a composite metric over them: Interleaved Coherence Score (ICS).
UniMA architecture.
| Rank | Model |
|---|---|
| 🥇 | UniMA |
| 🥈 | MIO |
| 🥉 | AnyGPT |
| 4 | NExT-GPT |
@article{li2026unim,
title={UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark},
author={Li, Yanlin and Guo, Minghui and Zhang, Kaiwen and Zhang, Shize and Zhao, Yiran and Li, Haodong and Zhou, Congyue and Zheng, Weijie and Yan, Yushen and Wu, Shengqiong and others},
journal={arXiv preprint arXiv:2603.05075},
year={2026}
}