A2A-MI : Any-to-Any Multimodal Intelligence

Any-to-Any Multimodal Intelligence aims to build a unified intelligent system capable of understanding, reasoning, and generating across arbitrary combinations of modalities. Unlike traditional multimodal models, which are typically limited to fixed input-output modes such as text-to-image or image-to-text, the any-to-any framework emphasizes flexible modeling capabilities in scenarios involving multiple inputs, multiple outputs, and interleaved modalities.

The core challenge lies in designing a unified architecture that supports multimodal collaborative modeling, in learning how to align semantics and control information flow in complex interleaved structures, and in evaluating generalization and scalability in real-world open environments. This direction therefore involves not only innovation in model architecture, but also the systematic construction of data organization methods, task design paradigms, and evaluation frameworks. This page serves as a central hub for our representative models, benchmarks, workshops, and surveys.

Overview figure for Any-to-Any Multimodal Intelligence
Overview
01

Flagship models

Core model papers push toward unified any-input any-output architectures and training paradigms.

02

Benchmarks

Evaluation work turns the vision into measurable tasks that stress flexible modality composition.

03

Workshops

Workshops create a public venue for the field, clarifying open problems and bringing together adjacent communities.

04

Survey papers

Survey work will consolidate technical progress, taxonomy, and unresolved challenges across this direction.

Models
Architecture figure for NExT-GPT
Any2Any MLLM ICML 2024 Oral

NExT-GPT Any-to-any multimodal model

NExT-GPT advances a unified framework for multimodal understanding and generation across text, image, video, and audio. It supports interleaved multimodal prompting, cross-modal reasoning, and controllable output generation inside one coherent model design.

Unified architecture Interleaved prompting Cross-modal generation
Benchmarks
Overview figure for UniM benchmark
Benchmark CVPR 2026

UniM Unified any-to-any interleaved multimodal benchmark

UniM is designed as a unified benchmark for any-to-any interleaved multimodal intelligence, with 31K high-quality instances spanning 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D. It targets intertwined reasoning and generation demands that go beyond conventional multimodal evaluation.

31K instances 7 modalities Interleaved evaluation
Workshops
Poster-like visual for the A2A-MI workshop
Workshop CVPR 2026 Workshop

A2A-MI The first workshop for this series

The Any-to-Any Multimodal Intelligence workshop centers on the next stage of multimodal intelligence: systems that can understand, align, transform, and generate across arbitrary modality sets. It provides a focal point for discussing architecture, training, data, evaluation, and real-world deployment challenges.

Community building Open problems Field shaping
Surveys
In preparation A2A-MI Survey

A forthcoming synthesis of methods, benchmarks, workshop themes, and unresolved questions across the A2A-MI landscape.

Survey paper Coming soon

Systematic landscape review Taxonomy, methods, and open challenges

This module is reserved for the survey paper that will unify the thread from a higher level: what counts as any-to-any multimodal intelligence, how existing approaches differ, which evaluation settings are still missing, and where the most important opportunities lie.

Taxonomy Method landscape Open problems
Link will be added when available
Community curated Awesome Any2Any

A living repository that organizes the broader any-to-any landscape, including datasets, paper collections, workshops, surveys, tools, and ongoing community updates.

Datasets Papers Surveys Tools
Community resource GitHub repository

Awesome Any2Any Curated resources for the A2A-MI landscape

This repository systematizes Any-to-Any Generation and adjacent A2A-MI research through a continuously maintained collection of resources: news, datasets, paper lists, and tools.

Paper collections Datasets and tools Community updates