Yuxuan Wang's Homepage

Yuxuan Wang

flagwyx [at] gmail.com

I am currently a research engineer at the Qwen team, Alibaba Inc. I obtained my Master's degree from Peking University, under the supervision of Dongyan Zhao. I have had the wonderful experience of working with Zilong Zheng @ BIGAI, Cihang Xie @ UCSC, and Alan L. Yuille @ JHU. My current work primarily focuses on omni-LMs. I am especially interested in studies that offer novel insights and impactful applications.

I am looking for Interns for omni-LM and open-world modeling research. Please feel free to contact me!

Scholar • Github • CV

Qwen3-VL Technical Report
Qwen Team (core contributor)
PDF | Code | Qwen Chat | Cite

Qwen3-Omni Technical Report
Qwen Team (core contributor)
PDF | Code | Qwen Chat | Cite

VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
Yuxuan Wang*, Yiqi Song*, Cihang Xie, Yang Liu, Zilong Zheng
ICCV 2025 | PDF | Code | Homepage | Cite

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng
EMNLP 2024 | PDF | Code & Demo | Cite

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
Yuxuan Wang, Alan Yuille, Zhuowan Li, Zilong Zheng
COLM 2024 | PDF | Code | Cite

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao
ACL 2023 | PDF | Code | Homepage | Cite

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao
EMNLP 2025 Findings | PDF | Code | Cite

The AI Hippocampus: How Far are We From Human Memory?
Zixia Jia*, Jiaqi Li*, Yipeng Kang*, Yuxuan Wang*, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu
TMLR 2025 | PDF | Code | Cite

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, Zilong Zheng
PDF | Code | Homepage | Cite

HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao
PDF | Code | Cite

Open-Omni-Nexus

A fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.
Multimodal Needle In A Video Haystack

Pressure Testing Large Video-Language Models (LVLM): Doing multimodal retrieval from LVLM at various video lengths to measure accuracy.
Streaming Grounded SAM 2

Grounded SAM 2 for streaming video tracking using natural language queries.

Colorful Multimodal Research

Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, and Fundamental Sciences such as Mathematics.
Language Modeling Research Hub

A comprehensive compendium for enthusiasts and scholars delving into the fascinating realm of language models (LMs), with a particular focus on large language models (LLMs).
Multimodal Memory Research

Reading List of Memory Augmented Multimodal Research, including multimodal context modeling, memory in vision and robotics, and external memory/knowledge augmented MLLM.

Reviewer: ARR 2023-Present (Great Review Mention), CVPR 2024
Area Chair: ARR 2024-Present
Organizer: NLPCC 2022 Shared Task 4, NLPCC 2023 Shared Task 10