MeshMimic. In-the-wild monocular videos yield long-horizon motions over complex terrains
for contact-consistent motion–terrain interaction learning.
Each result has three parts: Left is captured with a consumer monocular RGB camera without MOCAP assistance; Middle is the reconstructed scene with human SMPLX; and Right is the deployed result. Most videos show contact-rich interactions with the environment.
These real-sim results are reconstructed from in-the-wild monocular videos, with the original video on the left and the reconstruction on the right. Most examples involve long-horizon, contact-rich interactions in challenging real-world environments.
@misc{zhang2026meshmimic,
title = {MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction},
author = {Zhang, Qiang and Ma, Jiahao and Liu, Peiran and Shi, Shuai and Su, Zeran and Wang, Zifan and Sun, Jingkai and Cui, Wei and Yu, Jialin and Han, Gang and Zhao, Wen and Sun, Pihai and Yin, Kangning and Wang, Jiaxu and Cao, Jiahang and Zhang, Lingfeng and Cheng, Hao and Hao, Xiaoshuai and Ji, Yiding and Liang, Junwei and Tang, Jian and Xu, Renjing and Guo, Yijie},
year = {2026},
eprint = {2602.15733},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2602.15733}
}