EmboSceneExplorer is a comprehensive multimodal scene perception, understanding, and navigation system built on the Habitat simulation environment. It enables Embodied AI Agents to perform 3D perception and reconstruction, LLM-based grounding, and goal-oriented navigation within virtual 3D scenes (e.g., ScanNet).
Our system is based on the Habitat 3 simulation platform and implements a complete pipeline from scene perception, semantic understanding, localization to navigation. The system constructs 3D scene meshes, point clouds, and 3DGS representations from multi-view RGB(D) images or video inputs, and achieves precise target localization and navigation in scenes through spatial occupancy reasoning and language instruction parsing.
We support user-captured videos as input, completing 3D point cloud reconstruction and language annotation through automated algorithms, significantly reducing data production costs. Additionally, the system supports controlling dynamic character behavior simulation in scenes based on user text instructions, with preliminary multi-agent interaction and language alignment capabilities.