EmboSceneExplorer: Embodied Scene Explorer for Multimodal Perception and Navigation

A Comprehensive Framework for Embodied AI

† Project Leader

What is EmboSceneExplorer?

EmboSceneExplorer is a comprehensive multimodal scene perception, understanding, and navigation system built on the Habitat simulation environment. It enables Embodied AI Agents to perform 3D perception and reconstruction, LLM-based grounding, and goal-oriented navigation within virtual 3D scenes (e.g., ScanNet).

Our system is based on the Habitat 3 simulation platform and implements a complete pipeline from scene perception, semantic understanding, localization to navigation. The system constructs 3D scene meshes, point clouds, and 3DGS representations from multi-view RGB(D) images or video inputs, and achieves precise target localization and navigation in scenes through spatial occupancy reasoning and language instruction parsing.

We support user-captured videos as input, completing 3D point cloud reconstruction and language annotation through automated algorithms, significantly reducing data production costs. Additionally, the system supports controlling dynamic character behavior simulation in scenes based on user text instructions, with preliminary multi-agent interaction and language alignment capabilities.

Core Components

The EmboSceneExplorer workflow comprises four core components that work together to enable comprehensive embodied AI capabilities:

🎥 Multimodal Data Collection

Captures comprehensive multimodal data including:

RGB image sequences - High-quality visual data from multiple viewpoints
Depth maps and semantic segmentation maps - 3D spatial understanding
COLMAP-style camera intrinsics and extrinsics - Supporting 3D Gaussian Splatting training

🏗️ Scene Reconstruction

Builds comprehensive multimodal scene representations including:

Dense point clouds - Detailed 3D geometric representation
High-fidelity meshes - Accurate surface reconstruction
Occupancy grid maps (Occ) - Spatial navigation planning

🎯 3D Visual Grounding

Bridging language and spatial understanding with state-of-the-art performance:

Natural language parsing - Supporting both English and Chinese instructions
Semantic concept grounding - Precise 3D location mapping
Point-cloud-level accuracy - Generating accurate objects from textual queries

🧭 Autonomous Navigation

Integrates scene representations for intelligent navigation:

Navigable occupancy maps - Safe path planning
Optimal collision-free paths - Efficient route optimization
Exploration and goal-reaching behaviors - Adaptive navigation strategies

Key Features

🌐 Complete Closed-Loop System

Implements a complete pipeline from perception to action, providing a solid foundation for more complex embodied intelligence systems.

📹 User Video Support

Supports user-captured videos as input, significantly reducing data production costs through automated algorithms.

🤖 Multi-Agent Interaction

Features preliminary multi-agent interaction and language alignment capabilities for dynamic character behavior simulation.

🎯 State-of-the-Art Grounding

Achieves state-of-the-art performance in 3D visual grounding across multiple metrics and datasets.

🌍 Real-World Validation

Validated on multiple public real-world scene datasets, demonstrating excellent generalization and scalability.

🗣️ Bilingual Support

Supports natural language instructions in both English and Chinese for broader accessibility.

Citation

If you find our work useful, please consider citing:

@misc{embosceneexplorer2024,
  title={EmboSceneExplorer: Embodied Scene Explorer for Multimodal Perception and Navigation},
  author={Ao Gao, Luosong Guo, Chaoyang Li, Jiangming Shi, Zilong Xie, Jingyu Gong, Xin Tan, Zhizhong Zhang and Yuan Xie},
  year={2025},
  url={https://github.com/ECNU-AILab-SII/EmboSceneExplorer}
}

EmboSceneExplorer: Embodied Scene Explorer for
Multimodal Perception and Navigation