Authors: Hongyu Li*, Lingfeng Sun*, Yafei Hu, Duy Ta, Jennifer Barry, George Konidaris, Jiahui Fu
Affiliations: Robotics and AI Institute, Brown University
*Equal contribution
NovaFlow enables robots to execute novel manipulation tasks in a zero-shot manner without any demonstrations or embodiment-specific training. Given a natural language task description, NovaFlow autonomously synthesizes a video using state-of-the-art video generation models and distills it into 3D actionable object flow. This flow is then converted into precise robot actions through grasp proposals and trajectory optimization, enabling seamless transfer across different robotic platforms.
- Zero-Shot Manipulation: Execute novel tasks without demonstrations or training
- Multi-Embodiment Transfer: Naturally transfers across different robots (Franka arm, Spot quadruped)
- Object Agnostic: Handles rigid, articulated, and deformable objects
- Language-to-Action: Converts natural language task descriptions into precise robot trajectories
- Actionable Flow: Distills generated videos into 3D object motion plans
- Robust Execution: Grasp proposal + trajectory optimization for reliable manipulation
Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training.
- Hardware: Multi-GPU setup (H100/A100 recommended) for Wan2.1 video generation pipeline. For any GPU under A100/H100, the Veo model is recommended (requires only a single gaming GPU).
- Software: Python 3.8+, Docker (recommended, optional)
- API Keys:
GOOGLE_API_KEY(required if using Veo model) - Robots: Franka Panda arm or Boston Dynamics Spot (for physical execution)
-
Clone the repository:
git clone https://github.com/bdaiinstitute/NovaFlow.git cd NovaFlowThe dependency repos (
tapip3d,grounded_sam_2,wan2.1) are vendored underserver/. -
Build and enter Docker: You can pull our built Docker image
docker pull lhy0807/novaflow docker tag lhy0807/novaflow novaflow
or build by yourself:
cd server/docker docker build -t novaflow . cd ../.. # Run the container with the repo mounted docker run -it --gpus all -v $(pwd):/workspace novaflow bash
-
Download model weights (inside Docker):
cd /workspace/server ./download_weights.sh -
Start the server (inside Docker):
To use prompt extension, set
GOOGLE_API_KEYto your Google API key.Using Wan (Default, requires A100/H100)
cd /workspace/server ./start_ray_server.shUsing Veo (Recommended for GPUs < A100/H100)
export GOOGLE_API_KEY="your_api_key_here" cd /workspace/server ./start_ray_server.sh --model veo
-
Run your first job (from a separate terminal on the host):
Using Wan (Default)
cd client python submit_jobs.py --num-jobs 1 --base-seed 42Using Veo (Recommended for GPUs < A100/H100)
cd client python submit_jobs.py --num-jobs 1 --base-seed 42 --use-veo
NovaFlow operates through two main pipelines that convert language instructions into robot actions:
Converts task descriptions into 3D actionable object flow:
- Video Generation: Synthesizes plausible object motion videos using state-of-the-art video models (WAN2.1)
- 3D Lifting: Converts 2D video to 3D using monocular depth estimation
- Depth Calibration: Calibrates estimated depth against initial observations
- Point Tracking: Tracks dense per-point motion using 3D point tracking (Tapip3D)
- Object Grounding: Extracts object-centric 3D flow via segmentation (Grounded SAM 2)
Converts 3D flow into precise robot trajectories:
- Grasp Proposal: Determines initial end-effector poses from grasp candidates
- Trajectory Planning: Plans robot trajectories based on actionable flow with cost/constraint optimization
- Motion Execution: Tracks planned trajectories on physical robots (Franka/Spot)
- Rigid Objects: Cup placement, block insertion, mug hanging
- Articulated Objects: Drawer opening, lid lifting
- Deformable Objects: Rope straightening, plant watering
- Video Generation: WAN2.1 model integration for task video synthesis
- Depth Estimation: Monocular depth lifting for 3D reconstruction
- Depth Calibration: Calibration against initial depth observations
- Point Tracking: Tapip3D integration for dense 3D motion tracking
- Object Grounding: Grounded SAM 2 for object-centric flow extraction
- Flow Visualization: 3D motion flow visualization and analysis
- Grasp Planning: Candidate grasp pose generation and selection
- Trajectory Optimization: Motion planning with constraints and costs
- Execution: Real-time trajectory tracking on physical robots
This project is licensed under the RAI License - see the LICENSE file for details.
NovaFlow builds upon several outstanding research projects and open-source implementations:
- Wan2.1: Video generation models
- TAPIP3D: 3D point tracking
- Grounded SAM 2: Object segmentation and tracking
- Mega-SAM: Lift 2D video to 3D using depth estimation
- GraspGen: Grasp planning and generation
- Ray: Distributed computing framework
If you find NovaFlow useful in your research, please cite our paper:
@article{li2025novaflow,
title={Novaflow: Zero-shot manipulation via actionable flow from generated videos},
author={Li, Hongyu and Sun, Lingfeng and Hu, Yafei and Ta, Duy and Barry, Jennifer and Konidaris, George and Fu, Jiahui},
journal={arXiv preprint arXiv:2510.08568},
year={2025}
}