[ICRA 2026] NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

Authors: Hongyu Li*, Lingfeng Sun*, Yafei Hu, Duy Ta, Jennifer Barry, George Konidaris, Jiahui Fu

Affiliations: Robotics and AI Institute, Brown University
*Equal contribution

NovaFlow enables robots to execute novel manipulation tasks in a zero-shot manner without any demonstrations or embodiment-specific training. Given a natural language task description, NovaFlow autonomously synthesizes a video using state-of-the-art video generation models and distills it into 3D actionable object flow. This flow is then converted into precise robot actions through grasp proposals and trajectory optimization, enabling seamless transfer across different robotic platforms.

✨ Key Features

Zero-Shot Manipulation: Execute novel tasks without demonstrations or training
Multi-Embodiment Transfer: Naturally transfers across different robots (Franka arm, Spot quadruped)
Object Agnostic: Handles rigid, articulated, and deformable objects
Language-to-Action: Converts natural language task descriptions into precise robot trajectories
Actionable Flow: Distills generated videos into 3D object motion plans
Robust Execution: Grasp proposal + trajectory optimization for reliable manipulation

📋 Abstract

Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training.

🚀 Quick Start

Prerequisites

Hardware: Multi-GPU setup (H100/A100 recommended) for Wan2.1 video generation pipeline. For any GPU under A100/H100, the Veo model is recommended (requires only a single gaming GPU).
Software: Python 3.8+, Docker (recommended, optional)
API Keys: GOOGLE_API_KEY (required if using Veo model)
Robots: Franka Panda arm or Boston Dynamics Spot (for physical execution)

Installation

Clone the repository:
```
git clone https://github.com/bdaiinstitute/NovaFlow.git
cd NovaFlow
```
The dependency repos (tapip3d, grounded_sam_2, wan2.1) are vendored under server/.

Build and enter Docker: You can pull our built Docker image

docker pull lhy0807/novaflow
docker tag lhy0807/novaflow novaflow

or build by yourself:

cd server/docker
docker build -t novaflow .
cd ../..

# Run the container with the repo mounted
docker run -it --gpus all -v $(pwd):/workspace novaflow bash

Download model weights (inside Docker):

cd /workspace/server
./download_weights.sh

Start the server (inside Docker):

To use prompt extension, set GOOGLE_API_KEY to your Google API key.
Using Wan (Default, requires A100/H100)
```
cd /workspace/server
./start_ray_server.sh
```
Using Veo (Recommended for GPUs < A100/H100)
```
export GOOGLE_API_KEY="your_api_key_here"
cd /workspace/server
./start_ray_server.sh --model veo
```

Run your first job (from a separate terminal on the host):

Using Wan (Default)

cd client
python submit_jobs.py --num-jobs 1 --base-seed 42

Using Veo (Recommended for GPUs < A100/H100)

cd client
python submit_jobs.py --num-jobs 1 --base-seed 42 --use-veo

📖 Pipeline Overview

NovaFlow operates through two main pipelines that convert language instructions into robot actions:

🎬 Flow Generator Pipeline

Converts task descriptions into 3D actionable object flow:

Video Generation: Synthesizes plausible object motion videos using state-of-the-art video models (WAN2.1)
3D Lifting: Converts 2D video to 3D using monocular depth estimation
Depth Calibration: Calibrates estimated depth against initial observations
Point Tracking: Tracks dense per-point motion using 3D point tracking (Tapip3D)
Object Grounding: Extracts object-centric 3D flow via segmentation (Grounded SAM 2)

🤖 Flow Executor Pipeline

Converts 3D flow into precise robot trajectories:

Grasp Proposal: Determines initial end-effector poses from grasp candidates
Trajectory Planning: Plans robot trajectories based on actionable flow with cost/constraint optimization
Motion Execution: Tracks planned trajectories on physical robots (Franka/Spot)

🎯 Object Types Supported

Rigid Objects: Cup placement, block insertion, mug hanging
Articulated Objects: Drawer opening, lid lifting
Deformable Objects: Rope straightening, plant watering

📋 Pipeline Modules

Video Generation: WAN2.1 model integration for task video synthesis
Depth Estimation: Monocular depth lifting for 3D reconstruction
Depth Calibration: Calibration against initial depth observations
Point Tracking: Tapip3D integration for dense 3D motion tracking
Object Grounding: Grounded SAM 2 for object-centric flow extraction
Flow Visualization: 3D motion flow visualization and analysis
Grasp Planning: Candidate grasp pose generation and selection
Trajectory Optimization: Motion planning with constraints and costs
Execution: Real-time trajectory tracking on physical robots

📄 License

This project is licensed under the RAI License - see the LICENSE file for details.

🙏 Acknowledgments

NovaFlow builds upon several outstanding research projects and open-source implementations:

Wan2.1: Video generation models
TAPIP3D: 3D point tracking
Grounded SAM 2: Object segmentation and tracking
Mega-SAM: Lift 2D video to 3D using depth estimation
GraspGen: Grasp planning and generation
Ray: Distributed computing framework

📚 Citations

If you find NovaFlow useful in your research, please cite our paper:

@article{li2025novaflow,
  title={Novaflow: Zero-shot manipulation via actionable flow from generated videos},
  author={Li, Hongyu and Sun, Lingfeng and Hu, Yafei and Ta, Duy and Barry, Jennifer and Konidaris, George and Fu, Jiahui},
  journal={arXiv preprint arXiv:2510.08568},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
client		client
server		server
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICRA 2026] NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

✨ Key Features

📋 Abstract

🚀 Quick Start

Prerequisites

Installation

📖 Pipeline Overview

🎬 Flow Generator Pipeline

🤖 Flow Executor Pipeline

🎯 Object Types Supported

📋 Pipeline Modules

📄 License

🙏 Acknowledgments

📚 Citations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICRA 2026] NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

✨ Key Features

📋 Abstract

🚀 Quick Start

Prerequisites

Installation

📖 Pipeline Overview

🎬 Flow Generator Pipeline

🤖 Flow Executor Pipeline

🎯 Object Types Supported

📋 Pipeline Modules

📄 License

🙏 Acknowledgments

📚 Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages