LOGO

Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation

1Peking University; 2Beijing Academy of Artificial Intelligence (BAAI); 3University of Sydney; 4Institute of Automation;
*Equal contribution Project leaders  Corresponding author
highlight Overview of Action-Sketcher

 

teaser
Action-Sketcher operates in a See-Think-Sketch-Act loop, where a foundation model first performs temporal and spatial reasoning to decompose a high-level instruction (e.g., "Clean the objects on the table") into a subtask and a corresponding Visual Sketch. This sketch, composed of primitives like points, boxes, and arrows, serves as an explicit, human-readable plan that guides a low-level policy to generate robust action sequences. This methodology enables three key capabilities: (bottom left) long-horizon planning through task decomposition, (bottom middle) explicit spatial reasoning by grounding instructions in scene geometry, and (bottom right) seamless human-in-the-loop adaptability via direct sketch correction and intent supervision.

Abstract

Long-horizon robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision-Language-Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines referential grounding in cluttered or underspecified scenes, impedes effective task decomposition of long-horizon goals with close-loop interaction, and limits causal explanation by obscuring the rationale behind action choices. To address these issues, we first introduce Visual Sketch, an implausible visual intermediate that renders points, boxes, arrows, and typed relations in the robot's current views to externalize spatial intent, connect language to scene geometry, and provide a human-verifiable bridge between high-level reasoning and low-level control. Building on Visual Sketch, we present Action-Sketcher, a VLA framework that operates in a cyclic See -> Think -> Sketch -> Act workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate diverse corpus with interleaved images, text, Visual Sketch supervision, and action sequences, and train Action-Sketcher with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Extensive experiments on cluttered scenes and multi-object tasks, in simulation and on real-world tasks, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans.

Pipeline of Action-Sketcher

teaser
The Action-Sketcher framework is model-agnostic and can be integrated with any VLA model with an event-driven loop that (i) summarizes the next subtask, (ii) emits a compact Visual Sketch (points, boxes, arrows, relations) that externalizes spatial intent, and (iii) synthesizes an action chunk conditioned on that sketch and the robot state. The explicit intermediate supports targeted supervision, on-the-fly correction, and reliable long-horizon execution within a single-model architecture.

Experiments

teaser
Results of LIBERO, RoboTwin-2.0, and real-world long-horizon and spatially complex tasks, comparing Action-Sketcher to top baselines.

Visualization of Task Completing

teaser
Qualitative rollouts on long-horizon and spatial manipulation tasks. Our framework generates Visual Sketches (overlaid points, boxes, and arrows) to ground high-level reasoning into low-level actions, successfully completing tasks like tidying a tabletop and pouring tea in cluttered environments.

Visualization of Reasoning Process

highlight Videos

Simulation Demos

Pick and Place

Hang the Mug

Pick A to B

Stack Three Blocks

Real-World Demos

Clean the Table

Pour Tea

Citation

If you find our work helpful, feel free to cite it:


@article{tan2026actionsketcher,
    title={Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation}, 
    author={Tan, Huajie and Co, Peterson and Xu, Yijie and Rong, Shanyu and Ji, Yuheng and Chi, Cheng and Chen, Xiansheng and Zhao, Zhongxia and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang},
    journal={arXiv preprint arXiv:2601.01618},
    year={2026}
}