Bridging Vision, Language, and Action: What's Missing in Actionable Visual Perception for Robotics


CVPR 2026 Workshop, Denver CO, USA

Time: June 3rd, Wednesday (Full Day)

Location: 503


Overview

Vision foundation models excel at encoding passive data, yet robots require physically-grounded reasoning about pose, dynamics, and affordances. This workshop bridges the gap between computer vision and robotics, moving beyond simple deployment to pioneer task-driven, co-designed perception-action loops. We aim to translate perceptual abstractions into actionable structures, closing the loop from pixels to torque for robust, real-world systems.

To achieve this, we foster a bidirectional dialogue: enabling vision researchers to incorporate robotic constraints into design, while empowering roboticists to effectively deploy advanced models. Specifically, this workshop focuses on three interactive dimensions and the corresponding stage-wise challenges essential for actionable visual perception.

Topics

Interactive Dimensions

What visual capability is needed for fully autonomous systems

How can the vision community contribute to general-purpose robotic systems

What data modality is critical for generalizable, robust robot control

Stage-Wise Challenges

Data

  • Dynamic logs with multi-modal feedback.
  • Teaching models about risk.
  • Physically accurate data for the "sim-to-real" gap.

Model

  • 3D geometry and physical dynamics.
  • Differentiable cause-effect relationships.
  • Conditioned representations over passive observation.

Optimization

  • Safety and stability in the learning objective.
  • Downstream task success.
  • Model confidence for safe real-world deployment.

Evaluation

  • Closed-loop performance.
  • Reliability against environmental variability.
  • Physical task completion and safe interaction.

Call for Papers (Last updated on May 21th)

Submission Instructions

We welcome submissions covering:

All formats allow unlimited references and appendices.

Contributions will be non-archival but hosted on our workshop website, and thus dual submission is allowed where permitted by third parties. We welcome submissions that are under submission or accepted by other conferences. Please mention it in the last sentence of the paper abstract if your paper has been under submission or accepted by other conferences.

Submissions should follow CVPR two-column style and be anonymous; see the CVPR-26 author kit for details.

Submission and Important Dates

Invited Speakers (Last updated on May 21th)

Schedule (Last updated on June 1st)

To encourage open-ended discussion and maximize in-person engagement, the workshop will feature a mix of structured and interactive formats.

These interactive elements are designed to stimulate lively exchanges, bridge the gap between junior and senior researchers, and cultivate an open, inclusive research community.

Time Session
8:50 – 9:00 Opening Remarks
9:00 – 9:40 Invited Talk by Saurabh Gupta
9:40 – 10:20 Invited Talk by Ming-Yu Liu
10:20 – 10:30 Coffee Break
10:30 – 11:10 Invited Talk by Nathan F. Lepora
11:10 – 11:50 Invited Talk by Yunzhu Li
11:50 – 14:00 Lunch Break
14:00 – 14:40 Invited Talk by Chelsea Finn
14:40 – 15:20 Invited Talk by Ruoshi Liu
15:20 – 15:30 Coffee Break
15:30 – 16:10 Invited Talk by Marco Pavone
16:10 – 16:20 Closing Remarks
16:20 – 18:00 Poster (Exhibit Hall A)

Accepted Papers

  1. Multi-Objective Photoreal Simulation (MOPS) Dataset for Computer Vision in Robot Manipulation
  2. SIR: Structured Image Representations for Explainable Robot Learning
  3. Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
  4. SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models
  5. LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
  6. Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation
  7. Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
  8. PhaForce: Phase-Scheduled Visual–Force Policy Learning with Slow Planning and Fast Correction for Contact-Rich Manipulation
  9. PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning
  10. Multimodal Causal Subtask Modeling for Scalable VLA Pipelines in Long-Horizon Manipulation
  11. Learning End-to-End Visuomotor Control via Stochastic Decoupled Policy Gradient
  12. VL-Nav: A Neuro-Symbolic Approach for Reasoning-based Vision-Language Navigation
  13. PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
  14. Towards Robust Robot Manipulation: Visual-Tactile Sensing for VLA Models
  15. Geometry-Regularized Affordance Prediction for Sim-to-Real Transfer in Vision-Language-Action Models
  16. STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation
  17. Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow
  18. Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
  19. ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation
  20. KEYGEN: Unsupervised Keypoint based Object-Centric Representations for Category-Level Policy Generalization
  21. HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
  22. Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
  23. Point and Check: Diagnosing Fragile Visual Grounding for Embodied Systems
  24. VLS: Steering Pretrained Robot Policies via Vision–Language Models
  25. APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model
  26. Moving Through Clutter: Scaling Data Collection and Benchmarking for 3D Scene-Aware Humanoid Locomotion via Virtual Reality
  27. Multi-Phase Vision-Based Navigation and Inspection for Legged Robots with Online Goal Refinement and Vision-Only Halting
  28. ForesightNav: Learning Scene Imagination for Efficient Exploration
  29. Before the Needle: Action-Conditioned Risk-Affordance Maps for Safe Ultrasound-Guided Intervention
  30. Can Vision-Language Models See the Future? Coverage-Aware Pedestrian Forecasting and Intent-Guided Residuals
  31. From Vision to Action: Benchmarking VLMs for Multi-Arm Robotic Fruit Harvesting
  32. Text to Multi-Component Robotic Assembly: Vision Language Reasoning for Functional Component Assignment
  33. Reasoning-Guided Part-Level Visual Grounding via Reinforcement Learning
  34. CANON: Canonical Observation Pretraining for Visuomotor Control

Organizers

Student Organizers