Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning

TL;DR: Explicitly conditioning on camera pose helps BC policies.

Tianchong Jiang¹ Jingtian Ji¹ Xiangshan Tan¹ Jiading Fang^2,* Anand Bhattad³ Vitor Guizilini^4,† Matthew R. Walter^1,†

¹ Toyota Technological Institute at Chicago (TTIC) ² Waymo ³ Johns Hopkins University ⁴ Toyota Research Institute (TRI)

* Work completed while at TTIC † Equal advising

IEEE International Conference on Robotics and Automation (ICRA), 2026

Best Paper Award on Robot Learning

arXiv Code Dataset

We explicitly condition behavior-cloning policies on camera geometry by encoding each pixel as a Plücker ray.

Key Findings

Camera conditioning consistently helps: Per-pixel Plücker ray maps improve ACT, Diffusion Policy, and SmolVLA across simulated and real-world tasks.
Randomization removes pose shortcuts: Static simulated environments can leak camera pose through background cues; randomizing appearance removes this shortcut, so explicit camera-pose conditioning helps more.
Cropping is crucial: Applying random cropping jointly to images and Plücker gives significant boosts.
Action space matters: Delta end-effector pose performs best; nonetheless, camera conditioning helps across all action spaces.

Main results table — Main results (simulated tasks)

Effect of cropping on performance — Cropping improves success

Action space comparison — Action space ablation

Fixed vs randomized comparison — Fixed vs. randomized settings

Method

We explicitly condition imitation learning policies on camera geometry using per-pixel Plücker ray embeddings. Given camera intrinsics and extrinsics, each pixel is mapped to a 6D ray representation that is concatenated with image features.

With pretrained encoders: encode the Plücker map with a small CNN to the image-latent dimension, then concatenate channel-wise.
Without pretrained encoders: concatenate the image with the Plücker map channel-wise.

Architecture (no pretraining) — No pretraining: input concatenation

Architecture (pretrained) — Pretrained encoder: encode rays, late fuse

Benchmarks

Six tasks across two simulated environments
Paired fixed vs randomized variants

Top: fixed | Bottom: randomized

Assembly Square (fixed) — Assembly Square

Pick Place Can (randomized) — Pick Place Can

Assembly Square (randomized) — Assembly Square

Lift Upright (randomized) — Lift Upright

Real-World Experiments

UR5 robot; three movable third‑person cameras
Tasks: Pick Place, Plate Insertion, Hang Mug
Success counts across randomized camera poses

Success counts across settings — Success counts (per setting)

Plücker Snippet

To add camera conditioning to your policy, you can use the following minimalist snippet to get Plücker raymap from intrinsics and extrinsics. (It assumes OpenCV convention i.e. image origin at top-left, +z is forward.)

import torch

def get_plucker_raymap(K, c2w, height, width):
    """intrinsics (3,3), cam2world (4,4), height int, width int"""
    vv, uu = torch.meshgrid(
        torch.arange(height, device=K.device, dtype=K.dtype) + 0.5,
        torch.arange(width, device=K.device, dtype=K.dtype) + 0.5,
        indexing="ij",
    )
    rays = torch.stack([uu, vv, torch.ones_like(uu)], dim=-1)
    d_world = torch.nn.functional.normalize(
        (rays @ torch.linalg.inv(K).T) @ c2w[:3, :3].T,
        dim=-1,
        eps=1e-9,
    )
    o = c2w[:3, 3].view(1, 1, 3)
    m = torch.cross(o, d_world, dim=-1)
    return torch.cat([d_world, m], dim=-1)

BibTeX

@inproceedings{jiang2026knowyourcamera,
  title     = {Do You Know Where Your Camera Is? {V}iew-Invariant Policy Learning with Camera Conditioning},
  author    = {Tianchong Jiang and Jingtian Ji and Xiangshan Tan and Jiading Fang and Anand Bhattad and Vitor Guizilini and Matthew R. Walter},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
}