🔗 Download:
Egocentric AI assistants must understand human intentions that are often implicit, context-dependent, and not explicitly tied to object names.
We introduce EgoIntention, the first dataset for egocentric visual intention grounding, where models must localize objects in first-person views based on natural intention queries.
The model must infer the intended object from the full intention sentence, rather than simply detecting explicitly mentioned objects. In this example, “gather my phone and belongings” explicitly mentions “phone” (highlighted in red) , which often misleads existing visual grounding models to identify the wrong object (red box). The correct target, a handbag (green box), is only implied.
We source PACOEgo4D images and annotate intention queries with bounding boxes.
- 15,667 training samples
- 825 validation samples
- 9,892 test samples
- Two query types:
- Context intentions – leveraging environmental cues
- Uncommon intentions – alternative object functionalities
🔗 Download:
Our proposed Reason-to-Ground (RoG) instruction tuning improves grounding performance by chaining two stages:
- Intention reasoning → infer the explicit object category from the intention.
- Object grounding → localize the object in the scene.
RoG enables unified visual grounding across both egocentric (implicit intentions) and exocentric (explicit queries) perspectives.
Each row corresponds to an SFT configuration used for MiniGPT-v2 fine-tuning.
| RC/+/g | RCInt./+/g | EgoInt. | Method | Context | Uncommon | Object | Checkpoint |
|---|---|---|---|---|---|---|---|
| – | – | – | 0-shot | 21.7 | 18.0 | 40.8 | ckpt |
| ✓ | Naive SFT | 23.7 | 19.4 | 38.1 | ckpt | ||
| ✓ | Naive SFT | 42.8 | 39.2 | 46.2 | ckpt | ||
| ✓ | ✓ | Naive SFT | 45.9 | 40.8 | 48.6 | ckpt | |
| ✓ | ✓ | ✓ | Naive SFT | 46.0 | 40.9 | 51.3 | ckpt |
| ✓ | ✓ | ✓ | RoG SFT | 49.9 | 44.7 | 52.2 | ckpt |
We would like to thank the following open-source contributions that made this work possible:
- MiniGPT-v2 for their awesome open-source vision-language model.
- PACO for their valuable dataset contribution.
If you find this project useful, please cite:
@article{sun2025visual,
title={Visual Intention Grounding for Egocentric Assistants},
author={Sun, Pengzhan and Xiao, Junbin and Tse, Tze Ho Elden and Li, Yicong and Akula, Arjun and Yao, Angela},
journal={arXiv preprint arXiv:2504.13621},
year={2025}
}


