Skip to content

pengzhansun/EgoIntention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 

Repository files navigation

Visual Intention Grounding for Egocentric Assistants (ICCV 2025)

Paper Conference License

🔗 Download:



Introduction

Egocentric AI assistants must understand human intentions that are often implicit, context-dependent, and not explicitly tied to object names.
We introduce EgoIntention, the first dataset for egocentric visual intention grounding, where models must localize objects in first-person views based on natural intention queries.


Challenge

The model must infer the intended object from the full intention sentence, rather than simply detecting explicitly mentioned objects. In this example, “gather my phone and belongings” explicitly mentions “phone” (highlighted in red) , which often misleads existing visual grounding models to identify the wrong object (red box). The correct target, a handbag (green box), is only implied.


Dataset

We source PACOEgo4D images and annotate intention queries with bounding boxes.

  • 15,667 training samples
  • 825 validation samples
  • 9,892 test samples
  • Two query types:
    • Context intentions – leveraging environmental cues
    • Uncommon intentions – alternative object functionalities

🔗 Download:


Method

Our proposed Reason-to-Ground (RoG) instruction tuning improves grounding performance by chaining two stages:

  1. Intention reasoning → infer the explicit object category from the intention.
  2. Object grounding → localize the object in the scene.

RoG enables unified visual grounding across both egocentric (implicit intentions) and exocentric (explicit queries) perspectives.


Checkpoints

Each row corresponds to an SFT configuration used for MiniGPT-v2 fine-tuning.

RC/+/g RCInt./+/g EgoInt. Method Context Uncommon Object Checkpoint
0-shot 21.7 18.0 40.8 ckpt
Naive SFT 23.7 19.4 38.1 ckpt
Naive SFT 42.8 39.2 46.2 ckpt
Naive SFT 45.9 40.8 48.6 ckpt
Naive SFT 46.0 40.9 51.3 ckpt
RoG SFT 49.9 44.7 52.2 ckpt

Acknowledgments

We would like to thank the following open-source contributions that made this work possible:

  • MiniGPT-v2 for their awesome open-source vision-language model.
  • PACO for their valuable dataset contribution.

Citation

If you find this project useful, please cite:

@article{sun2025visual,
  title={Visual Intention Grounding for Egocentric Assistants},
  author={Sun, Pengzhan and Xiao, Junbin and Tse, Tze Ho Elden and Li, Yicong and Akula, Arjun and Yao, Angela},
  journal={arXiv preprint arXiv:2504.13621},
  year={2025}
}

About

[ICCV 2025][EgoIntention] Visual Intention Grounding for Egocentric Assistants

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published