Visual Intention Grounding for Egocentric Assistants (ICCV 2025)

🔗 Download:

Introduction
Challenge
Dataset
Method
Checkpoints
Acknowledgments
Citation

Introduction

Egocentric AI assistants must understand human intentions that are often implicit, context-dependent, and not explicitly tied to object names.
We introduce EgoIntention, the first dataset for egocentric visual intention grounding, where models must localize objects in first-person views based on natural intention queries.

Challenge

The model must infer the intended object from the full intention sentence, rather than simply detecting explicitly mentioned objects. In this example, “gather my phone and belongings” explicitly mentions “phone” (highlighted in red) , which often misleads existing visual grounding models to identify the wrong object (red box). The correct target, a handbag (green box), is only implied.

Dataset

We source PACOEgo4D images and annotate intention queries with bounding boxes.

15,667 training samples
825 validation samples
9,892 test samples
Two query types:
- Context intentions – leveraging environmental cues
- Uncommon intentions – alternative object functionalities

🔗 Download:

Method

Our proposed Reason-to-Ground (RoG) instruction tuning improves grounding performance by chaining two stages:

Intention reasoning → infer the explicit object category from the intention.
Object grounding → localize the object in the scene.

RoG enables unified visual grounding across both egocentric (implicit intentions) and exocentric (explicit queries) perspectives.

Checkpoints

Each row corresponds to an SFT configuration used for MiniGPT-v2 fine-tuning.

RC/+/g	RCInt./+/g	EgoInt.	Method	Context	Uncommon	Object	Checkpoint
–	–	–	0-shot	21.7	18.0	40.8	ckpt
✓			Naive SFT	23.7	19.4	38.1	ckpt
		✓	Naive SFT	42.8	39.2	46.2	ckpt
✓		✓	Naive SFT	45.9	40.8	48.6	ckpt
✓	✓	✓	Naive SFT	46.0	40.9	51.3	ckpt
✓	✓	✓	RoG SFT	49.9	44.7	52.2	ckpt

Acknowledgments

We would like to thank the following open-source contributions that made this work possible:

MiniGPT-v2 for their awesome open-source vision-language model.
PACO for their valuable dataset contribution.

Citation

If you find this project useful, please cite:

@article{sun2025visual,
  title={Visual Intention Grounding for Egocentric Assistants},
  author={Sun, Pengzhan and Xiao, Junbin and Tse, Tze Ho Elden and Li, Yicong and Akula, Arjun and Yao, Angela},
  journal={arXiv preprint arXiv:2504.13621},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Datasets		Datasets
Figures		Figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Visual Intention Grounding for Egocentric Assistants (ICCV 2025)

Introduction

Challenge

Dataset

Method

Checkpoints

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

pengzhansun/EgoIntention

Folders and files

Latest commit

History

Repository files navigation

Visual Intention Grounding for Egocentric Assistants (ICCV 2025)

Introduction

Challenge

Dataset

Method

Checkpoints

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages