GitHub - sjz5202/GUI-AIMA: GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Shijie Zhou^*1 Viet Dac Lai² Hao Tan² Jihyung Kil² Wanrong Zhu²
Changyou Chen¹ Ruiyi Zhang²^†

¹ University at Buffalo ² Adobe Research
^* Majority work done while SZ is at University at Buffalo ^† Leadership

📄 arXiv Paper 🤗 GUI-AIMA-3B 🤗 GUI-AIMA-lite-3B

Release

[2025/11/05] 🔥 GUI-AIMA-lite now supports FlashAttention-2, faster and significantly more memory-efficient than previous Eager attention implementations. While "output_attentions = Ture" is not supported by FlashAttention, we resolve it by mixing FlashAttention and Eager attention implementations.
[2026/03/27] 🔥 Update smz8599/GUI-AIMA-3B with stronger desktop grounding capacity. It achieves 61.5% and 68.1% on ScreenSpot-Pro and OSWorld-G in 2-step manner.

Main Results

There are two variants of GUI-AIMA: GUI-AIMA-3B and GUI-AIMA-lite-3B. Based on GUI-AIMA-lite-3B, GUI-AIMA-3B is extra trained with 250k data from GroundCUA.

1-step inference of GUI-AIMA achieves 53.8%, 62.8%, 60.0%, 79.1%, 92.1% on ScreenSpot-pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2. With 2-step zoom-in inference, it can achieve 61.5% and 68.1% on ScreenSpot-pro and OSWorld-G.

We trained GUI-AIMA for one-step center points predictions. However, GUI-AIMA can be inferenced in the 2-step fashion without further fine-tuning: (step 1) 1st inferece to determine rough grounding areas; (step 2) crop and zoom-in the rough grounding areas for 2nd preciser grounding inference. The 2-step inference is very helpful for GUI grounding on high-resolution screenshots, such as samples in ScreenSpot-pro and OSWorld-G.

Architecture

Figure 1. GUI-AIMA utilize the inherent attention of MLLMs for patch-wise GUI grounding. It simplifies the vanilla attention grounding requiring proper aggregation between all query tokens' grounding vectors by adding a learnable ANCHOR token as the context anchor of query. The multi-head aggregation on attention vectors between ANCHOR and visual tokens is adequate for grounding.

Figure 2. GUI-AIMA proposes an effective multi-head weighting approach by measuring the uniformity between global query-visual pattern and head-wise query-visual pattern.

Installation

Environment:

git clone https://github.com/sjz5202/GUI-AIMA
cd GUI-AIMA
conda create -n gui_aima python=3.10
conda activate gui_aima
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -e .

Model Training

Data preparation (GUI-AIMA-lite)

Download the GUI-Actor data from here.
Download the UGround single-round dialogue json data from here.
Download the GTA1 data without the web part from here.

Training

Single-node training:

bash scripts/sft_single_node.sh

Multi-node training (for reference, need adjusted for your environment):

bash scripts/sft_multi_node.sh

Evaluation on GUI Grounding Benchmarks

We provide evaluation scripts on ScreenSpot-Pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2 under the eval/ folder: eval_ss_pro.sh, eval_osworld_g.sh, eval_ui_vision.sh, eval_mmbench_l2.sh, eval_ss_v2.sh.

For ScreenSpot-Pro and OSWorld-G, we provide 2-step inference in eval_ss_pro.sh and eval_osworld_g.sh, which determines the focusing area at the 1st step and zoom-in the focusing area for grounding at the 2nd step without extra model training.

Evaluation datasets are available from ScreenSpot-Pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2. The data path in each evaluation script needs to be adjusted.

Single sample example usage is available in eval/example_inference.py.

Acknowledgements

GUI-AIMA is built upon the following projects.

Thanks for their great work!

Citation

@misc{zhou2025guiaimaaligningintrinsicmultimodal,
      title={GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding}, 
      author={Shijie Zhou and Viet Dac Lai and Hao Tan and Jihyung Kil and Wanrong Zhu and Changyou Chen and Ruiyi Zhang},
      year={2025},
      eprint={2511.00810},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.00810}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets/images		assets/images
data		data
eval		eval
scripts		scripts
src/gui_aima		src/gui_aima
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

📄 arXiv Paper 🤗 GUI-AIMA-3B 🤗 GUI-AIMA-lite-3B

Release

Table of Contents

Main Results

Architecture

Installation

Model Training

Data preparation (GUI-AIMA-lite)

Training

Evaluation on GUI Grounding Benchmarks

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

📄 arXiv Paper 🤗 GUI-AIMA-3B 🤗 GUI-AIMA-lite-3B

Release

Table of Contents

Main Results

Architecture

Installation

Model Training

Data preparation (GUI-AIMA-lite)

Training

Evaluation on GUI Grounding Benchmarks

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages