Shijie Zhou*1
Viet Dac Lai2
Hao Tan2
Jihyung Kil2
Wanrong Zhu2
Changyou Chen1
Ruiyi Zhang2†
1 University at Buffalo 2 Adobe Research
* Majority work done while SZ is at University at Buffalo † Leadership
-
[2025/11/05] 🔥 GUI-AIMA-lite now supports FlashAttention-2, faster and significantly more memory-efficient than previous Eager attention implementations. While "output_attentions = Ture" is not supported by FlashAttention, we resolve it by mixing FlashAttention and Eager attention implementations.
-
[2026/03/27] 🔥 Update smz8599/GUI-AIMA-3B with stronger desktop grounding capacity. It achieves 61.5% and 68.1% on ScreenSpot-Pro and OSWorld-G in 2-step manner.
- Main Results
- Architecture
- Installation
- Model Training
- Evaluation on GUI Grounding Benchmarks
- Acknowledgements
- Citation
There are two variants of GUI-AIMA: GUI-AIMA-3B and GUI-AIMA-lite-3B. Based on GUI-AIMA-lite-3B, GUI-AIMA-3B is extra trained with 250k data from GroundCUA.
1-step inference of GUI-AIMA achieves 53.8%, 62.8%, 60.0%, 79.1%, 92.1% on ScreenSpot-pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2. With 2-step zoom-in inference, it can achieve 61.5% and 68.1% on ScreenSpot-pro and OSWorld-G.
We trained GUI-AIMA for one-step center points predictions. However, GUI-AIMA can be inferenced in the 2-step fashion without further fine-tuning: (step 1) 1st inferece to determine rough grounding areas; (step 2) crop and zoom-in the rough grounding areas for 2nd preciser grounding inference. The 2-step inference is very helpful for GUI grounding on high-resolution screenshots, such as samples in ScreenSpot-pro and OSWorld-G.
Figure 1. GUI-AIMA utilize the inherent attention of MLLMs for patch-wise GUI grounding. It simplifies the vanilla attention grounding requiring proper aggregation between all query tokens' grounding vectors by adding a learnable ANCHOR token as the context anchor of query. The multi-head aggregation on attention vectors between ANCHOR and visual tokens is adequate for grounding.
Figure 2. GUI-AIMA proposes an effective multi-head weighting approach by measuring the uniformity between global query-visual pattern and head-wise query-visual pattern.
- Environment:
git clone https://github.com/sjz5202/GUI-AIMA
cd GUI-AIMA
conda create -n gui_aima python=3.10
conda activate gui_aima
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -e .- Download the GUI-Actor data from here.
- Download the UGround single-round dialogue json data from here.
- Download the GTA1 data without the web part from here.
- Single-node training:
bash scripts/sft_single_node.sh- Multi-node training (for reference, need adjusted for your environment):
bash scripts/sft_multi_node.shWe provide evaluation scripts on ScreenSpot-Pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2 under the eval/ folder: eval_ss_pro.sh, eval_osworld_g.sh, eval_ui_vision.sh, eval_mmbench_l2.sh, eval_ss_v2.sh.
For ScreenSpot-Pro and OSWorld-G, we provide 2-step inference in eval_ss_pro.sh and eval_osworld_g.sh, which determines the focusing area at the 1st step and zoom-in the focusing area for grounding at the 2nd step without extra model training.
Evaluation datasets are available from ScreenSpot-Pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2. The data path in each evaluation script needs to be adjusted.
Single sample example usage is available in eval/example_inference.py.
GUI-AIMA is built upon the following projects.
Thanks for their great work!
@misc{zhou2025guiaimaaligningintrinsicmultimodal,
title={GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding},
author={Shijie Zhou and Viet Dac Lai and Hao Tan and Jihyung Kil and Wanrong Zhu and Changyou Chen and Ruiyi Zhang},
year={2025},
eprint={2511.00810},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.00810},
}
