Skip to content

sjz5202/GUI-AIMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Shijie Zhou*1  Viet Dac Lai2  Hao Tan2  Jihyung Kil2  Wanrong Zhu2 
Changyou Chen1 Ruiyi Zhang2

1 University at Buffalo  2 Adobe Research
* Majority work done while SZ is at University at Buffalo   Leadership

Release

  • [2025/11/05] 🔥 GUI-AIMA-lite now supports FlashAttention-2, faster and significantly more memory-efficient than previous Eager attention implementations. While "output_attentions = Ture" is not supported by FlashAttention, we resolve it by mixing FlashAttention and Eager attention implementations.

  • [2026/03/27] 🔥 Update smz8599/GUI-AIMA-3B with stronger desktop grounding capacity. It achieves 61.5% and 68.1% on ScreenSpot-Pro and OSWorld-G in 2-step manner.

Table of Contents

Main Results

There are two variants of GUI-AIMA: GUI-AIMA-3B and GUI-AIMA-lite-3B. Based on GUI-AIMA-lite-3B, GUI-AIMA-3B is extra trained with 250k data from GroundCUA.

1-step inference of GUI-AIMA achieves 53.8%, 62.8%, 60.0%, 79.1%, 92.1% on ScreenSpot-pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2. With 2-step zoom-in inference, it can achieve 61.5% and 68.1% on ScreenSpot-pro and OSWorld-G.

We trained GUI-AIMA for one-step center points predictions. However, GUI-AIMA can be inferenced in the 2-step fashion without further fine-tuning: (step 1) 1st inferece to determine rough grounding areas; (step 2) crop and zoom-in the rough grounding areas for 2nd preciser grounding inference. The 2-step inference is very helpful for GUI grounding on high-resolution screenshots, such as samples in ScreenSpot-pro and OSWorld-G.

Architecture

Figure 1. GUI-AIMA utilize the inherent attention of MLLMs for patch-wise GUI grounding. It simplifies the vanilla attention grounding requiring proper aggregation between all query tokens' grounding vectors by adding a learnable ANCHOR token as the context anchor of query. The multi-head aggregation on attention vectors between ANCHOR and visual tokens is adequate for grounding.

Figure 2. GUI-AIMA proposes an effective multi-head weighting approach by measuring the uniformity between global query-visual pattern and head-wise query-visual pattern.

Installation

  1. Environment:
git clone https://github.com/sjz5202/GUI-AIMA
cd GUI-AIMA
conda create -n gui_aima python=3.10
conda activate gui_aima
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -e .

Model Training

Data preparation (GUI-AIMA-lite)

  1. Download the GUI-Actor data from here.
  2. Download the UGround single-round dialogue json data from here.
  3. Download the GTA1 data without the web part from here.

Training

  1. Single-node training:
bash scripts/sft_single_node.sh
  1. Multi-node training (for reference, need adjusted for your environment):
bash scripts/sft_multi_node.sh

Evaluation on GUI Grounding Benchmarks

We provide evaluation scripts on ScreenSpot-Pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2 under the eval/ folder: eval_ss_pro.sh, eval_osworld_g.sh, eval_ui_vision.sh, eval_mmbench_l2.sh, eval_ss_v2.sh.

For ScreenSpot-Pro and OSWorld-G, we provide 2-step inference in eval_ss_pro.sh and eval_osworld_g.sh, which determines the focusing area at the 1st step and zoom-in the focusing area for grounding at the 2nd step without extra model training.

Evaluation datasets are available from ScreenSpot-Pro, OSWorld-G, UI-Vision, MMBench-GUI-L2 and ScreenSpot-v2. The data path in each evaluation script needs to be adjusted.

Single sample example usage is available in eval/example_inference.py.

Acknowledgements

GUI-AIMA is built upon the following projects.

Thanks for their great work!

Citation

@misc{zhou2025guiaimaaligningintrinsicmultimodal,
      title={GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding}, 
      author={Shijie Zhou and Viet Dac Lai and Hao Tan and Jihyung Kil and Wanrong Zhu and Changyou Chen and Ruiyi Zhang},
      year={2025},
      eprint={2511.00810},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.00810}, 
}

About

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors