TAG: a Tuning-free Attention-driven GUI Grounding method for GUI task automation

Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning

AAAI 2025

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding—accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.

Evaluation

Task1: Optical Character Grounding

In this work, we developed an OCG dataset to evaluate the optical character grounding ability of MLLMs. Based on common screen resolutions, we construct 10 different (width:height) aspect ratios to comprehensively assess model's grounding robustness.

Task2: GUI Element Grounding

We evaluate methods on the ScreenSpot dataset.

Visual Demonstration:

Task3: GUI Agent Evaluation

We evaluate methods on the Element accuracy metric on Mind2Web dataset.

Visual Demonstration:

Installation

Dataset Preparation

Please download the following three benchmarks: OCG, ScreenSpot and Mind2Web screenshots and annotations. Please DO NOT redistribute the unzipped data files online to avoid risks such as model overfitting.

Env Preparation

Clone this repository and enter the project folder

git clone https://github.com/HeimingX/TAG.git
cd TAG

Create conda environment

conda create -n TAG python=3.10 -y
conda activate TAG

Install dependencies

pip install -r requirements.txt

Eval Scripts

Task1: OCG

MLLM_PATH=openbmb/MiniCPM-Llama3-V-2_5
OCG_DATAPATH=PATH/TO/OCG
IMG_ASPECTS='[[1, 4], [9, 21], [9, 19], [1, 2], [9, 16], [4, 3], [16, 9], [2, 1], [21, 9], [4, 1]]'

# Evaluate with MiniCPMV2.5
python eval_mm/ocg/eval_MiniCPMV2_5.py \
    --mllm_path ${MLLM_PATH} \
    --data_path ${OCG_DATAPATH} \
    --image_aspects "${IMG_ASPECTS}" \
    --save-dir output/ocg/mv2_5 \
    --sampling

# Evaluate with TAG
python eval_mm/ocg/eval_TAG.py \
    --mllm_path ${MLLM_PATH} \
    --data_path ${OCG_DATAPATH} \
    --image_aspects "${IMG_ASPECTS}" \
    --save-dir output/ocg/tag \
    --batchsize 4

Task2: ScreenSpot

MLLM_PATH=openbmb/MiniCPM-Llama3-V-2_5
SCREENSPOT_IMGS=PATH/TO/IMGS
SCREENSPOT_TEST=PATH/TO/TESTSET

# Evaluate with MiniCPMV2.5
python eval_mm/screenspot/eval_MiniCPMV2_5.py \
    --mllm_path ${MLLM_PATH} \
    --screenspot_imgs ${SCREENSPOT_IMGS} \
    --screenspot_test ${SCREENSPOT_TEST} \
    --save-dir output/screenspot/mv2_5

# Evaluate with TAG
python eval_mm/screenspot/eval_TAG.py \
    --mllm_path ${MLLM_PATH} \
    --screenspot_imgs ${SCREENSPOT_IMGS} \
    --screenspot_test ${SCREENSPOT_TEST} \
    --save-dir output/screenspot/tag

Task3: Mind2Web

MLLM_PATH=openbmb/MiniCPM-Llama3-V-2_5
MIND2WEB_DATAPATH=PATH/TO/MIND2WEB

TASKTYPES=(task website domain)
for TASK in "${TASKTYPES[@]}"
do
    # Evaluate with MiniCPMV2.5
    python eval_mm/mind2web/eval_MiniCPMV2_5.py \
        --mllm_path ${MLLM_PATH} \
        --data_dir ${MIND2WEB_DATAPATH} \
        --task ${TASK} \
        --save-dir output/mind2web/mv2_5

    # Evaluate with TAG
    python eval_mm/mind2web/eval_TAG.py \
        --mllm_path ${MLLM_PATH} \
        --data_dir ${MIND2WEB_DATAPATH} \
        --task ${TASK} \
        --save-dir output/mind2web/tag \
done

Note: some evaluation log files are provided for reference.

Acknowledgement

We thank the following MiniCPM-V, SeeClick and Mind2Web for their impressive work and open-sourced projects.

Citation

If you find our code/paper helpful, please consider cite our paper 📝 and star us ⭐️！

@inproceedings{xu2025tag,
    title={Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning},
    author={Hai-Ming Xu and Qi Chen and Lei Wang and Lingqiao Liu},
    booktitle={The 39th Annual AAAI Conference on Artificial Intelligence},
    year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
MiniCPM_Llama3_V_2_5		MiniCPM_Llama3_V_2_5
core_utils		core_utils
eval_mm		eval_mm
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAG: a Tuning-free Attention-driven GUI Grounding method for GUI task automation

Evaluation

Task1: Optical Character Grounding

Task2: GUI Element Grounding

Task3: GUI Agent Evaluation

Installation

Dataset Preparation

Env Preparation

Eval Scripts

Task1: OCG

Task2: ScreenSpot

Task3: Mind2Web

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TAG: a Tuning-free Attention-driven GUI Grounding method for GUI task automation

Evaluation

Task1: Optical Character Grounding

Task2: GUI Element Grounding

Task3: GUI Agent Evaluation

Installation

Dataset Preparation

Env Preparation

Eval Scripts

Task1: OCG

Task2: ScreenSpot

Task3: Mind2Web

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages