AITtrack: Attention-based Image-Text Alignment for Visual Tracking

Repository under construction...
This is the official implementation of our AITtrack: Attention-based Image-Text Alignment for Visual Tracking paper.

Brief Introduction

Our proposed AITrack simplifies the process of VLM-based tracking using attention-based visual and textual alignment modules. It utilizes a region-of-interest (ROI) text-guided encoder that leverages existing pre-trained language models to implicitly extract and encode textual features and a simple image encoder to encode visual features. A simple alignment module is implemented to combine both encoded visual and textual features, thereby inherently exposing the semantic relationship between the template and search frames with their surroundings, providing rich encodings for improved tracking performance. We employ a simple decoder that takes past predictions as spatiotemporal clues to effectively model the target appearance changes without the need for complex customized postprocessings and prediction heads.

AITrack Pipeline

Our Main Contributions

We propose an ROI-based text-guided encoder that leverages existing pre-trained language models to implicitly extract and encode textual descriptions.
We propose a simple image-text alignment module that encodes the semantic relationship between the template and search regions with their surroundings, providing rich and meaningful representation for improved VOT performance.
We also incorporate a simple decoder that leverages the spatiotemporal representations to effectively model the target object appearance variations across the video frames without the need for complex customized postprocessings and prediction heads.
We perform rigorous experimental evaluations on seven publicly available VOT benchmark datasets to show the advantages of our proposed AITrack.

The ROI-based Text-guided Encoder

Results Comparison

Trackers with Only Bounding Box (BB) Initialization

Trackers with Bounding Box (BB) and Natural Language (NL) Initialization

Environment Setup

Use the Anaconda (CUDA 11.3)

conda env create -f environment.yml
conda activate aitrack

Clone this repository

git clone https://github.com/BasitAlawode/AITrack AITrack
cd AITrack

Set project paths

Modify project paths by editing these two files

lib/train/admin/local.py  # paths about training
lib/test/evaluation/local.py  # paths about testing

Dataset Preparation

To be updated....

Training

To be updated....

Evaluation

To be updated....

Acknowledgement

Our work is based on
1. ARTrack,
2. Alpha-CLIP, and
3. RTS for the segmentation mask.

We thank the authors for making their codes available.

Citation

If you find our work useful, please consider citing:

@ARTICLE{basit_aitrack25,
  author={Alawode, Basit and Javed, Sajid},
  journal={IEEE Access}, 
  title={AITtrack: Attention-based Image-Text Alignment for Visual Tracking}, 
  year={2025},
  volume={},
  number={},
  pages={1-1},
  doi={10.1109/ACCESS.2025.3555816}}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
alpha_clip		alpha_clip
alpha_clip_ckpt		alpha_clip_ckpt
experiments/aitrack		experiments/aitrack
images		images
lib		lib
ltr		ltr
pytracking		pytracking
seg_module		seg_module
tracking		tracking
trained_model		trained_model
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AITtrack: Attention-based Image-Text Alignment for Visual Tracking

Brief Introduction

AITrack Pipeline

Our Main Contributions

The ROI-based Text-guided Encoder

Results Comparison

Environment Setup

Set project paths

Dataset Preparation

Training

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AITtrack: Attention-based Image-Text Alignment for Visual Tracking

Brief Introduction

AITrack Pipeline

Our Main Contributions

The ROI-based Text-guided Encoder

Results Comparison

Environment Setup

Set project paths

Dataset Preparation

Training

Evaluation

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages