PyTorch implementation of "Self-Adaptive Vision-Language Tracking With Context Prompting" (IEEE TIP)
Paper can be found here.
To address the substantial gap between vision and language modalities, along with the mismatch problem between fixed language descriptions and dynamic visual information, we propose a self-adaptive vision-language tracking framework that leverages the pre-trained multi-modal CLIP model to obtain well-aligned visual-language representations. A novel context-aware prompting mechanism is introduced to dynamically adapt linguistic cues based on the evolving visual context during tracking. Our framework employs a unified one-stream Transformer architecture, supporting joint training for both vision-only and vision-language tracking scenarios.

Please refer to install.sh for environment installation, and set your own project/model/data paths.
Please see eval.sh to find the commands for training and testing. Commands for language-only tracking can be found in eval_nl.sh.
The required pretrained models are provided here[pwd:c5ie]. (Please download, extract, and place them in your own project directory)
We also release our models here[pwd:jpj8] and results here[pwd:nrkw].
We acknowledge prior excellent works (SUTrack) and (CoCoOP) for inspiring our methodology. If you find this work helpful to your research, we would appreciate it if you consider citing our paper.
@article{zhaoself,
title={Self-Adaptive Vision-Language Tracking with Context Prompting},
author={Zhao, Jie and Chen, Xin and Li, Shengming and Bo, Chunjuan and Wang, Dong and Lu, Huchuan},
journal={IEEE Transactions on Image Processing},
year={2026}
}