Paper list for vision-language tracking (continue to Update this list)
-
Tracking by Natural Language Specification, Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, Arnold W.M. Smeulders (CVPR17)
[Paper] [Github] -
LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking, Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, Haibin Ling (CVPR19)
[Paper] [Github] [Project] -
Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark, Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, Feng Wu (CVPR21)
[Paper] [Evaluation Toolkit & Github] [Project] -
WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV Tracking, Chunhui Zhang, Guanjie Huang, Li Liu, Shan Huang, Yinan Yang, Xiang Wan, Shiming Ge, Dacheng Tao (TPAMI23)
[Paper] [Github] -
Elysium: Exploring Object-level Perception in Videos via MLLM, Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, Can Huang (ECCV24)
[Paper] [Github] [Project] -
VastTrack: Vast Category Visual Object Tracking, Liang Peng, Junyuan Gao, Xinran Liu, Weihong Li, Shaohua Dong, Zhipeng Zhang, Heng Fan, Libo Zhang (NeurIPS24)
[Paper] [Github]
-
SNLT: Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers, Feng Qi, Vitaly Ablavsky, Qinxun Bai, Stan Sclaroff (CVPR21)
[Paper] [Code] -
Divert More Attention to Vision-Language Tracking, Mingzhe Guo, Zhipeng Zhang, Heng Fan, Liping Jing (NeurIPS22)
[Paper] [Code] -
Towards Unified Token Learning for Vision-Language Tracking, Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, Xianxian Li (TCSVT23)
[Paper] [Code] -
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment, Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, Yanfeng Wang (MM23)
[Paper] -
One-Stream Vision-Language Memory Network for Object Tracking, Huanlong Zhang, Jingchao Wang, Jianwei Zhang, Tianzhu Zhang, and Bineng Zhong (TMM23)
[Paper] [Code] -
CiteTracker: Correlating Image and Text for Visual Tracking, Xin Li, Yuqing Huang, Zhenyu He, Yaowei Wang, Huchuan Lu, and Ming-Hsuan Yang (ICCV23)
[Paper] -
One-Stream Stepwise Decreasing for Vision-Language Tracking, Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Ning Li, Shuxiang Song (TCSVT24)
[Paper] -
Consistencies are All You Need for Semi-supervised Vision-Language Tracking, Jiawei Ge, Jiuxin Cao, Xuelin Zhu, Xinyu Zhang, Chang Liu, Kun Wang, Bo Liu (MM24)
[Paper] -
Diffusion Mask-Driven Visual-language Tracking, Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shuxiang Song (IJCAI24)
[Paper] -
Context-Aware Integration of Language and Visual References for Natural Language Tracking, Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, Jiming Chen (CVPR24)
[Paper] [Code] -
OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning, Lingyi Hong, Shilin Yan, Renrui Zhang, Wanyun Li, Xinyu Zhou, Pinxue Guo, Kaixun Jiang, Yiting Cheng, Jinglun Li, Zhaoyu Chen, Wenqiang Zhang (CVPR24)
[Paper] -
MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts, Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, wu meiqi, Jing Zhang, Xiaotang Chen, Kaiqi Huang (NeurIPS24)
[Paper] -
Divert More Attention to Vision-Language Object Tracking, Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, Heng Fan (TPAMI24)
[Paper] -
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model, Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Yang Li, Chenhui Li, Changbo Wang (NeurIPS24)
[Paper] -
Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking, Jiawei Ge, Jiuxin Cao, Xiangmei Chen, Xuelin Zhu, Weijia Liu, Chang Liu, Kun Wang, Bo Liu (TOMM25)
[Paper] -
Language-guided Visual Tracking: Comprehensive and Effective Multimodal Information Fusion, Jianbo Song, Hong Zhang, Yachun Feng, Hanyang Liu, Yifan Yang (TOMM25)
[Paper] -
Gen4Track: A Tuning-free Data Augmentation Framework via Self-correcting Diffusion Model for Vision-Language Tracking, Jiawei Ge, Xinyu Zhang, Jiuxin Cao, Xuelin Zhu, Weijia Liu, Qingqing Gao, Biwei Cao, Kun Wang, Chang Liu, Bo Liu, Chen Feng, Ioannis Patras (MM25)
[Paper] -
SUTrack: Towards Simple and Unified Single Object Tracking, Xin Chen, Ben Kang, Wanting Geng, Jiawen Zhu, Yi Liu, Dong Wang, Huchuan Lu (AAAI25)
[Paper] [Code] -
AVLTrack: Dynamic Sparse Learning for Aerial Vision-Language Tracking, Yuanliang Xue, Bineng Zhong, Guodong Jin, Tao Shen, Lining Tan, Ning Li, and Yaozong Zheng (TCSVT25)
[Paper] [Code] -
Progressive Semantic-Visual Alignment and Refinement for Vision-Language Tracking, Yanjie Liang, Qiangqiang Wu, Lin Cheng, Changqun Xia, Jia Li (TCSVT25)
[Paper] -
Learning Language Prompt for Vision-Language Tracking, Chengao Zong, Jie Zhao , Xin Chen , Huchuan Lu, Dong Wang (TCSVT25)
[Paper] -
State Space Models for Natural Language Tracking: Exploring Context-adaptive Language Cues, Yuyang Tang, Yinchao Ma, Dengqing Yang, Jie Xiao, Tianzhu Zhang (TCSVT25) [Paper]
-
Mamba Adapter: Efficient Multi-Modal Fusion for Vision-Language Tracking, Liangtao Shi, Bineng Zhong, Qihua Liang, Xiantao Hu, Zhiyi Mo, and Shuxiang Song (TCSVT25)
[Paper] [Code] -
ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking, Xiaokun Feng, Shiyu Hu, Xuchen Li, Dailing Zhang, Meiqi Wu, Jing Zhang, Xiaotang Chen, Kaiqi Huang (ICCV25)
[Paper] [Code] -
Dynamic Updates for Language Adaptation in Visual-Language Tracking, Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song (CVPR25)
[Paper] [Code]
-
Tracking by Natural Language Specification, Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, Arnold W.M. Smeulders (CVPR17)
[Paper] [Github] -
Grounding-Tracking-Integration, Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jingsong Su, and Jiebo Luo (TCSVT20)
[Paper] -
Real-time Visual Object Tracking with Natural Language Description, Qi Feng, Vitaly Ablavsky, Qinxun Bai, Guorong Li, and Stan Sclaroff (WACV20)
[Paper] -
Capsule-based Object Tracking with Natural Language Specification, Ding Ma, Xiangqian Wu (MM21)
[Paper] -
Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark, Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, Feng Wu (CVPR21)
[Paper] [Evaluation Toolkit & Github] [Project] -
Cross-modal Target Retrieval for Tracking by Natural Language, Yihao Li, Jun Yu, Zhongpeng Cai, Yuwen Pan (CVPRW22)
[Paper] -
Tracking by Natural Language Specification with Long Short-term Context Decoupling, Ding Ma, Xiangqian Wu (ICCV23)
[Paper] -
Joint Visual Grounding and Tracking with Natural Language Specification, Li Zhou, Zikun Zhou, Kaige Mao, and Zhenyu He (CVPR23)
[Paper] [Github] -
Unifying Visual and Vision-Language Tracking via Contrastive Learning, Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang (AAAI24)
[Paper] [Code] -
Semantic-Aware Network for Natural Language Tracking, Yuyang Tang, Yinchao Ma, Tianzhu Zhang (TCSVT25)
[Paper] -
Multi-Modal Hybrid Interaction Vision-Language Tracking, Lei Lei, Xianxian Li (TMM25)
[Paper] -
MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking, Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, and Zhenyu He (CVPR25)
[Paper] -
A Swiss Army Knife for Tracking by Natural Language Specification, Kaige Mao, Xiaopeng Hong, Xiaopeng Fan, Wangmeng Zuo (TIP25)
[Paper] [Code] -
UniSOT: A Unified Framework for Multi-Modality Single Object Tracking, Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang*, Xu Zhou, Feng Wu (TPAMI25)
[Paper]