VimTS
VimTS copied to clipboard
VimTS: A Unified Video and Image Text Spotter
VimTS
VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
Description
VimTS is a unified video and image text spotter for enhancing the cross-domain generalization. It outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data.
News
2024.5.3🚀 Code available.2024.5.1🚀 Release paper VimTS.
Framework
Overall framework of our method.
Overall framework of CoDeF-based synthetic method.
VTD-368K
We manually collect and filter text-free, open-source and unrestricted videos from NExT-QA, Charades-Ego, Breakfast, A2D, MPI-Cooking, ActorShift and Hollywood. By utilizing the CoDeF, our synthetic method facilitates the achievement of realistic and stable text flow propagation, significantly reducing the occurrence of distortions.
Compared with MLMMs
Getting Started
-
Installation
Python 3.8 + PyTorch 1.10.0 + CUDA 11.3 + torchvision=0.11.0 + Detectron2 (v0.2.1) + OpenCV for visualization
conda create -n VimTS python=3.8 -y
conda activate VimTS
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
git clone https://github.com/Yuliang-Liu/VimTS.git
cd detectron2-0.2.1
python setup.py build develop
pip install opencv-python
cd models/vimts/ops
sh make.sh
Data Preparation
Please download TotalText, CTW1500, and ICDAR2015 according to the guide provided by SPTS v2: README.md.
Please download Ground Truth of line-level TotalText and Ground Truth of word-level CTW1500.
Extract all the datasets and make sure you organize them as follows
- datasets
| - CTW1500
| | - annotations
| | - ctwtest_text_image
| | - ctwtrain_text_image
| - totaltext (or icdar2015)
| | - test_images
| | - train_images
| | - test.json
| | - train.json
| - evaluation
| | - gt_ctw1500_word.zip
| | - gt_totaltext_line.zip
Training
We use 8 GPUs for training and 2 images each GPU by default.
bash scripts/multi_tasks.sh /path/to/your/dataset
Evaluation
Download the weight Google Drive.
0 for Text Detection; 1 for Text Spotting.
bash scripts/test.sh config/VimTS/VimTS_multi_finetune.py /path/to/your/dataset 1 /path/to/your/checkpoint /path/to/your/test_dataset
e.g.:
bash scripts/test.sh config/VimTS/VimTS_multi_finetune.py ../datasets 1 cross_domain_checkpoint.pth totaltext_val
Visualization
Visualize the detection and recognition results
python vis.py
Cite
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
@misc{liuvimts,
author={Liu, Yuliang and Huang, Mingxin and Yan, Hao and Deng, Linger and Wu, Weijia and Lu, Hao and Shen, Chunhua and Jin, Lianwen and Bai, Xiang},
title={VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization},
publisher={arXiv preprint arXiv:2404.19652},
year={2024},
}
Copyright
We welcome suggestions to help us improve the VimTS. For any query, please contact Prof. Yuliang Liu: [email protected]. If you find something interesting, please also feel free to share with us through email or open an issue. Thanks!